Blame - script/tei2korapxml - KorAP/KorAP-XML-TEI

echo '<node a="v"><node1>some <n/> text</node1><node2>more-text</node2></node>' | perl -e 'use XML::CompactTree::XS; use XML::LibXML::Reader; $reader = XML::LibXML::Reader->new(IO => STDIN); $data = XML::CompactTree::XS::readSubtreeToPerl( $reader, XCT_DOCUMENT_ROOT | XCT_IGNORE_COMMENTS | XCT_LINE_NUMBERS ); print "\x27".$data->[2]->[0]->[5]->[1]->[1]."\x27\n"'

785

786

Exploring the structure of $data ( = reference to below array ):

787

788

[ 0: XML_READER_TYPE_DOCUMENT,

789

1: ?

Akron

9157792

2021-02-19 10:32:54 +0100

[diff] [blame]

790

2: [ 0: [ 0: XML_READER_TYPE_ELEMENT <- start recursion with array '$data->[2]' (see retr_info( \$tree_data->[2] ))

Akron

f8088e6

2021-02-18 16:18:59 +0100

[diff] [blame]

1: 'node'

2: ?

3: HASH (attributes)

4: 1 (line number)

5: [ 0: [ 0: XML_READER_TYPE_ELEMENT

796

1: 'node1'

797

2: ?

798

3: undefined (no attributes)

799

4: 1 (line number)

800

5: [ 0: [ 0: XML_READER_TYPE_TEXT

801

1: 'some '

802

]

803

1: [ 0: XML_READER_TYPE_ELEMENT

804

1: 'n'

805

2: ?

806

3: undefined (no attributes)

807

4: 1 (line number)

808

5: undefined (no child-nodes)

809

]

810

2: [ 0: XML_READER_TYPE_TEXT

1: ' text'

]

]

]

1: [ 0: XML_READER_TYPE_ELEMENT

816

1: 'node2'

817

2: ?

818

3: undefined (not attributes)

819

4: 1 (line number)

820

5: [ 0: [ 0: XML_READER_TYPE_TEXT

1: 'more-text'

]

]

]

]

]

]

]

$data->[0] = 9 (=> type == XML_READER_TYPE_DOCUMENT)

831

832

ref($data->[2]) == ARRAY (with 1 element for 'node')

833

ref($data->[2]->[0]) == ARRAY (with 6 elements)

834

835

$data->[2]->[0]->[0] == 1 (=> type == XML_READER_TYPE_ELEMENT)

836

$data->[2]->[0]->[1] == 'node'

837

ref($data->[2]->[0]->[3]) == HASH (=> ${$data->[2]->[0]->[3]}{a} == 'v')

838

$data->[2]->[0]->[4] == 1 (line number)

839

ref($data->[2]->[0]->[5]) == ARRAY (with 2 elements for 'node1' and 'node2')

840

# child-nodes of actual node (see $_IDX)

841

842

ref($data->[2]->[0]->[5]->[0]) == ARRAY (with 6 elements)

843

$data->[2]->[0]->[5]->[0]->[0] == 1 (=> type == XML_READER_TYPE_ELEMENT)

844

$data->[2]->[0]->[5]->[0]->[1] == 'node1'

845

$data->[2]->[0]->[5]->[0]->[3] == undefined (=> no attribute)

846

$data->[2]->[0]->[5]->[0]->[4] == 1 (line number)

847

ref($data->[2]->[0]->[5]->[0]->[5]) == ARRAY (with 3 elements for 'some ', '<n/>' and ' text')

848

849

ref($data->[2]->[0]->[5]->[0]->[5]->[0]) == ARRAY (with 2 elements)

850

$data->[2]->[0]->[5]->[0]->[5]->[0]->[0] == 3 (=> type == XML_READER_TYPE_TEXT)

851

$data->[2]->[0]->[5]->[0]->[5]->[0]->[1] == 'some '

852

853

ref($data->[2]->[0]->[5]->[0]->[5]->[1]) == ARRAY (with 5 elements)

854

$data->[2]->[0]->[5]->[0]->[5]->[1]->[0] == 1 (=> type == XML_READER_TYPE_ELEMENT)

855

$data->[2]->[0]->[5]->[0]->[5]->[1]->[1] == 'n'

856

$data->[2]->[0]->[5]->[0]->[5]->[1]->[3] == undefined (=> no attribute)

857

$data->[2]->[0]->[5]->[0]->[5]->[1]->[4] == 1 (line number)

858

$data->[2]->[0]->[5]->[0]->[5]->[1]->[5] == undefined (=> no child-nodes)

859

860

ref($data->[2]->[0]->[5]->[0]->[5]->[2]) == ARRAY (with 2 elements)

861

$data->[2]->[0]->[5]->[0]->[5]->[2]->[0] == 3 (=> type == XML_READER_TYPE_TEXT)

862

$data->[2]->[0]->[5]->[0]->[5]->[2]->[1] == ' text'

863

864

865

retr_info() starts with the array reference ${$_[0]} (= \$tree_data->[2]), which corresponds to ${\$data->[2]} in the above example.

866

Hence, the expression @{${$_[0]}} corresponds to @{${\$data->[2]}}, $e to ${${\$data->[2]}}[0] (= $data->[2]->[0]) and $e->[0] to

867

${${\$data->[2]}}[0]->[0] (= $data->[2]->[0]->[0]).

868

869

870

## Notes on whitespace handling

871

872

Every whitespace inside the processed text is 'significant' and recognized as a node of type 'XML_READER_TYPE_SIGNIFICANT_WHITESPACE'

873

(see function 'retr_info()').

874

875

Definition of significant and insignificant whitespace

876

(source: https://www.oracle.com/technical-resources/articles/wang-whitespace.html):

877

878

Significant whitespace is part of the document content and should be preserved.

879

Insignificant whitespace is used when editing XML documents for readability.

880

These whitespaces are typically not intended for inclusion in the delivery of the document.

881

882

### Regarding XML_READER_TYPE_SIGNIFICANT_WHITESPACE

883

884

The 3rd form of nodes, besides text- (XML_READER_TYPE_TEXT) and tag-nodes (XML_READER_TYPE_ELEMENT) are nodes of the type

885

'XML_READER_TYPE_SIGNIFICANT_WHITESPACE'.

886

887

When modifiying the previous example (see: Notes on how 'XML::CompactTree::XS' works) by inserting an additional blank between

888

'</node1>' and '<node2>', the output for '$data->[2]->[0]->[5]->[1]->[1]' is a blank (' ') and it's type is '14'

889

(XML_READER_TYPE_SIGNIFICANT_WHITESPACE, see 'man XML::LibXML::Reader'):

890

891

echo '<node a="v"><node1>some <n/> text</node1> <node2>more-text</node2></node>' | perl -e 'use XML::CompactTree::XS; use XML::LibXML::Reader; $reader = XML::LibXML::Reader->new(IO => STDIN); $data = XML::CompactTree::XS::readSubtreeToPerl( $reader, XCT_DOCUMENT_ROOT | XCT_IGNORE_COMMENTS | XCT_LINE_NUMBERS ); print "node=\x27".$data->[2]->[0]->[5]->[1]->[1]."\x27, type=".$data->[2]->[0]->[5]->[1]->[0]."\n"'

892

893

894

Example: '... <head type="main"><s>Campagne in Frankreich</s></head><head type="sub"> <s>1792</s> ...'

895

896

Two text-nodes should normally be separated by a blank. In the above example, that would be the 2 text-nodes

897

'Campagne in Frankreich' and '1792', which are separated by the whitespace-node ' ' (see [2]).

898

899

The text-node 'Campagne in Frankreich' leads to the setting of '$add_one' to 1, so that when opening the 2nd 'head'-tag,

900

it's from-index gets set to the correct start-index of '1792' (and not to the start-index of the whitespace-node ' ').

901

902

The assumption here is, that in most cases there _is_ a whitespace node between 2 text-nodes. The below code fragment

903

enables a way, to check, if this really _was_ the case for the last 2 'non-tag'-nodes, when closing a tag:

904

905

When a whitespace-node is read, its from-index is stored as a hash-key (in %ws), to state that it belongs to a ws-node.

906

So when closing a tag, it can be checked, if the previous 'non-tag'-node (text or whitespace), which is the one before

907

the last read 'non-tag'-node, was a actually _not_ a ws-node, but instead a text-node. In that case, the from-value of

908

the last read 'non-tag'-node has to be corrected (see [1]),

909

910

For whitespace-nodes $add_one is set to 0, so when opening the next tag (in the above example the 2nd 's'-tag), no

911

additional 1 is added (because this was already done by the whitespace-node itself when incrementing the variable $pos).

912

913

[1]

914

Now, what happens, when 2 text-nodes are _not_ seperated by a whitespace-node (e.g.: <w>Augen<c>,</c></w>)?

915

In this case, the falsely increased from-value has to be decreased again by 1 when closing the enclosing tag

916

(see above code fragment '... not exists $ws{ $fval - 1 } ...').

917

918

[2]

919

Comparing the 2 examples '<w>fu</w> <w>bar</w>' and '<w>fu</w><w> </w><w>bar</w>', is ' ' in both cases handled as a

920

whitespace-node (XML_READER_TYPE_SIGNIFICANT_WHITESPACE).

921

922

The from-index of the 2nd w-tag in the second example refers to 'bar', which may not have been the intention

923

(even though '<w> </w>' doesn't make a lot of sense). TODO: could this be a bug?

924

925

Empty tags also cling to the next text-token - e.g. in '<w>tok1</w> <w>tok2</w><a><b/></a> <w>tok3</w>' are the from-

926

and to-indizes for the tags 'a' and 'b' both 12, which is the start-index of the token 'tok3'.

927

928

929

## Notes on whitespace fixing

930

931

The idea for the below code fragment was to fix (recreate) missing whitespace in a poorly created corpus, in which linebreaks where inserted

932

into the text with the addition that maybe (or not) whitespace before those linebreaks was unintenionally stripped.

933

934

It soon turned out, that it was best to suggest considering just avoiding linebreaks and putting all primary text tokens into one line (see

935

example further down and notes on 'Input restrictions' in the manpage).

936

937

Somehow an old first very poor approach remained, which is not stringent, but also doesn't affect one-line text.

938

939

Examples (how primary text with linebreaks would be converted by below code):

940

941

'...<w>end</w>\n<w>.</w>...' -> '...<w>end</w> <w>.</w>...'

942

'...<w>,</w>\n<w>this</w>\n<w>is</w>\n<w>it</w>\n<w>!</w>...' -> '<w>,<w> <w>this</w> <w>is</w> <w>it</w> <w>!</w>'.

943

944

Blanks are inserted before the 1st character:

945

946

NOTE: not stringent ('...' stands for text):

947

948

beg1............................end1 => no blank before 'beg1'

949

beg2....<pb/>...................end2 => no blank before 'beg2'

950

beg3....<info attr1="val1"/>....end3 => no blank before 'beg3'

951

beg4....<test>ok</test>.........end4 => blank before 'beg4'

952

953

=> beg1....end1beg2...<pb/>...end2beg3....<info attr1="val1"/>....end3 beg4...<test>ok</test>....end4

954

^

955

|_blank between 'end3' and 'beg4'

956

957

958

## Notes on segfault prevention

959

Akron

9157792

2021-02-19 10:32:54 +0100

[diff] [blame]

960

binmode on the input handler prevents segfaulting of 'XML::LibXML::Reader' inside the main loop

Akron

f8088e6

2021-02-18 16:18:59 +0100

[diff] [blame]

961

(see notes on 'PerlIO layers' in 'man XML::LibXML'),

962

removing 'use open qw(:std :utf8)' would fix this problem too, but using binmode on input is more granular

963

see in perluniintro: You can switch encodings on an already opened stream by using "binmode()

964

see in perlfunc: If LAYER is omitted or specified as ":raw" the filehandle is made suitable for passing binary data.