Blame - script/tei2korapxml - KorAP/KorAP-XML-TEI

echo '<node a="v"><node1>some <n/> text</node1><node2>more-text</node2></node>' | perl -e 'use XML::CompactTree::XS; use XML::LibXML::Reader; $reader = XML::LibXML::Reader->new(IO => STDIN); $data = XML::CompactTree::XS::readSubtreeToPerl( $reader, XCT_DOCUMENT_ROOT | XCT_IGNORE_COMMENTS | XCT_LINE_NUMBERS ); print "\x27".$data->[2]->[0]->[5]->[1]->[1]."\x27\n"'

772

773

Exploring the structure of $data ( = reference to below array ):

774

775

[ 0: XML_READER_TYPE_DOCUMENT,

776

1: ?

777

2: [ 0: [ 0: XML_READER_TYPE_ELEMENT <- start recursion with array '$data->[2]' (see main(): retr_info( \$tree_data->[2] ))

1: 'node'

2: ?

3: HASH (attributes)

4: 1 (line number)

5: [ 0: [ 0: XML_READER_TYPE_ELEMENT

783

1: 'node1'

784

2: ?

785

3: undefined (no attributes)

786

4: 1 (line number)

787

5: [ 0: [ 0: XML_READER_TYPE_TEXT

788

1: 'some '

789

]

790

1: [ 0: XML_READER_TYPE_ELEMENT

791

1: 'n'

792

2: ?

793

3: undefined (no attributes)

794

4: 1 (line number)

795

5: undefined (no child-nodes)

796

]

797

2: [ 0: XML_READER_TYPE_TEXT

1: ' text'

]

]

]

1: [ 0: XML_READER_TYPE_ELEMENT

803

1: 'node2'

804

2: ?

805

3: undefined (not attributes)

806

4: 1 (line number)

807

5: [ 0: [ 0: XML_READER_TYPE_TEXT

1: 'more-text'

]

]

]

]

]

]

]

$data->[0] = 9 (=> type == XML_READER_TYPE_DOCUMENT)

818

819

ref($data->[2]) == ARRAY (with 1 element for 'node')

820

ref($data->[2]->[0]) == ARRAY (with 6 elements)

821

822

$data->[2]->[0]->[0] == 1 (=> type == XML_READER_TYPE_ELEMENT)

823

$data->[2]->[0]->[1] == 'node'

824

ref($data->[2]->[0]->[3]) == HASH (=> ${$data->[2]->[0]->[3]}{a} == 'v')

825

$data->[2]->[0]->[4] == 1 (line number)

826

ref($data->[2]->[0]->[5]) == ARRAY (with 2 elements for 'node1' and 'node2')

827

# child-nodes of actual node (see $_IDX)

828

829

ref($data->[2]->[0]->[5]->[0]) == ARRAY (with 6 elements)

830

$data->[2]->[0]->[5]->[0]->[0] == 1 (=> type == XML_READER_TYPE_ELEMENT)

831

$data->[2]->[0]->[5]->[0]->[1] == 'node1'

832

$data->[2]->[0]->[5]->[0]->[3] == undefined (=> no attribute)

833

$data->[2]->[0]->[5]->[0]->[4] == 1 (line number)

834

ref($data->[2]->[0]->[5]->[0]->[5]) == ARRAY (with 3 elements for 'some ', '<n/>' and ' text')

835

836

ref($data->[2]->[0]->[5]->[0]->[5]->[0]) == ARRAY (with 2 elements)

837

$data->[2]->[0]->[5]->[0]->[5]->[0]->[0] == 3 (=> type == XML_READER_TYPE_TEXT)

838

$data->[2]->[0]->[5]->[0]->[5]->[0]->[1] == 'some '

839

840

ref($data->[2]->[0]->[5]->[0]->[5]->[1]) == ARRAY (with 5 elements)

841

$data->[2]->[0]->[5]->[0]->[5]->[1]->[0] == 1 (=> type == XML_READER_TYPE_ELEMENT)

842

$data->[2]->[0]->[5]->[0]->[5]->[1]->[1] == 'n'

843

$data->[2]->[0]->[5]->[0]->[5]->[1]->[3] == undefined (=> no attribute)

844

$data->[2]->[0]->[5]->[0]->[5]->[1]->[4] == 1 (line number)

845

$data->[2]->[0]->[5]->[0]->[5]->[1]->[5] == undefined (=> no child-nodes)

846

847

ref($data->[2]->[0]->[5]->[0]->[5]->[2]) == ARRAY (with 2 elements)

848

$data->[2]->[0]->[5]->[0]->[5]->[2]->[0] == 3 (=> type == XML_READER_TYPE_TEXT)

849

$data->[2]->[0]->[5]->[0]->[5]->[2]->[1] == ' text'

850

851

852

retr_info() starts with the array reference ${$_[0]} (= \$tree_data->[2]), which corresponds to ${\$data->[2]} in the above example.

853

Hence, the expression @{${$_[0]}} corresponds to @{${\$data->[2]}}, $e to ${${\$data->[2]}}[0] (= $data->[2]->[0]) and $e->[0] to

854

${${\$data->[2]}}[0]->[0] (= $data->[2]->[0]->[0]).

855

856

857

## Notes on whitespace handling

858

859

Every whitespace inside the processed text is 'significant' and recognized as a node of type 'XML_READER_TYPE_SIGNIFICANT_WHITESPACE'

860

(see function 'retr_info()').

861

862

Definition of significant and insignificant whitespace

863

(source: https://www.oracle.com/technical-resources/articles/wang-whitespace.html):

864

865

Significant whitespace is part of the document content and should be preserved.

866

Insignificant whitespace is used when editing XML documents for readability.

867

These whitespaces are typically not intended for inclusion in the delivery of the document.

868

869

### Regarding XML_READER_TYPE_SIGNIFICANT_WHITESPACE

870

871

The 3rd form of nodes, besides text- (XML_READER_TYPE_TEXT) and tag-nodes (XML_READER_TYPE_ELEMENT) are nodes of the type

872

'XML_READER_TYPE_SIGNIFICANT_WHITESPACE'.

873

874

When modifiying the previous example (see: Notes on how 'XML::CompactTree::XS' works) by inserting an additional blank between

875

'</node1>' and '<node2>', the output for '$data->[2]->[0]->[5]->[1]->[1]' is a blank (' ') and it's type is '14'

876

(XML_READER_TYPE_SIGNIFICANT_WHITESPACE, see 'man XML::LibXML::Reader'):

877

878

echo '<node a="v"><node1>some <n/> text</node1> <node2>more-text</node2></node>' | perl -e 'use XML::CompactTree::XS; use XML::LibXML::Reader; $reader = XML::LibXML::Reader->new(IO => STDIN); $data = XML::CompactTree::XS::readSubtreeToPerl( $reader, XCT_DOCUMENT_ROOT | XCT_IGNORE_COMMENTS | XCT_LINE_NUMBERS ); print "node=\x27".$data->[2]->[0]->[5]->[1]->[1]."\x27, type=".$data->[2]->[0]->[5]->[1]->[0]."\n"'

879

880

881

Example: '... <head type="main"><s>Campagne in Frankreich</s></head><head type="sub"> <s>1792</s> ...'

882

883

Two text-nodes should normally be separated by a blank. In the above example, that would be the 2 text-nodes

884

'Campagne in Frankreich' and '1792', which are separated by the whitespace-node ' ' (see [2]).

885

886

The text-node 'Campagne in Frankreich' leads to the setting of '$add_one' to 1, so that when opening the 2nd 'head'-tag,

887

it's from-index gets set to the correct start-index of '1792' (and not to the start-index of the whitespace-node ' ').

888

889

The assumption here is, that in most cases there _is_ a whitespace node between 2 text-nodes. The below code fragment

890

enables a way, to check, if this really _was_ the case for the last 2 'non-tag'-nodes, when closing a tag:

891

892

When a whitespace-node is read, its from-index is stored as a hash-key (in %ws), to state that it belongs to a ws-node.

893

So when closing a tag, it can be checked, if the previous 'non-tag'-node (text or whitespace), which is the one before

894

the last read 'non-tag'-node, was a actually _not_ a ws-node, but instead a text-node. In that case, the from-value of

895

the last read 'non-tag'-node has to be corrected (see [1]),

896

897

For whitespace-nodes $add_one is set to 0, so when opening the next tag (in the above example the 2nd 's'-tag), no

898

additional 1 is added (because this was already done by the whitespace-node itself when incrementing the variable $pos).

899

900

[1]

901

Now, what happens, when 2 text-nodes are _not_ seperated by a whitespace-node (e.g.: <w>Augen<c>,</c></w>)?

902

In this case, the falsely increased from-value has to be decreased again by 1 when closing the enclosing tag

903

(see above code fragment '... not exists $ws{ $fval - 1 } ...').

904

905

[2]

906

Comparing the 2 examples '<w>fu</w> <w>bar</w>' and '<w>fu</w><w> </w><w>bar</w>', is ' ' in both cases handled as a

907

whitespace-node (XML_READER_TYPE_SIGNIFICANT_WHITESPACE).

908

909

The from-index of the 2nd w-tag in the second example refers to 'bar', which may not have been the intention

910

(even though '<w> </w>' doesn't make a lot of sense). TODO: could this be a bug?

911

912

Empty tags also cling to the next text-token - e.g. in '<w>tok1</w> <w>tok2</w><a><b/></a> <w>tok3</w>' are the from-

913

and to-indizes for the tags 'a' and 'b' both 12, which is the start-index of the token 'tok3'.

914

915

916

## Notes on whitespace fixing

917

918

The idea for the below code fragment was to fix (recreate) missing whitespace in a poorly created corpus, in which linebreaks where inserted

919

into the text with the addition that maybe (or not) whitespace before those linebreaks was unintenionally stripped.

920

921

It soon turned out, that it was best to suggest considering just avoiding linebreaks and putting all primary text tokens into one line (see

922

example further down and notes on 'Input restrictions' in the manpage).

923

924

Somehow an old first very poor approach remained, which is not stringent, but also doesn't affect one-line text.

925

926

Examples (how primary text with linebreaks would be converted by below code):

927

928

'...<w>end</w>\n<w>.</w>...' -> '...<w>end</w> <w>.</w>...'

929

'...<w>,</w>\n<w>this</w>\n<w>is</w>\n<w>it</w>\n<w>!</w>...' -> '<w>,<w> <w>this</w> <w>is</w> <w>it</w> <w>!</w>'.

930

931

Blanks are inserted before the 1st character:

932

933

NOTE: not stringent ('...' stands for text):

934

935

beg1............................end1 => no blank before 'beg1'

936

beg2....<pb/>...................end2 => no blank before 'beg2'

937

beg3....<info attr1="val1"/>....end3 => no blank before 'beg3'

938

beg4....<test>ok</test>.........end4 => blank before 'beg4'

939

940

=> beg1....end1beg2...<pb/>...end2beg3....<info attr1="val1"/>....end3 beg4...<test>ok</test>....end4

941

^

942

|_blank between 'end3' and 'beg4'

943

944

945

## Notes on segfault prevention

946

947

binmode on the input handler prevents segfaulting of 'XML::LibXML::Reader' inside 'main()'

948

(see notes on 'PerlIO layers' in 'man XML::LibXML'),

949

removing 'use open qw(:std :utf8)' would fix this problem too, but using binmode on input is more granular

950

see in perluniintro: You can switch encodings on an already opened stream by using "binmode()

951

see in perlfunc: If LAYER is omitted or specified as ":raw" the filehandle is made suitable for passing binary data.