Blame - script/tei2korapxml - KorAP/KorAP-XML-TEI

echo '<node a="v"><node1>some <n/> text</node1><node2>more-text</node2></node>' | perl -e 'use XML::CompactTree::XS; use XML::LibXML::Reader; $reader = XML::LibXML::Reader->new(IO => STDIN); $data = XML::CompactTree::XS::readSubtreeToPerl( $reader, XCT_DOCUMENT_ROOT | XCT_IGNORE_COMMENTS | XCT_LINE_NUMBERS ); print "\x27".$data->[2]->[0]->[5]->[1]->[1]."\x27\n"'

783

784

Exploring the structure of $data ( = reference to below array ):

785

786

[ 0: XML_READER_TYPE_DOCUMENT,

787

1: ?

788

2: [ 0: [ 0: XML_READER_TYPE_ELEMENT <- start recursion with array '$data->[2]' (see main(): retr_info( \$tree_data->[2] ))

1: 'node'

2: ?

3: HASH (attributes)

4: 1 (line number)

5: [ 0: [ 0: XML_READER_TYPE_ELEMENT

794

1: 'node1'

795

2: ?

796

3: undefined (no attributes)

797

4: 1 (line number)

798

5: [ 0: [ 0: XML_READER_TYPE_TEXT

799

1: 'some '

800

]

801

1: [ 0: XML_READER_TYPE_ELEMENT

802

1: 'n'

803

2: ?

804

3: undefined (no attributes)

805

4: 1 (line number)

806

5: undefined (no child-nodes)

807

]

808

2: [ 0: XML_READER_TYPE_TEXT

1: ' text'

]

]

]

1: [ 0: XML_READER_TYPE_ELEMENT

814

1: 'node2'

815

2: ?

816

3: undefined (not attributes)

817

4: 1 (line number)

818

5: [ 0: [ 0: XML_READER_TYPE_TEXT

1: 'more-text'

]

]

]

]

]

]

]

$data->[0] = 9 (=> type == XML_READER_TYPE_DOCUMENT)

829

830

ref($data->[2]) == ARRAY (with 1 element for 'node')

831

ref($data->[2]->[0]) == ARRAY (with 6 elements)

832

833

$data->[2]->[0]->[0] == 1 (=> type == XML_READER_TYPE_ELEMENT)

834

$data->[2]->[0]->[1] == 'node'

835

ref($data->[2]->[0]->[3]) == HASH (=> ${$data->[2]->[0]->[3]}{a} == 'v')

836

$data->[2]->[0]->[4] == 1 (line number)

837

ref($data->[2]->[0]->[5]) == ARRAY (with 2 elements for 'node1' and 'node2')

838

# child-nodes of actual node (see $_IDX)

839

840

ref($data->[2]->[0]->[5]->[0]) == ARRAY (with 6 elements)

841

$data->[2]->[0]->[5]->[0]->[0] == 1 (=> type == XML_READER_TYPE_ELEMENT)

842

$data->[2]->[0]->[5]->[0]->[1] == 'node1'

843

$data->[2]->[0]->[5]->[0]->[3] == undefined (=> no attribute)

844

$data->[2]->[0]->[5]->[0]->[4] == 1 (line number)

845

ref($data->[2]->[0]->[5]->[0]->[5]) == ARRAY (with 3 elements for 'some ', '<n/>' and ' text')

846

847

ref($data->[2]->[0]->[5]->[0]->[5]->[0]) == ARRAY (with 2 elements)

848

$data->[2]->[0]->[5]->[0]->[5]->[0]->[0] == 3 (=> type == XML_READER_TYPE_TEXT)

849

$data->[2]->[0]->[5]->[0]->[5]->[0]->[1] == 'some '

850

851

ref($data->[2]->[0]->[5]->[0]->[5]->[1]) == ARRAY (with 5 elements)

852

$data->[2]->[0]->[5]->[0]->[5]->[1]->[0] == 1 (=> type == XML_READER_TYPE_ELEMENT)

853

$data->[2]->[0]->[5]->[0]->[5]->[1]->[1] == 'n'

854

$data->[2]->[0]->[5]->[0]->[5]->[1]->[3] == undefined (=> no attribute)

855

$data->[2]->[0]->[5]->[0]->[5]->[1]->[4] == 1 (line number)

856

$data->[2]->[0]->[5]->[0]->[5]->[1]->[5] == undefined (=> no child-nodes)

857

858

ref($data->[2]->[0]->[5]->[0]->[5]->[2]) == ARRAY (with 2 elements)

859

$data->[2]->[0]->[5]->[0]->[5]->[2]->[0] == 3 (=> type == XML_READER_TYPE_TEXT)

860

$data->[2]->[0]->[5]->[0]->[5]->[2]->[1] == ' text'

861

862

863

retr_info() starts with the array reference ${$_[0]} (= \$tree_data->[2]), which corresponds to ${\$data->[2]} in the above example.

864

Hence, the expression @{${$_[0]}} corresponds to @{${\$data->[2]}}, $e to ${${\$data->[2]}}[0] (= $data->[2]->[0]) and $e->[0] to

865

${${\$data->[2]}}[0]->[0] (= $data->[2]->[0]->[0]).

866

867

868

## Notes on whitespace handling

869

870

Every whitespace inside the processed text is 'significant' and recognized as a node of type 'XML_READER_TYPE_SIGNIFICANT_WHITESPACE'

871

(see function 'retr_info()').

872

873

Definition of significant and insignificant whitespace

874

(source: https://www.oracle.com/technical-resources/articles/wang-whitespace.html):

875

876

Significant whitespace is part of the document content and should be preserved.

877

Insignificant whitespace is used when editing XML documents for readability.

878

These whitespaces are typically not intended for inclusion in the delivery of the document.

879

880

### Regarding XML_READER_TYPE_SIGNIFICANT_WHITESPACE

881

882

The 3rd form of nodes, besides text- (XML_READER_TYPE_TEXT) and tag-nodes (XML_READER_TYPE_ELEMENT) are nodes of the type

883

'XML_READER_TYPE_SIGNIFICANT_WHITESPACE'.

884

885

When modifiying the previous example (see: Notes on how 'XML::CompactTree::XS' works) by inserting an additional blank between

886

'</node1>' and '<node2>', the output for '$data->[2]->[0]->[5]->[1]->[1]' is a blank (' ') and it's type is '14'

887

(XML_READER_TYPE_SIGNIFICANT_WHITESPACE, see 'man XML::LibXML::Reader'):

888

889

echo '<node a="v"><node1>some <n/> text</node1> <node2>more-text</node2></node>' | perl -e 'use XML::CompactTree::XS; use XML::LibXML::Reader; $reader = XML::LibXML::Reader->new(IO => STDIN); $data = XML::CompactTree::XS::readSubtreeToPerl( $reader, XCT_DOCUMENT_ROOT | XCT_IGNORE_COMMENTS | XCT_LINE_NUMBERS ); print "node=\x27".$data->[2]->[0]->[5]->[1]->[1]."\x27, type=".$data->[2]->[0]->[5]->[1]->[0]."\n"'

890

891

892

Example: '... <head type="main"><s>Campagne in Frankreich</s></head><head type="sub"> <s>1792</s> ...'

893

894

Two text-nodes should normally be separated by a blank. In the above example, that would be the 2 text-nodes

895

'Campagne in Frankreich' and '1792', which are separated by the whitespace-node ' ' (see [2]).

896

897

The text-node 'Campagne in Frankreich' leads to the setting of '$add_one' to 1, so that when opening the 2nd 'head'-tag,

898

it's from-index gets set to the correct start-index of '1792' (and not to the start-index of the whitespace-node ' ').

899

900

The assumption here is, that in most cases there _is_ a whitespace node between 2 text-nodes. The below code fragment

901

enables a way, to check, if this really _was_ the case for the last 2 'non-tag'-nodes, when closing a tag:

902

903

When a whitespace-node is read, its from-index is stored as a hash-key (in %ws), to state that it belongs to a ws-node.

904

So when closing a tag, it can be checked, if the previous 'non-tag'-node (text or whitespace), which is the one before

905

the last read 'non-tag'-node, was a actually _not_ a ws-node, but instead a text-node. In that case, the from-value of

906

the last read 'non-tag'-node has to be corrected (see [1]),

907

908

For whitespace-nodes $add_one is set to 0, so when opening the next tag (in the above example the 2nd 's'-tag), no

909

additional 1 is added (because this was already done by the whitespace-node itself when incrementing the variable $pos).

910

911

[1]

912

Now, what happens, when 2 text-nodes are _not_ seperated by a whitespace-node (e.g.: <w>Augen<c>,</c></w>)?

913

In this case, the falsely increased from-value has to be decreased again by 1 when closing the enclosing tag

914

(see above code fragment '... not exists $ws{ $fval - 1 } ...').

915

916

[2]

917

Comparing the 2 examples '<w>fu</w> <w>bar</w>' and '<w>fu</w><w> </w><w>bar</w>', is ' ' in both cases handled as a

918

whitespace-node (XML_READER_TYPE_SIGNIFICANT_WHITESPACE).

919

920

The from-index of the 2nd w-tag in the second example refers to 'bar', which may not have been the intention

921

(even though '<w> </w>' doesn't make a lot of sense). TODO: could this be a bug?

922

923

Empty tags also cling to the next text-token - e.g. in '<w>tok1</w> <w>tok2</w><a><b/></a> <w>tok3</w>' are the from-

924

and to-indizes for the tags 'a' and 'b' both 12, which is the start-index of the token 'tok3'.

925

926

927

## Notes on whitespace fixing

928

929

The idea for the below code fragment was to fix (recreate) missing whitespace in a poorly created corpus, in which linebreaks where inserted

930

into the text with the addition that maybe (or not) whitespace before those linebreaks was unintenionally stripped.

931

932

It soon turned out, that it was best to suggest considering just avoiding linebreaks and putting all primary text tokens into one line (see

933

example further down and notes on 'Input restrictions' in the manpage).

934

935

Somehow an old first very poor approach remained, which is not stringent, but also doesn't affect one-line text.

936

937

Examples (how primary text with linebreaks would be converted by below code):

938

939

'...<w>end</w>\n<w>.</w>...' -> '...<w>end</w> <w>.</w>...'

940

'...<w>,</w>\n<w>this</w>\n<w>is</w>\n<w>it</w>\n<w>!</w>...' -> '<w>,<w> <w>this</w> <w>is</w> <w>it</w> <w>!</w>'.

941

942

Blanks are inserted before the 1st character:

943

944

NOTE: not stringent ('...' stands for text):

945

946

beg1............................end1 => no blank before 'beg1'

947

beg2....<pb/>...................end2 => no blank before 'beg2'

948

beg3....<info attr1="val1"/>....end3 => no blank before 'beg3'

949

beg4....<test>ok</test>.........end4 => blank before 'beg4'

950

951

=> beg1....end1beg2...<pb/>...end2beg3....<info attr1="val1"/>....end3 beg4...<test>ok</test>....end4

952

^

953

|_blank between 'end3' and 'beg4'

954

955

956

## Notes on segfault prevention

957

958

binmode on the input handler prevents segfaulting of 'XML::LibXML::Reader' inside 'main()'

959

(see notes on 'PerlIO layers' in 'man XML::LibXML'),

960

removing 'use open qw(:std :utf8)' would fix this problem too, but using binmode on input is more granular

961

see in perluniintro: You can switch encodings on an already opened stream by using "binmode()

962

see in perlfunc: If LAYER is omitted or specified as ":raw" the filehandle is made suitable for passing binary data.