Blame - script/tei2korapxml - KorAP/KorAP-XML-TEI

echo '<node a="v"><node1>some <n/> text</node1><node2>more-text</node2></node>' | perl -e 'use XML::CompactTree::XS; use XML::LibXML::Reader; $reader = XML::LibXML::Reader->new(IO => STDIN); $data = XML::CompactTree::XS::readSubtreeToPerl( $reader, XCT_DOCUMENT_ROOT | XCT_IGNORE_COMMENTS | XCT_LINE_NUMBERS ); print "\x27".$data->[2]->[0]->[5]->[1]->[1]."\x27\n"'

789

790

Exploring the structure of $data ( = reference to below array ):

791

792

[ 0: XML_READER_TYPE_DOCUMENT,

793

1: ?

794

2: [ 0: [ 0: XML_READER_TYPE_ELEMENT <- start recursion with array '$data->[2]' (see main(): retr_info( \$tree_data->[2] ))

1: 'node'

2: ?

3: HASH (attributes)

4: 1 (line number)

5: [ 0: [ 0: XML_READER_TYPE_ELEMENT

800

1: 'node1'

801

2: ?

802

3: undefined (no attributes)

803

4: 1 (line number)

804

5: [ 0: [ 0: XML_READER_TYPE_TEXT

805

1: 'some '

806

]

807

1: [ 0: XML_READER_TYPE_ELEMENT

808

1: 'n'

809

2: ?

810

3: undefined (no attributes)

811

4: 1 (line number)

812

5: undefined (no child-nodes)

813

]

814

2: [ 0: XML_READER_TYPE_TEXT

1: ' text'

]

]

]

1: [ 0: XML_READER_TYPE_ELEMENT

820

1: 'node2'

821

2: ?

822

3: undefined (not attributes)

823

4: 1 (line number)

824

5: [ 0: [ 0: XML_READER_TYPE_TEXT

1: 'more-text'

]

]

]

]

]

]

]

$data->[0] = 9 (=> type == XML_READER_TYPE_DOCUMENT)

835

836

ref($data->[2]) == ARRAY (with 1 element for 'node')

837

ref($data->[2]->[0]) == ARRAY (with 6 elements)

838

839

$data->[2]->[0]->[0] == 1 (=> type == XML_READER_TYPE_ELEMENT)

840

$data->[2]->[0]->[1] == 'node'

841

ref($data->[2]->[0]->[3]) == HASH (=> ${$data->[2]->[0]->[3]}{a} == 'v')

842

$data->[2]->[0]->[4] == 1 (line number)

843

ref($data->[2]->[0]->[5]) == ARRAY (with 2 elements for 'node1' and 'node2')

844

# child-nodes of actual node (see $_IDX)

845

846

ref($data->[2]->[0]->[5]->[0]) == ARRAY (with 6 elements)

847

$data->[2]->[0]->[5]->[0]->[0] == 1 (=> type == XML_READER_TYPE_ELEMENT)

848

$data->[2]->[0]->[5]->[0]->[1] == 'node1'

849

$data->[2]->[0]->[5]->[0]->[3] == undefined (=> no attribute)

850

$data->[2]->[0]->[5]->[0]->[4] == 1 (line number)

851

ref($data->[2]->[0]->[5]->[0]->[5]) == ARRAY (with 3 elements for 'some ', '<n/>' and ' text')

852

853

ref($data->[2]->[0]->[5]->[0]->[5]->[0]) == ARRAY (with 2 elements)

854

$data->[2]->[0]->[5]->[0]->[5]->[0]->[0] == 3 (=> type == XML_READER_TYPE_TEXT)

855

$data->[2]->[0]->[5]->[0]->[5]->[0]->[1] == 'some '

856

857

ref($data->[2]->[0]->[5]->[0]->[5]->[1]) == ARRAY (with 5 elements)

858

$data->[2]->[0]->[5]->[0]->[5]->[1]->[0] == 1 (=> type == XML_READER_TYPE_ELEMENT)

859

$data->[2]->[0]->[5]->[0]->[5]->[1]->[1] == 'n'

860

$data->[2]->[0]->[5]->[0]->[5]->[1]->[3] == undefined (=> no attribute)

861

$data->[2]->[0]->[5]->[0]->[5]->[1]->[4] == 1 (line number)

862

$data->[2]->[0]->[5]->[0]->[5]->[1]->[5] == undefined (=> no child-nodes)

863

864

ref($data->[2]->[0]->[5]->[0]->[5]->[2]) == ARRAY (with 2 elements)

865

$data->[2]->[0]->[5]->[0]->[5]->[2]->[0] == 3 (=> type == XML_READER_TYPE_TEXT)

866

$data->[2]->[0]->[5]->[0]->[5]->[2]->[1] == ' text'

867

868

869

retr_info() starts with the array reference ${$_[0]} (= \$tree_data->[2]), which corresponds to ${\$data->[2]} in the above example.

870

Hence, the expression @{${$_[0]}} corresponds to @{${\$data->[2]}}, $e to ${${\$data->[2]}}[0] (= $data->[2]->[0]) and $e->[0] to

871

${${\$data->[2]}}[0]->[0] (= $data->[2]->[0]->[0]).

872

873

874

## Notes on whitespace handling

875

876

Every whitespace inside the processed text is 'significant' and recognized as a node of type 'XML_READER_TYPE_SIGNIFICANT_WHITESPACE'

877

(see function 'retr_info()').

878

879

Definition of significant and insignificant whitespace

880

(source: https://www.oracle.com/technical-resources/articles/wang-whitespace.html):

881

882

Significant whitespace is part of the document content and should be preserved.

883

Insignificant whitespace is used when editing XML documents for readability.

884

These whitespaces are typically not intended for inclusion in the delivery of the document.

885

886

### Regarding XML_READER_TYPE_SIGNIFICANT_WHITESPACE

887

888

The 3rd form of nodes, besides text- (XML_READER_TYPE_TEXT) and tag-nodes (XML_READER_TYPE_ELEMENT) are nodes of the type

889

'XML_READER_TYPE_SIGNIFICANT_WHITESPACE'.

890

891

When modifiying the previous example (see: Notes on how 'XML::CompactTree::XS' works) by inserting an additional blank between

892

'</node1>' and '<node2>', the output for '$data->[2]->[0]->[5]->[1]->[1]' is a blank (' ') and it's type is '14'

893

(XML_READER_TYPE_SIGNIFICANT_WHITESPACE, see 'man XML::LibXML::Reader'):

894

895

echo '<node a="v"><node1>some <n/> text</node1> <node2>more-text</node2></node>' | perl -e 'use XML::CompactTree::XS; use XML::LibXML::Reader; $reader = XML::LibXML::Reader->new(IO => STDIN); $data = XML::CompactTree::XS::readSubtreeToPerl( $reader, XCT_DOCUMENT_ROOT | XCT_IGNORE_COMMENTS | XCT_LINE_NUMBERS ); print "node=\x27".$data->[2]->[0]->[5]->[1]->[1]."\x27, type=".$data->[2]->[0]->[5]->[1]->[0]."\n"'

896

897

898

Example: '... <head type="main"><s>Campagne in Frankreich</s></head><head type="sub"> <s>1792</s> ...'

899

900

Two text-nodes should normally be separated by a blank. In the above example, that would be the 2 text-nodes

901

'Campagne in Frankreich' and '1792', which are separated by the whitespace-node ' ' (see [2]).

902

903

The text-node 'Campagne in Frankreich' leads to the setting of '$add_one' to 1, so that when opening the 2nd 'head'-tag,

904

it's from-index gets set to the correct start-index of '1792' (and not to the start-index of the whitespace-node ' ').

905

906

The assumption here is, that in most cases there _is_ a whitespace node between 2 text-nodes. The below code fragment

907

enables a way, to check, if this really _was_ the case for the last 2 'non-tag'-nodes, when closing a tag:

908

909

When a whitespace-node is read, its from-index is stored as a hash-key (in %ws), to state that it belongs to a ws-node.

910

So when closing a tag, it can be checked, if the previous 'non-tag'-node (text or whitespace), which is the one before

911

the last read 'non-tag'-node, was a actually _not_ a ws-node, but instead a text-node. In that case, the from-value of

912

the last read 'non-tag'-node has to be corrected (see [1]),

913

914

For whitespace-nodes $add_one is set to 0, so when opening the next tag (in the above example the 2nd 's'-tag), no

915

additional 1 is added (because this was already done by the whitespace-node itself when incrementing the variable $pos).

916

917

[1]

918

Now, what happens, when 2 text-nodes are _not_ seperated by a whitespace-node (e.g.: <w>Augen<c>,</c></w>)?

919

In this case, the falsely increased from-value has to be decreased again by 1 when closing the enclosing tag

920

(see above code fragment '... not exists $ws{ $fval - 1 } ...').

921

922

[2]

923

Comparing the 2 examples '<w>fu</w> <w>bar</w>' and '<w>fu</w><w> </w><w>bar</w>', is ' ' in both cases handled as a

924

whitespace-node (XML_READER_TYPE_SIGNIFICANT_WHITESPACE).

925

926

The from-index of the 2nd w-tag in the second example refers to 'bar', which may not have been the intention

927

(even though '<w> </w>' doesn't make a lot of sense). TODO: could this be a bug?

928

929

Empty tags also cling to the next text-token - e.g. in '<w>tok1</w> <w>tok2</w><a><b/></a> <w>tok3</w>' are the from-

930

and to-indizes for the tags 'a' and 'b' both 12, which is the start-index of the token 'tok3'.

931

932

933

## Notes on whitespace fixing

934

935

The idea for the below code fragment was to fix (recreate) missing whitespace in a poorly created corpus, in which linebreaks where inserted

936

into the text with the addition that maybe (or not) whitespace before those linebreaks was unintenionally stripped.

937

938

It soon turned out, that it was best to suggest considering just avoiding linebreaks and putting all primary text tokens into one line (see

939

example further down and notes on 'Input restrictions' in the manpage).

940

941

Somehow an old first very poor approach remained, which is not stringent, but also doesn't affect one-line text.

942

943

Examples (how primary text with linebreaks would be converted by below code):

944

945

'...<w>end</w>\n<w>.</w>...' -> '...<w>end</w> <w>.</w>...'

946

'...<w>,</w>\n<w>this</w>\n<w>is</w>\n<w>it</w>\n<w>!</w>...' -> '<w>,<w> <w>this</w> <w>is</w> <w>it</w> <w>!</w>'.

947

948

Blanks are inserted before the 1st character:

949

950

NOTE: not stringent ('...' stands for text):

951

952

beg1............................end1 => no blank before 'beg1'

953

beg2....<pb/>...................end2 => no blank before 'beg2'

954

beg3....<info attr1="val1"/>....end3 => no blank before 'beg3'

955

beg4....<test>ok</test>.........end4 => blank before 'beg4'

956

957

=> beg1....end1beg2...<pb/>...end2beg3....<info attr1="val1"/>....end3 beg4...<test>ok</test>....end4

958

^

959

|_blank between 'end3' and 'beg4'

960

961

962

## Notes on segfault prevention

963

964

binmode on the input handler prevents segfaulting of 'XML::LibXML::Reader' inside 'main()'

965

(see notes on 'PerlIO layers' in 'man XML::LibXML'),

966

removing 'use open qw(:std :utf8)' would fix this problem too, but using binmode on input is more granular

967

see in perluniintro: You can switch encodings on an already opened stream by using "binmode()

968

see in perlfunc: If LAYER is omitted or specified as ":raw" the filehandle is made suitable for passing binary data.