Blame - script/tei2korapxml - KorAP/KorAP-XML-TEI

echo '<node a="v"><node1>some <n/> text</node1><node2>more-text</node2></node>' | perl -e 'use XML::CompactTree::XS; use XML::LibXML::Reader; $reader = XML::LibXML::Reader->new(IO => STDIN); $data = XML::CompactTree::XS::readSubtreeToPerl( $reader, XCT_DOCUMENT_ROOT | XCT_IGNORE_COMMENTS | XCT_LINE_NUMBERS ); print "\x27".$data->[2]->[0]->[5]->[1]->[1]."\x27\n"'

799

800

Exploring the structure of $data ( = reference to below array ):

801

802

[ 0: XML_READER_TYPE_DOCUMENT,

803

1: ?

804

2: [ 0: [ 0: XML_READER_TYPE_ELEMENT <- start recursion with array '$data->[2]' (see main(): retr_info( \$tree_data->[2] ))

1: 'node'

2: ?

3: HASH (attributes)

4: 1 (line number)

5: [ 0: [ 0: XML_READER_TYPE_ELEMENT

810

1: 'node1'

811

2: ?

812

3: undefined (no attributes)

813

4: 1 (line number)

814

5: [ 0: [ 0: XML_READER_TYPE_TEXT

815

1: 'some '

816

]

817

1: [ 0: XML_READER_TYPE_ELEMENT

818

1: 'n'

819

2: ?

820

3: undefined (no attributes)

821

4: 1 (line number)

822

5: undefined (no child-nodes)

823

]

824

2: [ 0: XML_READER_TYPE_TEXT

1: ' text'

]

]

]

1: [ 0: XML_READER_TYPE_ELEMENT

830

1: 'node2'

831

2: ?

832

3: undefined (not attributes)

833

4: 1 (line number)

834

5: [ 0: [ 0: XML_READER_TYPE_TEXT

1: 'more-text'

]

]

]

]

]

]

]

$data->[0] = 9 (=> type == XML_READER_TYPE_DOCUMENT)

845

846

ref($data->[2]) == ARRAY (with 1 element for 'node')

847

ref($data->[2]->[0]) == ARRAY (with 6 elements)

848

849

$data->[2]->[0]->[0] == 1 (=> type == XML_READER_TYPE_ELEMENT)

850

$data->[2]->[0]->[1] == 'node'

851

ref($data->[2]->[0]->[3]) == HASH (=> ${$data->[2]->[0]->[3]}{a} == 'v')

852

$data->[2]->[0]->[4] == 1 (line number)

853

ref($data->[2]->[0]->[5]) == ARRAY (with 2 elements for 'node1' and 'node2')

854

# child-nodes of actual node (see $_IDX)

855

856

ref($data->[2]->[0]->[5]->[0]) == ARRAY (with 6 elements)

857

$data->[2]->[0]->[5]->[0]->[0] == 1 (=> type == XML_READER_TYPE_ELEMENT)

858

$data->[2]->[0]->[5]->[0]->[1] == 'node1'

859

$data->[2]->[0]->[5]->[0]->[3] == undefined (=> no attribute)

860

$data->[2]->[0]->[5]->[0]->[4] == 1 (line number)

861

ref($data->[2]->[0]->[5]->[0]->[5]) == ARRAY (with 3 elements for 'some ', '<n/>' and ' text')

862

863

ref($data->[2]->[0]->[5]->[0]->[5]->[0]) == ARRAY (with 2 elements)

864

$data->[2]->[0]->[5]->[0]->[5]->[0]->[0] == 3 (=> type == XML_READER_TYPE_TEXT)

865

$data->[2]->[0]->[5]->[0]->[5]->[0]->[1] == 'some '

866

867

ref($data->[2]->[0]->[5]->[0]->[5]->[1]) == ARRAY (with 5 elements)

868

$data->[2]->[0]->[5]->[0]->[5]->[1]->[0] == 1 (=> type == XML_READER_TYPE_ELEMENT)

869

$data->[2]->[0]->[5]->[0]->[5]->[1]->[1] == 'n'

870

$data->[2]->[0]->[5]->[0]->[5]->[1]->[3] == undefined (=> no attribute)

871

$data->[2]->[0]->[5]->[0]->[5]->[1]->[4] == 1 (line number)

872

$data->[2]->[0]->[5]->[0]->[5]->[1]->[5] == undefined (=> no child-nodes)

873

874

ref($data->[2]->[0]->[5]->[0]->[5]->[2]) == ARRAY (with 2 elements)

875

$data->[2]->[0]->[5]->[0]->[5]->[2]->[0] == 3 (=> type == XML_READER_TYPE_TEXT)

876

$data->[2]->[0]->[5]->[0]->[5]->[2]->[1] == ' text'

877

878

879

retr_info() starts with the array reference ${$_[0]} (= \$tree_data->[2]), which corresponds to ${\$data->[2]} in the above example.

880

Hence, the expression @{${$_[0]}} corresponds to @{${\$data->[2]}}, $e to ${${\$data->[2]}}[0] (= $data->[2]->[0]) and $e->[0] to

881

${${\$data->[2]}}[0]->[0] (= $data->[2]->[0]->[0]).

882

883

884

## Notes on whitespace handling

885

886

Every whitespace inside the processed text is 'significant' and recognized as a node of type 'XML_READER_TYPE_SIGNIFICANT_WHITESPACE'

887

(see function 'retr_info()').

888

889

Definition of significant and insignificant whitespace

890

(source: https://www.oracle.com/technical-resources/articles/wang-whitespace.html):

891

892

Significant whitespace is part of the document content and should be preserved.

893

Insignificant whitespace is used when editing XML documents for readability.

894

These whitespaces are typically not intended for inclusion in the delivery of the document.

895

896

### Regarding XML_READER_TYPE_SIGNIFICANT_WHITESPACE

897

898

The 3rd form of nodes, besides text- (XML_READER_TYPE_TEXT) and tag-nodes (XML_READER_TYPE_ELEMENT) are nodes of the type

899

'XML_READER_TYPE_SIGNIFICANT_WHITESPACE'.

900

901

When modifiying the previous example (see: Notes on how 'XML::CompactTree::XS' works) by inserting an additional blank between

902

'</node1>' and '<node2>', the output for '$data->[2]->[0]->[5]->[1]->[1]' is a blank (' ') and it's type is '14'

903

(XML_READER_TYPE_SIGNIFICANT_WHITESPACE, see 'man XML::LibXML::Reader'):

904

905

echo '<node a="v"><node1>some <n/> text</node1> <node2>more-text</node2></node>' | perl -e 'use XML::CompactTree::XS; use XML::LibXML::Reader; $reader = XML::LibXML::Reader->new(IO => STDIN); $data = XML::CompactTree::XS::readSubtreeToPerl( $reader, XCT_DOCUMENT_ROOT | XCT_IGNORE_COMMENTS | XCT_LINE_NUMBERS ); print "node=\x27".$data->[2]->[0]->[5]->[1]->[1]."\x27, type=".$data->[2]->[0]->[5]->[1]->[0]."\n"'

906

907

908

Example: '... <head type="main"><s>Campagne in Frankreich</s></head><head type="sub"> <s>1792</s> ...'

909

910

Two text-nodes should normally be separated by a blank. In the above example, that would be the 2 text-nodes

911

'Campagne in Frankreich' and '1792', which are separated by the whitespace-node ' ' (see [2]).

912

913

The text-node 'Campagne in Frankreich' leads to the setting of '$add_one' to 1, so that when opening the 2nd 'head'-tag,

914

it's from-index gets set to the correct start-index of '1792' (and not to the start-index of the whitespace-node ' ').

915

916

The assumption here is, that in most cases there _is_ a whitespace node between 2 text-nodes. The below code fragment

917

enables a way, to check, if this really _was_ the case for the last 2 'non-tag'-nodes, when closing a tag:

918

919

When a whitespace-node is read, its from-index is stored as a hash-key (in %ws), to state that it belongs to a ws-node.

920

So when closing a tag, it can be checked, if the previous 'non-tag'-node (text or whitespace), which is the one before

921

the last read 'non-tag'-node, was a actually _not_ a ws-node, but instead a text-node. In that case, the from-value of

922

the last read 'non-tag'-node has to be corrected (see [1]),

923

924

For whitespace-nodes $add_one is set to 0, so when opening the next tag (in the above example the 2nd 's'-tag), no

925

additional 1 is added (because this was already done by the whitespace-node itself when incrementing the variable $pos).

926

927

[1]

928

Now, what happens, when 2 text-nodes are _not_ seperated by a whitespace-node (e.g.: <w>Augen<c>,</c></w>)?

929

In this case, the falsely increased from-value has to be decreased again by 1 when closing the enclosing tag

930

(see above code fragment '... not exists $ws{ $fval - 1 } ...').

931

932

[2]

933

Comparing the 2 examples '<w>fu</w> <w>bar</w>' and '<w>fu</w><w> </w><w>bar</w>', is ' ' in both cases handled as a

934

whitespace-node (XML_READER_TYPE_SIGNIFICANT_WHITESPACE).

935

936

The from-index of the 2nd w-tag in the second example refers to 'bar', which may not have been the intention

937

(even though '<w> </w>' doesn't make a lot of sense). TODO: could this be a bug?

938

939

Empty tags also cling to the next text-token - e.g. in '<w>tok1</w> <w>tok2</w><a><b/></a> <w>tok3</w>' are the from-

940

and to-indizes for the tags 'a' and 'b' both 12, which is the start-index of the token 'tok3'.

941

942

943

## Notes on whitespace fixing

944

945

The idea for the below code fragment was to fix (recreate) missing whitespace in a poorly created corpus, in which linebreaks where inserted

946

into the text with the addition that maybe (or not) whitespace before those linebreaks was unintenionally stripped.

947

948

It soon turned out, that it was best to suggest considering just avoiding linebreaks and putting all primary text tokens into one line (see

949

example further down and notes on 'Input restrictions' in the manpage).

950

951

Somehow an old first very poor approach remained, which is not stringent, but also doesn't affect one-line text.

952

953

Examples (how primary text with linebreaks would be converted by below code):

954

955

'...<w>end</w>\n<w>.</w>...' -> '...<w>end</w> <w>.</w>...'

956

'...<w>,</w>\n<w>this</w>\n<w>is</w>\n<w>it</w>\n<w>!</w>...' -> '<w>,<w> <w>this</w> <w>is</w> <w>it</w> <w>!</w>'.

957

958

Blanks are inserted before the 1st character:

959

960

NOTE: not stringent ('...' stands for text):

961

962

beg1............................end1 => no blank before 'beg1'

963

beg2....<pb/>...................end2 => no blank before 'beg2'

964

beg3....<info attr1="val1"/>....end3 => no blank before 'beg3'

965

beg4....<test>ok</test>.........end4 => blank before 'beg4'

966

967

=> beg1....end1beg2...<pb/>...end2beg3....<info attr1="val1"/>....end3 beg4...<test>ok</test>....end4

968

^

969

|_blank between 'end3' and 'beg4'

970

971

972

## Notes on segfault prevention

973

974

binmode on the input handler prevents segfaulting of 'XML::LibXML::Reader' inside 'main()'

975

(see notes on 'PerlIO layers' in 'man XML::LibXML'),

976

removing 'use open qw(:std :utf8)' would fix this problem too, but using binmode on input is more granular

977

see in perluniintro: You can switch encodings on an already opened stream by using "binmode()

978

see in perlfunc: If LAYER is omitted or specified as ":raw" the filehandle is made suitable for passing binary data.