Blame - script/tei2korapxml - KorAP/KorAP-XML-TEI

echo '<node a="v"><node1>some <n/> text</node1><node2>more-text</node2></node>' | perl -e 'use XML::CompactTree::XS; use XML::LibXML::Reader; $reader = XML::LibXML::Reader->new(IO => STDIN); $data = XML::CompactTree::XS::readSubtreeToPerl( $reader, XCT_DOCUMENT_ROOT | XCT_IGNORE_COMMENTS | XCT_LINE_NUMBERS ); print "\x27".$data->[2]->[0]->[5]->[1]->[1]."\x27\n"'

798

799

Exploring the structure of $data ( = reference to below array ):

800

801

[ 0: XML_READER_TYPE_DOCUMENT,

802

1: ?

Akron

9157792

2021-02-19 10:32:54 +0100

[diff] [blame^]

803

2: [ 0: [ 0: XML_READER_TYPE_ELEMENT <- start recursion with array '$data->[2]' (see retr_info( \$tree_data->[2] ))

Akron

f8088e6

2021-02-18 16:18:59 +0100

[diff] [blame]

1: 'node'

2: ?

3: HASH (attributes)

4: 1 (line number)

5: [ 0: [ 0: XML_READER_TYPE_ELEMENT

809

1: 'node1'

810

2: ?

811

3: undefined (no attributes)

812

4: 1 (line number)

813

5: [ 0: [ 0: XML_READER_TYPE_TEXT

814

1: 'some '

815

]

816

1: [ 0: XML_READER_TYPE_ELEMENT

817

1: 'n'

818

2: ?

819

3: undefined (no attributes)

820

4: 1 (line number)

821

5: undefined (no child-nodes)

822

]

823

2: [ 0: XML_READER_TYPE_TEXT

1: ' text'

]

]

]

1: [ 0: XML_READER_TYPE_ELEMENT

829

1: 'node2'

830

2: ?

831

3: undefined (not attributes)

832

4: 1 (line number)

833

5: [ 0: [ 0: XML_READER_TYPE_TEXT

1: 'more-text'

]

]

]

]

]

]

]

$data->[0] = 9 (=> type == XML_READER_TYPE_DOCUMENT)

844

845

ref($data->[2]) == ARRAY (with 1 element for 'node')

846

ref($data->[2]->[0]) == ARRAY (with 6 elements)

847

848

$data->[2]->[0]->[0] == 1 (=> type == XML_READER_TYPE_ELEMENT)

849

$data->[2]->[0]->[1] == 'node'

850

ref($data->[2]->[0]->[3]) == HASH (=> ${$data->[2]->[0]->[3]}{a} == 'v')

851

$data->[2]->[0]->[4] == 1 (line number)

852

ref($data->[2]->[0]->[5]) == ARRAY (with 2 elements for 'node1' and 'node2')

853

# child-nodes of actual node (see $_IDX)

854

855

ref($data->[2]->[0]->[5]->[0]) == ARRAY (with 6 elements)

856

$data->[2]->[0]->[5]->[0]->[0] == 1 (=> type == XML_READER_TYPE_ELEMENT)

857

$data->[2]->[0]->[5]->[0]->[1] == 'node1'

858

$data->[2]->[0]->[5]->[0]->[3] == undefined (=> no attribute)

859

$data->[2]->[0]->[5]->[0]->[4] == 1 (line number)

860

ref($data->[2]->[0]->[5]->[0]->[5]) == ARRAY (with 3 elements for 'some ', '<n/>' and ' text')

861

862

ref($data->[2]->[0]->[5]->[0]->[5]->[0]) == ARRAY (with 2 elements)

863

$data->[2]->[0]->[5]->[0]->[5]->[0]->[0] == 3 (=> type == XML_READER_TYPE_TEXT)

864

$data->[2]->[0]->[5]->[0]->[5]->[0]->[1] == 'some '

865

866

ref($data->[2]->[0]->[5]->[0]->[5]->[1]) == ARRAY (with 5 elements)

867

$data->[2]->[0]->[5]->[0]->[5]->[1]->[0] == 1 (=> type == XML_READER_TYPE_ELEMENT)

868

$data->[2]->[0]->[5]->[0]->[5]->[1]->[1] == 'n'

869

$data->[2]->[0]->[5]->[0]->[5]->[1]->[3] == undefined (=> no attribute)

870

$data->[2]->[0]->[5]->[0]->[5]->[1]->[4] == 1 (line number)

871

$data->[2]->[0]->[5]->[0]->[5]->[1]->[5] == undefined (=> no child-nodes)

872

873

ref($data->[2]->[0]->[5]->[0]->[5]->[2]) == ARRAY (with 2 elements)

874

$data->[2]->[0]->[5]->[0]->[5]->[2]->[0] == 3 (=> type == XML_READER_TYPE_TEXT)

875

$data->[2]->[0]->[5]->[0]->[5]->[2]->[1] == ' text'

876

877

878

retr_info() starts with the array reference ${$_[0]} (= \$tree_data->[2]), which corresponds to ${\$data->[2]} in the above example.

879

Hence, the expression @{${$_[0]}} corresponds to @{${\$data->[2]}}, $e to ${${\$data->[2]}}[0] (= $data->[2]->[0]) and $e->[0] to

880

${${\$data->[2]}}[0]->[0] (= $data->[2]->[0]->[0]).

881

882

883

## Notes on whitespace handling

884

885

Every whitespace inside the processed text is 'significant' and recognized as a node of type 'XML_READER_TYPE_SIGNIFICANT_WHITESPACE'

886

(see function 'retr_info()').

887

888

Definition of significant and insignificant whitespace

889

(source: https://www.oracle.com/technical-resources/articles/wang-whitespace.html):

890

891

Significant whitespace is part of the document content and should be preserved.

892

Insignificant whitespace is used when editing XML documents for readability.

893

These whitespaces are typically not intended for inclusion in the delivery of the document.

894

895

### Regarding XML_READER_TYPE_SIGNIFICANT_WHITESPACE

896

897

The 3rd form of nodes, besides text- (XML_READER_TYPE_TEXT) and tag-nodes (XML_READER_TYPE_ELEMENT) are nodes of the type

898

'XML_READER_TYPE_SIGNIFICANT_WHITESPACE'.

899

900

When modifiying the previous example (see: Notes on how 'XML::CompactTree::XS' works) by inserting an additional blank between

901

'</node1>' and '<node2>', the output for '$data->[2]->[0]->[5]->[1]->[1]' is a blank (' ') and it's type is '14'

902

(XML_READER_TYPE_SIGNIFICANT_WHITESPACE, see 'man XML::LibXML::Reader'):

903

904

echo '<node a="v"><node1>some <n/> text</node1> <node2>more-text</node2></node>' | perl -e 'use XML::CompactTree::XS; use XML::LibXML::Reader; $reader = XML::LibXML::Reader->new(IO => STDIN); $data = XML::CompactTree::XS::readSubtreeToPerl( $reader, XCT_DOCUMENT_ROOT | XCT_IGNORE_COMMENTS | XCT_LINE_NUMBERS ); print "node=\x27".$data->[2]->[0]->[5]->[1]->[1]."\x27, type=".$data->[2]->[0]->[5]->[1]->[0]."\n"'

905

906

907

Example: '... <head type="main"><s>Campagne in Frankreich</s></head><head type="sub"> <s>1792</s> ...'

908

909

Two text-nodes should normally be separated by a blank. In the above example, that would be the 2 text-nodes

910

'Campagne in Frankreich' and '1792', which are separated by the whitespace-node ' ' (see [2]).

911

912

The text-node 'Campagne in Frankreich' leads to the setting of '$add_one' to 1, so that when opening the 2nd 'head'-tag,

913

it's from-index gets set to the correct start-index of '1792' (and not to the start-index of the whitespace-node ' ').

914

915

The assumption here is, that in most cases there _is_ a whitespace node between 2 text-nodes. The below code fragment

916

enables a way, to check, if this really _was_ the case for the last 2 'non-tag'-nodes, when closing a tag:

917

918

When a whitespace-node is read, its from-index is stored as a hash-key (in %ws), to state that it belongs to a ws-node.

919

So when closing a tag, it can be checked, if the previous 'non-tag'-node (text or whitespace), which is the one before

920

the last read 'non-tag'-node, was a actually _not_ a ws-node, but instead a text-node. In that case, the from-value of

921

the last read 'non-tag'-node has to be corrected (see [1]),

922

923

For whitespace-nodes $add_one is set to 0, so when opening the next tag (in the above example the 2nd 's'-tag), no

924

additional 1 is added (because this was already done by the whitespace-node itself when incrementing the variable $pos).

925

926

[1]

927

Now, what happens, when 2 text-nodes are _not_ seperated by a whitespace-node (e.g.: <w>Augen<c>,</c></w>)?

928

In this case, the falsely increased from-value has to be decreased again by 1 when closing the enclosing tag

929

(see above code fragment '... not exists $ws{ $fval - 1 } ...').

930

931

[2]

932

Comparing the 2 examples '<w>fu</w> <w>bar</w>' and '<w>fu</w><w> </w><w>bar</w>', is ' ' in both cases handled as a

933

whitespace-node (XML_READER_TYPE_SIGNIFICANT_WHITESPACE).

934

935

The from-index of the 2nd w-tag in the second example refers to 'bar', which may not have been the intention

936

(even though '<w> </w>' doesn't make a lot of sense). TODO: could this be a bug?

937

938

Empty tags also cling to the next text-token - e.g. in '<w>tok1</w> <w>tok2</w><a><b/></a> <w>tok3</w>' are the from-

939

and to-indizes for the tags 'a' and 'b' both 12, which is the start-index of the token 'tok3'.

940

941

942

## Notes on whitespace fixing

943

944

The idea for the below code fragment was to fix (recreate) missing whitespace in a poorly created corpus, in which linebreaks where inserted

945

into the text with the addition that maybe (or not) whitespace before those linebreaks was unintenionally stripped.

946

947

It soon turned out, that it was best to suggest considering just avoiding linebreaks and putting all primary text tokens into one line (see

948

example further down and notes on 'Input restrictions' in the manpage).

949

950

Somehow an old first very poor approach remained, which is not stringent, but also doesn't affect one-line text.

951

952

Examples (how primary text with linebreaks would be converted by below code):

953

954

'...<w>end</w>\n<w>.</w>...' -> '...<w>end</w> <w>.</w>...'

955

'...<w>,</w>\n<w>this</w>\n<w>is</w>\n<w>it</w>\n<w>!</w>...' -> '<w>,<w> <w>this</w> <w>is</w> <w>it</w> <w>!</w>'.

956

957

Blanks are inserted before the 1st character:

958

959

NOTE: not stringent ('...' stands for text):

960

961

beg1............................end1 => no blank before 'beg1'

962

beg2....<pb/>...................end2 => no blank before 'beg2'

963

beg3....<info attr1="val1"/>....end3 => no blank before 'beg3'

964

beg4....<test>ok</test>.........end4 => blank before 'beg4'

965

966

=> beg1....end1beg2...<pb/>...end2beg3....<info attr1="val1"/>....end3 beg4...<test>ok</test>....end4

967

^

968

|_blank between 'end3' and 'beg4'

969

970

971

## Notes on segfault prevention

972

Akron

9157792

2021-02-19 10:32:54 +0100

[diff] [blame^]

973

binmode on the input handler prevents segfaulting of 'XML::LibXML::Reader' inside the main loop

Akron

f8088e6

2021-02-18 16:18:59 +0100

[diff] [blame]

974

(see notes on 'PerlIO layers' in 'man XML::LibXML'),

975

removing 'use open qw(:std :utf8)' would fix this problem too, but using binmode on input is more granular

976

see in perluniintro: You can switch encodings on an already opened stream by using "binmode()

977

see in perlfunc: If LAYER is omitted or specified as ":raw" the filehandle is made suitable for passing binary data.