blob: c20175c2585a34d9bdb997a66c8de76d27534f05 [file] [log] [blame]
Akron0c41ab32020-09-29 07:33:33 +02001=pod
2
3=encoding utf8
4
5=head1 NAME
6
7tei2korapxml - Conversion of TEI P5 based formats to KorAP-XML
8
9=head1 SYNOPSIS
10
11 cat corpus.i5.xml | tei2korapxml > corpus.korapxml.zip
12
13=head1 DESCRIPTION
14
15C<tei2korapxml> is a script to convert TEI P5 and
16L<I5|https://www1.ids-mannheim.de/kl/projekte/korpora/textmodell.html>
17based documents to the
18L<KorAP-XML format|https://github.com/KorAP/KorAP-XML-Krill#about-korap-xml>.
19If no specific input is defined, data is
20read from C<STDIN>. If no specific output is defined, data is written
21to C<STDOUT>.
22
23This program is usually called from inside another script.
24
25=head1 FORMATS
26
27=head2 Input restrictions
28
29=over 2
30
31=item
32
33utf8 encoded
34
35=item
36
37TEI P5 formatted input with certain restrictions:
38
39=over 4
40
41=item
42
43B<mandatory>: text-header with integrated textsigle, text-body
44
45=item
46
47B<optional>: corp-header with integrated corpsigle,
48doc-header with integrated docsigle
49
50=back
51
52=item
53
54All tokens inside the primary text may not be
55newline seperated, because newlines are removed
56(see L<KorAP::XML::TEI::Data>) and a conversion of newlines
57into blanks between 2 tokens could lead to additional blanks,
58where there should be none (e.g.: punctuation characters like C<,> or
59C<.> should not be seperated from their predecessor token).
60(see also code section C<~ whitespace handling ~>).
61
62=back
63
64=head2 Notes on the output
65
66=over 2
67
68=item
69
70zip file output (default on C<stdout>) with utf8 encoded entries
71(which together form the KorAP-XML format)
72
73=back
74
75=head1 INSTALLATION
76
77C<tei2korapxml> requires L<libxml2-dev> bindings to build. When
78these bindings are available, the preferred way to install the script is
79to use L<cpanm|App::cpanminus>.
80
81 $ cpanm https://github.com/KorAP/KorAP-XML-TEI.git
82
83In case everything went well, the C<tei2korapxml> tool will
84be available on your command line immediately.
85
86Minimum requirement for L<KorAP::XML::TEI> is Perl 5.16.
87
88=head1 OPTIONS
89
90=over 2
91
92=item B<--root|-r>
93
94The root directory for output. Defaults to C<.>.
95
96=item B<--help|-h>
97
98Print help information.
99
100=item B<--version|-v>
101
102Print version information.
103
104=item B<--tokenizer-call|-tc>
105
106Call an external tokenizer process, that will tokenize
107a single line from STDIN and outputs one token per line.
108
109=item B<--tokenizer-korap|-tk>
110
111Use the standard KorAP/DeReKo tokenizer.
112
113=item B<--use-intern-tokenization|-ti>
114
115Tokenize the data using two embedded tokenizers,
116that will take an I<Aggressive> and a I<conservative>
117approach.
118
119=item B<--log|-l>
120
121Loglevel for I<Log::Any>. Defaults to C<notice>.
122
123=back
124
125=head1 COPYRIGHT AND LICENSE
126
127Copyright (C) 2020, L<IDS Mannheim|https://www.ids-mannheim.de/>
128
129Author: Peter Harders
130
Akronaabd0952020-09-29 07:35:08 +0200131Contributors: Nils Diewald, Marc Kupietz, Carsten Schnober
Akron0c41ab32020-09-29 07:33:33 +0200132
133L<KorAP::XML::TEI> is developed as part of the L<KorAP|https://korap.ids-mannheim.de/>
134Corpus Analysis Platform at the
135L<Leibniz Institute for the German Language (IDS)|http://ids-mannheim.de/>,
136member of the
137L<Leibniz-Gemeinschaft|http://www.leibniz-gemeinschaft.de/>.
138
139This program is free software published under the
140L<BSD-2 License|https://raw.githubusercontent.com/KorAP/KorAP-XML-TEI/master/LICENSE>.
141
142=cut