blob: a587374419197eb14a3ce4139a05c3d24a0af97c [file] [log] [blame]
Akron0c41ab32020-09-29 07:33:33 +02001=pod
2
3=encoding utf8
4
5=head1 NAME
6
7tei2korapxml - Conversion of TEI P5 based formats to KorAP-XML
8
9=head1 SYNOPSIS
10
Marc Kupietz5b3f1d82024-07-05 17:50:55 +020011 cat corpus.i5.xml | tei2korapxml -tk - > corpus.korapxml.zip
12 tei2korapxml -tk corpus.i5.xml > corpus.korapxml.zip
Akron0c41ab32020-09-29 07:33:33 +020013
14=head1 DESCRIPTION
15
16C<tei2korapxml> is a script to convert TEI P5 and
Akrond72baca2021-07-23 13:25:32 +020017L<I5|https://www.ids-mannheim.de/digspra/kl/projekte/korpora/textmodell>
Akron0c41ab32020-09-29 07:33:33 +020018based documents to the
19L<KorAP-XML format|https://github.com/KorAP/KorAP-XML-Krill#about-korap-xml>.
Akron0c41ab32020-09-29 07:33:33 +020020
21This program is usually called from inside another script.
22
23=head1 FORMATS
24
25=head2 Input restrictions
26
27=over 2
28
29=item
30
Akron0c41ab32020-09-29 07:33:33 +020031TEI P5 formatted input with certain restrictions:
32
33=over 4
34
35=item
36
Akrone48bec42023-01-05 12:18:45 +010037B<mandatory>: text-header with integrated textsigle
38(or convertable identifier), text-body
Akron0c41ab32020-09-29 07:33:33 +020039
40=item
41
42B<optional>: corp-header with integrated corpsigle,
43doc-header with integrated docsigle
44
45=back
46
47=item
48
49All tokens inside the primary text may not be
50newline seperated, because newlines are removed
51(see L<KorAP::XML::TEI::Data>) and a conversion of newlines
52into blanks between 2 tokens could lead to additional blanks,
53where there should be none (e.g.: punctuation characters like C<,> or
54C<.> should not be seperated from their predecessor token).
Akron8a0c4bf2021-03-16 16:51:21 +010055(see also code section C<~ whitespace handling ~> in C<script/tei2korapxml>).
Akron0c41ab32020-09-29 07:33:33 +020056
Akron940ca6f2021-10-11 12:38:39 +020057=item
58
59Header types, like C<E<lt>idsHeader [...] type="document" [...] E<gt>>
60need to be defined in the same line as the header tag.
61
Akron0c41ab32020-09-29 07:33:33 +020062=back
63
64=head2 Notes on the output
65
66=over 2
67
68=item
69
70zip file output (default on C<stdout>) with utf8 encoded entries
71(which together form the KorAP-XML format)
72
73=back
74
75=head1 INSTALLATION
76
Akrond26319b2023-01-12 15:34:41 +010077C<tei2korapxml> requires C<libxml2-dev> bindings and L<File::ShareDir::Install> to be installed.
Marc Kupietze83a4e92021-03-16 20:51:26 +010078When these requirements are met, the preferred way to install the script is
Akron0c41ab32020-09-29 07:33:33 +020079to use L<cpanm|App::cpanminus>.
80
81 $ cpanm https://github.com/KorAP/KorAP-XML-TEI.git
82
83In case everything went well, the C<tei2korapxml> tool will
84be available on your command line immediately.
85
Akron6b1f26b2024-09-19 11:35:32 +020086Minimum requirement for L<KorAP::XML::TEI> is Perl 5.16.
Akron0c41ab32020-09-29 07:33:33 +020087
88=head1 OPTIONS
89
90=over 2
91
Akron11484782021-11-03 20:12:14 +010092=item B<--input|-i>
93
94The input file to process. If no specific input is defined and a single
95dash C<-> is passed as an argument, data is read from C<STDIN>.
96
Marc Kupietz5b3f1d82024-07-05 17:50:55 +020097Instead of using C<-i> input files can also be defined as trailing arguments
98to the command:
99
100 tei2korapxml -tk corpus1.i5.xml corpus2.i5.xml
101
Akron6b1f26b2024-09-19 11:35:32 +0200102=item B<--output|-o>
103
104The output zip file to be created. If no specific output is defined,
105data is written to C<STDOUT>.
Akron11484782021-11-03 20:12:14 +0100106
Akron0c41ab32020-09-29 07:33:33 +0200107=item B<--root|-r>
108
109The root directory for output. Defaults to C<.>.
110
111=item B<--help|-h>
112
113Print help information.
114
115=item B<--version|-v>
116
117Print version information.
118
Akrone48bec42023-01-05 12:18:45 +0100119=item B<--tokenizer-korap|-tk>
120
121Use the standard KorAP/DeReKo tokenizer.
122
123=item B<--tokenizer-internal|-ti>
124
125Tokenize the data using two embedded tokenizers,
126that will take an I<aggressive> and a I<conservative>
127approach.
128
Akron0c41ab32020-09-29 07:33:33 +0200129=item B<--tokenizer-call|-tc>
130
131Call an external tokenizer process, that will tokenize
Akron11484782021-11-03 20:12:14 +0100132from STDIN and outputs the offsets of all tokens.
133
134Texts are separated using C<\x04\n>. The external process
135should add a new line per text.
136
137If the L</--use-tokenizer-sentence-splits> option is activated,
138sentences are marked by offset as well in new lines.
139
140To use L<Datok|https://github.com/KorAP/Datok> including sentence
141splitting, call C<tei2korap> as follows:
142
143 $ cat corpus.i5.xml | tei2korapxml -s \
144 $ -tc 'datok tokenize \
145 $ -t ./tokenizer.matok \
146 $ -p --newline-after-eot --no-sentences \
147 $ --no-tokens --sentence-positions -' - \
148 $ > corpus.korapxml.zip
Akron0c41ab32020-09-29 07:33:33 +0200149
Akron6b1f26b2024-09-19 11:35:32 +0200150=item B<--no-tokenizer>
151
152Boolean flag indicating that no tokenizer should be used.
153This is meant to ensure that by default a final token layer always
154exists.
155If a separate tokenizer is chosen, this flag is ignored.
156
Akron75d63142021-02-23 18:40:56 +0100157=item B<--skip-inline-tokens>
158
159Boolean flag indicating that inline tokens should not
160be processed. Defaults to false (meaning inline tokens will be processed).
161
Akron692d17d2021-03-05 13:21:03 +0100162=item B<--skip-inline-token-annotations>
163
164Boolean flag indicating that inline token annotations should not
165be processed. Defaults to true (meaning inline token annotations
Akron6b1f26b2024-09-19 11:35:32 +0200166won't be processed). Can be negated with
167C<--no-skip-inline-token-annotations>.
Akron692d17d2021-03-05 13:21:03 +0100168
Akronca70a1d2021-02-25 16:21:31 +0100169=item B<--skip-inline-tags> <tags>
Akron54c3ff12021-02-25 11:33:37 +0100170
171Expects a comma-separated list of tags to be ignored when the structure
172is parsed. Content of these tags however will be processed.
173
Marc Kupietzfc3a0ee2024-07-05 16:58:16 +0200174=item B<--auto-textsigle> <textsigle>
175
176Expects a text sigle thats serves as fallback if no text sigles
177are given in the input data.
178The auto text sigle will be incremented for each text processed.
179
180Example:
181
182 tei2korapxml --auto-textsigle 'ICC/GER.00001' -s -tk - \
183 < data.i5.xml > korapxml.zip
184
Marc Kupietza671ae52022-12-22 16:28:14 +0100185=item B<--xmlid-to-textsigle> <from-regex>@<to-c/to-d/to-t>
186
Akrone48bec42023-01-05 12:18:45 +0100187Expects a regular replacement expression (separated by B<@> between the
Marc Kupietza671ae52022-12-22 16:28:14 +0100188search and the replacement) to convert text id attributes to text sigles
189with three parts (separated by B</>).
190
191Example:
192
193 tei2korapxml \
194 --xmlid-to-textsigle 'ICC.German\.([^.]+\.[^.]+)\.(.+)@ICCGER/$1/$2' \
195 -tk - < t/data/icc_german_sample.p5.xml
196
Akrone48bec42023-01-05 12:18:45 +0100197Converts text id C<ICC.German.DeReKo.WPD17.G11.00238> to
198sigle C<ICCGER/DeReKo.WPD17/G11.00238>.
Marc Kupietza671ae52022-12-22 16:28:14 +0100199
Akron1a5271a2021-02-18 13:18:15 +0100200=item B<--inline-tokens> <foundry>#[<file>]
201
202Define the foundry and file (without extension)
203to store inline token information in.
Akron8a0c4bf2021-03-16 16:51:21 +0100204Unless C<--skip-inline-token-annotations> is set,
205this will contain annotations as well.
Akron1a5271a2021-02-18 13:18:15 +0100206Defaults to C<tokens> and C<morpho>.
207
Akrone2819a12021-10-12 15:52:55 +0200208The inline token data will also be stored in the
209inline structures file (see I<--inline-structures>),
210unless the inline token foundry is prepended
211by an B<!> exclamation mark, indicating that inline
212tokens are stored exclusively in the inline tokens
213file.
214
215Example:
216
Akron6b1f26b2024-09-19 11:35:32 +0200217 tei2korapxml --no-tokenizer --inline-tokens \
218 '!gingko#morpho' < data.i5.xml > korapxml.zip
219
220=item B<--inline-dependencies> <foundry>#[<file>]
221
222Define the foundry and file (without extension)
223to store inline dependency information in.
224Defaults to the layer of C<dependency> and
225will be ignored if not set (which means, dependency
226attributes will be stored in the inline tokens file,
227if not skipped).
228
229The dependency data will also be stored in the
230inline token file (see I<--inline-tokens>),
231unless the inline dependencies foundry is prepended
232by an B<!> exclamation mark, indicating that inline
233dependency data is stored exclusively in the inline
234dependencies file.
235
236Example:
237
238 tei2korapxml --no-tokenizer --inline-dependencies \
239 'gingko#dependency' < data.i5.xml > korapxml.zip
240
Akrone2819a12021-10-12 15:52:55 +0200241
Akrondd0be8f2021-02-18 19:29:41 +0100242=item B<--inline-structures> <foundry>#[<file>]
243
244Define the foundry and file (without extension)
245to store inline structure information in.
246Defaults to C<struct> and C<structures>.
Akron75d63142021-02-23 18:40:56 +0100247
Akron26a71522021-02-19 10:27:37 +0100248=item B<--base-foundry> <foundry>
249
250Define the base foundry to store newly generated
251token information in.
252Defaults to C<base>.
253
254=item B<--data-file> <file>
255
256Define the file (without extension)
257to store primary data information in.
258Defaults to C<data>.
259
260=item B<--header-file> <file>
261
262Define the file name (without extension)
263to store header information on
264the corpus, document, and text level in.
265Defaults to C<header>.
Akrondd0be8f2021-02-18 19:29:41 +0100266
Marc Kupietz985da0c2021-02-15 19:29:50 +0100267=item B<--use-tokenizer-sentence-splits|-s>
268
269Replace existing with, or add new, sentence boundary information
Akron11484782021-11-03 20:12:14 +0100270provided by the tokenizer.
271Currently KorAP-tokenizer and certain external tokenizers support
272these boundaries.
Marc Kupietz985da0c2021-02-15 19:29:50 +0100273
Akron91705d72021-02-19 10:59:45 +0100274=item B<--tokens-file> <file>
275
276Define the file (without extension)
277to store generated token information in
278(either from the KorAP tokenizer or an externally called tokenizer).
279Defaults to C<tokens>.
280
Akron0c41ab32020-09-29 07:33:33 +0200281=item B<--log|-l>
282
283Loglevel for I<Log::Any>. Defaults to C<notice>.
284
285=back
286
Akronb3649472020-09-29 08:24:46 +0200287=head1 ENVIRONMENT VARIABLES
288
289=over 2
290
291=item B<KORAPXMLTEI_DEBUG>
292
293Activate minimal debugging.
294Defaults to C<false>.
295
Akronb3649472020-09-29 08:24:46 +0200296=back
297
Akron0c41ab32020-09-29 07:33:33 +0200298=head1 COPYRIGHT AND LICENSE
299
Marc Kupietz84566752024-01-11 14:37:11 +0100300Copyright (C) 2021-2024, L<IDS Mannheim|https://www.ids-mannheim.de/>
Akron0c41ab32020-09-29 07:33:33 +0200301
302Author: Peter Harders
303
Akronaabd0952020-09-29 07:35:08 +0200304Contributors: Nils Diewald, Marc Kupietz, Carsten Schnober
Akron0c41ab32020-09-29 07:33:33 +0200305
306L<KorAP::XML::TEI> is developed as part of the L<KorAP|https://korap.ids-mannheim.de/>
307Corpus Analysis Platform at the
Akrond72baca2021-07-23 13:25:32 +0200308L<Leibniz Institute for the German Language (IDS)|https://www.ids-mannheim.de/>,
Akron0c41ab32020-09-29 07:33:33 +0200309member of the
310L<Leibniz-Gemeinschaft|http://www.leibniz-gemeinschaft.de/>.
311
312This program is free software published under the
Marc Kupietze955ecc2021-02-17 17:42:01 +0100313L<BSD-2 License|https://opensource.org/licenses/BSD-2-Clause>.
Akron0c41ab32020-09-29 07:33:33 +0200314
Akron692d17d2021-03-05 13:21:03 +0100315=cut