blob: 4a058c391ae073d09d2886fd621e9ae8c692b215 [file] [log] [blame]
Akron0c41ab32020-09-29 07:33:33 +02001=pod
2
3=encoding utf8
4
5=head1 NAME
6
7tei2korapxml - Conversion of TEI P5 based formats to KorAP-XML
8
9=head1 SYNOPSIS
10
Marc Kupietz5b3f1d82024-07-05 17:50:55 +020011 cat corpus.i5.xml | tei2korapxml -tk - > corpus.korapxml.zip
12 tei2korapxml -tk corpus.i5.xml > corpus.korapxml.zip
Akron0c41ab32020-09-29 07:33:33 +020013
14=head1 DESCRIPTION
15
16C<tei2korapxml> is a script to convert TEI P5 and
Akrond72baca2021-07-23 13:25:32 +020017L<I5|https://www.ids-mannheim.de/digspra/kl/projekte/korpora/textmodell>
Akron0c41ab32020-09-29 07:33:33 +020018based documents to the
19L<KorAP-XML format|https://github.com/KorAP/KorAP-XML-Krill#about-korap-xml>.
Akron0c41ab32020-09-29 07:33:33 +020020
21This program is usually called from inside another script.
22
23=head1 FORMATS
24
25=head2 Input restrictions
26
27=over 2
28
29=item
30
Akron0c41ab32020-09-29 07:33:33 +020031TEI P5 formatted input with certain restrictions:
32
33=over 4
34
35=item
36
Akrone48bec42023-01-05 12:18:45 +010037B<mandatory>: text-header with integrated textsigle
38(or convertable identifier), text-body
Akron0c41ab32020-09-29 07:33:33 +020039
40=item
41
42B<optional>: corp-header with integrated corpsigle,
43doc-header with integrated docsigle
44
45=back
46
47=item
48
49All tokens inside the primary text may not be
50newline seperated, because newlines are removed
51(see L<KorAP::XML::TEI::Data>) and a conversion of newlines
52into blanks between 2 tokens could lead to additional blanks,
53where there should be none (e.g.: punctuation characters like C<,> or
54C<.> should not be seperated from their predecessor token).
Akron8a0c4bf2021-03-16 16:51:21 +010055(see also code section C<~ whitespace handling ~> in C<script/tei2korapxml>).
Akron0c41ab32020-09-29 07:33:33 +020056
Akron940ca6f2021-10-11 12:38:39 +020057=item
58
59Header types, like C<E<lt>idsHeader [...] type="document" [...] E<gt>>
60need to be defined in the same line as the header tag.
61
Akron0c41ab32020-09-29 07:33:33 +020062=back
63
64=head2 Notes on the output
65
66=over 2
67
68=item
69
70zip file output (default on C<stdout>) with utf8 encoded entries
71(which together form the KorAP-XML format)
72
73=back
74
75=head1 INSTALLATION
76
Marc Kupietz9452d322025-12-12 16:42:50 +010077=head2 Docker (Recommended)
78
79The easiest way to use C<tei2korapxml> is via Docker, which bundles all dependencies
80(Perl 5.42, Java 21, and required libraries) in a single container image.
81
82B<Pull from Docker Hub:>
83
84 $ docker pull korap/tei2korapxml:latest
85
86B<Usage examples:>
87
88 # Convert a file
89 $ docker run --rm -v $(pwd):/data korap/tei2korapxml:latest \
90 -s -tk /data/input.i5.xml > output.zip
91
92 # Convert from stdin
93 $ cat input.i5.xml | docker run --rm -i korap/tei2korapxml:latest \
94 -s -tk - > output.zip
95
96 # Using docker-compose
97 $ docker-compose run --rm tei2korapxml -s -tk input.i5.xml > output.zip
98
99B<Build locally:>
100
101 $ docker build -t korap/tei2korapxml:latest .
102
103For a slimmed-down image (using L<mintoolkit|https://github.com/mintoolkit/mint>):
104
105 $ docker build -t korap/tei2korapxml:large .
106 $ mint --crt-api-version 1.46 build --http-probe=false \
107 --exec='PERL5LIB=/tei2korapxml/script/tei2korapxml -v || test $? -eq 2 && java -jar /tei2korapxml/share/KorAP-Tokenizer-2.3.0-standalone.jar -V' \
108 --include-path=/tei2korapxml/lib --include-path=/usr/local/share/perl5 \
109 --include-path=/usr/share/perl5 --include-path=/usr/lib/perl5 \
110 --tag korap/tei2korapxml:latest \
111 korap/tei2korapxml:large
112
113=head2 Traditional Installation
114
Akrond26319b2023-01-12 15:34:41 +0100115C<tei2korapxml> requires C<libxml2-dev> bindings and L<File::ShareDir::Install> to be installed.
Marc Kupietze83a4e92021-03-16 20:51:26 +0100116When these requirements are met, the preferred way to install the script is
Akron0c41ab32020-09-29 07:33:33 +0200117to use L<cpanm|App::cpanminus>.
118
119 $ cpanm https://github.com/KorAP/KorAP-XML-TEI.git
120
121In case everything went well, the C<tei2korapxml> tool will
122be available on your command line immediately.
123
Marc Kupietz4ad648e2025-12-10 10:38:46 +0100124Minimum requirement for L<KorAP::XML::TEI> is Perl 5.38.
Akron0c41ab32020-09-29 07:33:33 +0200125
126=head1 OPTIONS
127
128=over 2
129
Akron11484782021-11-03 20:12:14 +0100130=item B<--input|-i>
131
132The input file to process. If no specific input is defined and a single
133dash C<-> is passed as an argument, data is read from C<STDIN>.
134
Marc Kupietz5b3f1d82024-07-05 17:50:55 +0200135Instead of using C<-i> input files can also be defined as trailing arguments
136to the command:
137
138 tei2korapxml -tk corpus1.i5.xml corpus2.i5.xml
139
Marc Kupietz2115ecc2025-12-10 11:37:03 +0100140=item B<--progress|-p>
141
142Show a progress bar (including ETA).
143This option is ignored if valid input is not read from a file.
144
Akron6b1f26b2024-09-19 11:35:32 +0200145=item B<--output|-o>
146
147The output zip file to be created. If no specific output is defined,
148data is written to C<STDOUT>.
Akron11484782021-11-03 20:12:14 +0100149
Akron0c41ab32020-09-29 07:33:33 +0200150=item B<--root|-r>
151
152The root directory for output. Defaults to C<.>.
153
154=item B<--help|-h>
155
156Print help information.
157
158=item B<--version|-v>
159
160Print version information.
161
Akrone48bec42023-01-05 12:18:45 +0100162=item B<--tokenizer-korap|-tk>
163
164Use the standard KorAP/DeReKo tokenizer.
165
166=item B<--tokenizer-internal|-ti>
167
168Tokenize the data using two embedded tokenizers,
169that will take an I<aggressive> and a I<conservative>
170approach.
171
Akron0c41ab32020-09-29 07:33:33 +0200172=item B<--tokenizer-call|-tc>
173
174Call an external tokenizer process, that will tokenize
Akron11484782021-11-03 20:12:14 +0100175from STDIN and outputs the offsets of all tokens.
176
177Texts are separated using C<\x04\n>. The external process
178should add a new line per text.
179
180If the L</--use-tokenizer-sentence-splits> option is activated,
181sentences are marked by offset as well in new lines.
182
183To use L<Datok|https://github.com/KorAP/Datok> including sentence
184splitting, call C<tei2korap> as follows:
185
186 $ cat corpus.i5.xml | tei2korapxml -s \
187 $ -tc 'datok tokenize \
188 $ -t ./tokenizer.matok \
189 $ -p --newline-after-eot --no-sentences \
190 $ --no-tokens --sentence-positions -' - \
191 $ > corpus.korapxml.zip
Akron0c41ab32020-09-29 07:33:33 +0200192
Akron6b1f26b2024-09-19 11:35:32 +0200193=item B<--no-tokenizer>
194
195Boolean flag indicating that no tokenizer should be used.
196This is meant to ensure that by default a final token layer always
197exists.
198If a separate tokenizer is chosen, this flag is ignored.
199
Akron75d63142021-02-23 18:40:56 +0100200=item B<--skip-inline-tokens>
201
202Boolean flag indicating that inline tokens should not
203be processed. Defaults to false (meaning inline tokens will be processed).
204
Akron692d17d2021-03-05 13:21:03 +0100205=item B<--skip-inline-token-annotations>
206
207Boolean flag indicating that inline token annotations should not
208be processed. Defaults to true (meaning inline token annotations
Akron6b1f26b2024-09-19 11:35:32 +0200209won't be processed). Can be negated with
210C<--no-skip-inline-token-annotations>.
Akron692d17d2021-03-05 13:21:03 +0100211
Akronca70a1d2021-02-25 16:21:31 +0100212=item B<--skip-inline-tags> <tags>
Akron54c3ff12021-02-25 11:33:37 +0100213
214Expects a comma-separated list of tags to be ignored when the structure
215is parsed. Content of these tags however will be processed.
216
Marc Kupietzfc3a0ee2024-07-05 16:58:16 +0200217=item B<--auto-textsigle> <textsigle>
218
219Expects a text sigle thats serves as fallback if no text sigles
220are given in the input data.
221The auto text sigle will be incremented for each text processed.
222
223Example:
224
225 tei2korapxml --auto-textsigle 'ICC/GER.00001' -s -tk - \
226 < data.i5.xml > korapxml.zip
227
Marc Kupietza671ae52022-12-22 16:28:14 +0100228=item B<--xmlid-to-textsigle> <from-regex>@<to-c/to-d/to-t>
229
Akrone48bec42023-01-05 12:18:45 +0100230Expects a regular replacement expression (separated by B<@> between the
Marc Kupietza671ae52022-12-22 16:28:14 +0100231search and the replacement) to convert text id attributes to text sigles
232with three parts (separated by B</>).
233
234Example:
235
236 tei2korapxml \
237 --xmlid-to-textsigle 'ICC.German\.([^.]+\.[^.]+)\.(.+)@ICCGER/$1/$2' \
238 -tk - < t/data/icc_german_sample.p5.xml
239
Akrone48bec42023-01-05 12:18:45 +0100240Converts text id C<ICC.German.DeReKo.WPD17.G11.00238> to
241sigle C<ICCGER/DeReKo.WPD17/G11.00238>.
Marc Kupietza671ae52022-12-22 16:28:14 +0100242
Akron1a5271a2021-02-18 13:18:15 +0100243=item B<--inline-tokens> <foundry>#[<file>]
244
245Define the foundry and file (without extension)
246to store inline token information in.
Akron8a0c4bf2021-03-16 16:51:21 +0100247Unless C<--skip-inline-token-annotations> is set,
248this will contain annotations as well.
Akron1a5271a2021-02-18 13:18:15 +0100249Defaults to C<tokens> and C<morpho>.
250
Akrone2819a12021-10-12 15:52:55 +0200251The inline token data will also be stored in the
252inline structures file (see I<--inline-structures>),
253unless the inline token foundry is prepended
254by an B<!> exclamation mark, indicating that inline
255tokens are stored exclusively in the inline tokens
256file.
257
258Example:
259
Akron6b1f26b2024-09-19 11:35:32 +0200260 tei2korapxml --no-tokenizer --inline-tokens \
261 '!gingko#morpho' < data.i5.xml > korapxml.zip
262
263=item B<--inline-dependencies> <foundry>#[<file>]
264
265Define the foundry and file (without extension)
266to store inline dependency information in.
267Defaults to the layer of C<dependency> and
268will be ignored if not set (which means, dependency
269attributes will be stored in the inline tokens file,
270if not skipped).
271
272The dependency data will also be stored in the
273inline token file (see I<--inline-tokens>),
274unless the inline dependencies foundry is prepended
275by an B<!> exclamation mark, indicating that inline
276dependency data is stored exclusively in the inline
277dependencies file.
278
279Example:
280
281 tei2korapxml --no-tokenizer --inline-dependencies \
282 'gingko#dependency' < data.i5.xml > korapxml.zip
283
Akrone2819a12021-10-12 15:52:55 +0200284
Akrondd0be8f2021-02-18 19:29:41 +0100285=item B<--inline-structures> <foundry>#[<file>]
286
287Define the foundry and file (without extension)
288to store inline structure information in.
289Defaults to C<struct> and C<structures>.
Akron75d63142021-02-23 18:40:56 +0100290
Akron26a71522021-02-19 10:27:37 +0100291=item B<--base-foundry> <foundry>
292
293Define the base foundry to store newly generated
294token information in.
295Defaults to C<base>.
296
297=item B<--data-file> <file>
298
299Define the file (without extension)
300to store primary data information in.
301Defaults to C<data>.
302
303=item B<--header-file> <file>
304
305Define the file name (without extension)
306to store header information on
307the corpus, document, and text level in.
308Defaults to C<header>.
Akrondd0be8f2021-02-18 19:29:41 +0100309
Marc Kupietz985da0c2021-02-15 19:29:50 +0100310=item B<--use-tokenizer-sentence-splits|-s>
311
312Replace existing with, or add new, sentence boundary information
Akron11484782021-11-03 20:12:14 +0100313provided by the tokenizer.
314Currently KorAP-tokenizer and certain external tokenizers support
315these boundaries.
Marc Kupietz985da0c2021-02-15 19:29:50 +0100316
Akron91705d72021-02-19 10:59:45 +0100317=item B<--tokens-file> <file>
318
319Define the file (without extension)
320to store generated token information in
321(either from the KorAP tokenizer or an externally called tokenizer).
322Defaults to C<tokens>.
323
Akron0c41ab32020-09-29 07:33:33 +0200324=item B<--log|-l>
325
326Loglevel for I<Log::Any>. Defaults to C<notice>.
327
328=back
329
Akronb3649472020-09-29 08:24:46 +0200330=head1 ENVIRONMENT VARIABLES
331
332=over 2
333
334=item B<KORAPXMLTEI_DEBUG>
335
336Activate minimal debugging.
337Defaults to C<false>.
338
Marc Kupietzd254f5c2025-04-16 10:37:08 +0200339=item B<KORAPXMLTEI_TOKENIZER_HEAP_SIZE>
340
341Set the heap size for the tokenizer process.
342Defaults to C<512m>.
343
Akronb3649472020-09-29 08:24:46 +0200344=back
345
Akron0c41ab32020-09-29 07:33:33 +0200346=head1 COPYRIGHT AND LICENSE
347
Marc Kupietzb6fd6bc2025-04-16 12:47:26 +0200348Copyright (C) 2021-2025, L<IDS Mannheim|https://www.ids-mannheim.de/>
Akron0c41ab32020-09-29 07:33:33 +0200349
350Author: Peter Harders
351
Akronaabd0952020-09-29 07:35:08 +0200352Contributors: Nils Diewald, Marc Kupietz, Carsten Schnober
Akron0c41ab32020-09-29 07:33:33 +0200353
354L<KorAP::XML::TEI> is developed as part of the L<KorAP|https://korap.ids-mannheim.de/>
355Corpus Analysis Platform at the
Akrond72baca2021-07-23 13:25:32 +0200356L<Leibniz Institute for the German Language (IDS)|https://www.ids-mannheim.de/>,
Akron0c41ab32020-09-29 07:33:33 +0200357member of the
358L<Leibniz-Gemeinschaft|http://www.leibniz-gemeinschaft.de/>.
359
360This program is free software published under the
Marc Kupietze955ecc2021-02-17 17:42:01 +0100361L<BSD-2 License|https://opensource.org/licenses/BSD-2-Clause>.
Akron0c41ab32020-09-29 07:33:33 +0200362
Akron692d17d2021-03-05 13:21:03 +0100363=cut