blob: 28bf016a8d19320a79f4bb6ac37b06ec63ccb1fe [file] [log] [blame]
Akron0c41ab32020-09-29 07:33:33 +02001=pod
2
3=encoding utf8
4
5=head1 NAME
6
7tei2korapxml - Conversion of TEI P5 based formats to KorAP-XML
8
9=head1 SYNOPSIS
10
Marc Kupietz5b3f1d82024-07-05 17:50:55 +020011 cat corpus.i5.xml | tei2korapxml -tk - > corpus.korapxml.zip
12 tei2korapxml -tk corpus.i5.xml > corpus.korapxml.zip
Akron0c41ab32020-09-29 07:33:33 +020013
14=head1 DESCRIPTION
15
16C<tei2korapxml> is a script to convert TEI P5 and
Akrond72baca2021-07-23 13:25:32 +020017L<I5|https://www.ids-mannheim.de/digspra/kl/projekte/korpora/textmodell>
Akron0c41ab32020-09-29 07:33:33 +020018based documents to the
19L<KorAP-XML format|https://github.com/KorAP/KorAP-XML-Krill#about-korap-xml>.
Akron0c41ab32020-09-29 07:33:33 +020020
21This program is usually called from inside another script.
22
23=head1 FORMATS
24
25=head2 Input restrictions
26
27=over 2
28
29=item
30
Akron0c41ab32020-09-29 07:33:33 +020031TEI P5 formatted input with certain restrictions:
32
33=over 4
34
35=item
36
Akrone48bec42023-01-05 12:18:45 +010037B<mandatory>: text-header with integrated textsigle
38(or convertable identifier), text-body
Akron0c41ab32020-09-29 07:33:33 +020039
40=item
41
42B<optional>: corp-header with integrated corpsigle,
43doc-header with integrated docsigle
44
45=back
46
47=item
48
49All tokens inside the primary text may not be
50newline seperated, because newlines are removed
51(see L<KorAP::XML::TEI::Data>) and a conversion of newlines
52into blanks between 2 tokens could lead to additional blanks,
53where there should be none (e.g.: punctuation characters like C<,> or
54C<.> should not be seperated from their predecessor token).
Akron8a0c4bf2021-03-16 16:51:21 +010055(see also code section C<~ whitespace handling ~> in C<script/tei2korapxml>).
Akron0c41ab32020-09-29 07:33:33 +020056
Akron940ca6f2021-10-11 12:38:39 +020057=item
58
59Header types, like C<E<lt>idsHeader [...] type="document" [...] E<gt>>
60need to be defined in the same line as the header tag.
61
Akron0c41ab32020-09-29 07:33:33 +020062=back
63
64=head2 Notes on the output
65
66=over 2
67
68=item
69
70zip file output (default on C<stdout>) with utf8 encoded entries
71(which together form the KorAP-XML format)
72
73=back
74
75=head1 INSTALLATION
76
Marc Kupietz9452d322025-12-12 16:42:50 +010077=head2 Docker (Recommended)
78
79The easiest way to use C<tei2korapxml> is via Docker, which bundles all dependencies
80(Perl 5.42, Java 21, and required libraries) in a single container image.
81
82B<Pull from Docker Hub:>
83
84 $ docker pull korap/tei2korapxml:latest
85
86B<Usage examples:>
87
88 # Convert a file
89 $ docker run --rm -v $(pwd):/data korap/tei2korapxml:latest \
90 -s -tk /data/input.i5.xml > output.zip
91
92 # Convert from stdin
93 $ cat input.i5.xml | docker run --rm -i korap/tei2korapxml:latest \
94 -s -tk - > output.zip
95
96 # Using docker-compose
97 $ docker-compose run --rm tei2korapxml -s -tk input.i5.xml > output.zip
98
99B<Build locally:>
100
101 $ docker build -t korap/tei2korapxml:latest .
102
103For a slimmed-down image (using L<mintoolkit|https://github.com/mintoolkit/mint>):
104
105 $ docker build -t korap/tei2korapxml:large .
106 $ mint --crt-api-version 1.46 build --http-probe=false \
107 --exec='PERL5LIB=/tei2korapxml/script/tei2korapxml -v || test $? -eq 2 && java -jar /tei2korapxml/share/KorAP-Tokenizer-2.3.0-standalone.jar -V' \
108 --include-path=/tei2korapxml/lib --include-path=/usr/local/share/perl5 \
109 --include-path=/usr/share/perl5 --include-path=/usr/lib/perl5 \
110 --tag korap/tei2korapxml:latest \
111 korap/tei2korapxml:large
112
113=head2 Traditional Installation
114
Akrond26319b2023-01-12 15:34:41 +0100115C<tei2korapxml> requires C<libxml2-dev> bindings and L<File::ShareDir::Install> to be installed.
Marc Kupietze83a4e92021-03-16 20:51:26 +0100116When these requirements are met, the preferred way to install the script is
Akron0c41ab32020-09-29 07:33:33 +0200117to use L<cpanm|App::cpanminus>.
118
119 $ cpanm https://github.com/KorAP/KorAP-XML-TEI.git
120
121In case everything went well, the C<tei2korapxml> tool will
122be available on your command line immediately.
123
Marc Kupietz4ad648e2025-12-10 10:38:46 +0100124Minimum requirement for L<KorAP::XML::TEI> is Perl 5.38.
Akron0c41ab32020-09-29 07:33:33 +0200125
126=head1 OPTIONS
127
128=over 2
129
Akron11484782021-11-03 20:12:14 +0100130=item B<--input|-i>
131
132The input file to process. If no specific input is defined and a single
133dash C<-> is passed as an argument, data is read from C<STDIN>.
134
Marc Kupietz5b3f1d82024-07-05 17:50:55 +0200135Instead of using C<-i> input files can also be defined as trailing arguments
136to the command:
137
138 tei2korapxml -tk corpus1.i5.xml corpus2.i5.xml
139
Marc Kupietz2115ecc2025-12-10 11:37:03 +0100140=item B<--progress|-p>
141
Marc Kupietz3c16cb92026-03-05 18:29:59 +0100142Show a progress bar (including ETA) written directly to C</dev/tty>,
143so it always appears on the terminal regardless of C<stderr> redirection.
144This option is ignored if valid input is not read from a file,
145or if no controlling terminal is available (e.g. in a detached container
146or CI environment).
Marc Kupietz2115ecc2025-12-10 11:37:03 +0100147
Akron6b1f26b2024-09-19 11:35:32 +0200148=item B<--output|-o>
149
150The output zip file to be created. If no specific output is defined,
151data is written to C<STDOUT>.
Akron11484782021-11-03 20:12:14 +0100152
Akron0c41ab32020-09-29 07:33:33 +0200153=item B<--root|-r>
154
155The root directory for output. Defaults to C<.>.
156
157=item B<--help|-h>
158
159Print help information.
160
161=item B<--version|-v>
162
163Print version information.
164
Akrone48bec42023-01-05 12:18:45 +0100165=item B<--tokenizer-korap|-tk>
166
167Use the standard KorAP/DeReKo tokenizer.
168
169=item B<--tokenizer-internal|-ti>
170
171Tokenize the data using two embedded tokenizers,
172that will take an I<aggressive> and a I<conservative>
173approach.
174
Akron0c41ab32020-09-29 07:33:33 +0200175=item B<--tokenizer-call|-tc>
176
177Call an external tokenizer process, that will tokenize
Akron11484782021-11-03 20:12:14 +0100178from STDIN and outputs the offsets of all tokens.
179
180Texts are separated using C<\x04\n>. The external process
181should add a new line per text.
182
183If the L</--use-tokenizer-sentence-splits> option is activated,
184sentences are marked by offset as well in new lines.
185
186To use L<Datok|https://github.com/KorAP/Datok> including sentence
187splitting, call C<tei2korap> as follows:
188
189 $ cat corpus.i5.xml | tei2korapxml -s \
190 $ -tc 'datok tokenize \
191 $ -t ./tokenizer.matok \
192 $ -p --newline-after-eot --no-sentences \
193 $ --no-tokens --sentence-positions -' - \
194 $ > corpus.korapxml.zip
Akron0c41ab32020-09-29 07:33:33 +0200195
Akron6b1f26b2024-09-19 11:35:32 +0200196=item B<--no-tokenizer>
197
198Boolean flag indicating that no tokenizer should be used.
199This is meant to ensure that by default a final token layer always
200exists.
201If a separate tokenizer is chosen, this flag is ignored.
202
Akron75d63142021-02-23 18:40:56 +0100203=item B<--skip-inline-tokens>
204
205Boolean flag indicating that inline tokens should not
206be processed. Defaults to false (meaning inline tokens will be processed).
207
Akron692d17d2021-03-05 13:21:03 +0100208=item B<--skip-inline-token-annotations>
209
210Boolean flag indicating that inline token annotations should not
211be processed. Defaults to true (meaning inline token annotations
Akron6b1f26b2024-09-19 11:35:32 +0200212won't be processed). Can be negated with
213C<--no-skip-inline-token-annotations>.
Akron692d17d2021-03-05 13:21:03 +0100214
Akronca70a1d2021-02-25 16:21:31 +0100215=item B<--skip-inline-tags> <tags>
Akron54c3ff12021-02-25 11:33:37 +0100216
217Expects a comma-separated list of tags to be ignored when the structure
218is parsed. Content of these tags however will be processed.
219
Marc Kupietzfc3a0ee2024-07-05 16:58:16 +0200220=item B<--auto-textsigle> <textsigle>
221
222Expects a text sigle thats serves as fallback if no text sigles
223are given in the input data.
224The auto text sigle will be incremented for each text processed.
225
226Example:
227
228 tei2korapxml --auto-textsigle 'ICC/GER.00001' -s -tk - \
229 < data.i5.xml > korapxml.zip
230
Marc Kupietza671ae52022-12-22 16:28:14 +0100231=item B<--xmlid-to-textsigle> <from-regex>@<to-c/to-d/to-t>
232
Akrone48bec42023-01-05 12:18:45 +0100233Expects a regular replacement expression (separated by B<@> between the
Marc Kupietza671ae52022-12-22 16:28:14 +0100234search and the replacement) to convert text id attributes to text sigles
235with three parts (separated by B</>).
236
237Example:
238
239 tei2korapxml \
240 --xmlid-to-textsigle 'ICC.German\.([^.]+\.[^.]+)\.(.+)@ICCGER/$1/$2' \
241 -tk - < t/data/icc_german_sample.p5.xml
242
Akrone48bec42023-01-05 12:18:45 +0100243Converts text id C<ICC.German.DeReKo.WPD17.G11.00238> to
244sigle C<ICCGER/DeReKo.WPD17/G11.00238>.
Marc Kupietza671ae52022-12-22 16:28:14 +0100245
Akron1a5271a2021-02-18 13:18:15 +0100246=item B<--inline-tokens> <foundry>#[<file>]
247
248Define the foundry and file (without extension)
249to store inline token information in.
Akron8a0c4bf2021-03-16 16:51:21 +0100250Unless C<--skip-inline-token-annotations> is set,
251this will contain annotations as well.
Akron1a5271a2021-02-18 13:18:15 +0100252Defaults to C<tokens> and C<morpho>.
253
Akrone2819a12021-10-12 15:52:55 +0200254The inline token data will also be stored in the
255inline structures file (see I<--inline-structures>),
256unless the inline token foundry is prepended
257by an B<!> exclamation mark, indicating that inline
258tokens are stored exclusively in the inline tokens
259file.
260
261Example:
262
Akron6b1f26b2024-09-19 11:35:32 +0200263 tei2korapxml --no-tokenizer --inline-tokens \
264 '!gingko#morpho' < data.i5.xml > korapxml.zip
265
266=item B<--inline-dependencies> <foundry>#[<file>]
267
268Define the foundry and file (without extension)
269to store inline dependency information in.
270Defaults to the layer of C<dependency> and
271will be ignored if not set (which means, dependency
272attributes will be stored in the inline tokens file,
273if not skipped).
274
275The dependency data will also be stored in the
276inline token file (see I<--inline-tokens>),
277unless the inline dependencies foundry is prepended
278by an B<!> exclamation mark, indicating that inline
279dependency data is stored exclusively in the inline
280dependencies file.
281
282Example:
283
284 tei2korapxml --no-tokenizer --inline-dependencies \
285 'gingko#dependency' < data.i5.xml > korapxml.zip
286
Akrone2819a12021-10-12 15:52:55 +0200287
Akrondd0be8f2021-02-18 19:29:41 +0100288=item B<--inline-structures> <foundry>#[<file>]
289
290Define the foundry and file (without extension)
291to store inline structure information in.
292Defaults to C<struct> and C<structures>.
Akron75d63142021-02-23 18:40:56 +0100293
Akron26a71522021-02-19 10:27:37 +0100294=item B<--base-foundry> <foundry>
295
296Define the base foundry to store newly generated
297token information in.
298Defaults to C<base>.
299
300=item B<--data-file> <file>
301
302Define the file (without extension)
303to store primary data information in.
304Defaults to C<data>.
305
306=item B<--header-file> <file>
307
308Define the file name (without extension)
309to store header information on
310the corpus, document, and text level in.
311Defaults to C<header>.
Akrondd0be8f2021-02-18 19:29:41 +0100312
Marc Kupietz985da0c2021-02-15 19:29:50 +0100313=item B<--use-tokenizer-sentence-splits|-s>
314
315Replace existing with, or add new, sentence boundary information
Akron11484782021-11-03 20:12:14 +0100316provided by the tokenizer.
317Currently KorAP-tokenizer and certain external tokenizers support
318these boundaries.
Marc Kupietz985da0c2021-02-15 19:29:50 +0100319
Akron91705d72021-02-19 10:59:45 +0100320=item B<--tokens-file> <file>
321
322Define the file (without extension)
323to store generated token information in
324(either from the KorAP tokenizer or an externally called tokenizer).
325Defaults to C<tokens>.
326
Akron0c41ab32020-09-29 07:33:33 +0200327=item B<--log|-l>
328
329Loglevel for I<Log::Any>. Defaults to C<notice>.
330
331=back
332
Akronb3649472020-09-29 08:24:46 +0200333=head1 ENVIRONMENT VARIABLES
334
335=over 2
336
337=item B<KORAPXMLTEI_DEBUG>
338
339Activate minimal debugging.
340Defaults to C<false>.
341
Marc Kupietzd254f5c2025-04-16 10:37:08 +0200342=item B<KORAPXMLTEI_TOKENIZER_HEAP_SIZE>
343
344Set the heap size for the tokenizer process.
345Defaults to C<512m>.
346
Akronb3649472020-09-29 08:24:46 +0200347=back
348
Akron0c41ab32020-09-29 07:33:33 +0200349=head1 COPYRIGHT AND LICENSE
350
Marc Kupietzb6fd6bc2025-04-16 12:47:26 +0200351Copyright (C) 2021-2025, L<IDS Mannheim|https://www.ids-mannheim.de/>
Akron0c41ab32020-09-29 07:33:33 +0200352
353Author: Peter Harders
354
Akronaabd0952020-09-29 07:35:08 +0200355Contributors: Nils Diewald, Marc Kupietz, Carsten Schnober
Akron0c41ab32020-09-29 07:33:33 +0200356
357L<KorAP::XML::TEI> is developed as part of the L<KorAP|https://korap.ids-mannheim.de/>
358Corpus Analysis Platform at the
Akrond72baca2021-07-23 13:25:32 +0200359L<Leibniz Institute for the German Language (IDS)|https://www.ids-mannheim.de/>,
Akron0c41ab32020-09-29 07:33:33 +0200360member of the
361L<Leibniz-Gemeinschaft|http://www.leibniz-gemeinschaft.de/>.
362
363This program is free software published under the
Marc Kupietze955ecc2021-02-17 17:42:01 +0100364L<BSD-2 License|https://opensource.org/licenses/BSD-2-Clause>.
Akron0c41ab32020-09-29 07:33:33 +0200365
Akron692d17d2021-03-05 13:21:03 +0100366=cut