blob: 074db869d29fa86017961884016abcb1b57fba34 [file] [log] [blame]
Akron0c41ab32020-09-29 07:33:33 +02001=pod
2
3=encoding utf8
4
5=head1 NAME
6
7tei2korapxml - Conversion of TEI P5 based formats to KorAP-XML
8
9=head1 SYNOPSIS
10
Marc Kupietz5b3f1d82024-07-05 17:50:55 +020011 cat corpus.i5.xml | tei2korapxml -tk - > corpus.korapxml.zip
12 tei2korapxml -tk corpus.i5.xml > corpus.korapxml.zip
Akron0c41ab32020-09-29 07:33:33 +020013
14=head1 DESCRIPTION
15
16C<tei2korapxml> is a script to convert TEI P5 and
Akrond72baca2021-07-23 13:25:32 +020017L<I5|https://www.ids-mannheim.de/digspra/kl/projekte/korpora/textmodell>
Akron0c41ab32020-09-29 07:33:33 +020018based documents to the
19L<KorAP-XML format|https://github.com/KorAP/KorAP-XML-Krill#about-korap-xml>.
Akron0c41ab32020-09-29 07:33:33 +020020
21This program is usually called from inside another script.
22
23=head1 FORMATS
24
25=head2 Input restrictions
26
27=over 2
28
29=item
30
Akron0c41ab32020-09-29 07:33:33 +020031TEI P5 formatted input with certain restrictions:
32
33=over 4
34
35=item
36
Akrone48bec42023-01-05 12:18:45 +010037B<mandatory>: text-header with integrated textsigle
38(or convertable identifier), text-body
Akron0c41ab32020-09-29 07:33:33 +020039
40=item
41
42B<optional>: corp-header with integrated corpsigle,
43doc-header with integrated docsigle
44
45=back
46
47=item
48
49All tokens inside the primary text may not be
50newline seperated, because newlines are removed
51(see L<KorAP::XML::TEI::Data>) and a conversion of newlines
52into blanks between 2 tokens could lead to additional blanks,
53where there should be none (e.g.: punctuation characters like C<,> or
54C<.> should not be seperated from their predecessor token).
Akron8a0c4bf2021-03-16 16:51:21 +010055(see also code section C<~ whitespace handling ~> in C<script/tei2korapxml>).
Akron0c41ab32020-09-29 07:33:33 +020056
Akron940ca6f2021-10-11 12:38:39 +020057=item
58
59Header types, like C<E<lt>idsHeader [...] type="document" [...] E<gt>>
60need to be defined in the same line as the header tag.
61
Akron0c41ab32020-09-29 07:33:33 +020062=back
63
64=head2 Notes on the output
65
66=over 2
67
68=item
69
70zip file output (default on C<stdout>) with utf8 encoded entries
71(which together form the KorAP-XML format)
72
73=back
74
75=head1 INSTALLATION
76
Akrond26319b2023-01-12 15:34:41 +010077C<tei2korapxml> requires C<libxml2-dev> bindings and L<File::ShareDir::Install> to be installed.
Marc Kupietze83a4e92021-03-16 20:51:26 +010078When these requirements are met, the preferred way to install the script is
Akron0c41ab32020-09-29 07:33:33 +020079to use L<cpanm|App::cpanminus>.
80
81 $ cpanm https://github.com/KorAP/KorAP-XML-TEI.git
82
83In case everything went well, the C<tei2korapxml> tool will
84be available on your command line immediately.
85
Marc Kupietz4ad648e2025-12-10 10:38:46 +010086Minimum requirement for L<KorAP::XML::TEI> is Perl 5.38.
Akron0c41ab32020-09-29 07:33:33 +020087
88=head1 OPTIONS
89
90=over 2
91
Akron11484782021-11-03 20:12:14 +010092=item B<--input|-i>
93
94The input file to process. If no specific input is defined and a single
95dash C<-> is passed as an argument, data is read from C<STDIN>.
96
Marc Kupietz5b3f1d82024-07-05 17:50:55 +020097Instead of using C<-i> input files can also be defined as trailing arguments
98to the command:
99
100 tei2korapxml -tk corpus1.i5.xml corpus2.i5.xml
101
Marc Kupietz2115ecc2025-12-10 11:37:03 +0100102=item B<--progress|-p>
103
104Show a progress bar (including ETA).
105This option is ignored if valid input is not read from a file.
106
Akron6b1f26b2024-09-19 11:35:32 +0200107=item B<--output|-o>
108
109The output zip file to be created. If no specific output is defined,
110data is written to C<STDOUT>.
Akron11484782021-11-03 20:12:14 +0100111
Akron0c41ab32020-09-29 07:33:33 +0200112=item B<--root|-r>
113
114The root directory for output. Defaults to C<.>.
115
116=item B<--help|-h>
117
118Print help information.
119
120=item B<--version|-v>
121
122Print version information.
123
Akrone48bec42023-01-05 12:18:45 +0100124=item B<--tokenizer-korap|-tk>
125
126Use the standard KorAP/DeReKo tokenizer.
127
128=item B<--tokenizer-internal|-ti>
129
130Tokenize the data using two embedded tokenizers,
131that will take an I<aggressive> and a I<conservative>
132approach.
133
Akron0c41ab32020-09-29 07:33:33 +0200134=item B<--tokenizer-call|-tc>
135
136Call an external tokenizer process, that will tokenize
Akron11484782021-11-03 20:12:14 +0100137from STDIN and outputs the offsets of all tokens.
138
139Texts are separated using C<\x04\n>. The external process
140should add a new line per text.
141
142If the L</--use-tokenizer-sentence-splits> option is activated,
143sentences are marked by offset as well in new lines.
144
145To use L<Datok|https://github.com/KorAP/Datok> including sentence
146splitting, call C<tei2korap> as follows:
147
148 $ cat corpus.i5.xml | tei2korapxml -s \
149 $ -tc 'datok tokenize \
150 $ -t ./tokenizer.matok \
151 $ -p --newline-after-eot --no-sentences \
152 $ --no-tokens --sentence-positions -' - \
153 $ > corpus.korapxml.zip
Akron0c41ab32020-09-29 07:33:33 +0200154
Akron6b1f26b2024-09-19 11:35:32 +0200155=item B<--no-tokenizer>
156
157Boolean flag indicating that no tokenizer should be used.
158This is meant to ensure that by default a final token layer always
159exists.
160If a separate tokenizer is chosen, this flag is ignored.
161
Akron75d63142021-02-23 18:40:56 +0100162=item B<--skip-inline-tokens>
163
164Boolean flag indicating that inline tokens should not
165be processed. Defaults to false (meaning inline tokens will be processed).
166
Akron692d17d2021-03-05 13:21:03 +0100167=item B<--skip-inline-token-annotations>
168
169Boolean flag indicating that inline token annotations should not
170be processed. Defaults to true (meaning inline token annotations
Akron6b1f26b2024-09-19 11:35:32 +0200171won't be processed). Can be negated with
172C<--no-skip-inline-token-annotations>.
Akron692d17d2021-03-05 13:21:03 +0100173
Akronca70a1d2021-02-25 16:21:31 +0100174=item B<--skip-inline-tags> <tags>
Akron54c3ff12021-02-25 11:33:37 +0100175
176Expects a comma-separated list of tags to be ignored when the structure
177is parsed. Content of these tags however will be processed.
178
Marc Kupietzfc3a0ee2024-07-05 16:58:16 +0200179=item B<--auto-textsigle> <textsigle>
180
181Expects a text sigle thats serves as fallback if no text sigles
182are given in the input data.
183The auto text sigle will be incremented for each text processed.
184
185Example:
186
187 tei2korapxml --auto-textsigle 'ICC/GER.00001' -s -tk - \
188 < data.i5.xml > korapxml.zip
189
Marc Kupietza671ae52022-12-22 16:28:14 +0100190=item B<--xmlid-to-textsigle> <from-regex>@<to-c/to-d/to-t>
191
Akrone48bec42023-01-05 12:18:45 +0100192Expects a regular replacement expression (separated by B<@> between the
Marc Kupietza671ae52022-12-22 16:28:14 +0100193search and the replacement) to convert text id attributes to text sigles
194with three parts (separated by B</>).
195
196Example:
197
198 tei2korapxml \
199 --xmlid-to-textsigle 'ICC.German\.([^.]+\.[^.]+)\.(.+)@ICCGER/$1/$2' \
200 -tk - < t/data/icc_german_sample.p5.xml
201
Akrone48bec42023-01-05 12:18:45 +0100202Converts text id C<ICC.German.DeReKo.WPD17.G11.00238> to
203sigle C<ICCGER/DeReKo.WPD17/G11.00238>.
Marc Kupietza671ae52022-12-22 16:28:14 +0100204
Akron1a5271a2021-02-18 13:18:15 +0100205=item B<--inline-tokens> <foundry>#[<file>]
206
207Define the foundry and file (without extension)
208to store inline token information in.
Akron8a0c4bf2021-03-16 16:51:21 +0100209Unless C<--skip-inline-token-annotations> is set,
210this will contain annotations as well.
Akron1a5271a2021-02-18 13:18:15 +0100211Defaults to C<tokens> and C<morpho>.
212
Akrone2819a12021-10-12 15:52:55 +0200213The inline token data will also be stored in the
214inline structures file (see I<--inline-structures>),
215unless the inline token foundry is prepended
216by an B<!> exclamation mark, indicating that inline
217tokens are stored exclusively in the inline tokens
218file.
219
220Example:
221
Akron6b1f26b2024-09-19 11:35:32 +0200222 tei2korapxml --no-tokenizer --inline-tokens \
223 '!gingko#morpho' < data.i5.xml > korapxml.zip
224
225=item B<--inline-dependencies> <foundry>#[<file>]
226
227Define the foundry and file (without extension)
228to store inline dependency information in.
229Defaults to the layer of C<dependency> and
230will be ignored if not set (which means, dependency
231attributes will be stored in the inline tokens file,
232if not skipped).
233
234The dependency data will also be stored in the
235inline token file (see I<--inline-tokens>),
236unless the inline dependencies foundry is prepended
237by an B<!> exclamation mark, indicating that inline
238dependency data is stored exclusively in the inline
239dependencies file.
240
241Example:
242
243 tei2korapxml --no-tokenizer --inline-dependencies \
244 'gingko#dependency' < data.i5.xml > korapxml.zip
245
Akrone2819a12021-10-12 15:52:55 +0200246
Akrondd0be8f2021-02-18 19:29:41 +0100247=item B<--inline-structures> <foundry>#[<file>]
248
249Define the foundry and file (without extension)
250to store inline structure information in.
251Defaults to C<struct> and C<structures>.
Akron75d63142021-02-23 18:40:56 +0100252
Akron26a71522021-02-19 10:27:37 +0100253=item B<--base-foundry> <foundry>
254
255Define the base foundry to store newly generated
256token information in.
257Defaults to C<base>.
258
259=item B<--data-file> <file>
260
261Define the file (without extension)
262to store primary data information in.
263Defaults to C<data>.
264
265=item B<--header-file> <file>
266
267Define the file name (without extension)
268to store header information on
269the corpus, document, and text level in.
270Defaults to C<header>.
Akrondd0be8f2021-02-18 19:29:41 +0100271
Marc Kupietz985da0c2021-02-15 19:29:50 +0100272=item B<--use-tokenizer-sentence-splits|-s>
273
274Replace existing with, or add new, sentence boundary information
Akron11484782021-11-03 20:12:14 +0100275provided by the tokenizer.
276Currently KorAP-tokenizer and certain external tokenizers support
277these boundaries.
Marc Kupietz985da0c2021-02-15 19:29:50 +0100278
Akron91705d72021-02-19 10:59:45 +0100279=item B<--tokens-file> <file>
280
281Define the file (without extension)
282to store generated token information in
283(either from the KorAP tokenizer or an externally called tokenizer).
284Defaults to C<tokens>.
285
Akron0c41ab32020-09-29 07:33:33 +0200286=item B<--log|-l>
287
288Loglevel for I<Log::Any>. Defaults to C<notice>.
289
290=back
291
Akronb3649472020-09-29 08:24:46 +0200292=head1 ENVIRONMENT VARIABLES
293
294=over 2
295
296=item B<KORAPXMLTEI_DEBUG>
297
298Activate minimal debugging.
299Defaults to C<false>.
300
Marc Kupietzd254f5c2025-04-16 10:37:08 +0200301=item B<KORAPXMLTEI_TOKENIZER_HEAP_SIZE>
302
303Set the heap size for the tokenizer process.
304Defaults to C<512m>.
305
Akronb3649472020-09-29 08:24:46 +0200306=back
307
Akron0c41ab32020-09-29 07:33:33 +0200308=head1 COPYRIGHT AND LICENSE
309
Marc Kupietzb6fd6bc2025-04-16 12:47:26 +0200310Copyright (C) 2021-2025, L<IDS Mannheim|https://www.ids-mannheim.de/>
Akron0c41ab32020-09-29 07:33:33 +0200311
312Author: Peter Harders
313
Akronaabd0952020-09-29 07:35:08 +0200314Contributors: Nils Diewald, Marc Kupietz, Carsten Schnober
Akron0c41ab32020-09-29 07:33:33 +0200315
316L<KorAP::XML::TEI> is developed as part of the L<KorAP|https://korap.ids-mannheim.de/>
317Corpus Analysis Platform at the
Akrond72baca2021-07-23 13:25:32 +0200318L<Leibniz Institute for the German Language (IDS)|https://www.ids-mannheim.de/>,
Akron0c41ab32020-09-29 07:33:33 +0200319member of the
320L<Leibniz-Gemeinschaft|http://www.leibniz-gemeinschaft.de/>.
321
322This program is free software published under the
Marc Kupietze955ecc2021-02-17 17:42:01 +0100323L<BSD-2 License|https://opensource.org/licenses/BSD-2-Clause>.
Akron0c41ab32020-09-29 07:33:33 +0200324
Akron692d17d2021-03-05 13:21:03 +0100325=cut