blob: e890733937ff2ec71a4ed675c422aab164fc2c2c [file] [log] [blame]
Akron0c41ab32020-09-29 07:33:33 +02001=pod
2
3=encoding utf8
4
5=head1 NAME
6
7tei2korapxml - Conversion of TEI P5 based formats to KorAP-XML
8
9=head1 SYNOPSIS
10
Akron11484782021-11-03 20:12:14 +010011 cat corpus.i5.xml | tei2korapxml - > corpus.korapxml.zip
Akron0c41ab32020-09-29 07:33:33 +020012
13=head1 DESCRIPTION
14
15C<tei2korapxml> is a script to convert TEI P5 and
Akrond72baca2021-07-23 13:25:32 +020016L<I5|https://www.ids-mannheim.de/digspra/kl/projekte/korpora/textmodell>
Akron0c41ab32020-09-29 07:33:33 +020017based documents to the
18L<KorAP-XML format|https://github.com/KorAP/KorAP-XML-Krill#about-korap-xml>.
Akron0c41ab32020-09-29 07:33:33 +020019
20This program is usually called from inside another script.
21
22=head1 FORMATS
23
24=head2 Input restrictions
25
26=over 2
27
28=item
29
Akron0c41ab32020-09-29 07:33:33 +020030TEI P5 formatted input with certain restrictions:
31
32=over 4
33
34=item
35
Akrone48bec42023-01-05 12:18:45 +010036B<mandatory>: text-header with integrated textsigle
37(or convertable identifier), text-body
Akron0c41ab32020-09-29 07:33:33 +020038
39=item
40
41B<optional>: corp-header with integrated corpsigle,
42doc-header with integrated docsigle
43
44=back
45
46=item
47
48All tokens inside the primary text may not be
49newline seperated, because newlines are removed
50(see L<KorAP::XML::TEI::Data>) and a conversion of newlines
51into blanks between 2 tokens could lead to additional blanks,
52where there should be none (e.g.: punctuation characters like C<,> or
53C<.> should not be seperated from their predecessor token).
Akron8a0c4bf2021-03-16 16:51:21 +010054(see also code section C<~ whitespace handling ~> in C<script/tei2korapxml>).
Akron0c41ab32020-09-29 07:33:33 +020055
Akron940ca6f2021-10-11 12:38:39 +020056=item
57
58Header types, like C<E<lt>idsHeader [...] type="document" [...] E<gt>>
59need to be defined in the same line as the header tag.
60
Akron0c41ab32020-09-29 07:33:33 +020061=back
62
63=head2 Notes on the output
64
65=over 2
66
67=item
68
69zip file output (default on C<stdout>) with utf8 encoded entries
70(which together form the KorAP-XML format)
71
72=back
73
74=head1 INSTALLATION
75
Akrond26319b2023-01-12 15:34:41 +010076C<tei2korapxml> requires C<libxml2-dev> bindings and L<File::ShareDir::Install> to be installed.
Marc Kupietze83a4e92021-03-16 20:51:26 +010077When these requirements are met, the preferred way to install the script is
Akron0c41ab32020-09-29 07:33:33 +020078to use L<cpanm|App::cpanminus>.
79
80 $ cpanm https://github.com/KorAP/KorAP-XML-TEI.git
81
82In case everything went well, the C<tei2korapxml> tool will
83be available on your command line immediately.
84
Akron6b1f26b2024-09-19 11:35:32 +020085Minimum requirement for L<KorAP::XML::TEI> is Perl 5.16.
Akron0c41ab32020-09-29 07:33:33 +020086
87=head1 OPTIONS
88
89=over 2
90
Akron11484782021-11-03 20:12:14 +010091=item B<--input|-i>
92
93The input file to process. If no specific input is defined and a single
94dash C<-> is passed as an argument, data is read from C<STDIN>.
95
Akron6b1f26b2024-09-19 11:35:32 +020096=item B<--output|-o>
97
98The output zip file to be created. If no specific output is defined,
99data is written to C<STDOUT>.
Akron11484782021-11-03 20:12:14 +0100100
Akron0c41ab32020-09-29 07:33:33 +0200101=item B<--root|-r>
102
103The root directory for output. Defaults to C<.>.
104
105=item B<--help|-h>
106
107Print help information.
108
109=item B<--version|-v>
110
111Print version information.
112
Akrone48bec42023-01-05 12:18:45 +0100113=item B<--tokenizer-korap|-tk>
114
115Use the standard KorAP/DeReKo tokenizer.
116
117=item B<--tokenizer-internal|-ti>
118
119Tokenize the data using two embedded tokenizers,
120that will take an I<aggressive> and a I<conservative>
121approach.
122
Akron0c41ab32020-09-29 07:33:33 +0200123=item B<--tokenizer-call|-tc>
124
125Call an external tokenizer process, that will tokenize
Akron11484782021-11-03 20:12:14 +0100126from STDIN and outputs the offsets of all tokens.
127
128Texts are separated using C<\x04\n>. The external process
129should add a new line per text.
130
131If the L</--use-tokenizer-sentence-splits> option is activated,
132sentences are marked by offset as well in new lines.
133
134To use L<Datok|https://github.com/KorAP/Datok> including sentence
135splitting, call C<tei2korap> as follows:
136
137 $ cat corpus.i5.xml | tei2korapxml -s \
138 $ -tc 'datok tokenize \
139 $ -t ./tokenizer.matok \
140 $ -p --newline-after-eot --no-sentences \
141 $ --no-tokens --sentence-positions -' - \
142 $ > corpus.korapxml.zip
Akron0c41ab32020-09-29 07:33:33 +0200143
Akron6b1f26b2024-09-19 11:35:32 +0200144=item B<--no-tokenizer>
145
146Boolean flag indicating that no tokenizer should be used.
147This is meant to ensure that by default a final token layer always
148exists.
149If a separate tokenizer is chosen, this flag is ignored.
150
Akron75d63142021-02-23 18:40:56 +0100151=item B<--skip-inline-tokens>
152
153Boolean flag indicating that inline tokens should not
154be processed. Defaults to false (meaning inline tokens will be processed).
155
Akron692d17d2021-03-05 13:21:03 +0100156=item B<--skip-inline-token-annotations>
157
158Boolean flag indicating that inline token annotations should not
159be processed. Defaults to true (meaning inline token annotations
Akron6b1f26b2024-09-19 11:35:32 +0200160won't be processed). Can be negated with
161C<--no-skip-inline-token-annotations>.
Akron692d17d2021-03-05 13:21:03 +0100162
Akronca70a1d2021-02-25 16:21:31 +0100163=item B<--skip-inline-tags> <tags>
Akron54c3ff12021-02-25 11:33:37 +0100164
165Expects a comma-separated list of tags to be ignored when the structure
166is parsed. Content of these tags however will be processed.
167
Marc Kupietzfc3a0ee2024-07-05 16:58:16 +0200168=item B<--auto-textsigle> <textsigle>
169
170Expects a text sigle thats serves as fallback if no text sigles
171are given in the input data.
172The auto text sigle will be incremented for each text processed.
173
174Example:
175
176 tei2korapxml --auto-textsigle 'ICC/GER.00001' -s -tk - \
177 < data.i5.xml > korapxml.zip
178
Marc Kupietza671ae52022-12-22 16:28:14 +0100179=item B<--xmlid-to-textsigle> <from-regex>@<to-c/to-d/to-t>
180
Akrone48bec42023-01-05 12:18:45 +0100181Expects a regular replacement expression (separated by B<@> between the
Marc Kupietza671ae52022-12-22 16:28:14 +0100182search and the replacement) to convert text id attributes to text sigles
183with three parts (separated by B</>).
184
185Example:
186
187 tei2korapxml \
188 --xmlid-to-textsigle 'ICC.German\.([^.]+\.[^.]+)\.(.+)@ICCGER/$1/$2' \
189 -tk - < t/data/icc_german_sample.p5.xml
190
Akrone48bec42023-01-05 12:18:45 +0100191Converts text id C<ICC.German.DeReKo.WPD17.G11.00238> to
192sigle C<ICCGER/DeReKo.WPD17/G11.00238>.
Marc Kupietza671ae52022-12-22 16:28:14 +0100193
Akron1a5271a2021-02-18 13:18:15 +0100194=item B<--inline-tokens> <foundry>#[<file>]
195
196Define the foundry and file (without extension)
197to store inline token information in.
Akron8a0c4bf2021-03-16 16:51:21 +0100198Unless C<--skip-inline-token-annotations> is set,
199this will contain annotations as well.
Akron1a5271a2021-02-18 13:18:15 +0100200Defaults to C<tokens> and C<morpho>.
201
Akrone2819a12021-10-12 15:52:55 +0200202The inline token data will also be stored in the
203inline structures file (see I<--inline-structures>),
204unless the inline token foundry is prepended
205by an B<!> exclamation mark, indicating that inline
206tokens are stored exclusively in the inline tokens
207file.
208
209Example:
210
Akron6b1f26b2024-09-19 11:35:32 +0200211 tei2korapxml --no-tokenizer --inline-tokens \
212 '!gingko#morpho' < data.i5.xml > korapxml.zip
213
214=item B<--inline-dependencies> <foundry>#[<file>]
215
216Define the foundry and file (without extension)
217to store inline dependency information in.
218Defaults to the layer of C<dependency> and
219will be ignored if not set (which means, dependency
220attributes will be stored in the inline tokens file,
221if not skipped).
222
223The dependency data will also be stored in the
224inline token file (see I<--inline-tokens>),
225unless the inline dependencies foundry is prepended
226by an B<!> exclamation mark, indicating that inline
227dependency data is stored exclusively in the inline
228dependencies file.
229
230Example:
231
232 tei2korapxml --no-tokenizer --inline-dependencies \
233 'gingko#dependency' < data.i5.xml > korapxml.zip
234
Akrone2819a12021-10-12 15:52:55 +0200235
Akrondd0be8f2021-02-18 19:29:41 +0100236=item B<--inline-structures> <foundry>#[<file>]
237
238Define the foundry and file (without extension)
239to store inline structure information in.
240Defaults to C<struct> and C<structures>.
Akron75d63142021-02-23 18:40:56 +0100241
Akron26a71522021-02-19 10:27:37 +0100242=item B<--base-foundry> <foundry>
243
244Define the base foundry to store newly generated
245token information in.
246Defaults to C<base>.
247
248=item B<--data-file> <file>
249
250Define the file (without extension)
251to store primary data information in.
252Defaults to C<data>.
253
254=item B<--header-file> <file>
255
256Define the file name (without extension)
257to store header information on
258the corpus, document, and text level in.
259Defaults to C<header>.
Akrondd0be8f2021-02-18 19:29:41 +0100260
Marc Kupietz985da0c2021-02-15 19:29:50 +0100261=item B<--use-tokenizer-sentence-splits|-s>
262
263Replace existing with, or add new, sentence boundary information
Akron11484782021-11-03 20:12:14 +0100264provided by the tokenizer.
265Currently KorAP-tokenizer and certain external tokenizers support
266these boundaries.
Marc Kupietz985da0c2021-02-15 19:29:50 +0100267
Akron91705d72021-02-19 10:59:45 +0100268=item B<--tokens-file> <file>
269
270Define the file (without extension)
271to store generated token information in
272(either from the KorAP tokenizer or an externally called tokenizer).
273Defaults to C<tokens>.
274
Akron0c41ab32020-09-29 07:33:33 +0200275=item B<--log|-l>
276
277Loglevel for I<Log::Any>. Defaults to C<notice>.
278
279=back
280
Akronb3649472020-09-29 08:24:46 +0200281=head1 ENVIRONMENT VARIABLES
282
283=over 2
284
285=item B<KORAPXMLTEI_DEBUG>
286
287Activate minimal debugging.
288Defaults to C<false>.
289
Akronb3649472020-09-29 08:24:46 +0200290=back
291
Akron0c41ab32020-09-29 07:33:33 +0200292=head1 COPYRIGHT AND LICENSE
293
Marc Kupietz84566752024-01-11 14:37:11 +0100294Copyright (C) 2021-2024, L<IDS Mannheim|https://www.ids-mannheim.de/>
Akron0c41ab32020-09-29 07:33:33 +0200295
296Author: Peter Harders
297
Akronaabd0952020-09-29 07:35:08 +0200298Contributors: Nils Diewald, Marc Kupietz, Carsten Schnober
Akron0c41ab32020-09-29 07:33:33 +0200299
300L<KorAP::XML::TEI> is developed as part of the L<KorAP|https://korap.ids-mannheim.de/>
301Corpus Analysis Platform at the
Akrond72baca2021-07-23 13:25:32 +0200302L<Leibniz Institute for the German Language (IDS)|https://www.ids-mannheim.de/>,
Akron0c41ab32020-09-29 07:33:33 +0200303member of the
304L<Leibniz-Gemeinschaft|http://www.leibniz-gemeinschaft.de/>.
305
306This program is free software published under the
Marc Kupietze955ecc2021-02-17 17:42:01 +0100307L<BSD-2 License|https://opensource.org/licenses/BSD-2-Clause>.
Akron0c41ab32020-09-29 07:33:33 +0200308
Akron692d17d2021-03-05 13:21:03 +0100309=cut