blob: 1c9554027e82d81e7fcb635f159560323a841cb1 [file] [log] [blame]
Akron0c41ab32020-09-29 07:33:33 +02001=pod
2
3=encoding utf8
4
5=head1 NAME
6
7tei2korapxml - Conversion of TEI P5 based formats to KorAP-XML
8
9=head1 SYNOPSIS
10
Akron11484782021-11-03 20:12:14 +010011 cat corpus.i5.xml | tei2korapxml - > corpus.korapxml.zip
Akron0c41ab32020-09-29 07:33:33 +020012
13=head1 DESCRIPTION
14
15C<tei2korapxml> is a script to convert TEI P5 and
Akrond72baca2021-07-23 13:25:32 +020016L<I5|https://www.ids-mannheim.de/digspra/kl/projekte/korpora/textmodell>
Akron0c41ab32020-09-29 07:33:33 +020017based documents to the
18L<KorAP-XML format|https://github.com/KorAP/KorAP-XML-Krill#about-korap-xml>.
Akron0c41ab32020-09-29 07:33:33 +020019
20This program is usually called from inside another script.
21
22=head1 FORMATS
23
24=head2 Input restrictions
25
26=over 2
27
28=item
29
Akron0c41ab32020-09-29 07:33:33 +020030TEI P5 formatted input with certain restrictions:
31
32=over 4
33
34=item
35
Akrone48bec42023-01-05 12:18:45 +010036B<mandatory>: text-header with integrated textsigle
37(or convertable identifier), text-body
Akron0c41ab32020-09-29 07:33:33 +020038
39=item
40
41B<optional>: corp-header with integrated corpsigle,
42doc-header with integrated docsigle
43
44=back
45
46=item
47
48All tokens inside the primary text may not be
49newline seperated, because newlines are removed
50(see L<KorAP::XML::TEI::Data>) and a conversion of newlines
51into blanks between 2 tokens could lead to additional blanks,
52where there should be none (e.g.: punctuation characters like C<,> or
53C<.> should not be seperated from their predecessor token).
Akron8a0c4bf2021-03-16 16:51:21 +010054(see also code section C<~ whitespace handling ~> in C<script/tei2korapxml>).
Akron0c41ab32020-09-29 07:33:33 +020055
Akron940ca6f2021-10-11 12:38:39 +020056=item
57
58Header types, like C<E<lt>idsHeader [...] type="document" [...] E<gt>>
59need to be defined in the same line as the header tag.
60
Akron0c41ab32020-09-29 07:33:33 +020061=back
62
63=head2 Notes on the output
64
65=over 2
66
67=item
68
69zip file output (default on C<stdout>) with utf8 encoded entries
70(which together form the KorAP-XML format)
71
72=back
73
74=head1 INSTALLATION
75
Akrond26319b2023-01-12 15:34:41 +010076C<tei2korapxml> requires C<libxml2-dev> bindings and L<File::ShareDir::Install> to be installed.
Marc Kupietze83a4e92021-03-16 20:51:26 +010077When these requirements are met, the preferred way to install the script is
Akron0c41ab32020-09-29 07:33:33 +020078to use L<cpanm|App::cpanminus>.
79
80 $ cpanm https://github.com/KorAP/KorAP-XML-TEI.git
81
82In case everything went well, the C<tei2korapxml> tool will
83be available on your command line immediately.
84
Akron6b1f26b2024-09-19 11:35:32 +020085Minimum requirement for L<KorAP::XML::TEI> is Perl 5.16.
Akron0c41ab32020-09-29 07:33:33 +020086
87=head1 OPTIONS
88
89=over 2
90
Akron11484782021-11-03 20:12:14 +010091=item B<--input|-i>
92
93The input file to process. If no specific input is defined and a single
94dash C<-> is passed as an argument, data is read from C<STDIN>.
95
Akron6b1f26b2024-09-19 11:35:32 +020096=item B<--output|-o>
97
98The output zip file to be created. If no specific output is defined,
99data is written to C<STDOUT>.
Akron11484782021-11-03 20:12:14 +0100100
Akron0c41ab32020-09-29 07:33:33 +0200101=item B<--root|-r>
102
103The root directory for output. Defaults to C<.>.
104
105=item B<--help|-h>
106
107Print help information.
108
109=item B<--version|-v>
110
111Print version information.
112
Akrone48bec42023-01-05 12:18:45 +0100113=item B<--tokenizer-korap|-tk>
114
115Use the standard KorAP/DeReKo tokenizer.
116
117=item B<--tokenizer-internal|-ti>
118
119Tokenize the data using two embedded tokenizers,
120that will take an I<aggressive> and a I<conservative>
121approach.
122
Akron0c41ab32020-09-29 07:33:33 +0200123=item B<--tokenizer-call|-tc>
124
125Call an external tokenizer process, that will tokenize
Akron11484782021-11-03 20:12:14 +0100126from STDIN and outputs the offsets of all tokens.
127
128Texts are separated using C<\x04\n>. The external process
129should add a new line per text.
130
131If the L</--use-tokenizer-sentence-splits> option is activated,
132sentences are marked by offset as well in new lines.
133
134To use L<Datok|https://github.com/KorAP/Datok> including sentence
135splitting, call C<tei2korap> as follows:
136
137 $ cat corpus.i5.xml | tei2korapxml -s \
138 $ -tc 'datok tokenize \
139 $ -t ./tokenizer.matok \
140 $ -p --newline-after-eot --no-sentences \
141 $ --no-tokens --sentence-positions -' - \
142 $ > corpus.korapxml.zip
Akron0c41ab32020-09-29 07:33:33 +0200143
Akron6b1f26b2024-09-19 11:35:32 +0200144=item B<--no-tokenizer>
145
146Boolean flag indicating that no tokenizer should be used.
147This is meant to ensure that by default a final token layer always
148exists.
149If a separate tokenizer is chosen, this flag is ignored.
150
Akron75d63142021-02-23 18:40:56 +0100151=item B<--skip-inline-tokens>
152
153Boolean flag indicating that inline tokens should not
154be processed. Defaults to false (meaning inline tokens will be processed).
155
Akron692d17d2021-03-05 13:21:03 +0100156=item B<--skip-inline-token-annotations>
157
158Boolean flag indicating that inline token annotations should not
159be processed. Defaults to true (meaning inline token annotations
Akron6b1f26b2024-09-19 11:35:32 +0200160won't be processed). Can be negated with
161C<--no-skip-inline-token-annotations>.
Akron692d17d2021-03-05 13:21:03 +0100162
Akronca70a1d2021-02-25 16:21:31 +0100163=item B<--skip-inline-tags> <tags>
Akron54c3ff12021-02-25 11:33:37 +0100164
165Expects a comma-separated list of tags to be ignored when the structure
166is parsed. Content of these tags however will be processed.
167
Marc Kupietza671ae52022-12-22 16:28:14 +0100168=item B<--xmlid-to-textsigle> <from-regex>@<to-c/to-d/to-t>
169
Akrone48bec42023-01-05 12:18:45 +0100170Expects a regular replacement expression (separated by B<@> between the
Marc Kupietza671ae52022-12-22 16:28:14 +0100171search and the replacement) to convert text id attributes to text sigles
172with three parts (separated by B</>).
173
174Example:
175
176 tei2korapxml \
177 --xmlid-to-textsigle 'ICC.German\.([^.]+\.[^.]+)\.(.+)@ICCGER/$1/$2' \
178 -tk - < t/data/icc_german_sample.p5.xml
179
Akrone48bec42023-01-05 12:18:45 +0100180Converts text id C<ICC.German.DeReKo.WPD17.G11.00238> to
181sigle C<ICCGER/DeReKo.WPD17/G11.00238>.
Marc Kupietza671ae52022-12-22 16:28:14 +0100182
Akron1a5271a2021-02-18 13:18:15 +0100183=item B<--inline-tokens> <foundry>#[<file>]
184
185Define the foundry and file (without extension)
186to store inline token information in.
Akron8a0c4bf2021-03-16 16:51:21 +0100187Unless C<--skip-inline-token-annotations> is set,
188this will contain annotations as well.
Akron1a5271a2021-02-18 13:18:15 +0100189Defaults to C<tokens> and C<morpho>.
190
Akrone2819a12021-10-12 15:52:55 +0200191The inline token data will also be stored in the
192inline structures file (see I<--inline-structures>),
193unless the inline token foundry is prepended
194by an B<!> exclamation mark, indicating that inline
195tokens are stored exclusively in the inline tokens
196file.
197
198Example:
199
Akron6b1f26b2024-09-19 11:35:32 +0200200 tei2korapxml --no-tokenizer --inline-tokens \
201 '!gingko#morpho' < data.i5.xml > korapxml.zip
202
203=item B<--inline-dependencies> <foundry>#[<file>]
204
205Define the foundry and file (without extension)
206to store inline dependency information in.
207Defaults to the layer of C<dependency> and
208will be ignored if not set (which means, dependency
209attributes will be stored in the inline tokens file,
210if not skipped).
211
212The dependency data will also be stored in the
213inline token file (see I<--inline-tokens>),
214unless the inline dependencies foundry is prepended
215by an B<!> exclamation mark, indicating that inline
216dependency data is stored exclusively in the inline
217dependencies file.
218
219Example:
220
221 tei2korapxml --no-tokenizer --inline-dependencies \
222 'gingko#dependency' < data.i5.xml > korapxml.zip
223
Akrone2819a12021-10-12 15:52:55 +0200224
Akrondd0be8f2021-02-18 19:29:41 +0100225=item B<--inline-structures> <foundry>#[<file>]
226
227Define the foundry and file (without extension)
228to store inline structure information in.
229Defaults to C<struct> and C<structures>.
Akron75d63142021-02-23 18:40:56 +0100230
Akron26a71522021-02-19 10:27:37 +0100231=item B<--base-foundry> <foundry>
232
233Define the base foundry to store newly generated
234token information in.
235Defaults to C<base>.
236
237=item B<--data-file> <file>
238
239Define the file (without extension)
240to store primary data information in.
241Defaults to C<data>.
242
243=item B<--header-file> <file>
244
245Define the file name (without extension)
246to store header information on
247the corpus, document, and text level in.
248Defaults to C<header>.
Akrondd0be8f2021-02-18 19:29:41 +0100249
Marc Kupietz985da0c2021-02-15 19:29:50 +0100250=item B<--use-tokenizer-sentence-splits|-s>
251
252Replace existing with, or add new, sentence boundary information
Akron11484782021-11-03 20:12:14 +0100253provided by the tokenizer.
254Currently KorAP-tokenizer and certain external tokenizers support
255these boundaries.
Marc Kupietz985da0c2021-02-15 19:29:50 +0100256
Akron91705d72021-02-19 10:59:45 +0100257=item B<--tokens-file> <file>
258
259Define the file (without extension)
260to store generated token information in
261(either from the KorAP tokenizer or an externally called tokenizer).
262Defaults to C<tokens>.
263
Akron0c41ab32020-09-29 07:33:33 +0200264=item B<--log|-l>
265
266Loglevel for I<Log::Any>. Defaults to C<notice>.
267
268=back
269
Akronb3649472020-09-29 08:24:46 +0200270=head1 ENVIRONMENT VARIABLES
271
272=over 2
273
274=item B<KORAPXMLTEI_DEBUG>
275
276Activate minimal debugging.
277Defaults to C<false>.
278
Akronb3649472020-09-29 08:24:46 +0200279=back
280
Akron0c41ab32020-09-29 07:33:33 +0200281=head1 COPYRIGHT AND LICENSE
282
Marc Kupietz84566752024-01-11 14:37:11 +0100283Copyright (C) 2021-2024, L<IDS Mannheim|https://www.ids-mannheim.de/>
Akron0c41ab32020-09-29 07:33:33 +0200284
285Author: Peter Harders
286
Akronaabd0952020-09-29 07:35:08 +0200287Contributors: Nils Diewald, Marc Kupietz, Carsten Schnober
Akron0c41ab32020-09-29 07:33:33 +0200288
289L<KorAP::XML::TEI> is developed as part of the L<KorAP|https://korap.ids-mannheim.de/>
290Corpus Analysis Platform at the
Akrond72baca2021-07-23 13:25:32 +0200291L<Leibniz Institute for the German Language (IDS)|https://www.ids-mannheim.de/>,
Akron0c41ab32020-09-29 07:33:33 +0200292member of the
293L<Leibniz-Gemeinschaft|http://www.leibniz-gemeinschaft.de/>.
294
295This program is free software published under the
Marc Kupietze955ecc2021-02-17 17:42:01 +0100296L<BSD-2 License|https://opensource.org/licenses/BSD-2-Clause>.
Akron0c41ab32020-09-29 07:33:33 +0200297
Akron692d17d2021-03-05 13:21:03 +0100298=cut