blob: fe13a27e4371168afec05f7903d5997932ea2312 [file] [log] [blame]
Akron0c41ab32020-09-29 07:33:33 +02001=pod
2
3=encoding utf8
4
5=head1 NAME
6
7tei2korapxml - Conversion of TEI P5 based formats to KorAP-XML
8
9=head1 SYNOPSIS
10
Akron11484782021-11-03 20:12:14 +010011 cat corpus.i5.xml | tei2korapxml - > corpus.korapxml.zip
Akron0c41ab32020-09-29 07:33:33 +020012
13=head1 DESCRIPTION
14
15C<tei2korapxml> is a script to convert TEI P5 and
Akrond72baca2021-07-23 13:25:32 +020016L<I5|https://www.ids-mannheim.de/digspra/kl/projekte/korpora/textmodell>
Akron0c41ab32020-09-29 07:33:33 +020017based documents to the
18L<KorAP-XML format|https://github.com/KorAP/KorAP-XML-Krill#about-korap-xml>.
Akron0c41ab32020-09-29 07:33:33 +020019
20This program is usually called from inside another script.
21
22=head1 FORMATS
23
24=head2 Input restrictions
25
26=over 2
27
28=item
29
Akron0c41ab32020-09-29 07:33:33 +020030TEI P5 formatted input with certain restrictions:
31
32=over 4
33
34=item
35
Akrone48bec42023-01-05 12:18:45 +010036B<mandatory>: text-header with integrated textsigle
37(or convertable identifier), text-body
Akron0c41ab32020-09-29 07:33:33 +020038
39=item
40
41B<optional>: corp-header with integrated corpsigle,
42doc-header with integrated docsigle
43
44=back
45
46=item
47
48All tokens inside the primary text may not be
49newline seperated, because newlines are removed
50(see L<KorAP::XML::TEI::Data>) and a conversion of newlines
51into blanks between 2 tokens could lead to additional blanks,
52where there should be none (e.g.: punctuation characters like C<,> or
53C<.> should not be seperated from their predecessor token).
Akron8a0c4bf2021-03-16 16:51:21 +010054(see also code section C<~ whitespace handling ~> in C<script/tei2korapxml>).
Akron0c41ab32020-09-29 07:33:33 +020055
Akron940ca6f2021-10-11 12:38:39 +020056=item
57
58Header types, like C<E<lt>idsHeader [...] type="document" [...] E<gt>>
59need to be defined in the same line as the header tag.
60
Akron0c41ab32020-09-29 07:33:33 +020061=back
62
63=head2 Notes on the output
64
65=over 2
66
67=item
68
69zip file output (default on C<stdout>) with utf8 encoded entries
70(which together form the KorAP-XML format)
71
72=back
73
74=head1 INSTALLATION
75
Akrond26319b2023-01-12 15:34:41 +010076C<tei2korapxml> requires C<libxml2-dev> bindings and L<File::ShareDir::Install> to be installed.
Marc Kupietze83a4e92021-03-16 20:51:26 +010077When these requirements are met, the preferred way to install the script is
Akron0c41ab32020-09-29 07:33:33 +020078to use L<cpanm|App::cpanminus>.
79
80 $ cpanm https://github.com/KorAP/KorAP-XML-TEI.git
81
82In case everything went well, the C<tei2korapxml> tool will
83be available on your command line immediately.
84
Marc Kupietz54807052024-01-09 10:56:09 +010085Minimum requirement for L<KorAP::XML::TEI> is Perl 5.36.
Akron0c41ab32020-09-29 07:33:33 +020086
87=head1 OPTIONS
88
89=over 2
90
Akron11484782021-11-03 20:12:14 +010091=item B<--input|-i>
92
93The input file to process. If no specific input is defined and a single
94dash C<-> is passed as an argument, data is read from C<STDIN>.
95
96
Akron0c41ab32020-09-29 07:33:33 +020097=item B<--root|-r>
98
99The root directory for output. Defaults to C<.>.
100
101=item B<--help|-h>
102
103Print help information.
104
105=item B<--version|-v>
106
107Print version information.
108
Akrone48bec42023-01-05 12:18:45 +0100109=item B<--tokenizer-korap|-tk>
110
111Use the standard KorAP/DeReKo tokenizer.
112
113=item B<--tokenizer-internal|-ti>
114
115Tokenize the data using two embedded tokenizers,
116that will take an I<aggressive> and a I<conservative>
117approach.
118
Akron0c41ab32020-09-29 07:33:33 +0200119=item B<--tokenizer-call|-tc>
120
121Call an external tokenizer process, that will tokenize
Akron11484782021-11-03 20:12:14 +0100122from STDIN and outputs the offsets of all tokens.
123
124Texts are separated using C<\x04\n>. The external process
125should add a new line per text.
126
127If the L</--use-tokenizer-sentence-splits> option is activated,
128sentences are marked by offset as well in new lines.
129
130To use L<Datok|https://github.com/KorAP/Datok> including sentence
131splitting, call C<tei2korap> as follows:
132
133 $ cat corpus.i5.xml | tei2korapxml -s \
134 $ -tc 'datok tokenize \
135 $ -t ./tokenizer.matok \
136 $ -p --newline-after-eot --no-sentences \
137 $ --no-tokens --sentence-positions -' - \
138 $ > corpus.korapxml.zip
Akron0c41ab32020-09-29 07:33:33 +0200139
Akron75d63142021-02-23 18:40:56 +0100140=item B<--skip-inline-tokens>
141
142Boolean flag indicating that inline tokens should not
143be processed. Defaults to false (meaning inline tokens will be processed).
144
Akron692d17d2021-03-05 13:21:03 +0100145=item B<--skip-inline-token-annotations>
146
147Boolean flag indicating that inline token annotations should not
148be processed. Defaults to true (meaning inline token annotations
149won't be processed).
150
Akronca70a1d2021-02-25 16:21:31 +0100151=item B<--skip-inline-tags> <tags>
Akron54c3ff12021-02-25 11:33:37 +0100152
153Expects a comma-separated list of tags to be ignored when the structure
154is parsed. Content of these tags however will be processed.
155
Marc Kupietza671ae52022-12-22 16:28:14 +0100156=item B<--xmlid-to-textsigle> <from-regex>@<to-c/to-d/to-t>
157
Akrone48bec42023-01-05 12:18:45 +0100158Expects a regular replacement expression (separated by B<@> between the
Marc Kupietza671ae52022-12-22 16:28:14 +0100159search and the replacement) to convert text id attributes to text sigles
160with three parts (separated by B</>).
161
162Example:
163
164 tei2korapxml \
165 --xmlid-to-textsigle 'ICC.German\.([^.]+\.[^.]+)\.(.+)@ICCGER/$1/$2' \
166 -tk - < t/data/icc_german_sample.p5.xml
167
Akrone48bec42023-01-05 12:18:45 +0100168Converts text id C<ICC.German.DeReKo.WPD17.G11.00238> to
169sigle C<ICCGER/DeReKo.WPD17/G11.00238>.
Marc Kupietza671ae52022-12-22 16:28:14 +0100170
Akron1a5271a2021-02-18 13:18:15 +0100171=item B<--inline-tokens> <foundry>#[<file>]
172
173Define the foundry and file (without extension)
174to store inline token information in.
Akron8a0c4bf2021-03-16 16:51:21 +0100175Unless C<--skip-inline-token-annotations> is set,
176this will contain annotations as well.
Akron1a5271a2021-02-18 13:18:15 +0100177Defaults to C<tokens> and C<morpho>.
178
Akrone2819a12021-10-12 15:52:55 +0200179The inline token data will also be stored in the
180inline structures file (see I<--inline-structures>),
181unless the inline token foundry is prepended
182by an B<!> exclamation mark, indicating that inline
183tokens are stored exclusively in the inline tokens
184file.
185
186Example:
187
188 tei2korapxml --inline-tokens '!gingko#morpho' < data.i5.xml > korapxml.zip
189
Akrondd0be8f2021-02-18 19:29:41 +0100190=item B<--inline-structures> <foundry>#[<file>]
191
192Define the foundry and file (without extension)
193to store inline structure information in.
194Defaults to C<struct> and C<structures>.
Akron75d63142021-02-23 18:40:56 +0100195
Akron26a71522021-02-19 10:27:37 +0100196=item B<--base-foundry> <foundry>
197
198Define the base foundry to store newly generated
199token information in.
200Defaults to C<base>.
201
202=item B<--data-file> <file>
203
204Define the file (without extension)
205to store primary data information in.
206Defaults to C<data>.
207
208=item B<--header-file> <file>
209
210Define the file name (without extension)
211to store header information on
212the corpus, document, and text level in.
213Defaults to C<header>.
Akrondd0be8f2021-02-18 19:29:41 +0100214
Marc Kupietz985da0c2021-02-15 19:29:50 +0100215=item B<--use-tokenizer-sentence-splits|-s>
216
217Replace existing with, or add new, sentence boundary information
Akron11484782021-11-03 20:12:14 +0100218provided by the tokenizer.
219Currently KorAP-tokenizer and certain external tokenizers support
220these boundaries.
Marc Kupietz985da0c2021-02-15 19:29:50 +0100221
Akron91705d72021-02-19 10:59:45 +0100222=item B<--tokens-file> <file>
223
224Define the file (without extension)
225to store generated token information in
226(either from the KorAP tokenizer or an externally called tokenizer).
227Defaults to C<tokens>.
228
Akron0c41ab32020-09-29 07:33:33 +0200229=item B<--log|-l>
230
231Loglevel for I<Log::Any>. Defaults to C<notice>.
232
233=back
234
Akronb3649472020-09-29 08:24:46 +0200235=head1 ENVIRONMENT VARIABLES
236
237=over 2
238
239=item B<KORAPXMLTEI_DEBUG>
240
241Activate minimal debugging.
242Defaults to C<false>.
243
Akronb3649472020-09-29 08:24:46 +0200244=back
245
Akron0c41ab32020-09-29 07:33:33 +0200246=head1 COPYRIGHT AND LICENSE
247
Akrone48bec42023-01-05 12:18:45 +0100248Copyright (C) 2021-2023, L<IDS Mannheim|https://www.ids-mannheim.de/>
Akron0c41ab32020-09-29 07:33:33 +0200249
250Author: Peter Harders
251
Akronaabd0952020-09-29 07:35:08 +0200252Contributors: Nils Diewald, Marc Kupietz, Carsten Schnober
Akron0c41ab32020-09-29 07:33:33 +0200253
254L<KorAP::XML::TEI> is developed as part of the L<KorAP|https://korap.ids-mannheim.de/>
255Corpus Analysis Platform at the
Akrond72baca2021-07-23 13:25:32 +0200256L<Leibniz Institute for the German Language (IDS)|https://www.ids-mannheim.de/>,
Akron0c41ab32020-09-29 07:33:33 +0200257member of the
258L<Leibniz-Gemeinschaft|http://www.leibniz-gemeinschaft.de/>.
259
260This program is free software published under the
Marc Kupietze955ecc2021-02-17 17:42:01 +0100261L<BSD-2 License|https://opensource.org/licenses/BSD-2-Clause>.
Akron0c41ab32020-09-29 07:33:33 +0200262
Akron692d17d2021-03-05 13:21:03 +0100263=cut