blob: 79180c425c30f3463f5a970f8b0365b82f048e73 [file] [log] [blame]
Akron0c41ab32020-09-29 07:33:33 +02001=pod
2
3=encoding utf8
4
5=head1 NAME
6
7tei2korapxml - Conversion of TEI P5 based formats to KorAP-XML
8
9=head1 SYNOPSIS
10
11 cat corpus.i5.xml | tei2korapxml > corpus.korapxml.zip
12
13=head1 DESCRIPTION
14
15C<tei2korapxml> is a script to convert TEI P5 and
Akrond72baca2021-07-23 13:25:32 +020016L<I5|https://www.ids-mannheim.de/digspra/kl/projekte/korpora/textmodell>
Akron0c41ab32020-09-29 07:33:33 +020017based documents to the
18L<KorAP-XML format|https://github.com/KorAP/KorAP-XML-Krill#about-korap-xml>.
19If no specific input is defined, data is
20read from C<STDIN>. If no specific output is defined, data is written
21to C<STDOUT>.
22
23This program is usually called from inside another script.
24
25=head1 FORMATS
26
27=head2 Input restrictions
28
29=over 2
30
31=item
32
Akron0c41ab32020-09-29 07:33:33 +020033TEI P5 formatted input with certain restrictions:
34
35=over 4
36
37=item
38
39B<mandatory>: text-header with integrated textsigle, text-body
40
41=item
42
43B<optional>: corp-header with integrated corpsigle,
44doc-header with integrated docsigle
45
46=back
47
48=item
49
50All tokens inside the primary text may not be
51newline seperated, because newlines are removed
52(see L<KorAP::XML::TEI::Data>) and a conversion of newlines
53into blanks between 2 tokens could lead to additional blanks,
54where there should be none (e.g.: punctuation characters like C<,> or
55C<.> should not be seperated from their predecessor token).
Akron8a0c4bf2021-03-16 16:51:21 +010056(see also code section C<~ whitespace handling ~> in C<script/tei2korapxml>).
Akron0c41ab32020-09-29 07:33:33 +020057
Akron940ca6f2021-10-11 12:38:39 +020058=item
59
60Header types, like C<E<lt>idsHeader [...] type="document" [...] E<gt>>
61need to be defined in the same line as the header tag.
62
Akron0c41ab32020-09-29 07:33:33 +020063=back
64
65=head2 Notes on the output
66
67=over 2
68
69=item
70
71zip file output (default on C<stdout>) with utf8 encoded entries
72(which together form the KorAP-XML format)
73
74=back
75
76=head1 INSTALLATION
77
Marc Kupietze83a4e92021-03-16 20:51:26 +010078C<tei2korapxml> requires L<libxml2-dev> bindings and L<File::ShareDir::Install> to be installed.
79When these requirements are met, the preferred way to install the script is
Akron0c41ab32020-09-29 07:33:33 +020080to use L<cpanm|App::cpanminus>.
81
82 $ cpanm https://github.com/KorAP/KorAP-XML-TEI.git
83
84In case everything went well, the C<tei2korapxml> tool will
85be available on your command line immediately.
86
87Minimum requirement for L<KorAP::XML::TEI> is Perl 5.16.
88
89=head1 OPTIONS
90
91=over 2
92
93=item B<--root|-r>
94
95The root directory for output. Defaults to C<.>.
96
97=item B<--help|-h>
98
99Print help information.
100
101=item B<--version|-v>
102
103Print version information.
104
105=item B<--tokenizer-call|-tc>
106
107Call an external tokenizer process, that will tokenize
108a single line from STDIN and outputs one token per line.
109
110=item B<--tokenizer-korap|-tk>
111
112Use the standard KorAP/DeReKo tokenizer.
113
Akron6d7b8e42020-09-29 07:37:41 +0200114=item B<--tokenizer-internal|-ti>
Akron0c41ab32020-09-29 07:33:33 +0200115
116Tokenize the data using two embedded tokenizers,
117that will take an I<Aggressive> and a I<conservative>
118approach.
119
Akron75d63142021-02-23 18:40:56 +0100120=item B<--skip-inline-tokens>
121
122Boolean flag indicating that inline tokens should not
123be processed. Defaults to false (meaning inline tokens will be processed).
124
Akron692d17d2021-03-05 13:21:03 +0100125=item B<--skip-inline-token-annotations>
126
127Boolean flag indicating that inline token annotations should not
128be processed. Defaults to true (meaning inline token annotations
129won't be processed).
130
Akronca70a1d2021-02-25 16:21:31 +0100131=item B<--skip-inline-tags> <tags>
Akron54c3ff12021-02-25 11:33:37 +0100132
133Expects a comma-separated list of tags to be ignored when the structure
134is parsed. Content of these tags however will be processed.
135
Akron1a5271a2021-02-18 13:18:15 +0100136=item B<--inline-tokens> <foundry>#[<file>]
137
138Define the foundry and file (without extension)
139to store inline token information in.
Akron8a0c4bf2021-03-16 16:51:21 +0100140Unless C<--skip-inline-token-annotations> is set,
141this will contain annotations as well.
Akron1a5271a2021-02-18 13:18:15 +0100142Defaults to C<tokens> and C<morpho>.
143
Akrone2819a12021-10-12 15:52:55 +0200144The inline token data will also be stored in the
145inline structures file (see I<--inline-structures>),
146unless the inline token foundry is prepended
147by an B<!> exclamation mark, indicating that inline
148tokens are stored exclusively in the inline tokens
149file.
150
151Example:
152
153 tei2korapxml --inline-tokens '!gingko#morpho' < data.i5.xml > korapxml.zip
154
Akrondd0be8f2021-02-18 19:29:41 +0100155=item B<--inline-structures> <foundry>#[<file>]
156
157Define the foundry and file (without extension)
158to store inline structure information in.
159Defaults to C<struct> and C<structures>.
Akron75d63142021-02-23 18:40:56 +0100160
Akron26a71522021-02-19 10:27:37 +0100161=item B<--base-foundry> <foundry>
162
163Define the base foundry to store newly generated
164token information in.
165Defaults to C<base>.
166
167=item B<--data-file> <file>
168
169Define the file (without extension)
170to store primary data information in.
171Defaults to C<data>.
172
173=item B<--header-file> <file>
174
175Define the file name (without extension)
176to store header information on
177the corpus, document, and text level in.
178Defaults to C<header>.
Akrondd0be8f2021-02-18 19:29:41 +0100179
Marc Kupietz985da0c2021-02-15 19:29:50 +0100180=item B<--use-tokenizer-sentence-splits|-s>
181
182Replace existing with, or add new, sentence boundary information
183provided by the KorAP tokenizer (currently supported only).
184
Akron91705d72021-02-19 10:59:45 +0100185=item B<--tokens-file> <file>
186
187Define the file (without extension)
188to store generated token information in
189(either from the KorAP tokenizer or an externally called tokenizer).
190Defaults to C<tokens>.
191
Akron0c41ab32020-09-29 07:33:33 +0200192=item B<--log|-l>
193
194Loglevel for I<Log::Any>. Defaults to C<notice>.
195
196=back
197
Akronb3649472020-09-29 08:24:46 +0200198=head1 ENVIRONMENT VARIABLES
199
200=over 2
201
202=item B<KORAPXMLTEI_DEBUG>
203
204Activate minimal debugging.
205Defaults to C<false>.
206
Akronb3649472020-09-29 08:24:46 +0200207=back
208
Akron0c41ab32020-09-29 07:33:33 +0200209=head1 COPYRIGHT AND LICENSE
210
Marc Kupietz985da0c2021-02-15 19:29:50 +0100211Copyright (C) 2021, L<IDS Mannheim|https://www.ids-mannheim.de/>
Akron0c41ab32020-09-29 07:33:33 +0200212
213Author: Peter Harders
214
Akronaabd0952020-09-29 07:35:08 +0200215Contributors: Nils Diewald, Marc Kupietz, Carsten Schnober
Akron0c41ab32020-09-29 07:33:33 +0200216
217L<KorAP::XML::TEI> is developed as part of the L<KorAP|https://korap.ids-mannheim.de/>
218Corpus Analysis Platform at the
Akrond72baca2021-07-23 13:25:32 +0200219L<Leibniz Institute for the German Language (IDS)|https://www.ids-mannheim.de/>,
Akron0c41ab32020-09-29 07:33:33 +0200220member of the
221L<Leibniz-Gemeinschaft|http://www.leibniz-gemeinschaft.de/>.
222
223This program is free software published under the
Marc Kupietze955ecc2021-02-17 17:42:01 +0100224L<BSD-2 License|https://opensource.org/licenses/BSD-2-Clause>.
Akron0c41ab32020-09-29 07:33:33 +0200225
Akron692d17d2021-03-05 13:21:03 +0100226=cut