blob: e627275246f7bd3c624d276396992fb02f641226 [file] [log] [blame]
Akron0c41ab32020-09-29 07:33:33 +02001=pod
2
3=encoding utf8
4
5=head1 NAME
6
7tei2korapxml - Conversion of TEI P5 based formats to KorAP-XML
8
9=head1 SYNOPSIS
10
11 cat corpus.i5.xml | tei2korapxml > corpus.korapxml.zip
12
13=head1 DESCRIPTION
14
15C<tei2korapxml> is a script to convert TEI P5 and
16L<I5|https://www1.ids-mannheim.de/kl/projekte/korpora/textmodell.html>
17based documents to the
18L<KorAP-XML format|https://github.com/KorAP/KorAP-XML-Krill#about-korap-xml>.
19If no specific input is defined, data is
20read from C<STDIN>. If no specific output is defined, data is written
21to C<STDOUT>.
22
23This program is usually called from inside another script.
24
25=head1 FORMATS
26
27=head2 Input restrictions
28
29=over 2
30
31=item
32
Akron0c41ab32020-09-29 07:33:33 +020033TEI P5 formatted input with certain restrictions:
34
35=over 4
36
37=item
38
39B<mandatory>: text-header with integrated textsigle, text-body
40
41=item
42
43B<optional>: corp-header with integrated corpsigle,
44doc-header with integrated docsigle
45
46=back
47
48=item
49
50All tokens inside the primary text may not be
51newline seperated, because newlines are removed
52(see L<KorAP::XML::TEI::Data>) and a conversion of newlines
53into blanks between 2 tokens could lead to additional blanks,
54where there should be none (e.g.: punctuation characters like C<,> or
55C<.> should not be seperated from their predecessor token).
56(see also code section C<~ whitespace handling ~>).
57
58=back
59
60=head2 Notes on the output
61
62=over 2
63
64=item
65
66zip file output (default on C<stdout>) with utf8 encoded entries
67(which together form the KorAP-XML format)
68
69=back
70
71=head1 INSTALLATION
72
73C<tei2korapxml> requires L<libxml2-dev> bindings to build. When
74these bindings are available, the preferred way to install the script is
75to use L<cpanm|App::cpanminus>.
76
77 $ cpanm https://github.com/KorAP/KorAP-XML-TEI.git
78
79In case everything went well, the C<tei2korapxml> tool will
80be available on your command line immediately.
81
82Minimum requirement for L<KorAP::XML::TEI> is Perl 5.16.
83
84=head1 OPTIONS
85
86=over 2
87
88=item B<--root|-r>
89
90The root directory for output. Defaults to C<.>.
91
92=item B<--help|-h>
93
94Print help information.
95
96=item B<--version|-v>
97
98Print version information.
99
100=item B<--tokenizer-call|-tc>
101
102Call an external tokenizer process, that will tokenize
103a single line from STDIN and outputs one token per line.
104
105=item B<--tokenizer-korap|-tk>
106
107Use the standard KorAP/DeReKo tokenizer.
108
Akron6d7b8e42020-09-29 07:37:41 +0200109=item B<--tokenizer-internal|-ti>
Akron0c41ab32020-09-29 07:33:33 +0200110
111Tokenize the data using two embedded tokenizers,
112that will take an I<Aggressive> and a I<conservative>
113approach.
114
Akron1a5271a2021-02-18 13:18:15 +0100115=item B<--inline-tokens> <foundry>#[<file>]
116
117Define the foundry and file (without extension)
118to store inline token information in.
119If L</KORAPXMLTEI_INLINE> is set, this will contain
120annotations as well.
121Defaults to C<tokens> and C<morpho>.
122
Akrondd0be8f2021-02-18 19:29:41 +0100123=item B<--inline-structures> <foundry>#[<file>]
124
125Define the foundry and file (without extension)
126to store inline structure information in.
127Defaults to C<struct> and C<structures>.
Akron26a71522021-02-19 10:27:37 +0100128=item B<--base-foundry> <foundry>
129
130Define the base foundry to store newly generated
131token information in.
132Defaults to C<base>.
133
134=item B<--data-file> <file>
135
136Define the file (without extension)
137to store primary data information in.
138Defaults to C<data>.
139
140=item B<--header-file> <file>
141
142Define the file name (without extension)
143to store header information on
144the corpus, document, and text level in.
145Defaults to C<header>.
Akrondd0be8f2021-02-18 19:29:41 +0100146
Marc Kupietz985da0c2021-02-15 19:29:50 +0100147=item B<--use-tokenizer-sentence-splits|-s>
148
149Replace existing with, or add new, sentence boundary information
150provided by the KorAP tokenizer (currently supported only).
151
Akron91705d72021-02-19 10:59:45 +0100152=item B<--tokens-file> <file>
153
154Define the file (without extension)
155to store generated token information in
156(either from the KorAP tokenizer or an externally called tokenizer).
157Defaults to C<tokens>.
158
Akron0c41ab32020-09-29 07:33:33 +0200159=item B<--log|-l>
160
161Loglevel for I<Log::Any>. Defaults to C<notice>.
162
163=back
164
Akronb3649472020-09-29 08:24:46 +0200165=head1 ENVIRONMENT VARIABLES
166
167=over 2
168
169=item B<KORAPXMLTEI_DEBUG>
170
171Activate minimal debugging.
172Defaults to C<false>.
173
174=item B<KORAPXMLTEI_INLINE>
175
176Process inline annotations, if present.
177Defaults to C<false>.
178
179=back
180
Akron0c41ab32020-09-29 07:33:33 +0200181=head1 COPYRIGHT AND LICENSE
182
Marc Kupietz985da0c2021-02-15 19:29:50 +0100183Copyright (C) 2021, L<IDS Mannheim|https://www.ids-mannheim.de/>
Akron0c41ab32020-09-29 07:33:33 +0200184
185Author: Peter Harders
186
Akronaabd0952020-09-29 07:35:08 +0200187Contributors: Nils Diewald, Marc Kupietz, Carsten Schnober
Akron0c41ab32020-09-29 07:33:33 +0200188
189L<KorAP::XML::TEI> is developed as part of the L<KorAP|https://korap.ids-mannheim.de/>
190Corpus Analysis Platform at the
191L<Leibniz Institute for the German Language (IDS)|http://ids-mannheim.de/>,
192member of the
193L<Leibniz-Gemeinschaft|http://www.leibniz-gemeinschaft.de/>.
194
195This program is free software published under the
Marc Kupietze955ecc2021-02-17 17:42:01 +0100196L<BSD-2 License|https://opensource.org/licenses/BSD-2-Clause>.
Akron0c41ab32020-09-29 07:33:33 +0200197
Akrondd0be8f2021-02-18 19:29:41 +0100198=cut