blob: d4c30027a9ad4caf9420880c39a020401cb77b2e [file] [log] [blame]
Akronc13a1702016-03-15 19:33:14 +01001=pod
2
3=encoding utf8
4
5=head1 NAME
6
Akron5c71a852016-10-31 16:00:33 +01007korapxml2krill - Merge KorapXML data and create Krill documents
Akronc13a1702016-03-15 19:33:14 +01008
9
10=head1 SYNOPSIS
11
Akron5c71a852016-10-31 16:00:33 +010012 korapxml2krill [archive|extract] --input <directory|archive> [options]
Akron2fd402b2016-10-27 21:26:48 +020013
Akronc13a1702016-03-15 19:33:14 +010014
15=head1 DESCRIPTION
16
Akron5c71a852016-10-31 16:00:33 +010017L<KorAP::XML::Krill> is a library to convert KorAP-XML documents to files
18compatible with the L<Krill|https://github.com/KorAP/Krill> indexer.
19The C<korapxml2krill> command line tool is a simple wrapper to the library.
Akronc13a1702016-03-15 19:33:14 +010020
21
Akron5c71a852016-10-31 16:00:33 +010022=head1 INSTALLATION
Akronc13a1702016-03-15 19:33:14 +010023
Akron5c71a852016-10-31 16:00:33 +010024The preferred way to install L<KorAP::XML::Krill> is to use L<cpanm|App::cpanminus>.
Akronc13a1702016-03-15 19:33:14 +010025
Akron5c71a852016-10-31 16:00:33 +010026 $ cpanm https://github.com/KorAP/KorAP-XML-Krill.git
Akronc13a1702016-03-15 19:33:14 +010027
Akron5c71a852016-10-31 16:00:33 +010028In case everything went well, the C<korapxml2krill> tool will
29be available on your command line immediately.
30Minimum requirement for L<KorAP::XML::Krill> is Perl 5.14.
31In addition to work with zip archives, the C<unzip> tool needs to be present.
Akronc13a1702016-03-15 19:33:14 +010032
Akron5c71a852016-10-31 16:00:33 +010033=head1 ARGUMENTS
Akronc13a1702016-03-15 19:33:14 +010034
Akron5c71a852016-10-31 16:00:33 +010035 $ korapxml2krill -z --input <directory> --output <filename>
36
37Without arguments, C<korapxml2krill> converts a directory of a single KorAP-XML document.
38It expects the input to point to the text level folder.
39
40=over 2
41
42=item B<archive>
43
44 $ korapxml2krill archive -z --input <directory|archive> --output <directory>
45
46Converts an archive of KorAP-XML documents. It expects a directory
47(pointing to the corpus level folder) or one or more zip files as input.
48
49=item B<extract>
50
51 $ korapxml2krill extract --input <archive> --output <directory> --sigle <SIGLE>
52
53Extracts KorAP-XML documents from a zip file.
54
55=back
Akrona76d8352016-10-27 16:27:32 +020056
Akron7606afa2016-10-25 16:23:49 +020057
Akron5c71a852016-10-31 16:00:33 +010058=head1 OPTIONS
Akronc13a1702016-03-15 19:33:14 +010059
Akron5c71a852016-10-31 16:00:33 +010060=over 2
Akronc13a1702016-03-15 19:33:14 +010061
Akron5c71a852016-10-31 16:00:33 +010062=item B<--input|-i> <directory|zip file>
Akrona76d8352016-10-27 16:27:32 +020063
Akron5c71a852016-10-31 16:00:33 +010064Directory or zip file(s) of documents to convert.
Akronc13a1702016-03-15 19:33:14 +010065
Akron5c71a852016-10-31 16:00:33 +010066Without arguments, C<korapxml2krill> expects a folder of a single KorAP-XML
67document, while C<archive> and C<extract> support zip files as well.
Akronc13a1702016-03-15 19:33:14 +010068
Akron5c71a852016-10-31 16:00:33 +010069C<archive> supports multiple input zip files with the constraint,
70that the first archive listed contains all primary data files
71and all meta data files.
Akrona76d8352016-10-27 16:27:32 +020072
Akron5c71a852016-10-31 16:00:33 +010073 -i file/news.zip -i file/news.malt.zip -i "#file/news.tt.zip"
Akronc13a1702016-03-15 19:33:14 +010074
Akron5c71a852016-10-31 16:00:33 +010075(The directory structure follows the base directory format,
76that may include a C<.> root folder.
77In this case further archives lacking a C<.> root folder
78need to be passed with a hash sign in front of the archive's name.
79This may require to quote the parameter.)
Akronc13a1702016-03-15 19:33:14 +010080
Akron5c71a852016-10-31 16:00:33 +010081To support zip files, a version of C<unzip> needs to be installed that is
82compatible with the archive file.
Akronc13a1702016-03-15 19:33:14 +010083
Akron5c71a852016-10-31 16:00:33 +010084B<The root folder switch using the hash sign is experimental and
85may vanish in future versions.>
Akronc13a1702016-03-15 19:33:14 +010086
Akron5c71a852016-10-31 16:00:33 +010087=item B<--output|-o> <directory|file>
Akronc13a1702016-03-15 19:33:14 +010088
Akron5c71a852016-10-31 16:00:33 +010089Output folder for archive processing or
90document name for single output (optional),
91writes to C<STDOUT> by default
92(in case C<output> is not mandatory due to further options).
Akronc13a1702016-03-15 19:33:14 +010093
Akron5c71a852016-10-31 16:00:33 +010094=item B<--overwrite|-w>
Akronc13a1702016-03-15 19:33:14 +010095
Akron5c71a852016-10-31 16:00:33 +010096Overwrite files that already exist.
Akron7606afa2016-10-25 16:23:49 +020097
Akron5c71a852016-10-31 16:00:33 +010098=item B<--token|-t> <foundry>[#<file>]
Akrona5920b12016-06-29 18:51:21 +020099
Akron5c71a852016-10-31 16:00:33 +0100100Define the default tokenization by specifying
101the name of the foundry and optionally the name
102of the layer-file. Defaults to C<OpenNLP#tokens>.
Akronc13a1702016-03-15 19:33:14 +0100103
Akron5c71a852016-10-31 16:00:33 +0100104=item B<--skip|-s> <foundry>[#<layer>]
105
106Skip specific annotations by specifying the foundry
107(and optionally the layer with a C<#>-prefix),
108e.g. C<Mate> or C<Mate#Morpho>. Alternatively you can skip C<#ALL>.
109Can be set multiple times.
110
111=item B<--anno|-a> <foundry>#<layer>
112
113Convert specific annotations by specifying the foundry
114(and optionally the layer with a C<#>-prefix),
115e.g. C<Mate> or C<Mate#Morpho>.
116Can be set multiple times.
117
118=item B<--primary|-p>
119
120Output primary data or not. Defaults to C<true>.
121Can be flagged using C<--no-primary> as well.
122This is I<deprecated>.
123
124=item B<--jobs|-j>
125
126Define the number of concurrent jobs in seperated forks
127for archive processing.
128Defaults to C<0> (everything runs in a single process).
129This is I<experimental>.
130
131=item B<--meta|-m>
132
133Define the metadata parser to use. Defaults to C<I5>.
134Metadata parsers can be defined in the C<KorAP::XML::Meta> namespace.
135This is I<experimental>.
136
137=item B<--pretty|-y>
138
139Pretty print JSON output. Defaults to C<false>.
140This is I<deprecated>.
141
142=item B<--gzip|-z>
143
144Compress the output.
145Expects a defined C<output> file in single processing.
146
147=item B<--cache|-c>
148
149File to mmap a cache (using L<Cache::FastMmap>).
150Defaults to C<korapxml2krill.cache> in the calling directory.
151
152=item B<--cache-size|-cs>
153
154Size of the cache. Defaults to C<50m>.
155
156=item B<--cache-init|-ci>
157
158Initialize cache file.
159Can be flagged using C<--no-cache-init> as well.
160Defaults to C<true>.
161
162=item B<--cache-delete|-cd>
163
164Delete cache file after processing.
165Can be flagged using C<--no-cache-delete> as well.
166Defaults to C<true>.
167
168=item B<--sigle|-sg>
169
170Extract the given texts.
171Can be set multiple times.
172I<Currently only supported on C<extract>.>
173Sigles have the structure C<Corpus>/C<Document>/C<Text>.
174In case the C<Text> path is omitted, the whole document will be extracted.
175On the document level, the postfix wildcard C<*> is supported.
176
177=item B<--log|-l>
178
179The L<Log4perl> log level, defaults to C<ERROR>.
180
181=item B<--help|-h>
182
183Print this document.
184
185=item B<--version|-v>
186
187Print version information.
188
189=back
190
191=head1 ANNOTATION SUPPORT
192
193L<KorAP::XML::Krill> has built-in importer for some annotation foundries and layers
194developed in the KorAP project that are part of the KorAP preprocessing pipeline.
195The base foundry with paragraphs, sentences, and the text element are mandatory for
196L<Krill|https://github.com/KorAP/Krill>.
197
198=over 2
199
200=item B<Base>
201
202=over 4
203
204=item #Paragraphs
205
206=item #Sentences
207
208=back
209
210=item B<Connexor>
211
212=over 4
213
214=item #Morpho
215
216=item #Phrase
217
218=item #Sentences
219
220=item #Syntax
221
222=back
223
224=item B<CoreNLP>
225
226=over 4
227
228=item #Constituency
229
230=item #Morpho
231
232=item #NamedEntities
233
234=item #Sentences
235
236=back
237
238=item B<DeReKo>
239
240=over 4
241
242=item #Structure
243
244=back
245
246=item B<Glemm>
247
248=over 4
249
250=item #Morpho
251
252=back
253
254=item B<Mate>
255
256=over 4
257
258=item #Dependency
259
260=item #Morpho
261
262=back
263
264=item B<OpenNLP>
265
266=over 4
267
268=item #Morpho
269
270=item #Sentences
271
272=back
273
274=item B<Sgbr>
275
276=over 4
277
278=item #Lemma
279
280=item #Morpho
281
282=back
283
284=item B<TreeTagger>
285
286=over 4
287
288=item #Morpho
289
290=item #Sentences
291
292=back
293
294=item B<XIP>
295
296=over 4
297
298=item #Constituency
299
300=item #Morpho
301
302=item #Sentences
303
304=back
305
306=back
307
308More importers are in preparation.
309New annotation importers can be defined in the C<KorAP::XML::Annotation> namespace.
310See the built-in annotation importers as examples.
Akronc13a1702016-03-15 19:33:14 +0100311
312=head1 AVAILABILITY
313
314 https://github.com/KorAP/KorAP-XML-Krill
315
316
317=head1 COPYRIGHT AND LICENSE
318
319Copyright (C) 2015-2016, L<IDS Mannheim|http://www.ids-mannheim.de/>
Akronc13a1702016-03-15 19:33:14 +0100320
Akron5c71a852016-10-31 16:00:33 +0100321Author: L<Nils Diewald|http://nils-diewald.de/>
322Contributor: Eliza Margaretha
323
324L<KorAP::XML::Krill> is developed as part of the L<KorAP|http://korap.ids-mannheim.de/>
Akronc13a1702016-03-15 19:33:14 +0100325Corpus Analysis Platform at the
326L<Institute for the German Language (IDS)|http://ids-mannheim.de/>,
327member of the
Akron5c71a852016-10-31 16:00:33 +0100328L<Leibniz-Gemeinschaft|http://www.leibniz-gemeinschaft.de/en/about-us/leibniz-competition/projekte-2011/2011-funding-line-2/>.
Akronc13a1702016-03-15 19:33:14 +0100329
Akron5c71a852016-10-31 16:00:33 +0100330This program is free software published under the
Akronc13a1702016-03-15 19:33:14 +0100331L<BSD-2 License|https://raw.githubusercontent.com/KorAP/KorAP-XML-Krill/master/LICENSE>.
332
333=cut