blob: edc314b0083ba5046b4f19db80c1a7db808ab1d0 [file] [log] [blame]
Akronc13a1702016-03-15 19:33:14 +01001=pod
2
3=encoding utf8
4
5=head1 NAME
6
Akron5c71a852016-10-31 16:00:33 +01007korapxml2krill - Merge KorapXML data and create Krill documents
Akronc13a1702016-03-15 19:33:14 +01008
9
10=head1 SYNOPSIS
11
Akron5c71a852016-10-31 16:00:33 +010012 korapxml2krill [archive|extract] --input <directory|archive> [options]
Akron2fd402b2016-10-27 21:26:48 +020013
Akronc13a1702016-03-15 19:33:14 +010014
15=head1 DESCRIPTION
16
Akron5c71a852016-10-31 16:00:33 +010017L<KorAP::XML::Krill> is a library to convert KorAP-XML documents to files
18compatible with the L<Krill|https://github.com/KorAP/Krill> indexer.
19The C<korapxml2krill> command line tool is a simple wrapper to the library.
Akronc13a1702016-03-15 19:33:14 +010020
21
Akron5c71a852016-10-31 16:00:33 +010022=head1 INSTALLATION
Akronc13a1702016-03-15 19:33:14 +010023
Akron5c71a852016-10-31 16:00:33 +010024The preferred way to install L<KorAP::XML::Krill> is to use L<cpanm|App::cpanminus>.
Akronc13a1702016-03-15 19:33:14 +010025
Akron5c71a852016-10-31 16:00:33 +010026 $ cpanm https://github.com/KorAP/KorAP-XML-Krill.git
Akronc13a1702016-03-15 19:33:14 +010027
Akron5c71a852016-10-31 16:00:33 +010028In case everything went well, the C<korapxml2krill> tool will
29be available on your command line immediately.
Akron6eff23b2018-09-24 10:31:20 +020030Minimum requirement for L<KorAP::XML::Krill> is Perl 5.16.
Akron5c71a852016-10-31 16:00:33 +010031In addition to work with zip archives, the C<unzip> tool needs to be present.
Akronc13a1702016-03-15 19:33:14 +010032
Akron5c71a852016-10-31 16:00:33 +010033=head1 ARGUMENTS
Akronc13a1702016-03-15 19:33:14 +010034
Akron5c71a852016-10-31 16:00:33 +010035 $ korapxml2krill -z --input <directory> --output <filename>
36
37Without arguments, C<korapxml2krill> converts a directory of a single KorAP-XML document.
38It expects the input to point to the text level folder.
39
40=over 2
41
42=item B<archive>
43
Akronf73ffb62018-06-27 12:13:59 +020044 $ korapxml2krill archive -z --input <directory|archive> --output <directory|tar>
Akron5c71a852016-10-31 16:00:33 +010045
46Converts an archive of KorAP-XML documents. It expects a directory
47(pointing to the corpus level folder) or one or more zip files as input.
48
49=item B<extract>
50
51 $ korapxml2krill extract --input <archive> --output <directory> --sigle <SIGLE>
52
53Extracts KorAP-XML documents from a zip file.
54
Akron442c4e92017-04-10 23:41:31 +020055=item B<serial>
56
57 $ korapxml2krill serial -i <archive1> -i <archive2> -o <directory> -cfg <config-file>
58
59Convert archives sequentially. The inputs are not merged but treated
60as they are (so they may be premerged or globs).
61the C<--out> directory is treated as the base directory where subdirectories
Akronf73ffb62018-06-27 12:13:59 +020062are created based on the archive name. In case the C<--to-tar> flag is given,
63the output will be a tar file.
Akron442c4e92017-04-10 23:41:31 +020064
65
Akron5c71a852016-10-31 16:00:33 +010066=back
Akrona76d8352016-10-27 16:27:32 +020067
Akron7606afa2016-10-25 16:23:49 +020068
Akron5c71a852016-10-31 16:00:33 +010069=head1 OPTIONS
Akronc13a1702016-03-15 19:33:14 +010070
Akron5c71a852016-10-31 16:00:33 +010071=over 2
Akronc13a1702016-03-15 19:33:14 +010072
Akron5c71a852016-10-31 16:00:33 +010073=item B<--input|-i> <directory|zip file>
Akrona76d8352016-10-27 16:27:32 +020074
Akron5c71a852016-10-31 16:00:33 +010075Directory or zip file(s) of documents to convert.
Akronc13a1702016-03-15 19:33:14 +010076
Akron5c71a852016-10-31 16:00:33 +010077Without arguments, C<korapxml2krill> expects a folder of a single KorAP-XML
Akronf1a1de92016-11-02 17:32:12 +010078document, while C<archive> expects a KorAP-XML corpus folder or a zip
79file to batch process multiple files.
80C<extract> expects zip files only.
Akronc13a1702016-03-15 19:33:14 +010081
Akron5c71a852016-10-31 16:00:33 +010082C<archive> supports multiple input zip files with the constraint,
83that the first archive listed contains all primary data files
84and all meta data files.
Akrona76d8352016-10-27 16:27:32 +020085
Akron5c71a852016-10-31 16:00:33 +010086 -i file/news.zip -i file/news.malt.zip -i "#file/news.tt.zip"
Akronc13a1702016-03-15 19:33:14 +010087
Akron821db3d2017-04-06 21:19:31 +020088Input may also be defined using BSD glob wildcards.
89
90 -i 'file/news*.zip'
91
92The extended input array will be sorted in length order, so the shortest
93path needs to contain all primary data files and all meta data files.
94
Akron5c71a852016-10-31 16:00:33 +010095(The directory structure follows the base directory format,
96that may include a C<.> root folder.
97In this case further archives lacking a C<.> root folder
98need to be passed with a hash sign in front of the archive's name.
99This may require to quote the parameter.)
Akronc13a1702016-03-15 19:33:14 +0100100
Akron5c71a852016-10-31 16:00:33 +0100101To support zip files, a version of C<unzip> needs to be installed that is
102compatible with the archive file.
Akronc13a1702016-03-15 19:33:14 +0100103
Akron5c71a852016-10-31 16:00:33 +0100104B<The root folder switch using the hash sign is experimental and
105may vanish in future versions.>
Akronc13a1702016-03-15 19:33:14 +0100106
Akronf73ffb62018-06-27 12:13:59 +0200107
Akron442c4e92017-04-10 23:41:31 +0200108=item B<--input-base|-ib> <directory>
109
110The base directory for inputs.
111
112
Akron5c71a852016-10-31 16:00:33 +0100113=item B<--output|-o> <directory|file>
Akronc13a1702016-03-15 19:33:14 +0100114
Akron5c71a852016-10-31 16:00:33 +0100115Output folder for archive processing or
116document name for single output (optional),
117writes to C<STDOUT> by default
118(in case C<output> is not mandatory due to further options).
Akronc13a1702016-03-15 19:33:14 +0100119
Akron5c71a852016-10-31 16:00:33 +0100120=item B<--overwrite|-w>
Akronc13a1702016-03-15 19:33:14 +0100121
Akron5c71a852016-10-31 16:00:33 +0100122Overwrite files that already exist.
Akron7606afa2016-10-25 16:23:49 +0200123
Akronf73ffb62018-06-27 12:13:59 +0200124
Akron3741f8b2016-12-21 19:55:21 +0100125=item B<--token|-t> <foundry>#<file>
Akrona5920b12016-06-29 18:51:21 +0200126
Akron5c71a852016-10-31 16:00:33 +0100127Define the default tokenization by specifying
128the name of the foundry and optionally the name
129of the layer-file. Defaults to C<OpenNLP#tokens>.
Akronf1849aa2019-12-16 23:35:33 +0100130This will directly take the file instead of running
131the layer implementation!
Akron3741f8b2016-12-21 19:55:21 +0100132
133=item B<--base-sentences|-bs> <foundry>#<layer>
134
135Define the layer for base sentences.
136If given, this will be used instead of using C<Base#Sentences>.
Akronc29b8e12019-12-16 14:28:09 +0100137Currently C<DeReKo#Structure> and C<DGD#Structure> are the only additional
138layers supported.
Akron3741f8b2016-12-21 19:55:21 +0100139
140 Defaults to unset.
141
142
143=item B<--base-paragraphs|-bp> <foundry>#<layer>
144
145Define the layer for base paragraphs.
146If given, this will be used instead of using C<Base#Paragraphs>.
147Currently C<DeReKo#Structure> is the only additional layer supported.
148
149 Defaults to unset.
150
151
Akron821db3d2017-04-06 21:19:31 +0200152=item B<--base-pagebreaks|-bpb> <foundry>#<layer>
153
154Define the layer for base pagebreaks.
155Currently C<DeReKo#Structure> is the only layer supported.
156
157 Defaults to unset.
158
159
Akron5c71a852016-10-31 16:00:33 +0100160=item B<--skip|-s> <foundry>[#<layer>]
161
162Skip specific annotations by specifying the foundry
163(and optionally the layer with a C<#>-prefix),
164e.g. C<Mate> or C<Mate#Morpho>. Alternatively you can skip C<#ALL>.
165Can be set multiple times.
166
Akronf73ffb62018-06-27 12:13:59 +0200167
Akron5c71a852016-10-31 16:00:33 +0100168=item B<--anno|-a> <foundry>#<layer>
169
170Convert specific annotations by specifying the foundry
171(and optionally the layer with a C<#>-prefix),
172e.g. C<Mate> or C<Mate#Morpho>.
173Can be set multiple times.
174
Akronf73ffb62018-06-27 12:13:59 +0200175
Akron5c71a852016-10-31 16:00:33 +0100176=item B<--primary|-p>
177
178Output primary data or not. Defaults to C<true>.
179Can be flagged using C<--no-primary> as well.
180This is I<deprecated>.
181
Akronf73ffb62018-06-27 12:13:59 +0200182
Akroned9baf02019-01-22 17:03:25 +0100183=item B<--non-word-tokens|-nwt>
184
185Tokenize non-word tokens like word tokens (defined as matching
186C</[\d\w]/>). Useful to treat punctuations as tokens.
187
188 Defaults to unset.
189
Akronf1849aa2019-12-16 23:35:33 +0100190
191=item B<--non-verbal-tokens|-nvt>
192
193Tokenize non-verbal tokens marked as in the primary data as
194the unicode symbol 'Black Vertical Rectangle' aka \x25ae.
195
196 Defaults to unset.
197
198
Akron5c71a852016-10-31 16:00:33 +0100199=item B<--jobs|-j>
200
201Define the number of concurrent jobs in seperated forks
202for archive processing.
203Defaults to C<0> (everything runs in a single process).
Akronf73ffb62018-06-27 12:13:59 +0200204
205If C<sequential-extraction> is not set to false, this will
206also apply to extraction.
207
Akron821db3d2017-04-06 21:19:31 +0200208Pass -1, and the value will be set automatically to 5
209times the number of available cores.
Akron5c71a852016-10-31 16:00:33 +0100210This is I<experimental>.
211
Akronf73ffb62018-06-27 12:13:59 +0200212
Akron263274c2019-02-07 09:48:30 +0100213=item B<--koral|-k>
214
215Version of the output format. Supported versions are:
216C<0> for legacy serialization, C<0.03> for serialization
217with metadata fields as key-values on the root object,
218C<0.4> for serialization with metadata fields as a list
219of C<"@type":"koral:field"> objects.
220
221Currently defaults to C<0.03>.
222
223
Akronf73ffb62018-06-27 12:13:59 +0200224=item B<--sequential-extraction|-se>
225
226Flag to indicate, if the C<jobs> value also applies to extraction.
227Some systems may have problems with extracting multiple archives
228to the same folder at the same time.
229Can be flagged using C<--no-sequential-extraction> as well.
230Defaults to C<false>.
231
232
Akron5c71a852016-10-31 16:00:33 +0100233=item B<--meta|-m>
234
235Define the metadata parser to use. Defaults to C<I5>.
236Metadata parsers can be defined in the C<KorAP::XML::Meta> namespace.
237This is I<experimental>.
238
Akronf73ffb62018-06-27 12:13:59 +0200239
Akron5c71a852016-10-31 16:00:33 +0100240=item B<--pretty|-y>
241
242Pretty print JSON output. Defaults to C<false>.
243This is I<deprecated>.
244
Akronf73ffb62018-06-27 12:13:59 +0200245
Akron5c71a852016-10-31 16:00:33 +0100246=item B<--gzip|-z>
247
248Compress the output.
249Expects a defined C<output> file in single processing.
250
Akronf73ffb62018-06-27 12:13:59 +0200251
Akron5c71a852016-10-31 16:00:33 +0100252=item B<--cache|-c>
253
254File to mmap a cache (using L<Cache::FastMmap>).
255Defaults to C<korapxml2krill.cache> in the calling directory.
256
Akronf73ffb62018-06-27 12:13:59 +0200257
Akron5c71a852016-10-31 16:00:33 +0100258=item B<--cache-size|-cs>
259
260Size of the cache. Defaults to C<50m>.
261
Akronf73ffb62018-06-27 12:13:59 +0200262
Akron5c71a852016-10-31 16:00:33 +0100263=item B<--cache-init|-ci>
264
265Initialize cache file.
266Can be flagged using C<--no-cache-init> as well.
267Defaults to C<true>.
268
Akronf73ffb62018-06-27 12:13:59 +0200269
Akron5c71a852016-10-31 16:00:33 +0100270=item B<--cache-delete|-cd>
271
272Delete cache file after processing.
273Can be flagged using C<--no-cache-delete> as well.
274Defaults to C<true>.
275
Akronf73ffb62018-06-27 12:13:59 +0200276
Akron636aa112017-04-07 18:48:56 +0200277=item B<--config|-cfg>
278
279Configure the parameters of your call in a file
280of key-value pairs with whitespace separator
281
282 overwrite 1
283 token DeReKo#Structure
284 ...
285
286Supported parameters are:
Akron442c4e92017-04-10 23:41:31 +0200287C<overwrite>, C<gzip>, C<jobs>, C<input-base>,
Akron636aa112017-04-07 18:48:56 +0200288C<token>, C<log>, C<cache>, C<cache-size>, C<cache-delete>, C<meta>,
Akron57510c12019-01-04 14:58:53 +0100289C<output>, C<koral>,
290C<tempary-extract>, C<sequential-extraction>,
Akronf73ffb62018-06-27 12:13:59 +0200291C<base-sentences>, C<base-paragraphs>,
292C<base-pagebreaks>,
293C<skip> (semicolon separated), C<sigle>
Akron636aa112017-04-07 18:48:56 +0200294(semicolon separated), C<anno> (semicolon separated).
295
Akronf73ffb62018-06-27 12:13:59 +0200296Configuration parameters will always be overwritten by
297passed parameters.
298
299
Akron81500102017-04-07 20:45:44 +0200300=item B<--temporary-extract|-te>
301
302Only valid for the C<archive> command.
303
304This will first extract all files into a
305directory and then will archive.
306If the directory is given as C<:temp:>,
307a temporary directory is used.
308This is especially useful to avoid
309massive unzipping and potential
310network latency.
Akron636aa112017-04-07 18:48:56 +0200311
Akronf73ffb62018-06-27 12:13:59 +0200312
Akronc93a0802019-07-11 15:48:34 +0200313=item B<--to-tar>
314
315Only valid for the C<archive> command.
316
317Writes the output into a tar archive.
318
319
Akron5c71a852016-10-31 16:00:33 +0100320=item B<--sigle|-sg>
321
322Extract the given texts.
323Can be set multiple times.
324I<Currently only supported on C<extract>.>
325Sigles have the structure C<Corpus>/C<Document>/C<Text>.
326In case the C<Text> path is omitted, the whole document will be extracted.
327On the document level, the postfix wildcard C<*> is supported.
328
Akronf73ffb62018-06-27 12:13:59 +0200329
Akron5c71a852016-10-31 16:00:33 +0100330=item B<--log|-l>
331
332The L<Log4perl> log level, defaults to C<ERROR>.
333
Akronf73ffb62018-06-27 12:13:59 +0200334
Akron5c71a852016-10-31 16:00:33 +0100335=item B<--help|-h>
336
337Print this document.
338
Akronf73ffb62018-06-27 12:13:59 +0200339
Akron5c71a852016-10-31 16:00:33 +0100340=item B<--version|-v>
341
342Print version information.
343
344=back
345
Akronf73ffb62018-06-27 12:13:59 +0200346
Akron5c71a852016-10-31 16:00:33 +0100347=head1 ANNOTATION SUPPORT
348
349L<KorAP::XML::Krill> has built-in importer for some annotation foundries and layers
350developed in the KorAP project that are part of the KorAP preprocessing pipeline.
351The base foundry with paragraphs, sentences, and the text element are mandatory for
352L<Krill|https://github.com/KorAP/Krill>.
353
Akron821db3d2017-04-06 21:19:31 +0200354 Base
355 #Paragraphs
356 #Sentences
Akron5c71a852016-10-31 16:00:33 +0100357
Akron821db3d2017-04-06 21:19:31 +0200358 Connexor
359 #Morpho
360 #Phrase
361 #Sentences
362 #Syntax
Akron5c71a852016-10-31 16:00:33 +0100363
Akron821db3d2017-04-06 21:19:31 +0200364 CoreNLP
365 #Constituency
366 #Morpho
367 #NamedEntities
368 #Sentences
Akron5c71a852016-10-31 16:00:33 +0100369
Akronf73ffb62018-06-27 12:13:59 +0200370 CMC
371 #Morpho
372
Akron821db3d2017-04-06 21:19:31 +0200373 DeReKo
374 #Structure
Akron5c71a852016-10-31 16:00:33 +0100375
Akron57510c12019-01-04 14:58:53 +0100376 DGD
377 #Morpho
Akronc29b8e12019-12-16 14:28:09 +0100378 #Structure
Akron57510c12019-01-04 14:58:53 +0100379
Akron821db3d2017-04-06 21:19:31 +0200380 DRuKoLa
381 #Morpho
Akron5c71a852016-10-31 16:00:33 +0100382
Akron821db3d2017-04-06 21:19:31 +0200383 Glemm
384 #Morpho
Akron5c71a852016-10-31 16:00:33 +0100385
Akroned9baf02019-01-22 17:03:25 +0100386 HNC
387 #Morpho
388
Akronf73ffb62018-06-27 12:13:59 +0200389 LWC
390 #Dependency
391
Akron821db3d2017-04-06 21:19:31 +0200392 Malt
393 #Dependency
Akron5c71a852016-10-31 16:00:33 +0100394
Akron821db3d2017-04-06 21:19:31 +0200395 MarMoT
396 #Morpho
Akron5c71a852016-10-31 16:00:33 +0100397
Akron821db3d2017-04-06 21:19:31 +0200398 Mate
399 #Dependency
400 #Morpho
Akron5c71a852016-10-31 16:00:33 +0100401
Akron821db3d2017-04-06 21:19:31 +0200402 MDParser
403 #Dependency
Akron5c71a852016-10-31 16:00:33 +0100404
Akron821db3d2017-04-06 21:19:31 +0200405 OpenNLP
406 #Morpho
407 #Sentences
Akron5c71a852016-10-31 16:00:33 +0100408
Akron821db3d2017-04-06 21:19:31 +0200409 Sgbr
410 #Lemma
411 #Morpho
Akron5c71a852016-10-31 16:00:33 +0100412
Akron7d5e6382019-08-08 16:36:27 +0200413 Talismane
414 #Dependency
415 #Morpho
416
Akron821db3d2017-04-06 21:19:31 +0200417 TreeTagger
418 #Morpho
419 #Sentences
Akron5c71a852016-10-31 16:00:33 +0100420
Akron821db3d2017-04-06 21:19:31 +0200421 XIP
422 #Constituency
423 #Morpho
424 #Sentences
Akron5c71a852016-10-31 16:00:33 +0100425
Akron5c71a852016-10-31 16:00:33 +0100426
427More importers are in preparation.
428New annotation importers can be defined in the C<KorAP::XML::Annotation> namespace.
429See the built-in annotation importers as examples.
Akronc13a1702016-03-15 19:33:14 +0100430
Akronf73ffb62018-06-27 12:13:59 +0200431
Akronc13a1702016-03-15 19:33:14 +0100432=head1 AVAILABILITY
433
434 https://github.com/KorAP/KorAP-XML-Krill
435
436
437=head1 COPYRIGHT AND LICENSE
438
Akroned9baf02019-01-22 17:03:25 +0100439Copyright (C) 2015-2019, L<IDS Mannheim|http://www.ids-mannheim.de/>
Akronc13a1702016-03-15 19:33:14 +0100440
Akron5c71a852016-10-31 16:00:33 +0100441Author: L<Nils Diewald|http://nils-diewald.de/>
Akron81500102017-04-07 20:45:44 +0200442
Akron5c71a852016-10-31 16:00:33 +0100443Contributor: Eliza Margaretha
444
445L<KorAP::XML::Krill> is developed as part of the L<KorAP|http://korap.ids-mannheim.de/>
Akronc13a1702016-03-15 19:33:14 +0100446Corpus Analysis Platform at the
Akron94262ce2019-02-28 21:42:43 +0100447L<Leibniz Institute for the German Language (IDS)|http://ids-mannheim.de/>,
Akronc13a1702016-03-15 19:33:14 +0100448member of the
Akronf1849aa2019-12-16 23:35:33 +0100449L<Leibniz-Gemeinschaft|http://www.leibniz-gemeinschaft.de/>.
Akronc13a1702016-03-15 19:33:14 +0100450
Akron5c71a852016-10-31 16:00:33 +0100451This program is free software published under the
Akronc13a1702016-03-15 19:33:14 +0100452L<BSD-2 License|https://raw.githubusercontent.com/KorAP/KorAP-XML-Krill/master/LICENSE>.
453
454=cut