blob: 8e82eda3684859c649644e85d433a81baea10f02 [file] [log] [blame]
Akronc13a1702016-03-15 19:33:14 +01001=pod
2
3=encoding utf8
4
5=head1 NAME
6
Akron5c71a852016-10-31 16:00:33 +01007korapxml2krill - Merge KorapXML data and create Krill documents
Akronc13a1702016-03-15 19:33:14 +01008
9
10=head1 SYNOPSIS
11
Akron5c71a852016-10-31 16:00:33 +010012 korapxml2krill [archive|extract] --input <directory|archive> [options]
Akron2fd402b2016-10-27 21:26:48 +020013
Akronc13a1702016-03-15 19:33:14 +010014
15=head1 DESCRIPTION
16
Akron5c71a852016-10-31 16:00:33 +010017L<KorAP::XML::Krill> is a library to convert KorAP-XML documents to files
18compatible with the L<Krill|https://github.com/KorAP/Krill> indexer.
19The C<korapxml2krill> command line tool is a simple wrapper to the library.
Akronc13a1702016-03-15 19:33:14 +010020
21
Akron5c71a852016-10-31 16:00:33 +010022=head1 INSTALLATION
Akronc13a1702016-03-15 19:33:14 +010023
Akron5c71a852016-10-31 16:00:33 +010024The preferred way to install L<KorAP::XML::Krill> is to use L<cpanm|App::cpanminus>.
Akronc13a1702016-03-15 19:33:14 +010025
Akron5c71a852016-10-31 16:00:33 +010026 $ cpanm https://github.com/KorAP/KorAP-XML-Krill.git
Akronc13a1702016-03-15 19:33:14 +010027
Akron5c71a852016-10-31 16:00:33 +010028In case everything went well, the C<korapxml2krill> tool will
29be available on your command line immediately.
Akron6eff23b2018-09-24 10:31:20 +020030Minimum requirement for L<KorAP::XML::Krill> is Perl 5.16.
Akron5c71a852016-10-31 16:00:33 +010031In addition to work with zip archives, the C<unzip> tool needs to be present.
Akronc13a1702016-03-15 19:33:14 +010032
Akron5c71a852016-10-31 16:00:33 +010033=head1 ARGUMENTS
Akronc13a1702016-03-15 19:33:14 +010034
Akron5c71a852016-10-31 16:00:33 +010035 $ korapxml2krill -z --input <directory> --output <filename>
36
37Without arguments, C<korapxml2krill> converts a directory of a single KorAP-XML document.
38It expects the input to point to the text level folder.
39
40=over 2
41
42=item B<archive>
43
Akronf73ffb62018-06-27 12:13:59 +020044 $ korapxml2krill archive -z --input <directory|archive> --output <directory|tar>
Akron5c71a852016-10-31 16:00:33 +010045
46Converts an archive of KorAP-XML documents. It expects a directory
47(pointing to the corpus level folder) or one or more zip files as input.
48
49=item B<extract>
50
51 $ korapxml2krill extract --input <archive> --output <directory> --sigle <SIGLE>
52
53Extracts KorAP-XML documents from a zip file.
54
Akron442c4e92017-04-10 23:41:31 +020055=item B<serial>
56
57 $ korapxml2krill serial -i <archive1> -i <archive2> -o <directory> -cfg <config-file>
58
59Convert archives sequentially. The inputs are not merged but treated
60as they are (so they may be premerged or globs).
61the C<--out> directory is treated as the base directory where subdirectories
Akronf73ffb62018-06-27 12:13:59 +020062are created based on the archive name. In case the C<--to-tar> flag is given,
63the output will be a tar file.
Akron442c4e92017-04-10 23:41:31 +020064
65
Akron5c71a852016-10-31 16:00:33 +010066=back
Akrona76d8352016-10-27 16:27:32 +020067
Akron7606afa2016-10-25 16:23:49 +020068
Akron5c71a852016-10-31 16:00:33 +010069=head1 OPTIONS
Akronc13a1702016-03-15 19:33:14 +010070
Akron5c71a852016-10-31 16:00:33 +010071=over 2
Akronc13a1702016-03-15 19:33:14 +010072
Akron5c71a852016-10-31 16:00:33 +010073=item B<--input|-i> <directory|zip file>
Akrona76d8352016-10-27 16:27:32 +020074
Akron5c71a852016-10-31 16:00:33 +010075Directory or zip file(s) of documents to convert.
Akronc13a1702016-03-15 19:33:14 +010076
Akron5c71a852016-10-31 16:00:33 +010077Without arguments, C<korapxml2krill> expects a folder of a single KorAP-XML
Akronf1a1de92016-11-02 17:32:12 +010078document, while C<archive> expects a KorAP-XML corpus folder or a zip
79file to batch process multiple files.
80C<extract> expects zip files only.
Akronc13a1702016-03-15 19:33:14 +010081
Akron5c71a852016-10-31 16:00:33 +010082C<archive> supports multiple input zip files with the constraint,
83that the first archive listed contains all primary data files
84and all meta data files.
Akrona76d8352016-10-27 16:27:32 +020085
Akron5c71a852016-10-31 16:00:33 +010086 -i file/news.zip -i file/news.malt.zip -i "#file/news.tt.zip"
Akronc13a1702016-03-15 19:33:14 +010087
Akron821db3d2017-04-06 21:19:31 +020088Input may also be defined using BSD glob wildcards.
89
90 -i 'file/news*.zip'
91
92The extended input array will be sorted in length order, so the shortest
93path needs to contain all primary data files and all meta data files.
94
Akron5c71a852016-10-31 16:00:33 +010095(The directory structure follows the base directory format,
96that may include a C<.> root folder.
97In this case further archives lacking a C<.> root folder
98need to be passed with a hash sign in front of the archive's name.
99This may require to quote the parameter.)
Akronc13a1702016-03-15 19:33:14 +0100100
Akron5c71a852016-10-31 16:00:33 +0100101To support zip files, a version of C<unzip> needs to be installed that is
102compatible with the archive file.
Akronc13a1702016-03-15 19:33:14 +0100103
Akron5c71a852016-10-31 16:00:33 +0100104B<The root folder switch using the hash sign is experimental and
105may vanish in future versions.>
Akronc13a1702016-03-15 19:33:14 +0100106
Akronf73ffb62018-06-27 12:13:59 +0200107
Akron442c4e92017-04-10 23:41:31 +0200108=item B<--input-base|-ib> <directory>
109
110The base directory for inputs.
111
112
Akron5c71a852016-10-31 16:00:33 +0100113=item B<--output|-o> <directory|file>
Akronc13a1702016-03-15 19:33:14 +0100114
Akron5c71a852016-10-31 16:00:33 +0100115Output folder for archive processing or
116document name for single output (optional),
117writes to C<STDOUT> by default
118(in case C<output> is not mandatory due to further options).
Akronc13a1702016-03-15 19:33:14 +0100119
Akron5c71a852016-10-31 16:00:33 +0100120=item B<--overwrite|-w>
Akronc13a1702016-03-15 19:33:14 +0100121
Akron5c71a852016-10-31 16:00:33 +0100122Overwrite files that already exist.
Akron7606afa2016-10-25 16:23:49 +0200123
Akronf73ffb62018-06-27 12:13:59 +0200124
Akron3741f8b2016-12-21 19:55:21 +0100125=item B<--token|-t> <foundry>#<file>
Akrona5920b12016-06-29 18:51:21 +0200126
Akron5c71a852016-10-31 16:00:33 +0100127Define the default tokenization by specifying
128the name of the foundry and optionally the name
129of the layer-file. Defaults to C<OpenNLP#tokens>.
Akronc13a1702016-03-15 19:33:14 +0100130
Akron3741f8b2016-12-21 19:55:21 +0100131
132=item B<--base-sentences|-bs> <foundry>#<layer>
133
134Define the layer for base sentences.
135If given, this will be used instead of using C<Base#Sentences>.
Akronc29b8e12019-12-16 14:28:09 +0100136Currently C<DeReKo#Structure> and C<DGD#Structure> are the only additional
137layers supported.
Akron3741f8b2016-12-21 19:55:21 +0100138
139 Defaults to unset.
140
141
142=item B<--base-paragraphs|-bp> <foundry>#<layer>
143
144Define the layer for base paragraphs.
145If given, this will be used instead of using C<Base#Paragraphs>.
146Currently C<DeReKo#Structure> is the only additional layer supported.
147
148 Defaults to unset.
149
150
Akron821db3d2017-04-06 21:19:31 +0200151=item B<--base-pagebreaks|-bpb> <foundry>#<layer>
152
153Define the layer for base pagebreaks.
154Currently C<DeReKo#Structure> is the only layer supported.
155
156 Defaults to unset.
157
158
Akron5c71a852016-10-31 16:00:33 +0100159=item B<--skip|-s> <foundry>[#<layer>]
160
161Skip specific annotations by specifying the foundry
162(and optionally the layer with a C<#>-prefix),
163e.g. C<Mate> or C<Mate#Morpho>. Alternatively you can skip C<#ALL>.
164Can be set multiple times.
165
Akronf73ffb62018-06-27 12:13:59 +0200166
Akron5c71a852016-10-31 16:00:33 +0100167=item B<--anno|-a> <foundry>#<layer>
168
169Convert specific annotations by specifying the foundry
170(and optionally the layer with a C<#>-prefix),
171e.g. C<Mate> or C<Mate#Morpho>.
172Can be set multiple times.
173
Akronf73ffb62018-06-27 12:13:59 +0200174
Akron5c71a852016-10-31 16:00:33 +0100175=item B<--primary|-p>
176
177Output primary data or not. Defaults to C<true>.
178Can be flagged using C<--no-primary> as well.
179This is I<deprecated>.
180
Akronf73ffb62018-06-27 12:13:59 +0200181
Akroned9baf02019-01-22 17:03:25 +0100182=item B<--non-word-tokens|-nwt>
183
184Tokenize non-word tokens like word tokens (defined as matching
185C</[\d\w]/>). Useful to treat punctuations as tokens.
186
187 Defaults to unset.
188
Akron5c71a852016-10-31 16:00:33 +0100189=item B<--jobs|-j>
190
191Define the number of concurrent jobs in seperated forks
192for archive processing.
193Defaults to C<0> (everything runs in a single process).
Akronf73ffb62018-06-27 12:13:59 +0200194
195If C<sequential-extraction> is not set to false, this will
196also apply to extraction.
197
Akron821db3d2017-04-06 21:19:31 +0200198Pass -1, and the value will be set automatically to 5
199times the number of available cores.
Akron5c71a852016-10-31 16:00:33 +0100200This is I<experimental>.
201
Akronf73ffb62018-06-27 12:13:59 +0200202
Akron263274c2019-02-07 09:48:30 +0100203=item B<--koral|-k>
204
205Version of the output format. Supported versions are:
206C<0> for legacy serialization, C<0.03> for serialization
207with metadata fields as key-values on the root object,
208C<0.4> for serialization with metadata fields as a list
209of C<"@type":"koral:field"> objects.
210
211Currently defaults to C<0.03>.
212
213
Akronf73ffb62018-06-27 12:13:59 +0200214=item B<--sequential-extraction|-se>
215
216Flag to indicate, if the C<jobs> value also applies to extraction.
217Some systems may have problems with extracting multiple archives
218to the same folder at the same time.
219Can be flagged using C<--no-sequential-extraction> as well.
220Defaults to C<false>.
221
222
Akron5c71a852016-10-31 16:00:33 +0100223=item B<--meta|-m>
224
225Define the metadata parser to use. Defaults to C<I5>.
226Metadata parsers can be defined in the C<KorAP::XML::Meta> namespace.
227This is I<experimental>.
228
Akronf73ffb62018-06-27 12:13:59 +0200229
Akron5c71a852016-10-31 16:00:33 +0100230=item B<--pretty|-y>
231
232Pretty print JSON output. Defaults to C<false>.
233This is I<deprecated>.
234
Akronf73ffb62018-06-27 12:13:59 +0200235
Akron5c71a852016-10-31 16:00:33 +0100236=item B<--gzip|-z>
237
238Compress the output.
239Expects a defined C<output> file in single processing.
240
Akronf73ffb62018-06-27 12:13:59 +0200241
Akron5c71a852016-10-31 16:00:33 +0100242=item B<--cache|-c>
243
244File to mmap a cache (using L<Cache::FastMmap>).
245Defaults to C<korapxml2krill.cache> in the calling directory.
246
Akronf73ffb62018-06-27 12:13:59 +0200247
Akron5c71a852016-10-31 16:00:33 +0100248=item B<--cache-size|-cs>
249
250Size of the cache. Defaults to C<50m>.
251
Akronf73ffb62018-06-27 12:13:59 +0200252
Akron5c71a852016-10-31 16:00:33 +0100253=item B<--cache-init|-ci>
254
255Initialize cache file.
256Can be flagged using C<--no-cache-init> as well.
257Defaults to C<true>.
258
Akronf73ffb62018-06-27 12:13:59 +0200259
Akron5c71a852016-10-31 16:00:33 +0100260=item B<--cache-delete|-cd>
261
262Delete cache file after processing.
263Can be flagged using C<--no-cache-delete> as well.
264Defaults to C<true>.
265
Akronf73ffb62018-06-27 12:13:59 +0200266
Akron636aa112017-04-07 18:48:56 +0200267=item B<--config|-cfg>
268
269Configure the parameters of your call in a file
270of key-value pairs with whitespace separator
271
272 overwrite 1
273 token DeReKo#Structure
274 ...
275
276Supported parameters are:
Akron442c4e92017-04-10 23:41:31 +0200277C<overwrite>, C<gzip>, C<jobs>, C<input-base>,
Akron636aa112017-04-07 18:48:56 +0200278C<token>, C<log>, C<cache>, C<cache-size>, C<cache-delete>, C<meta>,
Akron57510c12019-01-04 14:58:53 +0100279C<output>, C<koral>,
280C<tempary-extract>, C<sequential-extraction>,
Akronf73ffb62018-06-27 12:13:59 +0200281C<base-sentences>, C<base-paragraphs>,
282C<base-pagebreaks>,
283C<skip> (semicolon separated), C<sigle>
Akron636aa112017-04-07 18:48:56 +0200284(semicolon separated), C<anno> (semicolon separated).
285
Akronf73ffb62018-06-27 12:13:59 +0200286Configuration parameters will always be overwritten by
287passed parameters.
288
289
Akron81500102017-04-07 20:45:44 +0200290=item B<--temporary-extract|-te>
291
292Only valid for the C<archive> command.
293
294This will first extract all files into a
295directory and then will archive.
296If the directory is given as C<:temp:>,
297a temporary directory is used.
298This is especially useful to avoid
299massive unzipping and potential
300network latency.
Akron636aa112017-04-07 18:48:56 +0200301
Akronf73ffb62018-06-27 12:13:59 +0200302
Akronc93a0802019-07-11 15:48:34 +0200303=item B<--to-tar>
304
305Only valid for the C<archive> command.
306
307Writes the output into a tar archive.
308
309
Akron5c71a852016-10-31 16:00:33 +0100310=item B<--sigle|-sg>
311
312Extract the given texts.
313Can be set multiple times.
314I<Currently only supported on C<extract>.>
315Sigles have the structure C<Corpus>/C<Document>/C<Text>.
316In case the C<Text> path is omitted, the whole document will be extracted.
317On the document level, the postfix wildcard C<*> is supported.
318
Akronf73ffb62018-06-27 12:13:59 +0200319
Akron5c71a852016-10-31 16:00:33 +0100320=item B<--log|-l>
321
322The L<Log4perl> log level, defaults to C<ERROR>.
323
Akronf73ffb62018-06-27 12:13:59 +0200324
Akron5c71a852016-10-31 16:00:33 +0100325=item B<--help|-h>
326
327Print this document.
328
Akronf73ffb62018-06-27 12:13:59 +0200329
Akron5c71a852016-10-31 16:00:33 +0100330=item B<--version|-v>
331
332Print version information.
333
334=back
335
Akronf73ffb62018-06-27 12:13:59 +0200336
Akron5c71a852016-10-31 16:00:33 +0100337=head1 ANNOTATION SUPPORT
338
339L<KorAP::XML::Krill> has built-in importer for some annotation foundries and layers
340developed in the KorAP project that are part of the KorAP preprocessing pipeline.
341The base foundry with paragraphs, sentences, and the text element are mandatory for
342L<Krill|https://github.com/KorAP/Krill>.
343
Akron821db3d2017-04-06 21:19:31 +0200344 Base
345 #Paragraphs
346 #Sentences
Akron5c71a852016-10-31 16:00:33 +0100347
Akron821db3d2017-04-06 21:19:31 +0200348 Connexor
349 #Morpho
350 #Phrase
351 #Sentences
352 #Syntax
Akron5c71a852016-10-31 16:00:33 +0100353
Akron821db3d2017-04-06 21:19:31 +0200354 CoreNLP
355 #Constituency
356 #Morpho
357 #NamedEntities
358 #Sentences
Akron5c71a852016-10-31 16:00:33 +0100359
Akronf73ffb62018-06-27 12:13:59 +0200360 CMC
361 #Morpho
362
Akron821db3d2017-04-06 21:19:31 +0200363 DeReKo
364 #Structure
Akron5c71a852016-10-31 16:00:33 +0100365
Akron57510c12019-01-04 14:58:53 +0100366 DGD
367 #Morpho
Akronc29b8e12019-12-16 14:28:09 +0100368 #Structure
Akron57510c12019-01-04 14:58:53 +0100369
Akron821db3d2017-04-06 21:19:31 +0200370 DRuKoLa
371 #Morpho
Akron5c71a852016-10-31 16:00:33 +0100372
Akron821db3d2017-04-06 21:19:31 +0200373 Glemm
374 #Morpho
Akron5c71a852016-10-31 16:00:33 +0100375
Akroned9baf02019-01-22 17:03:25 +0100376 HNC
377 #Morpho
378
Akronf73ffb62018-06-27 12:13:59 +0200379 LWC
380 #Dependency
381
Akron821db3d2017-04-06 21:19:31 +0200382 Malt
383 #Dependency
Akron5c71a852016-10-31 16:00:33 +0100384
Akron821db3d2017-04-06 21:19:31 +0200385 MarMoT
386 #Morpho
Akron5c71a852016-10-31 16:00:33 +0100387
Akron821db3d2017-04-06 21:19:31 +0200388 Mate
389 #Dependency
390 #Morpho
Akron5c71a852016-10-31 16:00:33 +0100391
Akron821db3d2017-04-06 21:19:31 +0200392 MDParser
393 #Dependency
Akron5c71a852016-10-31 16:00:33 +0100394
Akron821db3d2017-04-06 21:19:31 +0200395 OpenNLP
396 #Morpho
397 #Sentences
Akron5c71a852016-10-31 16:00:33 +0100398
Akron821db3d2017-04-06 21:19:31 +0200399 Sgbr
400 #Lemma
401 #Morpho
Akron5c71a852016-10-31 16:00:33 +0100402
Akron7d5e6382019-08-08 16:36:27 +0200403 Talismane
404 #Dependency
405 #Morpho
406
Akron821db3d2017-04-06 21:19:31 +0200407 TreeTagger
408 #Morpho
409 #Sentences
Akron5c71a852016-10-31 16:00:33 +0100410
Akron821db3d2017-04-06 21:19:31 +0200411 XIP
412 #Constituency
413 #Morpho
414 #Sentences
Akron5c71a852016-10-31 16:00:33 +0100415
Akron5c71a852016-10-31 16:00:33 +0100416
417More importers are in preparation.
418New annotation importers can be defined in the C<KorAP::XML::Annotation> namespace.
419See the built-in annotation importers as examples.
Akronc13a1702016-03-15 19:33:14 +0100420
Akronf73ffb62018-06-27 12:13:59 +0200421
Akronc13a1702016-03-15 19:33:14 +0100422=head1 AVAILABILITY
423
424 https://github.com/KorAP/KorAP-XML-Krill
425
426
427=head1 COPYRIGHT AND LICENSE
428
Akroned9baf02019-01-22 17:03:25 +0100429Copyright (C) 2015-2019, L<IDS Mannheim|http://www.ids-mannheim.de/>
Akronc13a1702016-03-15 19:33:14 +0100430
Akron5c71a852016-10-31 16:00:33 +0100431Author: L<Nils Diewald|http://nils-diewald.de/>
Akron81500102017-04-07 20:45:44 +0200432
Akron5c71a852016-10-31 16:00:33 +0100433Contributor: Eliza Margaretha
434
435L<KorAP::XML::Krill> is developed as part of the L<KorAP|http://korap.ids-mannheim.de/>
Akronc13a1702016-03-15 19:33:14 +0100436Corpus Analysis Platform at the
Akron94262ce2019-02-28 21:42:43 +0100437L<Leibniz Institute for the German Language (IDS)|http://ids-mannheim.de/>,
Akronc13a1702016-03-15 19:33:14 +0100438member of the
Akron5c71a852016-10-31 16:00:33 +0100439L<Leibniz-Gemeinschaft|http://www.leibniz-gemeinschaft.de/en/about-us/leibniz-competition/projekte-2011/2011-funding-line-2/>.
Akronc13a1702016-03-15 19:33:14 +0100440
Akron5c71a852016-10-31 16:00:33 +0100441This program is free software published under the
Akronc13a1702016-03-15 19:33:14 +0100442L<BSD-2 License|https://raw.githubusercontent.com/KorAP/KorAP-XML-Krill/master/LICENSE>.
443
444=cut