blob: fd454fe673d2b34ea500d755a4ce712b0c09f8d3 [file] [log] [blame]
Akronc13a1702016-03-15 19:33:14 +01001=pod
2
3=encoding utf8
4
5=head1 NAME
6
Akron5c71a852016-10-31 16:00:33 +01007korapxml2krill - Merge KorapXML data and create Krill documents
Akronc13a1702016-03-15 19:33:14 +01008
9
10=head1 SYNOPSIS
11
Akron5c71a852016-10-31 16:00:33 +010012 korapxml2krill [archive|extract] --input <directory|archive> [options]
Akron2fd402b2016-10-27 21:26:48 +020013
Akronc13a1702016-03-15 19:33:14 +010014
15=head1 DESCRIPTION
16
Akron5c71a852016-10-31 16:00:33 +010017L<KorAP::XML::Krill> is a library to convert KorAP-XML documents to files
18compatible with the L<Krill|https://github.com/KorAP/Krill> indexer.
19The C<korapxml2krill> command line tool is a simple wrapper to the library.
Akronc13a1702016-03-15 19:33:14 +010020
21
Akron5c71a852016-10-31 16:00:33 +010022=head1 INSTALLATION
Akronc13a1702016-03-15 19:33:14 +010023
Akron5c71a852016-10-31 16:00:33 +010024The preferred way to install L<KorAP::XML::Krill> is to use L<cpanm|App::cpanminus>.
Akronc13a1702016-03-15 19:33:14 +010025
Akron5c71a852016-10-31 16:00:33 +010026 $ cpanm https://github.com/KorAP/KorAP-XML-Krill.git
Akronc13a1702016-03-15 19:33:14 +010027
Akron5c71a852016-10-31 16:00:33 +010028In case everything went well, the C<korapxml2krill> tool will
29be available on your command line immediately.
Akron6eff23b2018-09-24 10:31:20 +020030Minimum requirement for L<KorAP::XML::Krill> is Perl 5.16.
Akron5c71a852016-10-31 16:00:33 +010031In addition to work with zip archives, the C<unzip> tool needs to be present.
Akronc13a1702016-03-15 19:33:14 +010032
Akron5c71a852016-10-31 16:00:33 +010033=head1 ARGUMENTS
Akronc13a1702016-03-15 19:33:14 +010034
Akron5c71a852016-10-31 16:00:33 +010035 $ korapxml2krill -z --input <directory> --output <filename>
36
37Without arguments, C<korapxml2krill> converts a directory of a single KorAP-XML document.
38It expects the input to point to the text level folder.
39
40=over 2
41
42=item B<archive>
43
Akronf73ffb62018-06-27 12:13:59 +020044 $ korapxml2krill archive -z --input <directory|archive> --output <directory|tar>
Akron5c71a852016-10-31 16:00:33 +010045
46Converts an archive of KorAP-XML documents. It expects a directory
47(pointing to the corpus level folder) or one or more zip files as input.
48
49=item B<extract>
50
51 $ korapxml2krill extract --input <archive> --output <directory> --sigle <SIGLE>
52
53Extracts KorAP-XML documents from a zip file.
54
Akron442c4e92017-04-10 23:41:31 +020055=item B<serial>
56
57 $ korapxml2krill serial -i <archive1> -i <archive2> -o <directory> -cfg <config-file>
58
59Convert archives sequentially. The inputs are not merged but treated
60as they are (so they may be premerged or globs).
61the C<--out> directory is treated as the base directory where subdirectories
Akronf73ffb62018-06-27 12:13:59 +020062are created based on the archive name. In case the C<--to-tar> flag is given,
63the output will be a tar file.
Akron442c4e92017-04-10 23:41:31 +020064
65
Akron5c71a852016-10-31 16:00:33 +010066=back
Akrona76d8352016-10-27 16:27:32 +020067
Akron7606afa2016-10-25 16:23:49 +020068
Akron5c71a852016-10-31 16:00:33 +010069=head1 OPTIONS
Akronc13a1702016-03-15 19:33:14 +010070
Akron5c71a852016-10-31 16:00:33 +010071=over 2
Akronc13a1702016-03-15 19:33:14 +010072
Akron5c71a852016-10-31 16:00:33 +010073=item B<--input|-i> <directory|zip file>
Akrona76d8352016-10-27 16:27:32 +020074
Akron5c71a852016-10-31 16:00:33 +010075Directory or zip file(s) of documents to convert.
Akronc13a1702016-03-15 19:33:14 +010076
Akron5c71a852016-10-31 16:00:33 +010077Without arguments, C<korapxml2krill> expects a folder of a single KorAP-XML
Akronf1a1de92016-11-02 17:32:12 +010078document, while C<archive> expects a KorAP-XML corpus folder or a zip
79file to batch process multiple files.
80C<extract> expects zip files only.
Akronc13a1702016-03-15 19:33:14 +010081
Akron5c71a852016-10-31 16:00:33 +010082C<archive> supports multiple input zip files with the constraint,
83that the first archive listed contains all primary data files
84and all meta data files.
Akrona76d8352016-10-27 16:27:32 +020085
Akron5c71a852016-10-31 16:00:33 +010086 -i file/news.zip -i file/news.malt.zip -i "#file/news.tt.zip"
Akronc13a1702016-03-15 19:33:14 +010087
Akron821db3d2017-04-06 21:19:31 +020088Input may also be defined using BSD glob wildcards.
89
90 -i 'file/news*.zip'
91
92The extended input array will be sorted in length order, so the shortest
93path needs to contain all primary data files and all meta data files.
94
Akron5c71a852016-10-31 16:00:33 +010095(The directory structure follows the base directory format,
96that may include a C<.> root folder.
97In this case further archives lacking a C<.> root folder
98need to be passed with a hash sign in front of the archive's name.
99This may require to quote the parameter.)
Akronc13a1702016-03-15 19:33:14 +0100100
Akron5c71a852016-10-31 16:00:33 +0100101To support zip files, a version of C<unzip> needs to be installed that is
102compatible with the archive file.
Akronc13a1702016-03-15 19:33:14 +0100103
Akron5c71a852016-10-31 16:00:33 +0100104B<The root folder switch using the hash sign is experimental and
105may vanish in future versions.>
Akronc13a1702016-03-15 19:33:14 +0100106
Akronf73ffb62018-06-27 12:13:59 +0200107
Akron442c4e92017-04-10 23:41:31 +0200108=item B<--input-base|-ib> <directory>
109
110The base directory for inputs.
111
112
Akron5c71a852016-10-31 16:00:33 +0100113=item B<--output|-o> <directory|file>
Akronc13a1702016-03-15 19:33:14 +0100114
Akron5c71a852016-10-31 16:00:33 +0100115Output folder for archive processing or
116document name for single output (optional),
117writes to C<STDOUT> by default
118(in case C<output> is not mandatory due to further options).
Akronc13a1702016-03-15 19:33:14 +0100119
Akron5c71a852016-10-31 16:00:33 +0100120=item B<--overwrite|-w>
Akronc13a1702016-03-15 19:33:14 +0100121
Akron5c71a852016-10-31 16:00:33 +0100122Overwrite files that already exist.
Akron7606afa2016-10-25 16:23:49 +0200123
Akronf73ffb62018-06-27 12:13:59 +0200124
Akron3741f8b2016-12-21 19:55:21 +0100125=item B<--token|-t> <foundry>#<file>
Akrona5920b12016-06-29 18:51:21 +0200126
Akron5c71a852016-10-31 16:00:33 +0100127Define the default tokenization by specifying
128the name of the foundry and optionally the name
129of the layer-file. Defaults to C<OpenNLP#tokens>.
Akronc13a1702016-03-15 19:33:14 +0100130
Akron3741f8b2016-12-21 19:55:21 +0100131
132=item B<--base-sentences|-bs> <foundry>#<layer>
133
134Define the layer for base sentences.
135If given, this will be used instead of using C<Base#Sentences>.
136Currently C<DeReKo#Structure> is the only additional layer supported.
137
138 Defaults to unset.
139
140
141=item B<--base-paragraphs|-bp> <foundry>#<layer>
142
143Define the layer for base paragraphs.
144If given, this will be used instead of using C<Base#Paragraphs>.
145Currently C<DeReKo#Structure> is the only additional layer supported.
146
147 Defaults to unset.
148
149
Akron821db3d2017-04-06 21:19:31 +0200150=item B<--base-pagebreaks|-bpb> <foundry>#<layer>
151
152Define the layer for base pagebreaks.
153Currently C<DeReKo#Structure> is the only layer supported.
154
155 Defaults to unset.
156
157
Akron5c71a852016-10-31 16:00:33 +0100158=item B<--skip|-s> <foundry>[#<layer>]
159
160Skip specific annotations by specifying the foundry
161(and optionally the layer with a C<#>-prefix),
162e.g. C<Mate> or C<Mate#Morpho>. Alternatively you can skip C<#ALL>.
163Can be set multiple times.
164
Akronf73ffb62018-06-27 12:13:59 +0200165
Akron5c71a852016-10-31 16:00:33 +0100166=item B<--anno|-a> <foundry>#<layer>
167
168Convert specific annotations by specifying the foundry
169(and optionally the layer with a C<#>-prefix),
170e.g. C<Mate> or C<Mate#Morpho>.
171Can be set multiple times.
172
Akronf73ffb62018-06-27 12:13:59 +0200173
Akron5c71a852016-10-31 16:00:33 +0100174=item B<--primary|-p>
175
176Output primary data or not. Defaults to C<true>.
177Can be flagged using C<--no-primary> as well.
178This is I<deprecated>.
179
Akronf73ffb62018-06-27 12:13:59 +0200180
Akroned9baf02019-01-22 17:03:25 +0100181=item B<--non-word-tokens|-nwt>
182
183Tokenize non-word tokens like word tokens (defined as matching
184C</[\d\w]/>). Useful to treat punctuations as tokens.
185
186 Defaults to unset.
187
Akron5c71a852016-10-31 16:00:33 +0100188=item B<--jobs|-j>
189
190Define the number of concurrent jobs in seperated forks
191for archive processing.
192Defaults to C<0> (everything runs in a single process).
Akronf73ffb62018-06-27 12:13:59 +0200193
194If C<sequential-extraction> is not set to false, this will
195also apply to extraction.
196
Akron821db3d2017-04-06 21:19:31 +0200197Pass -1, and the value will be set automatically to 5
198times the number of available cores.
Akron5c71a852016-10-31 16:00:33 +0100199This is I<experimental>.
200
Akronf73ffb62018-06-27 12:13:59 +0200201
Akron263274c2019-02-07 09:48:30 +0100202=item B<--koral|-k>
203
204Version of the output format. Supported versions are:
205C<0> for legacy serialization, C<0.03> for serialization
206with metadata fields as key-values on the root object,
207C<0.4> for serialization with metadata fields as a list
208of C<"@type":"koral:field"> objects.
209
210Currently defaults to C<0.03>.
211
212
Akronf73ffb62018-06-27 12:13:59 +0200213=item B<--sequential-extraction|-se>
214
215Flag to indicate, if the C<jobs> value also applies to extraction.
216Some systems may have problems with extracting multiple archives
217to the same folder at the same time.
218Can be flagged using C<--no-sequential-extraction> as well.
219Defaults to C<false>.
220
221
Akron5c71a852016-10-31 16:00:33 +0100222=item B<--meta|-m>
223
224Define the metadata parser to use. Defaults to C<I5>.
225Metadata parsers can be defined in the C<KorAP::XML::Meta> namespace.
226This is I<experimental>.
227
Akronf73ffb62018-06-27 12:13:59 +0200228
Akron5c71a852016-10-31 16:00:33 +0100229=item B<--pretty|-y>
230
231Pretty print JSON output. Defaults to C<false>.
232This is I<deprecated>.
233
Akronf73ffb62018-06-27 12:13:59 +0200234
Akron5c71a852016-10-31 16:00:33 +0100235=item B<--gzip|-z>
236
237Compress the output.
238Expects a defined C<output> file in single processing.
239
Akronf73ffb62018-06-27 12:13:59 +0200240
Akron5c71a852016-10-31 16:00:33 +0100241=item B<--cache|-c>
242
243File to mmap a cache (using L<Cache::FastMmap>).
244Defaults to C<korapxml2krill.cache> in the calling directory.
245
Akronf73ffb62018-06-27 12:13:59 +0200246
Akron5c71a852016-10-31 16:00:33 +0100247=item B<--cache-size|-cs>
248
249Size of the cache. Defaults to C<50m>.
250
Akronf73ffb62018-06-27 12:13:59 +0200251
Akron5c71a852016-10-31 16:00:33 +0100252=item B<--cache-init|-ci>
253
254Initialize cache file.
255Can be flagged using C<--no-cache-init> as well.
256Defaults to C<true>.
257
Akronf73ffb62018-06-27 12:13:59 +0200258
Akron5c71a852016-10-31 16:00:33 +0100259=item B<--cache-delete|-cd>
260
261Delete cache file after processing.
262Can be flagged using C<--no-cache-delete> as well.
263Defaults to C<true>.
264
Akronf73ffb62018-06-27 12:13:59 +0200265
Akron636aa112017-04-07 18:48:56 +0200266=item B<--config|-cfg>
267
268Configure the parameters of your call in a file
269of key-value pairs with whitespace separator
270
271 overwrite 1
272 token DeReKo#Structure
273 ...
274
275Supported parameters are:
Akron442c4e92017-04-10 23:41:31 +0200276C<overwrite>, C<gzip>, C<jobs>, C<input-base>,
Akron636aa112017-04-07 18:48:56 +0200277C<token>, C<log>, C<cache>, C<cache-size>, C<cache-delete>, C<meta>,
Akronf73ffb62018-06-27 12:13:59 +0200278C<output>,
279C<temp-extract>, C<sequential-extraction>,
280C<base-sentences>, C<base-paragraphs>,
281C<base-pagebreaks>,
282C<skip> (semicolon separated), C<sigle>
Akron636aa112017-04-07 18:48:56 +0200283(semicolon separated), C<anno> (semicolon separated).
284
Akronf73ffb62018-06-27 12:13:59 +0200285Configuration parameters will always be overwritten by
286passed parameters.
287
288
Akron81500102017-04-07 20:45:44 +0200289=item B<--temporary-extract|-te>
290
291Only valid for the C<archive> command.
292
293This will first extract all files into a
294directory and then will archive.
295If the directory is given as C<:temp:>,
296a temporary directory is used.
297This is especially useful to avoid
298massive unzipping and potential
299network latency.
Akron636aa112017-04-07 18:48:56 +0200300
Akronf73ffb62018-06-27 12:13:59 +0200301
Akron5c71a852016-10-31 16:00:33 +0100302=item B<--sigle|-sg>
303
304Extract the given texts.
305Can be set multiple times.
306I<Currently only supported on C<extract>.>
307Sigles have the structure C<Corpus>/C<Document>/C<Text>.
308In case the C<Text> path is omitted, the whole document will be extracted.
309On the document level, the postfix wildcard C<*> is supported.
310
Akronf73ffb62018-06-27 12:13:59 +0200311
Akron5c71a852016-10-31 16:00:33 +0100312=item B<--log|-l>
313
314The L<Log4perl> log level, defaults to C<ERROR>.
315
Akronf73ffb62018-06-27 12:13:59 +0200316
Akron5c71a852016-10-31 16:00:33 +0100317=item B<--help|-h>
318
319Print this document.
320
Akronf73ffb62018-06-27 12:13:59 +0200321
Akron5c71a852016-10-31 16:00:33 +0100322=item B<--version|-v>
323
324Print version information.
325
326=back
327
Akronf73ffb62018-06-27 12:13:59 +0200328
Akron5c71a852016-10-31 16:00:33 +0100329=head1 ANNOTATION SUPPORT
330
331L<KorAP::XML::Krill> has built-in importer for some annotation foundries and layers
332developed in the KorAP project that are part of the KorAP preprocessing pipeline.
333The base foundry with paragraphs, sentences, and the text element are mandatory for
334L<Krill|https://github.com/KorAP/Krill>.
335
Akron821db3d2017-04-06 21:19:31 +0200336 Base
337 #Paragraphs
338 #Sentences
Akron5c71a852016-10-31 16:00:33 +0100339
Akron821db3d2017-04-06 21:19:31 +0200340 Connexor
341 #Morpho
342 #Phrase
343 #Sentences
344 #Syntax
Akron5c71a852016-10-31 16:00:33 +0100345
Akron821db3d2017-04-06 21:19:31 +0200346 CoreNLP
347 #Constituency
348 #Morpho
349 #NamedEntities
350 #Sentences
Akron5c71a852016-10-31 16:00:33 +0100351
Akronf73ffb62018-06-27 12:13:59 +0200352 CMC
353 #Morpho
354
Akron821db3d2017-04-06 21:19:31 +0200355 DeReKo
356 #Structure
Akron5c71a852016-10-31 16:00:33 +0100357
Akron821db3d2017-04-06 21:19:31 +0200358 DRuKoLa
359 #Morpho
Akron5c71a852016-10-31 16:00:33 +0100360
Akron821db3d2017-04-06 21:19:31 +0200361 Glemm
362 #Morpho
Akron5c71a852016-10-31 16:00:33 +0100363
Akroned9baf02019-01-22 17:03:25 +0100364 HNC
365 #Morpho
366
Akronf73ffb62018-06-27 12:13:59 +0200367 LWC
368 #Dependency
369
Akron821db3d2017-04-06 21:19:31 +0200370 Malt
371 #Dependency
Akron5c71a852016-10-31 16:00:33 +0100372
Akron821db3d2017-04-06 21:19:31 +0200373 MarMoT
374 #Morpho
Akron5c71a852016-10-31 16:00:33 +0100375
Akron821db3d2017-04-06 21:19:31 +0200376 Mate
377 #Dependency
378 #Morpho
Akron5c71a852016-10-31 16:00:33 +0100379
Akron821db3d2017-04-06 21:19:31 +0200380 MDParser
381 #Dependency
Akron5c71a852016-10-31 16:00:33 +0100382
Akron821db3d2017-04-06 21:19:31 +0200383 OpenNLP
384 #Morpho
385 #Sentences
Akron5c71a852016-10-31 16:00:33 +0100386
Akron821db3d2017-04-06 21:19:31 +0200387 Sgbr
388 #Lemma
389 #Morpho
Akron5c71a852016-10-31 16:00:33 +0100390
Akron821db3d2017-04-06 21:19:31 +0200391 TreeTagger
392 #Morpho
393 #Sentences
Akron5c71a852016-10-31 16:00:33 +0100394
Akron821db3d2017-04-06 21:19:31 +0200395 XIP
396 #Constituency
397 #Morpho
398 #Sentences
Akron5c71a852016-10-31 16:00:33 +0100399
Akron5c71a852016-10-31 16:00:33 +0100400
401More importers are in preparation.
402New annotation importers can be defined in the C<KorAP::XML::Annotation> namespace.
403See the built-in annotation importers as examples.
Akronc13a1702016-03-15 19:33:14 +0100404
Akronf73ffb62018-06-27 12:13:59 +0200405
Akronc13a1702016-03-15 19:33:14 +0100406=head1 AVAILABILITY
407
408 https://github.com/KorAP/KorAP-XML-Krill
409
410
411=head1 COPYRIGHT AND LICENSE
412
Akroned9baf02019-01-22 17:03:25 +0100413Copyright (C) 2015-2019, L<IDS Mannheim|http://www.ids-mannheim.de/>
Akronc13a1702016-03-15 19:33:14 +0100414
Akron5c71a852016-10-31 16:00:33 +0100415Author: L<Nils Diewald|http://nils-diewald.de/>
Akron81500102017-04-07 20:45:44 +0200416
Akron5c71a852016-10-31 16:00:33 +0100417Contributor: Eliza Margaretha
418
419L<KorAP::XML::Krill> is developed as part of the L<KorAP|http://korap.ids-mannheim.de/>
Akronc13a1702016-03-15 19:33:14 +0100420Corpus Analysis Platform at the
Akron94262ce2019-02-28 21:42:43 +0100421L<Leibniz Institute for the German Language (IDS)|http://ids-mannheim.de/>,
Akronc13a1702016-03-15 19:33:14 +0100422member of the
Akron5c71a852016-10-31 16:00:33 +0100423L<Leibniz-Gemeinschaft|http://www.leibniz-gemeinschaft.de/en/about-us/leibniz-competition/projekte-2011/2011-funding-line-2/>.
Akronc13a1702016-03-15 19:33:14 +0100424
Akron5c71a852016-10-31 16:00:33 +0100425This program is free software published under the
Akronc13a1702016-03-15 19:33:14 +0100426L<BSD-2 License|https://raw.githubusercontent.com/KorAP/KorAP-XML-Krill/master/LICENSE>.
427
428=cut