blob: c38522e9a30bcf474223b80d767bcd0ade319f9a [file] [log] [blame]
Akronc13a1702016-03-15 19:33:14 +01001=pod
2
3=encoding utf8
4
5=head1 NAME
6
Akron5c71a852016-10-31 16:00:33 +01007korapxml2krill - Merge KorapXML data and create Krill documents
Akronc13a1702016-03-15 19:33:14 +01008
9
10=head1 SYNOPSIS
11
Akron5c71a852016-10-31 16:00:33 +010012 korapxml2krill [archive|extract] --input <directory|archive> [options]
Akron2fd402b2016-10-27 21:26:48 +020013
Akronc13a1702016-03-15 19:33:14 +010014
15=head1 DESCRIPTION
16
Akron5c71a852016-10-31 16:00:33 +010017L<KorAP::XML::Krill> is a library to convert KorAP-XML documents to files
18compatible with the L<Krill|https://github.com/KorAP/Krill> indexer.
19The C<korapxml2krill> command line tool is a simple wrapper to the library.
Akronc13a1702016-03-15 19:33:14 +010020
21
Akron5c71a852016-10-31 16:00:33 +010022=head1 INSTALLATION
Akronc13a1702016-03-15 19:33:14 +010023
Akron5c71a852016-10-31 16:00:33 +010024The preferred way to install L<KorAP::XML::Krill> is to use L<cpanm|App::cpanminus>.
Akronc13a1702016-03-15 19:33:14 +010025
Akron5c71a852016-10-31 16:00:33 +010026 $ cpanm https://github.com/KorAP/KorAP-XML-Krill.git
Akronc13a1702016-03-15 19:33:14 +010027
Akron5c71a852016-10-31 16:00:33 +010028In case everything went well, the C<korapxml2krill> tool will
29be available on your command line immediately.
30Minimum requirement for L<KorAP::XML::Krill> is Perl 5.14.
31In addition to work with zip archives, the C<unzip> tool needs to be present.
Akronc13a1702016-03-15 19:33:14 +010032
Akron5c71a852016-10-31 16:00:33 +010033=head1 ARGUMENTS
Akronc13a1702016-03-15 19:33:14 +010034
Akron5c71a852016-10-31 16:00:33 +010035 $ korapxml2krill -z --input <directory> --output <filename>
36
37Without arguments, C<korapxml2krill> converts a directory of a single KorAP-XML document.
38It expects the input to point to the text level folder.
39
40=over 2
41
42=item B<archive>
43
44 $ korapxml2krill archive -z --input <directory|archive> --output <directory>
45
46Converts an archive of KorAP-XML documents. It expects a directory
47(pointing to the corpus level folder) or one or more zip files as input.
48
49=item B<extract>
50
51 $ korapxml2krill extract --input <archive> --output <directory> --sigle <SIGLE>
52
53Extracts KorAP-XML documents from a zip file.
54
Akron442c4e92017-04-10 23:41:31 +020055=item B<serial>
56
57 $ korapxml2krill serial -i <archive1> -i <archive2> -o <directory> -cfg <config-file>
58
59Convert archives sequentially. The inputs are not merged but treated
60as they are (so they may be premerged or globs).
61the C<--out> directory is treated as the base directory where subdirectories
62are created based on the archive name.
63
64
Akron5c71a852016-10-31 16:00:33 +010065=back
Akrona76d8352016-10-27 16:27:32 +020066
Akron7606afa2016-10-25 16:23:49 +020067
Akron5c71a852016-10-31 16:00:33 +010068=head1 OPTIONS
Akronc13a1702016-03-15 19:33:14 +010069
Akron5c71a852016-10-31 16:00:33 +010070=over 2
Akronc13a1702016-03-15 19:33:14 +010071
Akron5c71a852016-10-31 16:00:33 +010072=item B<--input|-i> <directory|zip file>
Akrona76d8352016-10-27 16:27:32 +020073
Akron5c71a852016-10-31 16:00:33 +010074Directory or zip file(s) of documents to convert.
Akronc13a1702016-03-15 19:33:14 +010075
Akron5c71a852016-10-31 16:00:33 +010076Without arguments, C<korapxml2krill> expects a folder of a single KorAP-XML
Akronf1a1de92016-11-02 17:32:12 +010077document, while C<archive> expects a KorAP-XML corpus folder or a zip
78file to batch process multiple files.
79C<extract> expects zip files only.
Akronc13a1702016-03-15 19:33:14 +010080
Akron5c71a852016-10-31 16:00:33 +010081C<archive> supports multiple input zip files with the constraint,
82that the first archive listed contains all primary data files
83and all meta data files.
Akrona76d8352016-10-27 16:27:32 +020084
Akron5c71a852016-10-31 16:00:33 +010085 -i file/news.zip -i file/news.malt.zip -i "#file/news.tt.zip"
Akronc13a1702016-03-15 19:33:14 +010086
Akron821db3d2017-04-06 21:19:31 +020087Input may also be defined using BSD glob wildcards.
88
89 -i 'file/news*.zip'
90
91The extended input array will be sorted in length order, so the shortest
92path needs to contain all primary data files and all meta data files.
93
Akron5c71a852016-10-31 16:00:33 +010094(The directory structure follows the base directory format,
95that may include a C<.> root folder.
96In this case further archives lacking a C<.> root folder
97need to be passed with a hash sign in front of the archive's name.
98This may require to quote the parameter.)
Akronc13a1702016-03-15 19:33:14 +010099
Akron5c71a852016-10-31 16:00:33 +0100100To support zip files, a version of C<unzip> needs to be installed that is
101compatible with the archive file.
Akronc13a1702016-03-15 19:33:14 +0100102
Akron5c71a852016-10-31 16:00:33 +0100103B<The root folder switch using the hash sign is experimental and
104may vanish in future versions.>
Akronc13a1702016-03-15 19:33:14 +0100105
Akron442c4e92017-04-10 23:41:31 +0200106=item B<--input-base|-ib> <directory>
107
108The base directory for inputs.
109
110
Akron5c71a852016-10-31 16:00:33 +0100111=item B<--output|-o> <directory|file>
Akronc13a1702016-03-15 19:33:14 +0100112
Akron5c71a852016-10-31 16:00:33 +0100113Output folder for archive processing or
114document name for single output (optional),
115writes to C<STDOUT> by default
116(in case C<output> is not mandatory due to further options).
Akronc13a1702016-03-15 19:33:14 +0100117
Akron5c71a852016-10-31 16:00:33 +0100118=item B<--overwrite|-w>
Akronc13a1702016-03-15 19:33:14 +0100119
Akron5c71a852016-10-31 16:00:33 +0100120Overwrite files that already exist.
Akron7606afa2016-10-25 16:23:49 +0200121
Akron3741f8b2016-12-21 19:55:21 +0100122=item B<--token|-t> <foundry>#<file>
Akrona5920b12016-06-29 18:51:21 +0200123
Akron5c71a852016-10-31 16:00:33 +0100124Define the default tokenization by specifying
125the name of the foundry and optionally the name
126of the layer-file. Defaults to C<OpenNLP#tokens>.
Akronc13a1702016-03-15 19:33:14 +0100127
Akron3741f8b2016-12-21 19:55:21 +0100128
129=item B<--base-sentences|-bs> <foundry>#<layer>
130
131Define the layer for base sentences.
132If given, this will be used instead of using C<Base#Sentences>.
133Currently C<DeReKo#Structure> is the only additional layer supported.
134
135 Defaults to unset.
136
137
138=item B<--base-paragraphs|-bp> <foundry>#<layer>
139
140Define the layer for base paragraphs.
141If given, this will be used instead of using C<Base#Paragraphs>.
142Currently C<DeReKo#Structure> is the only additional layer supported.
143
144 Defaults to unset.
145
146
Akron821db3d2017-04-06 21:19:31 +0200147=item B<--base-pagebreaks|-bpb> <foundry>#<layer>
148
149Define the layer for base pagebreaks.
150Currently C<DeReKo#Structure> is the only layer supported.
151
152 Defaults to unset.
153
154
Akron5c71a852016-10-31 16:00:33 +0100155=item B<--skip|-s> <foundry>[#<layer>]
156
157Skip specific annotations by specifying the foundry
158(and optionally the layer with a C<#>-prefix),
159e.g. C<Mate> or C<Mate#Morpho>. Alternatively you can skip C<#ALL>.
160Can be set multiple times.
161
162=item B<--anno|-a> <foundry>#<layer>
163
164Convert specific annotations by specifying the foundry
165(and optionally the layer with a C<#>-prefix),
166e.g. C<Mate> or C<Mate#Morpho>.
167Can be set multiple times.
168
169=item B<--primary|-p>
170
171Output primary data or not. Defaults to C<true>.
172Can be flagged using C<--no-primary> as well.
173This is I<deprecated>.
174
175=item B<--jobs|-j>
176
177Define the number of concurrent jobs in seperated forks
178for archive processing.
179Defaults to C<0> (everything runs in a single process).
Akron821db3d2017-04-06 21:19:31 +0200180Pass -1, and the value will be set automatically to 5
181times the number of available cores.
Akron5c71a852016-10-31 16:00:33 +0100182This is I<experimental>.
183
184=item B<--meta|-m>
185
186Define the metadata parser to use. Defaults to C<I5>.
187Metadata parsers can be defined in the C<KorAP::XML::Meta> namespace.
188This is I<experimental>.
189
190=item B<--pretty|-y>
191
192Pretty print JSON output. Defaults to C<false>.
193This is I<deprecated>.
194
195=item B<--gzip|-z>
196
197Compress the output.
198Expects a defined C<output> file in single processing.
199
200=item B<--cache|-c>
201
202File to mmap a cache (using L<Cache::FastMmap>).
203Defaults to C<korapxml2krill.cache> in the calling directory.
204
205=item B<--cache-size|-cs>
206
207Size of the cache. Defaults to C<50m>.
208
209=item B<--cache-init|-ci>
210
211Initialize cache file.
212Can be flagged using C<--no-cache-init> as well.
213Defaults to C<true>.
214
215=item B<--cache-delete|-cd>
216
217Delete cache file after processing.
218Can be flagged using C<--no-cache-delete> as well.
219Defaults to C<true>.
220
Akron636aa112017-04-07 18:48:56 +0200221=item B<--config|-cfg>
222
223Configure the parameters of your call in a file
224of key-value pairs with whitespace separator
225
226 overwrite 1
227 token DeReKo#Structure
228 ...
229
230Supported parameters are:
Akron442c4e92017-04-10 23:41:31 +0200231C<overwrite>, C<gzip>, C<jobs>, C<input-base>,
Akron636aa112017-04-07 18:48:56 +0200232C<token>, C<log>, C<cache>, C<cache-size>, C<cache-delete>, C<meta>,
Akron81500102017-04-07 20:45:44 +0200233C<output>, C<base-sentences>, C<temp-extract>, C<base-paragraphs>,
Akron636aa112017-04-07 18:48:56 +0200234C<base-pagebreaks>, C<skip> (semicolon separated), C<sigle>
235(semicolon separated), C<anno> (semicolon separated).
236
Akron81500102017-04-07 20:45:44 +0200237=item B<--temporary-extract|-te>
238
239Only valid for the C<archive> command.
240
241This will first extract all files into a
242directory and then will archive.
243If the directory is given as C<:temp:>,
244a temporary directory is used.
245This is especially useful to avoid
246massive unzipping and potential
247network latency.
Akron636aa112017-04-07 18:48:56 +0200248
Akron5c71a852016-10-31 16:00:33 +0100249=item B<--sigle|-sg>
250
251Extract the given texts.
252Can be set multiple times.
253I<Currently only supported on C<extract>.>
254Sigles have the structure C<Corpus>/C<Document>/C<Text>.
255In case the C<Text> path is omitted, the whole document will be extracted.
256On the document level, the postfix wildcard C<*> is supported.
257
258=item B<--log|-l>
259
260The L<Log4perl> log level, defaults to C<ERROR>.
261
262=item B<--help|-h>
263
264Print this document.
265
266=item B<--version|-v>
267
268Print version information.
269
270=back
271
272=head1 ANNOTATION SUPPORT
273
274L<KorAP::XML::Krill> has built-in importer for some annotation foundries and layers
275developed in the KorAP project that are part of the KorAP preprocessing pipeline.
276The base foundry with paragraphs, sentences, and the text element are mandatory for
277L<Krill|https://github.com/KorAP/Krill>.
278
Akron821db3d2017-04-06 21:19:31 +0200279 Base
280 #Paragraphs
281 #Sentences
Akron5c71a852016-10-31 16:00:33 +0100282
Akron821db3d2017-04-06 21:19:31 +0200283 Connexor
284 #Morpho
285 #Phrase
286 #Sentences
287 #Syntax
Akron5c71a852016-10-31 16:00:33 +0100288
Akron821db3d2017-04-06 21:19:31 +0200289 CoreNLP
290 #Constituency
291 #Morpho
292 #NamedEntities
293 #Sentences
Akron5c71a852016-10-31 16:00:33 +0100294
Akron821db3d2017-04-06 21:19:31 +0200295 DeReKo
296 #Structure
Akron5c71a852016-10-31 16:00:33 +0100297
Akron821db3d2017-04-06 21:19:31 +0200298 DRuKoLa
299 #Morpho
Akron5c71a852016-10-31 16:00:33 +0100300
Akron821db3d2017-04-06 21:19:31 +0200301 Glemm
302 #Morpho
Akron5c71a852016-10-31 16:00:33 +0100303
Akron821db3d2017-04-06 21:19:31 +0200304 Malt
305 #Dependency
Akron5c71a852016-10-31 16:00:33 +0100306
Akron821db3d2017-04-06 21:19:31 +0200307 MarMoT
308 #Morpho
Akron5c71a852016-10-31 16:00:33 +0100309
Akron821db3d2017-04-06 21:19:31 +0200310 Mate
311 #Dependency
312 #Morpho
Akron5c71a852016-10-31 16:00:33 +0100313
Akron821db3d2017-04-06 21:19:31 +0200314 MDParser
315 #Dependency
Akron5c71a852016-10-31 16:00:33 +0100316
Akron821db3d2017-04-06 21:19:31 +0200317 OpenNLP
318 #Morpho
319 #Sentences
Akron5c71a852016-10-31 16:00:33 +0100320
Akron821db3d2017-04-06 21:19:31 +0200321 Sgbr
322 #Lemma
323 #Morpho
Akron5c71a852016-10-31 16:00:33 +0100324
Akron821db3d2017-04-06 21:19:31 +0200325 TreeTagger
326 #Morpho
327 #Sentences
Akron5c71a852016-10-31 16:00:33 +0100328
Akron821db3d2017-04-06 21:19:31 +0200329 XIP
330 #Constituency
331 #Morpho
332 #Sentences
Akron5c71a852016-10-31 16:00:33 +0100333
Akron5c71a852016-10-31 16:00:33 +0100334
335More importers are in preparation.
336New annotation importers can be defined in the C<KorAP::XML::Annotation> namespace.
337See the built-in annotation importers as examples.
Akronc13a1702016-03-15 19:33:14 +0100338
339=head1 AVAILABILITY
340
341 https://github.com/KorAP/KorAP-XML-Krill
342
343
344=head1 COPYRIGHT AND LICENSE
345
Akron821db3d2017-04-06 21:19:31 +0200346Copyright (C) 2015-2017, L<IDS Mannheim|http://www.ids-mannheim.de/>
Akronc13a1702016-03-15 19:33:14 +0100347
Akron5c71a852016-10-31 16:00:33 +0100348Author: L<Nils Diewald|http://nils-diewald.de/>
Akron81500102017-04-07 20:45:44 +0200349
Akron5c71a852016-10-31 16:00:33 +0100350Contributor: Eliza Margaretha
351
352L<KorAP::XML::Krill> is developed as part of the L<KorAP|http://korap.ids-mannheim.de/>
Akronc13a1702016-03-15 19:33:14 +0100353Corpus Analysis Platform at the
354L<Institute for the German Language (IDS)|http://ids-mannheim.de/>,
355member of the
Akron5c71a852016-10-31 16:00:33 +0100356L<Leibniz-Gemeinschaft|http://www.leibniz-gemeinschaft.de/en/about-us/leibniz-competition/projekte-2011/2011-funding-line-2/>.
Akronc13a1702016-03-15 19:33:14 +0100357
Akron5c71a852016-10-31 16:00:33 +0100358This program is free software published under the
Akronc13a1702016-03-15 19:33:14 +0100359L<BSD-2 License|https://raw.githubusercontent.com/KorAP/KorAP-XML-Krill/master/LICENSE>.
360
361=cut