blob: 2ee0d7b9f1b92264e280c26b0203790a19fa3d97 [file] [log] [blame]
Akronc13a1702016-03-15 19:33:14 +01001=pod
2
3=encoding utf8
4
5=head1 NAME
6
Akronf7ad89e2016-03-16 18:22:47 +01007korapxml2krill - Merge KorapXML data and create Krill documents
Akronc13a1702016-03-15 19:33:14 +01008
9
10=head1 SYNOPSIS
11
Akrona76d8352016-10-27 16:27:32 +020012 korapxml2krill [archive|extract] --input <directory|archive> [options]
Akronc13a1702016-03-15 19:33:14 +010013
14=head1 DESCRIPTION
15
16L<KorAP::XML::Krill> is a library to convert KorAP-XML documents to files
17compatible with the L<Krill|https://github.com/KorAP/Krill> indexer.
Akronf7ad89e2016-03-16 18:22:47 +010018The C<korapxml2krill> command line tool is a simple wrapper to the library.
Akronc13a1702016-03-15 19:33:14 +010019
20
21=head1 INSTALLATION
22
23The preferred way to install L<KorAP::XML::Krill> is to use L<cpanm|App::cpanminus>.
24
Akronaf386982016-10-12 00:33:25 +020025 $ cpanm https://github.com/KorAP/KorAP-XML-Krill.git
Akronc13a1702016-03-15 19:33:14 +010026
27In case everything went well, the C<korapxml2krill> tool will
Akronf7ad89e2016-03-16 18:22:47 +010028be available on your command line immediately.
Akron74381512016-10-14 11:56:22 +020029Minimum requirement for L<KorAP::XML::Krill> is Perl 5.14.
Akrona93d51b2016-10-24 20:27:48 +020030In addition to work with zip archives, the C<unzip> tool needs to be present.
Akronc13a1702016-03-15 19:33:14 +010031
32=head1 ARGUMENTS
33
Akrona76d8352016-10-27 16:27:32 +020034 $ korapxml2krill -z --input <directory> --output <filename>
35
36Without arguments, C<korapxml2krill> converts a directory of a single KorAP-XML document.
37Expects the input to point to the text level folder.
Akron7606afa2016-10-25 16:23:49 +020038
Akronc13a1702016-03-15 19:33:14 +010039=over 2
40
41=item B<archive>
42
Akrona76d8352016-10-27 16:27:32 +020043 $ korapxml2krill archive -z --input <directory|archive> --output <directory>
44
45Converts an archive of KorAP-XML documents. Expects a directory
46(pointing to the text level folder) or one or more zip files as input.
Akronc13a1702016-03-15 19:33:14 +010047
48=item B<extract>
49
Akrona76d8352016-10-27 16:27:32 +020050 $ korapxml2krill extract --input <archive> --output <directory> --sigle <SIGLE>
51
52Extracts KorAP-XML documents from a zip file.
Akronc13a1702016-03-15 19:33:14 +010053
54=back
55
56
57=head1 OPTIONS
58
59=over 2
60
Akrona76d8352016-10-27 16:27:32 +020061=item B<--input|-i> <directory|zip file>
Akronc13a1702016-03-15 19:33:14 +010062
Akrona76d8352016-10-27 16:27:32 +020063Directory or zip file(s) of documents to convert.
Akronc13a1702016-03-15 19:33:14 +010064
Akron7606afa2016-10-25 16:23:49 +020065Without arguments, C<korapxml2krill> expects a folder of a single KorAP-XML
Akrona76d8352016-10-27 16:27:32 +020066document, while C<archive> and C<extract> support zip files as well.
Akron7606afa2016-10-25 16:23:49 +020067
Akrona76d8352016-10-27 16:27:32 +020068C<archive> supports multiple input zip files with the constraint,
Akrona5920b12016-06-29 18:51:21 +020069that the first archive listed contains all primary data files
70and all meta data files.
71
Akron7606afa2016-10-25 16:23:49 +020072 -i file/news.zip -i file/news.malt.zip -i "#file/news.tt.zip"
Akrona5920b12016-06-29 18:51:21 +020073
74(The directory structure follows the base directory format,
75that may include a C<.> root folder.
76In this case further archives lacking a C<.> root folder
Akron7606afa2016-10-25 16:23:49 +020077need to be passed with a hash sign in front of the archive's name.
78This may require to quote the parameter.)
Akrona5920b12016-06-29 18:51:21 +020079
Akron7606afa2016-10-25 16:23:49 +020080To support zip files, a version of C<unzip> needs to be installed that is
81compatible with the archive file.
Akrona93d51b2016-10-24 20:27:48 +020082
Akron7606afa2016-10-25 16:23:49 +020083B<The root folder switch using the hash sign is experimental and
84may vanish in future versions.>
Akrona93d51b2016-10-24 20:27:48 +020085
Akronc13a1702016-03-15 19:33:14 +010086=item B<--output|-o> <directory|file>
87
88Output folder for archive processing or
89document name for single output (optional),
Akronf7ad89e2016-03-16 18:22:47 +010090writes to C<STDOUT> by default
91(in case C<output> is not mandatory due to further options).
Akronc13a1702016-03-15 19:33:14 +010092
93=item B<--overwrite|-w>
94
95Overwrite files that already exist.
96
97=item B<--token|-t> <foundry>[#<file>]
98
99Define the default tokenization by specifying
100the name of the foundry and optionally the name
101of the layer-file. Defaults to C<OpenNLP#tokens>.
102
103=item B<--skip|-s> <foundry>[#<layer>]
104
Akronf7ad89e2016-03-16 18:22:47 +0100105Skip specific annotations by specifying the foundry
106(and optionally the layer with a C<#>-prefix),
107e.g. C<Mate> or C<Mate#Morpho>. Alternatively you can skip C<#ALL>.
Akronc13a1702016-03-15 19:33:14 +0100108Can be set multiple times.
109
110=item B<--anno|-a> <foundry>#<layer>
111
Akronf7ad89e2016-03-16 18:22:47 +0100112Convert specific annotations by specifying the foundry
113(and optionally the layer with a C<#>-prefix),
114e.g. C<Mate> or C<Mate#Morpho>.
115Can be set multiple times.
Akronc13a1702016-03-15 19:33:14 +0100116
117=item B<--primary|-p>
118
119Output primary data or not. Defaults to C<true>.
Akronf7ad89e2016-03-16 18:22:47 +0100120Can be flagged using C<--no-primary> as well.
121This is I<deprecated>.
Akronc13a1702016-03-15 19:33:14 +0100122
123=item B<--jobs|-j>
124
125Define the number of concurrent jobs in seperated forks
Akronf7ad89e2016-03-16 18:22:47 +0100126for archive processing.
Akron11c80302016-03-18 19:44:43 +0100127Defaults to C<0> (everything runs in a single process).
Akronf7ad89e2016-03-16 18:22:47 +0100128This is I<experimental>.
Akronc13a1702016-03-15 19:33:14 +0100129
Akron35db6e32016-03-17 22:42:22 +0100130=item B<--meta|-m>
Akronc13a1702016-03-15 19:33:14 +0100131
Akron35db6e32016-03-17 22:42:22 +0100132Define the metadata parser to use. Defaults to C<I5>.
133Metadata parsers can be defined in the C<KorAP::XML::Meta> namespace.
134This is I<experimental>.
Akronc13a1702016-03-15 19:33:14 +0100135
136=item B<--pretty|-y>
137
138Pretty print JSON output. Defaults to C<false>.
Akron35db6e32016-03-17 22:42:22 +0100139This is I<deprecated>.
Akronc13a1702016-03-15 19:33:14 +0100140
141=item B<--gzip|-z>
142
Akronf7ad89e2016-03-16 18:22:47 +0100143Compress the output.
144Expects a defined C<output> file in single processing.
Akronc13a1702016-03-15 19:33:14 +0100145
Akron11c80302016-03-18 19:44:43 +0100146=item B<--cache|-c>
147
148File to mmap a cache (using L<Cache::FastMmap>).
149Defaults to C<korapxml2krill.cache> in the calling directory.
150
151=item B<--cache-size|-cs>
152
153Size of the cache. Defaults to C<50m>.
154
155=item B<--cache-init|-ci>
156
157Initialize cache file.
158Can be flagged using C<--no-cache-init> as well.
159Defaults to C<true>.
160
161=item B<--cache-delete|-cd>
162
163Delete cache file after processing.
164Can be flagged using C<--no-cache-delete> as well.
165Defaults to C<true>.
166
Akronc13a1702016-03-15 19:33:14 +0100167=item B<--sigle|-sg>
168
Akron20807582016-10-26 17:11:34 +0200169Extract the given texts.
Akronc13a1702016-03-15 19:33:14 +0100170Can be set multiple times.
Akronf7ad89e2016-03-16 18:22:47 +0100171I<Currently only supported on C<extract>.>
Akrona5920b12016-06-29 18:51:21 +0200172Sigles have the structure C<Corpus>/C<Document>/C<Text>.
Akron20807582016-10-26 17:11:34 +0200173In case the C<Text> path is omitted, the whole document will be extracted.
Akronc13a1702016-03-15 19:33:14 +0100174
175=item B<--log|-l>
176
177The L<Log4perl> log level, defaults to C<ERROR>.
178
179=item B<--help|-h>
180
181Print this document.
182
183=item B<--version|-v>
184
185Print version information.
186
187=back
188
189=head1 ANNOTATION SUPPORT
190
191L<KorAP::XML::Krill> has built-in importer for some annotation foundries and layers
192developed in the KorAP project that are part of the KorAP preprocessing pipeline.
193The base foundry with paragraphs, sentences, and the text element are mandatory for
194L<Krill|https://github.com/KorAP/Krill>.
195
Akronf7ad89e2016-03-16 18:22:47 +0100196=over 2
Akronc13a1702016-03-15 19:33:14 +0100197
198=item B<Base>
199
200=over 4
201
Akronf7ad89e2016-03-16 18:22:47 +0100202=item #Paragraphs
Akronc13a1702016-03-15 19:33:14 +0100203
Akronf7ad89e2016-03-16 18:22:47 +0100204=item #Sentences
Akronc13a1702016-03-15 19:33:14 +0100205
206=back
207
208=item B<Connexor>
209
210=over 4
211
Akronf7ad89e2016-03-16 18:22:47 +0100212=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100213
Akronf7ad89e2016-03-16 18:22:47 +0100214=item #Phrase
Akronc13a1702016-03-15 19:33:14 +0100215
Akronf7ad89e2016-03-16 18:22:47 +0100216=item #Sentences
Akronc13a1702016-03-15 19:33:14 +0100217
Akronf7ad89e2016-03-16 18:22:47 +0100218=item #Syntax
Akronc13a1702016-03-15 19:33:14 +0100219
220=back
221
222=item B<CoreNLP>
223
224=over 4
225
Akronf7ad89e2016-03-16 18:22:47 +0100226=item #Constituency
Akronc13a1702016-03-15 19:33:14 +0100227
Akronf7ad89e2016-03-16 18:22:47 +0100228=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100229
Akronf7ad89e2016-03-16 18:22:47 +0100230=item #NamedEntities
Akronc13a1702016-03-15 19:33:14 +0100231
Akronf7ad89e2016-03-16 18:22:47 +0100232=item #Sentences
Akronc13a1702016-03-15 19:33:14 +0100233
234=back
235
236=item B<DeReKo>
237
238=over 4
239
Akronf7ad89e2016-03-16 18:22:47 +0100240=item #Structure
Akronc13a1702016-03-15 19:33:14 +0100241
242=back
243
244=item B<Glemm>
245
246=over 4
247
Akronf7ad89e2016-03-16 18:22:47 +0100248=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100249
250=back
251
252=item B<Mate>
253
254=over 4
255
Akronf7ad89e2016-03-16 18:22:47 +0100256=item #Dependency
Akronc13a1702016-03-15 19:33:14 +0100257
Akronf7ad89e2016-03-16 18:22:47 +0100258=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100259
260=back
261
262=item B<OpenNLP>
263
264=over 4
265
Akronf7ad89e2016-03-16 18:22:47 +0100266=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100267
Akronf7ad89e2016-03-16 18:22:47 +0100268=item #Sentences
Akronc13a1702016-03-15 19:33:14 +0100269
270=back
271
272=item B<Sgbr>
273
274=over 4
275
Akronf7ad89e2016-03-16 18:22:47 +0100276=item #Lemma
Akronc13a1702016-03-15 19:33:14 +0100277
Akronf7ad89e2016-03-16 18:22:47 +0100278=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100279
280=back
281
282=item B<TreeTagger>
283
284=over 4
285
Akronf7ad89e2016-03-16 18:22:47 +0100286=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100287
Akronf7ad89e2016-03-16 18:22:47 +0100288=item #Sentences
Akronc13a1702016-03-15 19:33:14 +0100289
290=back
291
292=item B<XIP>
293
294=over 4
295
Akronf7ad89e2016-03-16 18:22:47 +0100296=item #Constituency
Akronc13a1702016-03-15 19:33:14 +0100297
Akronf7ad89e2016-03-16 18:22:47 +0100298=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100299
Akronf7ad89e2016-03-16 18:22:47 +0100300=item #Sentences
Akronc13a1702016-03-15 19:33:14 +0100301
302=back
303
304=back
305
306More importers are in preparation.
307New annotation importers can be defined in the C<KorAP::XML::Annotation> namespace.
308See the built-in annotation importers as examples.
309
310=head1 AVAILABILITY
311
312 https://github.com/KorAP/KorAP-XML-Krill
313
314
315=head1 COPYRIGHT AND LICENSE
316
317Copyright (C) 2015-2016, L<IDS Mannheim|http://www.ids-mannheim.de/>
Akronf7ad89e2016-03-16 18:22:47 +0100318
Akronc13a1702016-03-15 19:33:14 +0100319Author: L<Nils Diewald|http://nils-diewald.de/>
Akrona76d8352016-10-27 16:27:32 +0200320Contributor: Eliza Margaretha
Akronc13a1702016-03-15 19:33:14 +0100321
322L<KorAP::XML::Krill> is developed as part of the L<KorAP|http://korap.ids-mannheim.de/>
323Corpus Analysis Platform at the
324L<Institute for the German Language (IDS)|http://ids-mannheim.de/>,
325member of the
326L<Leibniz-Gemeinschaft|http://www.leibniz-gemeinschaft.de/en/about-us/leibniz-competition/projekte-2011/2011-funding-line-2/>.
327
328This program is free software published under the
329L<BSD-2 License|https://raw.githubusercontent.com/KorAP/KorAP-XML-Krill/master/LICENSE>.
330
331=cut