blob: 0c2272b4e8d486d591e849f12a82f7fd4cec3c51 [file] [log] [blame]
Akronc13a1702016-03-15 19:33:14 +01001=pod
2
3=encoding utf8
4
5=head1 NAME
6
Akron5c71a852016-10-31 16:00:33 +01007korapxml2krill - Merge KorapXML data and create Krill documents
Akronc13a1702016-03-15 19:33:14 +01008
9
10=head1 SYNOPSIS
11
Akron5c71a852016-10-31 16:00:33 +010012 korapxml2krill [archive|extract] --input <directory|archive> [options]
Akron2fd402b2016-10-27 21:26:48 +020013
Akronc13a1702016-03-15 19:33:14 +010014
15=head1 DESCRIPTION
16
Akron5c71a852016-10-31 16:00:33 +010017L<KorAP::XML::Krill> is a library to convert KorAP-XML documents to files
18compatible with the L<Krill|https://github.com/KorAP/Krill> indexer.
19The C<korapxml2krill> command line tool is a simple wrapper to the library.
Akronc13a1702016-03-15 19:33:14 +010020
21
Akron5c71a852016-10-31 16:00:33 +010022=head1 INSTALLATION
Akronc13a1702016-03-15 19:33:14 +010023
Akron5c71a852016-10-31 16:00:33 +010024The preferred way to install L<KorAP::XML::Krill> is to use L<cpanm|App::cpanminus>.
Akronc13a1702016-03-15 19:33:14 +010025
Akron5c71a852016-10-31 16:00:33 +010026 $ cpanm https://github.com/KorAP/KorAP-XML-Krill.git
Akronc13a1702016-03-15 19:33:14 +010027
Akron5c71a852016-10-31 16:00:33 +010028In case everything went well, the C<korapxml2krill> tool will
29be available on your command line immediately.
30Minimum requirement for L<KorAP::XML::Krill> is Perl 5.14.
31In addition to work with zip archives, the C<unzip> tool needs to be present.
Akronc13a1702016-03-15 19:33:14 +010032
Akron5c71a852016-10-31 16:00:33 +010033=head1 ARGUMENTS
Akronc13a1702016-03-15 19:33:14 +010034
Akron5c71a852016-10-31 16:00:33 +010035 $ korapxml2krill -z --input <directory> --output <filename>
36
37Without arguments, C<korapxml2krill> converts a directory of a single KorAP-XML document.
38It expects the input to point to the text level folder.
39
40=over 2
41
42=item B<archive>
43
44 $ korapxml2krill archive -z --input <directory|archive> --output <directory>
45
46Converts an archive of KorAP-XML documents. It expects a directory
47(pointing to the corpus level folder) or one or more zip files as input.
48
49=item B<extract>
50
51 $ korapxml2krill extract --input <archive> --output <directory> --sigle <SIGLE>
52
53Extracts KorAP-XML documents from a zip file.
54
55=back
Akrona76d8352016-10-27 16:27:32 +020056
Akron7606afa2016-10-25 16:23:49 +020057
Akron5c71a852016-10-31 16:00:33 +010058=head1 OPTIONS
Akronc13a1702016-03-15 19:33:14 +010059
Akron5c71a852016-10-31 16:00:33 +010060=over 2
Akronc13a1702016-03-15 19:33:14 +010061
Akron5c71a852016-10-31 16:00:33 +010062=item B<--input|-i> <directory|zip file>
Akrona76d8352016-10-27 16:27:32 +020063
Akron5c71a852016-10-31 16:00:33 +010064Directory or zip file(s) of documents to convert.
Akronc13a1702016-03-15 19:33:14 +010065
Akron5c71a852016-10-31 16:00:33 +010066Without arguments, C<korapxml2krill> expects a folder of a single KorAP-XML
Akronf1a1de92016-11-02 17:32:12 +010067document, while C<archive> expects a KorAP-XML corpus folder or a zip
68file to batch process multiple files.
69C<extract> expects zip files only.
Akronc13a1702016-03-15 19:33:14 +010070
Akron5c71a852016-10-31 16:00:33 +010071C<archive> supports multiple input zip files with the constraint,
72that the first archive listed contains all primary data files
73and all meta data files.
Akrona76d8352016-10-27 16:27:32 +020074
Akron5c71a852016-10-31 16:00:33 +010075 -i file/news.zip -i file/news.malt.zip -i "#file/news.tt.zip"
Akronc13a1702016-03-15 19:33:14 +010076
Akron5c71a852016-10-31 16:00:33 +010077(The directory structure follows the base directory format,
78that may include a C<.> root folder.
79In this case further archives lacking a C<.> root folder
80need to be passed with a hash sign in front of the archive's name.
81This may require to quote the parameter.)
Akronc13a1702016-03-15 19:33:14 +010082
Akron5c71a852016-10-31 16:00:33 +010083To support zip files, a version of C<unzip> needs to be installed that is
84compatible with the archive file.
Akronc13a1702016-03-15 19:33:14 +010085
Akron5c71a852016-10-31 16:00:33 +010086B<The root folder switch using the hash sign is experimental and
87may vanish in future versions.>
Akronc13a1702016-03-15 19:33:14 +010088
Akron5c71a852016-10-31 16:00:33 +010089=item B<--output|-o> <directory|file>
Akronc13a1702016-03-15 19:33:14 +010090
Akron5c71a852016-10-31 16:00:33 +010091Output folder for archive processing or
92document name for single output (optional),
93writes to C<STDOUT> by default
94(in case C<output> is not mandatory due to further options).
Akronc13a1702016-03-15 19:33:14 +010095
Akron5c71a852016-10-31 16:00:33 +010096=item B<--overwrite|-w>
Akronc13a1702016-03-15 19:33:14 +010097
Akron5c71a852016-10-31 16:00:33 +010098Overwrite files that already exist.
Akron7606afa2016-10-25 16:23:49 +020099
Akron5c71a852016-10-31 16:00:33 +0100100=item B<--token|-t> <foundry>[#<file>]
Akrona5920b12016-06-29 18:51:21 +0200101
Akron5c71a852016-10-31 16:00:33 +0100102Define the default tokenization by specifying
103the name of the foundry and optionally the name
104of the layer-file. Defaults to C<OpenNLP#tokens>.
Akronc13a1702016-03-15 19:33:14 +0100105
Akron5c71a852016-10-31 16:00:33 +0100106=item B<--skip|-s> <foundry>[#<layer>]
107
108Skip specific annotations by specifying the foundry
109(and optionally the layer with a C<#>-prefix),
110e.g. C<Mate> or C<Mate#Morpho>. Alternatively you can skip C<#ALL>.
111Can be set multiple times.
112
113=item B<--anno|-a> <foundry>#<layer>
114
115Convert specific annotations by specifying the foundry
116(and optionally the layer with a C<#>-prefix),
117e.g. C<Mate> or C<Mate#Morpho>.
118Can be set multiple times.
119
120=item B<--primary|-p>
121
122Output primary data or not. Defaults to C<true>.
123Can be flagged using C<--no-primary> as well.
124This is I<deprecated>.
125
126=item B<--jobs|-j>
127
128Define the number of concurrent jobs in seperated forks
129for archive processing.
130Defaults to C<0> (everything runs in a single process).
131This is I<experimental>.
132
133=item B<--meta|-m>
134
135Define the metadata parser to use. Defaults to C<I5>.
136Metadata parsers can be defined in the C<KorAP::XML::Meta> namespace.
137This is I<experimental>.
138
139=item B<--pretty|-y>
140
141Pretty print JSON output. Defaults to C<false>.
142This is I<deprecated>.
143
144=item B<--gzip|-z>
145
146Compress the output.
147Expects a defined C<output> file in single processing.
148
149=item B<--cache|-c>
150
151File to mmap a cache (using L<Cache::FastMmap>).
152Defaults to C<korapxml2krill.cache> in the calling directory.
153
154=item B<--cache-size|-cs>
155
156Size of the cache. Defaults to C<50m>.
157
158=item B<--cache-init|-ci>
159
160Initialize cache file.
161Can be flagged using C<--no-cache-init> as well.
162Defaults to C<true>.
163
164=item B<--cache-delete|-cd>
165
166Delete cache file after processing.
167Can be flagged using C<--no-cache-delete> as well.
168Defaults to C<true>.
169
170=item B<--sigle|-sg>
171
172Extract the given texts.
173Can be set multiple times.
174I<Currently only supported on C<extract>.>
175Sigles have the structure C<Corpus>/C<Document>/C<Text>.
176In case the C<Text> path is omitted, the whole document will be extracted.
177On the document level, the postfix wildcard C<*> is supported.
178
179=item B<--log|-l>
180
181The L<Log4perl> log level, defaults to C<ERROR>.
182
183=item B<--help|-h>
184
185Print this document.
186
187=item B<--version|-v>
188
189Print version information.
190
191=back
192
193=head1 ANNOTATION SUPPORT
194
195L<KorAP::XML::Krill> has built-in importer for some annotation foundries and layers
196developed in the KorAP project that are part of the KorAP preprocessing pipeline.
197The base foundry with paragraphs, sentences, and the text element are mandatory for
198L<Krill|https://github.com/KorAP/Krill>.
199
200=over 2
201
202=item B<Base>
203
204=over 4
205
206=item #Paragraphs
207
208=item #Sentences
209
210=back
211
212=item B<Connexor>
213
214=over 4
215
216=item #Morpho
217
218=item #Phrase
219
220=item #Sentences
221
222=item #Syntax
223
224=back
225
226=item B<CoreNLP>
227
228=over 4
229
230=item #Constituency
231
232=item #Morpho
233
234=item #NamedEntities
235
236=item #Sentences
237
238=back
239
240=item B<DeReKo>
241
242=over 4
243
244=item #Structure
245
246=back
247
248=item B<Glemm>
249
250=over 4
251
252=item #Morpho
253
254=back
255
256=item B<Mate>
257
258=over 4
259
260=item #Dependency
261
262=item #Morpho
263
264=back
265
266=item B<OpenNLP>
267
268=over 4
269
270=item #Morpho
271
272=item #Sentences
273
274=back
275
276=item B<Sgbr>
277
278=over 4
279
280=item #Lemma
281
282=item #Morpho
283
284=back
285
286=item B<TreeTagger>
287
288=over 4
289
290=item #Morpho
291
292=item #Sentences
293
294=back
295
296=item B<XIP>
297
298=over 4
299
300=item #Constituency
301
302=item #Morpho
303
304=item #Sentences
305
306=back
307
308=back
309
310More importers are in preparation.
311New annotation importers can be defined in the C<KorAP::XML::Annotation> namespace.
312See the built-in annotation importers as examples.
Akronc13a1702016-03-15 19:33:14 +0100313
314=head1 AVAILABILITY
315
316 https://github.com/KorAP/KorAP-XML-Krill
317
318
319=head1 COPYRIGHT AND LICENSE
320
321Copyright (C) 2015-2016, L<IDS Mannheim|http://www.ids-mannheim.de/>
Akronc13a1702016-03-15 19:33:14 +0100322
Akron5c71a852016-10-31 16:00:33 +0100323Author: L<Nils Diewald|http://nils-diewald.de/>
324Contributor: Eliza Margaretha
325
326L<KorAP::XML::Krill> is developed as part of the L<KorAP|http://korap.ids-mannheim.de/>
Akronc13a1702016-03-15 19:33:14 +0100327Corpus Analysis Platform at the
328L<Institute for the German Language (IDS)|http://ids-mannheim.de/>,
329member of the
Akron5c71a852016-10-31 16:00:33 +0100330L<Leibniz-Gemeinschaft|http://www.leibniz-gemeinschaft.de/en/about-us/leibniz-competition/projekte-2011/2011-funding-line-2/>.
Akronc13a1702016-03-15 19:33:14 +0100331
Akron5c71a852016-10-31 16:00:33 +0100332This program is free software published under the
Akronc13a1702016-03-15 19:33:14 +0100333L<BSD-2 License|https://raw.githubusercontent.com/KorAP/KorAP-XML-Krill/master/LICENSE>.
334
335=cut