blob: 5fab72d0362a8cef2f6872e9405850206f67bdf9 [file] [log] [blame]
Akronc13a1702016-03-15 19:33:14 +01001=pod
2
3=encoding utf8
4
5=head1 NAME
6
Akron5c71a852016-10-31 16:00:33 +01007korapxml2krill - Merge KorapXML data and create Krill documents
Akronc13a1702016-03-15 19:33:14 +01008
9
10=head1 SYNOPSIS
11
Akron5c71a852016-10-31 16:00:33 +010012 korapxml2krill [archive|extract] --input <directory|archive> [options]
Akron2fd402b2016-10-27 21:26:48 +020013
Akronc13a1702016-03-15 19:33:14 +010014
15=head1 DESCRIPTION
16
Akron5c71a852016-10-31 16:00:33 +010017L<KorAP::XML::Krill> is a library to convert KorAP-XML documents to files
18compatible with the L<Krill|https://github.com/KorAP/Krill> indexer.
19The C<korapxml2krill> command line tool is a simple wrapper to the library.
Akronc13a1702016-03-15 19:33:14 +010020
21
Akron5c71a852016-10-31 16:00:33 +010022=head1 INSTALLATION
Akronc13a1702016-03-15 19:33:14 +010023
Akron5c71a852016-10-31 16:00:33 +010024The preferred way to install L<KorAP::XML::Krill> is to use L<cpanm|App::cpanminus>.
Akronc13a1702016-03-15 19:33:14 +010025
Akron5c71a852016-10-31 16:00:33 +010026 $ cpanm https://github.com/KorAP/KorAP-XML-Krill.git
Akronc13a1702016-03-15 19:33:14 +010027
Akron5c71a852016-10-31 16:00:33 +010028In case everything went well, the C<korapxml2krill> tool will
29be available on your command line immediately.
30Minimum requirement for L<KorAP::XML::Krill> is Perl 5.14.
31In addition to work with zip archives, the C<unzip> tool needs to be present.
Akronc13a1702016-03-15 19:33:14 +010032
Akron5c71a852016-10-31 16:00:33 +010033=head1 ARGUMENTS
Akronc13a1702016-03-15 19:33:14 +010034
Akron5c71a852016-10-31 16:00:33 +010035 $ korapxml2krill -z --input <directory> --output <filename>
36
37Without arguments, C<korapxml2krill> converts a directory of a single KorAP-XML document.
38It expects the input to point to the text level folder.
39
40=over 2
41
42=item B<archive>
43
44 $ korapxml2krill archive -z --input <directory|archive> --output <directory>
45
46Converts an archive of KorAP-XML documents. It expects a directory
47(pointing to the corpus level folder) or one or more zip files as input.
48
49=item B<extract>
50
51 $ korapxml2krill extract --input <archive> --output <directory> --sigle <SIGLE>
52
53Extracts KorAP-XML documents from a zip file.
54
55=back
Akrona76d8352016-10-27 16:27:32 +020056
Akron7606afa2016-10-25 16:23:49 +020057
Akron5c71a852016-10-31 16:00:33 +010058=head1 OPTIONS
Akronc13a1702016-03-15 19:33:14 +010059
Akron5c71a852016-10-31 16:00:33 +010060=over 2
Akronc13a1702016-03-15 19:33:14 +010061
Akron5c71a852016-10-31 16:00:33 +010062=item B<--input|-i> <directory|zip file>
Akrona76d8352016-10-27 16:27:32 +020063
Akron5c71a852016-10-31 16:00:33 +010064Directory or zip file(s) of documents to convert.
Akronc13a1702016-03-15 19:33:14 +010065
Akron5c71a852016-10-31 16:00:33 +010066Without arguments, C<korapxml2krill> expects a folder of a single KorAP-XML
Akronf1a1de92016-11-02 17:32:12 +010067document, while C<archive> expects a KorAP-XML corpus folder or a zip
68file to batch process multiple files.
69C<extract> expects zip files only.
Akronc13a1702016-03-15 19:33:14 +010070
Akron5c71a852016-10-31 16:00:33 +010071C<archive> supports multiple input zip files with the constraint,
72that the first archive listed contains all primary data files
73and all meta data files.
Akrona76d8352016-10-27 16:27:32 +020074
Akron5c71a852016-10-31 16:00:33 +010075 -i file/news.zip -i file/news.malt.zip -i "#file/news.tt.zip"
Akronc13a1702016-03-15 19:33:14 +010076
Akron5c71a852016-10-31 16:00:33 +010077(The directory structure follows the base directory format,
78that may include a C<.> root folder.
79In this case further archives lacking a C<.> root folder
80need to be passed with a hash sign in front of the archive's name.
81This may require to quote the parameter.)
Akronc13a1702016-03-15 19:33:14 +010082
Akron5c71a852016-10-31 16:00:33 +010083To support zip files, a version of C<unzip> needs to be installed that is
84compatible with the archive file.
Akronc13a1702016-03-15 19:33:14 +010085
Akron5c71a852016-10-31 16:00:33 +010086B<The root folder switch using the hash sign is experimental and
87may vanish in future versions.>
Akronc13a1702016-03-15 19:33:14 +010088
Akron5c71a852016-10-31 16:00:33 +010089=item B<--output|-o> <directory|file>
Akronc13a1702016-03-15 19:33:14 +010090
Akron5c71a852016-10-31 16:00:33 +010091Output folder for archive processing or
92document name for single output (optional),
93writes to C<STDOUT> by default
94(in case C<output> is not mandatory due to further options).
Akronc13a1702016-03-15 19:33:14 +010095
Akron5c71a852016-10-31 16:00:33 +010096=item B<--overwrite|-w>
Akronc13a1702016-03-15 19:33:14 +010097
Akron5c71a852016-10-31 16:00:33 +010098Overwrite files that already exist.
Akron7606afa2016-10-25 16:23:49 +020099
Akron3741f8b2016-12-21 19:55:21 +0100100=item B<--token|-t> <foundry>#<file>
Akrona5920b12016-06-29 18:51:21 +0200101
Akron5c71a852016-10-31 16:00:33 +0100102Define the default tokenization by specifying
103the name of the foundry and optionally the name
104of the layer-file. Defaults to C<OpenNLP#tokens>.
Akronc13a1702016-03-15 19:33:14 +0100105
Akron3741f8b2016-12-21 19:55:21 +0100106
107=item B<--base-sentences|-bs> <foundry>#<layer>
108
109Define the layer for base sentences.
110If given, this will be used instead of using C<Base#Sentences>.
111Currently C<DeReKo#Structure> is the only additional layer supported.
112
113 Defaults to unset.
114
115
116=item B<--base-paragraphs|-bp> <foundry>#<layer>
117
118Define the layer for base paragraphs.
119If given, this will be used instead of using C<Base#Paragraphs>.
120Currently C<DeReKo#Structure> is the only additional layer supported.
121
122 Defaults to unset.
123
124
Akron5c71a852016-10-31 16:00:33 +0100125=item B<--skip|-s> <foundry>[#<layer>]
126
127Skip specific annotations by specifying the foundry
128(and optionally the layer with a C<#>-prefix),
129e.g. C<Mate> or C<Mate#Morpho>. Alternatively you can skip C<#ALL>.
130Can be set multiple times.
131
132=item B<--anno|-a> <foundry>#<layer>
133
134Convert specific annotations by specifying the foundry
135(and optionally the layer with a C<#>-prefix),
136e.g. C<Mate> or C<Mate#Morpho>.
137Can be set multiple times.
138
139=item B<--primary|-p>
140
141Output primary data or not. Defaults to C<true>.
142Can be flagged using C<--no-primary> as well.
143This is I<deprecated>.
144
145=item B<--jobs|-j>
146
147Define the number of concurrent jobs in seperated forks
148for archive processing.
149Defaults to C<0> (everything runs in a single process).
150This is I<experimental>.
151
152=item B<--meta|-m>
153
154Define the metadata parser to use. Defaults to C<I5>.
155Metadata parsers can be defined in the C<KorAP::XML::Meta> namespace.
156This is I<experimental>.
157
158=item B<--pretty|-y>
159
160Pretty print JSON output. Defaults to C<false>.
161This is I<deprecated>.
162
163=item B<--gzip|-z>
164
165Compress the output.
166Expects a defined C<output> file in single processing.
167
168=item B<--cache|-c>
169
170File to mmap a cache (using L<Cache::FastMmap>).
171Defaults to C<korapxml2krill.cache> in the calling directory.
172
173=item B<--cache-size|-cs>
174
175Size of the cache. Defaults to C<50m>.
176
177=item B<--cache-init|-ci>
178
179Initialize cache file.
180Can be flagged using C<--no-cache-init> as well.
181Defaults to C<true>.
182
183=item B<--cache-delete|-cd>
184
185Delete cache file after processing.
186Can be flagged using C<--no-cache-delete> as well.
187Defaults to C<true>.
188
189=item B<--sigle|-sg>
190
191Extract the given texts.
192Can be set multiple times.
193I<Currently only supported on C<extract>.>
194Sigles have the structure C<Corpus>/C<Document>/C<Text>.
195In case the C<Text> path is omitted, the whole document will be extracted.
196On the document level, the postfix wildcard C<*> is supported.
197
198=item B<--log|-l>
199
200The L<Log4perl> log level, defaults to C<ERROR>.
201
202=item B<--help|-h>
203
204Print this document.
205
206=item B<--version|-v>
207
208Print version information.
209
210=back
211
212=head1 ANNOTATION SUPPORT
213
214L<KorAP::XML::Krill> has built-in importer for some annotation foundries and layers
215developed in the KorAP project that are part of the KorAP preprocessing pipeline.
216The base foundry with paragraphs, sentences, and the text element are mandatory for
217L<Krill|https://github.com/KorAP/Krill>.
218
219=over 2
220
221=item B<Base>
222
223=over 4
224
225=item #Paragraphs
226
227=item #Sentences
228
229=back
230
231=item B<Connexor>
232
233=over 4
234
235=item #Morpho
236
237=item #Phrase
238
239=item #Sentences
240
241=item #Syntax
242
243=back
244
245=item B<CoreNLP>
246
247=over 4
248
249=item #Constituency
250
251=item #Morpho
252
253=item #NamedEntities
254
255=item #Sentences
256
257=back
258
259=item B<DeReKo>
260
261=over 4
262
263=item #Structure
264
265=back
266
267=item B<Glemm>
268
269=over 4
270
271=item #Morpho
272
273=back
274
275=item B<Mate>
276
277=over 4
278
279=item #Dependency
280
281=item #Morpho
282
283=back
284
285=item B<OpenNLP>
286
287=over 4
288
289=item #Morpho
290
291=item #Sentences
292
293=back
294
295=item B<Sgbr>
296
297=over 4
298
299=item #Lemma
300
301=item #Morpho
302
303=back
304
305=item B<TreeTagger>
306
307=over 4
308
309=item #Morpho
310
311=item #Sentences
312
313=back
314
315=item B<XIP>
316
317=over 4
318
319=item #Constituency
320
321=item #Morpho
322
323=item #Sentences
324
325=back
326
327=back
328
329More importers are in preparation.
330New annotation importers can be defined in the C<KorAP::XML::Annotation> namespace.
331See the built-in annotation importers as examples.
Akronc13a1702016-03-15 19:33:14 +0100332
333=head1 AVAILABILITY
334
335 https://github.com/KorAP/KorAP-XML-Krill
336
337
338=head1 COPYRIGHT AND LICENSE
339
340Copyright (C) 2015-2016, L<IDS Mannheim|http://www.ids-mannheim.de/>
Akronc13a1702016-03-15 19:33:14 +0100341
Akron5c71a852016-10-31 16:00:33 +0100342Author: L<Nils Diewald|http://nils-diewald.de/>
343Contributor: Eliza Margaretha
344
345L<KorAP::XML::Krill> is developed as part of the L<KorAP|http://korap.ids-mannheim.de/>
Akronc13a1702016-03-15 19:33:14 +0100346Corpus Analysis Platform at the
347L<Institute for the German Language (IDS)|http://ids-mannheim.de/>,
348member of the
Akron5c71a852016-10-31 16:00:33 +0100349L<Leibniz-Gemeinschaft|http://www.leibniz-gemeinschaft.de/en/about-us/leibniz-competition/projekte-2011/2011-funding-line-2/>.
Akronc13a1702016-03-15 19:33:14 +0100350
Akron5c71a852016-10-31 16:00:33 +0100351This program is free software published under the
Akronc13a1702016-03-15 19:33:14 +0100352L<BSD-2 License|https://raw.githubusercontent.com/KorAP/KorAP-XML-Krill/master/LICENSE>.
353
354=cut