blob: cd6258f78d798e7bdad84cdd8a37c46eb2898c11 [file] [log] [blame]
Akronc13a1702016-03-15 19:33:14 +01001=pod
2
3=encoding utf8
4
5=head1 NAME
6
Akronf7ad89e2016-03-16 18:22:47 +01007korapxml2krill - Merge KorapXML data and create Krill documents
Akronc13a1702016-03-15 19:33:14 +01008
9
10=head1 SYNOPSIS
11
12 $ korapxml2krill -z --input <directory> --output <filename>
Akron7606afa2016-10-25 16:23:49 +020013 $ korapxml2krill archive -z --input <directory|archive> --output <directory>
14 $ korapxml2krill extract --input <directory|archive> --output <filename> --sigle <SIGLE>
Akronc13a1702016-03-15 19:33:14 +010015
16
17=head1 DESCRIPTION
18
19L<KorAP::XML::Krill> is a library to convert KorAP-XML documents to files
20compatible with the L<Krill|https://github.com/KorAP/Krill> indexer.
Akronf7ad89e2016-03-16 18:22:47 +010021The C<korapxml2krill> command line tool is a simple wrapper to the library.
Akronc13a1702016-03-15 19:33:14 +010022
23
24=head1 INSTALLATION
25
26The preferred way to install L<KorAP::XML::Krill> is to use L<cpanm|App::cpanminus>.
27
Akronaf386982016-10-12 00:33:25 +020028 $ cpanm https://github.com/KorAP/KorAP-XML-Krill.git
Akronc13a1702016-03-15 19:33:14 +010029
30In case everything went well, the C<korapxml2krill> tool will
Akronf7ad89e2016-03-16 18:22:47 +010031be available on your command line immediately.
Akron74381512016-10-14 11:56:22 +020032Minimum requirement for L<KorAP::XML::Krill> is Perl 5.14.
Akrona93d51b2016-10-24 20:27:48 +020033In addition to work with zip archives, the C<unzip> tool needs to be present.
Akronc13a1702016-03-15 19:33:14 +010034
35=head1 ARGUMENTS
36
Akron7606afa2016-10-25 16:23:49 +020037Without arguments, C<korapxml2krill> processes a directory of a single KorAP-XML document.
38
Akronc13a1702016-03-15 19:33:14 +010039=over 2
40
41=item B<archive>
42
Akron7606afa2016-10-25 16:23:49 +020043Processes an archive as a Zip-file or a folder of KorAP-XML documents.
Akronc13a1702016-03-15 19:33:14 +010044
45=item B<extract>
46
Akron7606afa2016-10-25 16:23:49 +020047Extracts KorAP-XML files from a Zip-file.
Akronc13a1702016-03-15 19:33:14 +010048
49=back
50
51
52=head1 OPTIONS
53
54=over 2
55
Akrona5920b12016-06-29 18:51:21 +020056=item B<--input|-i> <directory|file|files>
Akronc13a1702016-03-15 19:33:14 +010057
Akronf7ad89e2016-03-16 18:22:47 +010058Directory or archive file of documents to convert.
Akronc13a1702016-03-15 19:33:14 +010059
Akron7606afa2016-10-25 16:23:49 +020060Without arguments, C<korapxml2krill> expects a folder of a single KorAP-XML
61document, while C<archive> and C<extract> support zip archives as well.
62
63C<archive> supports multiple input archives with the constraint,
Akrona5920b12016-06-29 18:51:21 +020064that the first archive listed contains all primary data files
65and all meta data files.
66
Akron7606afa2016-10-25 16:23:49 +020067 -i file/news.zip -i file/news.malt.zip -i "#file/news.tt.zip"
Akrona5920b12016-06-29 18:51:21 +020068
69(The directory structure follows the base directory format,
70that may include a C<.> root folder.
71In this case further archives lacking a C<.> root folder
Akron7606afa2016-10-25 16:23:49 +020072need to be passed with a hash sign in front of the archive's name.
73This may require to quote the parameter.)
Akrona5920b12016-06-29 18:51:21 +020074
Akron7606afa2016-10-25 16:23:49 +020075To support zip files, a version of C<unzip> needs to be installed that is
76compatible with the archive file.
Akrona93d51b2016-10-24 20:27:48 +020077
Akron7606afa2016-10-25 16:23:49 +020078B<The root folder switch using the hash sign is experimental and
79may vanish in future versions.>
Akrona93d51b2016-10-24 20:27:48 +020080
Akronc13a1702016-03-15 19:33:14 +010081=item B<--output|-o> <directory|file>
82
83Output folder for archive processing or
84document name for single output (optional),
Akronf7ad89e2016-03-16 18:22:47 +010085writes to C<STDOUT> by default
86(in case C<output> is not mandatory due to further options).
Akronc13a1702016-03-15 19:33:14 +010087
88=item B<--overwrite|-w>
89
90Overwrite files that already exist.
91
92=item B<--token|-t> <foundry>[#<file>]
93
94Define the default tokenization by specifying
95the name of the foundry and optionally the name
96of the layer-file. Defaults to C<OpenNLP#tokens>.
97
98=item B<--skip|-s> <foundry>[#<layer>]
99
Akronf7ad89e2016-03-16 18:22:47 +0100100Skip specific annotations by specifying the foundry
101(and optionally the layer with a C<#>-prefix),
102e.g. C<Mate> or C<Mate#Morpho>. Alternatively you can skip C<#ALL>.
Akronc13a1702016-03-15 19:33:14 +0100103Can be set multiple times.
104
105=item B<--anno|-a> <foundry>#<layer>
106
Akronf7ad89e2016-03-16 18:22:47 +0100107Convert specific annotations by specifying the foundry
108(and optionally the layer with a C<#>-prefix),
109e.g. C<Mate> or C<Mate#Morpho>.
110Can be set multiple times.
Akronc13a1702016-03-15 19:33:14 +0100111
112=item B<--primary|-p>
113
114Output primary data or not. Defaults to C<true>.
Akronf7ad89e2016-03-16 18:22:47 +0100115Can be flagged using C<--no-primary> as well.
116This is I<deprecated>.
Akronc13a1702016-03-15 19:33:14 +0100117
118=item B<--jobs|-j>
119
120Define the number of concurrent jobs in seperated forks
Akronf7ad89e2016-03-16 18:22:47 +0100121for archive processing.
Akron11c80302016-03-18 19:44:43 +0100122Defaults to C<0> (everything runs in a single process).
Akronf7ad89e2016-03-16 18:22:47 +0100123This is I<experimental>.
Akronc13a1702016-03-15 19:33:14 +0100124
Akron35db6e32016-03-17 22:42:22 +0100125=item B<--meta|-m>
Akronc13a1702016-03-15 19:33:14 +0100126
Akron35db6e32016-03-17 22:42:22 +0100127Define the metadata parser to use. Defaults to C<I5>.
128Metadata parsers can be defined in the C<KorAP::XML::Meta> namespace.
129This is I<experimental>.
Akronc13a1702016-03-15 19:33:14 +0100130
131=item B<--pretty|-y>
132
133Pretty print JSON output. Defaults to C<false>.
Akron35db6e32016-03-17 22:42:22 +0100134This is I<deprecated>.
Akronc13a1702016-03-15 19:33:14 +0100135
136=item B<--gzip|-z>
137
Akronf7ad89e2016-03-16 18:22:47 +0100138Compress the output.
139Expects a defined C<output> file in single processing.
Akronc13a1702016-03-15 19:33:14 +0100140
Akron11c80302016-03-18 19:44:43 +0100141=item B<--cache|-c>
142
143File to mmap a cache (using L<Cache::FastMmap>).
144Defaults to C<korapxml2krill.cache> in the calling directory.
145
146=item B<--cache-size|-cs>
147
148Size of the cache. Defaults to C<50m>.
149
150=item B<--cache-init|-ci>
151
152Initialize cache file.
153Can be flagged using C<--no-cache-init> as well.
154Defaults to C<true>.
155
156=item B<--cache-delete|-cd>
157
158Delete cache file after processing.
159Can be flagged using C<--no-cache-delete> as well.
160Defaults to C<true>.
161
Akronc13a1702016-03-15 19:33:14 +0100162=item B<--sigle|-sg>
163
164Extract the given text sigles.
Akronc13a1702016-03-15 19:33:14 +0100165Can be set multiple times.
Akronf7ad89e2016-03-16 18:22:47 +0100166I<Currently only supported on C<extract>.>
Akrona5920b12016-06-29 18:51:21 +0200167Sigles have the structure C<Corpus>/C<Document>/C<Text>.
Akronc13a1702016-03-15 19:33:14 +0100168
169=item B<--log|-l>
170
171The L<Log4perl> log level, defaults to C<ERROR>.
172
173=item B<--help|-h>
174
175Print this document.
176
177=item B<--version|-v>
178
179Print version information.
180
181=back
182
183=head1 ANNOTATION SUPPORT
184
185L<KorAP::XML::Krill> has built-in importer for some annotation foundries and layers
186developed in the KorAP project that are part of the KorAP preprocessing pipeline.
187The base foundry with paragraphs, sentences, and the text element are mandatory for
188L<Krill|https://github.com/KorAP/Krill>.
189
Akronf7ad89e2016-03-16 18:22:47 +0100190=over 2
Akronc13a1702016-03-15 19:33:14 +0100191
192=item B<Base>
193
194=over 4
195
Akronf7ad89e2016-03-16 18:22:47 +0100196=item #Paragraphs
Akronc13a1702016-03-15 19:33:14 +0100197
Akronf7ad89e2016-03-16 18:22:47 +0100198=item #Sentences
Akronc13a1702016-03-15 19:33:14 +0100199
200=back
201
202=item B<Connexor>
203
204=over 4
205
Akronf7ad89e2016-03-16 18:22:47 +0100206=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100207
Akronf7ad89e2016-03-16 18:22:47 +0100208=item #Phrase
Akronc13a1702016-03-15 19:33:14 +0100209
Akronf7ad89e2016-03-16 18:22:47 +0100210=item #Sentences
Akronc13a1702016-03-15 19:33:14 +0100211
Akronf7ad89e2016-03-16 18:22:47 +0100212=item #Syntax
Akronc13a1702016-03-15 19:33:14 +0100213
214=back
215
216=item B<CoreNLP>
217
218=over 4
219
Akronf7ad89e2016-03-16 18:22:47 +0100220=item #Constituency
Akronc13a1702016-03-15 19:33:14 +0100221
Akronf7ad89e2016-03-16 18:22:47 +0100222=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100223
Akronf7ad89e2016-03-16 18:22:47 +0100224=item #NamedEntities
Akronc13a1702016-03-15 19:33:14 +0100225
Akronf7ad89e2016-03-16 18:22:47 +0100226=item #Sentences
Akronc13a1702016-03-15 19:33:14 +0100227
228=back
229
230=item B<DeReKo>
231
232=over 4
233
Akronf7ad89e2016-03-16 18:22:47 +0100234=item #Structure
Akronc13a1702016-03-15 19:33:14 +0100235
236=back
237
238=item B<Glemm>
239
240=over 4
241
Akronf7ad89e2016-03-16 18:22:47 +0100242=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100243
244=back
245
246=item B<Mate>
247
248=over 4
249
Akronf7ad89e2016-03-16 18:22:47 +0100250=item #Dependency
Akronc13a1702016-03-15 19:33:14 +0100251
Akronf7ad89e2016-03-16 18:22:47 +0100252=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100253
254=back
255
256=item B<OpenNLP>
257
258=over 4
259
Akronf7ad89e2016-03-16 18:22:47 +0100260=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100261
Akronf7ad89e2016-03-16 18:22:47 +0100262=item #Sentences
Akronc13a1702016-03-15 19:33:14 +0100263
264=back
265
266=item B<Sgbr>
267
268=over 4
269
Akronf7ad89e2016-03-16 18:22:47 +0100270=item #Lemma
Akronc13a1702016-03-15 19:33:14 +0100271
Akronf7ad89e2016-03-16 18:22:47 +0100272=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100273
274=back
275
276=item B<TreeTagger>
277
278=over 4
279
Akronf7ad89e2016-03-16 18:22:47 +0100280=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100281
Akronf7ad89e2016-03-16 18:22:47 +0100282=item #Sentences
Akronc13a1702016-03-15 19:33:14 +0100283
284=back
285
286=item B<XIP>
287
288=over 4
289
Akronf7ad89e2016-03-16 18:22:47 +0100290=item #Constituency
Akronc13a1702016-03-15 19:33:14 +0100291
Akronf7ad89e2016-03-16 18:22:47 +0100292=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100293
Akronf7ad89e2016-03-16 18:22:47 +0100294=item #Sentences
Akronc13a1702016-03-15 19:33:14 +0100295
296=back
297
298=back
299
300More importers are in preparation.
301New annotation importers can be defined in the C<KorAP::XML::Annotation> namespace.
302See the built-in annotation importers as examples.
303
304=head1 AVAILABILITY
305
306 https://github.com/KorAP/KorAP-XML-Krill
307
308
309=head1 COPYRIGHT AND LICENSE
310
311Copyright (C) 2015-2016, L<IDS Mannheim|http://www.ids-mannheim.de/>
Akronf7ad89e2016-03-16 18:22:47 +0100312
Akronc13a1702016-03-15 19:33:14 +0100313Author: L<Nils Diewald|http://nils-diewald.de/>
314
315L<KorAP::XML::Krill> is developed as part of the L<KorAP|http://korap.ids-mannheim.de/>
316Corpus Analysis Platform at the
317L<Institute for the German Language (IDS)|http://ids-mannheim.de/>,
318member of the
319L<Leibniz-Gemeinschaft|http://www.leibniz-gemeinschaft.de/en/about-us/leibniz-competition/projekte-2011/2011-funding-line-2/>.
320
321This program is free software published under the
322L<BSD-2 License|https://raw.githubusercontent.com/KorAP/KorAP-XML-Krill/master/LICENSE>.
323
324=cut