blob: 510b8d4e5882052c861fe39bb63cc228ab1eb32e [file] [log] [blame]
Akronc13a1702016-03-15 19:33:14 +01001=pod
2
3=encoding utf8
4
5=head1 NAME
6
Akronf7ad89e2016-03-16 18:22:47 +01007korapxml2krill - Merge KorapXML data and create Krill documents
Akronc13a1702016-03-15 19:33:14 +01008
9
10=head1 SYNOPSIS
11
12 $ korapxml2krill -z --input <directory> --output <filename>
13 $ korapxml2krill archive -z --input <directory> --output <directory>
14 $ korapxml2krill extract --input <directory> --output <filename> --sigle <SIGLE>
15
16
17=head1 DESCRIPTION
18
19L<KorAP::XML::Krill> is a library to convert KorAP-XML documents to files
20compatible with the L<Krill|https://github.com/KorAP/Krill> indexer.
Akronf7ad89e2016-03-16 18:22:47 +010021The C<korapxml2krill> command line tool is a simple wrapper to the library.
Akronc13a1702016-03-15 19:33:14 +010022
23
24=head1 INSTALLATION
25
26The preferred way to install L<KorAP::XML::Krill> is to use L<cpanm|App::cpanminus>.
27
Akronaf386982016-10-12 00:33:25 +020028 $ cpanm https://github.com/KorAP/KorAP-XML-Krill.git
Akronc13a1702016-03-15 19:33:14 +010029
30In case everything went well, the C<korapxml2krill> tool will
Akronf7ad89e2016-03-16 18:22:47 +010031be available on your command line immediately.
Akron74381512016-10-14 11:56:22 +020032Minimum requirement for L<KorAP::XML::Krill> is Perl 5.14.
Akronc13a1702016-03-15 19:33:14 +010033
34=head1 ARGUMENTS
35
36=over 2
37
38=item B<archive>
39
40Process an archive as a Zip-file or a folder of KorAP-XML documents.
41
42=item B<extract>
43
44Extract KorAP-XML files from a Zip-file.
45
46=back
47
48
49=head1 OPTIONS
50
51=over 2
52
Akrona5920b12016-06-29 18:51:21 +020053=item B<--input|-i> <directory|file|files>
Akronc13a1702016-03-15 19:33:14 +010054
Akronf7ad89e2016-03-16 18:22:47 +010055Directory or archive file of documents to convert.
Akronc13a1702016-03-15 19:33:14 +010056
Akrona5920b12016-06-29 18:51:21 +020057Archiving supports multiple input archives with the constraint,
58that the first archive listed contains all primary data files
59and all meta data files.
60
61 -i file/news.zip -i file/news.malt.zip -i #file/news.tt.zip
62
63(The directory structure follows the base directory format,
64that may include a C<.> root folder.
65In this case further archives lacking a C<.> root folder
66need to be passed with a hash sign in front of the archive's name.)
67
Akronc13a1702016-03-15 19:33:14 +010068=item B<--output|-o> <directory|file>
69
70Output folder for archive processing or
71document name for single output (optional),
Akronf7ad89e2016-03-16 18:22:47 +010072writes to C<STDOUT> by default
73(in case C<output> is not mandatory due to further options).
Akronc13a1702016-03-15 19:33:14 +010074
75=item B<--overwrite|-w>
76
77Overwrite files that already exist.
78
79=item B<--token|-t> <foundry>[#<file>]
80
81Define the default tokenization by specifying
82the name of the foundry and optionally the name
83of the layer-file. Defaults to C<OpenNLP#tokens>.
84
85=item B<--skip|-s> <foundry>[#<layer>]
86
Akronf7ad89e2016-03-16 18:22:47 +010087Skip specific annotations by specifying the foundry
88(and optionally the layer with a C<#>-prefix),
89e.g. C<Mate> or C<Mate#Morpho>. Alternatively you can skip C<#ALL>.
Akronc13a1702016-03-15 19:33:14 +010090Can be set multiple times.
91
92=item B<--anno|-a> <foundry>#<layer>
93
Akronf7ad89e2016-03-16 18:22:47 +010094Convert specific annotations by specifying the foundry
95(and optionally the layer with a C<#>-prefix),
96e.g. C<Mate> or C<Mate#Morpho>.
97Can be set multiple times.
Akronc13a1702016-03-15 19:33:14 +010098
99=item B<--primary|-p>
100
101Output primary data or not. Defaults to C<true>.
Akronf7ad89e2016-03-16 18:22:47 +0100102Can be flagged using C<--no-primary> as well.
103This is I<deprecated>.
Akronc13a1702016-03-15 19:33:14 +0100104
105=item B<--jobs|-j>
106
107Define the number of concurrent jobs in seperated forks
Akronf7ad89e2016-03-16 18:22:47 +0100108for archive processing.
Akron11c80302016-03-18 19:44:43 +0100109Defaults to C<0> (everything runs in a single process).
Akronf7ad89e2016-03-16 18:22:47 +0100110This is I<experimental>.
Akronc13a1702016-03-15 19:33:14 +0100111
Akron35db6e32016-03-17 22:42:22 +0100112=item B<--meta|-m>
Akronc13a1702016-03-15 19:33:14 +0100113
Akron35db6e32016-03-17 22:42:22 +0100114Define the metadata parser to use. Defaults to C<I5>.
115Metadata parsers can be defined in the C<KorAP::XML::Meta> namespace.
116This is I<experimental>.
Akronc13a1702016-03-15 19:33:14 +0100117
118=item B<--pretty|-y>
119
120Pretty print JSON output. Defaults to C<false>.
Akron35db6e32016-03-17 22:42:22 +0100121This is I<deprecated>.
Akronc13a1702016-03-15 19:33:14 +0100122
123=item B<--gzip|-z>
124
Akronf7ad89e2016-03-16 18:22:47 +0100125Compress the output.
126Expects a defined C<output> file in single processing.
Akronc13a1702016-03-15 19:33:14 +0100127
Akron11c80302016-03-18 19:44:43 +0100128=item B<--cache|-c>
129
130File to mmap a cache (using L<Cache::FastMmap>).
131Defaults to C<korapxml2krill.cache> in the calling directory.
132
133=item B<--cache-size|-cs>
134
135Size of the cache. Defaults to C<50m>.
136
137=item B<--cache-init|-ci>
138
139Initialize cache file.
140Can be flagged using C<--no-cache-init> as well.
141Defaults to C<true>.
142
143=item B<--cache-delete|-cd>
144
145Delete cache file after processing.
146Can be flagged using C<--no-cache-delete> as well.
147Defaults to C<true>.
148
Akronc13a1702016-03-15 19:33:14 +0100149=item B<--sigle|-sg>
150
151Extract the given text sigles.
Akronc13a1702016-03-15 19:33:14 +0100152Can be set multiple times.
Akronf7ad89e2016-03-16 18:22:47 +0100153I<Currently only supported on C<extract>.>
Akrona5920b12016-06-29 18:51:21 +0200154Sigles have the structure C<Corpus>/C<Document>/C<Text>.
Akronc13a1702016-03-15 19:33:14 +0100155
156=item B<--log|-l>
157
158The L<Log4perl> log level, defaults to C<ERROR>.
159
160=item B<--help|-h>
161
162Print this document.
163
164=item B<--version|-v>
165
166Print version information.
167
168=back
169
170=head1 ANNOTATION SUPPORT
171
172L<KorAP::XML::Krill> has built-in importer for some annotation foundries and layers
173developed in the KorAP project that are part of the KorAP preprocessing pipeline.
174The base foundry with paragraphs, sentences, and the text element are mandatory for
175L<Krill|https://github.com/KorAP/Krill>.
176
Akronf7ad89e2016-03-16 18:22:47 +0100177=over 2
Akronc13a1702016-03-15 19:33:14 +0100178
179=item B<Base>
180
181=over 4
182
Akronf7ad89e2016-03-16 18:22:47 +0100183=item #Paragraphs
Akronc13a1702016-03-15 19:33:14 +0100184
Akronf7ad89e2016-03-16 18:22:47 +0100185=item #Sentences
Akronc13a1702016-03-15 19:33:14 +0100186
187=back
188
189=item B<Connexor>
190
191=over 4
192
Akronf7ad89e2016-03-16 18:22:47 +0100193=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100194
Akronf7ad89e2016-03-16 18:22:47 +0100195=item #Phrase
Akronc13a1702016-03-15 19:33:14 +0100196
Akronf7ad89e2016-03-16 18:22:47 +0100197=item #Sentences
Akronc13a1702016-03-15 19:33:14 +0100198
Akronf7ad89e2016-03-16 18:22:47 +0100199=item #Syntax
Akronc13a1702016-03-15 19:33:14 +0100200
201=back
202
203=item B<CoreNLP>
204
205=over 4
206
Akronf7ad89e2016-03-16 18:22:47 +0100207=item #Constituency
Akronc13a1702016-03-15 19:33:14 +0100208
Akronf7ad89e2016-03-16 18:22:47 +0100209=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100210
Akronf7ad89e2016-03-16 18:22:47 +0100211=item #NamedEntities
Akronc13a1702016-03-15 19:33:14 +0100212
Akronf7ad89e2016-03-16 18:22:47 +0100213=item #Sentences
Akronc13a1702016-03-15 19:33:14 +0100214
215=back
216
217=item B<DeReKo>
218
219=over 4
220
Akronf7ad89e2016-03-16 18:22:47 +0100221=item #Structure
Akronc13a1702016-03-15 19:33:14 +0100222
223=back
224
225=item B<Glemm>
226
227=over 4
228
Akronf7ad89e2016-03-16 18:22:47 +0100229=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100230
231=back
232
233=item B<Mate>
234
235=over 4
236
Akronf7ad89e2016-03-16 18:22:47 +0100237=item #Dependency
Akronc13a1702016-03-15 19:33:14 +0100238
Akronf7ad89e2016-03-16 18:22:47 +0100239=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100240
241=back
242
243=item B<OpenNLP>
244
245=over 4
246
Akronf7ad89e2016-03-16 18:22:47 +0100247=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100248
Akronf7ad89e2016-03-16 18:22:47 +0100249=item #Sentences
Akronc13a1702016-03-15 19:33:14 +0100250
251=back
252
253=item B<Sgbr>
254
255=over 4
256
Akronf7ad89e2016-03-16 18:22:47 +0100257=item #Lemma
Akronc13a1702016-03-15 19:33:14 +0100258
Akronf7ad89e2016-03-16 18:22:47 +0100259=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100260
261=back
262
263=item B<TreeTagger>
264
265=over 4
266
Akronf7ad89e2016-03-16 18:22:47 +0100267=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100268
Akronf7ad89e2016-03-16 18:22:47 +0100269=item #Sentences
Akronc13a1702016-03-15 19:33:14 +0100270
271=back
272
273=item B<XIP>
274
275=over 4
276
Akronf7ad89e2016-03-16 18:22:47 +0100277=item #Constituency
Akronc13a1702016-03-15 19:33:14 +0100278
Akronf7ad89e2016-03-16 18:22:47 +0100279=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100280
Akronf7ad89e2016-03-16 18:22:47 +0100281=item #Sentences
Akronc13a1702016-03-15 19:33:14 +0100282
283=back
284
285=back
286
287More importers are in preparation.
288New annotation importers can be defined in the C<KorAP::XML::Annotation> namespace.
289See the built-in annotation importers as examples.
290
291=head1 AVAILABILITY
292
293 https://github.com/KorAP/KorAP-XML-Krill
294
295
296=head1 COPYRIGHT AND LICENSE
297
298Copyright (C) 2015-2016, L<IDS Mannheim|http://www.ids-mannheim.de/>
Akronf7ad89e2016-03-16 18:22:47 +0100299
Akronc13a1702016-03-15 19:33:14 +0100300Author: L<Nils Diewald|http://nils-diewald.de/>
301
302L<KorAP::XML::Krill> is developed as part of the L<KorAP|http://korap.ids-mannheim.de/>
303Corpus Analysis Platform at the
304L<Institute for the German Language (IDS)|http://ids-mannheim.de/>,
305member of the
306L<Leibniz-Gemeinschaft|http://www.leibniz-gemeinschaft.de/en/about-us/leibniz-competition/projekte-2011/2011-funding-line-2/>.
307
308This program is free software published under the
309L<BSD-2 License|https://raw.githubusercontent.com/KorAP/KorAP-XML-Krill/master/LICENSE>.
310
311=cut