blob: 58fd306c93c94d0e62cb2d12d32e05667fb16f08 [file] [log] [blame]
Akronc13a1702016-03-15 19:33:14 +01001=pod
2
3=encoding utf8
4
5=head1 NAME
6
Akronf7ad89e2016-03-16 18:22:47 +01007korapxml2krill - Merge KorapXML data and create Krill documents
Akronc13a1702016-03-15 19:33:14 +01008
9
10=head1 SYNOPSIS
11
12 $ korapxml2krill -z --input <directory> --output <filename>
13 $ korapxml2krill archive -z --input <directory> --output <directory>
14 $ korapxml2krill extract --input <directory> --output <filename> --sigle <SIGLE>
15
16
17=head1 DESCRIPTION
18
19L<KorAP::XML::Krill> is a library to convert KorAP-XML documents to files
20compatible with the L<Krill|https://github.com/KorAP/Krill> indexer.
Akronf7ad89e2016-03-16 18:22:47 +010021The C<korapxml2krill> command line tool is a simple wrapper to the library.
Akronc13a1702016-03-15 19:33:14 +010022
23
24=head1 INSTALLATION
25
26The preferred way to install L<KorAP::XML::Krill> is to use L<cpanm|App::cpanminus>.
27
28 $ cpanm https://github.com/KorAP/KorAP-XML-Krill
29
30In case everything went well, the C<korapxml2krill> tool will
Akronf7ad89e2016-03-16 18:22:47 +010031be available on your command line immediately.
Akronc13a1702016-03-15 19:33:14 +010032
33
34=head1 ARGUMENTS
35
36=over 2
37
38=item B<archive>
39
40Process an archive as a Zip-file or a folder of KorAP-XML documents.
41
42=item B<extract>
43
44Extract KorAP-XML files from a Zip-file.
45
46=back
47
48
49=head1 OPTIONS
50
51=over 2
52
53=item B<--input|-i> <directory|file>
54
Akronf7ad89e2016-03-16 18:22:47 +010055Directory or archive file of documents to convert.
Akronc13a1702016-03-15 19:33:14 +010056
57=item B<--output|-o> <directory|file>
58
59Output folder for archive processing or
60document name for single output (optional),
Akronf7ad89e2016-03-16 18:22:47 +010061writes to C<STDOUT> by default
62(in case C<output> is not mandatory due to further options).
Akronc13a1702016-03-15 19:33:14 +010063
64=item B<--overwrite|-w>
65
66Overwrite files that already exist.
67
68=item B<--token|-t> <foundry>[#<file>]
69
70Define the default tokenization by specifying
71the name of the foundry and optionally the name
72of the layer-file. Defaults to C<OpenNLP#tokens>.
73
74=item B<--skip|-s> <foundry>[#<layer>]
75
Akronf7ad89e2016-03-16 18:22:47 +010076Skip specific annotations by specifying the foundry
77(and optionally the layer with a C<#>-prefix),
78e.g. C<Mate> or C<Mate#Morpho>. Alternatively you can skip C<#ALL>.
Akronc13a1702016-03-15 19:33:14 +010079Can be set multiple times.
80
81=item B<--anno|-a> <foundry>#<layer>
82
Akronf7ad89e2016-03-16 18:22:47 +010083Convert specific annotations by specifying the foundry
84(and optionally the layer with a C<#>-prefix),
85e.g. C<Mate> or C<Mate#Morpho>.
86Can be set multiple times.
Akronc13a1702016-03-15 19:33:14 +010087
88=item B<--primary|-p>
89
90Output primary data or not. Defaults to C<true>.
Akronf7ad89e2016-03-16 18:22:47 +010091Can be flagged using C<--no-primary> as well.
92This is I<deprecated>.
Akronc13a1702016-03-15 19:33:14 +010093
94=item B<--jobs|-j>
95
96Define the number of concurrent jobs in seperated forks
Akronf7ad89e2016-03-16 18:22:47 +010097for archive processing.
Akron11c80302016-03-18 19:44:43 +010098Defaults to C<0> (everything runs in a single process).
Akronf7ad89e2016-03-16 18:22:47 +010099This is I<experimental>.
Akronc13a1702016-03-15 19:33:14 +0100100
Akron35db6e32016-03-17 22:42:22 +0100101=item B<--meta|-m>
Akronc13a1702016-03-15 19:33:14 +0100102
Akron35db6e32016-03-17 22:42:22 +0100103Define the metadata parser to use. Defaults to C<I5>.
104Metadata parsers can be defined in the C<KorAP::XML::Meta> namespace.
105This is I<experimental>.
Akronc13a1702016-03-15 19:33:14 +0100106
107=item B<--pretty|-y>
108
109Pretty print JSON output. Defaults to C<false>.
Akron35db6e32016-03-17 22:42:22 +0100110This is I<deprecated>.
Akronc13a1702016-03-15 19:33:14 +0100111
112=item B<--gzip|-z>
113
Akronf7ad89e2016-03-16 18:22:47 +0100114Compress the output.
115Expects a defined C<output> file in single processing.
Akronc13a1702016-03-15 19:33:14 +0100116
Akron11c80302016-03-18 19:44:43 +0100117=item B<--cache|-c>
118
119File to mmap a cache (using L<Cache::FastMmap>).
120Defaults to C<korapxml2krill.cache> in the calling directory.
121
122=item B<--cache-size|-cs>
123
124Size of the cache. Defaults to C<50m>.
125
126=item B<--cache-init|-ci>
127
128Initialize cache file.
129Can be flagged using C<--no-cache-init> as well.
130Defaults to C<true>.
131
132=item B<--cache-delete|-cd>
133
134Delete cache file after processing.
135Can be flagged using C<--no-cache-delete> as well.
136Defaults to C<true>.
137
Akronc13a1702016-03-15 19:33:14 +0100138=item B<--sigle|-sg>
139
140Extract the given text sigles.
Akronc13a1702016-03-15 19:33:14 +0100141Can be set multiple times.
Akronf7ad89e2016-03-16 18:22:47 +0100142I<Currently only supported on C<extract>.>
Akronc13a1702016-03-15 19:33:14 +0100143
144=item B<--log|-l>
145
146The L<Log4perl> log level, defaults to C<ERROR>.
147
148=item B<--help|-h>
149
150Print this document.
151
152=item B<--version|-v>
153
154Print version information.
155
156=back
157
158=head1 ANNOTATION SUPPORT
159
160L<KorAP::XML::Krill> has built-in importer for some annotation foundries and layers
161developed in the KorAP project that are part of the KorAP preprocessing pipeline.
162The base foundry with paragraphs, sentences, and the text element are mandatory for
163L<Krill|https://github.com/KorAP/Krill>.
164
Akronf7ad89e2016-03-16 18:22:47 +0100165=over 2
Akronc13a1702016-03-15 19:33:14 +0100166
167=item B<Base>
168
169=over 4
170
Akronf7ad89e2016-03-16 18:22:47 +0100171=item #Paragraphs
Akronc13a1702016-03-15 19:33:14 +0100172
Akronf7ad89e2016-03-16 18:22:47 +0100173=item #Sentences
Akronc13a1702016-03-15 19:33:14 +0100174
175=back
176
177=item B<Connexor>
178
179=over 4
180
Akronf7ad89e2016-03-16 18:22:47 +0100181=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100182
Akronf7ad89e2016-03-16 18:22:47 +0100183=item #Phrase
Akronc13a1702016-03-15 19:33:14 +0100184
Akronf7ad89e2016-03-16 18:22:47 +0100185=item #Sentences
Akronc13a1702016-03-15 19:33:14 +0100186
Akronf7ad89e2016-03-16 18:22:47 +0100187=item #Syntax
Akronc13a1702016-03-15 19:33:14 +0100188
189=back
190
191=item B<CoreNLP>
192
193=over 4
194
Akronf7ad89e2016-03-16 18:22:47 +0100195=item #Constituency
Akronc13a1702016-03-15 19:33:14 +0100196
Akronf7ad89e2016-03-16 18:22:47 +0100197=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100198
Akronf7ad89e2016-03-16 18:22:47 +0100199=item #NamedEntities
Akronc13a1702016-03-15 19:33:14 +0100200
Akronf7ad89e2016-03-16 18:22:47 +0100201=item #Sentences
Akronc13a1702016-03-15 19:33:14 +0100202
203=back
204
205=item B<DeReKo>
206
207=over 4
208
Akronf7ad89e2016-03-16 18:22:47 +0100209=item #Structure
Akronc13a1702016-03-15 19:33:14 +0100210
211=back
212
213=item B<Glemm>
214
215=over 4
216
Akronf7ad89e2016-03-16 18:22:47 +0100217=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100218
219=back
220
221=item B<Mate>
222
223=over 4
224
Akronf7ad89e2016-03-16 18:22:47 +0100225=item #Dependency
Akronc13a1702016-03-15 19:33:14 +0100226
Akronf7ad89e2016-03-16 18:22:47 +0100227=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100228
229=back
230
231=item B<OpenNLP>
232
233=over 4
234
Akronf7ad89e2016-03-16 18:22:47 +0100235=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100236
Akronf7ad89e2016-03-16 18:22:47 +0100237=item #Sentences
Akronc13a1702016-03-15 19:33:14 +0100238
239=back
240
241=item B<Sgbr>
242
243=over 4
244
Akronf7ad89e2016-03-16 18:22:47 +0100245=item #Lemma
Akronc13a1702016-03-15 19:33:14 +0100246
Akronf7ad89e2016-03-16 18:22:47 +0100247=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100248
249=back
250
251=item B<TreeTagger>
252
253=over 4
254
Akronf7ad89e2016-03-16 18:22:47 +0100255=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100256
Akronf7ad89e2016-03-16 18:22:47 +0100257=item #Sentences
Akronc13a1702016-03-15 19:33:14 +0100258
259=back
260
261=item B<XIP>
262
263=over 4
264
Akronf7ad89e2016-03-16 18:22:47 +0100265=item #Constituency
Akronc13a1702016-03-15 19:33:14 +0100266
Akronf7ad89e2016-03-16 18:22:47 +0100267=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100268
Akronf7ad89e2016-03-16 18:22:47 +0100269=item #Sentences
Akronc13a1702016-03-15 19:33:14 +0100270
271=back
272
273=back
274
275More importers are in preparation.
276New annotation importers can be defined in the C<KorAP::XML::Annotation> namespace.
277See the built-in annotation importers as examples.
278
279=head1 AVAILABILITY
280
281 https://github.com/KorAP/KorAP-XML-Krill
282
283
284=head1 COPYRIGHT AND LICENSE
285
286Copyright (C) 2015-2016, L<IDS Mannheim|http://www.ids-mannheim.de/>
Akronf7ad89e2016-03-16 18:22:47 +0100287
Akronc13a1702016-03-15 19:33:14 +0100288Author: L<Nils Diewald|http://nils-diewald.de/>
289
290L<KorAP::XML::Krill> is developed as part of the L<KorAP|http://korap.ids-mannheim.de/>
291Corpus Analysis Platform at the
292L<Institute for the German Language (IDS)|http://ids-mannheim.de/>,
293member of the
294L<Leibniz-Gemeinschaft|http://www.leibniz-gemeinschaft.de/en/about-us/leibniz-competition/projekte-2011/2011-funding-line-2/>.
295
296This program is free software published under the
297L<BSD-2 License|https://raw.githubusercontent.com/KorAP/KorAP-XML-Krill/master/LICENSE>.
298
299=cut