blob: 296cf13c200fe031fbd540fb82a814a712498b31 [file] [log] [blame]
Akronc13a1702016-03-15 19:33:14 +01001=pod
2
3=encoding utf8
4
5=head1 NAME
6
Akronf7ad89e2016-03-16 18:22:47 +01007korapxml2krill - Merge KorapXML data and create Krill documents
Akronc13a1702016-03-15 19:33:14 +01008
9
10=head1 SYNOPSIS
11
12 $ korapxml2krill -z --input <directory> --output <filename>
Akron20807582016-10-26 17:11:34 +020013 $ korapxml2krill extract --input <archive> --output <directory> --sigle <SIGLE>
Akron7606afa2016-10-25 16:23:49 +020014 $ korapxml2krill archive -z --input <directory|archive> --output <directory>
Akronc13a1702016-03-15 19:33:14 +010015
16
17=head1 DESCRIPTION
18
19L<KorAP::XML::Krill> is a library to convert KorAP-XML documents to files
20compatible with the L<Krill|https://github.com/KorAP/Krill> indexer.
Akronf7ad89e2016-03-16 18:22:47 +010021The C<korapxml2krill> command line tool is a simple wrapper to the library.
Akronc13a1702016-03-15 19:33:14 +010022
23
24=head1 INSTALLATION
25
26The preferred way to install L<KorAP::XML::Krill> is to use L<cpanm|App::cpanminus>.
27
Akronaf386982016-10-12 00:33:25 +020028 $ cpanm https://github.com/KorAP/KorAP-XML-Krill.git
Akronc13a1702016-03-15 19:33:14 +010029
30In case everything went well, the C<korapxml2krill> tool will
Akronf7ad89e2016-03-16 18:22:47 +010031be available on your command line immediately.
Akron74381512016-10-14 11:56:22 +020032Minimum requirement for L<KorAP::XML::Krill> is Perl 5.14.
Akrona93d51b2016-10-24 20:27:48 +020033In addition to work with zip archives, the C<unzip> tool needs to be present.
Akronc13a1702016-03-15 19:33:14 +010034
35=head1 ARGUMENTS
36
Akron7606afa2016-10-25 16:23:49 +020037Without arguments, C<korapxml2krill> processes a directory of a single KorAP-XML document.
38
Akronc13a1702016-03-15 19:33:14 +010039=over 2
40
41=item B<archive>
42
Akron7606afa2016-10-25 16:23:49 +020043Processes an archive as a Zip-file or a folder of KorAP-XML documents.
Akronc13a1702016-03-15 19:33:14 +010044
45=item B<extract>
46
Akron7606afa2016-10-25 16:23:49 +020047Extracts KorAP-XML files from a Zip-file.
Akronc13a1702016-03-15 19:33:14 +010048
49=back
50
51
52=head1 OPTIONS
53
54=over 2
55
Akrona5920b12016-06-29 18:51:21 +020056=item B<--input|-i> <directory|file|files>
Akronc13a1702016-03-15 19:33:14 +010057
Akronf7ad89e2016-03-16 18:22:47 +010058Directory or archive file of documents to convert.
Akronc13a1702016-03-15 19:33:14 +010059
Akron7606afa2016-10-25 16:23:49 +020060Without arguments, C<korapxml2krill> expects a folder of a single KorAP-XML
61document, while C<archive> and C<extract> support zip archives as well.
62
63C<archive> supports multiple input archives with the constraint,
Akrona5920b12016-06-29 18:51:21 +020064that the first archive listed contains all primary data files
65and all meta data files.
66
Akron7606afa2016-10-25 16:23:49 +020067 -i file/news.zip -i file/news.malt.zip -i "#file/news.tt.zip"
Akrona5920b12016-06-29 18:51:21 +020068
69(The directory structure follows the base directory format,
70that may include a C<.> root folder.
71In this case further archives lacking a C<.> root folder
Akron7606afa2016-10-25 16:23:49 +020072need to be passed with a hash sign in front of the archive's name.
73This may require to quote the parameter.)
Akrona5920b12016-06-29 18:51:21 +020074
Akron7606afa2016-10-25 16:23:49 +020075To support zip files, a version of C<unzip> needs to be installed that is
76compatible with the archive file.
Akrona93d51b2016-10-24 20:27:48 +020077
Akron7606afa2016-10-25 16:23:49 +020078B<The root folder switch using the hash sign is experimental and
79may vanish in future versions.>
Akrona93d51b2016-10-24 20:27:48 +020080
Akronc13a1702016-03-15 19:33:14 +010081=item B<--output|-o> <directory|file>
82
83Output folder for archive processing or
84document name for single output (optional),
Akronf7ad89e2016-03-16 18:22:47 +010085writes to C<STDOUT> by default
86(in case C<output> is not mandatory due to further options).
Akronc13a1702016-03-15 19:33:14 +010087
88=item B<--overwrite|-w>
89
90Overwrite files that already exist.
91
92=item B<--token|-t> <foundry>[#<file>]
93
94Define the default tokenization by specifying
95the name of the foundry and optionally the name
96of the layer-file. Defaults to C<OpenNLP#tokens>.
97
98=item B<--skip|-s> <foundry>[#<layer>]
99
Akronf7ad89e2016-03-16 18:22:47 +0100100Skip specific annotations by specifying the foundry
101(and optionally the layer with a C<#>-prefix),
102e.g. C<Mate> or C<Mate#Morpho>. Alternatively you can skip C<#ALL>.
Akronc13a1702016-03-15 19:33:14 +0100103Can be set multiple times.
104
105=item B<--anno|-a> <foundry>#<layer>
106
Akronf7ad89e2016-03-16 18:22:47 +0100107Convert specific annotations by specifying the foundry
108(and optionally the layer with a C<#>-prefix),
109e.g. C<Mate> or C<Mate#Morpho>.
110Can be set multiple times.
Akronc13a1702016-03-15 19:33:14 +0100111
112=item B<--primary|-p>
113
114Output primary data or not. Defaults to C<true>.
Akronf7ad89e2016-03-16 18:22:47 +0100115Can be flagged using C<--no-primary> as well.
116This is I<deprecated>.
Akronc13a1702016-03-15 19:33:14 +0100117
118=item B<--jobs|-j>
119
120Define the number of concurrent jobs in seperated forks
Akronf7ad89e2016-03-16 18:22:47 +0100121for archive processing.
Akron11c80302016-03-18 19:44:43 +0100122Defaults to C<0> (everything runs in a single process).
Akronf7ad89e2016-03-16 18:22:47 +0100123This is I<experimental>.
Akronc13a1702016-03-15 19:33:14 +0100124
Akron35db6e32016-03-17 22:42:22 +0100125=item B<--meta|-m>
Akronc13a1702016-03-15 19:33:14 +0100126
Akron35db6e32016-03-17 22:42:22 +0100127Define the metadata parser to use. Defaults to C<I5>.
128Metadata parsers can be defined in the C<KorAP::XML::Meta> namespace.
129This is I<experimental>.
Akronc13a1702016-03-15 19:33:14 +0100130
131=item B<--pretty|-y>
132
133Pretty print JSON output. Defaults to C<false>.
Akron35db6e32016-03-17 22:42:22 +0100134This is I<deprecated>.
Akronc13a1702016-03-15 19:33:14 +0100135
136=item B<--gzip|-z>
137
Akronf7ad89e2016-03-16 18:22:47 +0100138Compress the output.
139Expects a defined C<output> file in single processing.
Akronc13a1702016-03-15 19:33:14 +0100140
Akron11c80302016-03-18 19:44:43 +0100141=item B<--cache|-c>
142
143File to mmap a cache (using L<Cache::FastMmap>).
144Defaults to C<korapxml2krill.cache> in the calling directory.
145
146=item B<--cache-size|-cs>
147
148Size of the cache. Defaults to C<50m>.
149
150=item B<--cache-init|-ci>
151
152Initialize cache file.
153Can be flagged using C<--no-cache-init> as well.
154Defaults to C<true>.
155
156=item B<--cache-delete|-cd>
157
158Delete cache file after processing.
159Can be flagged using C<--no-cache-delete> as well.
160Defaults to C<true>.
161
Akronc13a1702016-03-15 19:33:14 +0100162=item B<--sigle|-sg>
163
Akron20807582016-10-26 17:11:34 +0200164Extract the given texts.
Akronc13a1702016-03-15 19:33:14 +0100165Can be set multiple times.
Akronf7ad89e2016-03-16 18:22:47 +0100166I<Currently only supported on C<extract>.>
Akrona5920b12016-06-29 18:51:21 +0200167Sigles have the structure C<Corpus>/C<Document>/C<Text>.
Akron20807582016-10-26 17:11:34 +0200168In case the C<Text> path is omitted, the whole document will be extracted.
Akronc13a1702016-03-15 19:33:14 +0100169
170=item B<--log|-l>
171
172The L<Log4perl> log level, defaults to C<ERROR>.
173
174=item B<--help|-h>
175
176Print this document.
177
178=item B<--version|-v>
179
180Print version information.
181
182=back
183
184=head1 ANNOTATION SUPPORT
185
186L<KorAP::XML::Krill> has built-in importer for some annotation foundries and layers
187developed in the KorAP project that are part of the KorAP preprocessing pipeline.
188The base foundry with paragraphs, sentences, and the text element are mandatory for
189L<Krill|https://github.com/KorAP/Krill>.
190
Akronf7ad89e2016-03-16 18:22:47 +0100191=over 2
Akronc13a1702016-03-15 19:33:14 +0100192
193=item B<Base>
194
195=over 4
196
Akronf7ad89e2016-03-16 18:22:47 +0100197=item #Paragraphs
Akronc13a1702016-03-15 19:33:14 +0100198
Akronf7ad89e2016-03-16 18:22:47 +0100199=item #Sentences
Akronc13a1702016-03-15 19:33:14 +0100200
201=back
202
203=item B<Connexor>
204
205=over 4
206
Akronf7ad89e2016-03-16 18:22:47 +0100207=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100208
Akronf7ad89e2016-03-16 18:22:47 +0100209=item #Phrase
Akronc13a1702016-03-15 19:33:14 +0100210
Akronf7ad89e2016-03-16 18:22:47 +0100211=item #Sentences
Akronc13a1702016-03-15 19:33:14 +0100212
Akronf7ad89e2016-03-16 18:22:47 +0100213=item #Syntax
Akronc13a1702016-03-15 19:33:14 +0100214
215=back
216
217=item B<CoreNLP>
218
219=over 4
220
Akronf7ad89e2016-03-16 18:22:47 +0100221=item #Constituency
Akronc13a1702016-03-15 19:33:14 +0100222
Akronf7ad89e2016-03-16 18:22:47 +0100223=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100224
Akronf7ad89e2016-03-16 18:22:47 +0100225=item #NamedEntities
Akronc13a1702016-03-15 19:33:14 +0100226
Akronf7ad89e2016-03-16 18:22:47 +0100227=item #Sentences
Akronc13a1702016-03-15 19:33:14 +0100228
229=back
230
231=item B<DeReKo>
232
233=over 4
234
Akronf7ad89e2016-03-16 18:22:47 +0100235=item #Structure
Akronc13a1702016-03-15 19:33:14 +0100236
237=back
238
239=item B<Glemm>
240
241=over 4
242
Akronf7ad89e2016-03-16 18:22:47 +0100243=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100244
245=back
246
247=item B<Mate>
248
249=over 4
250
Akronf7ad89e2016-03-16 18:22:47 +0100251=item #Dependency
Akronc13a1702016-03-15 19:33:14 +0100252
Akronf7ad89e2016-03-16 18:22:47 +0100253=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100254
255=back
256
257=item B<OpenNLP>
258
259=over 4
260
Akronf7ad89e2016-03-16 18:22:47 +0100261=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100262
Akronf7ad89e2016-03-16 18:22:47 +0100263=item #Sentences
Akronc13a1702016-03-15 19:33:14 +0100264
265=back
266
267=item B<Sgbr>
268
269=over 4
270
Akronf7ad89e2016-03-16 18:22:47 +0100271=item #Lemma
Akronc13a1702016-03-15 19:33:14 +0100272
Akronf7ad89e2016-03-16 18:22:47 +0100273=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100274
275=back
276
277=item B<TreeTagger>
278
279=over 4
280
Akronf7ad89e2016-03-16 18:22:47 +0100281=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100282
Akronf7ad89e2016-03-16 18:22:47 +0100283=item #Sentences
Akronc13a1702016-03-15 19:33:14 +0100284
285=back
286
287=item B<XIP>
288
289=over 4
290
Akronf7ad89e2016-03-16 18:22:47 +0100291=item #Constituency
Akronc13a1702016-03-15 19:33:14 +0100292
Akronf7ad89e2016-03-16 18:22:47 +0100293=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100294
Akronf7ad89e2016-03-16 18:22:47 +0100295=item #Sentences
Akronc13a1702016-03-15 19:33:14 +0100296
297=back
298
299=back
300
301More importers are in preparation.
302New annotation importers can be defined in the C<KorAP::XML::Annotation> namespace.
303See the built-in annotation importers as examples.
304
305=head1 AVAILABILITY
306
307 https://github.com/KorAP/KorAP-XML-Krill
308
309
310=head1 COPYRIGHT AND LICENSE
311
312Copyright (C) 2015-2016, L<IDS Mannheim|http://www.ids-mannheim.de/>
Akronf7ad89e2016-03-16 18:22:47 +0100313
Akronc13a1702016-03-15 19:33:14 +0100314Author: L<Nils Diewald|http://nils-diewald.de/>
315
316L<KorAP::XML::Krill> is developed as part of the L<KorAP|http://korap.ids-mannheim.de/>
317Corpus Analysis Platform at the
318L<Institute for the German Language (IDS)|http://ids-mannheim.de/>,
319member of the
320L<Leibniz-Gemeinschaft|http://www.leibniz-gemeinschaft.de/en/about-us/leibniz-competition/projekte-2011/2011-funding-line-2/>.
321
322This program is free software published under the
323L<BSD-2 License|https://raw.githubusercontent.com/KorAP/KorAP-XML-Krill/master/LICENSE>.
324
325=cut