blob: 6366c81a6fbbb64e3d8c1fe9007b0308d2041edb [file] [log] [blame]
Akronc13a1702016-03-15 19:33:14 +01001=pod
2
3=encoding utf8
4
5=head1 NAME
6
Akron5c71a852016-10-31 16:00:33 +01007korapxml2krill - Merge KorapXML data and create Krill documents
Akronc13a1702016-03-15 19:33:14 +01008
9
10=head1 SYNOPSIS
11
Akron5c71a852016-10-31 16:00:33 +010012 korapxml2krill [archive|extract] --input <directory|archive> [options]
Akron2fd402b2016-10-27 21:26:48 +020013
Akronc13a1702016-03-15 19:33:14 +010014
15=head1 DESCRIPTION
16
Akron5c71a852016-10-31 16:00:33 +010017L<KorAP::XML::Krill> is a library to convert KorAP-XML documents to files
18compatible with the L<Krill|https://github.com/KorAP/Krill> indexer.
19The C<korapxml2krill> command line tool is a simple wrapper to the library.
Akronc13a1702016-03-15 19:33:14 +010020
21
Akron5c71a852016-10-31 16:00:33 +010022=head1 INSTALLATION
Akronc13a1702016-03-15 19:33:14 +010023
Akron5c71a852016-10-31 16:00:33 +010024The preferred way to install L<KorAP::XML::Krill> is to use L<cpanm|App::cpanminus>.
Akronc13a1702016-03-15 19:33:14 +010025
Akron5c71a852016-10-31 16:00:33 +010026 $ cpanm https://github.com/KorAP/KorAP-XML-Krill.git
Akronc13a1702016-03-15 19:33:14 +010027
Akron5c71a852016-10-31 16:00:33 +010028In case everything went well, the C<korapxml2krill> tool will
29be available on your command line immediately.
30Minimum requirement for L<KorAP::XML::Krill> is Perl 5.14.
31In addition to work with zip archives, the C<unzip> tool needs to be present.
Akronc13a1702016-03-15 19:33:14 +010032
Akron5c71a852016-10-31 16:00:33 +010033=head1 ARGUMENTS
Akronc13a1702016-03-15 19:33:14 +010034
Akron5c71a852016-10-31 16:00:33 +010035 $ korapxml2krill -z --input <directory> --output <filename>
36
37Without arguments, C<korapxml2krill> converts a directory of a single KorAP-XML document.
38It expects the input to point to the text level folder.
39
40=over 2
41
42=item B<archive>
43
44 $ korapxml2krill archive -z --input <directory|archive> --output <directory>
45
46Converts an archive of KorAP-XML documents. It expects a directory
47(pointing to the corpus level folder) or one or more zip files as input.
48
49=item B<extract>
50
51 $ korapxml2krill extract --input <archive> --output <directory> --sigle <SIGLE>
52
53Extracts KorAP-XML documents from a zip file.
54
55=back
Akrona76d8352016-10-27 16:27:32 +020056
Akron7606afa2016-10-25 16:23:49 +020057
Akron5c71a852016-10-31 16:00:33 +010058=head1 OPTIONS
Akronc13a1702016-03-15 19:33:14 +010059
Akron5c71a852016-10-31 16:00:33 +010060=over 2
Akronc13a1702016-03-15 19:33:14 +010061
Akron5c71a852016-10-31 16:00:33 +010062=item B<--input|-i> <directory|zip file>
Akrona76d8352016-10-27 16:27:32 +020063
Akron5c71a852016-10-31 16:00:33 +010064Directory or zip file(s) of documents to convert.
Akronc13a1702016-03-15 19:33:14 +010065
Akron5c71a852016-10-31 16:00:33 +010066Without arguments, C<korapxml2krill> expects a folder of a single KorAP-XML
Akronf1a1de92016-11-02 17:32:12 +010067document, while C<archive> expects a KorAP-XML corpus folder or a zip
68file to batch process multiple files.
69C<extract> expects zip files only.
Akronc13a1702016-03-15 19:33:14 +010070
Akron5c71a852016-10-31 16:00:33 +010071C<archive> supports multiple input zip files with the constraint,
72that the first archive listed contains all primary data files
73and all meta data files.
Akrona76d8352016-10-27 16:27:32 +020074
Akron5c71a852016-10-31 16:00:33 +010075 -i file/news.zip -i file/news.malt.zip -i "#file/news.tt.zip"
Akronc13a1702016-03-15 19:33:14 +010076
Akron821db3d2017-04-06 21:19:31 +020077Input may also be defined using BSD glob wildcards.
78
79 -i 'file/news*.zip'
80
81The extended input array will be sorted in length order, so the shortest
82path needs to contain all primary data files and all meta data files.
83
Akron5c71a852016-10-31 16:00:33 +010084(The directory structure follows the base directory format,
85that may include a C<.> root folder.
86In this case further archives lacking a C<.> root folder
87need to be passed with a hash sign in front of the archive's name.
88This may require to quote the parameter.)
Akronc13a1702016-03-15 19:33:14 +010089
Akron5c71a852016-10-31 16:00:33 +010090To support zip files, a version of C<unzip> needs to be installed that is
91compatible with the archive file.
Akronc13a1702016-03-15 19:33:14 +010092
Akron5c71a852016-10-31 16:00:33 +010093B<The root folder switch using the hash sign is experimental and
94may vanish in future versions.>
Akronc13a1702016-03-15 19:33:14 +010095
Akron5c71a852016-10-31 16:00:33 +010096=item B<--output|-o> <directory|file>
Akronc13a1702016-03-15 19:33:14 +010097
Akron5c71a852016-10-31 16:00:33 +010098Output folder for archive processing or
99document name for single output (optional),
100writes to C<STDOUT> by default
101(in case C<output> is not mandatory due to further options).
Akronc13a1702016-03-15 19:33:14 +0100102
Akron5c71a852016-10-31 16:00:33 +0100103=item B<--overwrite|-w>
Akronc13a1702016-03-15 19:33:14 +0100104
Akron5c71a852016-10-31 16:00:33 +0100105Overwrite files that already exist.
Akron7606afa2016-10-25 16:23:49 +0200106
Akron3741f8b2016-12-21 19:55:21 +0100107=item B<--token|-t> <foundry>#<file>
Akrona5920b12016-06-29 18:51:21 +0200108
Akron5c71a852016-10-31 16:00:33 +0100109Define the default tokenization by specifying
110the name of the foundry and optionally the name
111of the layer-file. Defaults to C<OpenNLP#tokens>.
Akronc13a1702016-03-15 19:33:14 +0100112
Akron3741f8b2016-12-21 19:55:21 +0100113
114=item B<--base-sentences|-bs> <foundry>#<layer>
115
116Define the layer for base sentences.
117If given, this will be used instead of using C<Base#Sentences>.
118Currently C<DeReKo#Structure> is the only additional layer supported.
119
120 Defaults to unset.
121
122
123=item B<--base-paragraphs|-bp> <foundry>#<layer>
124
125Define the layer for base paragraphs.
126If given, this will be used instead of using C<Base#Paragraphs>.
127Currently C<DeReKo#Structure> is the only additional layer supported.
128
129 Defaults to unset.
130
131
Akron821db3d2017-04-06 21:19:31 +0200132=item B<--base-pagebreaks|-bpb> <foundry>#<layer>
133
134Define the layer for base pagebreaks.
135Currently C<DeReKo#Structure> is the only layer supported.
136
137 Defaults to unset.
138
139
Akron5c71a852016-10-31 16:00:33 +0100140=item B<--skip|-s> <foundry>[#<layer>]
141
142Skip specific annotations by specifying the foundry
143(and optionally the layer with a C<#>-prefix),
144e.g. C<Mate> or C<Mate#Morpho>. Alternatively you can skip C<#ALL>.
145Can be set multiple times.
146
147=item B<--anno|-a> <foundry>#<layer>
148
149Convert specific annotations by specifying the foundry
150(and optionally the layer with a C<#>-prefix),
151e.g. C<Mate> or C<Mate#Morpho>.
152Can be set multiple times.
153
154=item B<--primary|-p>
155
156Output primary data or not. Defaults to C<true>.
157Can be flagged using C<--no-primary> as well.
158This is I<deprecated>.
159
160=item B<--jobs|-j>
161
162Define the number of concurrent jobs in seperated forks
163for archive processing.
164Defaults to C<0> (everything runs in a single process).
Akron821db3d2017-04-06 21:19:31 +0200165Pass -1, and the value will be set automatically to 5
166times the number of available cores.
Akron5c71a852016-10-31 16:00:33 +0100167This is I<experimental>.
168
169=item B<--meta|-m>
170
171Define the metadata parser to use. Defaults to C<I5>.
172Metadata parsers can be defined in the C<KorAP::XML::Meta> namespace.
173This is I<experimental>.
174
175=item B<--pretty|-y>
176
177Pretty print JSON output. Defaults to C<false>.
178This is I<deprecated>.
179
180=item B<--gzip|-z>
181
182Compress the output.
183Expects a defined C<output> file in single processing.
184
185=item B<--cache|-c>
186
187File to mmap a cache (using L<Cache::FastMmap>).
188Defaults to C<korapxml2krill.cache> in the calling directory.
189
190=item B<--cache-size|-cs>
191
192Size of the cache. Defaults to C<50m>.
193
194=item B<--cache-init|-ci>
195
196Initialize cache file.
197Can be flagged using C<--no-cache-init> as well.
198Defaults to C<true>.
199
200=item B<--cache-delete|-cd>
201
202Delete cache file after processing.
203Can be flagged using C<--no-cache-delete> as well.
204Defaults to C<true>.
205
Akron636aa112017-04-07 18:48:56 +0200206=item B<--config|-cfg>
207
208Configure the parameters of your call in a file
209of key-value pairs with whitespace separator
210
211 overwrite 1
212 token DeReKo#Structure
213 ...
214
215Supported parameters are:
216C<overwrite>, C<gzip>, C<jobs>,
217C<token>, C<log>, C<cache>, C<cache-size>, C<cache-delete>, C<meta>,
Akron81500102017-04-07 20:45:44 +0200218C<output>, C<base-sentences>, C<temp-extract>, C<base-paragraphs>,
Akron636aa112017-04-07 18:48:56 +0200219C<base-pagebreaks>, C<skip> (semicolon separated), C<sigle>
220(semicolon separated), C<anno> (semicolon separated).
221
Akron81500102017-04-07 20:45:44 +0200222=item B<--temporary-extract|-te>
223
224Only valid for the C<archive> command.
225
226This will first extract all files into a
227directory and then will archive.
228If the directory is given as C<:temp:>,
229a temporary directory is used.
230This is especially useful to avoid
231massive unzipping and potential
232network latency.
Akron636aa112017-04-07 18:48:56 +0200233
Akron5c71a852016-10-31 16:00:33 +0100234=item B<--sigle|-sg>
235
236Extract the given texts.
237Can be set multiple times.
238I<Currently only supported on C<extract>.>
239Sigles have the structure C<Corpus>/C<Document>/C<Text>.
240In case the C<Text> path is omitted, the whole document will be extracted.
241On the document level, the postfix wildcard C<*> is supported.
242
243=item B<--log|-l>
244
245The L<Log4perl> log level, defaults to C<ERROR>.
246
247=item B<--help|-h>
248
249Print this document.
250
251=item B<--version|-v>
252
253Print version information.
254
255=back
256
257=head1 ANNOTATION SUPPORT
258
259L<KorAP::XML::Krill> has built-in importer for some annotation foundries and layers
260developed in the KorAP project that are part of the KorAP preprocessing pipeline.
261The base foundry with paragraphs, sentences, and the text element are mandatory for
262L<Krill|https://github.com/KorAP/Krill>.
263
Akron821db3d2017-04-06 21:19:31 +0200264 Base
265 #Paragraphs
266 #Sentences
Akron5c71a852016-10-31 16:00:33 +0100267
Akron821db3d2017-04-06 21:19:31 +0200268 Connexor
269 #Morpho
270 #Phrase
271 #Sentences
272 #Syntax
Akron5c71a852016-10-31 16:00:33 +0100273
Akron821db3d2017-04-06 21:19:31 +0200274 CoreNLP
275 #Constituency
276 #Morpho
277 #NamedEntities
278 #Sentences
Akron5c71a852016-10-31 16:00:33 +0100279
Akron821db3d2017-04-06 21:19:31 +0200280 DeReKo
281 #Structure
Akron5c71a852016-10-31 16:00:33 +0100282
Akron821db3d2017-04-06 21:19:31 +0200283 DRuKoLa
284 #Morpho
Akron5c71a852016-10-31 16:00:33 +0100285
Akron821db3d2017-04-06 21:19:31 +0200286 Glemm
287 #Morpho
Akron5c71a852016-10-31 16:00:33 +0100288
Akron821db3d2017-04-06 21:19:31 +0200289 Malt
290 #Dependency
Akron5c71a852016-10-31 16:00:33 +0100291
Akron821db3d2017-04-06 21:19:31 +0200292 MarMoT
293 #Morpho
Akron5c71a852016-10-31 16:00:33 +0100294
Akron821db3d2017-04-06 21:19:31 +0200295 Mate
296 #Dependency
297 #Morpho
Akron5c71a852016-10-31 16:00:33 +0100298
Akron821db3d2017-04-06 21:19:31 +0200299 MDParser
300 #Dependency
Akron5c71a852016-10-31 16:00:33 +0100301
Akron821db3d2017-04-06 21:19:31 +0200302 OpenNLP
303 #Morpho
304 #Sentences
Akron5c71a852016-10-31 16:00:33 +0100305
Akron821db3d2017-04-06 21:19:31 +0200306 Sgbr
307 #Lemma
308 #Morpho
Akron5c71a852016-10-31 16:00:33 +0100309
Akron821db3d2017-04-06 21:19:31 +0200310 TreeTagger
311 #Morpho
312 #Sentences
Akron5c71a852016-10-31 16:00:33 +0100313
Akron821db3d2017-04-06 21:19:31 +0200314 XIP
315 #Constituency
316 #Morpho
317 #Sentences
Akron5c71a852016-10-31 16:00:33 +0100318
Akron5c71a852016-10-31 16:00:33 +0100319
320More importers are in preparation.
321New annotation importers can be defined in the C<KorAP::XML::Annotation> namespace.
322See the built-in annotation importers as examples.
Akronc13a1702016-03-15 19:33:14 +0100323
324=head1 AVAILABILITY
325
326 https://github.com/KorAP/KorAP-XML-Krill
327
328
329=head1 COPYRIGHT AND LICENSE
330
Akron821db3d2017-04-06 21:19:31 +0200331Copyright (C) 2015-2017, L<IDS Mannheim|http://www.ids-mannheim.de/>
Akronc13a1702016-03-15 19:33:14 +0100332
Akron5c71a852016-10-31 16:00:33 +0100333Author: L<Nils Diewald|http://nils-diewald.de/>
Akron81500102017-04-07 20:45:44 +0200334
Akron5c71a852016-10-31 16:00:33 +0100335Contributor: Eliza Margaretha
336
337L<KorAP::XML::Krill> is developed as part of the L<KorAP|http://korap.ids-mannheim.de/>
Akronc13a1702016-03-15 19:33:14 +0100338Corpus Analysis Platform at the
339L<Institute for the German Language (IDS)|http://ids-mannheim.de/>,
340member of the
Akron5c71a852016-10-31 16:00:33 +0100341L<Leibniz-Gemeinschaft|http://www.leibniz-gemeinschaft.de/en/about-us/leibniz-competition/projekte-2011/2011-funding-line-2/>.
Akronc13a1702016-03-15 19:33:14 +0100342
Akron5c71a852016-10-31 16:00:33 +0100343This program is free software published under the
Akronc13a1702016-03-15 19:33:14 +0100344L<BSD-2 License|https://raw.githubusercontent.com/KorAP/KorAP-XML-Krill/master/LICENSE>.
345
346=cut