blob: 68edde3603ec3f0da201c9675d11103eeb0b5111 [file] [log] [blame]
Akronc13a1702016-03-15 19:33:14 +01001=pod
2
3=encoding utf8
4
5=head1 NAME
6
Akronf7ad89e2016-03-16 18:22:47 +01007korapxml2krill - Merge KorapXML data and create Krill documents
Akronc13a1702016-03-15 19:33:14 +01008
9
10=head1 SYNOPSIS
11
12 $ korapxml2krill -z --input <directory> --output <filename>
13 $ korapxml2krill archive -z --input <directory> --output <directory>
14 $ korapxml2krill extract --input <directory> --output <filename> --sigle <SIGLE>
15
16
17=head1 DESCRIPTION
18
19L<KorAP::XML::Krill> is a library to convert KorAP-XML documents to files
20compatible with the L<Krill|https://github.com/KorAP/Krill> indexer.
Akronf7ad89e2016-03-16 18:22:47 +010021The C<korapxml2krill> command line tool is a simple wrapper to the library.
Akronc13a1702016-03-15 19:33:14 +010022
23
24=head1 INSTALLATION
25
26The preferred way to install L<KorAP::XML::Krill> is to use L<cpanm|App::cpanminus>.
27
Akronaf386982016-10-12 00:33:25 +020028 $ cpanm https://github.com/KorAP/KorAP-XML-Krill.git
Akronc13a1702016-03-15 19:33:14 +010029
30In case everything went well, the C<korapxml2krill> tool will
Akronf7ad89e2016-03-16 18:22:47 +010031be available on your command line immediately.
Akron74381512016-10-14 11:56:22 +020032Minimum requirement for L<KorAP::XML::Krill> is Perl 5.14.
Akrona93d51b2016-10-24 20:27:48 +020033In addition to work with zip archives, the C<unzip> tool needs to be present.
Akronc13a1702016-03-15 19:33:14 +010034
35=head1 ARGUMENTS
36
37=over 2
38
39=item B<archive>
40
41Process an archive as a Zip-file or a folder of KorAP-XML documents.
42
43=item B<extract>
44
45Extract KorAP-XML files from a Zip-file.
46
47=back
48
49
50=head1 OPTIONS
51
52=over 2
53
Akrona5920b12016-06-29 18:51:21 +020054=item B<--input|-i> <directory|file|files>
Akronc13a1702016-03-15 19:33:14 +010055
Akronf7ad89e2016-03-16 18:22:47 +010056Directory or archive file of documents to convert.
Akronc13a1702016-03-15 19:33:14 +010057
Akrona5920b12016-06-29 18:51:21 +020058Archiving supports multiple input archives with the constraint,
59that the first archive listed contains all primary data files
60and all meta data files.
61
62 -i file/news.zip -i file/news.malt.zip -i #file/news.tt.zip
63
64(The directory structure follows the base directory format,
65that may include a C<.> root folder.
66In this case further archives lacking a C<.> root folder
67need to be passed with a hash sign in front of the archive's name.)
68
Akrona93d51b2016-10-24 20:27:48 +020069B<To support zip files, a version of C<unzip> needs to be installed that is
70compatible with the archive file.>
71
72B<The root folder switch is experimental and may vanish in future versions.>
73
Akronc13a1702016-03-15 19:33:14 +010074=item B<--output|-o> <directory|file>
75
76Output folder for archive processing or
77document name for single output (optional),
Akronf7ad89e2016-03-16 18:22:47 +010078writes to C<STDOUT> by default
79(in case C<output> is not mandatory due to further options).
Akronc13a1702016-03-15 19:33:14 +010080
81=item B<--overwrite|-w>
82
83Overwrite files that already exist.
84
85=item B<--token|-t> <foundry>[#<file>]
86
87Define the default tokenization by specifying
88the name of the foundry and optionally the name
89of the layer-file. Defaults to C<OpenNLP#tokens>.
90
91=item B<--skip|-s> <foundry>[#<layer>]
92
Akronf7ad89e2016-03-16 18:22:47 +010093Skip specific annotations by specifying the foundry
94(and optionally the layer with a C<#>-prefix),
95e.g. C<Mate> or C<Mate#Morpho>. Alternatively you can skip C<#ALL>.
Akronc13a1702016-03-15 19:33:14 +010096Can be set multiple times.
97
98=item B<--anno|-a> <foundry>#<layer>
99
Akronf7ad89e2016-03-16 18:22:47 +0100100Convert specific annotations by specifying the foundry
101(and optionally the layer with a C<#>-prefix),
102e.g. C<Mate> or C<Mate#Morpho>.
103Can be set multiple times.
Akronc13a1702016-03-15 19:33:14 +0100104
105=item B<--primary|-p>
106
107Output primary data or not. Defaults to C<true>.
Akronf7ad89e2016-03-16 18:22:47 +0100108Can be flagged using C<--no-primary> as well.
109This is I<deprecated>.
Akronc13a1702016-03-15 19:33:14 +0100110
111=item B<--jobs|-j>
112
113Define the number of concurrent jobs in seperated forks
Akronf7ad89e2016-03-16 18:22:47 +0100114for archive processing.
Akron11c80302016-03-18 19:44:43 +0100115Defaults to C<0> (everything runs in a single process).
Akronf7ad89e2016-03-16 18:22:47 +0100116This is I<experimental>.
Akronc13a1702016-03-15 19:33:14 +0100117
Akron35db6e32016-03-17 22:42:22 +0100118=item B<--meta|-m>
Akronc13a1702016-03-15 19:33:14 +0100119
Akron35db6e32016-03-17 22:42:22 +0100120Define the metadata parser to use. Defaults to C<I5>.
121Metadata parsers can be defined in the C<KorAP::XML::Meta> namespace.
122This is I<experimental>.
Akronc13a1702016-03-15 19:33:14 +0100123
124=item B<--pretty|-y>
125
126Pretty print JSON output. Defaults to C<false>.
Akron35db6e32016-03-17 22:42:22 +0100127This is I<deprecated>.
Akronc13a1702016-03-15 19:33:14 +0100128
129=item B<--gzip|-z>
130
Akronf7ad89e2016-03-16 18:22:47 +0100131Compress the output.
132Expects a defined C<output> file in single processing.
Akronc13a1702016-03-15 19:33:14 +0100133
Akron11c80302016-03-18 19:44:43 +0100134=item B<--cache|-c>
135
136File to mmap a cache (using L<Cache::FastMmap>).
137Defaults to C<korapxml2krill.cache> in the calling directory.
138
139=item B<--cache-size|-cs>
140
141Size of the cache. Defaults to C<50m>.
142
143=item B<--cache-init|-ci>
144
145Initialize cache file.
146Can be flagged using C<--no-cache-init> as well.
147Defaults to C<true>.
148
149=item B<--cache-delete|-cd>
150
151Delete cache file after processing.
152Can be flagged using C<--no-cache-delete> as well.
153Defaults to C<true>.
154
Akronc13a1702016-03-15 19:33:14 +0100155=item B<--sigle|-sg>
156
157Extract the given text sigles.
Akronc13a1702016-03-15 19:33:14 +0100158Can be set multiple times.
Akronf7ad89e2016-03-16 18:22:47 +0100159I<Currently only supported on C<extract>.>
Akrona5920b12016-06-29 18:51:21 +0200160Sigles have the structure C<Corpus>/C<Document>/C<Text>.
Akronc13a1702016-03-15 19:33:14 +0100161
162=item B<--log|-l>
163
164The L<Log4perl> log level, defaults to C<ERROR>.
165
166=item B<--help|-h>
167
168Print this document.
169
170=item B<--version|-v>
171
172Print version information.
173
174=back
175
176=head1 ANNOTATION SUPPORT
177
178L<KorAP::XML::Krill> has built-in importer for some annotation foundries and layers
179developed in the KorAP project that are part of the KorAP preprocessing pipeline.
180The base foundry with paragraphs, sentences, and the text element are mandatory for
181L<Krill|https://github.com/KorAP/Krill>.
182
Akronf7ad89e2016-03-16 18:22:47 +0100183=over 2
Akronc13a1702016-03-15 19:33:14 +0100184
185=item B<Base>
186
187=over 4
188
Akronf7ad89e2016-03-16 18:22:47 +0100189=item #Paragraphs
Akronc13a1702016-03-15 19:33:14 +0100190
Akronf7ad89e2016-03-16 18:22:47 +0100191=item #Sentences
Akronc13a1702016-03-15 19:33:14 +0100192
193=back
194
195=item B<Connexor>
196
197=over 4
198
Akronf7ad89e2016-03-16 18:22:47 +0100199=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100200
Akronf7ad89e2016-03-16 18:22:47 +0100201=item #Phrase
Akronc13a1702016-03-15 19:33:14 +0100202
Akronf7ad89e2016-03-16 18:22:47 +0100203=item #Sentences
Akronc13a1702016-03-15 19:33:14 +0100204
Akronf7ad89e2016-03-16 18:22:47 +0100205=item #Syntax
Akronc13a1702016-03-15 19:33:14 +0100206
207=back
208
209=item B<CoreNLP>
210
211=over 4
212
Akronf7ad89e2016-03-16 18:22:47 +0100213=item #Constituency
Akronc13a1702016-03-15 19:33:14 +0100214
Akronf7ad89e2016-03-16 18:22:47 +0100215=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100216
Akronf7ad89e2016-03-16 18:22:47 +0100217=item #NamedEntities
Akronc13a1702016-03-15 19:33:14 +0100218
Akronf7ad89e2016-03-16 18:22:47 +0100219=item #Sentences
Akronc13a1702016-03-15 19:33:14 +0100220
221=back
222
223=item B<DeReKo>
224
225=over 4
226
Akronf7ad89e2016-03-16 18:22:47 +0100227=item #Structure
Akronc13a1702016-03-15 19:33:14 +0100228
229=back
230
231=item B<Glemm>
232
233=over 4
234
Akronf7ad89e2016-03-16 18:22:47 +0100235=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100236
237=back
238
239=item B<Mate>
240
241=over 4
242
Akronf7ad89e2016-03-16 18:22:47 +0100243=item #Dependency
Akronc13a1702016-03-15 19:33:14 +0100244
Akronf7ad89e2016-03-16 18:22:47 +0100245=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100246
247=back
248
249=item B<OpenNLP>
250
251=over 4
252
Akronf7ad89e2016-03-16 18:22:47 +0100253=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100254
Akronf7ad89e2016-03-16 18:22:47 +0100255=item #Sentences
Akronc13a1702016-03-15 19:33:14 +0100256
257=back
258
259=item B<Sgbr>
260
261=over 4
262
Akronf7ad89e2016-03-16 18:22:47 +0100263=item #Lemma
Akronc13a1702016-03-15 19:33:14 +0100264
Akronf7ad89e2016-03-16 18:22:47 +0100265=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100266
267=back
268
269=item B<TreeTagger>
270
271=over 4
272
Akronf7ad89e2016-03-16 18:22:47 +0100273=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100274
Akronf7ad89e2016-03-16 18:22:47 +0100275=item #Sentences
Akronc13a1702016-03-15 19:33:14 +0100276
277=back
278
279=item B<XIP>
280
281=over 4
282
Akronf7ad89e2016-03-16 18:22:47 +0100283=item #Constituency
Akronc13a1702016-03-15 19:33:14 +0100284
Akronf7ad89e2016-03-16 18:22:47 +0100285=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100286
Akronf7ad89e2016-03-16 18:22:47 +0100287=item #Sentences
Akronc13a1702016-03-15 19:33:14 +0100288
289=back
290
291=back
292
293More importers are in preparation.
294New annotation importers can be defined in the C<KorAP::XML::Annotation> namespace.
295See the built-in annotation importers as examples.
296
297=head1 AVAILABILITY
298
299 https://github.com/KorAP/KorAP-XML-Krill
300
301
302=head1 COPYRIGHT AND LICENSE
303
304Copyright (C) 2015-2016, L<IDS Mannheim|http://www.ids-mannheim.de/>
Akronf7ad89e2016-03-16 18:22:47 +0100305
Akronc13a1702016-03-15 19:33:14 +0100306Author: L<Nils Diewald|http://nils-diewald.de/>
307
308L<KorAP::XML::Krill> is developed as part of the L<KorAP|http://korap.ids-mannheim.de/>
309Corpus Analysis Platform at the
310L<Institute for the German Language (IDS)|http://ids-mannheim.de/>,
311member of the
312L<Leibniz-Gemeinschaft|http://www.leibniz-gemeinschaft.de/en/about-us/leibniz-competition/projekte-2011/2011-funding-line-2/>.
313
314This program is free software published under the
315L<BSD-2 License|https://raw.githubusercontent.com/KorAP/KorAP-XML-Krill/master/LICENSE>.
316
317=cut