blob: 62662df19f164064978d04acc711358cbbca372e [file] [log] [blame]
Akronc13a1702016-03-15 19:33:14 +01001=pod
2
3=encoding utf8
4
5=head1 NAME
6
Akronf7ad89e2016-03-16 18:22:47 +01007korapxml2krill - Merge KorapXML data and create Krill documents
Akronc13a1702016-03-15 19:33:14 +01008
9
10=head1 SYNOPSIS
11
12 $ korapxml2krill -z --input <directory> --output <filename>
13 $ korapxml2krill archive -z --input <directory> --output <directory>
14 $ korapxml2krill extract --input <directory> --output <filename> --sigle <SIGLE>
15
16
17=head1 DESCRIPTION
18
19L<KorAP::XML::Krill> is a library to convert KorAP-XML documents to files
20compatible with the L<Krill|https://github.com/KorAP/Krill> indexer.
Akronf7ad89e2016-03-16 18:22:47 +010021The C<korapxml2krill> command line tool is a simple wrapper to the library.
Akronc13a1702016-03-15 19:33:14 +010022
23
24=head1 INSTALLATION
25
26The preferred way to install L<KorAP::XML::Krill> is to use L<cpanm|App::cpanminus>.
27
28 $ cpanm https://github.com/KorAP/KorAP-XML-Krill
29
30In case everything went well, the C<korapxml2krill> tool will
Akronf7ad89e2016-03-16 18:22:47 +010031be available on your command line immediately.
Akronc13a1702016-03-15 19:33:14 +010032
33
34=head1 ARGUMENTS
35
36=over 2
37
38=item B<archive>
39
40Process an archive as a Zip-file or a folder of KorAP-XML documents.
41
42=item B<extract>
43
44Extract KorAP-XML files from a Zip-file.
45
46=back
47
48
49=head1 OPTIONS
50
51=over 2
52
53=item B<--input|-i> <directory|file>
54
Akronf7ad89e2016-03-16 18:22:47 +010055Directory or archive file of documents to convert.
Akronc13a1702016-03-15 19:33:14 +010056
57=item B<--output|-o> <directory|file>
58
59Output folder for archive processing or
60document name for single output (optional),
Akronf7ad89e2016-03-16 18:22:47 +010061writes to C<STDOUT> by default
62(in case C<output> is not mandatory due to further options).
Akronc13a1702016-03-15 19:33:14 +010063
64=item B<--overwrite|-w>
65
66Overwrite files that already exist.
67
68=item B<--token|-t> <foundry>[#<file>]
69
70Define the default tokenization by specifying
71the name of the foundry and optionally the name
72of the layer-file. Defaults to C<OpenNLP#tokens>.
73
74=item B<--skip|-s> <foundry>[#<layer>]
75
Akronf7ad89e2016-03-16 18:22:47 +010076Skip specific annotations by specifying the foundry
77(and optionally the layer with a C<#>-prefix),
78e.g. C<Mate> or C<Mate#Morpho>. Alternatively you can skip C<#ALL>.
Akronc13a1702016-03-15 19:33:14 +010079Can be set multiple times.
80
81=item B<--anno|-a> <foundry>#<layer>
82
Akronf7ad89e2016-03-16 18:22:47 +010083Convert specific annotations by specifying the foundry
84(and optionally the layer with a C<#>-prefix),
85e.g. C<Mate> or C<Mate#Morpho>.
86Can be set multiple times.
Akronc13a1702016-03-15 19:33:14 +010087
88=item B<--primary|-p>
89
90Output primary data or not. Defaults to C<true>.
Akronf7ad89e2016-03-16 18:22:47 +010091Can be flagged using C<--no-primary> as well.
92This is I<deprecated>.
Akronc13a1702016-03-15 19:33:14 +010093
94=item B<--jobs|-j>
95
96Define the number of concurrent jobs in seperated forks
Akronf7ad89e2016-03-16 18:22:47 +010097for archive processing.
98Defaults to C<0>.
99This is I<experimental>.
Akronc13a1702016-03-15 19:33:14 +0100100
Akron35db6e32016-03-17 22:42:22 +0100101=item B<--meta|-m>
Akronc13a1702016-03-15 19:33:14 +0100102
Akron35db6e32016-03-17 22:42:22 +0100103Define the metadata parser to use. Defaults to C<I5>.
104Metadata parsers can be defined in the C<KorAP::XML::Meta> namespace.
105This is I<experimental>.
Akronc13a1702016-03-15 19:33:14 +0100106
107=item B<--pretty|-y>
108
109Pretty print JSON output. Defaults to C<false>.
Akron35db6e32016-03-17 22:42:22 +0100110This is I<deprecated>.
Akronc13a1702016-03-15 19:33:14 +0100111
112=item B<--gzip|-z>
113
Akronf7ad89e2016-03-16 18:22:47 +0100114Compress the output.
115Expects a defined C<output> file in single processing.
Akronc13a1702016-03-15 19:33:14 +0100116
117=item B<--sigle|-sg>
118
119Extract the given text sigles.
Akronc13a1702016-03-15 19:33:14 +0100120Can be set multiple times.
Akronf7ad89e2016-03-16 18:22:47 +0100121I<Currently only supported on C<extract>.>
Akronc13a1702016-03-15 19:33:14 +0100122
123=item B<--log|-l>
124
125The L<Log4perl> log level, defaults to C<ERROR>.
126
127=item B<--help|-h>
128
129Print this document.
130
131=item B<--version|-v>
132
133Print version information.
134
135=back
136
137=head1 ANNOTATION SUPPORT
138
139L<KorAP::XML::Krill> has built-in importer for some annotation foundries and layers
140developed in the KorAP project that are part of the KorAP preprocessing pipeline.
141The base foundry with paragraphs, sentences, and the text element are mandatory for
142L<Krill|https://github.com/KorAP/Krill>.
143
Akronf7ad89e2016-03-16 18:22:47 +0100144=over 2
Akronc13a1702016-03-15 19:33:14 +0100145
146=item B<Base>
147
148=over 4
149
Akronf7ad89e2016-03-16 18:22:47 +0100150=item #Paragraphs
Akronc13a1702016-03-15 19:33:14 +0100151
Akronf7ad89e2016-03-16 18:22:47 +0100152=item #Sentences
Akronc13a1702016-03-15 19:33:14 +0100153
154=back
155
156=item B<Connexor>
157
158=over 4
159
Akronf7ad89e2016-03-16 18:22:47 +0100160=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100161
Akronf7ad89e2016-03-16 18:22:47 +0100162=item #Phrase
Akronc13a1702016-03-15 19:33:14 +0100163
Akronf7ad89e2016-03-16 18:22:47 +0100164=item #Sentences
Akronc13a1702016-03-15 19:33:14 +0100165
Akronf7ad89e2016-03-16 18:22:47 +0100166=item #Syntax
Akronc13a1702016-03-15 19:33:14 +0100167
168=back
169
170=item B<CoreNLP>
171
172=over 4
173
Akronf7ad89e2016-03-16 18:22:47 +0100174=item #Constituency
Akronc13a1702016-03-15 19:33:14 +0100175
Akronf7ad89e2016-03-16 18:22:47 +0100176=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100177
Akronf7ad89e2016-03-16 18:22:47 +0100178=item #NamedEntities
Akronc13a1702016-03-15 19:33:14 +0100179
Akronf7ad89e2016-03-16 18:22:47 +0100180=item #Sentences
Akronc13a1702016-03-15 19:33:14 +0100181
182=back
183
184=item B<DeReKo>
185
186=over 4
187
Akronf7ad89e2016-03-16 18:22:47 +0100188=item #Structure
Akronc13a1702016-03-15 19:33:14 +0100189
190=back
191
192=item B<Glemm>
193
194=over 4
195
Akronf7ad89e2016-03-16 18:22:47 +0100196=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100197
198=back
199
200=item B<Mate>
201
202=over 4
203
Akronf7ad89e2016-03-16 18:22:47 +0100204=item #Dependency
Akronc13a1702016-03-15 19:33:14 +0100205
Akronf7ad89e2016-03-16 18:22:47 +0100206=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100207
208=back
209
210=item B<OpenNLP>
211
212=over 4
213
Akronf7ad89e2016-03-16 18:22:47 +0100214=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100215
Akronf7ad89e2016-03-16 18:22:47 +0100216=item #Sentences
Akronc13a1702016-03-15 19:33:14 +0100217
218=back
219
220=item B<Sgbr>
221
222=over 4
223
Akronf7ad89e2016-03-16 18:22:47 +0100224=item #Lemma
Akronc13a1702016-03-15 19:33:14 +0100225
Akronf7ad89e2016-03-16 18:22:47 +0100226=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100227
228=back
229
230=item B<TreeTagger>
231
232=over 4
233
Akronf7ad89e2016-03-16 18:22:47 +0100234=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100235
Akronf7ad89e2016-03-16 18:22:47 +0100236=item #Sentences
Akronc13a1702016-03-15 19:33:14 +0100237
238=back
239
240=item B<XIP>
241
242=over 4
243
Akronf7ad89e2016-03-16 18:22:47 +0100244=item #Constituency
Akronc13a1702016-03-15 19:33:14 +0100245
Akronf7ad89e2016-03-16 18:22:47 +0100246=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100247
Akronf7ad89e2016-03-16 18:22:47 +0100248=item #Sentences
Akronc13a1702016-03-15 19:33:14 +0100249
250=back
251
252=back
253
254More importers are in preparation.
255New annotation importers can be defined in the C<KorAP::XML::Annotation> namespace.
256See the built-in annotation importers as examples.
257
258=head1 AVAILABILITY
259
260 https://github.com/KorAP/KorAP-XML-Krill
261
262
263=head1 COPYRIGHT AND LICENSE
264
265Copyright (C) 2015-2016, L<IDS Mannheim|http://www.ids-mannheim.de/>
Akronf7ad89e2016-03-16 18:22:47 +0100266
Akronc13a1702016-03-15 19:33:14 +0100267Author: L<Nils Diewald|http://nils-diewald.de/>
268
269L<KorAP::XML::Krill> is developed as part of the L<KorAP|http://korap.ids-mannheim.de/>
270Corpus Analysis Platform at the
271L<Institute for the German Language (IDS)|http://ids-mannheim.de/>,
272member of the
273L<Leibniz-Gemeinschaft|http://www.leibniz-gemeinschaft.de/en/about-us/leibniz-competition/projekte-2011/2011-funding-line-2/>.
274
275This program is free software published under the
276L<BSD-2 License|https://raw.githubusercontent.com/KorAP/KorAP-XML-Krill/master/LICENSE>.
277
278=cut