blob: 149dab06dc195294ea18ef94162df1d72c55e8fc [file] [log] [blame]
Akronc13a1702016-03-15 19:33:14 +01001=pod
2
3=encoding utf8
4
5=head1 NAME
6
Akronf7ad89e2016-03-16 18:22:47 +01007korapxml2krill - Merge KorapXML data and create Krill documents
Akronc13a1702016-03-15 19:33:14 +01008
9
10=head1 SYNOPSIS
11
12 $ korapxml2krill -z --input <directory> --output <filename>
13 $ korapxml2krill archive -z --input <directory> --output <directory>
14 $ korapxml2krill extract --input <directory> --output <filename> --sigle <SIGLE>
15
16
17=head1 DESCRIPTION
18
19L<KorAP::XML::Krill> is a library to convert KorAP-XML documents to files
20compatible with the L<Krill|https://github.com/KorAP/Krill> indexer.
Akronf7ad89e2016-03-16 18:22:47 +010021The C<korapxml2krill> command line tool is a simple wrapper to the library.
Akronc13a1702016-03-15 19:33:14 +010022
23
24=head1 INSTALLATION
25
26The preferred way to install L<KorAP::XML::Krill> is to use L<cpanm|App::cpanminus>.
27
28 $ cpanm https://github.com/KorAP/KorAP-XML-Krill
29
30In case everything went well, the C<korapxml2krill> tool will
Akronf7ad89e2016-03-16 18:22:47 +010031be available on your command line immediately.
Akronc13a1702016-03-15 19:33:14 +010032
33
34=head1 ARGUMENTS
35
36=over 2
37
38=item B<archive>
39
40Process an archive as a Zip-file or a folder of KorAP-XML documents.
41
42=item B<extract>
43
44Extract KorAP-XML files from a Zip-file.
45
46=back
47
48
49=head1 OPTIONS
50
51=over 2
52
53=item B<--input|-i> <directory|file>
54
Akronf7ad89e2016-03-16 18:22:47 +010055Directory or archive file of documents to convert.
Akronc13a1702016-03-15 19:33:14 +010056
57=item B<--output|-o> <directory|file>
58
59Output folder for archive processing or
60document name for single output (optional),
Akronf7ad89e2016-03-16 18:22:47 +010061writes to C<STDOUT> by default
62(in case C<output> is not mandatory due to further options).
Akronc13a1702016-03-15 19:33:14 +010063
64=item B<--overwrite|-w>
65
66Overwrite files that already exist.
67
68=item B<--token|-t> <foundry>[#<file>]
69
70Define the default tokenization by specifying
71the name of the foundry and optionally the name
72of the layer-file. Defaults to C<OpenNLP#tokens>.
73
74=item B<--skip|-s> <foundry>[#<layer>]
75
Akronf7ad89e2016-03-16 18:22:47 +010076Skip specific annotations by specifying the foundry
77(and optionally the layer with a C<#>-prefix),
78e.g. C<Mate> or C<Mate#Morpho>. Alternatively you can skip C<#ALL>.
Akronc13a1702016-03-15 19:33:14 +010079Can be set multiple times.
80
81=item B<--anno|-a> <foundry>#<layer>
82
Akronf7ad89e2016-03-16 18:22:47 +010083Convert specific annotations by specifying the foundry
84(and optionally the layer with a C<#>-prefix),
85e.g. C<Mate> or C<Mate#Morpho>.
86Can be set multiple times.
Akronc13a1702016-03-15 19:33:14 +010087
88=item B<--primary|-p>
89
90Output primary data or not. Defaults to C<true>.
Akronf7ad89e2016-03-16 18:22:47 +010091Can be flagged using C<--no-primary> as well.
92This is I<deprecated>.
Akronc13a1702016-03-15 19:33:14 +010093
94=item B<--jobs|-j>
95
96Define the number of concurrent jobs in seperated forks
Akronf7ad89e2016-03-16 18:22:47 +010097for archive processing.
98Defaults to C<0>.
99This is I<experimental>.
Akronc13a1702016-03-15 19:33:14 +0100100
101=item B<--human|-m>
102
103Represent the data in an alternative human readible format.
Akronf7ad89e2016-03-16 18:22:47 +0100104This is I<deprecated>.
Akronc13a1702016-03-15 19:33:14 +0100105
106=item B<--pretty|-y>
107
108Pretty print JSON output. Defaults to C<false>.
109
110=item B<--gzip|-z>
111
Akronf7ad89e2016-03-16 18:22:47 +0100112Compress the output.
113Expects a defined C<output> file in single processing.
Akronc13a1702016-03-15 19:33:14 +0100114
115=item B<--sigle|-sg>
116
117Extract the given text sigles.
Akronc13a1702016-03-15 19:33:14 +0100118Can be set multiple times.
Akronf7ad89e2016-03-16 18:22:47 +0100119I<Currently only supported on C<extract>.>
Akronc13a1702016-03-15 19:33:14 +0100120
121=item B<--log|-l>
122
123The L<Log4perl> log level, defaults to C<ERROR>.
124
125=item B<--help|-h>
126
127Print this document.
128
129=item B<--version|-v>
130
131Print version information.
132
133=back
134
135=head1 ANNOTATION SUPPORT
136
137L<KorAP::XML::Krill> has built-in importer for some annotation foundries and layers
138developed in the KorAP project that are part of the KorAP preprocessing pipeline.
139The base foundry with paragraphs, sentences, and the text element are mandatory for
140L<Krill|https://github.com/KorAP/Krill>.
141
Akronf7ad89e2016-03-16 18:22:47 +0100142=over 2
Akronc13a1702016-03-15 19:33:14 +0100143
144=item B<Base>
145
146=over 4
147
Akronf7ad89e2016-03-16 18:22:47 +0100148=item #Paragraphs
Akronc13a1702016-03-15 19:33:14 +0100149
Akronf7ad89e2016-03-16 18:22:47 +0100150=item #Sentences
Akronc13a1702016-03-15 19:33:14 +0100151
152=back
153
154=item B<Connexor>
155
156=over 4
157
Akronf7ad89e2016-03-16 18:22:47 +0100158=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100159
Akronf7ad89e2016-03-16 18:22:47 +0100160=item #Phrase
Akronc13a1702016-03-15 19:33:14 +0100161
Akronf7ad89e2016-03-16 18:22:47 +0100162=item #Sentences
Akronc13a1702016-03-15 19:33:14 +0100163
Akronf7ad89e2016-03-16 18:22:47 +0100164=item #Syntax
Akronc13a1702016-03-15 19:33:14 +0100165
166=back
167
168=item B<CoreNLP>
169
170=over 4
171
Akronf7ad89e2016-03-16 18:22:47 +0100172=item #Constituency
Akronc13a1702016-03-15 19:33:14 +0100173
Akronf7ad89e2016-03-16 18:22:47 +0100174=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100175
Akronf7ad89e2016-03-16 18:22:47 +0100176=item #NamedEntities
Akronc13a1702016-03-15 19:33:14 +0100177
Akronf7ad89e2016-03-16 18:22:47 +0100178=item #Sentences
Akronc13a1702016-03-15 19:33:14 +0100179
180=back
181
182=item B<DeReKo>
183
184=over 4
185
Akronf7ad89e2016-03-16 18:22:47 +0100186=item #Structure
Akronc13a1702016-03-15 19:33:14 +0100187
188=back
189
190=item B<Glemm>
191
192=over 4
193
Akronf7ad89e2016-03-16 18:22:47 +0100194=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100195
196=back
197
198=item B<Mate>
199
200=over 4
201
Akronf7ad89e2016-03-16 18:22:47 +0100202=item #Dependency
Akronc13a1702016-03-15 19:33:14 +0100203
Akronf7ad89e2016-03-16 18:22:47 +0100204=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100205
206=back
207
208=item B<OpenNLP>
209
210=over 4
211
Akronf7ad89e2016-03-16 18:22:47 +0100212=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100213
Akronf7ad89e2016-03-16 18:22:47 +0100214=item #Sentences
Akronc13a1702016-03-15 19:33:14 +0100215
216=back
217
218=item B<Sgbr>
219
220=over 4
221
Akronf7ad89e2016-03-16 18:22:47 +0100222=item #Lemma
Akronc13a1702016-03-15 19:33:14 +0100223
Akronf7ad89e2016-03-16 18:22:47 +0100224=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100225
226=back
227
228=item B<TreeTagger>
229
230=over 4
231
Akronf7ad89e2016-03-16 18:22:47 +0100232=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100233
Akronf7ad89e2016-03-16 18:22:47 +0100234=item #Sentences
Akronc13a1702016-03-15 19:33:14 +0100235
236=back
237
238=item B<XIP>
239
240=over 4
241
Akronf7ad89e2016-03-16 18:22:47 +0100242=item #Constituency
Akronc13a1702016-03-15 19:33:14 +0100243
Akronf7ad89e2016-03-16 18:22:47 +0100244=item #Morpho
Akronc13a1702016-03-15 19:33:14 +0100245
Akronf7ad89e2016-03-16 18:22:47 +0100246=item #Sentences
Akronc13a1702016-03-15 19:33:14 +0100247
248=back
249
250=back
251
252More importers are in preparation.
253New annotation importers can be defined in the C<KorAP::XML::Annotation> namespace.
254See the built-in annotation importers as examples.
255
256=head1 AVAILABILITY
257
258 https://github.com/KorAP/KorAP-XML-Krill
259
260
261=head1 COPYRIGHT AND LICENSE
262
263Copyright (C) 2015-2016, L<IDS Mannheim|http://www.ids-mannheim.de/>
Akronf7ad89e2016-03-16 18:22:47 +0100264
Akronc13a1702016-03-15 19:33:14 +0100265Author: L<Nils Diewald|http://nils-diewald.de/>
266
267L<KorAP::XML::Krill> is developed as part of the L<KorAP|http://korap.ids-mannheim.de/>
268Corpus Analysis Platform at the
269L<Institute for the German Language (IDS)|http://ids-mannheim.de/>,
270member of the
271L<Leibniz-Gemeinschaft|http://www.leibniz-gemeinschaft.de/en/about-us/leibniz-competition/projekte-2011/2011-funding-line-2/>.
272
273This program is free software published under the
274L<BSD-2 License|https://raw.githubusercontent.com/KorAP/KorAP-XML-Krill/master/LICENSE>.
275
276=cut