blob: 609470c80490816678165f68e5767a01fca4a055 [file] [log] [blame]
Akronc13a1702016-03-15 19:33:14 +01001=pod
2
3=encoding utf8
4
5=head1 NAME
6
7korapxml2krill - Merge KorapXML data and create Krill friendly documents
8
9
10=head1 SYNOPSIS
11
12 $ korapxml2krill -z --input <directory> --output <filename>
13 $ korapxml2krill archive -z --input <directory> --output <directory>
14 $ korapxml2krill extract --input <directory> --output <filename> --sigle <SIGLE>
15
16
17=head1 DESCRIPTION
18
19L<KorAP::XML::Krill> is a library to convert KorAP-XML documents to files
20compatible with the L<Krill|https://github.com/KorAP/Krill> indexer.
21THe C<korapxml2krill> command line tool is a simple wrapper to the library.
22
23
24=head1 INSTALLATION
25
26The preferred way to install L<KorAP::XML::Krill> is to use L<cpanm|App::cpanminus>.
27
28 $ cpanm https://github.com/KorAP/KorAP-XML-Krill
29
30In case everything went well, the C<korapxml2krill> tool will
31be available on your command line.
32
33
34=head1 ARGUMENTS
35
36=over 2
37
38=item B<archive>
39
40Process an archive as a Zip-file or a folder of KorAP-XML documents.
41
42=item B<extract>
43
44Extract KorAP-XML files from a Zip-file.
45
46=back
47
48
49=head1 OPTIONS
50
51=over 2
52
53=item B<--input|-i> <directory|file>
54
55Directory or archive file of documents to index.
56
57=item B<--output|-o> <directory|file>
58
59Output folder for archive processing or
60document name for single output (optional),
61writes to C<STDOUT> by default.
62
63=item B<--overwrite|-w>
64
65Overwrite files that already exist.
66
67=item B<--token|-t> <foundry>[#<file>]
68
69Define the default tokenization by specifying
70the name of the foundry and optionally the name
71of the layer-file. Defaults to C<OpenNLP#tokens>.
72
73=item B<--skip|-s> <foundry>[#<layer>]
74
75Skip specific foundries by specifying the name
76or specific layers by defining the name
77with a # in front of the foundry,
78e.g. Mate#Morpho. Alternatively you can skip C<#ALL>.
79Can be set multiple times.
80
81=item B<--anno|-a> <foundry>#<layer>
82
83Allow specific annotion foundries and layers by defining them
84combining the foundry name with a C<#> and the layer name.
85
86=item B<--primary|-p>
87
88Output primary data or not. Defaults to C<true>.
89Can be flagged using --no-primary as well.
90This is deprecated.
91
92=item B<--jobs|-j>
93
94Define the number of concurrent jobs in seperated forks
95for archive processing, defaults to C<0>.
96This is experimental!
97
98=item B<--human|-m>
99
100Represent the data in an alternative human readible format.
101This is deprecated.
102
103=item B<--pretty|-y>
104
105Pretty print JSON output. Defaults to C<false>.
106
107=item B<--gzip|-z>
108
109Compress the output (expects a defined output file in single processing).
110
111=item B<--sigle|-sg>
112
113Extract the given text sigles.
114Currently only supported on C<extract>.
115Can be set multiple times.
116
117=item B<--log|-l>
118
119The L<Log4perl> log level, defaults to C<ERROR>.
120
121=item B<--help|-h>
122
123Print this document.
124
125=item B<--version|-v>
126
127Print version information.
128
129=back
130
131=head1 ANNOTATION SUPPORT
132
133L<KorAP::XML::Krill> has built-in importer for some annotation foundries and layers
134developed in the KorAP project that are part of the KorAP preprocessing pipeline.
135The base foundry with paragraphs, sentences, and the text element are mandatory for
136L<Krill|https://github.com/KorAP/Krill>.
137
138=over2
139
140=item B<Base>
141
142=over 4
143
144=item Paragraphs
145
146=item Sentences
147
148=back
149
150=item B<Connexor>
151
152=over 4
153
154=item Morpho
155
156=item Phrase
157
158=item Sentences
159
160=item Syntax
161
162=back
163
164=item B<CoreNLP>
165
166=over 4
167
168=item Constituency
169
170=item Morpho
171
172=item NamedEntities
173
174=item Sentences
175
176=back
177
178=item B<DeReKo>
179
180=over 4
181
182=item Structure
183
184=back
185
186=item B<Glemm>
187
188=over 4
189
190=item Morpho
191
192=back
193
194=item B<Mate>
195
196=over 4
197
198=item Dependency
199
200=item Morpho
201
202=back
203
204=item B<OpenNLP>
205
206=over 4
207
208=item Morpho
209
210=item Sentences
211
212=back
213
214=item B<Sgbr>
215
216=over 4
217
218=item Lemma
219
220=item Morpho
221
222=back
223
224=item B<TreeTagger>
225
226=over 4
227
228=item Morpho
229
230=item Sentences
231
232=back
233
234=item B<XIP>
235
236=over 4
237
238=item Constituency
239
240=item Morpho
241
242=item Sentences
243
244=back
245
246=back
247
248More importers are in preparation.
249New annotation importers can be defined in the C<KorAP::XML::Annotation> namespace.
250See the built-in annotation importers as examples.
251
252=head1 AVAILABILITY
253
254 https://github.com/KorAP/KorAP-XML-Krill
255
256
257=head1 COPYRIGHT AND LICENSE
258
259Copyright (C) 2015-2016, L<IDS Mannheim|http://www.ids-mannheim.de/>
260Author: L<Nils Diewald|http://nils-diewald.de/>
261
262L<KorAP::XML::Krill> is developed as part of the L<KorAP|http://korap.ids-mannheim.de/>
263Corpus Analysis Platform at the
264L<Institute for the German Language (IDS)|http://ids-mannheim.de/>,
265member of the
266L<Leibniz-Gemeinschaft|http://www.leibniz-gemeinschaft.de/en/about-us/leibniz-competition/projekte-2011/2011-funding-line-2/>.
267
268This program is free software published under the
269L<BSD-2 License|https://raw.githubusercontent.com/KorAP/KorAP-XML-Krill/master/LICENSE>.
270
271=cut