blob: 37d30a7b1d1ac34b2b0993c5c739f9e54e605e0b [file] [log] [blame]
Akronc13a1702016-03-15 19:33:14 +01001=pod
2
3=encoding utf8
4
5=head1 NAME
6
Akron42f48c12020-02-14 13:08:13 +01007korapxml2krill - Merge KorAP-XML data and create Krill documents
Akronc13a1702016-03-15 19:33:14 +01008
9
10=head1 SYNOPSIS
11
Akron9cb8c982024-03-22 10:46:56 +010012 $ korapxml2krill [archive|extract] --input <directory|archive> [options]
Akron2fd402b2016-10-27 21:26:48 +020013
Akronc13a1702016-03-15 19:33:14 +010014
15=head1 DESCRIPTION
16
Akron5c71a852016-10-31 16:00:33 +010017L<KorAP::XML::Krill> is a library to convert KorAP-XML documents to files
18compatible with the L<Krill|https://github.com/KorAP/Krill> indexer.
Akron8f69d632020-01-15 16:58:11 +010019The C<korapxml2krill> command line tool is a simple wrapper of this library.
Akronc13a1702016-03-15 19:33:14 +010020
21
Akron5c71a852016-10-31 16:00:33 +010022=head1 INSTALLATION
Akronc13a1702016-03-15 19:33:14 +010023
Akron5c71a852016-10-31 16:00:33 +010024The preferred way to install L<KorAP::XML::Krill> is to use L<cpanm|App::cpanminus>.
Akronc13a1702016-03-15 19:33:14 +010025
Akron5c71a852016-10-31 16:00:33 +010026 $ cpanm https://github.com/KorAP/KorAP-XML-Krill.git
Akronc13a1702016-03-15 19:33:14 +010027
Akron5c71a852016-10-31 16:00:33 +010028In case everything went well, the C<korapxml2krill> tool will
29be available on your command line immediately.
Akron8ce23f72023-12-13 15:48:49 +010030Minimum requirement for L<KorAP::XML::Krill> is Perl 5.32.
Akroneb370a02022-02-24 13:33:40 +010031Optionally installing L<Archive::Tar::Builder> speeds up archive building.
32Optional support for L<Sys::Info> to calculate available cores is available.
Akron5c71a852016-10-31 16:00:33 +010033In addition to work with zip archives, the C<unzip> tool needs to be present.
Akronc13a1702016-03-15 19:33:14 +010034
Akron5c71a852016-10-31 16:00:33 +010035=head1 ARGUMENTS
Akronc13a1702016-03-15 19:33:14 +010036
Akron9cb8c982024-03-22 10:46:56 +010037 $ korapxml2krill -z --input <directory> --output <filename>
Akron5c71a852016-10-31 16:00:33 +010038
39Without arguments, C<korapxml2krill> converts a directory of a single KorAP-XML document.
40It expects the input to point to the text level folder.
41
42=over 2
43
44=item B<archive>
45
Akron9cb8c982024-03-22 10:46:56 +010046 $ korapxml2krill archive -z --input <directory|archive> --output <directory|tar>
Akron5c71a852016-10-31 16:00:33 +010047
48Converts an archive of KorAP-XML documents. It expects a directory
49(pointing to the corpus level folder) or one or more zip files as input.
50
51=item B<extract>
52
53 $ korapxml2krill extract --input <archive> --output <directory> --sigle <SIGLE>
54
55Extracts KorAP-XML documents from a zip file.
56
Akron442c4e92017-04-10 23:41:31 +020057=item B<serial>
58
Akron9cb8c982024-03-22 10:46:56 +010059 $ korapxml2krill serial -i <archive1> -i <archive2> -o <directory> -cfg <config-file>
Akron442c4e92017-04-10 23:41:31 +020060
61Convert archives sequentially. The inputs are not merged but treated
62as they are (so they may be premerged or globs).
63the C<--out> directory is treated as the base directory where subdirectories
Akronf73ffb62018-06-27 12:13:59 +020064are created based on the archive name. In case the C<--to-tar> flag is given,
65the output will be a tar file.
Akron442c4e92017-04-10 23:41:31 +020066
67
Akron9f37ed72022-01-17 12:10:08 +010068=item B<slimlog>
69
Akron9cb8c982024-03-22 10:46:56 +010070 $ korapxml2krill slimlog <logfile> > <logfile-slim>
Akron9f37ed72022-01-17 12:10:08 +010071
72Filters out all useless aka succesfull information from logs, to simplify
73log checks. Expects no further options.
74
75
Akron5c71a852016-10-31 16:00:33 +010076=back
Akrona76d8352016-10-27 16:27:32 +020077
Akron7606afa2016-10-25 16:23:49 +020078
Akron5c71a852016-10-31 16:00:33 +010079=head1 OPTIONS
Akronc13a1702016-03-15 19:33:14 +010080
Akron5c71a852016-10-31 16:00:33 +010081=over 2
Akronc13a1702016-03-15 19:33:14 +010082
Akron5c71a852016-10-31 16:00:33 +010083=item B<--input|-i> <directory|zip file>
Akrona76d8352016-10-27 16:27:32 +020084
Akron5c71a852016-10-31 16:00:33 +010085Directory or zip file(s) of documents to convert.
Akronc13a1702016-03-15 19:33:14 +010086
Akron5c71a852016-10-31 16:00:33 +010087Without arguments, C<korapxml2krill> expects a folder of a single KorAP-XML
Akronf1a1de92016-11-02 17:32:12 +010088document, while C<archive> expects a KorAP-XML corpus folder or a zip
89file to batch process multiple files.
90C<extract> expects zip files only.
Akronc13a1702016-03-15 19:33:14 +010091
Akron5c71a852016-10-31 16:00:33 +010092C<archive> supports multiple input zip files with the constraint,
93that the first archive listed contains all primary data files
94and all meta data files.
Akrona76d8352016-10-27 16:27:32 +020095
Akron5c71a852016-10-31 16:00:33 +010096 -i file/news.zip -i file/news.malt.zip -i "#file/news.tt.zip"
Akronc13a1702016-03-15 19:33:14 +010097
Akron821db3d2017-04-06 21:19:31 +020098Input may also be defined using BSD glob wildcards.
99
100 -i 'file/news*.zip'
101
102The extended input array will be sorted in length order, so the shortest
103path needs to contain all primary data files and all meta data files.
104
Akron5c71a852016-10-31 16:00:33 +0100105(The directory structure follows the base directory format,
106that may include a C<.> root folder.
107In this case further archives lacking a C<.> root folder
108need to be passed with a hash sign in front of the archive's name.
109This may require to quote the parameter.)
Akronc13a1702016-03-15 19:33:14 +0100110
Akron5c71a852016-10-31 16:00:33 +0100111To support zip files, a version of C<unzip> needs to be installed that is
112compatible with the archive file.
Akronc13a1702016-03-15 19:33:14 +0100113
Akron5c71a852016-10-31 16:00:33 +0100114B<The root folder switch using the hash sign is experimental and
115may vanish in future versions.>
Akronc13a1702016-03-15 19:33:14 +0100116
Akronf73ffb62018-06-27 12:13:59 +0200117
Akron442c4e92017-04-10 23:41:31 +0200118=item B<--input-base|-ib> <directory>
119
120The base directory for inputs.
121
122
Akron5c71a852016-10-31 16:00:33 +0100123=item B<--output|-o> <directory|file>
Akronc13a1702016-03-15 19:33:14 +0100124
Akron5c71a852016-10-31 16:00:33 +0100125Output folder for archive processing or
126document name for single output (optional),
127writes to C<STDOUT> by default
128(in case C<output> is not mandatory due to further options).
Akronc13a1702016-03-15 19:33:14 +0100129
Akron5c71a852016-10-31 16:00:33 +0100130=item B<--overwrite|-w>
Akronc13a1702016-03-15 19:33:14 +0100131
Akron5c71a852016-10-31 16:00:33 +0100132Overwrite files that already exist.
Akron7606afa2016-10-25 16:23:49 +0200133
Akronf73ffb62018-06-27 12:13:59 +0200134
Akron3741f8b2016-12-21 19:55:21 +0100135=item B<--token|-t> <foundry>#<file>
Akrona5920b12016-06-29 18:51:21 +0200136
Akron5c71a852016-10-31 16:00:33 +0100137Define the default tokenization by specifying
138the name of the foundry and optionally the name
139of the layer-file. Defaults to C<OpenNLP#tokens>.
Akronf1849aa2019-12-16 23:35:33 +0100140This will directly take the file instead of running
141the layer implementation!
Akron3741f8b2016-12-21 19:55:21 +0100142
Akron8f69d632020-01-15 16:58:11 +0100143
Akron3741f8b2016-12-21 19:55:21 +0100144=item B<--base-sentences|-bs> <foundry>#<layer>
145
146Define the layer for base sentences.
147If given, this will be used instead of using C<Base#Sentences>.
Akronc29b8e12019-12-16 14:28:09 +0100148Currently C<DeReKo#Structure> and C<DGD#Structure> are the only additional
149layers supported.
Akron3741f8b2016-12-21 19:55:21 +0100150
151 Defaults to unset.
152
153
154=item B<--base-paragraphs|-bp> <foundry>#<layer>
155
156Define the layer for base paragraphs.
157If given, this will be used instead of using C<Base#Paragraphs>.
Akron9f37ed72022-01-17 12:10:08 +0100158Currently C<DeReKo#Structure> and C<DGD#Structure> are the only additional
159layer supported.
Akron3741f8b2016-12-21 19:55:21 +0100160
161 Defaults to unset.
162
163
Akron821db3d2017-04-06 21:19:31 +0200164=item B<--base-pagebreaks|-bpb> <foundry>#<layer>
165
166Define the layer for base pagebreaks.
167Currently C<DeReKo#Structure> is the only layer supported.
168
169 Defaults to unset.
170
171
Akron5c71a852016-10-31 16:00:33 +0100172=item B<--skip|-s> <foundry>[#<layer>]
173
174Skip specific annotations by specifying the foundry
175(and optionally the layer with a C<#>-prefix),
176e.g. C<Mate> or C<Mate#Morpho>. Alternatively you can skip C<#ALL>.
177Can be set multiple times.
178
Akronf73ffb62018-06-27 12:13:59 +0200179
Akron5c71a852016-10-31 16:00:33 +0100180=item B<--anno|-a> <foundry>#<layer>
181
182Convert specific annotations by specifying the foundry
183(and optionally the layer with a C<#>-prefix),
184e.g. C<Mate> or C<Mate#Morpho>.
185Can be set multiple times.
186
Akronf73ffb62018-06-27 12:13:59 +0200187
Akroned9baf02019-01-22 17:03:25 +0100188=item B<--non-word-tokens|-nwt>
189
190Tokenize non-word tokens like word tokens (defined as matching
191C</[\d\w]/>). Useful to treat punctuations as tokens.
192
193 Defaults to unset.
194
Akronf1849aa2019-12-16 23:35:33 +0100195
196=item B<--non-verbal-tokens|-nvt>
197
198Tokenize non-verbal tokens marked as in the primary data as
199the unicode symbol 'Black Vertical Rectangle' aka \x25ae.
200
201 Defaults to unset.
202
203
Akron5c71a852016-10-31 16:00:33 +0100204=item B<--jobs|-j>
205
206Define the number of concurrent jobs in seperated forks
207for archive processing.
208Defaults to C<0> (everything runs in a single process).
Akronf73ffb62018-06-27 12:13:59 +0200209
Akrona472a242023-02-13 13:46:30 +0100210If C<sequential-extraction> is not set to true, this will
Akronf73ffb62018-06-27 12:13:59 +0200211also apply to extraction.
212
Akron821db3d2017-04-06 21:19:31 +0200213Pass -1, and the value will be set automatically to 5
Akron0b04b312020-10-30 17:39:18 +0100214times the number of available cores, in case L<Sys::Info>
215is available.
Akron5c71a852016-10-31 16:00:33 +0100216This is I<experimental>.
217
Akronf73ffb62018-06-27 12:13:59 +0200218
Akron263274c2019-02-07 09:48:30 +0100219=item B<--koral|-k>
220
221Version of the output format. Supported versions are:
222C<0> for legacy serialization, C<0.03> for serialization
223with metadata fields as key-values on the root object,
224C<0.4> for serialization with metadata fields as a list
225of C<"@type":"koral:field"> objects.
226
227Currently defaults to C<0.03>.
228
229
Akronf73ffb62018-06-27 12:13:59 +0200230=item B<--sequential-extraction|-se>
231
232Flag to indicate, if the C<jobs> value also applies to extraction.
233Some systems may have problems with extracting multiple archives
234to the same folder at the same time.
235Can be flagged using C<--no-sequential-extraction> as well.
236Defaults to C<false>.
237
238
Akron5c71a852016-10-31 16:00:33 +0100239=item B<--meta|-m>
240
241Define the metadata parser to use. Defaults to C<I5>.
242Metadata parsers can be defined in the C<KorAP::XML::Meta> namespace.
243This is I<experimental>.
244
Akronf73ffb62018-06-27 12:13:59 +0200245
Akron5c71a852016-10-31 16:00:33 +0100246=item B<--gzip|-z>
247
248Compress the output.
249Expects a defined C<output> file in single processing.
250
Akronf73ffb62018-06-27 12:13:59 +0200251
Akron5c71a852016-10-31 16:00:33 +0100252=item B<--cache|-c>
253
254File to mmap a cache (using L<Cache::FastMmap>).
255Defaults to C<korapxml2krill.cache> in the calling directory.
256
Akronf73ffb62018-06-27 12:13:59 +0200257
Akron5c71a852016-10-31 16:00:33 +0100258=item B<--cache-size|-cs>
259
260Size of the cache. Defaults to C<50m>.
261
Akronf73ffb62018-06-27 12:13:59 +0200262
Akron5c71a852016-10-31 16:00:33 +0100263=item B<--cache-init|-ci>
264
265Initialize cache file.
266Can be flagged using C<--no-cache-init> as well.
267Defaults to C<true>.
268
Akronf73ffb62018-06-27 12:13:59 +0200269
Akron5c71a852016-10-31 16:00:33 +0100270=item B<--cache-delete|-cd>
271
272Delete cache file after processing.
273Can be flagged using C<--no-cache-delete> as well.
274Defaults to C<true>.
275
Akronf73ffb62018-06-27 12:13:59 +0200276
Akron636aa112017-04-07 18:48:56 +0200277=item B<--config|-cfg>
278
279Configure the parameters of your call in a file
280of key-value pairs with whitespace separator
281
282 overwrite 1
283 token DeReKo#Structure
284 ...
285
286Supported parameters are:
Akron442c4e92017-04-10 23:41:31 +0200287C<overwrite>, C<gzip>, C<jobs>, C<input-base>,
Akron636aa112017-04-07 18:48:56 +0200288C<token>, C<log>, C<cache>, C<cache-size>, C<cache-delete>, C<meta>,
Akron57510c12019-01-04 14:58:53 +0100289C<output>, C<koral>,
Akron9a2545e2022-01-16 15:15:50 +0100290C<temporary-extract>, C<sequential-extraction>,
Akronf73ffb62018-06-27 12:13:59 +0200291C<base-sentences>, C<base-paragraphs>,
292C<base-pagebreaks>,
293C<skip> (semicolon separated), C<sigle>
Akron636aa112017-04-07 18:48:56 +0200294(semicolon separated), C<anno> (semicolon separated).
295
Akronf73ffb62018-06-27 12:13:59 +0200296Configuration parameters will always be overwritten by
297passed parameters.
298
299
Akron81500102017-04-07 20:45:44 +0200300=item B<--temporary-extract|-te>
301
Akrona472a242023-02-13 13:46:30 +0100302Only valid for the C<archive> and C<serial>
303commands.
Akron81500102017-04-07 20:45:44 +0200304
305This will first extract all files into a
306directory and then will archive.
307If the directory is given as C<:temp:>,
308a temporary directory is used.
309This is especially useful to avoid
310massive unzipping and potential
311network latency.
Akron636aa112017-04-07 18:48:56 +0200312
Akronf73ffb62018-06-27 12:13:59 +0200313
Akronc93a0802019-07-11 15:48:34 +0200314=item B<--to-tar>
315
316Only valid for the C<archive> command.
317
318Writes the output into a tar archive.
319
320
Akron5c71a852016-10-31 16:00:33 +0100321=item B<--sigle|-sg>
322
323Extract the given texts.
324Can be set multiple times.
325I<Currently only supported on C<extract>.>
326Sigles have the structure C<Corpus>/C<Document>/C<Text>.
327In case the C<Text> path is omitted, the whole document will be extracted.
328On the document level, the postfix wildcard C<*> is supported.
329
Akron55fc2122022-07-27 13:24:39 +0200330=item B<--lang>
331
332Preferred language for metadata fields. In case multiple titles are
333given (on any level) with different C<xml:lang> attributes,
334the language given is preferred.
335Because titles may have different sources and different priorities,
336non-specific language titles may still be preferred in case the title
337source has a higher priority.
338
Akronf73ffb62018-06-27 12:13:59 +0200339
Akron5c71a852016-10-31 16:00:33 +0100340=item B<--log|-l>
341
Akron6882d7d2021-02-08 09:43:57 +0100342The L<Log::Any> log level, defaults to C<ERROR>.
Akron5c71a852016-10-31 16:00:33 +0100343
Akronf73ffb62018-06-27 12:13:59 +0200344
Akrona3518372024-01-22 23:29:00 +0100345=item B<--quiet>
346
347Silence all information (non-log) outputs.
348
349
Akron5c71a852016-10-31 16:00:33 +0100350=item B<--help|-h>
351
Akron42f48c12020-02-14 13:08:13 +0100352Print help information.
Akron5c71a852016-10-31 16:00:33 +0100353
Akronf73ffb62018-06-27 12:13:59 +0200354
Akron5c71a852016-10-31 16:00:33 +0100355=item B<--version|-v>
356
357Print version information.
358
359=back
360
Akronf73ffb62018-06-27 12:13:59 +0200361
Akron5c71a852016-10-31 16:00:33 +0100362=head1 ANNOTATION SUPPORT
363
364L<KorAP::XML::Krill> has built-in importer for some annotation foundries and layers
365developed in the KorAP project that are part of the KorAP preprocessing pipeline.
366The base foundry with paragraphs, sentences, and the text element are mandatory for
367L<Krill|https://github.com/KorAP/Krill>.
368
Akron821db3d2017-04-06 21:19:31 +0200369 Base
370 #Paragraphs
371 #Sentences
Akron5c71a852016-10-31 16:00:33 +0100372
Akron821db3d2017-04-06 21:19:31 +0200373 Connexor
374 #Morpho
375 #Phrase
376 #Sentences
377 #Syntax
Akron5c71a852016-10-31 16:00:33 +0100378
Akron821db3d2017-04-06 21:19:31 +0200379 CoreNLP
380 #Constituency
381 #Morpho
382 #NamedEntities
383 #Sentences
Akron5c71a852016-10-31 16:00:33 +0100384
Akronf73ffb62018-06-27 12:13:59 +0200385 CMC
386 #Morpho
387
Akron821db3d2017-04-06 21:19:31 +0200388 DeReKo
389 #Structure
Akron5c71a852016-10-31 16:00:33 +0100390
Akron57510c12019-01-04 14:58:53 +0100391 DGD
392 #Morpho
Akronc29b8e12019-12-16 14:28:09 +0100393 #Structure
Akron57510c12019-01-04 14:58:53 +0100394
Akron821db3d2017-04-06 21:19:31 +0200395 DRuKoLa
396 #Morpho
Akron5c71a852016-10-31 16:00:33 +0100397
Akron9f37ed72022-01-17 12:10:08 +0100398 Glemm
Akronabb36902021-10-11 15:51:06 +0200399 #Morpho
400
Akron9f37ed72022-01-17 12:10:08 +0100401 Gingko
Akron821db3d2017-04-06 21:19:31 +0200402 #Morpho
Akron5c71a852016-10-31 16:00:33 +0100403
Akroned9baf02019-01-22 17:03:25 +0100404 HNC
405 #Morpho
406
Akronf73ffb62018-06-27 12:13:59 +0200407 LWC
408 #Dependency
409
Akron821db3d2017-04-06 21:19:31 +0200410 Malt
411 #Dependency
Akron5c71a852016-10-31 16:00:33 +0100412
Akron821db3d2017-04-06 21:19:31 +0200413 MarMoT
414 #Morpho
Akron5c71a852016-10-31 16:00:33 +0100415
Akron821db3d2017-04-06 21:19:31 +0200416 Mate
417 #Dependency
418 #Morpho
Akron5c71a852016-10-31 16:00:33 +0100419
Akron821db3d2017-04-06 21:19:31 +0200420 MDParser
421 #Dependency
Akron5c71a852016-10-31 16:00:33 +0100422
Akrone85a7762022-07-22 08:05:03 +0200423 NKJP
424 #Morpho
425 #NamedEntities
426
Akron821db3d2017-04-06 21:19:31 +0200427 OpenNLP
428 #Morpho
429 #Sentences
Akron5c71a852016-10-31 16:00:33 +0100430
Akron0b04b312020-10-30 17:39:18 +0100431 RWK
432 #Morpho
433 #Structure
434
Akron821db3d2017-04-06 21:19:31 +0200435 Sgbr
436 #Lemma
437 #Morpho
Akron5c71a852016-10-31 16:00:33 +0100438
Marc Kupietzb8c53822024-03-16 18:54:08 +0100439 Spacy
440 #Morpho
441
Akron7d5e6382019-08-08 16:36:27 +0200442 Talismane
443 #Dependency
444 #Morpho
445
Akron821db3d2017-04-06 21:19:31 +0200446 TreeTagger
447 #Morpho
448 #Sentences
Akron5c71a852016-10-31 16:00:33 +0100449
Akron83aedd32023-02-07 10:57:41 +0100450 UDPipe
451 #Dependency
452 #Morpho
453
Akron821db3d2017-04-06 21:19:31 +0200454 XIP
455 #Constituency
456 #Morpho
457 #Sentences
Akron5c71a852016-10-31 16:00:33 +0100458
Akron5c71a852016-10-31 16:00:33 +0100459
460More importers are in preparation.
461New annotation importers can be defined in the C<KorAP::XML::Annotation> namespace.
462See the built-in annotation importers as examples.
Akronc13a1702016-03-15 19:33:14 +0100463
Akronf73ffb62018-06-27 12:13:59 +0200464
Akron41e6c8b2021-10-14 20:22:18 +0200465=head1 METADATA SUPPORT
466
467L<KorAP::XML::Krill> has built-in importer for some meta data variants
468developed in the KorAP project that are part of the KorAP preprocessing pipeline.
469
470=over 2
471
472=item I5 - Meta data for all I5 files
473
474=item Sgbr - Meta data from the Schreibgebrauch project
475
476=item Gingko - Meta data from the Gingko project in addition to I5
477
Akron2532f1b2023-05-15 13:41:24 +0200478=item ICC - Meta data for the ICC in addition to I5
479
Akron41e6c8b2021-10-14 20:22:18 +0200480=back
481
482More importers are in preparation.
483New meta data importers can be defined in the C<KorAP::XML::Meta> namespace.
484See the built-in meta data importers as examples.
485
486
Akron8f69d632020-01-15 16:58:11 +0100487=head1 About KorAP-XML
488
Akrona3518372024-01-22 23:29:00 +0100489KorAP-XML (Bański et al. 2012) is an implementation of the KorAP
490data model (Bański et al. 2013), where text data are stored physically
Akron8f69d632020-01-15 16:58:11 +0100491separated from their interpretations (i.e. annotations).
492A text document in KorAP-XML therefore consists of several files
493containing primary data, metadata and annotations.
494
495The structure of a single KorAP-XML document can be as follows:
496
497 - data.xml
498 - header.xml
499 + base
500 - tokens.xml
501 - ...
502 + struct
503 - structure.xml
504 - ...
505 + corenlp
506 - morpho.xml
507 - constituency.xml
508 - ...
509 + tree_tagger
510 - morpho.xml
511 - ...
512 - ...
513
514The C<data.xml> contains the primary data, the C<header.xml> contains
515the metadata, and the annotation layers are stored in subfolders
516like C<base>, C<struct> or C<corenlp>
Akrona3518372024-01-22 23:29:00 +0100517(so-called "foundries"; Bański et al. 2013).
Akron8f69d632020-01-15 16:58:11 +0100518
519Metadata is available in the TEI-P5 variant I5
Akrond4c5c102020-02-11 11:47:59 +0100520(Lüngen and Sperberg-McQueen 2012). See the documentation in
521L<KorAP::XML::Meta::I5> for translatable fields.
522
523Annotations correspond to a variant of the TEI-P5 feature structures
524(TEI Consortium; Lee et al. 2004).
Akron72bc5222020-02-06 16:00:13 +0100525Annotation feature structures refer to character sequences of the primary text
526inside the C<text> element of the C<data.xml>.
527A single annotation containing the lemma of a token can have the following structure:
528
529 <span from="0" to="3">
530 <fs type="lex" xmlns="http://www.tei-c.org/ns/1.0">
531 <f name="lex">
532 <fs>
533 <f name="lemma">zum</f>
534 </fs>
535 </f>
536 </fs>
537 </span>
538
539The C<from> and C<to> attributes are refering to the character span
540in the primary text.
541Depending on the kind of annotation (e.g. token-based, span-based, relation-based),
542the structure may vary. See L<KorAP::XML::Annotation::*> for various
543annotation preprocessors.
Akron8f69d632020-01-15 16:58:11 +0100544
545Multiple KorAP-XML documents are organized on three levels following
546the "IDS Textmodell" (Lüngen and Sperberg-McQueen 2012):
547corpus E<gt> document E<gt> text. On each level metadata information
548can be stored, that C<korapxml2krill> will merge to a single metadata
549object per text. A corpus is therefore structured as follows:
550
551 + <corpus>
552 - header.xml
553 + <document>
554 - header.xml
555 + <text>
556 - data.xml
557 - header.xml
558 - ...
559 - ...
560
561A single text can be identified by the concatenation of
562the corpus identifier, the document identifier and the text identifier.
563This identifier is called the text sigle
564(e.g. a text with the identifier C<18486> in the document C<060> in the
565corpus C<WPD17> has the text sigle C<WPD17/060/18486>, see C<--sigle>).
566
567These corpora are often stored in zip files, with which C<korapxml2krill>
568can deal with. Corpora may also be split in multiple zip archives
569(e.g. one zip file per foundry), which is also supported (see C<--input>).
570
571Examples for KorAP-XML files are included in L<KorAP::XML::Krill>
572in form of a test suite.
573The resulting JSON format merges all annotation layers
574based on a single token stream.
575
576=head2 References
577
Akrona3518372024-01-22 23:29:00 +0100578Piotr Bański, Cyril Belica, Helge Krause, Marc Kupietz, Carsten Schnober, Oliver Schonefeld, and Andreas Witt (2011):
Akron8f69d632020-01-15 16:58:11 +0100579KorAP data model: first approximation, December.
580
Akrona3518372024-01-22 23:29:00 +0100581Piotr Bański, Peter M. Fischer, Elena Frick, Erik Ketzan, Marc Kupietz, Carsten Schnober, Oliver Schonefeld and Andreas Witt (2012):
Akron8f69d632020-01-15 16:58:11 +0100582"The New IDS Corpus Analysis Platform: Challenges and Prospects",
583Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012).
584L<PDF|http://www.lrec-conf.org/proceedings/lrec2012/pdf/789_Paper.pdf>
585
Akrona3518372024-01-22 23:29:00 +0100586Piotr Bański, Elena Frick, Michael Hanl, Marc Kupietz, Carsten Schnober and Andreas Witt (2013):
Akron8f69d632020-01-15 16:58:11 +0100587"Robust corpus architecture: a new look at virtual collections and data access",
588Corpus Linguistics 2013. Abstract Book. Lancaster: UCREL, pp. 23-25.
589L<PDF|https://ids-pub.bsz-bw.de/frontdoor/deliver/index/docId/4485/file/Ba%c5%84ski_Frick_Hanl_Robust_corpus_architecture_2013.pdf>
590
591Kiyong Lee, Lou Burnard, Laurent Romary, Eric de la Clergerie, Thierry Declerck,
592Syd Bauman, Harry Bunt, Lionel Clément, Tomaz Erjavec, Azim Roussanaly and Claude Roux (2004):
593"Towards an international standard on featurestructure representation",
594Proceedings of the fourth International Conference on Language Resources and Evaluation (LREC 2004),
595pp. 373-376.
596L<PDF|http://www.lrec-conf.org/proceedings/lrec2004/pdf/687.pdf>
597
598Harald Lüngen and C. M. Sperberg-McQueen (2012):
599"A TEI P5 Document Grammar for the IDS Text Model",
600Journal of the Text Encoding Initiative, Issue 3 | November 2012.
601L<PDF|https://journals.openedition.org/jtei/pdf/508>
602
603TEI Consortium, eds:
604"Feature Structures",
605Guidelines for Electronic Text Encoding and Interchange.
606L<html|https://www.tei-c.org/release/doc/tei-p5-doc/en/html/FS.html>
607
Akronc13a1702016-03-15 19:33:14 +0100608=head1 AVAILABILITY
609
610 https://github.com/KorAP/KorAP-XML-Krill
611
612
613=head1 COPYRIGHT AND LICENSE
614
Akrona3518372024-01-22 23:29:00 +0100615Copyright (C) 2015-2024, L<IDS Mannheim|https://www.ids-mannheim.de/>
Akronc13a1702016-03-15 19:33:14 +0100616
Akron6882d7d2021-02-08 09:43:57 +0100617Author: L<Nils Diewald|https://www.nils-diewald.de/>
Akron81500102017-04-07 20:45:44 +0200618
Marc Kupietzb8c53822024-03-16 18:54:08 +0100619Contributor: Eliza Margaretha, Marc Kupietz
Akron5c71a852016-10-31 16:00:33 +0100620
Akron6882d7d2021-02-08 09:43:57 +0100621L<KorAP::XML::Krill> is developed as part of the L<KorAP|https://korap.ids-mannheim.de/>
Akronc13a1702016-03-15 19:33:14 +0100622Corpus Analysis Platform at the
Akron6882d7d2021-02-08 09:43:57 +0100623L<Leibniz Institute for the German Language (IDS)|https://www.ids-mannheim.de/>,
Akronc13a1702016-03-15 19:33:14 +0100624member of the
Akronf1849aa2019-12-16 23:35:33 +0100625L<Leibniz-Gemeinschaft|http://www.leibniz-gemeinschaft.de/>.
Akronc13a1702016-03-15 19:33:14 +0100626
Akron5c71a852016-10-31 16:00:33 +0100627This program is free software published under the
Akron6882d7d2021-02-08 09:43:57 +0100628L<BSD-2 License|https://opensource.org/licenses/BSD-2-Clause>.
Akronc13a1702016-03-15 19:33:14 +0100629
630=cut