blob: cc8c5191695e9883c7d0193c8c66aaa59ec4575a [file] [log] [blame]
Akron5530a552022-02-17 17:53:15 +01001__END__
2
Akronc13a1702016-03-15 19:33:14 +01003=pod
4
5=encoding utf8
6
7=head1 NAME
8
Akron42f48c12020-02-14 13:08:13 +01009korapxml2krill - Merge KorAP-XML data and create Krill documents
Akronc13a1702016-03-15 19:33:14 +010010
11
12=head1 SYNOPSIS
13
Akron9cb8c982024-03-22 10:46:56 +010014 $ korapxml2krill [archive|extract] --input <directory|archive> [options]
Akron2fd402b2016-10-27 21:26:48 +020015
Akronc13a1702016-03-15 19:33:14 +010016
17=head1 DESCRIPTION
18
Akron5c71a852016-10-31 16:00:33 +010019L<KorAP::XML::Krill> is a library to convert KorAP-XML documents to files
20compatible with the L<Krill|https://github.com/KorAP/Krill> indexer.
Akron8f69d632020-01-15 16:58:11 +010021The C<korapxml2krill> command line tool is a simple wrapper of this library.
Akronc13a1702016-03-15 19:33:14 +010022
23
Akron5c71a852016-10-31 16:00:33 +010024=head1 INSTALLATION
Akronc13a1702016-03-15 19:33:14 +010025
Akron5c71a852016-10-31 16:00:33 +010026The preferred way to install L<KorAP::XML::Krill> is to use L<cpanm|App::cpanminus>.
Akronc13a1702016-03-15 19:33:14 +010027
Akron5c71a852016-10-31 16:00:33 +010028 $ cpanm https://github.com/KorAP/KorAP-XML-Krill.git
Akronc13a1702016-03-15 19:33:14 +010029
Akron5c71a852016-10-31 16:00:33 +010030In case everything went well, the C<korapxml2krill> tool will
31be available on your command line immediately.
Akron8ce23f72023-12-13 15:48:49 +010032Minimum requirement for L<KorAP::XML::Krill> is Perl 5.32.
Akroneb370a02022-02-24 13:33:40 +010033Optionally installing L<Archive::Tar::Builder> speeds up archive building.
34Optional support for L<Sys::Info> to calculate available cores is available.
Akron5c71a852016-10-31 16:00:33 +010035In addition to work with zip archives, the C<unzip> tool needs to be present.
Akronc13a1702016-03-15 19:33:14 +010036
Akron5c71a852016-10-31 16:00:33 +010037=head1 ARGUMENTS
Akronc13a1702016-03-15 19:33:14 +010038
Akron9cb8c982024-03-22 10:46:56 +010039 $ korapxml2krill -z --input <directory> --output <filename>
Akron5c71a852016-10-31 16:00:33 +010040
41Without arguments, C<korapxml2krill> converts a directory of a single KorAP-XML document.
42It expects the input to point to the text level folder.
43
44=over 2
45
46=item B<archive>
47
Akron9cb8c982024-03-22 10:46:56 +010048 $ korapxml2krill archive -z --input <directory|archive> --output <directory|tar>
Akron5c71a852016-10-31 16:00:33 +010049
50Converts an archive of KorAP-XML documents. It expects a directory
51(pointing to the corpus level folder) or one or more zip files as input.
52
53=item B<extract>
54
55 $ korapxml2krill extract --input <archive> --output <directory> --sigle <SIGLE>
56
57Extracts KorAP-XML documents from a zip file.
58
Akron442c4e92017-04-10 23:41:31 +020059=item B<serial>
60
Akron9cb8c982024-03-22 10:46:56 +010061 $ korapxml2krill serial -i <archive1> -i <archive2> -o <directory> -cfg <config-file>
Akron442c4e92017-04-10 23:41:31 +020062
63Convert archives sequentially. The inputs are not merged but treated
64as they are (so they may be premerged or globs).
65the C<--out> directory is treated as the base directory where subdirectories
Akronf73ffb62018-06-27 12:13:59 +020066are created based on the archive name. In case the C<--to-tar> flag is given,
67the output will be a tar file.
Akron442c4e92017-04-10 23:41:31 +020068
69
Akron9f37ed72022-01-17 12:10:08 +010070=item B<slimlog>
71
Akron9cb8c982024-03-22 10:46:56 +010072 $ korapxml2krill slimlog <logfile> > <logfile-slim>
Akron9f37ed72022-01-17 12:10:08 +010073
74Filters out all useless aka succesfull information from logs, to simplify
75log checks. Expects no further options.
76
77
Akron5c71a852016-10-31 16:00:33 +010078=back
Akrona76d8352016-10-27 16:27:32 +020079
Akron7606afa2016-10-25 16:23:49 +020080
Akron5c71a852016-10-31 16:00:33 +010081=head1 OPTIONS
Akronc13a1702016-03-15 19:33:14 +010082
Akron5c71a852016-10-31 16:00:33 +010083=over 2
Akronc13a1702016-03-15 19:33:14 +010084
Akron5c71a852016-10-31 16:00:33 +010085=item B<--input|-i> <directory|zip file>
Akrona76d8352016-10-27 16:27:32 +020086
Akron5c71a852016-10-31 16:00:33 +010087Directory or zip file(s) of documents to convert.
Akronc13a1702016-03-15 19:33:14 +010088
Akron5c71a852016-10-31 16:00:33 +010089Without arguments, C<korapxml2krill> expects a folder of a single KorAP-XML
Akronf1a1de92016-11-02 17:32:12 +010090document, while C<archive> expects a KorAP-XML corpus folder or a zip
91file to batch process multiple files.
92C<extract> expects zip files only.
Akronc13a1702016-03-15 19:33:14 +010093
Akron5c71a852016-10-31 16:00:33 +010094C<archive> supports multiple input zip files with the constraint,
95that the first archive listed contains all primary data files
96and all meta data files.
Akrona76d8352016-10-27 16:27:32 +020097
Akron5c71a852016-10-31 16:00:33 +010098 -i file/news.zip -i file/news.malt.zip -i "#file/news.tt.zip"
Akronc13a1702016-03-15 19:33:14 +010099
Akron821db3d2017-04-06 21:19:31 +0200100Input may also be defined using BSD glob wildcards.
101
102 -i 'file/news*.zip'
103
104The extended input array will be sorted in length order, so the shortest
105path needs to contain all primary data files and all meta data files.
106
Akron5c71a852016-10-31 16:00:33 +0100107(The directory structure follows the base directory format,
108that may include a C<.> root folder.
109In this case further archives lacking a C<.> root folder
110need to be passed with a hash sign in front of the archive's name.
111This may require to quote the parameter.)
Akronc13a1702016-03-15 19:33:14 +0100112
Akron5c71a852016-10-31 16:00:33 +0100113To support zip files, a version of C<unzip> needs to be installed that is
114compatible with the archive file.
Akronc13a1702016-03-15 19:33:14 +0100115
Akron5c71a852016-10-31 16:00:33 +0100116B<The root folder switch using the hash sign is experimental and
117may vanish in future versions.>
Akronc13a1702016-03-15 19:33:14 +0100118
Akronf73ffb62018-06-27 12:13:59 +0200119
Akron442c4e92017-04-10 23:41:31 +0200120=item B<--input-base|-ib> <directory>
121
122The base directory for inputs.
123
124
Akron5c71a852016-10-31 16:00:33 +0100125=item B<--output|-o> <directory|file>
Akronc13a1702016-03-15 19:33:14 +0100126
Akron5c71a852016-10-31 16:00:33 +0100127Output folder for archive processing or
128document name for single output (optional),
129writes to C<STDOUT> by default
130(in case C<output> is not mandatory due to further options).
Akronc13a1702016-03-15 19:33:14 +0100131
Akron5c71a852016-10-31 16:00:33 +0100132=item B<--overwrite|-w>
Akronc13a1702016-03-15 19:33:14 +0100133
Akron5c71a852016-10-31 16:00:33 +0100134Overwrite files that already exist.
Akron7606afa2016-10-25 16:23:49 +0200135
Akronf73ffb62018-06-27 12:13:59 +0200136
Akron3741f8b2016-12-21 19:55:21 +0100137=item B<--token|-t> <foundry>#<file>
Akrona5920b12016-06-29 18:51:21 +0200138
Akron5c71a852016-10-31 16:00:33 +0100139Define the default tokenization by specifying
140the name of the foundry and optionally the name
141of the layer-file. Defaults to C<OpenNLP#tokens>.
Akronf1849aa2019-12-16 23:35:33 +0100142This will directly take the file instead of running
143the layer implementation!
Akron3741f8b2016-12-21 19:55:21 +0100144
Akron8f69d632020-01-15 16:58:11 +0100145
Akron3741f8b2016-12-21 19:55:21 +0100146=item B<--base-sentences|-bs> <foundry>#<layer>
147
148Define the layer for base sentences.
149If given, this will be used instead of using C<Base#Sentences>.
Akronc29b8e12019-12-16 14:28:09 +0100150Currently C<DeReKo#Structure> and C<DGD#Structure> are the only additional
151layers supported.
Akron3741f8b2016-12-21 19:55:21 +0100152
153 Defaults to unset.
154
155
156=item B<--base-paragraphs|-bp> <foundry>#<layer>
157
158Define the layer for base paragraphs.
159If given, this will be used instead of using C<Base#Paragraphs>.
Akron9f37ed72022-01-17 12:10:08 +0100160Currently C<DeReKo#Structure> and C<DGD#Structure> are the only additional
161layer supported.
Akron3741f8b2016-12-21 19:55:21 +0100162
163 Defaults to unset.
164
165
Akron821db3d2017-04-06 21:19:31 +0200166=item B<--base-pagebreaks|-bpb> <foundry>#<layer>
167
168Define the layer for base pagebreaks.
169Currently C<DeReKo#Structure> is the only layer supported.
170
171 Defaults to unset.
172
173
Akron5c71a852016-10-31 16:00:33 +0100174=item B<--skip|-s> <foundry>[#<layer>]
175
176Skip specific annotations by specifying the foundry
177(and optionally the layer with a C<#>-prefix),
178e.g. C<Mate> or C<Mate#Morpho>. Alternatively you can skip C<#ALL>.
179Can be set multiple times.
180
Akronf73ffb62018-06-27 12:13:59 +0200181
Akron5c71a852016-10-31 16:00:33 +0100182=item B<--anno|-a> <foundry>#<layer>
183
184Convert specific annotations by specifying the foundry
185(and optionally the layer with a C<#>-prefix),
186e.g. C<Mate> or C<Mate#Morpho>.
187Can be set multiple times.
188
Akronf73ffb62018-06-27 12:13:59 +0200189
Akroned9baf02019-01-22 17:03:25 +0100190=item B<--non-word-tokens|-nwt>
191
192Tokenize non-word tokens like word tokens (defined as matching
193C</[\d\w]/>). Useful to treat punctuations as tokens.
194
195 Defaults to unset.
196
Akronf1849aa2019-12-16 23:35:33 +0100197
198=item B<--non-verbal-tokens|-nvt>
199
200Tokenize non-verbal tokens marked as in the primary data as
201the unicode symbol 'Black Vertical Rectangle' aka \x25ae.
202
203 Defaults to unset.
204
205
Akron5c71a852016-10-31 16:00:33 +0100206=item B<--jobs|-j>
207
Akron29128262024-04-17 15:50:36 +0200208Define the number of spawned forks for concurrent jobs
209of archive processing.
Akron5c71a852016-10-31 16:00:33 +0100210Defaults to C<0> (everything runs in a single process).
Akronf73ffb62018-06-27 12:13:59 +0200211
Akrona472a242023-02-13 13:46:30 +0100212If C<sequential-extraction> is not set to true, this will
Akronf73ffb62018-06-27 12:13:59 +0200213also apply to extraction.
214
Akronebbac2e2024-03-22 10:31:23 +0100215Pass C<-1>, and the value will be set automatically to 5
Akron0b04b312020-10-30 17:39:18 +0100216times the number of available cores, in case L<Sys::Info>
Akronebbac2e2024-03-22 10:31:23 +0100217is available and can read CPU count (see C<--job-count>).
218Be aware, that the report of available cores
Akron29128262024-04-17 15:50:36 +0200219may not work in certain conditions. Benchmarking the processing
220speed based on the number of jobs may be valuable.
Akronebbac2e2024-03-22 10:31:23 +0100221
Akron5c71a852016-10-31 16:00:33 +0100222This is I<experimental>.
223
Akronf73ffb62018-06-27 12:13:59 +0200224
Akronebbac2e2024-03-22 10:31:23 +0100225=item B<--job-count|-jc>
226
227Print job and core information that would be used if
228C<-1> was passed to C<--jobs>.
229
230
Akron263274c2019-02-07 09:48:30 +0100231=item B<--koral|-k>
232
233Version of the output format. Supported versions are:
234C<0> for legacy serialization, C<0.03> for serialization
235with metadata fields as key-values on the root object,
236C<0.4> for serialization with metadata fields as a list
237of C<"@type":"koral:field"> objects.
238
239Currently defaults to C<0.03>.
240
241
Akronf73ffb62018-06-27 12:13:59 +0200242=item B<--sequential-extraction|-se>
243
244Flag to indicate, if the C<jobs> value also applies to extraction.
245Some systems may have problems with extracting multiple archives
246to the same folder at the same time.
247Can be flagged using C<--no-sequential-extraction> as well.
248Defaults to C<false>.
249
250
Akron5c71a852016-10-31 16:00:33 +0100251=item B<--meta|-m>
252
253Define the metadata parser to use. Defaults to C<I5>.
254Metadata parsers can be defined in the C<KorAP::XML::Meta> namespace.
255This is I<experimental>.
256
Akronf73ffb62018-06-27 12:13:59 +0200257
Akron5c71a852016-10-31 16:00:33 +0100258=item B<--gzip|-z>
259
260Compress the output.
261Expects a defined C<output> file in single processing.
262
Akronf73ffb62018-06-27 12:13:59 +0200263
Akron5c71a852016-10-31 16:00:33 +0100264=item B<--cache|-c>
265
266File to mmap a cache (using L<Cache::FastMmap>).
267Defaults to C<korapxml2krill.cache> in the calling directory.
268
Akronf73ffb62018-06-27 12:13:59 +0200269
Akron5c71a852016-10-31 16:00:33 +0100270=item B<--cache-size|-cs>
271
272Size of the cache. Defaults to C<50m>.
273
Akronf73ffb62018-06-27 12:13:59 +0200274
Akron5c71a852016-10-31 16:00:33 +0100275=item B<--cache-init|-ci>
276
277Initialize cache file.
278Can be flagged using C<--no-cache-init> as well.
279Defaults to C<true>.
280
Akronf73ffb62018-06-27 12:13:59 +0200281
Akron5c71a852016-10-31 16:00:33 +0100282=item B<--cache-delete|-cd>
283
284Delete cache file after processing.
285Can be flagged using C<--no-cache-delete> as well.
286Defaults to C<true>.
287
Akronf73ffb62018-06-27 12:13:59 +0200288
Akron636aa112017-04-07 18:48:56 +0200289=item B<--config|-cfg>
290
291Configure the parameters of your call in a file
292of key-value pairs with whitespace separator
293
294 overwrite 1
295 token DeReKo#Structure
296 ...
297
298Supported parameters are:
Akron442c4e92017-04-10 23:41:31 +0200299C<overwrite>, C<gzip>, C<jobs>, C<input-base>,
Akron29128262024-04-17 15:50:36 +0200300C<token>, C<log>,
301C<cache>, C<cache-size>, C<cache-init>, C<cache-delete>, C<meta>,
Akron57510c12019-01-04 14:58:53 +0100302C<output>, C<koral>,
Akron9a2545e2022-01-16 15:15:50 +0100303C<temporary-extract>, C<sequential-extraction>,
Akronf73ffb62018-06-27 12:13:59 +0200304C<base-sentences>, C<base-paragraphs>,
305C<base-pagebreaks>,
306C<skip> (semicolon separated), C<sigle>
Akron636aa112017-04-07 18:48:56 +0200307(semicolon separated), C<anno> (semicolon separated).
308
Akronf73ffb62018-06-27 12:13:59 +0200309Configuration parameters will always be overwritten by
310passed parameters.
311
312
Akron81500102017-04-07 20:45:44 +0200313=item B<--temporary-extract|-te>
314
Akrona472a242023-02-13 13:46:30 +0100315Only valid for the C<archive> and C<serial>
316commands.
Akron81500102017-04-07 20:45:44 +0200317
318This will first extract all files into a
319directory and then will archive.
320If the directory is given as C<:temp:>,
321a temporary directory is used.
322This is especially useful to avoid
323massive unzipping and potential
324network latency.
Akron636aa112017-04-07 18:48:56 +0200325
Akronf73ffb62018-06-27 12:13:59 +0200326
Akronc93a0802019-07-11 15:48:34 +0200327=item B<--to-tar>
328
329Only valid for the C<archive> command.
330
331Writes the output into a tar archive.
332
333
Akron5c71a852016-10-31 16:00:33 +0100334=item B<--sigle|-sg>
335
336Extract the given texts.
337Can be set multiple times.
338I<Currently only supported on C<extract>.>
339Sigles have the structure C<Corpus>/C<Document>/C<Text>.
340In case the C<Text> path is omitted, the whole document will be extracted.
341On the document level, the postfix wildcard C<*> is supported.
342
Akron55fc2122022-07-27 13:24:39 +0200343=item B<--lang>
344
345Preferred language for metadata fields. In case multiple titles are
346given (on any level) with different C<xml:lang> attributes,
347the language given is preferred.
348Because titles may have different sources and different priorities,
349non-specific language titles may still be preferred in case the title
350source has a higher priority.
351
Akronf73ffb62018-06-27 12:13:59 +0200352
Akron5c71a852016-10-31 16:00:33 +0100353=item B<--log|-l>
354
Akron6882d7d2021-02-08 09:43:57 +0100355The L<Log::Any> log level, defaults to C<ERROR>.
Akron5c71a852016-10-31 16:00:33 +0100356
Akronf73ffb62018-06-27 12:13:59 +0200357
Akrona3518372024-01-22 23:29:00 +0100358=item B<--quiet>
359
360Silence all information (non-log) outputs.
361
362
Akron5c71a852016-10-31 16:00:33 +0100363=item B<--help|-h>
364
Akron42f48c12020-02-14 13:08:13 +0100365Print help information.
Akron5c71a852016-10-31 16:00:33 +0100366
Akronf73ffb62018-06-27 12:13:59 +0200367
Akron5c71a852016-10-31 16:00:33 +0100368=item B<--version|-v>
369
370Print version information.
371
372=back
373
Akronf73ffb62018-06-27 12:13:59 +0200374
Akron5c71a852016-10-31 16:00:33 +0100375=head1 ANNOTATION SUPPORT
376
377L<KorAP::XML::Krill> has built-in importer for some annotation foundries and layers
378developed in the KorAP project that are part of the KorAP preprocessing pipeline.
379The base foundry with paragraphs, sentences, and the text element are mandatory for
380L<Krill|https://github.com/KorAP/Krill>.
381
Akron821db3d2017-04-06 21:19:31 +0200382 Base
383 #Paragraphs
384 #Sentences
Akron5c71a852016-10-31 16:00:33 +0100385
Akron821db3d2017-04-06 21:19:31 +0200386 Connexor
387 #Morpho
388 #Phrase
389 #Sentences
390 #Syntax
Akron5c71a852016-10-31 16:00:33 +0100391
Akron821db3d2017-04-06 21:19:31 +0200392 CoreNLP
393 #Constituency
394 #Morpho
395 #NamedEntities
396 #Sentences
Akron5c71a852016-10-31 16:00:33 +0100397
Akron5530a552022-02-17 17:53:15 +0100398 CorpusExplorer
399 #Morpho
400
Akronf73ffb62018-06-27 12:13:59 +0200401 CMC
402 #Morpho
403
Akron821db3d2017-04-06 21:19:31 +0200404 DeReKo
405 #Structure
Akron5c71a852016-10-31 16:00:33 +0100406
Akron57510c12019-01-04 14:58:53 +0100407 DGD
408 #Morpho
Akronc29b8e12019-12-16 14:28:09 +0100409 #Structure
Akron57510c12019-01-04 14:58:53 +0100410
Akron821db3d2017-04-06 21:19:31 +0200411 DRuKoLa
412 #Morpho
Akron5c71a852016-10-31 16:00:33 +0100413
Akron9f37ed72022-01-17 12:10:08 +0100414 Glemm
Akronabb36902021-10-11 15:51:06 +0200415 #Morpho
416
Akron9f37ed72022-01-17 12:10:08 +0100417 Gingko
Akron821db3d2017-04-06 21:19:31 +0200418 #Morpho
Akron5c71a852016-10-31 16:00:33 +0100419
Akroned9baf02019-01-22 17:03:25 +0100420 HNC
421 #Morpho
422
Akronf73ffb62018-06-27 12:13:59 +0200423 LWC
424 #Dependency
425
Akron821db3d2017-04-06 21:19:31 +0200426 Malt
427 #Dependency
Akron5c71a852016-10-31 16:00:33 +0100428
Akron821db3d2017-04-06 21:19:31 +0200429 MarMoT
430 #Morpho
Akron5c71a852016-10-31 16:00:33 +0100431
Akron821db3d2017-04-06 21:19:31 +0200432 Mate
433 #Dependency
434 #Morpho
Akron5c71a852016-10-31 16:00:33 +0100435
Akron821db3d2017-04-06 21:19:31 +0200436 MDParser
437 #Dependency
Akron5c71a852016-10-31 16:00:33 +0100438
Akrone85a7762022-07-22 08:05:03 +0200439 NKJP
440 #Morpho
441 #NamedEntities
442
Akron821db3d2017-04-06 21:19:31 +0200443 OpenNLP
444 #Morpho
445 #Sentences
Akron5c71a852016-10-31 16:00:33 +0100446
Akron0b04b312020-10-30 17:39:18 +0100447 RWK
448 #Morpho
449 #Structure
450
Akron821db3d2017-04-06 21:19:31 +0200451 Sgbr
452 #Lemma
453 #Morpho
Akron5c71a852016-10-31 16:00:33 +0100454
Marc Kupietzb8c53822024-03-16 18:54:08 +0100455 Spacy
456 #Morpho
457
Akron7d5e6382019-08-08 16:36:27 +0200458 Talismane
459 #Dependency
460 #Morpho
461
Akron821db3d2017-04-06 21:19:31 +0200462 TreeTagger
463 #Morpho
464 #Sentences
Akron5c71a852016-10-31 16:00:33 +0100465
Akron83aedd32023-02-07 10:57:41 +0100466 UDPipe
467 #Dependency
468 #Morpho
469
Akron821db3d2017-04-06 21:19:31 +0200470 XIP
471 #Constituency
472 #Morpho
473 #Sentences
Akron5c71a852016-10-31 16:00:33 +0100474
Akron5c71a852016-10-31 16:00:33 +0100475
476More importers are in preparation.
477New annotation importers can be defined in the C<KorAP::XML::Annotation> namespace.
478See the built-in annotation importers as examples.
Akronc13a1702016-03-15 19:33:14 +0100479
Akronf73ffb62018-06-27 12:13:59 +0200480
Akron41e6c8b2021-10-14 20:22:18 +0200481=head1 METADATA SUPPORT
482
483L<KorAP::XML::Krill> has built-in importer for some meta data variants
484developed in the KorAP project that are part of the KorAP preprocessing pipeline.
485
486=over 2
487
488=item I5 - Meta data for all I5 files
489
490=item Sgbr - Meta data from the Schreibgebrauch project
491
492=item Gingko - Meta data from the Gingko project in addition to I5
493
Akron2532f1b2023-05-15 13:41:24 +0200494=item ICC - Meta data for the ICC in addition to I5
495
Akron24ad3c02024-06-03 12:38:20 +0200496=item NKJP - Meta data for the NKJP corpora
497
Akron41e6c8b2021-10-14 20:22:18 +0200498=back
499
500More importers are in preparation.
501New meta data importers can be defined in the C<KorAP::XML::Meta> namespace.
502See the built-in meta data importers as examples.
503
504
Akron8f69d632020-01-15 16:58:11 +0100505=head1 About KorAP-XML
506
Akrona3518372024-01-22 23:29:00 +0100507KorAP-XML (Bański et al. 2012) is an implementation of the KorAP
508data model (Bański et al. 2013), where text data are stored physically
Akron8f69d632020-01-15 16:58:11 +0100509separated from their interpretations (i.e. annotations).
510A text document in KorAP-XML therefore consists of several files
511containing primary data, metadata and annotations.
512
513The structure of a single KorAP-XML document can be as follows:
514
515 - data.xml
516 - header.xml
517 + base
518 - tokens.xml
519 - ...
520 + struct
521 - structure.xml
522 - ...
523 + corenlp
524 - morpho.xml
525 - constituency.xml
526 - ...
527 + tree_tagger
528 - morpho.xml
529 - ...
530 - ...
531
532The C<data.xml> contains the primary data, the C<header.xml> contains
533the metadata, and the annotation layers are stored in subfolders
534like C<base>, C<struct> or C<corenlp>
Akrona3518372024-01-22 23:29:00 +0100535(so-called "foundries"; Bański et al. 2013).
Akron8f69d632020-01-15 16:58:11 +0100536
537Metadata is available in the TEI-P5 variant I5
Akrond4c5c102020-02-11 11:47:59 +0100538(Lüngen and Sperberg-McQueen 2012). See the documentation in
539L<KorAP::XML::Meta::I5> for translatable fields.
540
541Annotations correspond to a variant of the TEI-P5 feature structures
542(TEI Consortium; Lee et al. 2004).
Akron72bc5222020-02-06 16:00:13 +0100543Annotation feature structures refer to character sequences of the primary text
544inside the C<text> element of the C<data.xml>.
545A single annotation containing the lemma of a token can have the following structure:
546
547 <span from="0" to="3">
548 <fs type="lex" xmlns="http://www.tei-c.org/ns/1.0">
549 <f name="lex">
550 <fs>
551 <f name="lemma">zum</f>
552 </fs>
553 </f>
554 </fs>
555 </span>
556
557The C<from> and C<to> attributes are refering to the character span
558in the primary text.
559Depending on the kind of annotation (e.g. token-based, span-based, relation-based),
560the structure may vary. See L<KorAP::XML::Annotation::*> for various
561annotation preprocessors.
Akron8f69d632020-01-15 16:58:11 +0100562
563Multiple KorAP-XML documents are organized on three levels following
564the "IDS Textmodell" (Lüngen and Sperberg-McQueen 2012):
565corpus E<gt> document E<gt> text. On each level metadata information
566can be stored, that C<korapxml2krill> will merge to a single metadata
567object per text. A corpus is therefore structured as follows:
568
569 + <corpus>
570 - header.xml
571 + <document>
572 - header.xml
573 + <text>
574 - data.xml
575 - header.xml
576 - ...
577 - ...
578
579A single text can be identified by the concatenation of
580the corpus identifier, the document identifier and the text identifier.
581This identifier is called the text sigle
582(e.g. a text with the identifier C<18486> in the document C<060> in the
583corpus C<WPD17> has the text sigle C<WPD17/060/18486>, see C<--sigle>).
584
585These corpora are often stored in zip files, with which C<korapxml2krill>
586can deal with. Corpora may also be split in multiple zip archives
587(e.g. one zip file per foundry), which is also supported (see C<--input>).
588
589Examples for KorAP-XML files are included in L<KorAP::XML::Krill>
590in form of a test suite.
591The resulting JSON format merges all annotation layers
592based on a single token stream.
593
594=head2 References
595
Akrona3518372024-01-22 23:29:00 +0100596Piotr Bański, Cyril Belica, Helge Krause, Marc Kupietz, Carsten Schnober, Oliver Schonefeld, and Andreas Witt (2011):
Akron8f69d632020-01-15 16:58:11 +0100597KorAP data model: first approximation, December.
598
Akrona3518372024-01-22 23:29:00 +0100599Piotr Bański, Peter M. Fischer, Elena Frick, Erik Ketzan, Marc Kupietz, Carsten Schnober, Oliver Schonefeld and Andreas Witt (2012):
Akron8f69d632020-01-15 16:58:11 +0100600"The New IDS Corpus Analysis Platform: Challenges and Prospects",
601Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012).
602L<PDF|http://www.lrec-conf.org/proceedings/lrec2012/pdf/789_Paper.pdf>
603
Akrona3518372024-01-22 23:29:00 +0100604Piotr Bański, Elena Frick, Michael Hanl, Marc Kupietz, Carsten Schnober and Andreas Witt (2013):
Akron8f69d632020-01-15 16:58:11 +0100605"Robust corpus architecture: a new look at virtual collections and data access",
606Corpus Linguistics 2013. Abstract Book. Lancaster: UCREL, pp. 23-25.
607L<PDF|https://ids-pub.bsz-bw.de/frontdoor/deliver/index/docId/4485/file/Ba%c5%84ski_Frick_Hanl_Robust_corpus_architecture_2013.pdf>
608
609Kiyong Lee, Lou Burnard, Laurent Romary, Eric de la Clergerie, Thierry Declerck,
610Syd Bauman, Harry Bunt, Lionel Clément, Tomaz Erjavec, Azim Roussanaly and Claude Roux (2004):
611"Towards an international standard on featurestructure representation",
612Proceedings of the fourth International Conference on Language Resources and Evaluation (LREC 2004),
613pp. 373-376.
614L<PDF|http://www.lrec-conf.org/proceedings/lrec2004/pdf/687.pdf>
615
616Harald Lüngen and C. M. Sperberg-McQueen (2012):
617"A TEI P5 Document Grammar for the IDS Text Model",
618Journal of the Text Encoding Initiative, Issue 3 | November 2012.
619L<PDF|https://journals.openedition.org/jtei/pdf/508>
620
621TEI Consortium, eds:
622"Feature Structures",
623Guidelines for Electronic Text Encoding and Interchange.
624L<html|https://www.tei-c.org/release/doc/tei-p5-doc/en/html/FS.html>
625
Akronc13a1702016-03-15 19:33:14 +0100626=head1 AVAILABILITY
627
628 https://github.com/KorAP/KorAP-XML-Krill
629
630
631=head1 COPYRIGHT AND LICENSE
632
Akrona3518372024-01-22 23:29:00 +0100633Copyright (C) 2015-2024, L<IDS Mannheim|https://www.ids-mannheim.de/>
Akronc13a1702016-03-15 19:33:14 +0100634
Akron6882d7d2021-02-08 09:43:57 +0100635Author: L<Nils Diewald|https://www.nils-diewald.de/>
Akron81500102017-04-07 20:45:44 +0200636
Marc Kupietzb8c53822024-03-16 18:54:08 +0100637Contributor: Eliza Margaretha, Marc Kupietz
Akron5c71a852016-10-31 16:00:33 +0100638
Akron6882d7d2021-02-08 09:43:57 +0100639L<KorAP::XML::Krill> is developed as part of the L<KorAP|https://korap.ids-mannheim.de/>
Akronc13a1702016-03-15 19:33:14 +0100640Corpus Analysis Platform at the
Akron6882d7d2021-02-08 09:43:57 +0100641L<Leibniz Institute for the German Language (IDS)|https://www.ids-mannheim.de/>,
Akronc13a1702016-03-15 19:33:14 +0100642member of the
Akronf1849aa2019-12-16 23:35:33 +0100643L<Leibniz-Gemeinschaft|http://www.leibniz-gemeinschaft.de/>.
Akronc13a1702016-03-15 19:33:14 +0100644
Akron5c71a852016-10-31 16:00:33 +0100645This program is free software published under the
Akron6882d7d2021-02-08 09:43:57 +0100646L<BSD-2 License|https://opensource.org/licenses/BSD-2-Clause>.
Akronc13a1702016-03-15 19:33:14 +0100647
648=cut