blob: 26a234835c8b0500968f90afee91b903f0cd2cfd [file] [log] [blame]
Akronc13a1702016-03-15 19:33:14 +01001=pod
2
3=encoding utf8
4
5=head1 NAME
6
Akron42f48c12020-02-14 13:08:13 +01007korapxml2krill - Merge KorAP-XML data and create Krill documents
Akronc13a1702016-03-15 19:33:14 +01008
9
10=head1 SYNOPSIS
11
Akron9cb8c982024-03-22 10:46:56 +010012 $ korapxml2krill [archive|extract] --input <directory|archive> [options]
Akron2fd402b2016-10-27 21:26:48 +020013
Akronc13a1702016-03-15 19:33:14 +010014
15=head1 DESCRIPTION
16
Akron5c71a852016-10-31 16:00:33 +010017L<KorAP::XML::Krill> is a library to convert KorAP-XML documents to files
18compatible with the L<Krill|https://github.com/KorAP/Krill> indexer.
Akron8f69d632020-01-15 16:58:11 +010019The C<korapxml2krill> command line tool is a simple wrapper of this library.
Akronc13a1702016-03-15 19:33:14 +010020
21
Akron5c71a852016-10-31 16:00:33 +010022=head1 INSTALLATION
Akronc13a1702016-03-15 19:33:14 +010023
Akron5c71a852016-10-31 16:00:33 +010024The preferred way to install L<KorAP::XML::Krill> is to use L<cpanm|App::cpanminus>.
Akronc13a1702016-03-15 19:33:14 +010025
Akron5c71a852016-10-31 16:00:33 +010026 $ cpanm https://github.com/KorAP/KorAP-XML-Krill.git
Akronc13a1702016-03-15 19:33:14 +010027
Akron5c71a852016-10-31 16:00:33 +010028In case everything went well, the C<korapxml2krill> tool will
29be available on your command line immediately.
Akron8ce23f72023-12-13 15:48:49 +010030Minimum requirement for L<KorAP::XML::Krill> is Perl 5.32.
Akroneb370a02022-02-24 13:33:40 +010031Optionally installing L<Archive::Tar::Builder> speeds up archive building.
32Optional support for L<Sys::Info> to calculate available cores is available.
Akron5c71a852016-10-31 16:00:33 +010033In addition to work with zip archives, the C<unzip> tool needs to be present.
Akronc13a1702016-03-15 19:33:14 +010034
Akron5c71a852016-10-31 16:00:33 +010035=head1 ARGUMENTS
Akronc13a1702016-03-15 19:33:14 +010036
Akron9cb8c982024-03-22 10:46:56 +010037 $ korapxml2krill -z --input <directory> --output <filename>
Akron5c71a852016-10-31 16:00:33 +010038
39Without arguments, C<korapxml2krill> converts a directory of a single KorAP-XML document.
40It expects the input to point to the text level folder.
41
42=over 2
43
44=item B<archive>
45
Akron9cb8c982024-03-22 10:46:56 +010046 $ korapxml2krill archive -z --input <directory|archive> --output <directory|tar>
Akron5c71a852016-10-31 16:00:33 +010047
48Converts an archive of KorAP-XML documents. It expects a directory
49(pointing to the corpus level folder) or one or more zip files as input.
50
51=item B<extract>
52
53 $ korapxml2krill extract --input <archive> --output <directory> --sigle <SIGLE>
54
55Extracts KorAP-XML documents from a zip file.
56
Akron442c4e92017-04-10 23:41:31 +020057=item B<serial>
58
Akron9cb8c982024-03-22 10:46:56 +010059 $ korapxml2krill serial -i <archive1> -i <archive2> -o <directory> -cfg <config-file>
Akron442c4e92017-04-10 23:41:31 +020060
Akroncb12af72025-07-15 14:36:10 +020061Convert archives in serial. The inputs are not merged but treated
Akron442c4e92017-04-10 23:41:31 +020062as they are (so they may be premerged or globs).
63the C<--out> directory is treated as the base directory where subdirectories
Akronf73ffb62018-06-27 12:13:59 +020064are created based on the archive name. In case the C<--to-tar> flag is given,
65the output will be a tar file.
Akron442c4e92017-04-10 23:41:31 +020066
67
Akron9f37ed72022-01-17 12:10:08 +010068=item B<slimlog>
69
Akron9cb8c982024-03-22 10:46:56 +010070 $ korapxml2krill slimlog <logfile> > <logfile-slim>
Akron9f37ed72022-01-17 12:10:08 +010071
72Filters out all useless aka succesfull information from logs, to simplify
73log checks. Expects no further options.
74
75
Akron5c71a852016-10-31 16:00:33 +010076=back
Akrona76d8352016-10-27 16:27:32 +020077
Akron7606afa2016-10-25 16:23:49 +020078
Akron5c71a852016-10-31 16:00:33 +010079=head1 OPTIONS
Akronc13a1702016-03-15 19:33:14 +010080
Akron5c71a852016-10-31 16:00:33 +010081=over 2
Akronc13a1702016-03-15 19:33:14 +010082
Akron5c71a852016-10-31 16:00:33 +010083=item B<--input|-i> <directory|zip file>
Akrona76d8352016-10-27 16:27:32 +020084
Akron5c71a852016-10-31 16:00:33 +010085Directory or zip file(s) of documents to convert.
Akronc13a1702016-03-15 19:33:14 +010086
Akron5c71a852016-10-31 16:00:33 +010087Without arguments, C<korapxml2krill> expects a folder of a single KorAP-XML
Akronf1a1de92016-11-02 17:32:12 +010088document, while C<archive> expects a KorAP-XML corpus folder or a zip
89file to batch process multiple files.
90C<extract> expects zip files only.
Akronc13a1702016-03-15 19:33:14 +010091
Akrondee3cf62024-06-14 18:14:48 +020092C<archive> supports multiple input zip files with the constraint
Akron5c71a852016-10-31 16:00:33 +010093that the first archive listed contains all primary data files
94and all meta data files.
Akrona76d8352016-10-27 16:27:32 +020095
Akron5c71a852016-10-31 16:00:33 +010096 -i file/news.zip -i file/news.malt.zip -i "#file/news.tt.zip"
Akronc13a1702016-03-15 19:33:14 +010097
Akron821db3d2017-04-06 21:19:31 +020098Input may also be defined using BSD glob wildcards.
99
100 -i 'file/news*.zip'
101
102The extended input array will be sorted in length order, so the shortest
103path needs to contain all primary data files and all meta data files.
104
Akrondee3cf62024-06-14 18:14:48 +0200105(The directory structure follows the base directory format
Akron5c71a852016-10-31 16:00:33 +0100106that may include a C<.> root folder.
107In this case further archives lacking a C<.> root folder
108need to be passed with a hash sign in front of the archive's name.
109This may require to quote the parameter.)
Akronc13a1702016-03-15 19:33:14 +0100110
Akron5c71a852016-10-31 16:00:33 +0100111To support zip files, a version of C<unzip> needs to be installed that is
112compatible with the archive file.
Akronc13a1702016-03-15 19:33:14 +0100113
Akron5c71a852016-10-31 16:00:33 +0100114B<The root folder switch using the hash sign is experimental and
115may vanish in future versions.>
Akronc13a1702016-03-15 19:33:14 +0100116
Akronf73ffb62018-06-27 12:13:59 +0200117
Akron442c4e92017-04-10 23:41:31 +0200118=item B<--input-base|-ib> <directory>
119
120The base directory for inputs.
121
122
Akron5c71a852016-10-31 16:00:33 +0100123=item B<--output|-o> <directory|file>
Akronc13a1702016-03-15 19:33:14 +0100124
Akron5c71a852016-10-31 16:00:33 +0100125Output folder for archive processing or
126document name for single output (optional),
127writes to C<STDOUT> by default
128(in case C<output> is not mandatory due to further options).
Akronc13a1702016-03-15 19:33:14 +0100129
Akron5c71a852016-10-31 16:00:33 +0100130=item B<--overwrite|-w>
Akronc13a1702016-03-15 19:33:14 +0100131
Akron5c71a852016-10-31 16:00:33 +0100132Overwrite files that already exist.
Akron7606afa2016-10-25 16:23:49 +0200133
Akronf73ffb62018-06-27 12:13:59 +0200134
Akron3741f8b2016-12-21 19:55:21 +0100135=item B<--token|-t> <foundry>#<file>
Akrona5920b12016-06-29 18:51:21 +0200136
Akron5c71a852016-10-31 16:00:33 +0100137Define the default tokenization by specifying
138the name of the foundry and optionally the name
139of the layer-file. Defaults to C<OpenNLP#tokens>.
Akronf1849aa2019-12-16 23:35:33 +0100140This will directly take the file instead of running
141the layer implementation!
Akron3741f8b2016-12-21 19:55:21 +0100142
Akron8f69d632020-01-15 16:58:11 +0100143
Akron3741f8b2016-12-21 19:55:21 +0100144=item B<--base-sentences|-bs> <foundry>#<layer>
145
146Define the layer for base sentences.
147If given, this will be used instead of using C<Base#Sentences>.
Akronc29b8e12019-12-16 14:28:09 +0100148Currently C<DeReKo#Structure> and C<DGD#Structure> are the only additional
149layers supported.
Akron3741f8b2016-12-21 19:55:21 +0100150
151 Defaults to unset.
152
153
154=item B<--base-paragraphs|-bp> <foundry>#<layer>
155
156Define the layer for base paragraphs.
157If given, this will be used instead of using C<Base#Paragraphs>.
Akron9f37ed72022-01-17 12:10:08 +0100158Currently C<DeReKo#Structure> and C<DGD#Structure> are the only additional
159layer supported.
Akron3741f8b2016-12-21 19:55:21 +0100160
161 Defaults to unset.
162
163
Akron821db3d2017-04-06 21:19:31 +0200164=item B<--base-pagebreaks|-bpb> <foundry>#<layer>
165
166Define the layer for base pagebreaks.
167Currently C<DeReKo#Structure> is the only layer supported.
168
169 Defaults to unset.
170
171
Akron5c71a852016-10-31 16:00:33 +0100172=item B<--skip|-s> <foundry>[#<layer>]
173
174Skip specific annotations by specifying the foundry
175(and optionally the layer with a C<#>-prefix),
176e.g. C<Mate> or C<Mate#Morpho>. Alternatively you can skip C<#ALL>.
177Can be set multiple times.
178
Akronf73ffb62018-06-27 12:13:59 +0200179
Akron5c71a852016-10-31 16:00:33 +0100180=item B<--anno|-a> <foundry>#<layer>
181
182Convert specific annotations by specifying the foundry
183(and optionally the layer with a C<#>-prefix),
184e.g. C<Mate> or C<Mate#Morpho>.
185Can be set multiple times.
186
Akronf73ffb62018-06-27 12:13:59 +0200187
Akroned9baf02019-01-22 17:03:25 +0100188=item B<--non-word-tokens|-nwt>
189
190Tokenize non-word tokens like word tokens (defined as matching
191C</[\d\w]/>). Useful to treat punctuations as tokens.
192
193 Defaults to unset.
194
Akronf1849aa2019-12-16 23:35:33 +0100195
196=item B<--non-verbal-tokens|-nvt>
197
198Tokenize non-verbal tokens marked as in the primary data as
199the unicode symbol 'Black Vertical Rectangle' aka \x25ae.
200
201 Defaults to unset.
202
203
Akron5c71a852016-10-31 16:00:33 +0100204=item B<--jobs|-j>
205
Akron29128262024-04-17 15:50:36 +0200206Define the number of spawned forks for concurrent jobs
207of archive processing.
Akron5c71a852016-10-31 16:00:33 +0100208Defaults to C<0> (everything runs in a single process).
Akronf73ffb62018-06-27 12:13:59 +0200209
Akrona472a242023-02-13 13:46:30 +0100210If C<sequential-extraction> is not set to true, this will
Akronf73ffb62018-06-27 12:13:59 +0200211also apply to extraction.
212
Akronebbac2e2024-03-22 10:31:23 +0100213Pass C<-1>, and the value will be set automatically to 5
Akron0b04b312020-10-30 17:39:18 +0100214times the number of available cores, in case L<Sys::Info>
Akronebbac2e2024-03-22 10:31:23 +0100215is available and can read CPU count (see C<--job-count>).
216Be aware, that the report of available cores
Akron29128262024-04-17 15:50:36 +0200217may not work in certain conditions. Benchmarking the processing
218speed based on the number of jobs may be valuable.
Akronebbac2e2024-03-22 10:31:23 +0100219
Akron5c71a852016-10-31 16:00:33 +0100220This is I<experimental>.
221
Akronf73ffb62018-06-27 12:13:59 +0200222
Akronebbac2e2024-03-22 10:31:23 +0100223=item B<--job-count|-jc>
224
225Print job and core information that would be used if
226C<-1> was passed to C<--jobs>.
227
228
Akron263274c2019-02-07 09:48:30 +0100229=item B<--koral|-k>
230
231Version of the output format. Supported versions are:
232C<0> for legacy serialization, C<0.03> for serialization
233with metadata fields as key-values on the root object,
234C<0.4> for serialization with metadata fields as a list
235of C<"@type":"koral:field"> objects.
236
237Currently defaults to C<0.03>.
238
239
Akronf73ffb62018-06-27 12:13:59 +0200240=item B<--sequential-extraction|-se>
241
242Flag to indicate, if the C<jobs> value also applies to extraction.
243Some systems may have problems with extracting multiple archives
244to the same folder at the same time.
245Can be flagged using C<--no-sequential-extraction> as well.
246Defaults to C<false>.
247
248
Akron5c71a852016-10-31 16:00:33 +0100249=item B<--meta|-m>
250
251Define the metadata parser to use. Defaults to C<I5>.
252Metadata parsers can be defined in the C<KorAP::XML::Meta> namespace.
253This is I<experimental>.
254
Akronf73ffb62018-06-27 12:13:59 +0200255
Akron5c71a852016-10-31 16:00:33 +0100256=item B<--gzip|-z>
257
258Compress the output.
259Expects a defined C<output> file in single processing.
260
Akronf73ffb62018-06-27 12:13:59 +0200261
Akron5c71a852016-10-31 16:00:33 +0100262=item B<--cache|-c>
263
264File to mmap a cache (using L<Cache::FastMmap>).
265Defaults to C<korapxml2krill.cache> in the calling directory.
266
Akronf73ffb62018-06-27 12:13:59 +0200267
Akron5c71a852016-10-31 16:00:33 +0100268=item B<--cache-size|-cs>
269
270Size of the cache. Defaults to C<50m>.
271
Akronf73ffb62018-06-27 12:13:59 +0200272
Akron5c71a852016-10-31 16:00:33 +0100273=item B<--cache-init|-ci>
274
275Initialize cache file.
276Can be flagged using C<--no-cache-init> as well.
277Defaults to C<true>.
278
Akronf73ffb62018-06-27 12:13:59 +0200279
Akron5c71a852016-10-31 16:00:33 +0100280=item B<--cache-delete|-cd>
281
282Delete cache file after processing.
283Can be flagged using C<--no-cache-delete> as well.
284Defaults to C<true>.
285
Akronf73ffb62018-06-27 12:13:59 +0200286
Akron636aa112017-04-07 18:48:56 +0200287=item B<--config|-cfg>
288
289Configure the parameters of your call in a file
290of key-value pairs with whitespace separator
291
292 overwrite 1
293 token DeReKo#Structure
294 ...
295
296Supported parameters are:
Akron442c4e92017-04-10 23:41:31 +0200297C<overwrite>, C<gzip>, C<jobs>, C<input-base>,
Akron29128262024-04-17 15:50:36 +0200298C<token>, C<log>,
299C<cache>, C<cache-size>, C<cache-init>, C<cache-delete>, C<meta>,
Akron57510c12019-01-04 14:58:53 +0100300C<output>, C<koral>,
Akron9a2545e2022-01-16 15:15:50 +0100301C<temporary-extract>, C<sequential-extraction>,
Akronf73ffb62018-06-27 12:13:59 +0200302C<base-sentences>, C<base-paragraphs>,
303C<base-pagebreaks>,
304C<skip> (semicolon separated), C<sigle>
Akron636aa112017-04-07 18:48:56 +0200305(semicolon separated), C<anno> (semicolon separated).
306
Akronf73ffb62018-06-27 12:13:59 +0200307Configuration parameters will always be overwritten by
308passed parameters.
309
310
Akron81500102017-04-07 20:45:44 +0200311=item B<--temporary-extract|-te>
312
Akrona472a242023-02-13 13:46:30 +0100313Only valid for the C<archive> and C<serial>
314commands.
Akron81500102017-04-07 20:45:44 +0200315
316This will first extract all files into a
317directory and then will archive.
318If the directory is given as C<:temp:>,
319a temporary directory is used.
320This is especially useful to avoid
321massive unzipping and potential
322network latency.
Akron636aa112017-04-07 18:48:56 +0200323
Akronf73ffb62018-06-27 12:13:59 +0200324
Akronc93a0802019-07-11 15:48:34 +0200325=item B<--to-tar>
326
327Only valid for the C<archive> command.
328
329Writes the output into a tar archive.
Akroncb12af72025-07-15 14:36:10 +0200330The tar needs to be opened with C<--ignore-zeros> afterwards.
Akronc93a0802019-07-11 15:48:34 +0200331
Akronec01ff42025-10-17 11:59:33 +0200332
Akron5c71a852016-10-31 16:00:33 +0100333=item B<--sigle|-sg>
334
335Extract the given texts.
336Can be set multiple times.
337I<Currently only supported on C<extract>.>
338Sigles have the structure C<Corpus>/C<Document>/C<Text>.
339In case the C<Text> path is omitted, the whole document will be extracted.
340On the document level, the postfix wildcard C<*> is supported.
341
Akron55fc2122022-07-27 13:24:39 +0200342=item B<--lang>
343
344Preferred language for metadata fields. In case multiple titles are
345given (on any level) with different C<xml:lang> attributes,
346the language given is preferred.
347Because titles may have different sources and different priorities,
348non-specific language titles may still be preferred in case the title
349source has a higher priority.
350
Akronf73ffb62018-06-27 12:13:59 +0200351
Akron5c71a852016-10-31 16:00:33 +0100352=item B<--log|-l>
353
Akron6882d7d2021-02-08 09:43:57 +0100354The L<Log::Any> log level, defaults to C<ERROR>.
Akron5c71a852016-10-31 16:00:33 +0100355
Akronf73ffb62018-06-27 12:13:59 +0200356
Akrona3518372024-01-22 23:29:00 +0100357=item B<--quiet>
358
359Silence all information (non-log) outputs.
360
361
Akron5c71a852016-10-31 16:00:33 +0100362=item B<--help|-h>
363
Akron42f48c12020-02-14 13:08:13 +0100364Print help information.
Akron5c71a852016-10-31 16:00:33 +0100365
Akronf73ffb62018-06-27 12:13:59 +0200366
Akron5c71a852016-10-31 16:00:33 +0100367=item B<--version|-v>
368
369Print version information.
370
371=back
372
Akron311e29b2024-09-11 11:46:09 +0200373=head1 PERFORMANCE
374
375There are some ways to improve performance for large tasks:
376
Akronec01ff42025-10-17 11:59:33 +0200377=over 2
378
Akron311e29b2024-09-11 11:46:09 +0200379=item First unpack
380
381Using the archive or serial command on one or multiple zip files
382can be very slow, as it needs to unpack small portions every time.
383It's better to use C<--temporary-extract> to unpack the whole archive
384first into a temprary directory and then read the extracted files.
385This is especially important for remote archives
386
387=item Limit annotations
388
389Per default, all supported annotation layers are sought. This can be limited
390by adding C<--skip '#ALL'> and only listing the expected annotations with C<--anno>.
391
392=item Checking the parallel job count
393
394By providing the number of parallel jobs using C<--jobs>, the execution can be tailored to specific
395hardware environments.
396
Marc Kupietzaeac7532025-04-14 20:00:33 +0200397=item Install ripunzip
398
399For full extraction of data, L<ripunzip|https://github.com/google/ripunzip> can be
400used for improved performance.
401
Akronec01ff42025-10-17 11:59:33 +0200402=back
Akronf73ffb62018-06-27 12:13:59 +0200403
Akron5c71a852016-10-31 16:00:33 +0100404=head1 ANNOTATION SUPPORT
405
406L<KorAP::XML::Krill> has built-in importer for some annotation foundries and layers
407developed in the KorAP project that are part of the KorAP preprocessing pipeline.
408The base foundry with paragraphs, sentences, and the text element are mandatory for
409L<Krill|https://github.com/KorAP/Krill>.
410
Akron821db3d2017-04-06 21:19:31 +0200411 Base
412 #Paragraphs
413 #Sentences
Akron5c71a852016-10-31 16:00:33 +0100414
Akron821db3d2017-04-06 21:19:31 +0200415 Connexor
416 #Morpho
417 #Phrase
418 #Sentences
419 #Syntax
Akron5c71a852016-10-31 16:00:33 +0100420
Akron821db3d2017-04-06 21:19:31 +0200421 CoreNLP
422 #Constituency
423 #Morpho
424 #NamedEntities
425 #Sentences
Akron5c71a852016-10-31 16:00:33 +0100426
Akron5530a552022-02-17 17:53:15 +0100427 CorpusExplorer
428 #Morpho
429
Akronf73ffb62018-06-27 12:13:59 +0200430 CMC
431 #Morpho
432
Akron821db3d2017-04-06 21:19:31 +0200433 DeReKo
434 #Structure
Akron5c71a852016-10-31 16:00:33 +0100435
Akron57510c12019-01-04 14:58:53 +0100436 DGD
437 #Morpho
Akronc29b8e12019-12-16 14:28:09 +0100438 #Structure
Akron57510c12019-01-04 14:58:53 +0100439
Akron821db3d2017-04-06 21:19:31 +0200440 DRuKoLa
441 #Morpho
Akron5c71a852016-10-31 16:00:33 +0100442
Akron9f37ed72022-01-17 12:10:08 +0100443 Glemm
Akronabb36902021-10-11 15:51:06 +0200444 #Morpho
445
Akron9f37ed72022-01-17 12:10:08 +0100446 Gingko
Akron821db3d2017-04-06 21:19:31 +0200447 #Morpho
Akron5c71a852016-10-31 16:00:33 +0100448
Akroned9baf02019-01-22 17:03:25 +0100449 HNC
450 #Morpho
451
Akronf73ffb62018-06-27 12:13:59 +0200452 LWC
453 #Dependency
454
Akron821db3d2017-04-06 21:19:31 +0200455 Malt
456 #Dependency
Akron5c71a852016-10-31 16:00:33 +0100457
Akron821db3d2017-04-06 21:19:31 +0200458 MarMoT
459 #Morpho
Akron5c71a852016-10-31 16:00:33 +0100460
Akron821db3d2017-04-06 21:19:31 +0200461 Mate
462 #Dependency
463 #Morpho
Akron5c71a852016-10-31 16:00:33 +0100464
Akron821db3d2017-04-06 21:19:31 +0200465 MDParser
466 #Dependency
Akron5c71a852016-10-31 16:00:33 +0100467
Akrone85a7762022-07-22 08:05:03 +0200468 NKJP
469 #Morpho
470 #NamedEntities
471
Akron821db3d2017-04-06 21:19:31 +0200472 OpenNLP
473 #Morpho
474 #Sentences
Akron5c71a852016-10-31 16:00:33 +0100475
Akron0b04b312020-10-30 17:39:18 +0100476 RWK
477 #Morpho
478 #Structure
479
Akron821db3d2017-04-06 21:19:31 +0200480 Sgbr
481 #Lemma
482 #Morpho
Akron5c71a852016-10-31 16:00:33 +0100483
Marc Kupietzb8c53822024-03-16 18:54:08 +0100484 Spacy
485 #Morpho
Marc Kupietz23446562025-10-28 14:36:50 +0100486 #Dependency
Marc Kupietzb8c53822024-03-16 18:54:08 +0100487
Akron7d5e6382019-08-08 16:36:27 +0200488 Talismane
489 #Dependency
490 #Morpho
491
Akron821db3d2017-04-06 21:19:31 +0200492 TreeTagger
493 #Morpho
494 #Sentences
Akron5c71a852016-10-31 16:00:33 +0100495
Akron83aedd32023-02-07 10:57:41 +0100496 UDPipe
497 #Dependency
498 #Morpho
499
Akron821db3d2017-04-06 21:19:31 +0200500 XIP
501 #Constituency
502 #Morpho
503 #Sentences
Akron5c71a852016-10-31 16:00:33 +0100504
Akron5c71a852016-10-31 16:00:33 +0100505
506More importers are in preparation.
507New annotation importers can be defined in the C<KorAP::XML::Annotation> namespace.
508See the built-in annotation importers as examples.
Akronc13a1702016-03-15 19:33:14 +0100509
Akronf73ffb62018-06-27 12:13:59 +0200510
Akron41e6c8b2021-10-14 20:22:18 +0200511=head1 METADATA SUPPORT
512
513L<KorAP::XML::Krill> has built-in importer for some meta data variants
Akron4b001ce2024-06-06 12:32:11 +0200514that are part of the KorAP preprocessing pipeline.
Akron41e6c8b2021-10-14 20:22:18 +0200515
516=over 2
517
Akron1d101492024-06-06 12:47:35 +0200518=item B<I5>
Akron41e6c8b2021-10-14 20:22:18 +0200519
Akron1d101492024-06-06 12:47:35 +0200520Meta data for all I5 files
Akron41e6c8b2021-10-14 20:22:18 +0200521
Akronec01ff42025-10-17 11:59:33 +0200522Environment variables:
523
524=over 4
525
526=item C<K2K_TRANSLATOR_TEXT>
527
528Index the translator as a text field (attachement otherwise).
529
530=item C<K2K_PUBLISHER_STRING>
531
532Index the publisher as a string field (attachement otherwise).
533
534
535=back
536
Akron1d101492024-06-06 12:47:35 +0200537=item B<Sgbr>
Akron41e6c8b2021-10-14 20:22:18 +0200538
Akron1d101492024-06-06 12:47:35 +0200539Meta data from the Schreibgebrauch project
Akron2532f1b2023-05-15 13:41:24 +0200540
Akron1d101492024-06-06 12:47:35 +0200541=item B<Gingko>
542
543Meta data from the Gingko project in addition to I5
544
545=item B<ICC>
546
547Meta data for the ICC in addition to I5
548
549=item B<NKJP>
550
551Meta data for the NKJP corpora
Akron24ad3c02024-06-03 12:38:20 +0200552
Akron41e6c8b2021-10-14 20:22:18 +0200553=back
554
Akron41e6c8b2021-10-14 20:22:18 +0200555New meta data importers can be defined in the C<KorAP::XML::Meta> namespace.
556See the built-in meta data importers as examples.
557
Akron4b001ce2024-06-06 12:32:11 +0200558The I5 metadata definition is based on TEI-P5 and supports C<E<lt>xenoDataE<gt>>
Akron82064bb2024-06-17 12:53:23 +0200559with C<E<lt>metaE<gt>> elements like
Akron4b001ce2024-06-06 12:32:11 +0200560
561 <meta type="..." name="..." project="..." desc="...">...</meta>
562
563that are directly translated to Krill objects. The supported values are:
564
565=over 2
566
Akron1d101492024-06-06 12:47:35 +0200567=item C<type>
Akron4b001ce2024-06-06 12:32:11 +0200568
569=over 4
570
Akron1d101492024-06-06 12:47:35 +0200571=item C<string>
Akron4b001ce2024-06-06 12:32:11 +0200572
Akron1d101492024-06-06 12:47:35 +0200573String meta data value
Akron4b001ce2024-06-06 12:32:11 +0200574
Akron1d101492024-06-06 12:47:35 +0200575=item C<keyword>
Akron4b001ce2024-06-06 12:32:11 +0200576
Akrondee3cf62024-06-14 18:14:48 +0200577String meta data value that can be given multiple times
Akron4b001ce2024-06-06 12:32:11 +0200578
Akron1d101492024-06-06 12:47:35 +0200579=item C<text>
Akron4b001ce2024-06-06 12:32:11 +0200580
Akrondee3cf62024-06-14 18:14:48 +0200581String meta data value that is tokenized and can be searched as token sequences
Akron4b001ce2024-06-06 12:32:11 +0200582
Akron1d101492024-06-06 12:47:35 +0200583=item C<date>
584
585Date meta data value (as "yyyy/mm/dd" with optional granularity)
586
587=item C<integer>
588
589Numerical meta data value
590
Akrondee3cf62024-06-14 18:14:48 +0200591=item C<attachment>
Akron1d101492024-06-06 12:47:35 +0200592
593Non-indexed meta data value (only retrievable)
594
595=item C<uri>
596
597Non-indexed attached URI, takes the desc as the title for links
Akron4b001ce2024-06-06 12:32:11 +0200598
599=back
600
Akron1d101492024-06-06 12:47:35 +0200601=item C<name>
Akron4b001ce2024-06-06 12:32:11 +0200602
Akrondee3cf62024-06-14 18:14:48 +0200603The key of the meta object that may be prefixed by C<corpus> or C<doc>, in case the
Akron693f5882024-06-06 12:52:39 +0200604C<E<lt>xenoDataE<gt>> information is located on these levels. The text level introduces
605no prefixes.
Akron4b001ce2024-06-06 12:32:11 +0200606
Akron1d101492024-06-06 12:47:35 +0200607=item C<project> (optional)
Akron4b001ce2024-06-06 12:32:11 +0200608
Akron1d101492024-06-06 12:47:35 +0200609A prefixed namespace of the key
610
611=item C<desc> (optional)
612
613A description of the key
614
615=item text content
616
617The value of the meta object
Akron4b001ce2024-06-06 12:32:11 +0200618
619=back
620
Akron41e6c8b2021-10-14 20:22:18 +0200621
Akron8f69d632020-01-15 16:58:11 +0100622=head1 About KorAP-XML
623
Akrona3518372024-01-22 23:29:00 +0100624KorAP-XML (Bański et al. 2012) is an implementation of the KorAP
625data model (Bański et al. 2013), where text data are stored physically
Akron8f69d632020-01-15 16:58:11 +0100626separated from their interpretations (i.e. annotations).
627A text document in KorAP-XML therefore consists of several files
628containing primary data, metadata and annotations.
629
630The structure of a single KorAP-XML document can be as follows:
631
632 - data.xml
633 - header.xml
634 + base
635 - tokens.xml
636 - ...
637 + struct
638 - structure.xml
639 - ...
640 + corenlp
641 - morpho.xml
642 - constituency.xml
643 - ...
644 + tree_tagger
645 - morpho.xml
646 - ...
647 - ...
648
649The C<data.xml> contains the primary data, the C<header.xml> contains
650the metadata, and the annotation layers are stored in subfolders
651like C<base>, C<struct> or C<corenlp>
Akrona3518372024-01-22 23:29:00 +0100652(so-called "foundries"; Bański et al. 2013).
Akron8f69d632020-01-15 16:58:11 +0100653
654Metadata is available in the TEI-P5 variant I5
Akrond4c5c102020-02-11 11:47:59 +0100655(Lüngen and Sperberg-McQueen 2012). See the documentation in
656L<KorAP::XML::Meta::I5> for translatable fields.
657
658Annotations correspond to a variant of the TEI-P5 feature structures
659(TEI Consortium; Lee et al. 2004).
Akron72bc5222020-02-06 16:00:13 +0100660Annotation feature structures refer to character sequences of the primary text
661inside the C<text> element of the C<data.xml>.
662A single annotation containing the lemma of a token can have the following structure:
663
664 <span from="0" to="3">
665 <fs type="lex" xmlns="http://www.tei-c.org/ns/1.0">
666 <f name="lex">
667 <fs>
668 <f name="lemma">zum</f>
669 </fs>
670 </f>
671 </fs>
672 </span>
673
674The C<from> and C<to> attributes are refering to the character span
675in the primary text.
676Depending on the kind of annotation (e.g. token-based, span-based, relation-based),
677the structure may vary. See L<KorAP::XML::Annotation::*> for various
678annotation preprocessors.
Akron8f69d632020-01-15 16:58:11 +0100679
680Multiple KorAP-XML documents are organized on three levels following
681the "IDS Textmodell" (Lüngen and Sperberg-McQueen 2012):
682corpus E<gt> document E<gt> text. On each level metadata information
683can be stored, that C<korapxml2krill> will merge to a single metadata
684object per text. A corpus is therefore structured as follows:
685
686 + <corpus>
687 - header.xml
688 + <document>
689 - header.xml
690 + <text>
691 - data.xml
692 - header.xml
693 - ...
694 - ...
695
696A single text can be identified by the concatenation of
697the corpus identifier, the document identifier and the text identifier.
698This identifier is called the text sigle
699(e.g. a text with the identifier C<18486> in the document C<060> in the
700corpus C<WPD17> has the text sigle C<WPD17/060/18486>, see C<--sigle>).
701
702These corpora are often stored in zip files, with which C<korapxml2krill>
703can deal with. Corpora may also be split in multiple zip archives
704(e.g. one zip file per foundry), which is also supported (see C<--input>).
705
706Examples for KorAP-XML files are included in L<KorAP::XML::Krill>
707in form of a test suite.
708The resulting JSON format merges all annotation layers
709based on a single token stream.
710
711=head2 References
712
Akrona3518372024-01-22 23:29:00 +0100713Piotr Bański, Cyril Belica, Helge Krause, Marc Kupietz, Carsten Schnober, Oliver Schonefeld, and Andreas Witt (2011):
Akron8f69d632020-01-15 16:58:11 +0100714KorAP data model: first approximation, December.
715
Akrona3518372024-01-22 23:29:00 +0100716Piotr Bański, Peter M. Fischer, Elena Frick, Erik Ketzan, Marc Kupietz, Carsten Schnober, Oliver Schonefeld and Andreas Witt (2012):
Akron8f69d632020-01-15 16:58:11 +0100717"The New IDS Corpus Analysis Platform: Challenges and Prospects",
718Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012).
719L<PDF|http://www.lrec-conf.org/proceedings/lrec2012/pdf/789_Paper.pdf>
720
Akrona3518372024-01-22 23:29:00 +0100721Piotr Bański, Elena Frick, Michael Hanl, Marc Kupietz, Carsten Schnober and Andreas Witt (2013):
Akron8f69d632020-01-15 16:58:11 +0100722"Robust corpus architecture: a new look at virtual collections and data access",
723Corpus Linguistics 2013. Abstract Book. Lancaster: UCREL, pp. 23-25.
724L<PDF|https://ids-pub.bsz-bw.de/frontdoor/deliver/index/docId/4485/file/Ba%c5%84ski_Frick_Hanl_Robust_corpus_architecture_2013.pdf>
725
726Kiyong Lee, Lou Burnard, Laurent Romary, Eric de la Clergerie, Thierry Declerck,
727Syd Bauman, Harry Bunt, Lionel Clément, Tomaz Erjavec, Azim Roussanaly and Claude Roux (2004):
728"Towards an international standard on featurestructure representation",
729Proceedings of the fourth International Conference on Language Resources and Evaluation (LREC 2004),
730pp. 373-376.
731L<PDF|http://www.lrec-conf.org/proceedings/lrec2004/pdf/687.pdf>
732
733Harald Lüngen and C. M. Sperberg-McQueen (2012):
734"A TEI P5 Document Grammar for the IDS Text Model",
735Journal of the Text Encoding Initiative, Issue 3 | November 2012.
736L<PDF|https://journals.openedition.org/jtei/pdf/508>
737
738TEI Consortium, eds:
739"Feature Structures",
740Guidelines for Electronic Text Encoding and Interchange.
741L<html|https://www.tei-c.org/release/doc/tei-p5-doc/en/html/FS.html>
742
Akronc13a1702016-03-15 19:33:14 +0100743=head1 AVAILABILITY
744
745 https://github.com/KorAP/KorAP-XML-Krill
746
747
748=head1 COPYRIGHT AND LICENSE
749
Akroncb12af72025-07-15 14:36:10 +0200750Copyright (C) 2015-2025, L<IDS Mannheim|https://www.ids-mannheim.de/>
Akronc13a1702016-03-15 19:33:14 +0100751
Akron6882d7d2021-02-08 09:43:57 +0100752Author: L<Nils Diewald|https://www.nils-diewald.de/>
Akron81500102017-04-07 20:45:44 +0200753
Marc Kupietzb8c53822024-03-16 18:54:08 +0100754Contributor: Eliza Margaretha, Marc Kupietz
Akron5c71a852016-10-31 16:00:33 +0100755
Akron6882d7d2021-02-08 09:43:57 +0100756L<KorAP::XML::Krill> is developed as part of the L<KorAP|https://korap.ids-mannheim.de/>
Akronc13a1702016-03-15 19:33:14 +0100757Corpus Analysis Platform at the
Akron6882d7d2021-02-08 09:43:57 +0100758L<Leibniz Institute for the German Language (IDS)|https://www.ids-mannheim.de/>,
Akronc13a1702016-03-15 19:33:14 +0100759member of the
Akronf1849aa2019-12-16 23:35:33 +0100760L<Leibniz-Gemeinschaft|http://www.leibniz-gemeinschaft.de/>.
Akronc13a1702016-03-15 19:33:14 +0100761
Akron5c71a852016-10-31 16:00:33 +0100762This program is free software published under the
Akron6882d7d2021-02-08 09:43:57 +0100763L<BSD-2 License|https://opensource.org/licenses/BSD-2-Clause>.
Akronc13a1702016-03-15 19:33:14 +0100764
765=cut