blob: b7445abf2bd7e394a4df829bca7d27effb8194ec [file] [log] [blame]
Akronc13a1702016-03-15 19:33:14 +01001=pod
2
3=encoding utf8
4
5=head1 NAME
6
Akron42f48c12020-02-14 13:08:13 +01007korapxml2krill - Merge KorAP-XML data and create Krill documents
Akronc13a1702016-03-15 19:33:14 +01008
9
10=head1 SYNOPSIS
11
Akron5c71a852016-10-31 16:00:33 +010012 korapxml2krill [archive|extract] --input <directory|archive> [options]
Akron2fd402b2016-10-27 21:26:48 +020013
Akronc13a1702016-03-15 19:33:14 +010014
15=head1 DESCRIPTION
16
Akron5c71a852016-10-31 16:00:33 +010017L<KorAP::XML::Krill> is a library to convert KorAP-XML documents to files
18compatible with the L<Krill|https://github.com/KorAP/Krill> indexer.
Akron8f69d632020-01-15 16:58:11 +010019The C<korapxml2krill> command line tool is a simple wrapper of this library.
Akronc13a1702016-03-15 19:33:14 +010020
21
Akron5c71a852016-10-31 16:00:33 +010022=head1 INSTALLATION
Akronc13a1702016-03-15 19:33:14 +010023
Akron5c71a852016-10-31 16:00:33 +010024The preferred way to install L<KorAP::XML::Krill> is to use L<cpanm|App::cpanminus>.
Akronc13a1702016-03-15 19:33:14 +010025
Akron5c71a852016-10-31 16:00:33 +010026 $ cpanm https://github.com/KorAP/KorAP-XML-Krill.git
Akronc13a1702016-03-15 19:33:14 +010027
Akron5c71a852016-10-31 16:00:33 +010028In case everything went well, the C<korapxml2krill> tool will
29be available on your command line immediately.
Akron6eff23b2018-09-24 10:31:20 +020030Minimum requirement for L<KorAP::XML::Krill> is Perl 5.16.
Akron0b04b312020-10-30 17:39:18 +010031Optional support for L<Sys::Info> to calculate available cores.
Akron5c71a852016-10-31 16:00:33 +010032In addition to work with zip archives, the C<unzip> tool needs to be present.
Akronc13a1702016-03-15 19:33:14 +010033
Akron5c71a852016-10-31 16:00:33 +010034=head1 ARGUMENTS
Akronc13a1702016-03-15 19:33:14 +010035
Akron5c71a852016-10-31 16:00:33 +010036 $ korapxml2krill -z --input <directory> --output <filename>
37
38Without arguments, C<korapxml2krill> converts a directory of a single KorAP-XML document.
39It expects the input to point to the text level folder.
40
41=over 2
42
43=item B<archive>
44
Akronf73ffb62018-06-27 12:13:59 +020045 $ korapxml2krill archive -z --input <directory|archive> --output <directory|tar>
Akron5c71a852016-10-31 16:00:33 +010046
47Converts an archive of KorAP-XML documents. It expects a directory
48(pointing to the corpus level folder) or one or more zip files as input.
49
50=item B<extract>
51
52 $ korapxml2krill extract --input <archive> --output <directory> --sigle <SIGLE>
53
54Extracts KorAP-XML documents from a zip file.
55
Akron442c4e92017-04-10 23:41:31 +020056=item B<serial>
57
58 $ korapxml2krill serial -i <archive1> -i <archive2> -o <directory> -cfg <config-file>
59
60Convert archives sequentially. The inputs are not merged but treated
61as they are (so they may be premerged or globs).
62the C<--out> directory is treated as the base directory where subdirectories
Akronf73ffb62018-06-27 12:13:59 +020063are created based on the archive name. In case the C<--to-tar> flag is given,
64the output will be a tar file.
Akron442c4e92017-04-10 23:41:31 +020065
66
Akron9f37ed72022-01-17 12:10:08 +010067=item B<slimlog>
68
69 $ korapxml2krill slimlog <logfile> > <logfile-slim>
70
71Filters out all useless aka succesfull information from logs, to simplify
72log checks. Expects no further options.
73
74
Akron5c71a852016-10-31 16:00:33 +010075=back
Akrona76d8352016-10-27 16:27:32 +020076
Akron7606afa2016-10-25 16:23:49 +020077
Akron5c71a852016-10-31 16:00:33 +010078=head1 OPTIONS
Akronc13a1702016-03-15 19:33:14 +010079
Akron5c71a852016-10-31 16:00:33 +010080=over 2
Akronc13a1702016-03-15 19:33:14 +010081
Akron5c71a852016-10-31 16:00:33 +010082=item B<--input|-i> <directory|zip file>
Akrona76d8352016-10-27 16:27:32 +020083
Akron5c71a852016-10-31 16:00:33 +010084Directory or zip file(s) of documents to convert.
Akronc13a1702016-03-15 19:33:14 +010085
Akron5c71a852016-10-31 16:00:33 +010086Without arguments, C<korapxml2krill> expects a folder of a single KorAP-XML
Akronf1a1de92016-11-02 17:32:12 +010087document, while C<archive> expects a KorAP-XML corpus folder or a zip
88file to batch process multiple files.
89C<extract> expects zip files only.
Akronc13a1702016-03-15 19:33:14 +010090
Akron5c71a852016-10-31 16:00:33 +010091C<archive> supports multiple input zip files with the constraint,
92that the first archive listed contains all primary data files
93and all meta data files.
Akrona76d8352016-10-27 16:27:32 +020094
Akron5c71a852016-10-31 16:00:33 +010095 -i file/news.zip -i file/news.malt.zip -i "#file/news.tt.zip"
Akronc13a1702016-03-15 19:33:14 +010096
Akron821db3d2017-04-06 21:19:31 +020097Input may also be defined using BSD glob wildcards.
98
99 -i 'file/news*.zip'
100
101The extended input array will be sorted in length order, so the shortest
102path needs to contain all primary data files and all meta data files.
103
Akron5c71a852016-10-31 16:00:33 +0100104(The directory structure follows the base directory format,
105that may include a C<.> root folder.
106In this case further archives lacking a C<.> root folder
107need to be passed with a hash sign in front of the archive's name.
108This may require to quote the parameter.)
Akronc13a1702016-03-15 19:33:14 +0100109
Akron5c71a852016-10-31 16:00:33 +0100110To support zip files, a version of C<unzip> needs to be installed that is
111compatible with the archive file.
Akronc13a1702016-03-15 19:33:14 +0100112
Akron5c71a852016-10-31 16:00:33 +0100113B<The root folder switch using the hash sign is experimental and
114may vanish in future versions.>
Akronc13a1702016-03-15 19:33:14 +0100115
Akronf73ffb62018-06-27 12:13:59 +0200116
Akron442c4e92017-04-10 23:41:31 +0200117=item B<--input-base|-ib> <directory>
118
119The base directory for inputs.
120
121
Akron5c71a852016-10-31 16:00:33 +0100122=item B<--output|-o> <directory|file>
Akronc13a1702016-03-15 19:33:14 +0100123
Akron5c71a852016-10-31 16:00:33 +0100124Output folder for archive processing or
125document name for single output (optional),
126writes to C<STDOUT> by default
127(in case C<output> is not mandatory due to further options).
Akronc13a1702016-03-15 19:33:14 +0100128
Akron5c71a852016-10-31 16:00:33 +0100129=item B<--overwrite|-w>
Akronc13a1702016-03-15 19:33:14 +0100130
Akron5c71a852016-10-31 16:00:33 +0100131Overwrite files that already exist.
Akron7606afa2016-10-25 16:23:49 +0200132
Akronf73ffb62018-06-27 12:13:59 +0200133
Akron3741f8b2016-12-21 19:55:21 +0100134=item B<--token|-t> <foundry>#<file>
Akrona5920b12016-06-29 18:51:21 +0200135
Akron5c71a852016-10-31 16:00:33 +0100136Define the default tokenization by specifying
137the name of the foundry and optionally the name
138of the layer-file. Defaults to C<OpenNLP#tokens>.
Akronf1849aa2019-12-16 23:35:33 +0100139This will directly take the file instead of running
140the layer implementation!
Akron3741f8b2016-12-21 19:55:21 +0100141
Akron8f69d632020-01-15 16:58:11 +0100142
Akron3741f8b2016-12-21 19:55:21 +0100143=item B<--base-sentences|-bs> <foundry>#<layer>
144
145Define the layer for base sentences.
146If given, this will be used instead of using C<Base#Sentences>.
Akronc29b8e12019-12-16 14:28:09 +0100147Currently C<DeReKo#Structure> and C<DGD#Structure> are the only additional
148layers supported.
Akron3741f8b2016-12-21 19:55:21 +0100149
150 Defaults to unset.
151
152
153=item B<--base-paragraphs|-bp> <foundry>#<layer>
154
155Define the layer for base paragraphs.
156If given, this will be used instead of using C<Base#Paragraphs>.
Akron9f37ed72022-01-17 12:10:08 +0100157Currently C<DeReKo#Structure> and C<DGD#Structure> are the only additional
158layer supported.
Akron3741f8b2016-12-21 19:55:21 +0100159
160 Defaults to unset.
161
162
Akron821db3d2017-04-06 21:19:31 +0200163=item B<--base-pagebreaks|-bpb> <foundry>#<layer>
164
165Define the layer for base pagebreaks.
166Currently C<DeReKo#Structure> is the only layer supported.
167
168 Defaults to unset.
169
170
Akron5c71a852016-10-31 16:00:33 +0100171=item B<--skip|-s> <foundry>[#<layer>]
172
173Skip specific annotations by specifying the foundry
174(and optionally the layer with a C<#>-prefix),
175e.g. C<Mate> or C<Mate#Morpho>. Alternatively you can skip C<#ALL>.
176Can be set multiple times.
177
Akronf73ffb62018-06-27 12:13:59 +0200178
Akron5c71a852016-10-31 16:00:33 +0100179=item B<--anno|-a> <foundry>#<layer>
180
181Convert specific annotations by specifying the foundry
182(and optionally the layer with a C<#>-prefix),
183e.g. C<Mate> or C<Mate#Morpho>.
184Can be set multiple times.
185
Akronf73ffb62018-06-27 12:13:59 +0200186
Akroned9baf02019-01-22 17:03:25 +0100187=item B<--non-word-tokens|-nwt>
188
189Tokenize non-word tokens like word tokens (defined as matching
190C</[\d\w]/>). Useful to treat punctuations as tokens.
191
192 Defaults to unset.
193
Akronf1849aa2019-12-16 23:35:33 +0100194
195=item B<--non-verbal-tokens|-nvt>
196
197Tokenize non-verbal tokens marked as in the primary data as
198the unicode symbol 'Black Vertical Rectangle' aka \x25ae.
199
200 Defaults to unset.
201
202
Akron5c71a852016-10-31 16:00:33 +0100203=item B<--jobs|-j>
204
205Define the number of concurrent jobs in seperated forks
206for archive processing.
207Defaults to C<0> (everything runs in a single process).
Akronf73ffb62018-06-27 12:13:59 +0200208
209If C<sequential-extraction> is not set to false, this will
210also apply to extraction.
211
Akron821db3d2017-04-06 21:19:31 +0200212Pass -1, and the value will be set automatically to 5
Akron0b04b312020-10-30 17:39:18 +0100213times the number of available cores, in case L<Sys::Info>
214is available.
Akron5c71a852016-10-31 16:00:33 +0100215This is I<experimental>.
216
Akronf73ffb62018-06-27 12:13:59 +0200217
Akron263274c2019-02-07 09:48:30 +0100218=item B<--koral|-k>
219
220Version of the output format. Supported versions are:
221C<0> for legacy serialization, C<0.03> for serialization
222with metadata fields as key-values on the root object,
223C<0.4> for serialization with metadata fields as a list
224of C<"@type":"koral:field"> objects.
225
226Currently defaults to C<0.03>.
227
228
Akronf73ffb62018-06-27 12:13:59 +0200229=item B<--sequential-extraction|-se>
230
231Flag to indicate, if the C<jobs> value also applies to extraction.
232Some systems may have problems with extracting multiple archives
233to the same folder at the same time.
234Can be flagged using C<--no-sequential-extraction> as well.
235Defaults to C<false>.
236
237
Akron5c71a852016-10-31 16:00:33 +0100238=item B<--meta|-m>
239
240Define the metadata parser to use. Defaults to C<I5>.
241Metadata parsers can be defined in the C<KorAP::XML::Meta> namespace.
242This is I<experimental>.
243
Akronf73ffb62018-06-27 12:13:59 +0200244
Akron5c71a852016-10-31 16:00:33 +0100245=item B<--gzip|-z>
246
247Compress the output.
248Expects a defined C<output> file in single processing.
249
Akronf73ffb62018-06-27 12:13:59 +0200250
Akron5c71a852016-10-31 16:00:33 +0100251=item B<--cache|-c>
252
253File to mmap a cache (using L<Cache::FastMmap>).
254Defaults to C<korapxml2krill.cache> in the calling directory.
255
Akronf73ffb62018-06-27 12:13:59 +0200256
Akron5c71a852016-10-31 16:00:33 +0100257=item B<--cache-size|-cs>
258
259Size of the cache. Defaults to C<50m>.
260
Akronf73ffb62018-06-27 12:13:59 +0200261
Akron5c71a852016-10-31 16:00:33 +0100262=item B<--cache-init|-ci>
263
264Initialize cache file.
265Can be flagged using C<--no-cache-init> as well.
266Defaults to C<true>.
267
Akronf73ffb62018-06-27 12:13:59 +0200268
Akron5c71a852016-10-31 16:00:33 +0100269=item B<--cache-delete|-cd>
270
271Delete cache file after processing.
272Can be flagged using C<--no-cache-delete> as well.
273Defaults to C<true>.
274
Akronf73ffb62018-06-27 12:13:59 +0200275
Akron636aa112017-04-07 18:48:56 +0200276=item B<--config|-cfg>
277
278Configure the parameters of your call in a file
279of key-value pairs with whitespace separator
280
281 overwrite 1
282 token DeReKo#Structure
283 ...
284
285Supported parameters are:
Akron442c4e92017-04-10 23:41:31 +0200286C<overwrite>, C<gzip>, C<jobs>, C<input-base>,
Akron636aa112017-04-07 18:48:56 +0200287C<token>, C<log>, C<cache>, C<cache-size>, C<cache-delete>, C<meta>,
Akron57510c12019-01-04 14:58:53 +0100288C<output>, C<koral>,
Akron9a2545e2022-01-16 15:15:50 +0100289C<temporary-extract>, C<sequential-extraction>,
Akronf73ffb62018-06-27 12:13:59 +0200290C<base-sentences>, C<base-paragraphs>,
291C<base-pagebreaks>,
292C<skip> (semicolon separated), C<sigle>
Akron636aa112017-04-07 18:48:56 +0200293(semicolon separated), C<anno> (semicolon separated).
294
Akronf73ffb62018-06-27 12:13:59 +0200295Configuration parameters will always be overwritten by
296passed parameters.
297
298
Akron81500102017-04-07 20:45:44 +0200299=item B<--temporary-extract|-te>
300
301Only valid for the C<archive> command.
302
303This will first extract all files into a
304directory and then will archive.
305If the directory is given as C<:temp:>,
306a temporary directory is used.
307This is especially useful to avoid
308massive unzipping and potential
309network latency.
Akron636aa112017-04-07 18:48:56 +0200310
Akronf73ffb62018-06-27 12:13:59 +0200311
Akronc93a0802019-07-11 15:48:34 +0200312=item B<--to-tar>
313
314Only valid for the C<archive> command.
315
316Writes the output into a tar archive.
317
318
Akron5c71a852016-10-31 16:00:33 +0100319=item B<--sigle|-sg>
320
321Extract the given texts.
322Can be set multiple times.
323I<Currently only supported on C<extract>.>
324Sigles have the structure C<Corpus>/C<Document>/C<Text>.
325In case the C<Text> path is omitted, the whole document will be extracted.
326On the document level, the postfix wildcard C<*> is supported.
327
Akronf73ffb62018-06-27 12:13:59 +0200328
Akron5c71a852016-10-31 16:00:33 +0100329=item B<--log|-l>
330
Akron6882d7d2021-02-08 09:43:57 +0100331The L<Log::Any> log level, defaults to C<ERROR>.
Akron5c71a852016-10-31 16:00:33 +0100332
Akronf73ffb62018-06-27 12:13:59 +0200333
Akron5c71a852016-10-31 16:00:33 +0100334=item B<--help|-h>
335
Akron42f48c12020-02-14 13:08:13 +0100336Print help information.
Akron5c71a852016-10-31 16:00:33 +0100337
Akronf73ffb62018-06-27 12:13:59 +0200338
Akron5c71a852016-10-31 16:00:33 +0100339=item B<--version|-v>
340
341Print version information.
342
343=back
344
Akronf73ffb62018-06-27 12:13:59 +0200345
Akron5c71a852016-10-31 16:00:33 +0100346=head1 ANNOTATION SUPPORT
347
348L<KorAP::XML::Krill> has built-in importer for some annotation foundries and layers
349developed in the KorAP project that are part of the KorAP preprocessing pipeline.
350The base foundry with paragraphs, sentences, and the text element are mandatory for
351L<Krill|https://github.com/KorAP/Krill>.
352
Akron821db3d2017-04-06 21:19:31 +0200353 Base
354 #Paragraphs
355 #Sentences
Akron5c71a852016-10-31 16:00:33 +0100356
Akron821db3d2017-04-06 21:19:31 +0200357 Connexor
358 #Morpho
359 #Phrase
360 #Sentences
361 #Syntax
Akron5c71a852016-10-31 16:00:33 +0100362
Akron821db3d2017-04-06 21:19:31 +0200363 CoreNLP
364 #Constituency
365 #Morpho
366 #NamedEntities
367 #Sentences
Akron5c71a852016-10-31 16:00:33 +0100368
Akronf73ffb62018-06-27 12:13:59 +0200369 CMC
370 #Morpho
371
Akron821db3d2017-04-06 21:19:31 +0200372 DeReKo
373 #Structure
Akron5c71a852016-10-31 16:00:33 +0100374
Akron57510c12019-01-04 14:58:53 +0100375 DGD
376 #Morpho
Akronc29b8e12019-12-16 14:28:09 +0100377 #Structure
Akron57510c12019-01-04 14:58:53 +0100378
Akron821db3d2017-04-06 21:19:31 +0200379 DRuKoLa
380 #Morpho
Akron5c71a852016-10-31 16:00:33 +0100381
Akron9f37ed72022-01-17 12:10:08 +0100382 Glemm
Akronabb36902021-10-11 15:51:06 +0200383 #Morpho
384
Akron9f37ed72022-01-17 12:10:08 +0100385 Gingko
Akron821db3d2017-04-06 21:19:31 +0200386 #Morpho
Akron5c71a852016-10-31 16:00:33 +0100387
Akroned9baf02019-01-22 17:03:25 +0100388 HNC
389 #Morpho
390
Akronf73ffb62018-06-27 12:13:59 +0200391 LWC
392 #Dependency
393
Akron821db3d2017-04-06 21:19:31 +0200394 Malt
395 #Dependency
Akron5c71a852016-10-31 16:00:33 +0100396
Akron821db3d2017-04-06 21:19:31 +0200397 MarMoT
398 #Morpho
Akron5c71a852016-10-31 16:00:33 +0100399
Akron821db3d2017-04-06 21:19:31 +0200400 Mate
401 #Dependency
402 #Morpho
Akron5c71a852016-10-31 16:00:33 +0100403
Akron821db3d2017-04-06 21:19:31 +0200404 MDParser
405 #Dependency
Akron5c71a852016-10-31 16:00:33 +0100406
Akron821db3d2017-04-06 21:19:31 +0200407 OpenNLP
408 #Morpho
409 #Sentences
Akron5c71a852016-10-31 16:00:33 +0100410
Akron0b04b312020-10-30 17:39:18 +0100411 RWK
412 #Morpho
413 #Structure
414
Akron821db3d2017-04-06 21:19:31 +0200415 Sgbr
416 #Lemma
417 #Morpho
Akron5c71a852016-10-31 16:00:33 +0100418
Akron7d5e6382019-08-08 16:36:27 +0200419 Talismane
420 #Dependency
421 #Morpho
422
Akron821db3d2017-04-06 21:19:31 +0200423 TreeTagger
424 #Morpho
425 #Sentences
Akron5c71a852016-10-31 16:00:33 +0100426
Akron821db3d2017-04-06 21:19:31 +0200427 XIP
428 #Constituency
429 #Morpho
430 #Sentences
Akron5c71a852016-10-31 16:00:33 +0100431
Akron5c71a852016-10-31 16:00:33 +0100432
433More importers are in preparation.
434New annotation importers can be defined in the C<KorAP::XML::Annotation> namespace.
435See the built-in annotation importers as examples.
Akronc13a1702016-03-15 19:33:14 +0100436
Akronf73ffb62018-06-27 12:13:59 +0200437
Akron41e6c8b2021-10-14 20:22:18 +0200438=head1 METADATA SUPPORT
439
440L<KorAP::XML::Krill> has built-in importer for some meta data variants
441developed in the KorAP project that are part of the KorAP preprocessing pipeline.
442
443=over 2
444
445=item I5 - Meta data for all I5 files
446
447=item Sgbr - Meta data from the Schreibgebrauch project
448
449=item Gingko - Meta data from the Gingko project in addition to I5
450
451=back
452
453More importers are in preparation.
454New meta data importers can be defined in the C<KorAP::XML::Meta> namespace.
455See the built-in meta data importers as examples.
456
457
Akron8f69d632020-01-15 16:58:11 +0100458=head1 About KorAP-XML
459
460KorAP-XML (Bański et al. 2012) is an implementation of the KorAP
461data model (Bański et al. 2013), where text data are stored physically
462separated from their interpretations (i.e. annotations).
463A text document in KorAP-XML therefore consists of several files
464containing primary data, metadata and annotations.
465
466The structure of a single KorAP-XML document can be as follows:
467
468 - data.xml
469 - header.xml
470 + base
471 - tokens.xml
472 - ...
473 + struct
474 - structure.xml
475 - ...
476 + corenlp
477 - morpho.xml
478 - constituency.xml
479 - ...
480 + tree_tagger
481 - morpho.xml
482 - ...
483 - ...
484
485The C<data.xml> contains the primary data, the C<header.xml> contains
486the metadata, and the annotation layers are stored in subfolders
487like C<base>, C<struct> or C<corenlp>
488(so-called "foundries"; Bański et al. 2013).
489
490Metadata is available in the TEI-P5 variant I5
Akrond4c5c102020-02-11 11:47:59 +0100491(Lüngen and Sperberg-McQueen 2012). See the documentation in
492L<KorAP::XML::Meta::I5> for translatable fields.
493
494Annotations correspond to a variant of the TEI-P5 feature structures
495(TEI Consortium; Lee et al. 2004).
Akron72bc5222020-02-06 16:00:13 +0100496Annotation feature structures refer to character sequences of the primary text
497inside the C<text> element of the C<data.xml>.
498A single annotation containing the lemma of a token can have the following structure:
499
500 <span from="0" to="3">
501 <fs type="lex" xmlns="http://www.tei-c.org/ns/1.0">
502 <f name="lex">
503 <fs>
504 <f name="lemma">zum</f>
505 </fs>
506 </f>
507 </fs>
508 </span>
509
510The C<from> and C<to> attributes are refering to the character span
511in the primary text.
512Depending on the kind of annotation (e.g. token-based, span-based, relation-based),
513the structure may vary. See L<KorAP::XML::Annotation::*> for various
514annotation preprocessors.
Akron8f69d632020-01-15 16:58:11 +0100515
516Multiple KorAP-XML documents are organized on three levels following
517the "IDS Textmodell" (Lüngen and Sperberg-McQueen 2012):
518corpus E<gt> document E<gt> text. On each level metadata information
519can be stored, that C<korapxml2krill> will merge to a single metadata
520object per text. A corpus is therefore structured as follows:
521
522 + <corpus>
523 - header.xml
524 + <document>
525 - header.xml
526 + <text>
527 - data.xml
528 - header.xml
529 - ...
530 - ...
531
532A single text can be identified by the concatenation of
533the corpus identifier, the document identifier and the text identifier.
534This identifier is called the text sigle
535(e.g. a text with the identifier C<18486> in the document C<060> in the
536corpus C<WPD17> has the text sigle C<WPD17/060/18486>, see C<--sigle>).
537
538These corpora are often stored in zip files, with which C<korapxml2krill>
539can deal with. Corpora may also be split in multiple zip archives
540(e.g. one zip file per foundry), which is also supported (see C<--input>).
541
542Examples for KorAP-XML files are included in L<KorAP::XML::Krill>
543in form of a test suite.
544The resulting JSON format merges all annotation layers
545based on a single token stream.
546
547=head2 References
548
549Piotr Bański, Cyril Belica, Helge Krause, Marc Kupietz, Carsten Schnober, Oliver Schonefeld, and Andreas Witt (2011):
550KorAP data model: first approximation, December.
551
552Piotr Bański, Peter M. Fischer, Elena Frick, Erik Ketzan, Marc Kupietz, Carsten Schnober, Oliver Schonefeld and Andreas Witt (2012):
553"The New IDS Corpus Analysis Platform: Challenges and Prospects",
554Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012).
555L<PDF|http://www.lrec-conf.org/proceedings/lrec2012/pdf/789_Paper.pdf>
556
557Piotr Bański, Elena Frick, Michael Hanl, Marc Kupietz, Carsten Schnober and Andreas Witt (2013):
558"Robust corpus architecture: a new look at virtual collections and data access",
559Corpus Linguistics 2013. Abstract Book. Lancaster: UCREL, pp. 23-25.
560L<PDF|https://ids-pub.bsz-bw.de/frontdoor/deliver/index/docId/4485/file/Ba%c5%84ski_Frick_Hanl_Robust_corpus_architecture_2013.pdf>
561
562Kiyong Lee, Lou Burnard, Laurent Romary, Eric de la Clergerie, Thierry Declerck,
563Syd Bauman, Harry Bunt, Lionel Clément, Tomaz Erjavec, Azim Roussanaly and Claude Roux (2004):
564"Towards an international standard on featurestructure representation",
565Proceedings of the fourth International Conference on Language Resources and Evaluation (LREC 2004),
566pp. 373-376.
567L<PDF|http://www.lrec-conf.org/proceedings/lrec2004/pdf/687.pdf>
568
569Harald Lüngen and C. M. Sperberg-McQueen (2012):
570"A TEI P5 Document Grammar for the IDS Text Model",
571Journal of the Text Encoding Initiative, Issue 3 | November 2012.
572L<PDF|https://journals.openedition.org/jtei/pdf/508>
573
574TEI Consortium, eds:
575"Feature Structures",
576Guidelines for Electronic Text Encoding and Interchange.
577L<html|https://www.tei-c.org/release/doc/tei-p5-doc/en/html/FS.html>
578
Akronc13a1702016-03-15 19:33:14 +0100579=head1 AVAILABILITY
580
581 https://github.com/KorAP/KorAP-XML-Krill
582
583
584=head1 COPYRIGHT AND LICENSE
585
Akron9a2545e2022-01-16 15:15:50 +0100586Copyright (C) 2015-2022, L<IDS Mannheim|https://www.ids-mannheim.de/>
Akronc13a1702016-03-15 19:33:14 +0100587
Akron6882d7d2021-02-08 09:43:57 +0100588Author: L<Nils Diewald|https://www.nils-diewald.de/>
Akron81500102017-04-07 20:45:44 +0200589
Akron5c71a852016-10-31 16:00:33 +0100590Contributor: Eliza Margaretha
591
Akron6882d7d2021-02-08 09:43:57 +0100592L<KorAP::XML::Krill> is developed as part of the L<KorAP|https://korap.ids-mannheim.de/>
Akronc13a1702016-03-15 19:33:14 +0100593Corpus Analysis Platform at the
Akron6882d7d2021-02-08 09:43:57 +0100594L<Leibniz Institute for the German Language (IDS)|https://www.ids-mannheim.de/>,
Akronc13a1702016-03-15 19:33:14 +0100595member of the
Akronf1849aa2019-12-16 23:35:33 +0100596L<Leibniz-Gemeinschaft|http://www.leibniz-gemeinschaft.de/>.
Akronc13a1702016-03-15 19:33:14 +0100597
Akron5c71a852016-10-31 16:00:33 +0100598This program is free software published under the
Akron6882d7d2021-02-08 09:43:57 +0100599L<BSD-2 License|https://opensource.org/licenses/BSD-2-Clause>.
Akronc13a1702016-03-15 19:33:14 +0100600
601=cut