Blame - Readme.pod - KorAP/KorAP-XML-Krill

blob: 80a4d5f02ae33c4d258f1f0652fe50eb292f33b7 [file] [log] [blame]

Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	1	=pod
				2
				3	=encoding utf8
				4
				5	=head1 NAME
				6
Akron	42f48c1	2020-02-14 13:08:13 +0100	[diff] [blame]	7	korapxml2krill - Merge KorAP-XML data and create Krill documents
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	8
				9
				10	=head1 SYNOPSIS
				11
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	12	korapxml2krill [archive\|extract] --input <directory\|archive> [options]
Akron	2fd402b	2016-10-27 21:26:48 +0200	[diff] [blame]	13
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	14
				15	=head1 DESCRIPTION
				16
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	17	L<KorAP::XML::Krill> is a library to convert KorAP-XML documents to files
				18	compatible with the L<Krill\|https://github.com/KorAP/Krill> indexer.
Akron	8f69d63	2020-01-15 16:58:11 +0100	[diff] [blame]	19	The C<korapxml2krill> command line tool is a simple wrapper of this library.
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	20
				21
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	22	=head1 INSTALLATION
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	23
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	24	The preferred way to install L<KorAP::XML::Krill> is to use L<cpanm\|App::cpanminus>.
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	25
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	26	$ cpanm https://github.com/KorAP/KorAP-XML-Krill.git
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	27
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	28	In case everything went well, the C<korapxml2krill> tool will
				29	be available on your command line immediately.
Akron	6eff23b	2018-09-24 10:31:20 +0200	[diff] [blame]	30	Minimum requirement for L<KorAP::XML::Krill> is Perl 5.16.
Akron	0b04b31	2020-10-30 17:39:18 +0100	[diff] [blame]	31	Optional support for L<Sys::Info> to calculate available cores.
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	32	In addition to work with zip archives, the C<unzip> tool needs to be present.
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	33
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	34	=head1 ARGUMENTS
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	35
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	36	$ korapxml2krill -z --input <directory> --output <filename>
				37
				38	Without arguments, C<korapxml2krill> converts a directory of a single KorAP-XML document.
				39	It expects the input to point to the text level folder.
				40
				41	=over 2
				42
				43	=item B<archive>
				44
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	45	$ korapxml2krill archive -z --input <directory\|archive> --output <directory\|tar>
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	46
				47	Converts an archive of KorAP-XML documents. It expects a directory
				48	(pointing to the corpus level folder) or one or more zip files as input.
				49
				50	=item B<extract>
				51
				52	$ korapxml2krill extract --input <archive> --output <directory> --sigle <SIGLE>
				53
				54	Extracts KorAP-XML documents from a zip file.
				55
Akron	442c4e9	2017-04-10 23:41:31 +0200	[diff] [blame]	56	=item B<serial>
				57
				58	$ korapxml2krill serial -i <archive1> -i <archive2> -o <directory> -cfg <config-file>
				59
				60	Convert archives sequentially. The inputs are not merged but treated
				61	as they are (so they may be premerged or globs).
				62	the C<--out> directory is treated as the base directory where subdirectories
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	63	are created based on the archive name. In case the C<--to-tar> flag is given,
				64	the output will be a tar file.
Akron	442c4e9	2017-04-10 23:41:31 +0200	[diff] [blame]	65
				66
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	67	=back
Akron	a76d835	2016-10-27 16:27:32 +0200	[diff] [blame]	68
Akron	7606afa	2016-10-25 16:23:49 +0200	[diff] [blame]	69
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	70	=head1 OPTIONS
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	71
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	72	=over 2
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	73
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	74	=item B<--input\|-i> <directory\|zip file>
Akron	a76d835	2016-10-27 16:27:32 +0200	[diff] [blame]	75
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	76	Directory or zip file(s) of documents to convert.
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	77
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	78	Without arguments, C<korapxml2krill> expects a folder of a single KorAP-XML
Akron	f1a1de9	2016-11-02 17:32:12 +0100	[diff] [blame]	79	document, while C<archive> expects a KorAP-XML corpus folder or a zip
				80	file to batch process multiple files.
				81	C<extract> expects zip files only.
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	82
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	83	C<archive> supports multiple input zip files with the constraint,
				84	that the first archive listed contains all primary data files
				85	and all meta data files.
Akron	a76d835	2016-10-27 16:27:32 +0200	[diff] [blame]	86
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	87	-i file/news.zip -i file/news.malt.zip -i "#file/news.tt.zip"
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	88
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	89	Input may also be defined using BSD glob wildcards.
				90
				91	-i 'file/news*.zip'
				92
				93	The extended input array will be sorted in length order, so the shortest
				94	path needs to contain all primary data files and all meta data files.
				95
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	96	(The directory structure follows the base directory format,
				97	that may include a C<.> root folder.
				98	In this case further archives lacking a C<.> root folder
				99	need to be passed with a hash sign in front of the archive's name.
				100	This may require to quote the parameter.)
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	101
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	102	To support zip files, a version of C<unzip> needs to be installed that is
				103	compatible with the archive file.
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	104
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	105	B<The root folder switch using the hash sign is experimental and
				106	may vanish in future versions.>
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	107
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	108
Akron	442c4e9	2017-04-10 23:41:31 +0200	[diff] [blame]	109	=item B<--input-base\|-ib> <directory>
				110
				111	The base directory for inputs.
				112
				113
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	114	=item B<--output\|-o> <directory\|file>
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	115
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	116	Output folder for archive processing or
				117	document name for single output (optional),
				118	writes to C<STDOUT> by default
				119	(in case C<output> is not mandatory due to further options).
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	120
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	121	=item B<--overwrite\|-w>
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	122
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	123	Overwrite files that already exist.
Akron	7606afa	2016-10-25 16:23:49 +0200	[diff] [blame]	124
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	125
Akron	3741f8b	2016-12-21 19:55:21 +0100	[diff] [blame]	126	=item B<--token\|-t> <foundry>#<file>
Akron	a5920b1	2016-06-29 18:51:21 +0200	[diff] [blame]	127
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	128	Define the default tokenization by specifying
				129	the name of the foundry and optionally the name
				130	of the layer-file. Defaults to C<OpenNLP#tokens>.
Akron	f1849aa	2019-12-16 23:35:33 +0100	[diff] [blame]	131	This will directly take the file instead of running
				132	the layer implementation!
Akron	3741f8b	2016-12-21 19:55:21 +0100	[diff] [blame]	133
Akron	8f69d63	2020-01-15 16:58:11 +0100	[diff] [blame]	134
Akron	3741f8b	2016-12-21 19:55:21 +0100	[diff] [blame]	135	=item B<--base-sentences\|-bs> <foundry>#<layer>
				136
				137	Define the layer for base sentences.
				138	If given, this will be used instead of using C<Base#Sentences>.
Akron	c29b8e1	2019-12-16 14:28:09 +0100	[diff] [blame]	139	Currently C<DeReKo#Structure> and C<DGD#Structure> are the only additional
				140	layers supported.
Akron	3741f8b	2016-12-21 19:55:21 +0100	[diff] [blame]	141
				142	Defaults to unset.
				143
				144
				145	=item B<--base-paragraphs\|-bp> <foundry>#<layer>
				146
				147	Define the layer for base paragraphs.
				148	If given, this will be used instead of using C<Base#Paragraphs>.
				149	Currently C<DeReKo#Structure> is the only additional layer supported.
				150
				151	Defaults to unset.
				152
				153
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	154	=item B<--base-pagebreaks\|-bpb> <foundry>#<layer>
				155
				156	Define the layer for base pagebreaks.
				157	Currently C<DeReKo#Structure> is the only layer supported.
				158
				159	Defaults to unset.
				160
				161
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	162	=item B<--skip\|-s> <foundry>[#<layer>]
				163
				164	Skip specific annotations by specifying the foundry
				165	(and optionally the layer with a C<#>-prefix),
				166	e.g. C<Mate> or C<Mate#Morpho>. Alternatively you can skip C<#ALL>.
				167	Can be set multiple times.
				168
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	169
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	170	=item B<--anno\|-a> <foundry>#<layer>
				171
				172	Convert specific annotations by specifying the foundry
				173	(and optionally the layer with a C<#>-prefix),
				174	e.g. C<Mate> or C<Mate#Morpho>.
				175	Can be set multiple times.
				176
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	177
Akron	ed9baf0	2019-01-22 17:03:25 +0100	[diff] [blame]	178	=item B<--non-word-tokens\|-nwt>
				179
				180	Tokenize non-word tokens like word tokens (defined as matching
				181	C</[\d\w]/>). Useful to treat punctuations as tokens.
				182
				183	Defaults to unset.
				184
Akron	f1849aa	2019-12-16 23:35:33 +0100	[diff] [blame]	185
				186	=item B<--non-verbal-tokens\|-nvt>
				187
				188	Tokenize non-verbal tokens marked as in the primary data as
				189	the unicode symbol 'Black Vertical Rectangle' aka \x25ae.
				190
				191	Defaults to unset.
				192
				193
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	194	=item B<--jobs\|-j>
				195
				196	Define the number of concurrent jobs in seperated forks
				197	for archive processing.
				198	Defaults to C<0> (everything runs in a single process).
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	199
				200	If C<sequential-extraction> is not set to false, this will
				201	also apply to extraction.
				202
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	203	Pass -1, and the value will be set automatically to 5
Akron	0b04b31	2020-10-30 17:39:18 +0100	[diff] [blame]	204	times the number of available cores, in case L<Sys::Info>
				205	is available.
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	206	This is I<experimental>.
				207
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	208
Akron	263274c	2019-02-07 09:48:30 +0100	[diff] [blame]	209	=item B<--koral\|-k>
				210
				211	Version of the output format. Supported versions are:
				212	C<0> for legacy serialization, C<0.03> for serialization
				213	with metadata fields as key-values on the root object,
				214	C<0.4> for serialization with metadata fields as a list
				215	of C<"@type":"koral:field"> objects.
				216
				217	Currently defaults to C<0.03>.
				218
				219
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	220	=item B<--sequential-extraction\|-se>
				221
				222	Flag to indicate, if the C<jobs> value also applies to extraction.
				223	Some systems may have problems with extracting multiple archives
				224	to the same folder at the same time.
				225	Can be flagged using C<--no-sequential-extraction> as well.
				226	Defaults to C<false>.
				227
				228
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	229	=item B<--meta\|-m>
				230
				231	Define the metadata parser to use. Defaults to C<I5>.
				232	Metadata parsers can be defined in the C<KorAP::XML::Meta> namespace.
				233	This is I<experimental>.
				234
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	235
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	236	=item B<--gzip\|-z>
				237
				238	Compress the output.
				239	Expects a defined C<output> file in single processing.
				240
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	241
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	242	=item B<--cache\|-c>
				243
				244	File to mmap a cache (using L<Cache::FastMmap>).
				245	Defaults to C<korapxml2krill.cache> in the calling directory.
				246
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	247
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	248	=item B<--cache-size\|-cs>
				249
				250	Size of the cache. Defaults to C<50m>.
				251
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	252
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	253	=item B<--cache-init\|-ci>
				254
				255	Initialize cache file.
				256	Can be flagged using C<--no-cache-init> as well.
				257	Defaults to C<true>.
				258
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	259
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	260	=item B<--cache-delete\|-cd>
				261
				262	Delete cache file after processing.
				263	Can be flagged using C<--no-cache-delete> as well.
				264	Defaults to C<true>.
				265
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	266
Akron	636aa11	2017-04-07 18:48:56 +0200	[diff] [blame]	267	=item B<--config\|-cfg>
				268
				269	Configure the parameters of your call in a file
				270	of key-value pairs with whitespace separator
				271
				272	overwrite 1
				273	token DeReKo#Structure
				274	...
				275
				276	Supported parameters are:
Akron	442c4e9	2017-04-10 23:41:31 +0200	[diff] [blame]	277	C<overwrite>, C<gzip>, C<jobs>, C<input-base>,
Akron	636aa11	2017-04-07 18:48:56 +0200	[diff] [blame]	278	C<token>, C<log>, C<cache>, C<cache-size>, C<cache-delete>, C<meta>,
Akron	57510c1	2019-01-04 14:58:53 +0100	[diff] [blame]	279	C<output>, C<koral>,
Akron	9a2545e	2022-01-16 15:15:50 +0100	[diff] [blame]	280	C<temporary-extract>, C<sequential-extraction>,
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	281	C<base-sentences>, C<base-paragraphs>,
				282	C<base-pagebreaks>,
				283	C<skip> (semicolon separated), C<sigle>
Akron	636aa11	2017-04-07 18:48:56 +0200	[diff] [blame]	284	(semicolon separated), C<anno> (semicolon separated).
				285
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	286	Configuration parameters will always be overwritten by
				287	passed parameters.
				288
				289
Akron	8150010	2017-04-07 20:45:44 +0200	[diff] [blame]	290	=item B<--temporary-extract\|-te>
				291
				292	Only valid for the C<archive> command.
				293
				294	This will first extract all files into a
				295	directory and then will archive.
				296	If the directory is given as C<:temp:>,
				297	a temporary directory is used.
				298	This is especially useful to avoid
				299	massive unzipping and potential
				300	network latency.
Akron	636aa11	2017-04-07 18:48:56 +0200	[diff] [blame]	301
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	302
Akron	c93a080	2019-07-11 15:48:34 +0200	[diff] [blame]	303	=item B<--to-tar>
				304
				305	Only valid for the C<archive> command.
				306
				307	Writes the output into a tar archive.
				308
				309
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	310	=item B<--sigle\|-sg>
				311
				312	Extract the given texts.
				313	Can be set multiple times.
				314	I<Currently only supported on C<extract>.>
				315	Sigles have the structure C<Corpus>/C<Document>/C<Text>.
				316	In case the C<Text> path is omitted, the whole document will be extracted.
				317	On the document level, the postfix wildcard C<*> is supported.
				318
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	319
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	320	=item B<--log\|-l>
				321
Akron	6882d7d	2021-02-08 09:43:57 +0100	[diff] [blame]	322	The L<Log::Any> log level, defaults to C<ERROR>.
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	323
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	324
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	325	=item B<--help\|-h>
				326
Akron	42f48c1	2020-02-14 13:08:13 +0100	[diff] [blame]	327	Print help information.
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	328
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	329
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	330	=item B<--version\|-v>
				331
				332	Print version information.
				333
				334	=back
				335
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	336
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	337	=head1 ANNOTATION SUPPORT
				338
				339	L<KorAP::XML::Krill> has built-in importer for some annotation foundries and layers
				340	developed in the KorAP project that are part of the KorAP preprocessing pipeline.
				341	The base foundry with paragraphs, sentences, and the text element are mandatory for
				342	L<Krill\|https://github.com/KorAP/Krill>.
				343
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	344	Base
				345	#Paragraphs
				346	#Sentences
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	347
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	348	Connexor
				349	#Morpho
				350	#Phrase
				351	#Sentences
				352	#Syntax
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	353
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	354	CoreNLP
				355	#Constituency
				356	#Morpho
				357	#NamedEntities
				358	#Sentences
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	359
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	360	CMC
				361	#Morpho
				362
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	363	DeReKo
				364	#Structure
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	365
Akron	57510c1	2019-01-04 14:58:53 +0100	[diff] [blame]	366	DGD
				367	#Morpho
Akron	c29b8e1	2019-12-16 14:28:09 +0100	[diff] [blame]	368	#Structure
Akron	57510c1	2019-01-04 14:58:53 +0100	[diff] [blame]	369
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	370	DRuKoLa
				371	#Morpho
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	372
Akron	abb3690	2021-10-11 15:51:06 +0200	[diff] [blame]	373	Gingko
				374	#Morpho
				375
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	376	Glemm
				377	#Morpho
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	378
Akron	ed9baf0	2019-01-22 17:03:25 +0100	[diff] [blame]	379	HNC
				380	#Morpho
				381
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	382	LWC
				383	#Dependency
				384
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	385	Malt
				386	#Dependency
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	387
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	388	MarMoT
				389	#Morpho
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	390
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	391	Mate
				392	#Dependency
				393	#Morpho
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	394
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	395	MDParser
				396	#Dependency
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	397
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	398	OpenNLP
				399	#Morpho
				400	#Sentences
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	401
Akron	0b04b31	2020-10-30 17:39:18 +0100	[diff] [blame]	402	RWK
				403	#Morpho
				404	#Structure
				405
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	406	Sgbr
				407	#Lemma
				408	#Morpho
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	409
Akron	7d5e638	2019-08-08 16:36:27 +0200	[diff] [blame]	410	Talismane
				411	#Dependency
				412	#Morpho
				413
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	414	TreeTagger
				415	#Morpho
				416	#Sentences
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	417
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	418	XIP
				419	#Constituency
				420	#Morpho
				421	#Sentences
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	422
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	423
				424	More importers are in preparation.
				425	New annotation importers can be defined in the C<KorAP::XML::Annotation> namespace.
				426	See the built-in annotation importers as examples.
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	427
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	428
Akron	41e6c8b	2021-10-14 20:22:18 +0200	[diff] [blame]	429	=head1 METADATA SUPPORT
				430
				431	L<KorAP::XML::Krill> has built-in importer for some meta data variants
				432	developed in the KorAP project that are part of the KorAP preprocessing pipeline.
				433
				434	=over 2
				435
				436	=item I5 - Meta data for all I5 files
				437
				438	=item Sgbr - Meta data from the Schreibgebrauch project
				439
				440	=item Gingko - Meta data from the Gingko project in addition to I5
				441
				442	=back
				443
				444	More importers are in preparation.
				445	New meta data importers can be defined in the C<KorAP::XML::Meta> namespace.
				446	See the built-in meta data importers as examples.
				447
				448
Akron	8f69d63	2020-01-15 16:58:11 +0100	[diff] [blame]	449	=head1 About KorAP-XML
				450
				451	KorAP-XML (Bański et al. 2012) is an implementation of the KorAP
				452	data model (Bański et al. 2013), where text data are stored physically
				453	separated from their interpretations (i.e. annotations).
				454	A text document in KorAP-XML therefore consists of several files
				455	containing primary data, metadata and annotations.
				456
				457	The structure of a single KorAP-XML document can be as follows:
				458
				459	- data.xml
				460	- header.xml
				461	+ base
				462	- tokens.xml
				463	- ...
				464	+ struct
				465	- structure.xml
				466	- ...
				467	+ corenlp
				468	- morpho.xml
				469	- constituency.xml
				470	- ...
				471	+ tree_tagger
				472	- morpho.xml
				473	- ...
				474	- ...
				475
				476	The C<data.xml> contains the primary data, the C<header.xml> contains
				477	the metadata, and the annotation layers are stored in subfolders
				478	like C<base>, C<struct> or C<corenlp>
				479	(so-called "foundries"; Bański et al. 2013).
				480
				481	Metadata is available in the TEI-P5 variant I5
Akron	d4c5c10	2020-02-11 11:47:59 +0100	[diff] [blame]	482	(Lüngen and Sperberg-McQueen 2012). See the documentation in
				483	L<KorAP::XML::Meta::I5> for translatable fields.
				484
				485	Annotations correspond to a variant of the TEI-P5 feature structures
				486	(TEI Consortium; Lee et al. 2004).
Akron	72bc522	2020-02-06 16:00:13 +0100	[diff] [blame]	487	Annotation feature structures refer to character sequences of the primary text
				488	inside the C<text> element of the C<data.xml>.
				489	A single annotation containing the lemma of a token can have the following structure:
				490
				491	<span from="0" to="3">
				492	<fs type="lex" xmlns="http://www.tei-c.org/ns/1.0">
				493	<f name="lex">
				494	<fs>
				495	<f name="lemma">zum</f>
				496	</fs>
				497	</f>
				498	</fs>
				499	</span>
				500
				501	The C<from> and C<to> attributes are refering to the character span
				502	in the primary text.
				503	Depending on the kind of annotation (e.g. token-based, span-based, relation-based),
				504	the structure may vary. See L<KorAP::XML::Annotation::*> for various
				505	annotation preprocessors.
Akron	8f69d63	2020-01-15 16:58:11 +0100	[diff] [blame]	506
				507	Multiple KorAP-XML documents are organized on three levels following
				508	the "IDS Textmodell" (Lüngen and Sperberg-McQueen 2012):
				509	corpus E<gt> document E<gt> text. On each level metadata information
				510	can be stored, that C<korapxml2krill> will merge to a single metadata
				511	object per text. A corpus is therefore structured as follows:
				512
				513	+ <corpus>
				514	- header.xml
				515	+ <document>
				516	- header.xml
				517	+ <text>
				518	- data.xml
				519	- header.xml
				520	- ...
				521	- ...
				522
				523	A single text can be identified by the concatenation of
				524	the corpus identifier, the document identifier and the text identifier.
				525	This identifier is called the text sigle
				526	(e.g. a text with the identifier C<18486> in the document C<060> in the
				527	corpus C<WPD17> has the text sigle C<WPD17/060/18486>, see C<--sigle>).
				528
				529	These corpora are often stored in zip files, with which C<korapxml2krill>
				530	can deal with. Corpora may also be split in multiple zip archives
				531	(e.g. one zip file per foundry), which is also supported (see C<--input>).
				532
				533	Examples for KorAP-XML files are included in L<KorAP::XML::Krill>
				534	in form of a test suite.
				535	The resulting JSON format merges all annotation layers
				536	based on a single token stream.
				537
				538	=head2 References
				539
				540	Piotr Bański, Cyril Belica, Helge Krause, Marc Kupietz, Carsten Schnober, Oliver Schonefeld, and Andreas Witt (2011):
				541	KorAP data model: first approximation, December.
				542
				543	Piotr Bański, Peter M. Fischer, Elena Frick, Erik Ketzan, Marc Kupietz, Carsten Schnober, Oliver Schonefeld and Andreas Witt (2012):
				544	"The New IDS Corpus Analysis Platform: Challenges and Prospects",
				545	Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012).
				546	L<PDF\|http://www.lrec-conf.org/proceedings/lrec2012/pdf/789_Paper.pdf>
				547
				548	Piotr Bański, Elena Frick, Michael Hanl, Marc Kupietz, Carsten Schnober and Andreas Witt (2013):
				549	"Robust corpus architecture: a new look at virtual collections and data access",
				550	Corpus Linguistics 2013. Abstract Book. Lancaster: UCREL, pp. 23-25.
				551	L<PDF\|https://ids-pub.bsz-bw.de/frontdoor/deliver/index/docId/4485/file/Ba%c5%84ski_Frick_Hanl_Robust_corpus_architecture_2013.pdf>
				552
				553	Kiyong Lee, Lou Burnard, Laurent Romary, Eric de la Clergerie, Thierry Declerck,
				554	Syd Bauman, Harry Bunt, Lionel Clément, Tomaz Erjavec, Azim Roussanaly and Claude Roux (2004):
				555	"Towards an international standard on featurestructure representation",
				556	Proceedings of the fourth International Conference on Language Resources and Evaluation (LREC 2004),
				557	pp. 373-376.
				558	L<PDF\|http://www.lrec-conf.org/proceedings/lrec2004/pdf/687.pdf>
				559
				560	Harald Lüngen and C. M. Sperberg-McQueen (2012):
				561	"A TEI P5 Document Grammar for the IDS Text Model",
				562	Journal of the Text Encoding Initiative, Issue 3 \| November 2012.
				563	L<PDF\|https://journals.openedition.org/jtei/pdf/508>
				564
				565	TEI Consortium, eds:
				566	"Feature Structures",
				567	Guidelines for Electronic Text Encoding and Interchange.
				568	L<html\|https://www.tei-c.org/release/doc/tei-p5-doc/en/html/FS.html>
				569
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	570	=head1 AVAILABILITY
				571
				572	https://github.com/KorAP/KorAP-XML-Krill
				573
				574
				575	=head1 COPYRIGHT AND LICENSE
				576
Akron	9a2545e	2022-01-16 15:15:50 +0100	[diff] [blame]	577	Copyright (C) 2015-2022, L<IDS Mannheim\|https://www.ids-mannheim.de/>
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	578
Akron	6882d7d	2021-02-08 09:43:57 +0100	[diff] [blame]	579	Author: L<Nils Diewald\|https://www.nils-diewald.de/>
Akron	8150010	2017-04-07 20:45:44 +0200	[diff] [blame]	580
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	581	Contributor: Eliza Margaretha
				582
Akron	6882d7d	2021-02-08 09:43:57 +0100	[diff] [blame]	583	L<KorAP::XML::Krill> is developed as part of the L<KorAP\|https://korap.ids-mannheim.de/>
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	584	Corpus Analysis Platform at the
Akron	6882d7d	2021-02-08 09:43:57 +0100	[diff] [blame]	585	L<Leibniz Institute for the German Language (IDS)\|https://www.ids-mannheim.de/>,
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	586	member of the
Akron	f1849aa	2019-12-16 23:35:33 +0100	[diff] [blame]	587	L<Leibniz-Gemeinschaft\|http://www.leibniz-gemeinschaft.de/>.
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	588
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	589	This program is free software published under the
Akron	6882d7d	2021-02-08 09:43:57 +0100	[diff] [blame]	590	L<BSD-2 License\|https://opensource.org/licenses/BSD-2-Clause>.
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	591
				592	=cut