Blame - Readme.pod - KorAP/KorAP-XML-Krill

blob: 7cc0d20af2dec18a33389961d55785164fe4c8c2 [file] [log] [blame]

Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	1	=pod
				2
				3	=encoding utf8
				4
				5	=head1 NAME
				6
Akron	42f48c1	2020-02-14 13:08:13 +0100	[diff] [blame]	7	korapxml2krill - Merge KorAP-XML data and create Krill documents
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	8
				9
				10	=head1 SYNOPSIS
				11
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	12	korapxml2krill [archive\|extract] --input <directory\|archive> [options]
Akron	2fd402b	2016-10-27 21:26:48 +0200	[diff] [blame]	13
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	14
				15	=head1 DESCRIPTION
				16
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	17	L<KorAP::XML::Krill> is a library to convert KorAP-XML documents to files
				18	compatible with the L<Krill\|https://github.com/KorAP/Krill> indexer.
Akron	8f69d63	2020-01-15 16:58:11 +0100	[diff] [blame]	19	The C<korapxml2krill> command line tool is a simple wrapper of this library.
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	20
				21
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	22	=head1 INSTALLATION
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	23
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	24	The preferred way to install L<KorAP::XML::Krill> is to use L<cpanm\|App::cpanminus>.
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	25
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	26	$ cpanm https://github.com/KorAP/KorAP-XML-Krill.git
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	27
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	28	In case everything went well, the C<korapxml2krill> tool will
				29	be available on your command line immediately.
Akron	6eff23b	2018-09-24 10:31:20 +0200	[diff] [blame]	30	Minimum requirement for L<KorAP::XML::Krill> is Perl 5.16.
Akron	0b04b31	2020-10-30 17:39:18 +0100	[diff] [blame]	31	Optional support for L<Sys::Info> to calculate available cores.
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	32	In addition to work with zip archives, the C<unzip> tool needs to be present.
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	33
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	34	=head1 ARGUMENTS
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	35
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	36	$ korapxml2krill -z --input <directory> --output <filename>
				37
				38	Without arguments, C<korapxml2krill> converts a directory of a single KorAP-XML document.
				39	It expects the input to point to the text level folder.
				40
				41	=over 2
				42
				43	=item B<archive>
				44
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	45	$ korapxml2krill archive -z --input <directory\|archive> --output <directory\|tar>
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	46
				47	Converts an archive of KorAP-XML documents. It expects a directory
				48	(pointing to the corpus level folder) or one or more zip files as input.
				49
				50	=item B<extract>
				51
				52	$ korapxml2krill extract --input <archive> --output <directory> --sigle <SIGLE>
				53
				54	Extracts KorAP-XML documents from a zip file.
				55
Akron	442c4e9	2017-04-10 23:41:31 +0200	[diff] [blame]	56	=item B<serial>
				57
				58	$ korapxml2krill serial -i <archive1> -i <archive2> -o <directory> -cfg <config-file>
				59
				60	Convert archives sequentially. The inputs are not merged but treated
				61	as they are (so they may be premerged or globs).
				62	the C<--out> directory is treated as the base directory where subdirectories
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	63	are created based on the archive name. In case the C<--to-tar> flag is given,
				64	the output will be a tar file.
Akron	442c4e9	2017-04-10 23:41:31 +0200	[diff] [blame]	65
				66
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	67	=back
Akron	a76d835	2016-10-27 16:27:32 +0200	[diff] [blame]	68
Akron	7606afa	2016-10-25 16:23:49 +0200	[diff] [blame]	69
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	70	=head1 OPTIONS
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	71
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	72	=over 2
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	73
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	74	=item B<--input\|-i> <directory\|zip file>
Akron	a76d835	2016-10-27 16:27:32 +0200	[diff] [blame]	75
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	76	Directory or zip file(s) of documents to convert.
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	77
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	78	Without arguments, C<korapxml2krill> expects a folder of a single KorAP-XML
Akron	f1a1de9	2016-11-02 17:32:12 +0100	[diff] [blame]	79	document, while C<archive> expects a KorAP-XML corpus folder or a zip
				80	file to batch process multiple files.
				81	C<extract> expects zip files only.
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	82
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	83	C<archive> supports multiple input zip files with the constraint,
				84	that the first archive listed contains all primary data files
				85	and all meta data files.
Akron	a76d835	2016-10-27 16:27:32 +0200	[diff] [blame]	86
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	87	-i file/news.zip -i file/news.malt.zip -i "#file/news.tt.zip"
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	88
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	89	Input may also be defined using BSD glob wildcards.
				90
				91	-i 'file/news*.zip'
				92
				93	The extended input array will be sorted in length order, so the shortest
				94	path needs to contain all primary data files and all meta data files.
				95
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	96	(The directory structure follows the base directory format,
				97	that may include a C<.> root folder.
				98	In this case further archives lacking a C<.> root folder
				99	need to be passed with a hash sign in front of the archive's name.
				100	This may require to quote the parameter.)
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	101
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	102	To support zip files, a version of C<unzip> needs to be installed that is
				103	compatible with the archive file.
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	104
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	105	B<The root folder switch using the hash sign is experimental and
				106	may vanish in future versions.>
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	107
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	108
Akron	442c4e9	2017-04-10 23:41:31 +0200	[diff] [blame]	109	=item B<--input-base\|-ib> <directory>
				110
				111	The base directory for inputs.
				112
				113
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	114	=item B<--output\|-o> <directory\|file>
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	115
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	116	Output folder for archive processing or
				117	document name for single output (optional),
				118	writes to C<STDOUT> by default
				119	(in case C<output> is not mandatory due to further options).
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	120
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	121	=item B<--overwrite\|-w>
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	122
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	123	Overwrite files that already exist.
Akron	7606afa	2016-10-25 16:23:49 +0200	[diff] [blame]	124
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	125
Akron	3741f8b	2016-12-21 19:55:21 +0100	[diff] [blame]	126	=item B<--token\|-t> <foundry>#<file>
Akron	a5920b1	2016-06-29 18:51:21 +0200	[diff] [blame]	127
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	128	Define the default tokenization by specifying
				129	the name of the foundry and optionally the name
				130	of the layer-file. Defaults to C<OpenNLP#tokens>.
Akron	f1849aa	2019-12-16 23:35:33 +0100	[diff] [blame]	131	This will directly take the file instead of running
				132	the layer implementation!
Akron	3741f8b	2016-12-21 19:55:21 +0100	[diff] [blame]	133
Akron	8f69d63	2020-01-15 16:58:11 +0100	[diff] [blame]	134
Akron	3741f8b	2016-12-21 19:55:21 +0100	[diff] [blame]	135	=item B<--base-sentences\|-bs> <foundry>#<layer>
				136
				137	Define the layer for base sentences.
				138	If given, this will be used instead of using C<Base#Sentences>.
Akron	c29b8e1	2019-12-16 14:28:09 +0100	[diff] [blame]	139	Currently C<DeReKo#Structure> and C<DGD#Structure> are the only additional
				140	layers supported.
Akron	3741f8b	2016-12-21 19:55:21 +0100	[diff] [blame]	141
				142	Defaults to unset.
				143
				144
				145	=item B<--base-paragraphs\|-bp> <foundry>#<layer>
				146
				147	Define the layer for base paragraphs.
				148	If given, this will be used instead of using C<Base#Paragraphs>.
				149	Currently C<DeReKo#Structure> is the only additional layer supported.
				150
				151	Defaults to unset.
				152
				153
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	154	=item B<--base-pagebreaks\|-bpb> <foundry>#<layer>
				155
				156	Define the layer for base pagebreaks.
				157	Currently C<DeReKo#Structure> is the only layer supported.
				158
				159	Defaults to unset.
				160
				161
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	162	=item B<--skip\|-s> <foundry>[#<layer>]
				163
				164	Skip specific annotations by specifying the foundry
				165	(and optionally the layer with a C<#>-prefix),
				166	e.g. C<Mate> or C<Mate#Morpho>. Alternatively you can skip C<#ALL>.
				167	Can be set multiple times.
				168
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	169
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	170	=item B<--anno\|-a> <foundry>#<layer>
				171
				172	Convert specific annotations by specifying the foundry
				173	(and optionally the layer with a C<#>-prefix),
				174	e.g. C<Mate> or C<Mate#Morpho>.
				175	Can be set multiple times.
				176
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	177
Akron	ed9baf0	2019-01-22 17:03:25 +0100	[diff] [blame]	178	=item B<--non-word-tokens\|-nwt>
				179
				180	Tokenize non-word tokens like word tokens (defined as matching
				181	C</[\d\w]/>). Useful to treat punctuations as tokens.
				182
				183	Defaults to unset.
				184
Akron	f1849aa	2019-12-16 23:35:33 +0100	[diff] [blame]	185
				186	=item B<--non-verbal-tokens\|-nvt>
				187
				188	Tokenize non-verbal tokens marked as in the primary data as
				189	the unicode symbol 'Black Vertical Rectangle' aka \x25ae.
				190
				191	Defaults to unset.
				192
				193
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	194	=item B<--jobs\|-j>
				195
				196	Define the number of concurrent jobs in seperated forks
				197	for archive processing.
				198	Defaults to C<0> (everything runs in a single process).
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	199
				200	If C<sequential-extraction> is not set to false, this will
				201	also apply to extraction.
				202
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	203	Pass -1, and the value will be set automatically to 5
Akron	0b04b31	2020-10-30 17:39:18 +0100	[diff] [blame]	204	times the number of available cores, in case L<Sys::Info>
				205	is available.
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	206	This is I<experimental>.
				207
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	208
Akron	263274c	2019-02-07 09:48:30 +0100	[diff] [blame]	209	=item B<--koral\|-k>
				210
				211	Version of the output format. Supported versions are:
				212	C<0> for legacy serialization, C<0.03> for serialization
				213	with metadata fields as key-values on the root object,
				214	C<0.4> for serialization with metadata fields as a list
				215	of C<"@type":"koral:field"> objects.
				216
				217	Currently defaults to C<0.03>.
				218
				219
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	220	=item B<--sequential-extraction\|-se>
				221
				222	Flag to indicate, if the C<jobs> value also applies to extraction.
				223	Some systems may have problems with extracting multiple archives
				224	to the same folder at the same time.
				225	Can be flagged using C<--no-sequential-extraction> as well.
				226	Defaults to C<false>.
				227
				228
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	229	=item B<--meta\|-m>
				230
				231	Define the metadata parser to use. Defaults to C<I5>.
				232	Metadata parsers can be defined in the C<KorAP::XML::Meta> namespace.
				233	This is I<experimental>.
				234
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	235
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	236	=item B<--gzip\|-z>
				237
				238	Compress the output.
				239	Expects a defined C<output> file in single processing.
				240
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	241
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	242	=item B<--cache\|-c>
				243
				244	File to mmap a cache (using L<Cache::FastMmap>).
				245	Defaults to C<korapxml2krill.cache> in the calling directory.
				246
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	247
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	248	=item B<--cache-size\|-cs>
				249
				250	Size of the cache. Defaults to C<50m>.
				251
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	252
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	253	=item B<--cache-init\|-ci>
				254
				255	Initialize cache file.
				256	Can be flagged using C<--no-cache-init> as well.
				257	Defaults to C<true>.
				258
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	259
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	260	=item B<--cache-delete\|-cd>
				261
				262	Delete cache file after processing.
				263	Can be flagged using C<--no-cache-delete> as well.
				264	Defaults to C<true>.
				265
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	266
Akron	636aa11	2017-04-07 18:48:56 +0200	[diff] [blame]	267	=item B<--config\|-cfg>
				268
				269	Configure the parameters of your call in a file
				270	of key-value pairs with whitespace separator
				271
				272	overwrite 1
				273	token DeReKo#Structure
				274	...
				275
				276	Supported parameters are:
Akron	442c4e9	2017-04-10 23:41:31 +0200	[diff] [blame]	277	C<overwrite>, C<gzip>, C<jobs>, C<input-base>,
Akron	636aa11	2017-04-07 18:48:56 +0200	[diff] [blame]	278	C<token>, C<log>, C<cache>, C<cache-size>, C<cache-delete>, C<meta>,
Akron	57510c1	2019-01-04 14:58:53 +0100	[diff] [blame]	279	C<output>, C<koral>,
				280	C<tempary-extract>, C<sequential-extraction>,
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	281	C<base-sentences>, C<base-paragraphs>,
				282	C<base-pagebreaks>,
				283	C<skip> (semicolon separated), C<sigle>
Akron	636aa11	2017-04-07 18:48:56 +0200	[diff] [blame]	284	(semicolon separated), C<anno> (semicolon separated).
				285
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	286	Configuration parameters will always be overwritten by
				287	passed parameters.
				288
				289
Akron	8150010	2017-04-07 20:45:44 +0200	[diff] [blame]	290	=item B<--temporary-extract\|-te>
				291
				292	Only valid for the C<archive> command.
				293
				294	This will first extract all files into a
				295	directory and then will archive.
				296	If the directory is given as C<:temp:>,
				297	a temporary directory is used.
				298	This is especially useful to avoid
				299	massive unzipping and potential
				300	network latency.
Akron	636aa11	2017-04-07 18:48:56 +0200	[diff] [blame]	301
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	302
Akron	c93a080	2019-07-11 15:48:34 +0200	[diff] [blame]	303	=item B<--to-tar>
				304
				305	Only valid for the C<archive> command.
				306
				307	Writes the output into a tar archive.
				308
				309
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	310	=item B<--sigle\|-sg>
				311
				312	Extract the given texts.
				313	Can be set multiple times.
				314	I<Currently only supported on C<extract>.>
				315	Sigles have the structure C<Corpus>/C<Document>/C<Text>.
				316	In case the C<Text> path is omitted, the whole document will be extracted.
				317	On the document level, the postfix wildcard C<*> is supported.
				318
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	319
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	320	=item B<--log\|-l>
				321
				322	The L<Log4perl> log level, defaults to C<ERROR>.
				323
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	324
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	325	=item B<--help\|-h>
				326
Akron	42f48c1	2020-02-14 13:08:13 +0100	[diff] [blame]	327	Print help information.
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	328
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	329
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	330	=item B<--version\|-v>
				331
				332	Print version information.
				333
				334	=back
				335
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	336
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	337	=head1 ANNOTATION SUPPORT
				338
				339	L<KorAP::XML::Krill> has built-in importer for some annotation foundries and layers
				340	developed in the KorAP project that are part of the KorAP preprocessing pipeline.
				341	The base foundry with paragraphs, sentences, and the text element are mandatory for
				342	L<Krill\|https://github.com/KorAP/Krill>.
				343
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	344	Base
				345	#Paragraphs
				346	#Sentences
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	347
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	348	Connexor
				349	#Morpho
				350	#Phrase
				351	#Sentences
				352	#Syntax
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	353
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	354	CoreNLP
				355	#Constituency
				356	#Morpho
				357	#NamedEntities
				358	#Sentences
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	359
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	360	CMC
				361	#Morpho
				362
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	363	DeReKo
				364	#Structure
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	365
Akron	57510c1	2019-01-04 14:58:53 +0100	[diff] [blame]	366	DGD
				367	#Morpho
Akron	c29b8e1	2019-12-16 14:28:09 +0100	[diff] [blame]	368	#Structure
Akron	57510c1	2019-01-04 14:58:53 +0100	[diff] [blame]	369
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	370	DRuKoLa
				371	#Morpho
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	372
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	373	Glemm
				374	#Morpho
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	375
Akron	ed9baf0	2019-01-22 17:03:25 +0100	[diff] [blame]	376	HNC
				377	#Morpho
				378
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	379	LWC
				380	#Dependency
				381
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	382	Malt
				383	#Dependency
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	384
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	385	MarMoT
				386	#Morpho
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	387
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	388	Mate
				389	#Dependency
				390	#Morpho
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	391
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	392	MDParser
				393	#Dependency
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	394
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	395	OpenNLP
				396	#Morpho
				397	#Sentences
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	398
Akron	0b04b31	2020-10-30 17:39:18 +0100	[diff] [blame]	399	RWK
				400	#Morpho
				401	#Structure
				402
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	403	Sgbr
				404	#Lemma
				405	#Morpho
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	406
Akron	7d5e638	2019-08-08 16:36:27 +0200	[diff] [blame]	407	Talismane
				408	#Dependency
				409	#Morpho
				410
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	411	TreeTagger
				412	#Morpho
				413	#Sentences
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	414
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	415	XIP
				416	#Constituency
				417	#Morpho
				418	#Sentences
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	419
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	420
				421	More importers are in preparation.
				422	New annotation importers can be defined in the C<KorAP::XML::Annotation> namespace.
				423	See the built-in annotation importers as examples.
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	424
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	425
Akron	8f69d63	2020-01-15 16:58:11 +0100	[diff] [blame]	426	=head1 About KorAP-XML
				427
				428	KorAP-XML (Bański et al. 2012) is an implementation of the KorAP
				429	data model (Bański et al. 2013), where text data are stored physically
				430	separated from their interpretations (i.e. annotations).
				431	A text document in KorAP-XML therefore consists of several files
				432	containing primary data, metadata and annotations.
				433
				434	The structure of a single KorAP-XML document can be as follows:
				435
				436	- data.xml
				437	- header.xml
				438	+ base
				439	- tokens.xml
				440	- ...
				441	+ struct
				442	- structure.xml
				443	- ...
				444	+ corenlp
				445	- morpho.xml
				446	- constituency.xml
				447	- ...
				448	+ tree_tagger
				449	- morpho.xml
				450	- ...
				451	- ...
				452
				453	The C<data.xml> contains the primary data, the C<header.xml> contains
				454	the metadata, and the annotation layers are stored in subfolders
				455	like C<base>, C<struct> or C<corenlp>
				456	(so-called "foundries"; Bański et al. 2013).
				457
				458	Metadata is available in the TEI-P5 variant I5
Akron	d4c5c10	2020-02-11 11:47:59 +0100	[diff] [blame]	459	(Lüngen and Sperberg-McQueen 2012). See the documentation in
				460	L<KorAP::XML::Meta::I5> for translatable fields.
				461
				462	Annotations correspond to a variant of the TEI-P5 feature structures
				463	(TEI Consortium; Lee et al. 2004).
Akron	72bc522	2020-02-06 16:00:13 +0100	[diff] [blame]	464	Annotation feature structures refer to character sequences of the primary text
				465	inside the C<text> element of the C<data.xml>.
				466	A single annotation containing the lemma of a token can have the following structure:
				467
				468	<span from="0" to="3">
				469	<fs type="lex" xmlns="http://www.tei-c.org/ns/1.0">
				470	<f name="lex">
				471	<fs>
				472	<f name="lemma">zum</f>
				473	</fs>
				474	</f>
				475	</fs>
				476	</span>
				477
				478	The C<from> and C<to> attributes are refering to the character span
				479	in the primary text.
				480	Depending on the kind of annotation (e.g. token-based, span-based, relation-based),
				481	the structure may vary. See L<KorAP::XML::Annotation::*> for various
				482	annotation preprocessors.
Akron	8f69d63	2020-01-15 16:58:11 +0100	[diff] [blame]	483
				484	Multiple KorAP-XML documents are organized on three levels following
				485	the "IDS Textmodell" (Lüngen and Sperberg-McQueen 2012):
				486	corpus E<gt> document E<gt> text. On each level metadata information
				487	can be stored, that C<korapxml2krill> will merge to a single metadata
				488	object per text. A corpus is therefore structured as follows:
				489
				490	+ <corpus>
				491	- header.xml
				492	+ <document>
				493	- header.xml
				494	+ <text>
				495	- data.xml
				496	- header.xml
				497	- ...
				498	- ...
				499
				500	A single text can be identified by the concatenation of
				501	the corpus identifier, the document identifier and the text identifier.
				502	This identifier is called the text sigle
				503	(e.g. a text with the identifier C<18486> in the document C<060> in the
				504	corpus C<WPD17> has the text sigle C<WPD17/060/18486>, see C<--sigle>).
				505
				506	These corpora are often stored in zip files, with which C<korapxml2krill>
				507	can deal with. Corpora may also be split in multiple zip archives
				508	(e.g. one zip file per foundry), which is also supported (see C<--input>).
				509
				510	Examples for KorAP-XML files are included in L<KorAP::XML::Krill>
				511	in form of a test suite.
				512	The resulting JSON format merges all annotation layers
				513	based on a single token stream.
				514
				515	=head2 References
				516
				517	Piotr Bański, Cyril Belica, Helge Krause, Marc Kupietz, Carsten Schnober, Oliver Schonefeld, and Andreas Witt (2011):
				518	KorAP data model: first approximation, December.
				519
				520	Piotr Bański, Peter M. Fischer, Elena Frick, Erik Ketzan, Marc Kupietz, Carsten Schnober, Oliver Schonefeld and Andreas Witt (2012):
				521	"The New IDS Corpus Analysis Platform: Challenges and Prospects",
				522	Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012).
				523	L<PDF\|http://www.lrec-conf.org/proceedings/lrec2012/pdf/789_Paper.pdf>
				524
				525	Piotr Bański, Elena Frick, Michael Hanl, Marc Kupietz, Carsten Schnober and Andreas Witt (2013):
				526	"Robust corpus architecture: a new look at virtual collections and data access",
				527	Corpus Linguistics 2013. Abstract Book. Lancaster: UCREL, pp. 23-25.
				528	L<PDF\|https://ids-pub.bsz-bw.de/frontdoor/deliver/index/docId/4485/file/Ba%c5%84ski_Frick_Hanl_Robust_corpus_architecture_2013.pdf>
				529
				530	Kiyong Lee, Lou Burnard, Laurent Romary, Eric de la Clergerie, Thierry Declerck,
				531	Syd Bauman, Harry Bunt, Lionel Clément, Tomaz Erjavec, Azim Roussanaly and Claude Roux (2004):
				532	"Towards an international standard on featurestructure representation",
				533	Proceedings of the fourth International Conference on Language Resources and Evaluation (LREC 2004),
				534	pp. 373-376.
				535	L<PDF\|http://www.lrec-conf.org/proceedings/lrec2004/pdf/687.pdf>
				536
				537	Harald Lüngen and C. M. Sperberg-McQueen (2012):
				538	"A TEI P5 Document Grammar for the IDS Text Model",
				539	Journal of the Text Encoding Initiative, Issue 3 \| November 2012.
				540	L<PDF\|https://journals.openedition.org/jtei/pdf/508>
				541
				542	TEI Consortium, eds:
				543	"Feature Structures",
				544	Guidelines for Electronic Text Encoding and Interchange.
				545	L<html\|https://www.tei-c.org/release/doc/tei-p5-doc/en/html/FS.html>
				546
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	547	=head1 AVAILABILITY
				548
				549	https://github.com/KorAP/KorAP-XML-Krill
				550
				551
				552	=head1 COPYRIGHT AND LICENSE
				553
Akron	8f69d63	2020-01-15 16:58:11 +0100	[diff] [blame]	554	Copyright (C) 2015-2020, L<IDS Mannheim\|https://www.ids-mannheim.de/>
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	555
Akron	8f69d63	2020-01-15 16:58:11 +0100	[diff] [blame]	556	Author: L<Nils Diewald\|https://nils-diewald.de/>
Akron	8150010	2017-04-07 20:45:44 +0200	[diff] [blame]	557
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	558	Contributor: Eliza Margaretha
				559
				560	L<KorAP::XML::Krill> is developed as part of the L<KorAP\|http://korap.ids-mannheim.de/>
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	561	Corpus Analysis Platform at the
Akron	94262ce	2019-02-28 21:42:43 +0100	[diff] [blame]	562	L<Leibniz Institute for the German Language (IDS)\|http://ids-mannheim.de/>,
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	563	member of the
Akron	f1849aa	2019-12-16 23:35:33 +0100	[diff] [blame]	564	L<Leibniz-Gemeinschaft\|http://www.leibniz-gemeinschaft.de/>.
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	565
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	566	This program is free software published under the
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	567	L<BSD-2 License\|https://raw.githubusercontent.com/KorAP/KorAP-XML-Krill/master/LICENSE>.
				568
				569	=cut