Blame - Readme.pod - KorAP/KorAP-XML-Krill

blob: b7445abf2bd7e394a4df829bca7d27effb8194ec [file] [log] [blame]

Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	1	=pod
				2
				3	=encoding utf8
				4
				5	=head1 NAME
				6
Akron	42f48c1	2020-02-14 13:08:13 +0100	[diff] [blame]	7	korapxml2krill - Merge KorAP-XML data and create Krill documents
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	8
				9
				10	=head1 SYNOPSIS
				11
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	12	korapxml2krill [archive\|extract] --input <directory\|archive> [options]
Akron	2fd402b	2016-10-27 21:26:48 +0200	[diff] [blame]	13
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	14
				15	=head1 DESCRIPTION
				16
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	17	L<KorAP::XML::Krill> is a library to convert KorAP-XML documents to files
				18	compatible with the L<Krill\|https://github.com/KorAP/Krill> indexer.
Akron	8f69d63	2020-01-15 16:58:11 +0100	[diff] [blame]	19	The C<korapxml2krill> command line tool is a simple wrapper of this library.
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	20
				21
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	22	=head1 INSTALLATION
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	23
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	24	The preferred way to install L<KorAP::XML::Krill> is to use L<cpanm\|App::cpanminus>.
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	25
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	26	$ cpanm https://github.com/KorAP/KorAP-XML-Krill.git
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	27
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	28	In case everything went well, the C<korapxml2krill> tool will
				29	be available on your command line immediately.
Akron	6eff23b	2018-09-24 10:31:20 +0200	[diff] [blame]	30	Minimum requirement for L<KorAP::XML::Krill> is Perl 5.16.
Akron	0b04b31	2020-10-30 17:39:18 +0100	[diff] [blame]	31	Optional support for L<Sys::Info> to calculate available cores.
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	32	In addition to work with zip archives, the C<unzip> tool needs to be present.
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	33
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	34	=head1 ARGUMENTS
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	35
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	36	$ korapxml2krill -z --input <directory> --output <filename>
				37
				38	Without arguments, C<korapxml2krill> converts a directory of a single KorAP-XML document.
				39	It expects the input to point to the text level folder.
				40
				41	=over 2
				42
				43	=item B<archive>
				44
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	45	$ korapxml2krill archive -z --input <directory\|archive> --output <directory\|tar>
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	46
				47	Converts an archive of KorAP-XML documents. It expects a directory
				48	(pointing to the corpus level folder) or one or more zip files as input.
				49
				50	=item B<extract>
				51
				52	$ korapxml2krill extract --input <archive> --output <directory> --sigle <SIGLE>
				53
				54	Extracts KorAP-XML documents from a zip file.
				55
Akron	442c4e9	2017-04-10 23:41:31 +0200	[diff] [blame]	56	=item B<serial>
				57
				58	$ korapxml2krill serial -i <archive1> -i <archive2> -o <directory> -cfg <config-file>
				59
				60	Convert archives sequentially. The inputs are not merged but treated
				61	as they are (so they may be premerged or globs).
				62	the C<--out> directory is treated as the base directory where subdirectories
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	63	are created based on the archive name. In case the C<--to-tar> flag is given,
				64	the output will be a tar file.
Akron	442c4e9	2017-04-10 23:41:31 +0200	[diff] [blame]	65
				66
Akron	9f37ed7	2022-01-17 12:10:08 +0100	[diff] [blame]	67	=item B<slimlog>
				68
				69	$ korapxml2krill slimlog <logfile> > <logfile-slim>
				70
				71	Filters out all useless aka succesfull information from logs, to simplify
				72	log checks. Expects no further options.
				73
				74
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	75	=back
Akron	a76d835	2016-10-27 16:27:32 +0200	[diff] [blame]	76
Akron	7606afa	2016-10-25 16:23:49 +0200	[diff] [blame]	77
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	78	=head1 OPTIONS
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	79
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	80	=over 2
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	81
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	82	=item B<--input\|-i> <directory\|zip file>
Akron	a76d835	2016-10-27 16:27:32 +0200	[diff] [blame]	83
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	84	Directory or zip file(s) of documents to convert.
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	85
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	86	Without arguments, C<korapxml2krill> expects a folder of a single KorAP-XML
Akron	f1a1de9	2016-11-02 17:32:12 +0100	[diff] [blame]	87	document, while C<archive> expects a KorAP-XML corpus folder or a zip
				88	file to batch process multiple files.
				89	C<extract> expects zip files only.
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	90
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	91	C<archive> supports multiple input zip files with the constraint,
				92	that the first archive listed contains all primary data files
				93	and all meta data files.
Akron	a76d835	2016-10-27 16:27:32 +0200	[diff] [blame]	94
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	95	-i file/news.zip -i file/news.malt.zip -i "#file/news.tt.zip"
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	96
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	97	Input may also be defined using BSD glob wildcards.
				98
				99	-i 'file/news*.zip'
				100
				101	The extended input array will be sorted in length order, so the shortest
				102	path needs to contain all primary data files and all meta data files.
				103
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	104	(The directory structure follows the base directory format,
				105	that may include a C<.> root folder.
				106	In this case further archives lacking a C<.> root folder
				107	need to be passed with a hash sign in front of the archive's name.
				108	This may require to quote the parameter.)
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	109
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	110	To support zip files, a version of C<unzip> needs to be installed that is
				111	compatible with the archive file.
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	112
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	113	B<The root folder switch using the hash sign is experimental and
				114	may vanish in future versions.>
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	115
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	116
Akron	442c4e9	2017-04-10 23:41:31 +0200	[diff] [blame]	117	=item B<--input-base\|-ib> <directory>
				118
				119	The base directory for inputs.
				120
				121
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	122	=item B<--output\|-o> <directory\|file>
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	123
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	124	Output folder for archive processing or
				125	document name for single output (optional),
				126	writes to C<STDOUT> by default
				127	(in case C<output> is not mandatory due to further options).
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	128
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	129	=item B<--overwrite\|-w>
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	130
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	131	Overwrite files that already exist.
Akron	7606afa	2016-10-25 16:23:49 +0200	[diff] [blame]	132
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	133
Akron	3741f8b	2016-12-21 19:55:21 +0100	[diff] [blame]	134	=item B<--token\|-t> <foundry>#<file>
Akron	a5920b1	2016-06-29 18:51:21 +0200	[diff] [blame]	135
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	136	Define the default tokenization by specifying
				137	the name of the foundry and optionally the name
				138	of the layer-file. Defaults to C<OpenNLP#tokens>.
Akron	f1849aa	2019-12-16 23:35:33 +0100	[diff] [blame]	139	This will directly take the file instead of running
				140	the layer implementation!
Akron	3741f8b	2016-12-21 19:55:21 +0100	[diff] [blame]	141
Akron	8f69d63	2020-01-15 16:58:11 +0100	[diff] [blame]	142
Akron	3741f8b	2016-12-21 19:55:21 +0100	[diff] [blame]	143	=item B<--base-sentences\|-bs> <foundry>#<layer>
				144
				145	Define the layer for base sentences.
				146	If given, this will be used instead of using C<Base#Sentences>.
Akron	c29b8e1	2019-12-16 14:28:09 +0100	[diff] [blame]	147	Currently C<DeReKo#Structure> and C<DGD#Structure> are the only additional
				148	layers supported.
Akron	3741f8b	2016-12-21 19:55:21 +0100	[diff] [blame]	149
				150	Defaults to unset.
				151
				152
				153	=item B<--base-paragraphs\|-bp> <foundry>#<layer>
				154
				155	Define the layer for base paragraphs.
				156	If given, this will be used instead of using C<Base#Paragraphs>.
Akron	9f37ed7	2022-01-17 12:10:08 +0100	[diff] [blame]	157	Currently C<DeReKo#Structure> and C<DGD#Structure> are the only additional
				158	layer supported.
Akron	3741f8b	2016-12-21 19:55:21 +0100	[diff] [blame]	159
				160	Defaults to unset.
				161
				162
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	163	=item B<--base-pagebreaks\|-bpb> <foundry>#<layer>
				164
				165	Define the layer for base pagebreaks.
				166	Currently C<DeReKo#Structure> is the only layer supported.
				167
				168	Defaults to unset.
				169
				170
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	171	=item B<--skip\|-s> <foundry>[#<layer>]
				172
				173	Skip specific annotations by specifying the foundry
				174	(and optionally the layer with a C<#>-prefix),
				175	e.g. C<Mate> or C<Mate#Morpho>. Alternatively you can skip C<#ALL>.
				176	Can be set multiple times.
				177
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	178
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	179	=item B<--anno\|-a> <foundry>#<layer>
				180
				181	Convert specific annotations by specifying the foundry
				182	(and optionally the layer with a C<#>-prefix),
				183	e.g. C<Mate> or C<Mate#Morpho>.
				184	Can be set multiple times.
				185
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	186
Akron	ed9baf0	2019-01-22 17:03:25 +0100	[diff] [blame]	187	=item B<--non-word-tokens\|-nwt>
				188
				189	Tokenize non-word tokens like word tokens (defined as matching
				190	C</[\d\w]/>). Useful to treat punctuations as tokens.
				191
				192	Defaults to unset.
				193
Akron	f1849aa	2019-12-16 23:35:33 +0100	[diff] [blame]	194
				195	=item B<--non-verbal-tokens\|-nvt>
				196
				197	Tokenize non-verbal tokens marked as in the primary data as
				198	the unicode symbol 'Black Vertical Rectangle' aka \x25ae.
				199
				200	Defaults to unset.
				201
				202
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	203	=item B<--jobs\|-j>
				204
				205	Define the number of concurrent jobs in seperated forks
				206	for archive processing.
				207	Defaults to C<0> (everything runs in a single process).
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	208
				209	If C<sequential-extraction> is not set to false, this will
				210	also apply to extraction.
				211
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	212	Pass -1, and the value will be set automatically to 5
Akron	0b04b31	2020-10-30 17:39:18 +0100	[diff] [blame]	213	times the number of available cores, in case L<Sys::Info>
				214	is available.
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	215	This is I<experimental>.
				216
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	217
Akron	263274c	2019-02-07 09:48:30 +0100	[diff] [blame]	218	=item B<--koral\|-k>
				219
				220	Version of the output format. Supported versions are:
				221	C<0> for legacy serialization, C<0.03> for serialization
				222	with metadata fields as key-values on the root object,
				223	C<0.4> for serialization with metadata fields as a list
				224	of C<"@type":"koral:field"> objects.
				225
				226	Currently defaults to C<0.03>.
				227
				228
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	229	=item B<--sequential-extraction\|-se>
				230
				231	Flag to indicate, if the C<jobs> value also applies to extraction.
				232	Some systems may have problems with extracting multiple archives
				233	to the same folder at the same time.
				234	Can be flagged using C<--no-sequential-extraction> as well.
				235	Defaults to C<false>.
				236
				237
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	238	=item B<--meta\|-m>
				239
				240	Define the metadata parser to use. Defaults to C<I5>.
				241	Metadata parsers can be defined in the C<KorAP::XML::Meta> namespace.
				242	This is I<experimental>.
				243
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	244
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	245	=item B<--gzip\|-z>
				246
				247	Compress the output.
				248	Expects a defined C<output> file in single processing.
				249
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	250
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	251	=item B<--cache\|-c>
				252
				253	File to mmap a cache (using L<Cache::FastMmap>).
				254	Defaults to C<korapxml2krill.cache> in the calling directory.
				255
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	256
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	257	=item B<--cache-size\|-cs>
				258
				259	Size of the cache. Defaults to C<50m>.
				260
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	261
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	262	=item B<--cache-init\|-ci>
				263
				264	Initialize cache file.
				265	Can be flagged using C<--no-cache-init> as well.
				266	Defaults to C<true>.
				267
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	268
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	269	=item B<--cache-delete\|-cd>
				270
				271	Delete cache file after processing.
				272	Can be flagged using C<--no-cache-delete> as well.
				273	Defaults to C<true>.
				274
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	275
Akron	636aa11	2017-04-07 18:48:56 +0200	[diff] [blame]	276	=item B<--config\|-cfg>
				277
				278	Configure the parameters of your call in a file
				279	of key-value pairs with whitespace separator
				280
				281	overwrite 1
				282	token DeReKo#Structure
				283	...
				284
				285	Supported parameters are:
Akron	442c4e9	2017-04-10 23:41:31 +0200	[diff] [blame]	286	C<overwrite>, C<gzip>, C<jobs>, C<input-base>,
Akron	636aa11	2017-04-07 18:48:56 +0200	[diff] [blame]	287	C<token>, C<log>, C<cache>, C<cache-size>, C<cache-delete>, C<meta>,
Akron	57510c1	2019-01-04 14:58:53 +0100	[diff] [blame]	288	C<output>, C<koral>,
Akron	9a2545e	2022-01-16 15:15:50 +0100	[diff] [blame]	289	C<temporary-extract>, C<sequential-extraction>,
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	290	C<base-sentences>, C<base-paragraphs>,
				291	C<base-pagebreaks>,
				292	C<skip> (semicolon separated), C<sigle>
Akron	636aa11	2017-04-07 18:48:56 +0200	[diff] [blame]	293	(semicolon separated), C<anno> (semicolon separated).
				294
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	295	Configuration parameters will always be overwritten by
				296	passed parameters.
				297
				298
Akron	8150010	2017-04-07 20:45:44 +0200	[diff] [blame]	299	=item B<--temporary-extract\|-te>
				300
				301	Only valid for the C<archive> command.
				302
				303	This will first extract all files into a
				304	directory and then will archive.
				305	If the directory is given as C<:temp:>,
				306	a temporary directory is used.
				307	This is especially useful to avoid
				308	massive unzipping and potential
				309	network latency.
Akron	636aa11	2017-04-07 18:48:56 +0200	[diff] [blame]	310
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	311
Akron	c93a080	2019-07-11 15:48:34 +0200	[diff] [blame]	312	=item B<--to-tar>
				313
				314	Only valid for the C<archive> command.
				315
				316	Writes the output into a tar archive.
				317
				318
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	319	=item B<--sigle\|-sg>
				320
				321	Extract the given texts.
				322	Can be set multiple times.
				323	I<Currently only supported on C<extract>.>
				324	Sigles have the structure C<Corpus>/C<Document>/C<Text>.
				325	In case the C<Text> path is omitted, the whole document will be extracted.
				326	On the document level, the postfix wildcard C<*> is supported.
				327
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	328
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	329	=item B<--log\|-l>
				330
Akron	6882d7d	2021-02-08 09:43:57 +0100	[diff] [blame]	331	The L<Log::Any> log level, defaults to C<ERROR>.
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	332
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	333
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	334	=item B<--help\|-h>
				335
Akron	42f48c1	2020-02-14 13:08:13 +0100	[diff] [blame]	336	Print help information.
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	337
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	338
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	339	=item B<--version\|-v>
				340
				341	Print version information.
				342
				343	=back
				344
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	345
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	346	=head1 ANNOTATION SUPPORT
				347
				348	L<KorAP::XML::Krill> has built-in importer for some annotation foundries and layers
				349	developed in the KorAP project that are part of the KorAP preprocessing pipeline.
				350	The base foundry with paragraphs, sentences, and the text element are mandatory for
				351	L<Krill\|https://github.com/KorAP/Krill>.
				352
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	353	Base
				354	#Paragraphs
				355	#Sentences
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	356
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	357	Connexor
				358	#Morpho
				359	#Phrase
				360	#Sentences
				361	#Syntax
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	362
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	363	CoreNLP
				364	#Constituency
				365	#Morpho
				366	#NamedEntities
				367	#Sentences
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	368
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	369	CMC
				370	#Morpho
				371
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	372	DeReKo
				373	#Structure
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	374
Akron	57510c1	2019-01-04 14:58:53 +0100	[diff] [blame]	375	DGD
				376	#Morpho
Akron	c29b8e1	2019-12-16 14:28:09 +0100	[diff] [blame]	377	#Structure
Akron	57510c1	2019-01-04 14:58:53 +0100	[diff] [blame]	378
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	379	DRuKoLa
				380	#Morpho
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	381
Akron	9f37ed7	2022-01-17 12:10:08 +0100	[diff] [blame]	382	Glemm
Akron	abb3690	2021-10-11 15:51:06 +0200	[diff] [blame]	383	#Morpho
				384
Akron	9f37ed7	2022-01-17 12:10:08 +0100	[diff] [blame]	385	Gingko
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	386	#Morpho
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	387
Akron	ed9baf0	2019-01-22 17:03:25 +0100	[diff] [blame]	388	HNC
				389	#Morpho
				390
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	391	LWC
				392	#Dependency
				393
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	394	Malt
				395	#Dependency
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	396
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	397	MarMoT
				398	#Morpho
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	399
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	400	Mate
				401	#Dependency
				402	#Morpho
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	403
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	404	MDParser
				405	#Dependency
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	406
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	407	OpenNLP
				408	#Morpho
				409	#Sentences
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	410
Akron	0b04b31	2020-10-30 17:39:18 +0100	[diff] [blame]	411	RWK
				412	#Morpho
				413	#Structure
				414
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	415	Sgbr
				416	#Lemma
				417	#Morpho
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	418
Akron	7d5e638	2019-08-08 16:36:27 +0200	[diff] [blame]	419	Talismane
				420	#Dependency
				421	#Morpho
				422
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	423	TreeTagger
				424	#Morpho
				425	#Sentences
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	426
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	427	XIP
				428	#Constituency
				429	#Morpho
				430	#Sentences
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	431
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	432
				433	More importers are in preparation.
				434	New annotation importers can be defined in the C<KorAP::XML::Annotation> namespace.
				435	See the built-in annotation importers as examples.
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	436
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	437
Akron	41e6c8b	2021-10-14 20:22:18 +0200	[diff] [blame]	438	=head1 METADATA SUPPORT
				439
				440	L<KorAP::XML::Krill> has built-in importer for some meta data variants
				441	developed in the KorAP project that are part of the KorAP preprocessing pipeline.
				442
				443	=over 2
				444
				445	=item I5 - Meta data for all I5 files
				446
				447	=item Sgbr - Meta data from the Schreibgebrauch project
				448
				449	=item Gingko - Meta data from the Gingko project in addition to I5
				450
				451	=back
				452
				453	More importers are in preparation.
				454	New meta data importers can be defined in the C<KorAP::XML::Meta> namespace.
				455	See the built-in meta data importers as examples.
				456
				457
Akron	8f69d63	2020-01-15 16:58:11 +0100	[diff] [blame]	458	=head1 About KorAP-XML
				459
				460	KorAP-XML (Bański et al. 2012) is an implementation of the KorAP
				461	data model (Bański et al. 2013), where text data are stored physically
				462	separated from their interpretations (i.e. annotations).
				463	A text document in KorAP-XML therefore consists of several files
				464	containing primary data, metadata and annotations.
				465
				466	The structure of a single KorAP-XML document can be as follows:
				467
				468	- data.xml
				469	- header.xml
				470	+ base
				471	- tokens.xml
				472	- ...
				473	+ struct
				474	- structure.xml
				475	- ...
				476	+ corenlp
				477	- morpho.xml
				478	- constituency.xml
				479	- ...
				480	+ tree_tagger
				481	- morpho.xml
				482	- ...
				483	- ...
				484
				485	The C<data.xml> contains the primary data, the C<header.xml> contains
				486	the metadata, and the annotation layers are stored in subfolders
				487	like C<base>, C<struct> or C<corenlp>
				488	(so-called "foundries"; Bański et al. 2013).
				489
				490	Metadata is available in the TEI-P5 variant I5
Akron	d4c5c10	2020-02-11 11:47:59 +0100	[diff] [blame]	491	(Lüngen and Sperberg-McQueen 2012). See the documentation in
				492	L<KorAP::XML::Meta::I5> for translatable fields.
				493
				494	Annotations correspond to a variant of the TEI-P5 feature structures
				495	(TEI Consortium; Lee et al. 2004).
Akron	72bc522	2020-02-06 16:00:13 +0100	[diff] [blame]	496	Annotation feature structures refer to character sequences of the primary text
				497	inside the C<text> element of the C<data.xml>.
				498	A single annotation containing the lemma of a token can have the following structure:
				499
				500	<span from="0" to="3">
				501	<fs type="lex" xmlns="http://www.tei-c.org/ns/1.0">
				502	<f name="lex">
				503	<fs>
				504	<f name="lemma">zum</f>
				505	</fs>
				506	</f>
				507	</fs>
				508	</span>
				509
				510	The C<from> and C<to> attributes are refering to the character span
				511	in the primary text.
				512	Depending on the kind of annotation (e.g. token-based, span-based, relation-based),
				513	the structure may vary. See L<KorAP::XML::Annotation::*> for various
				514	annotation preprocessors.
Akron	8f69d63	2020-01-15 16:58:11 +0100	[diff] [blame]	515
				516	Multiple KorAP-XML documents are organized on three levels following
				517	the "IDS Textmodell" (Lüngen and Sperberg-McQueen 2012):
				518	corpus E<gt> document E<gt> text. On each level metadata information
				519	can be stored, that C<korapxml2krill> will merge to a single metadata
				520	object per text. A corpus is therefore structured as follows:
				521
				522	+ <corpus>
				523	- header.xml
				524	+ <document>
				525	- header.xml
				526	+ <text>
				527	- data.xml
				528	- header.xml
				529	- ...
				530	- ...
				531
				532	A single text can be identified by the concatenation of
				533	the corpus identifier, the document identifier and the text identifier.
				534	This identifier is called the text sigle
				535	(e.g. a text with the identifier C<18486> in the document C<060> in the
				536	corpus C<WPD17> has the text sigle C<WPD17/060/18486>, see C<--sigle>).
				537
				538	These corpora are often stored in zip files, with which C<korapxml2krill>
				539	can deal with. Corpora may also be split in multiple zip archives
				540	(e.g. one zip file per foundry), which is also supported (see C<--input>).
				541
				542	Examples for KorAP-XML files are included in L<KorAP::XML::Krill>
				543	in form of a test suite.
				544	The resulting JSON format merges all annotation layers
				545	based on a single token stream.
				546
				547	=head2 References
				548
				549	Piotr Bański, Cyril Belica, Helge Krause, Marc Kupietz, Carsten Schnober, Oliver Schonefeld, and Andreas Witt (2011):
				550	KorAP data model: first approximation, December.
				551
				552	Piotr Bański, Peter M. Fischer, Elena Frick, Erik Ketzan, Marc Kupietz, Carsten Schnober, Oliver Schonefeld and Andreas Witt (2012):
				553	"The New IDS Corpus Analysis Platform: Challenges and Prospects",
				554	Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012).
				555	L<PDF\|http://www.lrec-conf.org/proceedings/lrec2012/pdf/789_Paper.pdf>
				556
				557	Piotr Bański, Elena Frick, Michael Hanl, Marc Kupietz, Carsten Schnober and Andreas Witt (2013):
				558	"Robust corpus architecture: a new look at virtual collections and data access",
				559	Corpus Linguistics 2013. Abstract Book. Lancaster: UCREL, pp. 23-25.
				560	L<PDF\|https://ids-pub.bsz-bw.de/frontdoor/deliver/index/docId/4485/file/Ba%c5%84ski_Frick_Hanl_Robust_corpus_architecture_2013.pdf>
				561
				562	Kiyong Lee, Lou Burnard, Laurent Romary, Eric de la Clergerie, Thierry Declerck,
				563	Syd Bauman, Harry Bunt, Lionel Clément, Tomaz Erjavec, Azim Roussanaly and Claude Roux (2004):
				564	"Towards an international standard on featurestructure representation",
				565	Proceedings of the fourth International Conference on Language Resources and Evaluation (LREC 2004),
				566	pp. 373-376.
				567	L<PDF\|http://www.lrec-conf.org/proceedings/lrec2004/pdf/687.pdf>
				568
				569	Harald Lüngen and C. M. Sperberg-McQueen (2012):
				570	"A TEI P5 Document Grammar for the IDS Text Model",
				571	Journal of the Text Encoding Initiative, Issue 3 \| November 2012.
				572	L<PDF\|https://journals.openedition.org/jtei/pdf/508>
				573
				574	TEI Consortium, eds:
				575	"Feature Structures",
				576	Guidelines for Electronic Text Encoding and Interchange.
				577	L<html\|https://www.tei-c.org/release/doc/tei-p5-doc/en/html/FS.html>
				578
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	579	=head1 AVAILABILITY
				580
				581	https://github.com/KorAP/KorAP-XML-Krill
				582
				583
				584	=head1 COPYRIGHT AND LICENSE
				585
Akron	9a2545e	2022-01-16 15:15:50 +0100	[diff] [blame]	586	Copyright (C) 2015-2022, L<IDS Mannheim\|https://www.ids-mannheim.de/>
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	587
Akron	6882d7d	2021-02-08 09:43:57 +0100	[diff] [blame]	588	Author: L<Nils Diewald\|https://www.nils-diewald.de/>
Akron	8150010	2017-04-07 20:45:44 +0200	[diff] [blame]	589
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	590	Contributor: Eliza Margaretha
				591
Akron	6882d7d	2021-02-08 09:43:57 +0100	[diff] [blame]	592	L<KorAP::XML::Krill> is developed as part of the L<KorAP\|https://korap.ids-mannheim.de/>
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	593	Corpus Analysis Platform at the
Akron	6882d7d	2021-02-08 09:43:57 +0100	[diff] [blame]	594	L<Leibniz Institute for the German Language (IDS)\|https://www.ids-mannheim.de/>,
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	595	member of the
Akron	f1849aa	2019-12-16 23:35:33 +0100	[diff] [blame]	596	L<Leibniz-Gemeinschaft\|http://www.leibniz-gemeinschaft.de/>.
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	597
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	598	This program is free software published under the
Akron	6882d7d	2021-02-08 09:43:57 +0100	[diff] [blame]	599	L<BSD-2 License\|https://opensource.org/licenses/BSD-2-Clause>.
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	600
				601	=cut