Blame - Readme.pod - KorAP/KorAP-XML-Krill

blob: eac3a7e64eba86d7ee05385af84d39e89b0d86d0 [file] [log] [blame]

Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	1	=pod
				2
				3	=encoding utf8
				4
				5	=head1 NAME
				6
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	7	korapxml2krill - Merge KorapXML data and create Krill documents
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	8
				9
				10	=head1 SYNOPSIS
				11
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	12	korapxml2krill [archive\|extract] --input <directory\|archive> [options]
Akron	2fd402b	2016-10-27 21:26:48 +0200	[diff] [blame]	13
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	14
				15	=head1 DESCRIPTION
				16
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	17	L<KorAP::XML::Krill> is a library to convert KorAP-XML documents to files
				18	compatible with the L<Krill\|https://github.com/KorAP/Krill> indexer.
Akron	8f69d63	2020-01-15 16:58:11 +0100	[diff] [blame]	19	The C<korapxml2krill> command line tool is a simple wrapper of this library.
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	20
				21
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	22	=head1 INSTALLATION
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	23
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	24	The preferred way to install L<KorAP::XML::Krill> is to use L<cpanm\|App::cpanminus>.
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	25
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	26	$ cpanm https://github.com/KorAP/KorAP-XML-Krill.git
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	27
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	28	In case everything went well, the C<korapxml2krill> tool will
				29	be available on your command line immediately.
Akron	6eff23b	2018-09-24 10:31:20 +0200	[diff] [blame]	30	Minimum requirement for L<KorAP::XML::Krill> is Perl 5.16.
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	31	In addition to work with zip archives, the C<unzip> tool needs to be present.
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	32
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	33	=head1 ARGUMENTS
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	34
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	35	$ korapxml2krill -z --input <directory> --output <filename>
				36
				37	Without arguments, C<korapxml2krill> converts a directory of a single KorAP-XML document.
				38	It expects the input to point to the text level folder.
				39
				40	=over 2
				41
				42	=item B<archive>
				43
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	44	$ korapxml2krill archive -z --input <directory\|archive> --output <directory\|tar>
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	45
				46	Converts an archive of KorAP-XML documents. It expects a directory
				47	(pointing to the corpus level folder) or one or more zip files as input.
				48
				49	=item B<extract>
				50
				51	$ korapxml2krill extract --input <archive> --output <directory> --sigle <SIGLE>
				52
				53	Extracts KorAP-XML documents from a zip file.
				54
Akron	442c4e9	2017-04-10 23:41:31 +0200	[diff] [blame]	55	=item B<serial>
				56
				57	$ korapxml2krill serial -i <archive1> -i <archive2> -o <directory> -cfg <config-file>
				58
				59	Convert archives sequentially. The inputs are not merged but treated
				60	as they are (so they may be premerged or globs).
				61	the C<--out> directory is treated as the base directory where subdirectories
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	62	are created based on the archive name. In case the C<--to-tar> flag is given,
				63	the output will be a tar file.
Akron	442c4e9	2017-04-10 23:41:31 +0200	[diff] [blame]	64
				65
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	66	=back
Akron	a76d835	2016-10-27 16:27:32 +0200	[diff] [blame]	67
Akron	7606afa	2016-10-25 16:23:49 +0200	[diff] [blame]	68
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	69	=head1 OPTIONS
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	70
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	71	=over 2
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	72
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	73	=item B<--input\|-i> <directory\|zip file>
Akron	a76d835	2016-10-27 16:27:32 +0200	[diff] [blame]	74
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	75	Directory or zip file(s) of documents to convert.
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	76
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	77	Without arguments, C<korapxml2krill> expects a folder of a single KorAP-XML
Akron	f1a1de9	2016-11-02 17:32:12 +0100	[diff] [blame]	78	document, while C<archive> expects a KorAP-XML corpus folder or a zip
				79	file to batch process multiple files.
				80	C<extract> expects zip files only.
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	81
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	82	C<archive> supports multiple input zip files with the constraint,
				83	that the first archive listed contains all primary data files
				84	and all meta data files.
Akron	a76d835	2016-10-27 16:27:32 +0200	[diff] [blame]	85
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	86	-i file/news.zip -i file/news.malt.zip -i "#file/news.tt.zip"
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	87
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	88	Input may also be defined using BSD glob wildcards.
				89
				90	-i 'file/news*.zip'
				91
				92	The extended input array will be sorted in length order, so the shortest
				93	path needs to contain all primary data files and all meta data files.
				94
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	95	(The directory structure follows the base directory format,
				96	that may include a C<.> root folder.
				97	In this case further archives lacking a C<.> root folder
				98	need to be passed with a hash sign in front of the archive's name.
				99	This may require to quote the parameter.)
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	100
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	101	To support zip files, a version of C<unzip> needs to be installed that is
				102	compatible with the archive file.
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	103
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	104	B<The root folder switch using the hash sign is experimental and
				105	may vanish in future versions.>
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	106
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	107
Akron	442c4e9	2017-04-10 23:41:31 +0200	[diff] [blame]	108	=item B<--input-base\|-ib> <directory>
				109
				110	The base directory for inputs.
				111
				112
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	113	=item B<--output\|-o> <directory\|file>
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	114
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	115	Output folder for archive processing or
				116	document name for single output (optional),
				117	writes to C<STDOUT> by default
				118	(in case C<output> is not mandatory due to further options).
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	119
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	120	=item B<--overwrite\|-w>
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	121
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	122	Overwrite files that already exist.
Akron	7606afa	2016-10-25 16:23:49 +0200	[diff] [blame]	123
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	124
Akron	3741f8b	2016-12-21 19:55:21 +0100	[diff] [blame]	125	=item B<--token\|-t> <foundry>#<file>
Akron	a5920b1	2016-06-29 18:51:21 +0200	[diff] [blame]	126
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	127	Define the default tokenization by specifying
				128	the name of the foundry and optionally the name
				129	of the layer-file. Defaults to C<OpenNLP#tokens>.
Akron	f1849aa	2019-12-16 23:35:33 +0100	[diff] [blame]	130	This will directly take the file instead of running
				131	the layer implementation!
Akron	3741f8b	2016-12-21 19:55:21 +0100	[diff] [blame]	132
Akron	8f69d63	2020-01-15 16:58:11 +0100	[diff] [blame]	133
Akron	3741f8b	2016-12-21 19:55:21 +0100	[diff] [blame]	134	=item B<--base-sentences\|-bs> <foundry>#<layer>
				135
				136	Define the layer for base sentences.
				137	If given, this will be used instead of using C<Base#Sentences>.
Akron	c29b8e1	2019-12-16 14:28:09 +0100	[diff] [blame]	138	Currently C<DeReKo#Structure> and C<DGD#Structure> are the only additional
				139	layers supported.
Akron	3741f8b	2016-12-21 19:55:21 +0100	[diff] [blame]	140
				141	Defaults to unset.
				142
				143
				144	=item B<--base-paragraphs\|-bp> <foundry>#<layer>
				145
				146	Define the layer for base paragraphs.
				147	If given, this will be used instead of using C<Base#Paragraphs>.
				148	Currently C<DeReKo#Structure> is the only additional layer supported.
				149
				150	Defaults to unset.
				151
				152
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	153	=item B<--base-pagebreaks\|-bpb> <foundry>#<layer>
				154
				155	Define the layer for base pagebreaks.
				156	Currently C<DeReKo#Structure> is the only layer supported.
				157
				158	Defaults to unset.
				159
				160
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	161	=item B<--skip\|-s> <foundry>[#<layer>]
				162
				163	Skip specific annotations by specifying the foundry
				164	(and optionally the layer with a C<#>-prefix),
				165	e.g. C<Mate> or C<Mate#Morpho>. Alternatively you can skip C<#ALL>.
				166	Can be set multiple times.
				167
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	168
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	169	=item B<--anno\|-a> <foundry>#<layer>
				170
				171	Convert specific annotations by specifying the foundry
				172	(and optionally the layer with a C<#>-prefix),
				173	e.g. C<Mate> or C<Mate#Morpho>.
				174	Can be set multiple times.
				175
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	176
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	177	=item B<--primary\|-p>
				178
				179	Output primary data or not. Defaults to C<true>.
				180	Can be flagged using C<--no-primary> as well.
				181	This is I<deprecated>.
				182
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	183
Akron	ed9baf0	2019-01-22 17:03:25 +0100	[diff] [blame]	184	=item B<--non-word-tokens\|-nwt>
				185
				186	Tokenize non-word tokens like word tokens (defined as matching
				187	C</[\d\w]/>). Useful to treat punctuations as tokens.
				188
				189	Defaults to unset.
				190
Akron	f1849aa	2019-12-16 23:35:33 +0100	[diff] [blame]	191
				192	=item B<--non-verbal-tokens\|-nvt>
				193
				194	Tokenize non-verbal tokens marked as in the primary data as
				195	the unicode symbol 'Black Vertical Rectangle' aka \x25ae.
				196
				197	Defaults to unset.
				198
				199
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	200	=item B<--jobs\|-j>
				201
				202	Define the number of concurrent jobs in seperated forks
				203	for archive processing.
				204	Defaults to C<0> (everything runs in a single process).
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	205
				206	If C<sequential-extraction> is not set to false, this will
				207	also apply to extraction.
				208
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	209	Pass -1, and the value will be set automatically to 5
				210	times the number of available cores.
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	211	This is I<experimental>.
				212
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	213
Akron	263274c	2019-02-07 09:48:30 +0100	[diff] [blame]	214	=item B<--koral\|-k>
				215
				216	Version of the output format. Supported versions are:
				217	C<0> for legacy serialization, C<0.03> for serialization
				218	with metadata fields as key-values on the root object,
				219	C<0.4> for serialization with metadata fields as a list
				220	of C<"@type":"koral:field"> objects.
				221
				222	Currently defaults to C<0.03>.
				223
				224
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	225	=item B<--sequential-extraction\|-se>
				226
				227	Flag to indicate, if the C<jobs> value also applies to extraction.
				228	Some systems may have problems with extracting multiple archives
				229	to the same folder at the same time.
				230	Can be flagged using C<--no-sequential-extraction> as well.
				231	Defaults to C<false>.
				232
				233
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	234	=item B<--meta\|-m>
				235
				236	Define the metadata parser to use. Defaults to C<I5>.
				237	Metadata parsers can be defined in the C<KorAP::XML::Meta> namespace.
				238	This is I<experimental>.
				239
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	240
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	241	=item B<--pretty\|-y>
				242
				243	Pretty print JSON output. Defaults to C<false>.
				244	This is I<deprecated>.
				245
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	246
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	247	=item B<--gzip\|-z>
				248
				249	Compress the output.
				250	Expects a defined C<output> file in single processing.
				251
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	252
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	253	=item B<--cache\|-c>
				254
				255	File to mmap a cache (using L<Cache::FastMmap>).
				256	Defaults to C<korapxml2krill.cache> in the calling directory.
				257
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	258
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	259	=item B<--cache-size\|-cs>
				260
				261	Size of the cache. Defaults to C<50m>.
				262
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	263
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	264	=item B<--cache-init\|-ci>
				265
				266	Initialize cache file.
				267	Can be flagged using C<--no-cache-init> as well.
				268	Defaults to C<true>.
				269
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	270
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	271	=item B<--cache-delete\|-cd>
				272
				273	Delete cache file after processing.
				274	Can be flagged using C<--no-cache-delete> as well.
				275	Defaults to C<true>.
				276
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	277
Akron	636aa11	2017-04-07 18:48:56 +0200	[diff] [blame]	278	=item B<--config\|-cfg>
				279
				280	Configure the parameters of your call in a file
				281	of key-value pairs with whitespace separator
				282
				283	overwrite 1
				284	token DeReKo#Structure
				285	...
				286
				287	Supported parameters are:
Akron	442c4e9	2017-04-10 23:41:31 +0200	[diff] [blame]	288	C<overwrite>, C<gzip>, C<jobs>, C<input-base>,
Akron	636aa11	2017-04-07 18:48:56 +0200	[diff] [blame]	289	C<token>, C<log>, C<cache>, C<cache-size>, C<cache-delete>, C<meta>,
Akron	57510c1	2019-01-04 14:58:53 +0100	[diff] [blame]	290	C<output>, C<koral>,
				291	C<tempary-extract>, C<sequential-extraction>,
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	292	C<base-sentences>, C<base-paragraphs>,
				293	C<base-pagebreaks>,
				294	C<skip> (semicolon separated), C<sigle>
Akron	636aa11	2017-04-07 18:48:56 +0200	[diff] [blame]	295	(semicolon separated), C<anno> (semicolon separated).
				296
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	297	Configuration parameters will always be overwritten by
				298	passed parameters.
				299
				300
Akron	8150010	2017-04-07 20:45:44 +0200	[diff] [blame]	301	=item B<--temporary-extract\|-te>
				302
				303	Only valid for the C<archive> command.
				304
				305	This will first extract all files into a
				306	directory and then will archive.
				307	If the directory is given as C<:temp:>,
				308	a temporary directory is used.
				309	This is especially useful to avoid
				310	massive unzipping and potential
				311	network latency.
Akron	636aa11	2017-04-07 18:48:56 +0200	[diff] [blame]	312
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	313
Akron	c93a080	2019-07-11 15:48:34 +0200	[diff] [blame]	314	=item B<--to-tar>
				315
				316	Only valid for the C<archive> command.
				317
				318	Writes the output into a tar archive.
				319
				320
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	321	=item B<--sigle\|-sg>
				322
				323	Extract the given texts.
				324	Can be set multiple times.
				325	I<Currently only supported on C<extract>.>
				326	Sigles have the structure C<Corpus>/C<Document>/C<Text>.
				327	In case the C<Text> path is omitted, the whole document will be extracted.
				328	On the document level, the postfix wildcard C<*> is supported.
				329
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	330
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	331	=item B<--log\|-l>
				332
				333	The L<Log4perl> log level, defaults to C<ERROR>.
				334
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	335
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	336	=item B<--help\|-h>
				337
				338	Print this document.
				339
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	340
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	341	=item B<--version\|-v>
				342
				343	Print version information.
				344
				345	=back
				346
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	347
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	348	=head1 ANNOTATION SUPPORT
				349
				350	L<KorAP::XML::Krill> has built-in importer for some annotation foundries and layers
				351	developed in the KorAP project that are part of the KorAP preprocessing pipeline.
				352	The base foundry with paragraphs, sentences, and the text element are mandatory for
				353	L<Krill\|https://github.com/KorAP/Krill>.
				354
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	355	Base
				356	#Paragraphs
				357	#Sentences
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	358
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	359	Connexor
				360	#Morpho
				361	#Phrase
				362	#Sentences
				363	#Syntax
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	364
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	365	CoreNLP
				366	#Constituency
				367	#Morpho
				368	#NamedEntities
				369	#Sentences
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	370
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	371	CMC
				372	#Morpho
				373
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	374	DeReKo
				375	#Structure
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	376
Akron	57510c1	2019-01-04 14:58:53 +0100	[diff] [blame]	377	DGD
				378	#Morpho
Akron	c29b8e1	2019-12-16 14:28:09 +0100	[diff] [blame]	379	#Structure
Akron	57510c1	2019-01-04 14:58:53 +0100	[diff] [blame]	380
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	381	DRuKoLa
				382	#Morpho
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	383
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	384	Glemm
				385	#Morpho
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	386
Akron	ed9baf0	2019-01-22 17:03:25 +0100	[diff] [blame]	387	HNC
				388	#Morpho
				389
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	390	LWC
				391	#Dependency
				392
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	393	Malt
				394	#Dependency
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	395
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	396	MarMoT
				397	#Morpho
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	398
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	399	Mate
				400	#Dependency
				401	#Morpho
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	402
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	403	MDParser
				404	#Dependency
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	405
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	406	OpenNLP
				407	#Morpho
				408	#Sentences
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	409
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	410	Sgbr
				411	#Lemma
				412	#Morpho
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	413
Akron	7d5e638	2019-08-08 16:36:27 +0200	[diff] [blame]	414	Talismane
				415	#Dependency
				416	#Morpho
				417
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	418	TreeTagger
				419	#Morpho
				420	#Sentences
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	421
Akron	821db3d	2017-04-06 21:19:31 +0200	[diff] [blame]	422	XIP
				423	#Constituency
				424	#Morpho
				425	#Sentences
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	426
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	427
				428	More importers are in preparation.
				429	New annotation importers can be defined in the C<KorAP::XML::Annotation> namespace.
				430	See the built-in annotation importers as examples.
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	431
Akron	f73ffb6	2018-06-27 12:13:59 +0200	[diff] [blame]	432
Akron	8f69d63	2020-01-15 16:58:11 +0100	[diff] [blame]	433	=head1 About KorAP-XML
				434
				435	KorAP-XML (Bański et al. 2012) is an implementation of the KorAP
				436	data model (Bański et al. 2013), where text data are stored physically
				437	separated from their interpretations (i.e. annotations).
				438	A text document in KorAP-XML therefore consists of several files
				439	containing primary data, metadata and annotations.
				440
				441	The structure of a single KorAP-XML document can be as follows:
				442
				443	- data.xml
				444	- header.xml
				445	+ base
				446	- tokens.xml
				447	- ...
				448	+ struct
				449	- structure.xml
				450	- ...
				451	+ corenlp
				452	- morpho.xml
				453	- constituency.xml
				454	- ...
				455	+ tree_tagger
				456	- morpho.xml
				457	- ...
				458	- ...
				459
				460	The C<data.xml> contains the primary data, the C<header.xml> contains
				461	the metadata, and the annotation layers are stored in subfolders
				462	like C<base>, C<struct> or C<corenlp>
				463	(so-called "foundries"; Bański et al. 2013).
				464
				465	Metadata is available in the TEI-P5 variant I5
				466	(Lüngen and Sperberg-McQueen 2012), while annotations correspond to
				467	a variant of the TEI-P5 feature structures (TEI Consortium; Lee et al. 2004).
				468
				469	Multiple KorAP-XML documents are organized on three levels following
				470	the "IDS Textmodell" (Lüngen and Sperberg-McQueen 2012):
				471	corpus E<gt> document E<gt> text. On each level metadata information
				472	can be stored, that C<korapxml2krill> will merge to a single metadata
				473	object per text. A corpus is therefore structured as follows:
				474
				475	+ <corpus>
				476	- header.xml
				477	+ <document>
				478	- header.xml
				479	+ <text>
				480	- data.xml
				481	- header.xml
				482	- ...
				483	- ...
				484
				485	A single text can be identified by the concatenation of
				486	the corpus identifier, the document identifier and the text identifier.
				487	This identifier is called the text sigle
				488	(e.g. a text with the identifier C<18486> in the document C<060> in the
				489	corpus C<WPD17> has the text sigle C<WPD17/060/18486>, see C<--sigle>).
				490
				491	These corpora are often stored in zip files, with which C<korapxml2krill>
				492	can deal with. Corpora may also be split in multiple zip archives
				493	(e.g. one zip file per foundry), which is also supported (see C<--input>).
				494
				495	Examples for KorAP-XML files are included in L<KorAP::XML::Krill>
				496	in form of a test suite.
				497	The resulting JSON format merges all annotation layers
				498	based on a single token stream.
				499
				500	=head2 References
				501
				502	Piotr Bański, Cyril Belica, Helge Krause, Marc Kupietz, Carsten Schnober, Oliver Schonefeld, and Andreas Witt (2011):
				503	KorAP data model: first approximation, December.
				504
				505	Piotr Bański, Peter M. Fischer, Elena Frick, Erik Ketzan, Marc Kupietz, Carsten Schnober, Oliver Schonefeld and Andreas Witt (2012):
				506	"The New IDS Corpus Analysis Platform: Challenges and Prospects",
				507	Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012).
				508	L<PDF\|http://www.lrec-conf.org/proceedings/lrec2012/pdf/789_Paper.pdf>
				509
				510	Piotr Bański, Elena Frick, Michael Hanl, Marc Kupietz, Carsten Schnober and Andreas Witt (2013):
				511	"Robust corpus architecture: a new look at virtual collections and data access",
				512	Corpus Linguistics 2013. Abstract Book. Lancaster: UCREL, pp. 23-25.
				513	L<PDF\|https://ids-pub.bsz-bw.de/frontdoor/deliver/index/docId/4485/file/Ba%c5%84ski_Frick_Hanl_Robust_corpus_architecture_2013.pdf>
				514
				515	Kiyong Lee, Lou Burnard, Laurent Romary, Eric de la Clergerie, Thierry Declerck,
				516	Syd Bauman, Harry Bunt, Lionel Clément, Tomaz Erjavec, Azim Roussanaly and Claude Roux (2004):
				517	"Towards an international standard on featurestructure representation",
				518	Proceedings of the fourth International Conference on Language Resources and Evaluation (LREC 2004),
				519	pp. 373-376.
				520	L<PDF\|http://www.lrec-conf.org/proceedings/lrec2004/pdf/687.pdf>
				521
				522	Harald Lüngen and C. M. Sperberg-McQueen (2012):
				523	"A TEI P5 Document Grammar for the IDS Text Model",
				524	Journal of the Text Encoding Initiative, Issue 3 \| November 2012.
				525	L<PDF\|https://journals.openedition.org/jtei/pdf/508>
				526
				527	TEI Consortium, eds:
				528	"Feature Structures",
				529	Guidelines for Electronic Text Encoding and Interchange.
				530	L<html\|https://www.tei-c.org/release/doc/tei-p5-doc/en/html/FS.html>
				531
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	532	=head1 AVAILABILITY
				533
				534	https://github.com/KorAP/KorAP-XML-Krill
				535
				536
				537	=head1 COPYRIGHT AND LICENSE
				538
Akron	8f69d63	2020-01-15 16:58:11 +0100	[diff] [blame]	539	Copyright (C) 2015-2020, L<IDS Mannheim\|https://www.ids-mannheim.de/>
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	540
Akron	8f69d63	2020-01-15 16:58:11 +0100	[diff] [blame]	541	Author: L<Nils Diewald\|https://nils-diewald.de/>
Akron	8150010	2017-04-07 20:45:44 +0200	[diff] [blame]	542
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	543	Contributor: Eliza Margaretha
				544
				545	L<KorAP::XML::Krill> is developed as part of the L<KorAP\|http://korap.ids-mannheim.de/>
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	546	Corpus Analysis Platform at the
Akron	94262ce	2019-02-28 21:42:43 +0100	[diff] [blame]	547	L<Leibniz Institute for the German Language (IDS)\|http://ids-mannheim.de/>,
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	548	member of the
Akron	f1849aa	2019-12-16 23:35:33 +0100	[diff] [blame]	549	L<Leibniz-Gemeinschaft\|http://www.leibniz-gemeinschaft.de/>.
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	550
Akron	5c71a85	2016-10-31 16:00:33 +0100	[diff] [blame]	551	This program is free software published under the
Akron	c13a170	2016-03-15 19:33:14 +0100	[diff] [blame]	552	L<BSD-2 License\|https://raw.githubusercontent.com/KorAP/KorAP-XML-Krill/master/LICENSE>.
				553
				554	=cut