Blame - Readme.pod - KorAP/KorAP-XML-TEI

blob: c097fd20270fe37ea6af71474533195a3b609191 [file] [log] [blame]

Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	1	=pod
				2
				3	=encoding utf8
				4
				5	=head1 NAME
				6
				7	tei2korapxml - Conversion of TEI P5 based formats to KorAP-XML
				8
				9	=head1 SYNOPSIS
				10
Marc Kupietz	5b3f1d8	2024-07-05 17:50:55 +0200	[diff] [blame]	11	cat corpus.i5.xml \| tei2korapxml -tk - > corpus.korapxml.zip
				12	tei2korapxml -tk corpus.i5.xml > corpus.korapxml.zip
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	13
				14	=head1 DESCRIPTION
				15
				16	C<tei2korapxml> is a script to convert TEI P5 and
Akron	d72baca	2021-07-23 13:25:32 +0200	[diff] [blame]	17	L<I5\|https://www.ids-mannheim.de/digspra/kl/projekte/korpora/textmodell>
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	18	based documents to the
				19	L<KorAP-XML format\|https://github.com/KorAP/KorAP-XML-Krill#about-korap-xml>.
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	20
				21	This program is usually called from inside another script.
				22
				23	=head1 FORMATS
				24
				25	=head2 Input restrictions
				26
				27	=over 2
				28
				29	=item
				30
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	31	TEI P5 formatted input with certain restrictions:
				32
				33	=over 4
				34
				35	=item
				36
Akron	e48bec4	2023-01-05 12:18:45 +0100	[diff] [blame]	37	B<mandatory>: text-header with integrated textsigle
				38	(or convertable identifier), text-body
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	39
				40	=item
				41
				42	B<optional>: corp-header with integrated corpsigle,
				43	doc-header with integrated docsigle
				44
				45	=back
				46
				47	=item
				48
				49	All tokens inside the primary text may not be
				50	newline seperated, because newlines are removed
				51	(see L<KorAP::XML::TEI::Data>) and a conversion of newlines
				52	into blanks between 2 tokens could lead to additional blanks,
				53	where there should be none (e.g.: punctuation characters like C<,> or
				54	C<.> should not be seperated from their predecessor token).
Akron	8a0c4bf	2021-03-16 16:51:21 +0100	[diff] [blame]	55	(see also code section C<~ whitespace handling ~> in C<script/tei2korapxml>).
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	56
Akron	940ca6f	2021-10-11 12:38:39 +0200	[diff] [blame]	57	=item
				58
				59	Header types, like C<E<lt>idsHeader [...] type="document" [...] E<gt>>
				60	need to be defined in the same line as the header tag.
				61
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	62	=back
				63
				64	=head2 Notes on the output
				65
				66	=over 2
				67
				68	=item
				69
				70	zip file output (default on C<stdout>) with utf8 encoded entries
				71	(which together form the KorAP-XML format)
				72
				73	=back
				74
				75	=head1 INSTALLATION
				76
Akron	d26319b	2023-01-12 15:34:41 +0100	[diff] [blame]	77	C<tei2korapxml> requires C<libxml2-dev> bindings and L<File::ShareDir::Install> to be installed.
Marc Kupietz	e83a4e9	2021-03-16 20:51:26 +0100	[diff] [blame]	78	When these requirements are met, the preferred way to install the script is
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	79	to use L<cpanm\|App::cpanminus>.
				80
				81	$ cpanm https://github.com/KorAP/KorAP-XML-TEI.git
				82
				83	In case everything went well, the C<tei2korapxml> tool will
				84	be available on your command line immediately.
				85
Akron	6b1f26b	2024-09-19 11:35:32 +0200	[diff] [blame]	86	Minimum requirement for L<KorAP::XML::TEI> is Perl 5.16.
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	87
				88	=head1 OPTIONS
				89
				90	=over 2
				91
Akron	1148478	2021-11-03 20:12:14 +0100	[diff] [blame]	92	=item B<--input\|-i>
				93
				94	The input file to process. If no specific input is defined and a single
				95	dash C<-> is passed as an argument, data is read from C<STDIN>.
				96
Marc Kupietz	5b3f1d8	2024-07-05 17:50:55 +0200	[diff] [blame]	97	Instead of using C<-i> input files can also be defined as trailing arguments
				98	to the command:
				99
				100	tei2korapxml -tk corpus1.i5.xml corpus2.i5.xml
				101
Akron	6b1f26b	2024-09-19 11:35:32 +0200	[diff] [blame]	102	=item B<--output\|-o>
				103
				104	The output zip file to be created. If no specific output is defined,
				105	data is written to C<STDOUT>.
Akron	1148478	2021-11-03 20:12:14 +0100	[diff] [blame]	106
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	107	=item B<--root\|-r>
				108
				109	The root directory for output. Defaults to C<.>.
				110
				111	=item B<--help\|-h>
				112
				113	Print help information.
				114
				115	=item B<--version\|-v>
				116
				117	Print version information.
				118
Akron	e48bec4	2023-01-05 12:18:45 +0100	[diff] [blame]	119	=item B<--tokenizer-korap\|-tk>
				120
				121	Use the standard KorAP/DeReKo tokenizer.
				122
				123	=item B<--tokenizer-internal\|-ti>
				124
				125	Tokenize the data using two embedded tokenizers,
				126	that will take an I<aggressive> and a I<conservative>
				127	approach.
				128
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	129	=item B<--tokenizer-call\|-tc>
				130
				131	Call an external tokenizer process, that will tokenize
Akron	1148478	2021-11-03 20:12:14 +0100	[diff] [blame]	132	from STDIN and outputs the offsets of all tokens.
				133
				134	Texts are separated using C<\x04\n>. The external process
				135	should add a new line per text.
				136
				137	If the L</--use-tokenizer-sentence-splits> option is activated,
				138	sentences are marked by offset as well in new lines.
				139
				140	To use L<Datok\|https://github.com/KorAP/Datok> including sentence
				141	splitting, call C<tei2korap> as follows:
				142
				143	$ cat corpus.i5.xml \| tei2korapxml -s \
				144	$ -tc 'datok tokenize \
				145	$ -t ./tokenizer.matok \
				146	$ -p --newline-after-eot --no-sentences \
				147	$ --no-tokens --sentence-positions -' - \
				148	$ > corpus.korapxml.zip
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	149
Akron	6b1f26b	2024-09-19 11:35:32 +0200	[diff] [blame]	150	=item B<--no-tokenizer>
				151
				152	Boolean flag indicating that no tokenizer should be used.
				153	This is meant to ensure that by default a final token layer always
				154	exists.
				155	If a separate tokenizer is chosen, this flag is ignored.
				156
Akron	75d6314	2021-02-23 18:40:56 +0100	[diff] [blame]	157	=item B<--skip-inline-tokens>
				158
				159	Boolean flag indicating that inline tokens should not
				160	be processed. Defaults to false (meaning inline tokens will be processed).
				161
Akron	692d17d	2021-03-05 13:21:03 +0100	[diff] [blame]	162	=item B<--skip-inline-token-annotations>
				163
				164	Boolean flag indicating that inline token annotations should not
				165	be processed. Defaults to true (meaning inline token annotations
Akron	6b1f26b	2024-09-19 11:35:32 +0200	[diff] [blame]	166	won't be processed). Can be negated with
				167	C<--no-skip-inline-token-annotations>.
Akron	692d17d	2021-03-05 13:21:03 +0100	[diff] [blame]	168
Akron	ca70a1d	2021-02-25 16:21:31 +0100	[diff] [blame]	169	=item B<--skip-inline-tags> <tags>
Akron	54c3ff1	2021-02-25 11:33:37 +0100	[diff] [blame]	170
				171	Expects a comma-separated list of tags to be ignored when the structure
				172	is parsed. Content of these tags however will be processed.
				173
Marc Kupietz	fc3a0ee	2024-07-05 16:58:16 +0200	[diff] [blame]	174	=item B<--auto-textsigle> <textsigle>
				175
				176	Expects a text sigle thats serves as fallback if no text sigles
				177	are given in the input data.
				178	The auto text sigle will be incremented for each text processed.
				179
				180	Example:
				181
				182	tei2korapxml --auto-textsigle 'ICC/GER.00001' -s -tk - \
				183	< data.i5.xml > korapxml.zip
				184
Marc Kupietz	a671ae5	2022-12-22 16:28:14 +0100	[diff] [blame]	185	=item B<--xmlid-to-textsigle> <from-regex>@<to-c/to-d/to-t>
				186
Akron	e48bec4	2023-01-05 12:18:45 +0100	[diff] [blame]	187	Expects a regular replacement expression (separated by B<@> between the
Marc Kupietz	a671ae5	2022-12-22 16:28:14 +0100	[diff] [blame]	188	search and the replacement) to convert text id attributes to text sigles
				189	with three parts (separated by B</>).
				190
				191	Example:
				192
				193	tei2korapxml \
				194	--xmlid-to-textsigle 'ICC.German\.([^.]+\.[^.]+)\.(.+)@ICCGER/$1/$2' \
				195	-tk - < t/data/icc_german_sample.p5.xml
				196
Akron	e48bec4	2023-01-05 12:18:45 +0100	[diff] [blame]	197	Converts text id C<ICC.German.DeReKo.WPD17.G11.00238> to
				198	sigle C<ICCGER/DeReKo.WPD17/G11.00238>.
Marc Kupietz	a671ae5	2022-12-22 16:28:14 +0100	[diff] [blame]	199
Akron	1a5271a	2021-02-18 13:18:15 +0100	[diff] [blame]	200	=item B<--inline-tokens> <foundry>#[<file>]
				201
				202	Define the foundry and file (without extension)
				203	to store inline token information in.
Akron	8a0c4bf	2021-03-16 16:51:21 +0100	[diff] [blame]	204	Unless C<--skip-inline-token-annotations> is set,
				205	this will contain annotations as well.
Akron	1a5271a	2021-02-18 13:18:15 +0100	[diff] [blame]	206	Defaults to C<tokens> and C<morpho>.
				207
Akron	e2819a1	2021-10-12 15:52:55 +0200	[diff] [blame]	208	The inline token data will also be stored in the
				209	inline structures file (see I<--inline-structures>),
				210	unless the inline token foundry is prepended
				211	by an B<!> exclamation mark, indicating that inline
				212	tokens are stored exclusively in the inline tokens
				213	file.
				214
				215	Example:
				216
Akron	6b1f26b	2024-09-19 11:35:32 +0200	[diff] [blame]	217	tei2korapxml --no-tokenizer --inline-tokens \
				218	'!gingko#morpho' < data.i5.xml > korapxml.zip
				219
				220	=item B<--inline-dependencies> <foundry>#[<file>]
				221
				222	Define the foundry and file (without extension)
				223	to store inline dependency information in.
				224	Defaults to the layer of C<dependency> and
				225	will be ignored if not set (which means, dependency
				226	attributes will be stored in the inline tokens file,
				227	if not skipped).
				228
				229	The dependency data will also be stored in the
				230	inline token file (see I<--inline-tokens>),
				231	unless the inline dependencies foundry is prepended
				232	by an B<!> exclamation mark, indicating that inline
				233	dependency data is stored exclusively in the inline
				234	dependencies file.
				235
				236	Example:
				237
				238	tei2korapxml --no-tokenizer --inline-dependencies \
				239	'gingko#dependency' < data.i5.xml > korapxml.zip
				240
Akron	e2819a1	2021-10-12 15:52:55 +0200	[diff] [blame]	241
Akron	dd0be8f	2021-02-18 19:29:41 +0100	[diff] [blame]	242	=item B<--inline-structures> <foundry>#[<file>]
				243
				244	Define the foundry and file (without extension)
				245	to store inline structure information in.
				246	Defaults to C<struct> and C<structures>.
Akron	75d6314	2021-02-23 18:40:56 +0100	[diff] [blame]	247
Akron	26a7152	2021-02-19 10:27:37 +0100	[diff] [blame]	248	=item B<--base-foundry> <foundry>
				249
				250	Define the base foundry to store newly generated
				251	token information in.
				252	Defaults to C<base>.
				253
				254	=item B<--data-file> <file>
				255
				256	Define the file (without extension)
				257	to store primary data information in.
				258	Defaults to C<data>.
				259
				260	=item B<--header-file> <file>
				261
				262	Define the file name (without extension)
				263	to store header information on
				264	the corpus, document, and text level in.
				265	Defaults to C<header>.
Akron	dd0be8f	2021-02-18 19:29:41 +0100	[diff] [blame]	266
Marc Kupietz	985da0c	2021-02-15 19:29:50 +0100	[diff] [blame]	267	=item B<--use-tokenizer-sentence-splits\|-s>
				268
				269	Replace existing with, or add new, sentence boundary information
Akron	1148478	2021-11-03 20:12:14 +0100	[diff] [blame]	270	provided by the tokenizer.
				271	Currently KorAP-tokenizer and certain external tokenizers support
				272	these boundaries.
Marc Kupietz	985da0c	2021-02-15 19:29:50 +0100	[diff] [blame]	273
Akron	91705d7	2021-02-19 10:59:45 +0100	[diff] [blame]	274	=item B<--tokens-file> <file>
				275
				276	Define the file (without extension)
				277	to store generated token information in
				278	(either from the KorAP tokenizer or an externally called tokenizer).
				279	Defaults to C<tokens>.
				280
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	281	=item B<--log\|-l>
				282
				283	Loglevel for I<Log::Any>. Defaults to C<notice>.
				284
				285	=back
				286
Akron	b364947	2020-09-29 08:24:46 +0200	[diff] [blame]	287	=head1 ENVIRONMENT VARIABLES
				288
				289	=over 2
				290
				291	=item B<KORAPXMLTEI_DEBUG>
				292
				293	Activate minimal debugging.
				294	Defaults to C<false>.
				295
Marc Kupietz	d254f5c	2025-04-16 10:37:08 +0200	[diff] [blame]	296	=item B<KORAPXMLTEI_TOKENIZER_HEAP_SIZE>
				297
				298	Set the heap size for the tokenizer process.
				299	Defaults to C<512m>.
				300
Akron	b364947	2020-09-29 08:24:46 +0200	[diff] [blame]	301	=back
				302
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	303	=head1 COPYRIGHT AND LICENSE
				304
Marc Kupietz	b6fd6bc	2025-04-16 12:47:26 +0200	[diff] [blame]	305	Copyright (C) 2021-2025, L<IDS Mannheim\|https://www.ids-mannheim.de/>
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	306
				307	Author: Peter Harders
				308
Akron	aabd095	2020-09-29 07:35:08 +0200	[diff] [blame]	309	Contributors: Nils Diewald, Marc Kupietz, Carsten Schnober
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	310
				311	L<KorAP::XML::TEI> is developed as part of the L<KorAP\|https://korap.ids-mannheim.de/>
				312	Corpus Analysis Platform at the
Akron	d72baca	2021-07-23 13:25:32 +0200	[diff] [blame]	313	L<Leibniz Institute for the German Language (IDS)\|https://www.ids-mannheim.de/>,
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	314	member of the
				315	L<Leibniz-Gemeinschaft\|http://www.leibniz-gemeinschaft.de/>.
				316
				317	This program is free software published under the
Marc Kupietz	e955ecc	2021-02-17 17:42:01 +0100	[diff] [blame]	318	L<BSD-2 License\|https://opensource.org/licenses/BSD-2-Clause>.
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	319
Akron	692d17d	2021-03-05 13:21:03 +0100	[diff] [blame]	320	=cut