Blame - Readme.pod - KorAP/KorAP-XML-TEI

blob: 28bf016a8d19320a79f4bb6ac37b06ec63ccb1fe [file] [log] [blame]

Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	1	=pod
				2
				3	=encoding utf8
				4
				5	=head1 NAME
				6
				7	tei2korapxml - Conversion of TEI P5 based formats to KorAP-XML
				8
				9	=head1 SYNOPSIS
				10
Marc Kupietz	5b3f1d8	2024-07-05 17:50:55 +0200	[diff] [blame]	11	cat corpus.i5.xml \| tei2korapxml -tk - > corpus.korapxml.zip
				12	tei2korapxml -tk corpus.i5.xml > corpus.korapxml.zip
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	13
				14	=head1 DESCRIPTION
				15
				16	C<tei2korapxml> is a script to convert TEI P5 and
Akron	d72baca	2021-07-23 13:25:32 +0200	[diff] [blame]	17	L<I5\|https://www.ids-mannheim.de/digspra/kl/projekte/korpora/textmodell>
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	18	based documents to the
				19	L<KorAP-XML format\|https://github.com/KorAP/KorAP-XML-Krill#about-korap-xml>.
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	20
				21	This program is usually called from inside another script.
				22
				23	=head1 FORMATS
				24
				25	=head2 Input restrictions
				26
				27	=over 2
				28
				29	=item
				30
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	31	TEI P5 formatted input with certain restrictions:
				32
				33	=over 4
				34
				35	=item
				36
Akron	e48bec4	2023-01-05 12:18:45 +0100	[diff] [blame]	37	B<mandatory>: text-header with integrated textsigle
				38	(or convertable identifier), text-body
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	39
				40	=item
				41
				42	B<optional>: corp-header with integrated corpsigle,
				43	doc-header with integrated docsigle
				44
				45	=back
				46
				47	=item
				48
				49	All tokens inside the primary text may not be
				50	newline seperated, because newlines are removed
				51	(see L<KorAP::XML::TEI::Data>) and a conversion of newlines
				52	into blanks between 2 tokens could lead to additional blanks,
				53	where there should be none (e.g.: punctuation characters like C<,> or
				54	C<.> should not be seperated from their predecessor token).
Akron	8a0c4bf	2021-03-16 16:51:21 +0100	[diff] [blame]	55	(see also code section C<~ whitespace handling ~> in C<script/tei2korapxml>).
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	56
Akron	940ca6f	2021-10-11 12:38:39 +0200	[diff] [blame]	57	=item
				58
				59	Header types, like C<E<lt>idsHeader [...] type="document" [...] E<gt>>
				60	need to be defined in the same line as the header tag.
				61
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	62	=back
				63
				64	=head2 Notes on the output
				65
				66	=over 2
				67
				68	=item
				69
				70	zip file output (default on C<stdout>) with utf8 encoded entries
				71	(which together form the KorAP-XML format)
				72
				73	=back
				74
				75	=head1 INSTALLATION
				76
Marc Kupietz	9452d32	2025-12-12 16:42:50 +0100	[diff] [blame]	77	=head2 Docker (Recommended)
				78
				79	The easiest way to use C<tei2korapxml> is via Docker, which bundles all dependencies
				80	(Perl 5.42, Java 21, and required libraries) in a single container image.
				81
				82	B<Pull from Docker Hub:>
				83
				84	$ docker pull korap/tei2korapxml:latest
				85
				86	B<Usage examples:>
				87
				88	# Convert a file
				89	$ docker run --rm -v $(pwd):/data korap/tei2korapxml:latest \
				90	-s -tk /data/input.i5.xml > output.zip
				91
				92	# Convert from stdin
				93	$ cat input.i5.xml \| docker run --rm -i korap/tei2korapxml:latest \
				94	-s -tk - > output.zip
				95
				96	# Using docker-compose
				97	$ docker-compose run --rm tei2korapxml -s -tk input.i5.xml > output.zip
				98
				99	B<Build locally:>
				100
				101	$ docker build -t korap/tei2korapxml:latest .
				102
				103	For a slimmed-down image (using L<mintoolkit\|https://github.com/mintoolkit/mint>):
				104
				105	$ docker build -t korap/tei2korapxml:large .
				106	$ mint --crt-api-version 1.46 build --http-probe=false \
				107	--exec='PERL5LIB=/tei2korapxml/script/tei2korapxml -v \|\| test $? -eq 2 && java -jar /tei2korapxml/share/KorAP-Tokenizer-2.3.0-standalone.jar -V' \
				108	--include-path=/tei2korapxml/lib --include-path=/usr/local/share/perl5 \
				109	--include-path=/usr/share/perl5 --include-path=/usr/lib/perl5 \
				110	--tag korap/tei2korapxml:latest \
				111	korap/tei2korapxml:large
				112
				113	=head2 Traditional Installation
				114
Akron	d26319b	2023-01-12 15:34:41 +0100	[diff] [blame]	115	C<tei2korapxml> requires C<libxml2-dev> bindings and L<File::ShareDir::Install> to be installed.
Marc Kupietz	e83a4e9	2021-03-16 20:51:26 +0100	[diff] [blame]	116	When these requirements are met, the preferred way to install the script is
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	117	to use L<cpanm\|App::cpanminus>.
				118
				119	$ cpanm https://github.com/KorAP/KorAP-XML-TEI.git
				120
				121	In case everything went well, the C<tei2korapxml> tool will
				122	be available on your command line immediately.
				123
Marc Kupietz	4ad648e	2025-12-10 10:38:46 +0100	[diff] [blame]	124	Minimum requirement for L<KorAP::XML::TEI> is Perl 5.38.
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	125
				126	=head1 OPTIONS
				127
				128	=over 2
				129
Akron	1148478	2021-11-03 20:12:14 +0100	[diff] [blame]	130	=item B<--input\|-i>
				131
				132	The input file to process. If no specific input is defined and a single
				133	dash C<-> is passed as an argument, data is read from C<STDIN>.
				134
Marc Kupietz	5b3f1d8	2024-07-05 17:50:55 +0200	[diff] [blame]	135	Instead of using C<-i> input files can also be defined as trailing arguments
				136	to the command:
				137
				138	tei2korapxml -tk corpus1.i5.xml corpus2.i5.xml
				139
Marc Kupietz	2115ecc	2025-12-10 11:37:03 +0100	[diff] [blame]	140	=item B<--progress\|-p>
				141
Marc Kupietz	3c16cb9	2026-03-05 18:29:59 +0100	[diff] [blame^]	142	Show a progress bar (including ETA) written directly to C</dev/tty>,
				143	so it always appears on the terminal regardless of C<stderr> redirection.
				144	This option is ignored if valid input is not read from a file,
				145	or if no controlling terminal is available (e.g. in a detached container
				146	or CI environment).
Marc Kupietz	2115ecc	2025-12-10 11:37:03 +0100	[diff] [blame]	147
Akron	6b1f26b	2024-09-19 11:35:32 +0200	[diff] [blame]	148	=item B<--output\|-o>
				149
				150	The output zip file to be created. If no specific output is defined,
				151	data is written to C<STDOUT>.
Akron	1148478	2021-11-03 20:12:14 +0100	[diff] [blame]	152
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	153	=item B<--root\|-r>
				154
				155	The root directory for output. Defaults to C<.>.
				156
				157	=item B<--help\|-h>
				158
				159	Print help information.
				160
				161	=item B<--version\|-v>
				162
				163	Print version information.
				164
Akron	e48bec4	2023-01-05 12:18:45 +0100	[diff] [blame]	165	=item B<--tokenizer-korap\|-tk>
				166
				167	Use the standard KorAP/DeReKo tokenizer.
				168
				169	=item B<--tokenizer-internal\|-ti>
				170
				171	Tokenize the data using two embedded tokenizers,
				172	that will take an I<aggressive> and a I<conservative>
				173	approach.
				174
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	175	=item B<--tokenizer-call\|-tc>
				176
				177	Call an external tokenizer process, that will tokenize
Akron	1148478	2021-11-03 20:12:14 +0100	[diff] [blame]	178	from STDIN and outputs the offsets of all tokens.
				179
				180	Texts are separated using C<\x04\n>. The external process
				181	should add a new line per text.
				182
				183	If the L</--use-tokenizer-sentence-splits> option is activated,
				184	sentences are marked by offset as well in new lines.
				185
				186	To use L<Datok\|https://github.com/KorAP/Datok> including sentence
				187	splitting, call C<tei2korap> as follows:
				188
				189	$ cat corpus.i5.xml \| tei2korapxml -s \
				190	$ -tc 'datok tokenize \
				191	$ -t ./tokenizer.matok \
				192	$ -p --newline-after-eot --no-sentences \
				193	$ --no-tokens --sentence-positions -' - \
				194	$ > corpus.korapxml.zip
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	195
Akron	6b1f26b	2024-09-19 11:35:32 +0200	[diff] [blame]	196	=item B<--no-tokenizer>
				197
				198	Boolean flag indicating that no tokenizer should be used.
				199	This is meant to ensure that by default a final token layer always
				200	exists.
				201	If a separate tokenizer is chosen, this flag is ignored.
				202
Akron	75d6314	2021-02-23 18:40:56 +0100	[diff] [blame]	203	=item B<--skip-inline-tokens>
				204
				205	Boolean flag indicating that inline tokens should not
				206	be processed. Defaults to false (meaning inline tokens will be processed).
				207
Akron	692d17d	2021-03-05 13:21:03 +0100	[diff] [blame]	208	=item B<--skip-inline-token-annotations>
				209
				210	Boolean flag indicating that inline token annotations should not
				211	be processed. Defaults to true (meaning inline token annotations
Akron	6b1f26b	2024-09-19 11:35:32 +0200	[diff] [blame]	212	won't be processed). Can be negated with
				213	C<--no-skip-inline-token-annotations>.
Akron	692d17d	2021-03-05 13:21:03 +0100	[diff] [blame]	214
Akron	ca70a1d	2021-02-25 16:21:31 +0100	[diff] [blame]	215	=item B<--skip-inline-tags> <tags>
Akron	54c3ff1	2021-02-25 11:33:37 +0100	[diff] [blame]	216
				217	Expects a comma-separated list of tags to be ignored when the structure
				218	is parsed. Content of these tags however will be processed.
				219
Marc Kupietz	fc3a0ee	2024-07-05 16:58:16 +0200	[diff] [blame]	220	=item B<--auto-textsigle> <textsigle>
				221
				222	Expects a text sigle thats serves as fallback if no text sigles
				223	are given in the input data.
				224	The auto text sigle will be incremented for each text processed.
				225
				226	Example:
				227
				228	tei2korapxml --auto-textsigle 'ICC/GER.00001' -s -tk - \
				229	< data.i5.xml > korapxml.zip
				230
Marc Kupietz	a671ae5	2022-12-22 16:28:14 +0100	[diff] [blame]	231	=item B<--xmlid-to-textsigle> <from-regex>@<to-c/to-d/to-t>
				232
Akron	e48bec4	2023-01-05 12:18:45 +0100	[diff] [blame]	233	Expects a regular replacement expression (separated by B<@> between the
Marc Kupietz	a671ae5	2022-12-22 16:28:14 +0100	[diff] [blame]	234	search and the replacement) to convert text id attributes to text sigles
				235	with three parts (separated by B</>).
				236
				237	Example:
				238
				239	tei2korapxml \
				240	--xmlid-to-textsigle 'ICC.German\.([^.]+\.[^.]+)\.(.+)@ICCGER/$1/$2' \
				241	-tk - < t/data/icc_german_sample.p5.xml
				242
Akron	e48bec4	2023-01-05 12:18:45 +0100	[diff] [blame]	243	Converts text id C<ICC.German.DeReKo.WPD17.G11.00238> to
				244	sigle C<ICCGER/DeReKo.WPD17/G11.00238>.
Marc Kupietz	a671ae5	2022-12-22 16:28:14 +0100	[diff] [blame]	245
Akron	1a5271a	2021-02-18 13:18:15 +0100	[diff] [blame]	246	=item B<--inline-tokens> <foundry>#[<file>]
				247
				248	Define the foundry and file (without extension)
				249	to store inline token information in.
Akron	8a0c4bf	2021-03-16 16:51:21 +0100	[diff] [blame]	250	Unless C<--skip-inline-token-annotations> is set,
				251	this will contain annotations as well.
Akron	1a5271a	2021-02-18 13:18:15 +0100	[diff] [blame]	252	Defaults to C<tokens> and C<morpho>.
				253
Akron	e2819a1	2021-10-12 15:52:55 +0200	[diff] [blame]	254	The inline token data will also be stored in the
				255	inline structures file (see I<--inline-structures>),
				256	unless the inline token foundry is prepended
				257	by an B<!> exclamation mark, indicating that inline
				258	tokens are stored exclusively in the inline tokens
				259	file.
				260
				261	Example:
				262
Akron	6b1f26b	2024-09-19 11:35:32 +0200	[diff] [blame]	263	tei2korapxml --no-tokenizer --inline-tokens \
				264	'!gingko#morpho' < data.i5.xml > korapxml.zip
				265
				266	=item B<--inline-dependencies> <foundry>#[<file>]
				267
				268	Define the foundry and file (without extension)
				269	to store inline dependency information in.
				270	Defaults to the layer of C<dependency> and
				271	will be ignored if not set (which means, dependency
				272	attributes will be stored in the inline tokens file,
				273	if not skipped).
				274
				275	The dependency data will also be stored in the
				276	inline token file (see I<--inline-tokens>),
				277	unless the inline dependencies foundry is prepended
				278	by an B<!> exclamation mark, indicating that inline
				279	dependency data is stored exclusively in the inline
				280	dependencies file.
				281
				282	Example:
				283
				284	tei2korapxml --no-tokenizer --inline-dependencies \
				285	'gingko#dependency' < data.i5.xml > korapxml.zip
				286
Akron	e2819a1	2021-10-12 15:52:55 +0200	[diff] [blame]	287
Akron	dd0be8f	2021-02-18 19:29:41 +0100	[diff] [blame]	288	=item B<--inline-structures> <foundry>#[<file>]
				289
				290	Define the foundry and file (without extension)
				291	to store inline structure information in.
				292	Defaults to C<struct> and C<structures>.
Akron	75d6314	2021-02-23 18:40:56 +0100	[diff] [blame]	293
Akron	26a7152	2021-02-19 10:27:37 +0100	[diff] [blame]	294	=item B<--base-foundry> <foundry>
				295
				296	Define the base foundry to store newly generated
				297	token information in.
				298	Defaults to C<base>.
				299
				300	=item B<--data-file> <file>
				301
				302	Define the file (without extension)
				303	to store primary data information in.
				304	Defaults to C<data>.
				305
				306	=item B<--header-file> <file>
				307
				308	Define the file name (without extension)
				309	to store header information on
				310	the corpus, document, and text level in.
				311	Defaults to C<header>.
Akron	dd0be8f	2021-02-18 19:29:41 +0100	[diff] [blame]	312
Marc Kupietz	985da0c	2021-02-15 19:29:50 +0100	[diff] [blame]	313	=item B<--use-tokenizer-sentence-splits\|-s>
				314
				315	Replace existing with, or add new, sentence boundary information
Akron	1148478	2021-11-03 20:12:14 +0100	[diff] [blame]	316	provided by the tokenizer.
				317	Currently KorAP-tokenizer and certain external tokenizers support
				318	these boundaries.
Marc Kupietz	985da0c	2021-02-15 19:29:50 +0100	[diff] [blame]	319
Akron	91705d7	2021-02-19 10:59:45 +0100	[diff] [blame]	320	=item B<--tokens-file> <file>
				321
				322	Define the file (without extension)
				323	to store generated token information in
				324	(either from the KorAP tokenizer or an externally called tokenizer).
				325	Defaults to C<tokens>.
				326
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	327	=item B<--log\|-l>
				328
				329	Loglevel for I<Log::Any>. Defaults to C<notice>.
				330
				331	=back
				332
Akron	b364947	2020-09-29 08:24:46 +0200	[diff] [blame]	333	=head1 ENVIRONMENT VARIABLES
				334
				335	=over 2
				336
				337	=item B<KORAPXMLTEI_DEBUG>
				338
				339	Activate minimal debugging.
				340	Defaults to C<false>.
				341
Marc Kupietz	d254f5c	2025-04-16 10:37:08 +0200	[diff] [blame]	342	=item B<KORAPXMLTEI_TOKENIZER_HEAP_SIZE>
				343
				344	Set the heap size for the tokenizer process.
				345	Defaults to C<512m>.
				346
Akron	b364947	2020-09-29 08:24:46 +0200	[diff] [blame]	347	=back
				348
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	349	=head1 COPYRIGHT AND LICENSE
				350
Marc Kupietz	b6fd6bc	2025-04-16 12:47:26 +0200	[diff] [blame]	351	Copyright (C) 2021-2025, L<IDS Mannheim\|https://www.ids-mannheim.de/>
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	352
				353	Author: Peter Harders
				354
Akron	aabd095	2020-09-29 07:35:08 +0200	[diff] [blame]	355	Contributors: Nils Diewald, Marc Kupietz, Carsten Schnober
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	356
				357	L<KorAP::XML::TEI> is developed as part of the L<KorAP\|https://korap.ids-mannheim.de/>
				358	Corpus Analysis Platform at the
Akron	d72baca	2021-07-23 13:25:32 +0200	[diff] [blame]	359	L<Leibniz Institute for the German Language (IDS)\|https://www.ids-mannheim.de/>,
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	360	member of the
				361	L<Leibniz-Gemeinschaft\|http://www.leibniz-gemeinschaft.de/>.
				362
				363	This program is free software published under the
Marc Kupietz	e955ecc	2021-02-17 17:42:01 +0100	[diff] [blame]	364	L<BSD-2 License\|https://opensource.org/licenses/BSD-2-Clause>.
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	365
Akron	692d17d	2021-03-05 13:21:03 +0100	[diff] [blame]	366	=cut