Blame - Readme.pod - KorAP/KorAP-XML-TEI

blob: 4a058c391ae073d09d2886fd621e9ae8c692b215 [file] [log] [blame]

Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	1	=pod
				2
				3	=encoding utf8
				4
				5	=head1 NAME
				6
				7	tei2korapxml - Conversion of TEI P5 based formats to KorAP-XML
				8
				9	=head1 SYNOPSIS
				10
Marc Kupietz	5b3f1d8	2024-07-05 17:50:55 +0200	[diff] [blame]	11	cat corpus.i5.xml \| tei2korapxml -tk - > corpus.korapxml.zip
				12	tei2korapxml -tk corpus.i5.xml > corpus.korapxml.zip
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	13
				14	=head1 DESCRIPTION
				15
				16	C<tei2korapxml> is a script to convert TEI P5 and
Akron	d72baca	2021-07-23 13:25:32 +0200	[diff] [blame]	17	L<I5\|https://www.ids-mannheim.de/digspra/kl/projekte/korpora/textmodell>
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	18	based documents to the
				19	L<KorAP-XML format\|https://github.com/KorAP/KorAP-XML-Krill#about-korap-xml>.
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	20
				21	This program is usually called from inside another script.
				22
				23	=head1 FORMATS
				24
				25	=head2 Input restrictions
				26
				27	=over 2
				28
				29	=item
				30
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	31	TEI P5 formatted input with certain restrictions:
				32
				33	=over 4
				34
				35	=item
				36
Akron	e48bec4	2023-01-05 12:18:45 +0100	[diff] [blame]	37	B<mandatory>: text-header with integrated textsigle
				38	(or convertable identifier), text-body
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	39
				40	=item
				41
				42	B<optional>: corp-header with integrated corpsigle,
				43	doc-header with integrated docsigle
				44
				45	=back
				46
				47	=item
				48
				49	All tokens inside the primary text may not be
				50	newline seperated, because newlines are removed
				51	(see L<KorAP::XML::TEI::Data>) and a conversion of newlines
				52	into blanks between 2 tokens could lead to additional blanks,
				53	where there should be none (e.g.: punctuation characters like C<,> or
				54	C<.> should not be seperated from their predecessor token).
Akron	8a0c4bf	2021-03-16 16:51:21 +0100	[diff] [blame]	55	(see also code section C<~ whitespace handling ~> in C<script/tei2korapxml>).
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	56
Akron	940ca6f	2021-10-11 12:38:39 +0200	[diff] [blame]	57	=item
				58
				59	Header types, like C<E<lt>idsHeader [...] type="document" [...] E<gt>>
				60	need to be defined in the same line as the header tag.
				61
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	62	=back
				63
				64	=head2 Notes on the output
				65
				66	=over 2
				67
				68	=item
				69
				70	zip file output (default on C<stdout>) with utf8 encoded entries
				71	(which together form the KorAP-XML format)
				72
				73	=back
				74
				75	=head1 INSTALLATION
				76
Marc Kupietz	9452d32	2025-12-12 16:42:50 +0100	[diff] [blame^]	77	=head2 Docker (Recommended)
				78
				79	The easiest way to use C<tei2korapxml> is via Docker, which bundles all dependencies
				80	(Perl 5.42, Java 21, and required libraries) in a single container image.
				81
				82	B<Pull from Docker Hub:>
				83
				84	$ docker pull korap/tei2korapxml:latest
				85
				86	B<Usage examples:>
				87
				88	# Convert a file
				89	$ docker run --rm -v $(pwd):/data korap/tei2korapxml:latest \
				90	-s -tk /data/input.i5.xml > output.zip
				91
				92	# Convert from stdin
				93	$ cat input.i5.xml \| docker run --rm -i korap/tei2korapxml:latest \
				94	-s -tk - > output.zip
				95
				96	# Using docker-compose
				97	$ docker-compose run --rm tei2korapxml -s -tk input.i5.xml > output.zip
				98
				99	B<Build locally:>
				100
				101	$ docker build -t korap/tei2korapxml:latest .
				102
				103	For a slimmed-down image (using L<mintoolkit\|https://github.com/mintoolkit/mint>):
				104
				105	$ docker build -t korap/tei2korapxml:large .
				106	$ mint --crt-api-version 1.46 build --http-probe=false \
				107	--exec='PERL5LIB=/tei2korapxml/script/tei2korapxml -v \|\| test $? -eq 2 && java -jar /tei2korapxml/share/KorAP-Tokenizer-2.3.0-standalone.jar -V' \
				108	--include-path=/tei2korapxml/lib --include-path=/usr/local/share/perl5 \
				109	--include-path=/usr/share/perl5 --include-path=/usr/lib/perl5 \
				110	--tag korap/tei2korapxml:latest \
				111	korap/tei2korapxml:large
				112
				113	=head2 Traditional Installation
				114
Akron	d26319b	2023-01-12 15:34:41 +0100	[diff] [blame]	115	C<tei2korapxml> requires C<libxml2-dev> bindings and L<File::ShareDir::Install> to be installed.
Marc Kupietz	e83a4e9	2021-03-16 20:51:26 +0100	[diff] [blame]	116	When these requirements are met, the preferred way to install the script is
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	117	to use L<cpanm\|App::cpanminus>.
				118
				119	$ cpanm https://github.com/KorAP/KorAP-XML-TEI.git
				120
				121	In case everything went well, the C<tei2korapxml> tool will
				122	be available on your command line immediately.
				123
Marc Kupietz	4ad648e	2025-12-10 10:38:46 +0100	[diff] [blame]	124	Minimum requirement for L<KorAP::XML::TEI> is Perl 5.38.
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	125
				126	=head1 OPTIONS
				127
				128	=over 2
				129
Akron	1148478	2021-11-03 20:12:14 +0100	[diff] [blame]	130	=item B<--input\|-i>
				131
				132	The input file to process. If no specific input is defined and a single
				133	dash C<-> is passed as an argument, data is read from C<STDIN>.
				134
Marc Kupietz	5b3f1d8	2024-07-05 17:50:55 +0200	[diff] [blame]	135	Instead of using C<-i> input files can also be defined as trailing arguments
				136	to the command:
				137
				138	tei2korapxml -tk corpus1.i5.xml corpus2.i5.xml
				139
Marc Kupietz	2115ecc	2025-12-10 11:37:03 +0100	[diff] [blame]	140	=item B<--progress\|-p>
				141
				142	Show a progress bar (including ETA).
				143	This option is ignored if valid input is not read from a file.
				144
Akron	6b1f26b	2024-09-19 11:35:32 +0200	[diff] [blame]	145	=item B<--output\|-o>
				146
				147	The output zip file to be created. If no specific output is defined,
				148	data is written to C<STDOUT>.
Akron	1148478	2021-11-03 20:12:14 +0100	[diff] [blame]	149
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	150	=item B<--root\|-r>
				151
				152	The root directory for output. Defaults to C<.>.
				153
				154	=item B<--help\|-h>
				155
				156	Print help information.
				157
				158	=item B<--version\|-v>
				159
				160	Print version information.
				161
Akron	e48bec4	2023-01-05 12:18:45 +0100	[diff] [blame]	162	=item B<--tokenizer-korap\|-tk>
				163
				164	Use the standard KorAP/DeReKo tokenizer.
				165
				166	=item B<--tokenizer-internal\|-ti>
				167
				168	Tokenize the data using two embedded tokenizers,
				169	that will take an I<aggressive> and a I<conservative>
				170	approach.
				171
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	172	=item B<--tokenizer-call\|-tc>
				173
				174	Call an external tokenizer process, that will tokenize
Akron	1148478	2021-11-03 20:12:14 +0100	[diff] [blame]	175	from STDIN and outputs the offsets of all tokens.
				176
				177	Texts are separated using C<\x04\n>. The external process
				178	should add a new line per text.
				179
				180	If the L</--use-tokenizer-sentence-splits> option is activated,
				181	sentences are marked by offset as well in new lines.
				182
				183	To use L<Datok\|https://github.com/KorAP/Datok> including sentence
				184	splitting, call C<tei2korap> as follows:
				185
				186	$ cat corpus.i5.xml \| tei2korapxml -s \
				187	$ -tc 'datok tokenize \
				188	$ -t ./tokenizer.matok \
				189	$ -p --newline-after-eot --no-sentences \
				190	$ --no-tokens --sentence-positions -' - \
				191	$ > corpus.korapxml.zip
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	192
Akron	6b1f26b	2024-09-19 11:35:32 +0200	[diff] [blame]	193	=item B<--no-tokenizer>
				194
				195	Boolean flag indicating that no tokenizer should be used.
				196	This is meant to ensure that by default a final token layer always
				197	exists.
				198	If a separate tokenizer is chosen, this flag is ignored.
				199
Akron	75d6314	2021-02-23 18:40:56 +0100	[diff] [blame]	200	=item B<--skip-inline-tokens>
				201
				202	Boolean flag indicating that inline tokens should not
				203	be processed. Defaults to false (meaning inline tokens will be processed).
				204
Akron	692d17d	2021-03-05 13:21:03 +0100	[diff] [blame]	205	=item B<--skip-inline-token-annotations>
				206
				207	Boolean flag indicating that inline token annotations should not
				208	be processed. Defaults to true (meaning inline token annotations
Akron	6b1f26b	2024-09-19 11:35:32 +0200	[diff] [blame]	209	won't be processed). Can be negated with
				210	C<--no-skip-inline-token-annotations>.
Akron	692d17d	2021-03-05 13:21:03 +0100	[diff] [blame]	211
Akron	ca70a1d	2021-02-25 16:21:31 +0100	[diff] [blame]	212	=item B<--skip-inline-tags> <tags>
Akron	54c3ff1	2021-02-25 11:33:37 +0100	[diff] [blame]	213
				214	Expects a comma-separated list of tags to be ignored when the structure
				215	is parsed. Content of these tags however will be processed.
				216
Marc Kupietz	fc3a0ee	2024-07-05 16:58:16 +0200	[diff] [blame]	217	=item B<--auto-textsigle> <textsigle>
				218
				219	Expects a text sigle thats serves as fallback if no text sigles
				220	are given in the input data.
				221	The auto text sigle will be incremented for each text processed.
				222
				223	Example:
				224
				225	tei2korapxml --auto-textsigle 'ICC/GER.00001' -s -tk - \
				226	< data.i5.xml > korapxml.zip
				227
Marc Kupietz	a671ae5	2022-12-22 16:28:14 +0100	[diff] [blame]	228	=item B<--xmlid-to-textsigle> <from-regex>@<to-c/to-d/to-t>
				229
Akron	e48bec4	2023-01-05 12:18:45 +0100	[diff] [blame]	230	Expects a regular replacement expression (separated by B<@> between the
Marc Kupietz	a671ae5	2022-12-22 16:28:14 +0100	[diff] [blame]	231	search and the replacement) to convert text id attributes to text sigles
				232	with three parts (separated by B</>).
				233
				234	Example:
				235
				236	tei2korapxml \
				237	--xmlid-to-textsigle 'ICC.German\.([^.]+\.[^.]+)\.(.+)@ICCGER/$1/$2' \
				238	-tk - < t/data/icc_german_sample.p5.xml
				239
Akron	e48bec4	2023-01-05 12:18:45 +0100	[diff] [blame]	240	Converts text id C<ICC.German.DeReKo.WPD17.G11.00238> to
				241	sigle C<ICCGER/DeReKo.WPD17/G11.00238>.
Marc Kupietz	a671ae5	2022-12-22 16:28:14 +0100	[diff] [blame]	242
Akron	1a5271a	2021-02-18 13:18:15 +0100	[diff] [blame]	243	=item B<--inline-tokens> <foundry>#[<file>]
				244
				245	Define the foundry and file (without extension)
				246	to store inline token information in.
Akron	8a0c4bf	2021-03-16 16:51:21 +0100	[diff] [blame]	247	Unless C<--skip-inline-token-annotations> is set,
				248	this will contain annotations as well.
Akron	1a5271a	2021-02-18 13:18:15 +0100	[diff] [blame]	249	Defaults to C<tokens> and C<morpho>.
				250
Akron	e2819a1	2021-10-12 15:52:55 +0200	[diff] [blame]	251	The inline token data will also be stored in the
				252	inline structures file (see I<--inline-structures>),
				253	unless the inline token foundry is prepended
				254	by an B<!> exclamation mark, indicating that inline
				255	tokens are stored exclusively in the inline tokens
				256	file.
				257
				258	Example:
				259
Akron	6b1f26b	2024-09-19 11:35:32 +0200	[diff] [blame]	260	tei2korapxml --no-tokenizer --inline-tokens \
				261	'!gingko#morpho' < data.i5.xml > korapxml.zip
				262
				263	=item B<--inline-dependencies> <foundry>#[<file>]
				264
				265	Define the foundry and file (without extension)
				266	to store inline dependency information in.
				267	Defaults to the layer of C<dependency> and
				268	will be ignored if not set (which means, dependency
				269	attributes will be stored in the inline tokens file,
				270	if not skipped).
				271
				272	The dependency data will also be stored in the
				273	inline token file (see I<--inline-tokens>),
				274	unless the inline dependencies foundry is prepended
				275	by an B<!> exclamation mark, indicating that inline
				276	dependency data is stored exclusively in the inline
				277	dependencies file.
				278
				279	Example:
				280
				281	tei2korapxml --no-tokenizer --inline-dependencies \
				282	'gingko#dependency' < data.i5.xml > korapxml.zip
				283
Akron	e2819a1	2021-10-12 15:52:55 +0200	[diff] [blame]	284
Akron	dd0be8f	2021-02-18 19:29:41 +0100	[diff] [blame]	285	=item B<--inline-structures> <foundry>#[<file>]
				286
				287	Define the foundry and file (without extension)
				288	to store inline structure information in.
				289	Defaults to C<struct> and C<structures>.
Akron	75d6314	2021-02-23 18:40:56 +0100	[diff] [blame]	290
Akron	26a7152	2021-02-19 10:27:37 +0100	[diff] [blame]	291	=item B<--base-foundry> <foundry>
				292
				293	Define the base foundry to store newly generated
				294	token information in.
				295	Defaults to C<base>.
				296
				297	=item B<--data-file> <file>
				298
				299	Define the file (without extension)
				300	to store primary data information in.
				301	Defaults to C<data>.
				302
				303	=item B<--header-file> <file>
				304
				305	Define the file name (without extension)
				306	to store header information on
				307	the corpus, document, and text level in.
				308	Defaults to C<header>.
Akron	dd0be8f	2021-02-18 19:29:41 +0100	[diff] [blame]	309
Marc Kupietz	985da0c	2021-02-15 19:29:50 +0100	[diff] [blame]	310	=item B<--use-tokenizer-sentence-splits\|-s>
				311
				312	Replace existing with, or add new, sentence boundary information
Akron	1148478	2021-11-03 20:12:14 +0100	[diff] [blame]	313	provided by the tokenizer.
				314	Currently KorAP-tokenizer and certain external tokenizers support
				315	these boundaries.
Marc Kupietz	985da0c	2021-02-15 19:29:50 +0100	[diff] [blame]	316
Akron	91705d7	2021-02-19 10:59:45 +0100	[diff] [blame]	317	=item B<--tokens-file> <file>
				318
				319	Define the file (without extension)
				320	to store generated token information in
				321	(either from the KorAP tokenizer or an externally called tokenizer).
				322	Defaults to C<tokens>.
				323
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	324	=item B<--log\|-l>
				325
				326	Loglevel for I<Log::Any>. Defaults to C<notice>.
				327
				328	=back
				329
Akron	b364947	2020-09-29 08:24:46 +0200	[diff] [blame]	330	=head1 ENVIRONMENT VARIABLES
				331
				332	=over 2
				333
				334	=item B<KORAPXMLTEI_DEBUG>
				335
				336	Activate minimal debugging.
				337	Defaults to C<false>.
				338
Marc Kupietz	d254f5c	2025-04-16 10:37:08 +0200	[diff] [blame]	339	=item B<KORAPXMLTEI_TOKENIZER_HEAP_SIZE>
				340
				341	Set the heap size for the tokenizer process.
				342	Defaults to C<512m>.
				343
Akron	b364947	2020-09-29 08:24:46 +0200	[diff] [blame]	344	=back
				345
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	346	=head1 COPYRIGHT AND LICENSE
				347
Marc Kupietz	b6fd6bc	2025-04-16 12:47:26 +0200	[diff] [blame]	348	Copyright (C) 2021-2025, L<IDS Mannheim\|https://www.ids-mannheim.de/>
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	349
				350	Author: Peter Harders
				351
Akron	aabd095	2020-09-29 07:35:08 +0200	[diff] [blame]	352	Contributors: Nils Diewald, Marc Kupietz, Carsten Schnober
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	353
				354	L<KorAP::XML::TEI> is developed as part of the L<KorAP\|https://korap.ids-mannheim.de/>
				355	Corpus Analysis Platform at the
Akron	d72baca	2021-07-23 13:25:32 +0200	[diff] [blame]	356	L<Leibniz Institute for the German Language (IDS)\|https://www.ids-mannheim.de/>,
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	357	member of the
				358	L<Leibniz-Gemeinschaft\|http://www.leibniz-gemeinschaft.de/>.
				359
				360	This program is free software published under the
Marc Kupietz	e955ecc	2021-02-17 17:42:01 +0100	[diff] [blame]	361	L<BSD-2 License\|https://opensource.org/licenses/BSD-2-Clause>.
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	362
Akron	692d17d	2021-03-05 13:21:03 +0100	[diff] [blame]	363	=cut