Blame - Readme.pod - KorAP/KorAP-XML-TEI

blob: 074db869d29fa86017961884016abcb1b57fba34 [file] [log] [blame]

Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	1	=pod
				2
				3	=encoding utf8
				4
				5	=head1 NAME
				6
				7	tei2korapxml - Conversion of TEI P5 based formats to KorAP-XML
				8
				9	=head1 SYNOPSIS
				10
Marc Kupietz	5b3f1d8	2024-07-05 17:50:55 +0200	[diff] [blame]	11	cat corpus.i5.xml \| tei2korapxml -tk - > corpus.korapxml.zip
				12	tei2korapxml -tk corpus.i5.xml > corpus.korapxml.zip
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	13
				14	=head1 DESCRIPTION
				15
				16	C<tei2korapxml> is a script to convert TEI P5 and
Akron	d72baca	2021-07-23 13:25:32 +0200	[diff] [blame]	17	L<I5\|https://www.ids-mannheim.de/digspra/kl/projekte/korpora/textmodell>
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	18	based documents to the
				19	L<KorAP-XML format\|https://github.com/KorAP/KorAP-XML-Krill#about-korap-xml>.
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	20
				21	This program is usually called from inside another script.
				22
				23	=head1 FORMATS
				24
				25	=head2 Input restrictions
				26
				27	=over 2
				28
				29	=item
				30
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	31	TEI P5 formatted input with certain restrictions:
				32
				33	=over 4
				34
				35	=item
				36
Akron	e48bec4	2023-01-05 12:18:45 +0100	[diff] [blame]	37	B<mandatory>: text-header with integrated textsigle
				38	(or convertable identifier), text-body
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	39
				40	=item
				41
				42	B<optional>: corp-header with integrated corpsigle,
				43	doc-header with integrated docsigle
				44
				45	=back
				46
				47	=item
				48
				49	All tokens inside the primary text may not be
				50	newline seperated, because newlines are removed
				51	(see L<KorAP::XML::TEI::Data>) and a conversion of newlines
				52	into blanks between 2 tokens could lead to additional blanks,
				53	where there should be none (e.g.: punctuation characters like C<,> or
				54	C<.> should not be seperated from their predecessor token).
Akron	8a0c4bf	2021-03-16 16:51:21 +0100	[diff] [blame]	55	(see also code section C<~ whitespace handling ~> in C<script/tei2korapxml>).
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	56
Akron	940ca6f	2021-10-11 12:38:39 +0200	[diff] [blame]	57	=item
				58
				59	Header types, like C<E<lt>idsHeader [...] type="document" [...] E<gt>>
				60	need to be defined in the same line as the header tag.
				61
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	62	=back
				63
				64	=head2 Notes on the output
				65
				66	=over 2
				67
				68	=item
				69
				70	zip file output (default on C<stdout>) with utf8 encoded entries
				71	(which together form the KorAP-XML format)
				72
				73	=back
				74
				75	=head1 INSTALLATION
				76
Akron	d26319b	2023-01-12 15:34:41 +0100	[diff] [blame]	77	C<tei2korapxml> requires C<libxml2-dev> bindings and L<File::ShareDir::Install> to be installed.
Marc Kupietz	e83a4e9	2021-03-16 20:51:26 +0100	[diff] [blame]	78	When these requirements are met, the preferred way to install the script is
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	79	to use L<cpanm\|App::cpanminus>.
				80
				81	$ cpanm https://github.com/KorAP/KorAP-XML-TEI.git
				82
				83	In case everything went well, the C<tei2korapxml> tool will
				84	be available on your command line immediately.
				85
Marc Kupietz	4ad648e	2025-12-10 10:38:46 +0100	[diff] [blame]	86	Minimum requirement for L<KorAP::XML::TEI> is Perl 5.38.
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	87
				88	=head1 OPTIONS
				89
				90	=over 2
				91
Akron	1148478	2021-11-03 20:12:14 +0100	[diff] [blame]	92	=item B<--input\|-i>
				93
				94	The input file to process. If no specific input is defined and a single
				95	dash C<-> is passed as an argument, data is read from C<STDIN>.
				96
Marc Kupietz	5b3f1d8	2024-07-05 17:50:55 +0200	[diff] [blame]	97	Instead of using C<-i> input files can also be defined as trailing arguments
				98	to the command:
				99
				100	tei2korapxml -tk corpus1.i5.xml corpus2.i5.xml
				101
Marc Kupietz	2115ecc	2025-12-10 11:37:03 +0100	[diff] [blame^]	102	=item B<--progress\|-p>
				103
				104	Show a progress bar (including ETA).
				105	This option is ignored if valid input is not read from a file.
				106
Akron	6b1f26b	2024-09-19 11:35:32 +0200	[diff] [blame]	107	=item B<--output\|-o>
				108
				109	The output zip file to be created. If no specific output is defined,
				110	data is written to C<STDOUT>.
Akron	1148478	2021-11-03 20:12:14 +0100	[diff] [blame]	111
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	112	=item B<--root\|-r>
				113
				114	The root directory for output. Defaults to C<.>.
				115
				116	=item B<--help\|-h>
				117
				118	Print help information.
				119
				120	=item B<--version\|-v>
				121
				122	Print version information.
				123
Akron	e48bec4	2023-01-05 12:18:45 +0100	[diff] [blame]	124	=item B<--tokenizer-korap\|-tk>
				125
				126	Use the standard KorAP/DeReKo tokenizer.
				127
				128	=item B<--tokenizer-internal\|-ti>
				129
				130	Tokenize the data using two embedded tokenizers,
				131	that will take an I<aggressive> and a I<conservative>
				132	approach.
				133
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	134	=item B<--tokenizer-call\|-tc>
				135
				136	Call an external tokenizer process, that will tokenize
Akron	1148478	2021-11-03 20:12:14 +0100	[diff] [blame]	137	from STDIN and outputs the offsets of all tokens.
				138
				139	Texts are separated using C<\x04\n>. The external process
				140	should add a new line per text.
				141
				142	If the L</--use-tokenizer-sentence-splits> option is activated,
				143	sentences are marked by offset as well in new lines.
				144
				145	To use L<Datok\|https://github.com/KorAP/Datok> including sentence
				146	splitting, call C<tei2korap> as follows:
				147
				148	$ cat corpus.i5.xml \| tei2korapxml -s \
				149	$ -tc 'datok tokenize \
				150	$ -t ./tokenizer.matok \
				151	$ -p --newline-after-eot --no-sentences \
				152	$ --no-tokens --sentence-positions -' - \
				153	$ > corpus.korapxml.zip
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	154
Akron	6b1f26b	2024-09-19 11:35:32 +0200	[diff] [blame]	155	=item B<--no-tokenizer>
				156
				157	Boolean flag indicating that no tokenizer should be used.
				158	This is meant to ensure that by default a final token layer always
				159	exists.
				160	If a separate tokenizer is chosen, this flag is ignored.
				161
Akron	75d6314	2021-02-23 18:40:56 +0100	[diff] [blame]	162	=item B<--skip-inline-tokens>
				163
				164	Boolean flag indicating that inline tokens should not
				165	be processed. Defaults to false (meaning inline tokens will be processed).
				166
Akron	692d17d	2021-03-05 13:21:03 +0100	[diff] [blame]	167	=item B<--skip-inline-token-annotations>
				168
				169	Boolean flag indicating that inline token annotations should not
				170	be processed. Defaults to true (meaning inline token annotations
Akron	6b1f26b	2024-09-19 11:35:32 +0200	[diff] [blame]	171	won't be processed). Can be negated with
				172	C<--no-skip-inline-token-annotations>.
Akron	692d17d	2021-03-05 13:21:03 +0100	[diff] [blame]	173
Akron	ca70a1d	2021-02-25 16:21:31 +0100	[diff] [blame]	174	=item B<--skip-inline-tags> <tags>
Akron	54c3ff1	2021-02-25 11:33:37 +0100	[diff] [blame]	175
				176	Expects a comma-separated list of tags to be ignored when the structure
				177	is parsed. Content of these tags however will be processed.
				178
Marc Kupietz	fc3a0ee	2024-07-05 16:58:16 +0200	[diff] [blame]	179	=item B<--auto-textsigle> <textsigle>
				180
				181	Expects a text sigle thats serves as fallback if no text sigles
				182	are given in the input data.
				183	The auto text sigle will be incremented for each text processed.
				184
				185	Example:
				186
				187	tei2korapxml --auto-textsigle 'ICC/GER.00001' -s -tk - \
				188	< data.i5.xml > korapxml.zip
				189
Marc Kupietz	a671ae5	2022-12-22 16:28:14 +0100	[diff] [blame]	190	=item B<--xmlid-to-textsigle> <from-regex>@<to-c/to-d/to-t>
				191
Akron	e48bec4	2023-01-05 12:18:45 +0100	[diff] [blame]	192	Expects a regular replacement expression (separated by B<@> between the
Marc Kupietz	a671ae5	2022-12-22 16:28:14 +0100	[diff] [blame]	193	search and the replacement) to convert text id attributes to text sigles
				194	with three parts (separated by B</>).
				195
				196	Example:
				197
				198	tei2korapxml \
				199	--xmlid-to-textsigle 'ICC.German\.([^.]+\.[^.]+)\.(.+)@ICCGER/$1/$2' \
				200	-tk - < t/data/icc_german_sample.p5.xml
				201
Akron	e48bec4	2023-01-05 12:18:45 +0100	[diff] [blame]	202	Converts text id C<ICC.German.DeReKo.WPD17.G11.00238> to
				203	sigle C<ICCGER/DeReKo.WPD17/G11.00238>.
Marc Kupietz	a671ae5	2022-12-22 16:28:14 +0100	[diff] [blame]	204
Akron	1a5271a	2021-02-18 13:18:15 +0100	[diff] [blame]	205	=item B<--inline-tokens> <foundry>#[<file>]
				206
				207	Define the foundry and file (without extension)
				208	to store inline token information in.
Akron	8a0c4bf	2021-03-16 16:51:21 +0100	[diff] [blame]	209	Unless C<--skip-inline-token-annotations> is set,
				210	this will contain annotations as well.
Akron	1a5271a	2021-02-18 13:18:15 +0100	[diff] [blame]	211	Defaults to C<tokens> and C<morpho>.
				212
Akron	e2819a1	2021-10-12 15:52:55 +0200	[diff] [blame]	213	The inline token data will also be stored in the
				214	inline structures file (see I<--inline-structures>),
				215	unless the inline token foundry is prepended
				216	by an B<!> exclamation mark, indicating that inline
				217	tokens are stored exclusively in the inline tokens
				218	file.
				219
				220	Example:
				221
Akron	6b1f26b	2024-09-19 11:35:32 +0200	[diff] [blame]	222	tei2korapxml --no-tokenizer --inline-tokens \
				223	'!gingko#morpho' < data.i5.xml > korapxml.zip
				224
				225	=item B<--inline-dependencies> <foundry>#[<file>]
				226
				227	Define the foundry and file (without extension)
				228	to store inline dependency information in.
				229	Defaults to the layer of C<dependency> and
				230	will be ignored if not set (which means, dependency
				231	attributes will be stored in the inline tokens file,
				232	if not skipped).
				233
				234	The dependency data will also be stored in the
				235	inline token file (see I<--inline-tokens>),
				236	unless the inline dependencies foundry is prepended
				237	by an B<!> exclamation mark, indicating that inline
				238	dependency data is stored exclusively in the inline
				239	dependencies file.
				240
				241	Example:
				242
				243	tei2korapxml --no-tokenizer --inline-dependencies \
				244	'gingko#dependency' < data.i5.xml > korapxml.zip
				245
Akron	e2819a1	2021-10-12 15:52:55 +0200	[diff] [blame]	246
Akron	dd0be8f	2021-02-18 19:29:41 +0100	[diff] [blame]	247	=item B<--inline-structures> <foundry>#[<file>]
				248
				249	Define the foundry and file (without extension)
				250	to store inline structure information in.
				251	Defaults to C<struct> and C<structures>.
Akron	75d6314	2021-02-23 18:40:56 +0100	[diff] [blame]	252
Akron	26a7152	2021-02-19 10:27:37 +0100	[diff] [blame]	253	=item B<--base-foundry> <foundry>
				254
				255	Define the base foundry to store newly generated
				256	token information in.
				257	Defaults to C<base>.
				258
				259	=item B<--data-file> <file>
				260
				261	Define the file (without extension)
				262	to store primary data information in.
				263	Defaults to C<data>.
				264
				265	=item B<--header-file> <file>
				266
				267	Define the file name (without extension)
				268	to store header information on
				269	the corpus, document, and text level in.
				270	Defaults to C<header>.
Akron	dd0be8f	2021-02-18 19:29:41 +0100	[diff] [blame]	271
Marc Kupietz	985da0c	2021-02-15 19:29:50 +0100	[diff] [blame]	272	=item B<--use-tokenizer-sentence-splits\|-s>
				273
				274	Replace existing with, or add new, sentence boundary information
Akron	1148478	2021-11-03 20:12:14 +0100	[diff] [blame]	275	provided by the tokenizer.
				276	Currently KorAP-tokenizer and certain external tokenizers support
				277	these boundaries.
Marc Kupietz	985da0c	2021-02-15 19:29:50 +0100	[diff] [blame]	278
Akron	91705d7	2021-02-19 10:59:45 +0100	[diff] [blame]	279	=item B<--tokens-file> <file>
				280
				281	Define the file (without extension)
				282	to store generated token information in
				283	(either from the KorAP tokenizer or an externally called tokenizer).
				284	Defaults to C<tokens>.
				285
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	286	=item B<--log\|-l>
				287
				288	Loglevel for I<Log::Any>. Defaults to C<notice>.
				289
				290	=back
				291
Akron	b364947	2020-09-29 08:24:46 +0200	[diff] [blame]	292	=head1 ENVIRONMENT VARIABLES
				293
				294	=over 2
				295
				296	=item B<KORAPXMLTEI_DEBUG>
				297
				298	Activate minimal debugging.
				299	Defaults to C<false>.
				300
Marc Kupietz	d254f5c	2025-04-16 10:37:08 +0200	[diff] [blame]	301	=item B<KORAPXMLTEI_TOKENIZER_HEAP_SIZE>
				302
				303	Set the heap size for the tokenizer process.
				304	Defaults to C<512m>.
				305
Akron	b364947	2020-09-29 08:24:46 +0200	[diff] [blame]	306	=back
				307
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	308	=head1 COPYRIGHT AND LICENSE
				309
Marc Kupietz	b6fd6bc	2025-04-16 12:47:26 +0200	[diff] [blame]	310	Copyright (C) 2021-2025, L<IDS Mannheim\|https://www.ids-mannheim.de/>
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	311
				312	Author: Peter Harders
				313
Akron	aabd095	2020-09-29 07:35:08 +0200	[diff] [blame]	314	Contributors: Nils Diewald, Marc Kupietz, Carsten Schnober
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	315
				316	L<KorAP::XML::TEI> is developed as part of the L<KorAP\|https://korap.ids-mannheim.de/>
				317	Corpus Analysis Platform at the
Akron	d72baca	2021-07-23 13:25:32 +0200	[diff] [blame]	318	L<Leibniz Institute for the German Language (IDS)\|https://www.ids-mannheim.de/>,
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	319	member of the
				320	L<Leibniz-Gemeinschaft\|http://www.leibniz-gemeinschaft.de/>.
				321
				322	This program is free software published under the
Marc Kupietz	e955ecc	2021-02-17 17:42:01 +0100	[diff] [blame]	323	L<BSD-2 License\|https://opensource.org/licenses/BSD-2-Clause>.
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	324
Akron	692d17d	2021-03-05 13:21:03 +0100	[diff] [blame]	325	=cut