Blame - Readme.pod - KorAP/KorAP-XML-TEI

blob: e890733937ff2ec71a4ed675c422aab164fc2c2c [file] [log] [blame]

Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	1	=pod
				2
				3	=encoding utf8
				4
				5	=head1 NAME
				6
				7	tei2korapxml - Conversion of TEI P5 based formats to KorAP-XML
				8
				9	=head1 SYNOPSIS
				10
Akron	1148478	2021-11-03 20:12:14 +0100	[diff] [blame]	11	cat corpus.i5.xml \| tei2korapxml - > corpus.korapxml.zip
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	12
				13	=head1 DESCRIPTION
				14
				15	C<tei2korapxml> is a script to convert TEI P5 and
Akron	d72baca	2021-07-23 13:25:32 +0200	[diff] [blame]	16	L<I5\|https://www.ids-mannheim.de/digspra/kl/projekte/korpora/textmodell>
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	17	based documents to the
				18	L<KorAP-XML format\|https://github.com/KorAP/KorAP-XML-Krill#about-korap-xml>.
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	19
				20	This program is usually called from inside another script.
				21
				22	=head1 FORMATS
				23
				24	=head2 Input restrictions
				25
				26	=over 2
				27
				28	=item
				29
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	30	TEI P5 formatted input with certain restrictions:
				31
				32	=over 4
				33
				34	=item
				35
Akron	e48bec4	2023-01-05 12:18:45 +0100	[diff] [blame]	36	B<mandatory>: text-header with integrated textsigle
				37	(or convertable identifier), text-body
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	38
				39	=item
				40
				41	B<optional>: corp-header with integrated corpsigle,
				42	doc-header with integrated docsigle
				43
				44	=back
				45
				46	=item
				47
				48	All tokens inside the primary text may not be
				49	newline seperated, because newlines are removed
				50	(see L<KorAP::XML::TEI::Data>) and a conversion of newlines
				51	into blanks between 2 tokens could lead to additional blanks,
				52	where there should be none (e.g.: punctuation characters like C<,> or
				53	C<.> should not be seperated from their predecessor token).
Akron	8a0c4bf	2021-03-16 16:51:21 +0100	[diff] [blame]	54	(see also code section C<~ whitespace handling ~> in C<script/tei2korapxml>).
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	55
Akron	940ca6f	2021-10-11 12:38:39 +0200	[diff] [blame]	56	=item
				57
				58	Header types, like C<E<lt>idsHeader [...] type="document" [...] E<gt>>
				59	need to be defined in the same line as the header tag.
				60
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	61	=back
				62
				63	=head2 Notes on the output
				64
				65	=over 2
				66
				67	=item
				68
				69	zip file output (default on C<stdout>) with utf8 encoded entries
				70	(which together form the KorAP-XML format)
				71
				72	=back
				73
				74	=head1 INSTALLATION
				75
Akron	d26319b	2023-01-12 15:34:41 +0100	[diff] [blame]	76	C<tei2korapxml> requires C<libxml2-dev> bindings and L<File::ShareDir::Install> to be installed.
Marc Kupietz	e83a4e9	2021-03-16 20:51:26 +0100	[diff] [blame]	77	When these requirements are met, the preferred way to install the script is
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	78	to use L<cpanm\|App::cpanminus>.
				79
				80	$ cpanm https://github.com/KorAP/KorAP-XML-TEI.git
				81
				82	In case everything went well, the C<tei2korapxml> tool will
				83	be available on your command line immediately.
				84
Akron	6b1f26b	2024-09-19 11:35:32 +0200	[diff] [blame]	85	Minimum requirement for L<KorAP::XML::TEI> is Perl 5.16.
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	86
				87	=head1 OPTIONS
				88
				89	=over 2
				90
Akron	1148478	2021-11-03 20:12:14 +0100	[diff] [blame]	91	=item B<--input\|-i>
				92
				93	The input file to process. If no specific input is defined and a single
				94	dash C<-> is passed as an argument, data is read from C<STDIN>.
				95
Akron	6b1f26b	2024-09-19 11:35:32 +0200	[diff] [blame]	96	=item B<--output\|-o>
				97
				98	The output zip file to be created. If no specific output is defined,
				99	data is written to C<STDOUT>.
Akron	1148478	2021-11-03 20:12:14 +0100	[diff] [blame]	100
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	101	=item B<--root\|-r>
				102
				103	The root directory for output. Defaults to C<.>.
				104
				105	=item B<--help\|-h>
				106
				107	Print help information.
				108
				109	=item B<--version\|-v>
				110
				111	Print version information.
				112
Akron	e48bec4	2023-01-05 12:18:45 +0100	[diff] [blame]	113	=item B<--tokenizer-korap\|-tk>
				114
				115	Use the standard KorAP/DeReKo tokenizer.
				116
				117	=item B<--tokenizer-internal\|-ti>
				118
				119	Tokenize the data using two embedded tokenizers,
				120	that will take an I<aggressive> and a I<conservative>
				121	approach.
				122
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	123	=item B<--tokenizer-call\|-tc>
				124
				125	Call an external tokenizer process, that will tokenize
Akron	1148478	2021-11-03 20:12:14 +0100	[diff] [blame]	126	from STDIN and outputs the offsets of all tokens.
				127
				128	Texts are separated using C<\x04\n>. The external process
				129	should add a new line per text.
				130
				131	If the L</--use-tokenizer-sentence-splits> option is activated,
				132	sentences are marked by offset as well in new lines.
				133
				134	To use L<Datok\|https://github.com/KorAP/Datok> including sentence
				135	splitting, call C<tei2korap> as follows:
				136
				137	$ cat corpus.i5.xml \| tei2korapxml -s \
				138	$ -tc 'datok tokenize \
				139	$ -t ./tokenizer.matok \
				140	$ -p --newline-after-eot --no-sentences \
				141	$ --no-tokens --sentence-positions -' - \
				142	$ > corpus.korapxml.zip
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	143
Akron	6b1f26b	2024-09-19 11:35:32 +0200	[diff] [blame]	144	=item B<--no-tokenizer>
				145
				146	Boolean flag indicating that no tokenizer should be used.
				147	This is meant to ensure that by default a final token layer always
				148	exists.
				149	If a separate tokenizer is chosen, this flag is ignored.
				150
Akron	75d6314	2021-02-23 18:40:56 +0100	[diff] [blame]	151	=item B<--skip-inline-tokens>
				152
				153	Boolean flag indicating that inline tokens should not
				154	be processed. Defaults to false (meaning inline tokens will be processed).
				155
Akron	692d17d	2021-03-05 13:21:03 +0100	[diff] [blame]	156	=item B<--skip-inline-token-annotations>
				157
				158	Boolean flag indicating that inline token annotations should not
				159	be processed. Defaults to true (meaning inline token annotations
Akron	6b1f26b	2024-09-19 11:35:32 +0200	[diff] [blame]	160	won't be processed). Can be negated with
				161	C<--no-skip-inline-token-annotations>.
Akron	692d17d	2021-03-05 13:21:03 +0100	[diff] [blame]	162
Akron	ca70a1d	2021-02-25 16:21:31 +0100	[diff] [blame]	163	=item B<--skip-inline-tags> <tags>
Akron	54c3ff1	2021-02-25 11:33:37 +0100	[diff] [blame]	164
				165	Expects a comma-separated list of tags to be ignored when the structure
				166	is parsed. Content of these tags however will be processed.
				167
Marc Kupietz	fc3a0ee	2024-07-05 16:58:16 +0200	[diff] [blame^]	168	=item B<--auto-textsigle> <textsigle>
				169
				170	Expects a text sigle thats serves as fallback if no text sigles
				171	are given in the input data.
				172	The auto text sigle will be incremented for each text processed.
				173
				174	Example:
				175
				176	tei2korapxml --auto-textsigle 'ICC/GER.00001' -s -tk - \
				177	< data.i5.xml > korapxml.zip
				178
Marc Kupietz	a671ae5	2022-12-22 16:28:14 +0100	[diff] [blame]	179	=item B<--xmlid-to-textsigle> <from-regex>@<to-c/to-d/to-t>
				180
Akron	e48bec4	2023-01-05 12:18:45 +0100	[diff] [blame]	181	Expects a regular replacement expression (separated by B<@> between the
Marc Kupietz	a671ae5	2022-12-22 16:28:14 +0100	[diff] [blame]	182	search and the replacement) to convert text id attributes to text sigles
				183	with three parts (separated by B</>).
				184
				185	Example:
				186
				187	tei2korapxml \
				188	--xmlid-to-textsigle 'ICC.German\.([^.]+\.[^.]+)\.(.+)@ICCGER/$1/$2' \
				189	-tk - < t/data/icc_german_sample.p5.xml
				190
Akron	e48bec4	2023-01-05 12:18:45 +0100	[diff] [blame]	191	Converts text id C<ICC.German.DeReKo.WPD17.G11.00238> to
				192	sigle C<ICCGER/DeReKo.WPD17/G11.00238>.
Marc Kupietz	a671ae5	2022-12-22 16:28:14 +0100	[diff] [blame]	193
Akron	1a5271a	2021-02-18 13:18:15 +0100	[diff] [blame]	194	=item B<--inline-tokens> <foundry>#[<file>]
				195
				196	Define the foundry and file (without extension)
				197	to store inline token information in.
Akron	8a0c4bf	2021-03-16 16:51:21 +0100	[diff] [blame]	198	Unless C<--skip-inline-token-annotations> is set,
				199	this will contain annotations as well.
Akron	1a5271a	2021-02-18 13:18:15 +0100	[diff] [blame]	200	Defaults to C<tokens> and C<morpho>.
				201
Akron	e2819a1	2021-10-12 15:52:55 +0200	[diff] [blame]	202	The inline token data will also be stored in the
				203	inline structures file (see I<--inline-structures>),
				204	unless the inline token foundry is prepended
				205	by an B<!> exclamation mark, indicating that inline
				206	tokens are stored exclusively in the inline tokens
				207	file.
				208
				209	Example:
				210
Akron	6b1f26b	2024-09-19 11:35:32 +0200	[diff] [blame]	211	tei2korapxml --no-tokenizer --inline-tokens \
				212	'!gingko#morpho' < data.i5.xml > korapxml.zip
				213
				214	=item B<--inline-dependencies> <foundry>#[<file>]
				215
				216	Define the foundry and file (without extension)
				217	to store inline dependency information in.
				218	Defaults to the layer of C<dependency> and
				219	will be ignored if not set (which means, dependency
				220	attributes will be stored in the inline tokens file,
				221	if not skipped).
				222
				223	The dependency data will also be stored in the
				224	inline token file (see I<--inline-tokens>),
				225	unless the inline dependencies foundry is prepended
				226	by an B<!> exclamation mark, indicating that inline
				227	dependency data is stored exclusively in the inline
				228	dependencies file.
				229
				230	Example:
				231
				232	tei2korapxml --no-tokenizer --inline-dependencies \
				233	'gingko#dependency' < data.i5.xml > korapxml.zip
				234
Akron	e2819a1	2021-10-12 15:52:55 +0200	[diff] [blame]	235
Akron	dd0be8f	2021-02-18 19:29:41 +0100	[diff] [blame]	236	=item B<--inline-structures> <foundry>#[<file>]
				237
				238	Define the foundry and file (without extension)
				239	to store inline structure information in.
				240	Defaults to C<struct> and C<structures>.
Akron	75d6314	2021-02-23 18:40:56 +0100	[diff] [blame]	241
Akron	26a7152	2021-02-19 10:27:37 +0100	[diff] [blame]	242	=item B<--base-foundry> <foundry>
				243
				244	Define the base foundry to store newly generated
				245	token information in.
				246	Defaults to C<base>.
				247
				248	=item B<--data-file> <file>
				249
				250	Define the file (without extension)
				251	to store primary data information in.
				252	Defaults to C<data>.
				253
				254	=item B<--header-file> <file>
				255
				256	Define the file name (without extension)
				257	to store header information on
				258	the corpus, document, and text level in.
				259	Defaults to C<header>.
Akron	dd0be8f	2021-02-18 19:29:41 +0100	[diff] [blame]	260
Marc Kupietz	985da0c	2021-02-15 19:29:50 +0100	[diff] [blame]	261	=item B<--use-tokenizer-sentence-splits\|-s>
				262
				263	Replace existing with, or add new, sentence boundary information
Akron	1148478	2021-11-03 20:12:14 +0100	[diff] [blame]	264	provided by the tokenizer.
				265	Currently KorAP-tokenizer and certain external tokenizers support
				266	these boundaries.
Marc Kupietz	985da0c	2021-02-15 19:29:50 +0100	[diff] [blame]	267
Akron	91705d7	2021-02-19 10:59:45 +0100	[diff] [blame]	268	=item B<--tokens-file> <file>
				269
				270	Define the file (without extension)
				271	to store generated token information in
				272	(either from the KorAP tokenizer or an externally called tokenizer).
				273	Defaults to C<tokens>.
				274
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	275	=item B<--log\|-l>
				276
				277	Loglevel for I<Log::Any>. Defaults to C<notice>.
				278
				279	=back
				280
Akron	b364947	2020-09-29 08:24:46 +0200	[diff] [blame]	281	=head1 ENVIRONMENT VARIABLES
				282
				283	=over 2
				284
				285	=item B<KORAPXMLTEI_DEBUG>
				286
				287	Activate minimal debugging.
				288	Defaults to C<false>.
				289
Akron	b364947	2020-09-29 08:24:46 +0200	[diff] [blame]	290	=back
				291
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	292	=head1 COPYRIGHT AND LICENSE
				293
Marc Kupietz	8456675	2024-01-11 14:37:11 +0100	[diff] [blame]	294	Copyright (C) 2021-2024, L<IDS Mannheim\|https://www.ids-mannheim.de/>
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	295
				296	Author: Peter Harders
				297
Akron	aabd095	2020-09-29 07:35:08 +0200	[diff] [blame]	298	Contributors: Nils Diewald, Marc Kupietz, Carsten Schnober
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	299
				300	L<KorAP::XML::TEI> is developed as part of the L<KorAP\|https://korap.ids-mannheim.de/>
				301	Corpus Analysis Platform at the
Akron	d72baca	2021-07-23 13:25:32 +0200	[diff] [blame]	302	L<Leibniz Institute for the German Language (IDS)\|https://www.ids-mannheim.de/>,
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	303	member of the
				304	L<Leibniz-Gemeinschaft\|http://www.leibniz-gemeinschaft.de/>.
				305
				306	This program is free software published under the
Marc Kupietz	e955ecc	2021-02-17 17:42:01 +0100	[diff] [blame]	307	L<BSD-2 License\|https://opensource.org/licenses/BSD-2-Clause>.
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	308
Akron	692d17d	2021-03-05 13:21:03 +0100	[diff] [blame]	309	=cut