Blame - Readme.pod - KorAP/KorAP-XML-TEI

blob: 1c9554027e82d81e7fcb635f159560323a841cb1 [file] [log] [blame]

Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	1	=pod
				2
				3	=encoding utf8
				4
				5	=head1 NAME
				6
				7	tei2korapxml - Conversion of TEI P5 based formats to KorAP-XML
				8
				9	=head1 SYNOPSIS
				10
Akron	1148478	2021-11-03 20:12:14 +0100	[diff] [blame]	11	cat corpus.i5.xml \| tei2korapxml - > corpus.korapxml.zip
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	12
				13	=head1 DESCRIPTION
				14
				15	C<tei2korapxml> is a script to convert TEI P5 and
Akron	d72baca	2021-07-23 13:25:32 +0200	[diff] [blame]	16	L<I5\|https://www.ids-mannheim.de/digspra/kl/projekte/korpora/textmodell>
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	17	based documents to the
				18	L<KorAP-XML format\|https://github.com/KorAP/KorAP-XML-Krill#about-korap-xml>.
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	19
				20	This program is usually called from inside another script.
				21
				22	=head1 FORMATS
				23
				24	=head2 Input restrictions
				25
				26	=over 2
				27
				28	=item
				29
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	30	TEI P5 formatted input with certain restrictions:
				31
				32	=over 4
				33
				34	=item
				35
Akron	e48bec4	2023-01-05 12:18:45 +0100	[diff] [blame]	36	B<mandatory>: text-header with integrated textsigle
				37	(or convertable identifier), text-body
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	38
				39	=item
				40
				41	B<optional>: corp-header with integrated corpsigle,
				42	doc-header with integrated docsigle
				43
				44	=back
				45
				46	=item
				47
				48	All tokens inside the primary text may not be
				49	newline seperated, because newlines are removed
				50	(see L<KorAP::XML::TEI::Data>) and a conversion of newlines
				51	into blanks between 2 tokens could lead to additional blanks,
				52	where there should be none (e.g.: punctuation characters like C<,> or
				53	C<.> should not be seperated from their predecessor token).
Akron	8a0c4bf	2021-03-16 16:51:21 +0100	[diff] [blame]	54	(see also code section C<~ whitespace handling ~> in C<script/tei2korapxml>).
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	55
Akron	940ca6f	2021-10-11 12:38:39 +0200	[diff] [blame]	56	=item
				57
				58	Header types, like C<E<lt>idsHeader [...] type="document" [...] E<gt>>
				59	need to be defined in the same line as the header tag.
				60
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	61	=back
				62
				63	=head2 Notes on the output
				64
				65	=over 2
				66
				67	=item
				68
				69	zip file output (default on C<stdout>) with utf8 encoded entries
				70	(which together form the KorAP-XML format)
				71
				72	=back
				73
				74	=head1 INSTALLATION
				75
Akron	d26319b	2023-01-12 15:34:41 +0100	[diff] [blame]	76	C<tei2korapxml> requires C<libxml2-dev> bindings and L<File::ShareDir::Install> to be installed.
Marc Kupietz	e83a4e9	2021-03-16 20:51:26 +0100	[diff] [blame]	77	When these requirements are met, the preferred way to install the script is
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	78	to use L<cpanm\|App::cpanminus>.
				79
				80	$ cpanm https://github.com/KorAP/KorAP-XML-TEI.git
				81
				82	In case everything went well, the C<tei2korapxml> tool will
				83	be available on your command line immediately.
				84
Akron	6b1f26b	2024-09-19 11:35:32 +0200	[diff] [blame^]	85	Minimum requirement for L<KorAP::XML::TEI> is Perl 5.16.
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	86
				87	=head1 OPTIONS
				88
				89	=over 2
				90
Akron	1148478	2021-11-03 20:12:14 +0100	[diff] [blame]	91	=item B<--input\|-i>
				92
				93	The input file to process. If no specific input is defined and a single
				94	dash C<-> is passed as an argument, data is read from C<STDIN>.
				95
Akron	6b1f26b	2024-09-19 11:35:32 +0200	[diff] [blame^]	96	=item B<--output\|-o>
				97
				98	The output zip file to be created. If no specific output is defined,
				99	data is written to C<STDOUT>.
Akron	1148478	2021-11-03 20:12:14 +0100	[diff] [blame]	100
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	101	=item B<--root\|-r>
				102
				103	The root directory for output. Defaults to C<.>.
				104
				105	=item B<--help\|-h>
				106
				107	Print help information.
				108
				109	=item B<--version\|-v>
				110
				111	Print version information.
				112
Akron	e48bec4	2023-01-05 12:18:45 +0100	[diff] [blame]	113	=item B<--tokenizer-korap\|-tk>
				114
				115	Use the standard KorAP/DeReKo tokenizer.
				116
				117	=item B<--tokenizer-internal\|-ti>
				118
				119	Tokenize the data using two embedded tokenizers,
				120	that will take an I<aggressive> and a I<conservative>
				121	approach.
				122
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	123	=item B<--tokenizer-call\|-tc>
				124
				125	Call an external tokenizer process, that will tokenize
Akron	1148478	2021-11-03 20:12:14 +0100	[diff] [blame]	126	from STDIN and outputs the offsets of all tokens.
				127
				128	Texts are separated using C<\x04\n>. The external process
				129	should add a new line per text.
				130
				131	If the L</--use-tokenizer-sentence-splits> option is activated,
				132	sentences are marked by offset as well in new lines.
				133
				134	To use L<Datok\|https://github.com/KorAP/Datok> including sentence
				135	splitting, call C<tei2korap> as follows:
				136
				137	$ cat corpus.i5.xml \| tei2korapxml -s \
				138	$ -tc 'datok tokenize \
				139	$ -t ./tokenizer.matok \
				140	$ -p --newline-after-eot --no-sentences \
				141	$ --no-tokens --sentence-positions -' - \
				142	$ > corpus.korapxml.zip
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	143
Akron	6b1f26b	2024-09-19 11:35:32 +0200	[diff] [blame^]	144	=item B<--no-tokenizer>
				145
				146	Boolean flag indicating that no tokenizer should be used.
				147	This is meant to ensure that by default a final token layer always
				148	exists.
				149	If a separate tokenizer is chosen, this flag is ignored.
				150
Akron	75d6314	2021-02-23 18:40:56 +0100	[diff] [blame]	151	=item B<--skip-inline-tokens>
				152
				153	Boolean flag indicating that inline tokens should not
				154	be processed. Defaults to false (meaning inline tokens will be processed).
				155
Akron	692d17d	2021-03-05 13:21:03 +0100	[diff] [blame]	156	=item B<--skip-inline-token-annotations>
				157
				158	Boolean flag indicating that inline token annotations should not
				159	be processed. Defaults to true (meaning inline token annotations
Akron	6b1f26b	2024-09-19 11:35:32 +0200	[diff] [blame^]	160	won't be processed). Can be negated with
				161	C<--no-skip-inline-token-annotations>.
Akron	692d17d	2021-03-05 13:21:03 +0100	[diff] [blame]	162
Akron	ca70a1d	2021-02-25 16:21:31 +0100	[diff] [blame]	163	=item B<--skip-inline-tags> <tags>
Akron	54c3ff1	2021-02-25 11:33:37 +0100	[diff] [blame]	164
				165	Expects a comma-separated list of tags to be ignored when the structure
				166	is parsed. Content of these tags however will be processed.
				167
Marc Kupietz	a671ae5	2022-12-22 16:28:14 +0100	[diff] [blame]	168	=item B<--xmlid-to-textsigle> <from-regex>@<to-c/to-d/to-t>
				169
Akron	e48bec4	2023-01-05 12:18:45 +0100	[diff] [blame]	170	Expects a regular replacement expression (separated by B<@> between the
Marc Kupietz	a671ae5	2022-12-22 16:28:14 +0100	[diff] [blame]	171	search and the replacement) to convert text id attributes to text sigles
				172	with three parts (separated by B</>).
				173
				174	Example:
				175
				176	tei2korapxml \
				177	--xmlid-to-textsigle 'ICC.German\.([^.]+\.[^.]+)\.(.+)@ICCGER/$1/$2' \
				178	-tk - < t/data/icc_german_sample.p5.xml
				179
Akron	e48bec4	2023-01-05 12:18:45 +0100	[diff] [blame]	180	Converts text id C<ICC.German.DeReKo.WPD17.G11.00238> to
				181	sigle C<ICCGER/DeReKo.WPD17/G11.00238>.
Marc Kupietz	a671ae5	2022-12-22 16:28:14 +0100	[diff] [blame]	182
Akron	1a5271a	2021-02-18 13:18:15 +0100	[diff] [blame]	183	=item B<--inline-tokens> <foundry>#[<file>]
				184
				185	Define the foundry and file (without extension)
				186	to store inline token information in.
Akron	8a0c4bf	2021-03-16 16:51:21 +0100	[diff] [blame]	187	Unless C<--skip-inline-token-annotations> is set,
				188	this will contain annotations as well.
Akron	1a5271a	2021-02-18 13:18:15 +0100	[diff] [blame]	189	Defaults to C<tokens> and C<morpho>.
				190
Akron	e2819a1	2021-10-12 15:52:55 +0200	[diff] [blame]	191	The inline token data will also be stored in the
				192	inline structures file (see I<--inline-structures>),
				193	unless the inline token foundry is prepended
				194	by an B<!> exclamation mark, indicating that inline
				195	tokens are stored exclusively in the inline tokens
				196	file.
				197
				198	Example:
				199
Akron	6b1f26b	2024-09-19 11:35:32 +0200	[diff] [blame^]	200	tei2korapxml --no-tokenizer --inline-tokens \
				201	'!gingko#morpho' < data.i5.xml > korapxml.zip
				202
				203	=item B<--inline-dependencies> <foundry>#[<file>]
				204
				205	Define the foundry and file (without extension)
				206	to store inline dependency information in.
				207	Defaults to the layer of C<dependency> and
				208	will be ignored if not set (which means, dependency
				209	attributes will be stored in the inline tokens file,
				210	if not skipped).
				211
				212	The dependency data will also be stored in the
				213	inline token file (see I<--inline-tokens>),
				214	unless the inline dependencies foundry is prepended
				215	by an B<!> exclamation mark, indicating that inline
				216	dependency data is stored exclusively in the inline
				217	dependencies file.
				218
				219	Example:
				220
				221	tei2korapxml --no-tokenizer --inline-dependencies \
				222	'gingko#dependency' < data.i5.xml > korapxml.zip
				223
Akron	e2819a1	2021-10-12 15:52:55 +0200	[diff] [blame]	224
Akron	dd0be8f	2021-02-18 19:29:41 +0100	[diff] [blame]	225	=item B<--inline-structures> <foundry>#[<file>]
				226
				227	Define the foundry and file (without extension)
				228	to store inline structure information in.
				229	Defaults to C<struct> and C<structures>.
Akron	75d6314	2021-02-23 18:40:56 +0100	[diff] [blame]	230
Akron	26a7152	2021-02-19 10:27:37 +0100	[diff] [blame]	231	=item B<--base-foundry> <foundry>
				232
				233	Define the base foundry to store newly generated
				234	token information in.
				235	Defaults to C<base>.
				236
				237	=item B<--data-file> <file>
				238
				239	Define the file (without extension)
				240	to store primary data information in.
				241	Defaults to C<data>.
				242
				243	=item B<--header-file> <file>
				244
				245	Define the file name (without extension)
				246	to store header information on
				247	the corpus, document, and text level in.
				248	Defaults to C<header>.
Akron	dd0be8f	2021-02-18 19:29:41 +0100	[diff] [blame]	249
Marc Kupietz	985da0c	2021-02-15 19:29:50 +0100	[diff] [blame]	250	=item B<--use-tokenizer-sentence-splits\|-s>
				251
				252	Replace existing with, or add new, sentence boundary information
Akron	1148478	2021-11-03 20:12:14 +0100	[diff] [blame]	253	provided by the tokenizer.
				254	Currently KorAP-tokenizer and certain external tokenizers support
				255	these boundaries.
Marc Kupietz	985da0c	2021-02-15 19:29:50 +0100	[diff] [blame]	256
Akron	91705d7	2021-02-19 10:59:45 +0100	[diff] [blame]	257	=item B<--tokens-file> <file>
				258
				259	Define the file (without extension)
				260	to store generated token information in
				261	(either from the KorAP tokenizer or an externally called tokenizer).
				262	Defaults to C<tokens>.
				263
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	264	=item B<--log\|-l>
				265
				266	Loglevel for I<Log::Any>. Defaults to C<notice>.
				267
				268	=back
				269
Akron	b364947	2020-09-29 08:24:46 +0200	[diff] [blame]	270	=head1 ENVIRONMENT VARIABLES
				271
				272	=over 2
				273
				274	=item B<KORAPXMLTEI_DEBUG>
				275
				276	Activate minimal debugging.
				277	Defaults to C<false>.
				278
Akron	b364947	2020-09-29 08:24:46 +0200	[diff] [blame]	279	=back
				280
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	281	=head1 COPYRIGHT AND LICENSE
				282
Marc Kupietz	8456675	2024-01-11 14:37:11 +0100	[diff] [blame]	283	Copyright (C) 2021-2024, L<IDS Mannheim\|https://www.ids-mannheim.de/>
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	284
				285	Author: Peter Harders
				286
Akron	aabd095	2020-09-29 07:35:08 +0200	[diff] [blame]	287	Contributors: Nils Diewald, Marc Kupietz, Carsten Schnober
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	288
				289	L<KorAP::XML::TEI> is developed as part of the L<KorAP\|https://korap.ids-mannheim.de/>
				290	Corpus Analysis Platform at the
Akron	d72baca	2021-07-23 13:25:32 +0200	[diff] [blame]	291	L<Leibniz Institute for the German Language (IDS)\|https://www.ids-mannheim.de/>,
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	292	member of the
				293	L<Leibniz-Gemeinschaft\|http://www.leibniz-gemeinschaft.de/>.
				294
				295	This program is free software published under the
Marc Kupietz	e955ecc	2021-02-17 17:42:01 +0100	[diff] [blame]	296	L<BSD-2 License\|https://opensource.org/licenses/BSD-2-Clause>.
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	297
Akron	692d17d	2021-03-05 13:21:03 +0100	[diff] [blame]	298	=cut