Blame - Readme.pod - KorAP/KorAP-XML-TEI

blob: d679620bb53f092a40f84b3b997cd8a4c900ec05 [file] [log] [blame]

Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	1	=pod
				2
				3	=encoding utf8
				4
				5	=head1 NAME
				6
				7	tei2korapxml - Conversion of TEI P5 based formats to KorAP-XML
				8
				9	=head1 SYNOPSIS
				10
Akron	1148478	2021-11-03 20:12:14 +0100	[diff] [blame]	11	cat corpus.i5.xml \| tei2korapxml - > corpus.korapxml.zip
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	12
				13	=head1 DESCRIPTION
				14
				15	C<tei2korapxml> is a script to convert TEI P5 and
Akron	d72baca	2021-07-23 13:25:32 +0200	[diff] [blame]	16	L<I5\|https://www.ids-mannheim.de/digspra/kl/projekte/korpora/textmodell>
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	17	based documents to the
				18	L<KorAP-XML format\|https://github.com/KorAP/KorAP-XML-Krill#about-korap-xml>.
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	19
				20	This program is usually called from inside another script.
				21
				22	=head1 FORMATS
				23
				24	=head2 Input restrictions
				25
				26	=over 2
				27
				28	=item
				29
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	30	TEI P5 formatted input with certain restrictions:
				31
				32	=over 4
				33
				34	=item
				35
Akron	e48bec4	2023-01-05 12:18:45 +0100	[diff] [blame]	36	B<mandatory>: text-header with integrated textsigle
				37	(or convertable identifier), text-body
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	38
				39	=item
				40
				41	B<optional>: corp-header with integrated corpsigle,
				42	doc-header with integrated docsigle
				43
				44	=back
				45
				46	=item
				47
				48	All tokens inside the primary text may not be
				49	newline seperated, because newlines are removed
				50	(see L<KorAP::XML::TEI::Data>) and a conversion of newlines
				51	into blanks between 2 tokens could lead to additional blanks,
				52	where there should be none (e.g.: punctuation characters like C<,> or
				53	C<.> should not be seperated from their predecessor token).
Akron	8a0c4bf	2021-03-16 16:51:21 +0100	[diff] [blame]	54	(see also code section C<~ whitespace handling ~> in C<script/tei2korapxml>).
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	55
Akron	940ca6f	2021-10-11 12:38:39 +0200	[diff] [blame]	56	=item
				57
				58	Header types, like C<E<lt>idsHeader [...] type="document" [...] E<gt>>
				59	need to be defined in the same line as the header tag.
				60
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	61	=back
				62
				63	=head2 Notes on the output
				64
				65	=over 2
				66
				67	=item
				68
				69	zip file output (default on C<stdout>) with utf8 encoded entries
				70	(which together form the KorAP-XML format)
				71
				72	=back
				73
				74	=head1 INSTALLATION
				75
Akron	d26319b	2023-01-12 15:34:41 +0100	[diff] [blame]	76	C<tei2korapxml> requires C<libxml2-dev> bindings and L<File::ShareDir::Install> to be installed.
Marc Kupietz	e83a4e9	2021-03-16 20:51:26 +0100	[diff] [blame]	77	When these requirements are met, the preferred way to install the script is
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	78	to use L<cpanm\|App::cpanminus>.
				79
				80	$ cpanm https://github.com/KorAP/KorAP-XML-TEI.git
				81
				82	In case everything went well, the C<tei2korapxml> tool will
				83	be available on your command line immediately.
				84
				85	Minimum requirement for L<KorAP::XML::TEI> is Perl 5.16.
				86
				87	=head1 OPTIONS
				88
				89	=over 2
				90
Akron	1148478	2021-11-03 20:12:14 +0100	[diff] [blame]	91	=item B<--input\|-i>
				92
				93	The input file to process. If no specific input is defined and a single
				94	dash C<-> is passed as an argument, data is read from C<STDIN>.
				95
				96
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	97	=item B<--root\|-r>
				98
				99	The root directory for output. Defaults to C<.>.
				100
				101	=item B<--help\|-h>
				102
				103	Print help information.
				104
				105	=item B<--version\|-v>
				106
				107	Print version information.
				108
Akron	e48bec4	2023-01-05 12:18:45 +0100	[diff] [blame]	109	=item B<--tokenizer-korap\|-tk>
				110
				111	Use the standard KorAP/DeReKo tokenizer.
				112
				113	=item B<--tokenizer-internal\|-ti>
				114
				115	Tokenize the data using two embedded tokenizers,
				116	that will take an I<aggressive> and a I<conservative>
				117	approach.
				118
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	119	=item B<--tokenizer-call\|-tc>
				120
				121	Call an external tokenizer process, that will tokenize
Akron	1148478	2021-11-03 20:12:14 +0100	[diff] [blame]	122	from STDIN and outputs the offsets of all tokens.
				123
				124	Texts are separated using C<\x04\n>. The external process
				125	should add a new line per text.
				126
				127	If the L</--use-tokenizer-sentence-splits> option is activated,
				128	sentences are marked by offset as well in new lines.
				129
				130	To use L<Datok\|https://github.com/KorAP/Datok> including sentence
				131	splitting, call C<tei2korap> as follows:
				132
				133	$ cat corpus.i5.xml \| tei2korapxml -s \
				134	$ -tc 'datok tokenize \
				135	$ -t ./tokenizer.matok \
				136	$ -p --newline-after-eot --no-sentences \
				137	$ --no-tokens --sentence-positions -' - \
				138	$ > corpus.korapxml.zip
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	139
Akron	75d6314	2021-02-23 18:40:56 +0100	[diff] [blame]	140	=item B<--skip-inline-tokens>
				141
				142	Boolean flag indicating that inline tokens should not
				143	be processed. Defaults to false (meaning inline tokens will be processed).
				144
Akron	692d17d	2021-03-05 13:21:03 +0100	[diff] [blame]	145	=item B<--skip-inline-token-annotations>
				146
				147	Boolean flag indicating that inline token annotations should not
				148	be processed. Defaults to true (meaning inline token annotations
				149	won't be processed).
				150
Akron	ca70a1d	2021-02-25 16:21:31 +0100	[diff] [blame]	151	=item B<--skip-inline-tags> <tags>
Akron	54c3ff1	2021-02-25 11:33:37 +0100	[diff] [blame]	152
				153	Expects a comma-separated list of tags to be ignored when the structure
				154	is parsed. Content of these tags however will be processed.
				155
Marc Kupietz	a671ae5	2022-12-22 16:28:14 +0100	[diff] [blame]	156	=item B<--xmlid-to-textsigle> <from-regex>@<to-c/to-d/to-t>
				157
Akron	e48bec4	2023-01-05 12:18:45 +0100	[diff] [blame]	158	Expects a regular replacement expression (separated by B<@> between the
Marc Kupietz	a671ae5	2022-12-22 16:28:14 +0100	[diff] [blame]	159	search and the replacement) to convert text id attributes to text sigles
				160	with three parts (separated by B</>).
				161
				162	Example:
				163
				164	tei2korapxml \
				165	--xmlid-to-textsigle 'ICC.German\.([^.]+\.[^.]+)\.(.+)@ICCGER/$1/$2' \
				166	-tk - < t/data/icc_german_sample.p5.xml
				167
Akron	e48bec4	2023-01-05 12:18:45 +0100	[diff] [blame]	168	Converts text id C<ICC.German.DeReKo.WPD17.G11.00238> to
				169	sigle C<ICCGER/DeReKo.WPD17/G11.00238>.
Marc Kupietz	a671ae5	2022-12-22 16:28:14 +0100	[diff] [blame]	170
Akron	1a5271a	2021-02-18 13:18:15 +0100	[diff] [blame]	171	=item B<--inline-tokens> <foundry>#[<file>]
				172
				173	Define the foundry and file (without extension)
				174	to store inline token information in.
Akron	8a0c4bf	2021-03-16 16:51:21 +0100	[diff] [blame]	175	Unless C<--skip-inline-token-annotations> is set,
				176	this will contain annotations as well.
Akron	1a5271a	2021-02-18 13:18:15 +0100	[diff] [blame]	177	Defaults to C<tokens> and C<morpho>.
				178
Akron	e2819a1	2021-10-12 15:52:55 +0200	[diff] [blame]	179	The inline token data will also be stored in the
				180	inline structures file (see I<--inline-structures>),
				181	unless the inline token foundry is prepended
				182	by an B<!> exclamation mark, indicating that inline
				183	tokens are stored exclusively in the inline tokens
				184	file.
				185
				186	Example:
				187
				188	tei2korapxml --inline-tokens '!gingko#morpho' < data.i5.xml > korapxml.zip
				189
Akron	dd0be8f	2021-02-18 19:29:41 +0100	[diff] [blame]	190	=item B<--inline-structures> <foundry>#[<file>]
				191
				192	Define the foundry and file (without extension)
				193	to store inline structure information in.
				194	Defaults to C<struct> and C<structures>.
Akron	75d6314	2021-02-23 18:40:56 +0100	[diff] [blame]	195
Akron	26a7152	2021-02-19 10:27:37 +0100	[diff] [blame]	196	=item B<--base-foundry> <foundry>
				197
				198	Define the base foundry to store newly generated
				199	token information in.
				200	Defaults to C<base>.
				201
				202	=item B<--data-file> <file>
				203
				204	Define the file (without extension)
				205	to store primary data information in.
				206	Defaults to C<data>.
				207
				208	=item B<--header-file> <file>
				209
				210	Define the file name (without extension)
				211	to store header information on
				212	the corpus, document, and text level in.
				213	Defaults to C<header>.
Akron	dd0be8f	2021-02-18 19:29:41 +0100	[diff] [blame]	214
Marc Kupietz	985da0c	2021-02-15 19:29:50 +0100	[diff] [blame]	215	=item B<--use-tokenizer-sentence-splits\|-s>
				216
				217	Replace existing with, or add new, sentence boundary information
Akron	1148478	2021-11-03 20:12:14 +0100	[diff] [blame]	218	provided by the tokenizer.
				219	Currently KorAP-tokenizer and certain external tokenizers support
				220	these boundaries.
Marc Kupietz	985da0c	2021-02-15 19:29:50 +0100	[diff] [blame]	221
Akron	91705d7	2021-02-19 10:59:45 +0100	[diff] [blame]	222	=item B<--tokens-file> <file>
				223
				224	Define the file (without extension)
				225	to store generated token information in
				226	(either from the KorAP tokenizer or an externally called tokenizer).
				227	Defaults to C<tokens>.
				228
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	229	=item B<--log\|-l>
				230
				231	Loglevel for I<Log::Any>. Defaults to C<notice>.
				232
				233	=back
				234
Akron	b364947	2020-09-29 08:24:46 +0200	[diff] [blame]	235	=head1 ENVIRONMENT VARIABLES
				236
				237	=over 2
				238
				239	=item B<KORAPXMLTEI_DEBUG>
				240
				241	Activate minimal debugging.
				242	Defaults to C<false>.
				243
Akron	b364947	2020-09-29 08:24:46 +0200	[diff] [blame]	244	=back
				245
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	246	=head1 COPYRIGHT AND LICENSE
				247
Akron	e48bec4	2023-01-05 12:18:45 +0100	[diff] [blame]	248	Copyright (C) 2021-2023, L<IDS Mannheim\|https://www.ids-mannheim.de/>
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	249
				250	Author: Peter Harders
				251
Akron	aabd095	2020-09-29 07:35:08 +0200	[diff] [blame]	252	Contributors: Nils Diewald, Marc Kupietz, Carsten Schnober
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	253
				254	L<KorAP::XML::TEI> is developed as part of the L<KorAP\|https://korap.ids-mannheim.de/>
				255	Corpus Analysis Platform at the
Akron	d72baca	2021-07-23 13:25:32 +0200	[diff] [blame]	256	L<Leibniz Institute for the German Language (IDS)\|https://www.ids-mannheim.de/>,
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	257	member of the
				258	L<Leibniz-Gemeinschaft\|http://www.leibniz-gemeinschaft.de/>.
				259
				260	This program is free software published under the
Marc Kupietz	e955ecc	2021-02-17 17:42:01 +0100	[diff] [blame]	261	L<BSD-2 License\|https://opensource.org/licenses/BSD-2-Clause>.
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	262
Akron	692d17d	2021-03-05 13:21:03 +0100	[diff] [blame]	263	=cut