Blame - Readme.pod - KorAP/KorAP-XML-TEI

blob: 79180c425c30f3463f5a970f8b0365b82f048e73 [file] [log] [blame]

Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	1	=pod
				2
				3	=encoding utf8
				4
				5	=head1 NAME
				6
				7	tei2korapxml - Conversion of TEI P5 based formats to KorAP-XML
				8
				9	=head1 SYNOPSIS
				10
				11	cat corpus.i5.xml \| tei2korapxml > corpus.korapxml.zip
				12
				13	=head1 DESCRIPTION
				14
				15	C<tei2korapxml> is a script to convert TEI P5 and
Akron	d72baca	2021-07-23 13:25:32 +0200	[diff] [blame]	16	L<I5\|https://www.ids-mannheim.de/digspra/kl/projekte/korpora/textmodell>
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	17	based documents to the
				18	L<KorAP-XML format\|https://github.com/KorAP/KorAP-XML-Krill#about-korap-xml>.
				19	If no specific input is defined, data is
				20	read from C<STDIN>. If no specific output is defined, data is written
				21	to C<STDOUT>.
				22
				23	This program is usually called from inside another script.
				24
				25	=head1 FORMATS
				26
				27	=head2 Input restrictions
				28
				29	=over 2
				30
				31	=item
				32
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	33	TEI P5 formatted input with certain restrictions:
				34
				35	=over 4
				36
				37	=item
				38
				39	B<mandatory>: text-header with integrated textsigle, text-body
				40
				41	=item
				42
				43	B<optional>: corp-header with integrated corpsigle,
				44	doc-header with integrated docsigle
				45
				46	=back
				47
				48	=item
				49
				50	All tokens inside the primary text may not be
				51	newline seperated, because newlines are removed
				52	(see L<KorAP::XML::TEI::Data>) and a conversion of newlines
				53	into blanks between 2 tokens could lead to additional blanks,
				54	where there should be none (e.g.: punctuation characters like C<,> or
				55	C<.> should not be seperated from their predecessor token).
Akron	8a0c4bf	2021-03-16 16:51:21 +0100	[diff] [blame]	56	(see also code section C<~ whitespace handling ~> in C<script/tei2korapxml>).
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	57
Akron	940ca6f	2021-10-11 12:38:39 +0200	[diff] [blame]	58	=item
				59
				60	Header types, like C<E<lt>idsHeader [...] type="document" [...] E<gt>>
				61	need to be defined in the same line as the header tag.
				62
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	63	=back
				64
				65	=head2 Notes on the output
				66
				67	=over 2
				68
				69	=item
				70
				71	zip file output (default on C<stdout>) with utf8 encoded entries
				72	(which together form the KorAP-XML format)
				73
				74	=back
				75
				76	=head1 INSTALLATION
				77
Marc Kupietz	e83a4e9	2021-03-16 20:51:26 +0100	[diff] [blame]	78	C<tei2korapxml> requires L<libxml2-dev> bindings and L<File::ShareDir::Install> to be installed.
				79	When these requirements are met, the preferred way to install the script is
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	80	to use L<cpanm\|App::cpanminus>.
				81
				82	$ cpanm https://github.com/KorAP/KorAP-XML-TEI.git
				83
				84	In case everything went well, the C<tei2korapxml> tool will
				85	be available on your command line immediately.
				86
				87	Minimum requirement for L<KorAP::XML::TEI> is Perl 5.16.
				88
				89	=head1 OPTIONS
				90
				91	=over 2
				92
				93	=item B<--root\|-r>
				94
				95	The root directory for output. Defaults to C<.>.
				96
				97	=item B<--help\|-h>
				98
				99	Print help information.
				100
				101	=item B<--version\|-v>
				102
				103	Print version information.
				104
				105	=item B<--tokenizer-call\|-tc>
				106
				107	Call an external tokenizer process, that will tokenize
				108	a single line from STDIN and outputs one token per line.
				109
				110	=item B<--tokenizer-korap\|-tk>
				111
				112	Use the standard KorAP/DeReKo tokenizer.
				113
Akron	6d7b8e4	2020-09-29 07:37:41 +0200	[diff] [blame]	114	=item B<--tokenizer-internal\|-ti>
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	115
				116	Tokenize the data using two embedded tokenizers,
				117	that will take an I<Aggressive> and a I<conservative>
				118	approach.
				119
Akron	75d6314	2021-02-23 18:40:56 +0100	[diff] [blame]	120	=item B<--skip-inline-tokens>
				121
				122	Boolean flag indicating that inline tokens should not
				123	be processed. Defaults to false (meaning inline tokens will be processed).
				124
Akron	692d17d	2021-03-05 13:21:03 +0100	[diff] [blame]	125	=item B<--skip-inline-token-annotations>
				126
				127	Boolean flag indicating that inline token annotations should not
				128	be processed. Defaults to true (meaning inline token annotations
				129	won't be processed).
				130
Akron	ca70a1d	2021-02-25 16:21:31 +0100	[diff] [blame]	131	=item B<--skip-inline-tags> <tags>
Akron	54c3ff1	2021-02-25 11:33:37 +0100	[diff] [blame]	132
				133	Expects a comma-separated list of tags to be ignored when the structure
				134	is parsed. Content of these tags however will be processed.
				135
Akron	1a5271a	2021-02-18 13:18:15 +0100	[diff] [blame]	136	=item B<--inline-tokens> <foundry>#[<file>]
				137
				138	Define the foundry and file (without extension)
				139	to store inline token information in.
Akron	8a0c4bf	2021-03-16 16:51:21 +0100	[diff] [blame]	140	Unless C<--skip-inline-token-annotations> is set,
				141	this will contain annotations as well.
Akron	1a5271a	2021-02-18 13:18:15 +0100	[diff] [blame]	142	Defaults to C<tokens> and C<morpho>.
				143
Akron	e2819a1	2021-10-12 15:52:55 +0200	[diff] [blame]	144	The inline token data will also be stored in the
				145	inline structures file (see I<--inline-structures>),
				146	unless the inline token foundry is prepended
				147	by an B<!> exclamation mark, indicating that inline
				148	tokens are stored exclusively in the inline tokens
				149	file.
				150
				151	Example:
				152
				153	tei2korapxml --inline-tokens '!gingko#morpho' < data.i5.xml > korapxml.zip
				154
Akron	dd0be8f	2021-02-18 19:29:41 +0100	[diff] [blame]	155	=item B<--inline-structures> <foundry>#[<file>]
				156
				157	Define the foundry and file (without extension)
				158	to store inline structure information in.
				159	Defaults to C<struct> and C<structures>.
Akron	75d6314	2021-02-23 18:40:56 +0100	[diff] [blame]	160
Akron	26a7152	2021-02-19 10:27:37 +0100	[diff] [blame]	161	=item B<--base-foundry> <foundry>
				162
				163	Define the base foundry to store newly generated
				164	token information in.
				165	Defaults to C<base>.
				166
				167	=item B<--data-file> <file>
				168
				169	Define the file (without extension)
				170	to store primary data information in.
				171	Defaults to C<data>.
				172
				173	=item B<--header-file> <file>
				174
				175	Define the file name (without extension)
				176	to store header information on
				177	the corpus, document, and text level in.
				178	Defaults to C<header>.
Akron	dd0be8f	2021-02-18 19:29:41 +0100	[diff] [blame]	179
Marc Kupietz	985da0c	2021-02-15 19:29:50 +0100	[diff] [blame]	180	=item B<--use-tokenizer-sentence-splits\|-s>
				181
				182	Replace existing with, or add new, sentence boundary information
				183	provided by the KorAP tokenizer (currently supported only).
				184
Akron	91705d7	2021-02-19 10:59:45 +0100	[diff] [blame]	185	=item B<--tokens-file> <file>
				186
				187	Define the file (without extension)
				188	to store generated token information in
				189	(either from the KorAP tokenizer or an externally called tokenizer).
				190	Defaults to C<tokens>.
				191
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	192	=item B<--log\|-l>
				193
				194	Loglevel for I<Log::Any>. Defaults to C<notice>.
				195
				196	=back
				197
Akron	b364947	2020-09-29 08:24:46 +0200	[diff] [blame]	198	=head1 ENVIRONMENT VARIABLES
				199
				200	=over 2
				201
				202	=item B<KORAPXMLTEI_DEBUG>
				203
				204	Activate minimal debugging.
				205	Defaults to C<false>.
				206
Akron	b364947	2020-09-29 08:24:46 +0200	[diff] [blame]	207	=back
				208
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	209	=head1 COPYRIGHT AND LICENSE
				210
Marc Kupietz	985da0c	2021-02-15 19:29:50 +0100	[diff] [blame]	211	Copyright (C) 2021, L<IDS Mannheim\|https://www.ids-mannheim.de/>
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	212
				213	Author: Peter Harders
				214
Akron	aabd095	2020-09-29 07:35:08 +0200	[diff] [blame]	215	Contributors: Nils Diewald, Marc Kupietz, Carsten Schnober
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	216
				217	L<KorAP::XML::TEI> is developed as part of the L<KorAP\|https://korap.ids-mannheim.de/>
				218	Corpus Analysis Platform at the
Akron	d72baca	2021-07-23 13:25:32 +0200	[diff] [blame]	219	L<Leibniz Institute for the German Language (IDS)\|https://www.ids-mannheim.de/>,
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	220	member of the
				221	L<Leibniz-Gemeinschaft\|http://www.leibniz-gemeinschaft.de/>.
				222
				223	This program is free software published under the
Marc Kupietz	e955ecc	2021-02-17 17:42:01 +0100	[diff] [blame]	224	L<BSD-2 License\|https://opensource.org/licenses/BSD-2-Clause>.
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	225
Akron	692d17d	2021-03-05 13:21:03 +0100	[diff] [blame]	226	=cut