Blame - Readme.pod - KorAP/KorAP-XML-TEI

blob: 1689eb4d83218f2657c49a582a2838e138b356d3 [file] [log] [blame]

Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	1	=pod
				2
				3	=encoding utf8
				4
				5	=head1 NAME
				6
				7	tei2korapxml - Conversion of TEI P5 based formats to KorAP-XML
				8
				9	=head1 SYNOPSIS
				10
				11	cat corpus.i5.xml \| tei2korapxml > corpus.korapxml.zip
				12
				13	=head1 DESCRIPTION
				14
				15	C<tei2korapxml> is a script to convert TEI P5 and
				16	L<I5\|https://www1.ids-mannheim.de/kl/projekte/korpora/textmodell.html>
				17	based documents to the
				18	L<KorAP-XML format\|https://github.com/KorAP/KorAP-XML-Krill#about-korap-xml>.
				19	If no specific input is defined, data is
				20	read from C<STDIN>. If no specific output is defined, data is written
				21	to C<STDOUT>.
				22
				23	This program is usually called from inside another script.
				24
				25	=head1 FORMATS
				26
				27	=head2 Input restrictions
				28
				29	=over 2
				30
				31	=item
				32
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	33	TEI P5 formatted input with certain restrictions:
				34
				35	=over 4
				36
				37	=item
				38
				39	B<mandatory>: text-header with integrated textsigle, text-body
				40
				41	=item
				42
				43	B<optional>: corp-header with integrated corpsigle,
				44	doc-header with integrated docsigle
				45
				46	=back
				47
				48	=item
				49
				50	All tokens inside the primary text may not be
				51	newline seperated, because newlines are removed
				52	(see L<KorAP::XML::TEI::Data>) and a conversion of newlines
				53	into blanks between 2 tokens could lead to additional blanks,
				54	where there should be none (e.g.: punctuation characters like C<,> or
				55	C<.> should not be seperated from their predecessor token).
				56	(see also code section C<~ whitespace handling ~>).
				57
				58	=back
				59
				60	=head2 Notes on the output
				61
				62	=over 2
				63
				64	=item
				65
				66	zip file output (default on C<stdout>) with utf8 encoded entries
				67	(which together form the KorAP-XML format)
				68
				69	=back
				70
				71	=head1 INSTALLATION
				72
				73	C<tei2korapxml> requires L<libxml2-dev> bindings to build. When
				74	these bindings are available, the preferred way to install the script is
				75	to use L<cpanm\|App::cpanminus>.
				76
				77	$ cpanm https://github.com/KorAP/KorAP-XML-TEI.git
				78
				79	In case everything went well, the C<tei2korapxml> tool will
				80	be available on your command line immediately.
				81
				82	Minimum requirement for L<KorAP::XML::TEI> is Perl 5.16.
				83
				84	=head1 OPTIONS
				85
				86	=over 2
				87
				88	=item B<--root\|-r>
				89
				90	The root directory for output. Defaults to C<.>.
				91
				92	=item B<--help\|-h>
				93
				94	Print help information.
				95
				96	=item B<--version\|-v>
				97
				98	Print version information.
				99
				100	=item B<--tokenizer-call\|-tc>
				101
				102	Call an external tokenizer process, that will tokenize
				103	a single line from STDIN and outputs one token per line.
				104
				105	=item B<--tokenizer-korap\|-tk>
				106
				107	Use the standard KorAP/DeReKo tokenizer.
				108
Akron	6d7b8e4	2020-09-29 07:37:41 +0200	[diff] [blame]	109	=item B<--tokenizer-internal\|-ti>
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	110
				111	Tokenize the data using two embedded tokenizers,
				112	that will take an I<Aggressive> and a I<conservative>
				113	approach.
				114
Akron	1a5271a	2021-02-18 13:18:15 +0100	[diff] [blame]	115	=item B<--inline-tokens> <foundry>#[<file>]
				116
				117	Define the foundry and file (without extension)
				118	to store inline token information in.
				119	If L</KORAPXMLTEI_INLINE> is set, this will contain
				120	annotations as well.
				121	Defaults to C<tokens> and C<morpho>.
				122
Marc Kupietz	985da0c	2021-02-15 19:29:50 +0100	[diff] [blame]	123	=item B<--use-tokenizer-sentence-splits\|-s>
				124
				125	Replace existing with, or add new, sentence boundary information
				126	provided by the KorAP tokenizer (currently supported only).
				127
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	128	=item B<--log\|-l>
				129
				130	Loglevel for I<Log::Any>. Defaults to C<notice>.
				131
				132	=back
				133
Akron	b364947	2020-09-29 08:24:46 +0200	[diff] [blame]	134	=head1 ENVIRONMENT VARIABLES
				135
				136	=over 2
				137
				138	=item B<KORAPXMLTEI_DEBUG>
				139
				140	Activate minimal debugging.
				141	Defaults to C<false>.
				142
				143	=item B<KORAPXMLTEI_INLINE>
				144
				145	Process inline annotations, if present.
				146	Defaults to C<false>.
				147
				148	=back
				149
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	150	=head1 COPYRIGHT AND LICENSE
				151
Marc Kupietz	985da0c	2021-02-15 19:29:50 +0100	[diff] [blame]	152	Copyright (C) 2021, L<IDS Mannheim\|https://www.ids-mannheim.de/>
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	153
				154	Author: Peter Harders
				155
Akron	aabd095	2020-09-29 07:35:08 +0200	[diff] [blame]	156	Contributors: Nils Diewald, Marc Kupietz, Carsten Schnober
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	157
				158	L<KorAP::XML::TEI> is developed as part of the L<KorAP\|https://korap.ids-mannheim.de/>
				159	Corpus Analysis Platform at the
				160	L<Leibniz Institute for the German Language (IDS)\|http://ids-mannheim.de/>,
				161	member of the
				162	L<Leibniz-Gemeinschaft\|http://www.leibniz-gemeinschaft.de/>.
				163
				164	This program is free software published under the
Marc Kupietz	e955ecc	2021-02-17 17:42:01 +0100	[diff] [blame]	165	L<BSD-2 License\|https://opensource.org/licenses/BSD-2-Clause>.
Akron	0c41ab3	2020-09-29 07:33:33 +0200	[diff] [blame]	166
				167	=cut