Readme.pod - KorAP/KorAP-XML-TEI - Gitiles

 =pod

 =encoding utf8

 =head1 NAME

 tei2korapxml - Conversion of TEI P5 based formats to KorAP-XML

 =head1 SYNOPSIS

   cat corpus.i5.xml | tei2korapxml - > corpus.korapxml.zip

 =head1 DESCRIPTION

 C<tei2korapxml> is a script to convert TEI P5 and
 L<I5|https://www.ids-mannheim.de/digspra/kl/projekte/korpora/textmodell>
 based documents to the
 L<KorAP-XML format|https://github.com/KorAP/KorAP-XML-Krill#about-korap-xml>.

 This program is usually called from inside another script.

 =head1 FORMATS

 =head2 Input restrictions

 =over 2

 =item

 TEI P5 formatted input with certain restrictions:

 =over 4

 =item

 B<mandatory>: text-header with integrated textsigle
 (or convertable identifier), text-body

 =item

 B<optional>: corp-header with integrated corpsigle,
 doc-header with integrated docsigle

 =back

 =item

 All tokens inside the primary text may not be
 newline seperated, because newlines are removed
 (see L<KorAP::XML::TEI::Data>) and a conversion of newlines
 into blanks between 2 tokens could lead to additional blanks,
 where there should be none (e.g.: punctuation characters like C<,> or
 C<.> should not be seperated from their predecessor token).
 (see also code section C<~ whitespace handling ~> in C<script/tei2korapxml>).

 =item

 Header types, like C<E<lt>idsHeader [...] type="document" [...] E<gt>>
 need to be defined in the same line as the header tag.

 =back

 =head2 Notes on the output

 =over 2

 =item

 zip file output (default on C<stdout>) with utf8 encoded entries
 (which together form the KorAP-XML format)

 =back

 =head1 INSTALLATION

 C<tei2korapxml> requires C<libxml2-dev> bindings and L<File::ShareDir::Install> to be installed.
 When these requirements are met, the preferred way to install the script is
 to use L<cpanm|App::cpanminus>.

   $ cpanm https://github.com/KorAP/KorAP-XML-TEI.git

 In case everything went well, the C<tei2korapxml> tool will
 be available on your command line immediately.

 Minimum requirement for L<KorAP::XML::TEI> is Perl 5.36.

 =head1 OPTIONS

 =over 2

 =item B<--input|-i>

 The input file to process. If no specific input is defined and a single
 dash C<-> is passed as an argument, data is read from C<STDIN>.


 =item B<--root|-r>

 The root directory for output. Defaults to C<.>.

 =item B<--help|-h>

 Print help information.

 =item B<--version|-v>

 Print version information.

 =item B<--tokenizer-korap|-tk>

 Use the standard KorAP/DeReKo tokenizer.

 =item B<--tokenizer-internal|-ti>

 Tokenize the data using two embedded tokenizers,
 that will take an I<aggressive> and a I<conservative>
 approach.

 =item B<--tokenizer-call|-tc>

 Call an external tokenizer process, that will tokenize
 from STDIN and outputs the offsets of all tokens.

 Texts are separated using C<\x04\n>. The external process
 should add a new line per text.

 If the L</--use-tokenizer-sentence-splits> option is activated,
 sentences are marked by offset as well in new lines.

 To use L<Datok|https://github.com/KorAP/Datok> including sentence
 splitting, call C<tei2korap> as follows:

   $ cat corpus.i5.xml | tei2korapxml -s \
   $   -tc 'datok tokenize \
   $        -t ./tokenizer.matok \
   $        -p --newline-after-eot --no-sentences \
   $        --no-tokens --sentence-positions -' - \
   $        > corpus.korapxml.zip

 =item B<--skip-inline-tokens>

 Boolean flag indicating that inline tokens should not
 be processed. Defaults to false (meaning inline tokens will be processed).

 =item B<--skip-inline-token-annotations>

 Boolean flag indicating that inline token annotations should not
 be processed. Defaults to true (meaning inline token annotations
 won't be processed).

 =item B<--skip-inline-tags> <tags>

 Expects a comma-separated list of tags to be ignored when the structure
 is parsed. Content of these tags however will be processed.

 =item B<--xmlid-to-textsigle> <from-regex>@<to-c/to-d/to-t>

 Expects a regular replacement expression (separated by B<@> between the
 search and the replacement) to convert text id attributes to text sigles
 with three parts (separated by B</>).

 Example:

   tei2korapxml  \
     --xmlid-to-textsigle 'ICC.German\.([^.]+\.[^.]+)\.(.+)@ICCGER/$1/$2' \
     -tk - < t/data/icc_german_sample.p5.xml

 Converts text id C<ICC.German.DeReKo.WPD17.G11.00238> to
 sigle C<ICCGER/DeReKo.WPD17/G11.00238>.

 =item B<--inline-tokens> <foundry>#[<file>]

 Define the foundry and file (without extension)
 to store inline token information in.
 Unless C<--skip-inline-token-annotations> is set,
 this will contain annotations as well.
 Defaults to C<tokens> and C<morpho>.

 The inline token data will also be stored in the
 inline structures file (see I<--inline-structures>),
 unless the inline token foundry is prepended
 by an B<!> exclamation mark, indicating that inline
 tokens are stored exclusively in the inline tokens
 file.

 Example:

   tei2korapxml --inline-tokens '!gingko#morpho' < data.i5.xml > korapxml.zip

 =item B<--inline-structures> <foundry>#[<file>]

 Define the foundry and file (without extension)
 to store inline structure information in.
 Defaults to C<struct> and C<structures>.

 =item B<--base-foundry> <foundry>

 Define the base foundry to store newly generated
 token information in.
 Defaults to C<base>.

 =item B<--data-file> <file>

 Define the file (without extension)
 to store primary data information in.
 Defaults to C<data>.

 =item B<--header-file> <file>

 Define the file name (without extension)
 to store header information on
 the corpus, document, and text level in.
 Defaults to C<header>.

 =item B<--use-tokenizer-sentence-splits|-s>

 Replace existing with, or add new, sentence boundary information
 provided by the tokenizer.
 Currently KorAP-tokenizer and certain external tokenizers support
 these boundaries.

 =item B<--tokens-file> <file>

 Define the file (without extension)
 to store generated token information in
 (either from the KorAP tokenizer or an externally called tokenizer).
 Defaults to C<tokens>.

 =item B<--log|-l>

 Loglevel for I<Log::Any>. Defaults to C<notice>.

 =back

 =head1 ENVIRONMENT VARIABLES

 =over 2

 =item B<KORAPXMLTEI_DEBUG>

 Activate minimal debugging.
 Defaults to C<false>.

 =back

 =head1 COPYRIGHT AND LICENSE

 Copyright (C) 2021-2024, L<IDS Mannheim|https://www.ids-mannheim.de/>

 Author: Peter Harders

 Contributors: Nils Diewald, Marc Kupietz, Carsten Schnober

 L<KorAP::XML::TEI> is developed as part of the L<KorAP|https://korap.ids-mannheim.de/>
 Corpus Analysis Platform at the
 L<Leibniz Institute for the German Language (IDS)|https://www.ids-mannheim.de/>,
 member of the
 L<Leibniz-Gemeinschaft|http://www.leibniz-gemeinschaft.de/>.

 This program is free software published under the
 L<BSD-2 License|https://opensource.org/licenses/BSD-2-Clause>.

 =cut
	=pod

	=encoding utf8

	=head1 NAME

	tei2korapxml - Conversion of TEI P5 based formats to KorAP-XML

	=head1 SYNOPSIS

	cat corpus.i5.xml \| tei2korapxml - > corpus.korapxml.zip

	=head1 DESCRIPTION

	C<tei2korapxml> is a script to convert TEI P5 and
	L<I5\|https://www.ids-mannheim.de/digspra/kl/projekte/korpora/textmodell>
	based documents to the
	L<KorAP-XML format\|https://github.com/KorAP/KorAP-XML-Krill#about-korap-xml>.

	This program is usually called from inside another script.

	=head1 FORMATS

	=head2 Input restrictions

	=over 2

	=item

	TEI P5 formatted input with certain restrictions:

	=over 4

	=item

	B<mandatory>: text-header with integrated textsigle
	(or convertable identifier), text-body

	=item

	B<optional>: corp-header with integrated corpsigle,
	doc-header with integrated docsigle

	=back

	=item

	All tokens inside the primary text may not be
	newline seperated, because newlines are removed
	(see L<KorAP::XML::TEI::Data>) and a conversion of newlines
	into blanks between 2 tokens could lead to additional blanks,
	where there should be none (e.g.: punctuation characters like C<,> or
	C<.> should not be seperated from their predecessor token).
	(see also code section C<~ whitespace handling ~> in C<script/tei2korapxml>).

	=item

	Header types, like C<E<lt>idsHeader [...] type="document" [...] E<gt>>
	need to be defined in the same line as the header tag.

	=back

	=head2 Notes on the output

	=over 2

	=item

	zip file output (default on C<stdout>) with utf8 encoded entries
	(which together form the KorAP-XML format)

	=back

	=head1 INSTALLATION

	C<tei2korapxml> requires C<libxml2-dev> bindings and L<File::ShareDir::Install> to be installed.
	When these requirements are met, the preferred way to install the script is
	to use L<cpanm\|App::cpanminus>.

	$ cpanm https://github.com/KorAP/KorAP-XML-TEI.git

	In case everything went well, the C<tei2korapxml> tool will
	be available on your command line immediately.

	Minimum requirement for L<KorAP::XML::TEI> is Perl 5.36.

	=head1 OPTIONS

	=over 2

	=item B<--input\|-i>

	The input file to process. If no specific input is defined and a single
	dash C<-> is passed as an argument, data is read from C<STDIN>.


	=item B<--root\|-r>

	The root directory for output. Defaults to C<.>.

	=item B<--help\|-h>

	Print help information.

	=item B<--version\|-v>

	Print version information.

	=item B<--tokenizer-korap\|-tk>

	Use the standard KorAP/DeReKo tokenizer.

	=item B<--tokenizer-internal\|-ti>

	Tokenize the data using two embedded tokenizers,
	that will take an I<aggressive> and a I<conservative>
	approach.

	=item B<--tokenizer-call\|-tc>

	Call an external tokenizer process, that will tokenize
	from STDIN and outputs the offsets of all tokens.

	Texts are separated using C<\x04\n>. The external process
	should add a new line per text.

	If the L</--use-tokenizer-sentence-splits> option is activated,
	sentences are marked by offset as well in new lines.

	To use L<Datok\|https://github.com/KorAP/Datok> including sentence
	splitting, call C<tei2korap> as follows:

	$ cat corpus.i5.xml \| tei2korapxml -s \
	$ -tc 'datok tokenize \
	$ -t ./tokenizer.matok \
	$ -p --newline-after-eot --no-sentences \
	$ --no-tokens --sentence-positions -' - \
	$ > corpus.korapxml.zip

	=item B<--skip-inline-tokens>

	Boolean flag indicating that inline tokens should not
	be processed. Defaults to false (meaning inline tokens will be processed).

	=item B<--skip-inline-token-annotations>

	Boolean flag indicating that inline token annotations should not
	be processed. Defaults to true (meaning inline token annotations
	won't be processed).

	=item B<--skip-inline-tags> <tags>

	Expects a comma-separated list of tags to be ignored when the structure
	is parsed. Content of these tags however will be processed.

	=item B<--xmlid-to-textsigle> <from-regex>@<to-c/to-d/to-t>

	Expects a regular replacement expression (separated by B<@> between the
	search and the replacement) to convert text id attributes to text sigles
	with three parts (separated by B</>).

	Example:

	tei2korapxml \
	--xmlid-to-textsigle 'ICC.German\.([^.]+\.[^.]+)\.(.+)@ICCGER/$1/$2' \
	-tk - < t/data/icc_german_sample.p5.xml

	Converts text id C<ICC.German.DeReKo.WPD17.G11.00238> to
	sigle C<ICCGER/DeReKo.WPD17/G11.00238>.

	=item B<--inline-tokens> <foundry>#[<file>]

	Define the foundry and file (without extension)
	to store inline token information in.
	Unless C<--skip-inline-token-annotations> is set,
	this will contain annotations as well.
	Defaults to C<tokens> and C<morpho>.

	The inline token data will also be stored in the
	inline structures file (see I<--inline-structures>),
	unless the inline token foundry is prepended
	by an B<!> exclamation mark, indicating that inline
	tokens are stored exclusively in the inline tokens
	file.

	Example:

	tei2korapxml --inline-tokens '!gingko#morpho' < data.i5.xml > korapxml.zip

	=item B<--inline-structures> <foundry>#[<file>]

	Define the foundry and file (without extension)
	to store inline structure information in.
	Defaults to C<struct> and C<structures>.

	=item B<--base-foundry> <foundry>

	Define the base foundry to store newly generated
	token information in.
	Defaults to C<base>.

	=item B<--data-file> <file>

	Define the file (without extension)
	to store primary data information in.
	Defaults to C<data>.

	=item B<--header-file> <file>

	Define the file name (without extension)
	to store header information on
	the corpus, document, and text level in.
	Defaults to C<header>.

	=item B<--use-tokenizer-sentence-splits\|-s>

	Replace existing with, or add new, sentence boundary information
	provided by the tokenizer.
	Currently KorAP-tokenizer and certain external tokenizers support
	these boundaries.

	=item B<--tokens-file> <file>

	Define the file (without extension)
	to store generated token information in
	(either from the KorAP tokenizer or an externally called tokenizer).
	Defaults to C<tokens>.

	=item B<--log\|-l>

	Loglevel for I<Log::Any>. Defaults to C<notice>.

	=back

	=head1 ENVIRONMENT VARIABLES

	=over 2

	=item B<KORAPXMLTEI_DEBUG>

	Activate minimal debugging.
	Defaults to C<false>.

	=back

	=head1 COPYRIGHT AND LICENSE

	Copyright (C) 2021-2024, L<IDS Mannheim\|https://www.ids-mannheim.de/>

	Author: Peter Harders

	Contributors: Nils Diewald, Marc Kupietz, Carsten Schnober

	L<KorAP::XML::TEI> is developed as part of the L<KorAP\|https://korap.ids-mannheim.de/>
	Corpus Analysis Platform at the
	L<Leibniz Institute for the German Language (IDS)\|https://www.ids-mannheim.de/>,
	member of the
	L<Leibniz-Gemeinschaft\|http://www.leibniz-gemeinschaft.de/>.

	This program is free software published under the
	L<BSD-2 License\|https://opensource.org/licenses/BSD-2-Clause>.

	=cut