Readme.pod - KorAP/KorAP-XML-TEI - Gitiles

 =pod

 =encoding utf8

 =head1 NAME

 tei2korapxml - Conversion of TEI P5 based formats to KorAP-XML

 =head1 SYNOPSIS

   cat corpus.i5.xml | tei2korapxml -tk - > corpus.korapxml.zip
   tei2korapxml -tk corpus.i5.xml > corpus.korapxml.zip

 =head1 DESCRIPTION

 C<tei2korapxml> is a script to convert TEI P5 and
 L<I5|https://www.ids-mannheim.de/digspra/kl/projekte/korpora/textmodell>
 based documents to the
 L<KorAP-XML format|https://github.com/KorAP/KorAP-XML-Krill#about-korap-xml>.

 This program is usually called from inside another script.

 =head1 FORMATS

 =head2 Input restrictions

 =over 2

 =item

 TEI P5 formatted input with certain restrictions:

 =over 4

 =item

 B<mandatory>: text-header with integrated textsigle
 (or convertable identifier), text-body

 =item

 B<optional>: corp-header with integrated corpsigle,
 doc-header with integrated docsigle

 =back

 =item

 All tokens inside the primary text may not be
 newline seperated, because newlines are removed
 (see L<KorAP::XML::TEI::Data>) and a conversion of newlines
 into blanks between 2 tokens could lead to additional blanks,
 where there should be none (e.g.: punctuation characters like C<,> or
 C<.> should not be seperated from their predecessor token).
 (see also code section C<~ whitespace handling ~> in C<script/tei2korapxml>).

 =item

 Header types, like C<E<lt>idsHeader [...] type="document" [...] E<gt>>
 need to be defined in the same line as the header tag.

 =back

 =head2 Notes on the output

 =over 2

 =item

 zip file output (default on C<stdout>) with utf8 encoded entries
 (which together form the KorAP-XML format)

 =back

 =head1 INSTALLATION

 C<tei2korapxml> requires C<libxml2-dev> bindings and L<File::ShareDir::Install> to be installed.
 When these requirements are met, the preferred way to install the script is
 to use L<cpanm|App::cpanminus>.

   $ cpanm https://github.com/KorAP/KorAP-XML-TEI.git

 In case everything went well, the C<tei2korapxml> tool will
 be available on your command line immediately.

 Minimum requirement for L<KorAP::XML::TEI> is Perl 5.16.

 =head1 OPTIONS

 =over 2

 =item B<--input|-i>

 The input file to process. If no specific input is defined and a single
 dash C<-> is passed as an argument, data is read from C<STDIN>.

 Instead of using C<-i> input files can also be defined as trailing arguments
 to the command:

   tei2korapxml -tk corpus1.i5.xml corpus2.i5.xml

 =item B<--output|-o>

 The output zip file to be created. If no specific output is defined,
 data is written to C<STDOUT>.

 =item B<--root|-r>

 The root directory for output. Defaults to C<.>.

 =item B<--help|-h>

 Print help information.

 =item B<--version|-v>

 Print version information.

 =item B<--tokenizer-korap|-tk>

 Use the standard KorAP/DeReKo tokenizer.

 =item B<--tokenizer-internal|-ti>

 Tokenize the data using two embedded tokenizers,
 that will take an I<aggressive> and a I<conservative>
 approach.

 =item B<--tokenizer-call|-tc>

 Call an external tokenizer process, that will tokenize
 from STDIN and outputs the offsets of all tokens.

 Texts are separated using C<\x04\n>. The external process
 should add a new line per text.

 If the L</--use-tokenizer-sentence-splits> option is activated,
 sentences are marked by offset as well in new lines.

 To use L<Datok|https://github.com/KorAP/Datok> including sentence
 splitting, call C<tei2korap> as follows:

   $ cat corpus.i5.xml | tei2korapxml -s \
   $   -tc 'datok tokenize \
   $        -t ./tokenizer.matok \
   $        -p --newline-after-eot --no-sentences \
   $        --no-tokens --sentence-positions -' - \
   $        > corpus.korapxml.zip

 =item B<--no-tokenizer>

 Boolean flag indicating that no tokenizer should be used.
 This is meant to ensure that by default a final token layer always
 exists.
 If a separate tokenizer is chosen, this flag is ignored.

 =item B<--skip-inline-tokens>

 Boolean flag indicating that inline tokens should not
 be processed. Defaults to false (meaning inline tokens will be processed).

 =item B<--skip-inline-token-annotations>

 Boolean flag indicating that inline token annotations should not
 be processed. Defaults to true (meaning inline token annotations
 won't be processed). Can be negated with
 C<--no-skip-inline-token-annotations>.

 =item B<--skip-inline-tags> <tags>

 Expects a comma-separated list of tags to be ignored when the structure
 is parsed. Content of these tags however will be processed.

 =item B<--auto-textsigle> <textsigle>

 Expects a text sigle thats serves as fallback if no text sigles
 are given in the input data.
 The auto text sigle will be incremented for each text processed.

 Example:

   tei2korapxml --auto-textsigle 'ICC/GER.00001' -s -tk - \
   < data.i5.xml > korapxml.zip

 =item B<--xmlid-to-textsigle> <from-regex>@<to-c/to-d/to-t>

 Expects a regular replacement expression (separated by B<@> between the
 search and the replacement) to convert text id attributes to text sigles
 with three parts (separated by B</>).

 Example:

   tei2korapxml  \
     --xmlid-to-textsigle 'ICC.German\.([^.]+\.[^.]+)\.(.+)@ICCGER/$1/$2' \
     -tk - < t/data/icc_german_sample.p5.xml

 Converts text id C<ICC.German.DeReKo.WPD17.G11.00238> to
 sigle C<ICCGER/DeReKo.WPD17/G11.00238>.

 =item B<--inline-tokens> <foundry>#[<file>]

 Define the foundry and file (without extension)
 to store inline token information in.
 Unless C<--skip-inline-token-annotations> is set,
 this will contain annotations as well.
 Defaults to C<tokens> and C<morpho>.

 The inline token data will also be stored in the
 inline structures file (see I<--inline-structures>),
 unless the inline token foundry is prepended
 by an B<!> exclamation mark, indicating that inline
 tokens are stored exclusively in the inline tokens
 file.

 Example:

   tei2korapxml --no-tokenizer --inline-tokens \
     '!gingko#morpho' < data.i5.xml > korapxml.zip

 =item B<--inline-dependencies> <foundry>#[<file>]

 Define the foundry and file (without extension)
 to store inline dependency information in.
 Defaults to the layer of C<dependency> and
 will be ignored if not set (which means, dependency
 attributes will be stored in the inline tokens file,
 if not skipped).

 The dependency data will also be stored in the
 inline token file (see I<--inline-tokens>),
 unless the inline dependencies foundry is prepended
 by an B<!> exclamation mark, indicating that inline
 dependency data is stored exclusively in the inline
 dependencies file.

 Example:

   tei2korapxml --no-tokenizer --inline-dependencies \
     'gingko#dependency' < data.i5.xml > korapxml.zip


 =item B<--inline-structures> <foundry>#[<file>]

 Define the foundry and file (without extension)
 to store inline structure information in.
 Defaults to C<struct> and C<structures>.

 =item B<--base-foundry> <foundry>

 Define the base foundry to store newly generated
 token information in.
 Defaults to C<base>.

 =item B<--data-file> <file>

 Define the file (without extension)
 to store primary data information in.
 Defaults to C<data>.

 =item B<--header-file> <file>

 Define the file name (without extension)
 to store header information on
 the corpus, document, and text level in.
 Defaults to C<header>.

 =item B<--use-tokenizer-sentence-splits|-s>

 Replace existing with, or add new, sentence boundary information
 provided by the tokenizer.
 Currently KorAP-tokenizer and certain external tokenizers support
 these boundaries.

 =item B<--tokens-file> <file>

 Define the file (without extension)
 to store generated token information in
 (either from the KorAP tokenizer or an externally called tokenizer).
 Defaults to C<tokens>.

 =item B<--log|-l>

 Loglevel for I<Log::Any>. Defaults to C<notice>.

 =back

 =head1 ENVIRONMENT VARIABLES

 =over 2

 =item B<KORAPXMLTEI_DEBUG>

 Activate minimal debugging.
 Defaults to C<false>.

 =back

 =head1 COPYRIGHT AND LICENSE

 Copyright (C) 2021-2024, L<IDS Mannheim|https://www.ids-mannheim.de/>

 Author: Peter Harders

 Contributors: Nils Diewald, Marc Kupietz, Carsten Schnober

 L<KorAP::XML::TEI> is developed as part of the L<KorAP|https://korap.ids-mannheim.de/>
 Corpus Analysis Platform at the
 L<Leibniz Institute for the German Language (IDS)|https://www.ids-mannheim.de/>,
 member of the
 L<Leibniz-Gemeinschaft|http://www.leibniz-gemeinschaft.de/>.

 This program is free software published under the
 L<BSD-2 License|https://opensource.org/licenses/BSD-2-Clause>.

 =cut
	=pod

	=encoding utf8

	=head1 NAME

	tei2korapxml - Conversion of TEI P5 based formats to KorAP-XML

	=head1 SYNOPSIS

	cat corpus.i5.xml \| tei2korapxml -tk - > corpus.korapxml.zip
	tei2korapxml -tk corpus.i5.xml > corpus.korapxml.zip

	=head1 DESCRIPTION

	C<tei2korapxml> is a script to convert TEI P5 and
	L<I5\|https://www.ids-mannheim.de/digspra/kl/projekte/korpora/textmodell>
	based documents to the
	L<KorAP-XML format\|https://github.com/KorAP/KorAP-XML-Krill#about-korap-xml>.

	This program is usually called from inside another script.

	=head1 FORMATS

	=head2 Input restrictions

	=over 2

	=item

	TEI P5 formatted input with certain restrictions:

	=over 4

	=item

	B<mandatory>: text-header with integrated textsigle
	(or convertable identifier), text-body

	=item

	B<optional>: corp-header with integrated corpsigle,
	doc-header with integrated docsigle

	=back

	=item

	All tokens inside the primary text may not be
	newline seperated, because newlines are removed
	(see L<KorAP::XML::TEI::Data>) and a conversion of newlines
	into blanks between 2 tokens could lead to additional blanks,
	where there should be none (e.g.: punctuation characters like C<,> or
	C<.> should not be seperated from their predecessor token).
	(see also code section C<~ whitespace handling ~> in C<script/tei2korapxml>).

	=item

	Header types, like C<E<lt>idsHeader [...] type="document" [...] E<gt>>
	need to be defined in the same line as the header tag.

	=back

	=head2 Notes on the output

	=over 2

	=item

	zip file output (default on C<stdout>) with utf8 encoded entries
	(which together form the KorAP-XML format)

	=back

	=head1 INSTALLATION

	C<tei2korapxml> requires C<libxml2-dev> bindings and L<File::ShareDir::Install> to be installed.
	When these requirements are met, the preferred way to install the script is
	to use L<cpanm\|App::cpanminus>.

	$ cpanm https://github.com/KorAP/KorAP-XML-TEI.git

	In case everything went well, the C<tei2korapxml> tool will
	be available on your command line immediately.

	Minimum requirement for L<KorAP::XML::TEI> is Perl 5.16.

	=head1 OPTIONS

	=over 2

	=item B<--input\|-i>

	The input file to process. If no specific input is defined and a single
	dash C<-> is passed as an argument, data is read from C<STDIN>.

	Instead of using C<-i> input files can also be defined as trailing arguments
	to the command:

	tei2korapxml -tk corpus1.i5.xml corpus2.i5.xml

	=item B<--output\|-o>

	The output zip file to be created. If no specific output is defined,
	data is written to C<STDOUT>.

	=item B<--root\|-r>

	The root directory for output. Defaults to C<.>.

	=item B<--help\|-h>

	Print help information.

	=item B<--version\|-v>

	Print version information.

	=item B<--tokenizer-korap\|-tk>

	Use the standard KorAP/DeReKo tokenizer.

	=item B<--tokenizer-internal\|-ti>

	Tokenize the data using two embedded tokenizers,
	that will take an I<aggressive> and a I<conservative>
	approach.

	=item B<--tokenizer-call\|-tc>

	Call an external tokenizer process, that will tokenize
	from STDIN and outputs the offsets of all tokens.

	Texts are separated using C<\x04\n>. The external process
	should add a new line per text.

	If the L</--use-tokenizer-sentence-splits> option is activated,
	sentences are marked by offset as well in new lines.

	To use L<Datok\|https://github.com/KorAP/Datok> including sentence
	splitting, call C<tei2korap> as follows:

	$ cat corpus.i5.xml \| tei2korapxml -s \
	$ -tc 'datok tokenize \
	$ -t ./tokenizer.matok \
	$ -p --newline-after-eot --no-sentences \
	$ --no-tokens --sentence-positions -' - \
	$ > corpus.korapxml.zip

	=item B<--no-tokenizer>

	Boolean flag indicating that no tokenizer should be used.
	This is meant to ensure that by default a final token layer always
	exists.
	If a separate tokenizer is chosen, this flag is ignored.

	=item B<--skip-inline-tokens>

	Boolean flag indicating that inline tokens should not
	be processed. Defaults to false (meaning inline tokens will be processed).

	=item B<--skip-inline-token-annotations>

	Boolean flag indicating that inline token annotations should not
	be processed. Defaults to true (meaning inline token annotations
	won't be processed). Can be negated with
	C<--no-skip-inline-token-annotations>.

	=item B<--skip-inline-tags> <tags>

	Expects a comma-separated list of tags to be ignored when the structure
	is parsed. Content of these tags however will be processed.

	=item B<--auto-textsigle> <textsigle>

	Expects a text sigle thats serves as fallback if no text sigles
	are given in the input data.
	The auto text sigle will be incremented for each text processed.

	Example:

	tei2korapxml --auto-textsigle 'ICC/GER.00001' -s -tk - \
	< data.i5.xml > korapxml.zip

	=item B<--xmlid-to-textsigle> <from-regex>@<to-c/to-d/to-t>

	Expects a regular replacement expression (separated by B<@> between the
	search and the replacement) to convert text id attributes to text sigles
	with three parts (separated by B</>).

	Example:

	tei2korapxml \
	--xmlid-to-textsigle 'ICC.German\.([^.]+\.[^.]+)\.(.+)@ICCGER/$1/$2' \
	-tk - < t/data/icc_german_sample.p5.xml

	Converts text id C<ICC.German.DeReKo.WPD17.G11.00238> to
	sigle C<ICCGER/DeReKo.WPD17/G11.00238>.

	=item B<--inline-tokens> <foundry>#[<file>]

	Define the foundry and file (without extension)
	to store inline token information in.
	Unless C<--skip-inline-token-annotations> is set,
	this will contain annotations as well.
	Defaults to C<tokens> and C<morpho>.

	The inline token data will also be stored in the
	inline structures file (see I<--inline-structures>),
	unless the inline token foundry is prepended
	by an B<!> exclamation mark, indicating that inline
	tokens are stored exclusively in the inline tokens
	file.

	Example:

	tei2korapxml --no-tokenizer --inline-tokens \
	'!gingko#morpho' < data.i5.xml > korapxml.zip

	=item B<--inline-dependencies> <foundry>#[<file>]

	Define the foundry and file (without extension)
	to store inline dependency information in.
	Defaults to the layer of C<dependency> and
	will be ignored if not set (which means, dependency
	attributes will be stored in the inline tokens file,
	if not skipped).

	The dependency data will also be stored in the
	inline token file (see I<--inline-tokens>),
	unless the inline dependencies foundry is prepended
	by an B<!> exclamation mark, indicating that inline
	dependency data is stored exclusively in the inline
	dependencies file.

	Example:

	tei2korapxml --no-tokenizer --inline-dependencies \
	'gingko#dependency' < data.i5.xml > korapxml.zip


	=item B<--inline-structures> <foundry>#[<file>]

	Define the foundry and file (without extension)
	to store inline structure information in.
	Defaults to C<struct> and C<structures>.

	=item B<--base-foundry> <foundry>

	Define the base foundry to store newly generated
	token information in.
	Defaults to C<base>.

	=item B<--data-file> <file>

	Define the file (without extension)
	to store primary data information in.
	Defaults to C<data>.

	=item B<--header-file> <file>

	Define the file name (without extension)
	to store header information on
	the corpus, document, and text level in.
	Defaults to C<header>.

	=item B<--use-tokenizer-sentence-splits\|-s>

	Replace existing with, or add new, sentence boundary information
	provided by the tokenizer.
	Currently KorAP-tokenizer and certain external tokenizers support
	these boundaries.

	=item B<--tokens-file> <file>

	Define the file (without extension)
	to store generated token information in
	(either from the KorAP tokenizer or an externally called tokenizer).
	Defaults to C<tokens>.

	=item B<--log\|-l>

	Loglevel for I<Log::Any>. Defaults to C<notice>.

	=back

	=head1 ENVIRONMENT VARIABLES

	=over 2

	=item B<KORAPXMLTEI_DEBUG>

	Activate minimal debugging.
	Defaults to C<false>.

	=back

	=head1 COPYRIGHT AND LICENSE

	Copyright (C) 2021-2024, L<IDS Mannheim\|https://www.ids-mannheim.de/>

	Author: Peter Harders

	Contributors: Nils Diewald, Marc Kupietz, Carsten Schnober

	L<KorAP::XML::TEI> is developed as part of the L<KorAP\|https://korap.ids-mannheim.de/>
	Corpus Analysis Platform at the
	L<Leibniz Institute for the German Language (IDS)\|https://www.ids-mannheim.de/>,
	member of the
	L<Leibniz-Gemeinschaft\|http://www.leibniz-gemeinschaft.de/>.

	This program is free software published under the
	L<BSD-2 License\|https://opensource.org/licenses/BSD-2-Clause>.

	=cut