blob: a587374419197eb14a3ce4139a05c3d24a0af97c [file] [log] [blame]
=pod
=encoding utf8
=head1 NAME
tei2korapxml - Conversion of TEI P5 based formats to KorAP-XML
=head1 SYNOPSIS
cat corpus.i5.xml | tei2korapxml -tk - > corpus.korapxml.zip
tei2korapxml -tk corpus.i5.xml > corpus.korapxml.zip
=head1 DESCRIPTION
C<tei2korapxml> is a script to convert TEI P5 and
L<I5|https://www.ids-mannheim.de/digspra/kl/projekte/korpora/textmodell>
based documents to the
L<KorAP-XML format|https://github.com/KorAP/KorAP-XML-Krill#about-korap-xml>.
This program is usually called from inside another script.
=head1 FORMATS
=head2 Input restrictions
=over 2
=item
TEI P5 formatted input with certain restrictions:
=over 4
=item
B<mandatory>: text-header with integrated textsigle
(or convertable identifier), text-body
=item
B<optional>: corp-header with integrated corpsigle,
doc-header with integrated docsigle
=back
=item
All tokens inside the primary text may not be
newline seperated, because newlines are removed
(see L<KorAP::XML::TEI::Data>) and a conversion of newlines
into blanks between 2 tokens could lead to additional blanks,
where there should be none (e.g.: punctuation characters like C<,> or
C<.> should not be seperated from their predecessor token).
(see also code section C<~ whitespace handling ~> in C<script/tei2korapxml>).
=item
Header types, like C<E<lt>idsHeader [...] type="document" [...] E<gt>>
need to be defined in the same line as the header tag.
=back
=head2 Notes on the output
=over 2
=item
zip file output (default on C<stdout>) with utf8 encoded entries
(which together form the KorAP-XML format)
=back
=head1 INSTALLATION
C<tei2korapxml> requires C<libxml2-dev> bindings and L<File::ShareDir::Install> to be installed.
When these requirements are met, the preferred way to install the script is
to use L<cpanm|App::cpanminus>.
$ cpanm https://github.com/KorAP/KorAP-XML-TEI.git
In case everything went well, the C<tei2korapxml> tool will
be available on your command line immediately.
Minimum requirement for L<KorAP::XML::TEI> is Perl 5.16.
=head1 OPTIONS
=over 2
=item B<--input|-i>
The input file to process. If no specific input is defined and a single
dash C<-> is passed as an argument, data is read from C<STDIN>.
Instead of using C<-i> input files can also be defined as trailing arguments
to the command:
tei2korapxml -tk corpus1.i5.xml corpus2.i5.xml
=item B<--output|-o>
The output zip file to be created. If no specific output is defined,
data is written to C<STDOUT>.
=item B<--root|-r>
The root directory for output. Defaults to C<.>.
=item B<--help|-h>
Print help information.
=item B<--version|-v>
Print version information.
=item B<--tokenizer-korap|-tk>
Use the standard KorAP/DeReKo tokenizer.
=item B<--tokenizer-internal|-ti>
Tokenize the data using two embedded tokenizers,
that will take an I<aggressive> and a I<conservative>
approach.
=item B<--tokenizer-call|-tc>
Call an external tokenizer process, that will tokenize
from STDIN and outputs the offsets of all tokens.
Texts are separated using C<\x04\n>. The external process
should add a new line per text.
If the L</--use-tokenizer-sentence-splits> option is activated,
sentences are marked by offset as well in new lines.
To use L<Datok|https://github.com/KorAP/Datok> including sentence
splitting, call C<tei2korap> as follows:
$ cat corpus.i5.xml | tei2korapxml -s \
$ -tc 'datok tokenize \
$ -t ./tokenizer.matok \
$ -p --newline-after-eot --no-sentences \
$ --no-tokens --sentence-positions -' - \
$ > corpus.korapxml.zip
=item B<--no-tokenizer>
Boolean flag indicating that no tokenizer should be used.
This is meant to ensure that by default a final token layer always
exists.
If a separate tokenizer is chosen, this flag is ignored.
=item B<--skip-inline-tokens>
Boolean flag indicating that inline tokens should not
be processed. Defaults to false (meaning inline tokens will be processed).
=item B<--skip-inline-token-annotations>
Boolean flag indicating that inline token annotations should not
be processed. Defaults to true (meaning inline token annotations
won't be processed). Can be negated with
C<--no-skip-inline-token-annotations>.
=item B<--skip-inline-tags> <tags>
Expects a comma-separated list of tags to be ignored when the structure
is parsed. Content of these tags however will be processed.
=item B<--auto-textsigle> <textsigle>
Expects a text sigle thats serves as fallback if no text sigles
are given in the input data.
The auto text sigle will be incremented for each text processed.
Example:
tei2korapxml --auto-textsigle 'ICC/GER.00001' -s -tk - \
< data.i5.xml > korapxml.zip
=item B<--xmlid-to-textsigle> <from-regex>@<to-c/to-d/to-t>
Expects a regular replacement expression (separated by B<@> between the
search and the replacement) to convert text id attributes to text sigles
with three parts (separated by B</>).
Example:
tei2korapxml \
--xmlid-to-textsigle 'ICC.German\.([^.]+\.[^.]+)\.(.+)@ICCGER/$1/$2' \
-tk - < t/data/icc_german_sample.p5.xml
Converts text id C<ICC.German.DeReKo.WPD17.G11.00238> to
sigle C<ICCGER/DeReKo.WPD17/G11.00238>.
=item B<--inline-tokens> <foundry>#[<file>]
Define the foundry and file (without extension)
to store inline token information in.
Unless C<--skip-inline-token-annotations> is set,
this will contain annotations as well.
Defaults to C<tokens> and C<morpho>.
The inline token data will also be stored in the
inline structures file (see I<--inline-structures>),
unless the inline token foundry is prepended
by an B<!> exclamation mark, indicating that inline
tokens are stored exclusively in the inline tokens
file.
Example:
tei2korapxml --no-tokenizer --inline-tokens \
'!gingko#morpho' < data.i5.xml > korapxml.zip
=item B<--inline-dependencies> <foundry>#[<file>]
Define the foundry and file (without extension)
to store inline dependency information in.
Defaults to the layer of C<dependency> and
will be ignored if not set (which means, dependency
attributes will be stored in the inline tokens file,
if not skipped).
The dependency data will also be stored in the
inline token file (see I<--inline-tokens>),
unless the inline dependencies foundry is prepended
by an B<!> exclamation mark, indicating that inline
dependency data is stored exclusively in the inline
dependencies file.
Example:
tei2korapxml --no-tokenizer --inline-dependencies \
'gingko#dependency' < data.i5.xml > korapxml.zip
=item B<--inline-structures> <foundry>#[<file>]
Define the foundry and file (without extension)
to store inline structure information in.
Defaults to C<struct> and C<structures>.
=item B<--base-foundry> <foundry>
Define the base foundry to store newly generated
token information in.
Defaults to C<base>.
=item B<--data-file> <file>
Define the file (without extension)
to store primary data information in.
Defaults to C<data>.
=item B<--header-file> <file>
Define the file name (without extension)
to store header information on
the corpus, document, and text level in.
Defaults to C<header>.
=item B<--use-tokenizer-sentence-splits|-s>
Replace existing with, or add new, sentence boundary information
provided by the tokenizer.
Currently KorAP-tokenizer and certain external tokenizers support
these boundaries.
=item B<--tokens-file> <file>
Define the file (without extension)
to store generated token information in
(either from the KorAP tokenizer or an externally called tokenizer).
Defaults to C<tokens>.
=item B<--log|-l>
Loglevel for I<Log::Any>. Defaults to C<notice>.
=back
=head1 ENVIRONMENT VARIABLES
=over 2
=item B<KORAPXMLTEI_DEBUG>
Activate minimal debugging.
Defaults to C<false>.
=back
=head1 COPYRIGHT AND LICENSE
Copyright (C) 2021-2024, L<IDS Mannheim|https://www.ids-mannheim.de/>
Author: Peter Harders
Contributors: Nils Diewald, Marc Kupietz, Carsten Schnober
L<KorAP::XML::TEI> is developed as part of the L<KorAP|https://korap.ids-mannheim.de/>
Corpus Analysis Platform at the
L<Leibniz Institute for the German Language (IDS)|https://www.ids-mannheim.de/>,
member of the
L<Leibniz-Gemeinschaft|http://www.leibniz-gemeinschaft.de/>.
This program is free software published under the
L<BSD-2 License|https://opensource.org/licenses/BSD-2-Clause>.
=cut