blob: 295e3ab50be678bb68ab330fa368984eddbbd767 [file] [log] [blame]
=pod
=encoding utf8
=head1 NAME
tei2korapxml - Conversion of TEI P5 based formats to KorAP-XML
=head1 SYNOPSIS
cat corpus.i5.xml | tei2korapxml > corpus.korapxml.zip
=head1 DESCRIPTION
C<tei2korapxml> is a script to convert TEI P5 and
L<I5|https://www1.ids-mannheim.de/kl/projekte/korpora/textmodell.html>
based documents to the
L<KorAP-XML format|https://github.com/KorAP/KorAP-XML-Krill#about-korap-xml>.
If no specific input is defined, data is
read from C<STDIN>. If no specific output is defined, data is written
to C<STDOUT>.
This program is usually called from inside another script.
=head1 FORMATS
=head2 Input restrictions
=over 2
=item
TEI P5 formatted input with certain restrictions:
=over 4
=item
B<mandatory>: text-header with integrated textsigle, text-body
=item
B<optional>: corp-header with integrated corpsigle,
doc-header with integrated docsigle
=back
=item
All tokens inside the primary text may not be
newline seperated, because newlines are removed
(see L<KorAP::XML::TEI::Data>) and a conversion of newlines
into blanks between 2 tokens could lead to additional blanks,
where there should be none (e.g.: punctuation characters like C<,> or
C<.> should not be seperated from their predecessor token).
(see also code section C<~ whitespace handling ~> in C<script/tei2korapxml>).
=back
=head2 Notes on the output
=over 2
=item
zip file output (default on C<stdout>) with utf8 encoded entries
(which together form the KorAP-XML format)
=back
=head1 INSTALLATION
C<tei2korapxml> requires L<libxml2-dev> bindings to build. When
these bindings are available, the preferred way to install the script is
to use L<cpanm|App::cpanminus>.
$ cpanm https://github.com/KorAP/KorAP-XML-TEI.git
In case everything went well, the C<tei2korapxml> tool will
be available on your command line immediately.
Minimum requirement for L<KorAP::XML::TEI> is Perl 5.16.
=head1 OPTIONS
=over 2
=item B<--root|-r>
The root directory for output. Defaults to C<.>.
=item B<--help|-h>
Print help information.
=item B<--version|-v>
Print version information.
=item B<--tokenizer-call|-tc>
Call an external tokenizer process, that will tokenize
a single line from STDIN and outputs one token per line.
=item B<--tokenizer-korap|-tk>
Use the standard KorAP/DeReKo tokenizer.
=item B<--tokenizer-internal|-ti>
Tokenize the data using two embedded tokenizers,
that will take an I<Aggressive> and a I<conservative>
approach.
=item B<--skip-inline-tokens>
Boolean flag indicating that inline tokens should not
be processed. Defaults to false (meaning inline tokens will be processed).
=item B<--skip-inline-token-annotations>
Boolean flag indicating that inline token annotations should not
be processed. Defaults to true (meaning inline token annotations
won't be processed).
=item B<--skip-inline-tags> <tags>
Expects a comma-separated list of tags to be ignored when the structure
is parsed. Content of these tags however will be processed.
=item B<--inline-tokens> <foundry>#[<file>]
Define the foundry and file (without extension)
to store inline token information in.
Unless C<--skip-inline-token-annotations> is set,
this will contain annotations as well.
Defaults to C<tokens> and C<morpho>.
=item B<--inline-structures> <foundry>#[<file>]
Define the foundry and file (without extension)
to store inline structure information in.
Defaults to C<struct> and C<structures>.
=item B<--base-foundry> <foundry>
Define the base foundry to store newly generated
token information in.
Defaults to C<base>.
=item B<--data-file> <file>
Define the file (without extension)
to store primary data information in.
Defaults to C<data>.
=item B<--header-file> <file>
Define the file name (without extension)
to store header information on
the corpus, document, and text level in.
Defaults to C<header>.
=item B<--use-tokenizer-sentence-splits|-s>
Replace existing with, or add new, sentence boundary information
provided by the KorAP tokenizer (currently supported only).
=item B<--tokens-file> <file>
Define the file (without extension)
to store generated token information in
(either from the KorAP tokenizer or an externally called tokenizer).
Defaults to C<tokens>.
=item B<--log|-l>
Loglevel for I<Log::Any>. Defaults to C<notice>.
=back
=head1 ENVIRONMENT VARIABLES
=over 2
=item B<KORAPXMLTEI_DEBUG>
Activate minimal debugging.
Defaults to C<false>.
=back
=head1 COPYRIGHT AND LICENSE
Copyright (C) 2021, L<IDS Mannheim|https://www.ids-mannheim.de/>
Author: Peter Harders
Contributors: Nils Diewald, Marc Kupietz, Carsten Schnober
L<KorAP::XML::TEI> is developed as part of the L<KorAP|https://korap.ids-mannheim.de/>
Corpus Analysis Platform at the
L<Leibniz Institute for the German Language (IDS)|http://ids-mannheim.de/>,
member of the
L<Leibniz-Gemeinschaft|http://www.leibniz-gemeinschaft.de/>.
This program is free software published under the
L<BSD-2 License|https://opensource.org/licenses/BSD-2-Clause>.
=cut