=pod

=encoding utf8

=head1 NAME

korapxml2krill - Merge KorapXML data and create Krill documents


=head1 SYNOPSIS

  $ korapxml2krill -z --input <directory> --output <filename>
  $ korapxml2krill archive -z --input <directory|archive> --output <directory>
  $ korapxml2krill extract --input <directory|archive> --output <filename> --sigle <SIGLE>


=head1 DESCRIPTION

L<KorAP::XML::Krill> is a library to convert KorAP-XML documents to files
compatible with the L<Krill|https://github.com/KorAP/Krill> indexer.
The C<korapxml2krill> command line tool is a simple wrapper to the library.


=head1 INSTALLATION

The preferred way to install L<KorAP::XML::Krill> is to use L<cpanm|App::cpanminus>.

  $ cpanm https://github.com/KorAP/KorAP-XML-Krill.git

In case everything went well, the C<korapxml2krill> tool will
be available on your command line immediately.
Minimum requirement for L<KorAP::XML::Krill> is Perl 5.14.
In addition to work with zip archives, the C<unzip> tool needs to be present.

=head1 ARGUMENTS

Without arguments, C<korapxml2krill> processes a directory of a single KorAP-XML document.

=over 2

=item B<archive>

Processes an archive as a Zip-file or a folder of KorAP-XML documents.

=item B<extract>

Extracts KorAP-XML files from a Zip-file.

=back


=head1 OPTIONS

=over 2

=item B<--input|-i> <directory|file|files>

Directory or archive file of documents to convert.

Without arguments, C<korapxml2krill> expects a folder of a single KorAP-XML
document, while C<archive> and C<extract> support zip archives as well.

C<archive> supports multiple input archives with the constraint,
that the first archive listed contains all primary data files
and all meta data files.

  -i file/news.zip -i file/news.malt.zip -i "#file/news.tt.zip"

(The directory structure follows the base directory format,
that may include a C<.> root folder.
In this case further archives lacking a C<.> root folder
need to be passed with a hash sign in front of the archive's name.
This may require to quote the parameter.)

To support zip files, a version of C<unzip> needs to be installed that is
compatible with the archive file.

B<The root folder switch using the hash sign is experimental and
may vanish in future versions.>

=item B<--output|-o> <directory|file>

Output folder for archive processing or
document name for single output (optional),
writes to C<STDOUT> by default
(in case C<output> is not mandatory due to further options).

=item B<--overwrite|-w>

Overwrite files that already exist.

=item B<--token|-t> <foundry>[#<file>]

Define the default tokenization by specifying
the name of the foundry and optionally the name
of the layer-file. Defaults to C<OpenNLP#tokens>.

=item B<--skip|-s> <foundry>[#<layer>]

Skip specific annotations by specifying the foundry
(and optionally the layer with a C<#>-prefix),
e.g. C<Mate> or C<Mate#Morpho>. Alternatively you can skip C<#ALL>.
Can be set multiple times.

=item B<--anno|-a> <foundry>#<layer>

Convert specific annotations by specifying the foundry
(and optionally the layer with a C<#>-prefix),
e.g. C<Mate> or C<Mate#Morpho>.
Can be set multiple times.

=item B<--primary|-p>

Output primary data or not. Defaults to C<true>.
Can be flagged using C<--no-primary> as well.
This is I<deprecated>.

=item B<--jobs|-j>

Define the number of concurrent jobs in seperated forks
for archive processing.
Defaults to C<0> (everything runs in a single process).
This is I<experimental>.

=item B<--meta|-m>

Define the metadata parser to use. Defaults to C<I5>.
Metadata parsers can be defined in the C<KorAP::XML::Meta> namespace.
This is I<experimental>.

=item B<--pretty|-y>

Pretty print JSON output. Defaults to C<false>.
This is I<deprecated>.

=item B<--gzip|-z>

Compress the output.
Expects a defined C<output> file in single processing.

=item B<--cache|-c>

File to mmap a cache (using L<Cache::FastMmap>).
Defaults to C<korapxml2krill.cache> in the calling directory.

=item B<--cache-size|-cs>

Size of the cache. Defaults to C<50m>.

=item B<--cache-init|-ci>

Initialize cache file.
Can be flagged using C<--no-cache-init> as well.
Defaults to C<true>.

=item B<--cache-delete|-cd>

Delete cache file after processing.
Can be flagged using C<--no-cache-delete> as well.
Defaults to C<true>.

=item B<--sigle|-sg>

Extract the given text sigles.
Can be set multiple times.
I<Currently only supported on C<extract>.>
Sigles have the structure C<Corpus>/C<Document>/C<Text>.

=item B<--log|-l>

The L<Log4perl> log level, defaults to C<ERROR>.

=item B<--help|-h>

Print this document.

=item B<--version|-v>

Print version information.

=back

=head1 ANNOTATION SUPPORT

L<KorAP::XML::Krill> has built-in importer for some annotation foundries and layers
developed in the KorAP project that are part of the KorAP preprocessing pipeline.
The base foundry with paragraphs, sentences, and the text element are mandatory for
L<Krill|https://github.com/KorAP/Krill>.

=over 2

=item B<Base>

=over 4

=item #Paragraphs

=item #Sentences

=back

=item B<Connexor>

=over 4

=item #Morpho

=item #Phrase

=item #Sentences

=item #Syntax

=back

=item B<CoreNLP>

=over 4

=item #Constituency

=item #Morpho

=item #NamedEntities

=item #Sentences

=back

=item B<DeReKo>

=over 4

=item #Structure

=back

=item B<Glemm>

=over 4

=item #Morpho

=back

=item B<Mate>

=over 4

=item #Dependency

=item #Morpho

=back

=item B<OpenNLP>

=over 4

=item #Morpho

=item #Sentences

=back

=item B<Sgbr>

=over 4

=item #Lemma

=item #Morpho

=back

=item B<TreeTagger>

=over 4

=item #Morpho

=item #Sentences

=back

=item B<XIP>

=over 4

=item #Constituency

=item #Morpho

=item #Sentences

=back

=back

More importers are in preparation.
New annotation importers can be defined in the C<KorAP::XML::Annotation> namespace.
See the built-in annotation importers as examples.

=head1 AVAILABILITY

  https://github.com/KorAP/KorAP-XML-Krill


=head1 COPYRIGHT AND LICENSE

Copyright (C) 2015-2016, L<IDS Mannheim|http://www.ids-mannheim.de/>

Author: L<Nils Diewald|http://nils-diewald.de/>

L<KorAP::XML::Krill> is developed as part of the L<KorAP|http://korap.ids-mannheim.de/>
Corpus Analysis Platform at the
L<Institute for the German Language (IDS)|http://ids-mannheim.de/>,
member of the
L<Leibniz-Gemeinschaft|http://www.leibniz-gemeinschaft.de/en/about-us/leibniz-competition/projekte-2011/2011-funding-line-2/>.

This program is free software published under the
L<BSD-2 License|https://raw.githubusercontent.com/KorAP/KorAP-XML-Krill/master/LICENSE>.

=cut
