Readme.pod - KorAP/KorAP-XML-Krill - Gitiles

 =pod

 =encoding utf8

 =head1 NAME

 korapxml2krill - Merge KorapXML data and create Krill documents


 =head1 SYNOPSIS

   korapxml2krill [archive|extract] --input <directory|archive> [options]


 =head1 DESCRIPTION

 L<KorAP::XML::Krill> is a library to convert KorAP-XML documents to files
 compatible with the L<Krill|https://github.com/KorAP/Krill> indexer.
 The C<korapxml2krill> command line tool is a simple wrapper to the library.


 =head1 INSTALLATION

 The preferred way to install L<KorAP::XML::Krill> is to use L<cpanm|App::cpanminus>.

   $ cpanm https://github.com/KorAP/KorAP-XML-Krill.git

 In case everything went well, the C<korapxml2krill> tool will
 be available on your command line immediately.
 Minimum requirement for L<KorAP::XML::Krill> is Perl 5.16.
 In addition to work with zip archives, the C<unzip> tool needs to be present.

 =head1 ARGUMENTS

   $ korapxml2krill -z --input <directory> --output <filename>

 Without arguments, C<korapxml2krill> converts a directory of a single KorAP-XML document.
 It expects the input to point to the text level folder.

 =over 2

 =item B<archive>

   $ korapxml2krill archive -z --input <directory|archive> --output <directory|tar>

 Converts an archive of KorAP-XML documents. It expects a directory
 (pointing to the corpus level folder) or one or more zip files as input.

 =item B<extract>

   $ korapxml2krill extract --input <archive> --output <directory> --sigle <SIGLE>

 Extracts KorAP-XML documents from a zip file.

 =item B<serial>

   $ korapxml2krill serial -i <archive1> -i <archive2> -o <directory> -cfg <config-file>

 Convert archives sequentially. The inputs are not merged but treated
 as they are (so they may be premerged or globs).
 the C<--out> directory is treated as the base directory where subdirectories
 are created based on the archive name. In case the C<--to-tar> flag is given,
 the output will be a tar file.


 =back


 =head1 OPTIONS

 =over 2

 =item B<--input|-i> <directory|zip file>

 Directory or zip file(s) of documents to convert.

 Without arguments, C<korapxml2krill> expects a folder of a single KorAP-XML
 document, while C<archive> expects a KorAP-XML corpus folder or a zip
 file to batch process multiple files.
 C<extract> expects zip files only.

 C<archive> supports multiple input zip files with the constraint,
 that the first archive listed contains all primary data files
 and all meta data files.

   -i file/news.zip -i file/news.malt.zip -i "#file/news.tt.zip"

 Input may also be defined using BSD glob wildcards.

   -i 'file/news*.zip'

 The extended input array will be sorted in length order, so the shortest
 path needs to contain all primary data files and all meta data files.

 (The directory structure follows the base directory format,
 that may include a C<.> root folder.
 In this case further archives lacking a C<.> root folder
 need to be passed with a hash sign in front of the archive's name.
 This may require to quote the parameter.)

 To support zip files, a version of C<unzip> needs to be installed that is
 compatible with the archive file.

 B<The root folder switch using the hash sign is experimental and
 may vanish in future versions.>


 =item B<--input-base|-ib> <directory>

 The base directory for inputs.


 =item B<--output|-o> <directory|file>

 Output folder for archive processing or
 document name for single output (optional),
 writes to C<STDOUT> by default
 (in case C<output> is not mandatory due to further options).

 =item B<--overwrite|-w>

 Overwrite files that already exist.


 =item B<--token|-t> <foundry>#<file>

 Define the default tokenization by specifying
 the name of the foundry and optionally the name
 of the layer-file. Defaults to C<OpenNLP#tokens>.


 =item B<--base-sentences|-bs> <foundry>#<layer>

 Define the layer for base sentences.
 If given, this will be used instead of using C<Base#Sentences>.
 Currently C<DeReKo#Structure> is the only additional layer supported.

  Defaults to unset.


 =item B<--base-paragraphs|-bp> <foundry>#<layer>

 Define the layer for base paragraphs.
 If given, this will be used instead of using C<Base#Paragraphs>.
 Currently C<DeReKo#Structure> is the only additional layer supported.

  Defaults to unset.


 =item B<--base-pagebreaks|-bpb> <foundry>#<layer>

 Define the layer for base pagebreaks.
 Currently C<DeReKo#Structure> is the only layer supported.

  Defaults to unset.


 =item B<--skip|-s> <foundry>[#<layer>]

 Skip specific annotations by specifying the foundry
 (and optionally the layer with a C<#>-prefix),
 e.g. C<Mate> or C<Mate#Morpho>. Alternatively you can skip C<#ALL>.
 Can be set multiple times.


 =item B<--anno|-a> <foundry>#<layer>

 Convert specific annotations by specifying the foundry
 (and optionally the layer with a C<#>-prefix),
 e.g. C<Mate> or C<Mate#Morpho>.
 Can be set multiple times.


 =item B<--primary|-p>

 Output primary data or not. Defaults to C<true>.
 Can be flagged using C<--no-primary> as well.
 This is I<deprecated>.


 =item B<--non-word-tokens|-nwt>

 Tokenize non-word tokens like word tokens (defined as matching
 C</[\d\w]/>). Useful to treat punctuations as tokens.

  Defaults to unset.

 =item B<--jobs|-j>

 Define the number of concurrent jobs in seperated forks
 for archive processing.
 Defaults to C<0> (everything runs in a single process).

 If C<sequential-extraction> is not set to false, this will
 also apply to extraction.

 Pass -1, and the value will be set automatically to 5
 times the number of available cores.
 This is I<experimental>.


 =item B<--koral|-k>

 Version of the output format. Supported versions are:
 C<0> for legacy serialization, C<0.03> for serialization
 with metadata fields as key-values on the root object,
 C<0.4> for serialization with metadata fields as a list
 of C<"@type":"koral:field"> objects.

 Currently defaults to C<0.03>.


 =item B<--sequential-extraction|-se>

 Flag to indicate, if the C<jobs> value also applies to extraction.
 Some systems may have problems with extracting multiple archives
 to the same folder at the same time.
 Can be flagged using C<--no-sequential-extraction> as well.
 Defaults to C<false>.


 =item B<--meta|-m>

 Define the metadata parser to use. Defaults to C<I5>.
 Metadata parsers can be defined in the C<KorAP::XML::Meta> namespace.
 This is I<experimental>.


 =item B<--pretty|-y>

 Pretty print JSON output. Defaults to C<false>.
 This is I<deprecated>.


 =item B<--gzip|-z>

 Compress the output.
 Expects a defined C<output> file in single processing.


 =item B<--cache|-c>

 File to mmap a cache (using L<Cache::FastMmap>).
 Defaults to C<korapxml2krill.cache> in the calling directory.


 =item B<--cache-size|-cs>

 Size of the cache. Defaults to C<50m>.


 =item B<--cache-init|-ci>

 Initialize cache file.
 Can be flagged using C<--no-cache-init> as well.
 Defaults to C<true>.


 =item B<--cache-delete|-cd>

 Delete cache file after processing.
 Can be flagged using C<--no-cache-delete> as well.
 Defaults to C<true>.


 =item B<--config|-cfg>

 Configure the parameters of your call in a file
 of key-value pairs with whitespace separator

   overwrite 1
   token     DeReKo#Structure
   ...

 Supported parameters are:
 C<overwrite>, C<gzip>, C<jobs>, C<input-base>,
 C<token>, C<log>, C<cache>, C<cache-size>, C<cache-delete>, C<meta>,
 C<output>, C<koral>,
 C<tempary-extract>, C<sequential-extraction>,
 C<base-sentences>, C<base-paragraphs>,
 C<base-pagebreaks>,
 C<skip> (semicolon separated), C<sigle>
 (semicolon separated), C<anno> (semicolon separated).

 Configuration parameters will always be overwritten by
 passed parameters.


 =item B<--temporary-extract|-te>

 Only valid for the C<archive> command.

 This will first extract all files into a
 directory and then will archive.
 If the directory is given as C<:temp:>,
 a temporary directory is used.
 This is especially useful to avoid
 massive unzipping and potential
 network latency.


 =item B<--sigle|-sg>

 Extract the given texts.
 Can be set multiple times.
 I<Currently only supported on C<extract>.>
 Sigles have the structure C<Corpus>/C<Document>/C<Text>.
 In case the C<Text> path is omitted, the whole document will be extracted.
 On the document level, the postfix wildcard C<*> is supported.


 =item B<--log|-l>

 The L<Log4perl> log level, defaults to C<ERROR>.


 =item B<--help|-h>

 Print this document.


 =item B<--version|-v>

 Print version information.

 =back


 =head1 ANNOTATION SUPPORT

 L<KorAP::XML::Krill> has built-in importer for some annotation foundries and layers
 developed in the KorAP project that are part of the KorAP preprocessing pipeline.
 The base foundry with paragraphs, sentences, and the text element are mandatory for
 L<Krill|https://github.com/KorAP/Krill>.

   Base
     #Paragraphs
     #Sentences

   Connexor
     #Morpho
     #Phrase
     #Sentences
     #Syntax

   CoreNLP
     #Constituency
     #Morpho
     #NamedEntities
     #Sentences

   CMC
     #Morpho

   DeReKo
     #Structure

   DGD
     #Morpho

   DRuKoLa
     #Morpho

   Glemm
     #Morpho

   HNC
     #Morpho

   LWC
     #Dependency

   Malt
     #Dependency

   MarMoT
     #Morpho

   Mate
     #Dependency
     #Morpho

   MDParser
     #Dependency

   OpenNLP
     #Morpho
     #Sentences

   Sgbr
     #Lemma
     #Morpho

   TreeTagger
     #Morpho
     #Sentences

   XIP
     #Constituency
     #Morpho
     #Sentences


 More importers are in preparation.
 New annotation importers can be defined in the C<KorAP::XML::Annotation> namespace.
 See the built-in annotation importers as examples.


 =head1 AVAILABILITY

   https://github.com/KorAP/KorAP-XML-Krill


 =head1 COPYRIGHT AND LICENSE

 Copyright (C) 2015-2019, L<IDS Mannheim|http://www.ids-mannheim.de/>

 Author: L<Nils Diewald|http://nils-diewald.de/>

 Contributor: Eliza Margaretha

 L<KorAP::XML::Krill> is developed as part of the L<KorAP|http://korap.ids-mannheim.de/>
 Corpus Analysis Platform at the
 L<Leibniz Institute for the German Language (IDS)|http://ids-mannheim.de/>,
 member of the
 L<Leibniz-Gemeinschaft|http://www.leibniz-gemeinschaft.de/en/about-us/leibniz-competition/projekte-2011/2011-funding-line-2/>.

 This program is free software published under the
 L<BSD-2 License|https://raw.githubusercontent.com/KorAP/KorAP-XML-Krill/master/LICENSE>.

 =cut
	=pod

	=encoding utf8

	=head1 NAME

	korapxml2krill - Merge KorapXML data and create Krill documents


	=head1 SYNOPSIS

	korapxml2krill [archive\|extract] --input <directory\|archive> [options]


	=head1 DESCRIPTION

	L<KorAP::XML::Krill> is a library to convert KorAP-XML documents to files
	compatible with the L<Krill\|https://github.com/KorAP/Krill> indexer.
	The C<korapxml2krill> command line tool is a simple wrapper to the library.


	=head1 INSTALLATION

	The preferred way to install L<KorAP::XML::Krill> is to use L<cpanm\|App::cpanminus>.

	$ cpanm https://github.com/KorAP/KorAP-XML-Krill.git

	In case everything went well, the C<korapxml2krill> tool will
	be available on your command line immediately.
	Minimum requirement for L<KorAP::XML::Krill> is Perl 5.16.
	In addition to work with zip archives, the C<unzip> tool needs to be present.

	=head1 ARGUMENTS

	$ korapxml2krill -z --input <directory> --output <filename>

	Without arguments, C<korapxml2krill> converts a directory of a single KorAP-XML document.
	It expects the input to point to the text level folder.

	=over 2

	=item B<archive>

	$ korapxml2krill archive -z --input <directory\|archive> --output <directory\|tar>

	Converts an archive of KorAP-XML documents. It expects a directory
	(pointing to the corpus level folder) or one or more zip files as input.

	=item B<extract>

	$ korapxml2krill extract --input <archive> --output <directory> --sigle <SIGLE>

	Extracts KorAP-XML documents from a zip file.

	=item B<serial>

	$ korapxml2krill serial -i <archive1> -i <archive2> -o <directory> -cfg <config-file>

	Convert archives sequentially. The inputs are not merged but treated
	as they are (so they may be premerged or globs).
	the C<--out> directory is treated as the base directory where subdirectories
	are created based on the archive name. In case the C<--to-tar> flag is given,
	the output will be a tar file.


	=back


	=head1 OPTIONS

	=over 2

	=item B<--input\|-i> <directory\|zip file>

	Directory or zip file(s) of documents to convert.

	Without arguments, C<korapxml2krill> expects a folder of a single KorAP-XML
	document, while C<archive> expects a KorAP-XML corpus folder or a zip
	file to batch process multiple files.
	C<extract> expects zip files only.

	C<archive> supports multiple input zip files with the constraint,
	that the first archive listed contains all primary data files
	and all meta data files.

	-i file/news.zip -i file/news.malt.zip -i "#file/news.tt.zip"

	Input may also be defined using BSD glob wildcards.

	-i 'file/news*.zip'

	The extended input array will be sorted in length order, so the shortest
	path needs to contain all primary data files and all meta data files.

	(The directory structure follows the base directory format,
	that may include a C<.> root folder.
	In this case further archives lacking a C<.> root folder
	need to be passed with a hash sign in front of the archive's name.
	This may require to quote the parameter.)

	To support zip files, a version of C<unzip> needs to be installed that is
	compatible with the archive file.

	B<The root folder switch using the hash sign is experimental and
	may vanish in future versions.>


	=item B<--input-base\|-ib> <directory>

	The base directory for inputs.


	=item B<--output\|-o> <directory\|file>

	Output folder for archive processing or
	document name for single output (optional),
	writes to C<STDOUT> by default
	(in case C<output> is not mandatory due to further options).

	=item B<--overwrite\|-w>

	Overwrite files that already exist.


	=item B<--token\|-t> <foundry>#<file>

	Define the default tokenization by specifying
	the name of the foundry and optionally the name
	of the layer-file. Defaults to C<OpenNLP#tokens>.


	=item B<--base-sentences\|-bs> <foundry>#<layer>

	Define the layer for base sentences.
	If given, this will be used instead of using C<Base#Sentences>.
	Currently C<DeReKo#Structure> is the only additional layer supported.

	Defaults to unset.


	=item B<--base-paragraphs\|-bp> <foundry>#<layer>

	Define the layer for base paragraphs.
	If given, this will be used instead of using C<Base#Paragraphs>.
	Currently C<DeReKo#Structure> is the only additional layer supported.

	Defaults to unset.


	=item B<--base-pagebreaks\|-bpb> <foundry>#<layer>

	Define the layer for base pagebreaks.
	Currently C<DeReKo#Structure> is the only layer supported.

	Defaults to unset.


	=item B<--skip\|-s> <foundry>[#<layer>]

	Skip specific annotations by specifying the foundry
	(and optionally the layer with a C<#>-prefix),
	e.g. C<Mate> or C<Mate#Morpho>. Alternatively you can skip C<#ALL>.
	Can be set multiple times.


	=item B<--anno\|-a> <foundry>#<layer>

	Convert specific annotations by specifying the foundry
	(and optionally the layer with a C<#>-prefix),
	e.g. C<Mate> or C<Mate#Morpho>.
	Can be set multiple times.


	=item B<--primary\|-p>

	Output primary data or not. Defaults to C<true>.
	Can be flagged using C<--no-primary> as well.
	This is I<deprecated>.


	=item B<--non-word-tokens\|-nwt>

	Tokenize non-word tokens like word tokens (defined as matching
	C</[\d\w]/>). Useful to treat punctuations as tokens.

	Defaults to unset.

	=item B<--jobs\|-j>

	Define the number of concurrent jobs in seperated forks
	for archive processing.
	Defaults to C<0> (everything runs in a single process).

	If C<sequential-extraction> is not set to false, this will
	also apply to extraction.

	Pass -1, and the value will be set automatically to 5
	times the number of available cores.
	This is I<experimental>.


	=item B<--koral\|-k>

	Version of the output format. Supported versions are:
	C<0> for legacy serialization, C<0.03> for serialization
	with metadata fields as key-values on the root object,
	C<0.4> for serialization with metadata fields as a list
	of C<"@type":"koral:field"> objects.

	Currently defaults to C<0.03>.


	=item B<--sequential-extraction\|-se>

	Flag to indicate, if the C<jobs> value also applies to extraction.
	Some systems may have problems with extracting multiple archives
	to the same folder at the same time.
	Can be flagged using C<--no-sequential-extraction> as well.
	Defaults to C<false>.


	=item B<--meta\|-m>

	Define the metadata parser to use. Defaults to C<I5>.
	Metadata parsers can be defined in the C<KorAP::XML::Meta> namespace.
	This is I<experimental>.


	=item B<--pretty\|-y>

	Pretty print JSON output. Defaults to C<false>.
	This is I<deprecated>.


	=item B<--gzip\|-z>

	Compress the output.
	Expects a defined C<output> file in single processing.


	=item B<--cache\|-c>

	File to mmap a cache (using L<Cache::FastMmap>).
	Defaults to C<korapxml2krill.cache> in the calling directory.


	=item B<--cache-size\|-cs>

	Size of the cache. Defaults to C<50m>.


	=item B<--cache-init\|-ci>

	Initialize cache file.
	Can be flagged using C<--no-cache-init> as well.
	Defaults to C<true>.


	=item B<--cache-delete\|-cd>

	Delete cache file after processing.
	Can be flagged using C<--no-cache-delete> as well.
	Defaults to C<true>.


	=item B<--config\|-cfg>

	Configure the parameters of your call in a file
	of key-value pairs with whitespace separator

	overwrite 1
	token DeReKo#Structure
	...

	Supported parameters are:
	C<overwrite>, C<gzip>, C<jobs>, C<input-base>,
	C<token>, C<log>, C<cache>, C<cache-size>, C<cache-delete>, C<meta>,
	C<output>, C<koral>,
	C<tempary-extract>, C<sequential-extraction>,
	C<base-sentences>, C<base-paragraphs>,
	C<base-pagebreaks>,
	C<skip> (semicolon separated), C<sigle>
	(semicolon separated), C<anno> (semicolon separated).

	Configuration parameters will always be overwritten by
	passed parameters.


	=item B<--temporary-extract\|-te>

	Only valid for the C<archive> command.

	This will first extract all files into a
	directory and then will archive.
	If the directory is given as C<:temp:>,
	a temporary directory is used.
	This is especially useful to avoid
	massive unzipping and potential
	network latency.


	=item B<--sigle\|-sg>

	Extract the given texts.
	Can be set multiple times.
	I<Currently only supported on C<extract>.>
	Sigles have the structure C<Corpus>/C<Document>/C<Text>.
	In case the C<Text> path is omitted, the whole document will be extracted.
	On the document level, the postfix wildcard C<*> is supported.


	=item B<--log\|-l>

	The L<Log4perl> log level, defaults to C<ERROR>.


	=item B<--help\|-h>

	Print this document.


	=item B<--version\|-v>

	Print version information.

	=back


	=head1 ANNOTATION SUPPORT

	L<KorAP::XML::Krill> has built-in importer for some annotation foundries and layers
	developed in the KorAP project that are part of the KorAP preprocessing pipeline.
	The base foundry with paragraphs, sentences, and the text element are mandatory for
	L<Krill\|https://github.com/KorAP/Krill>.

	Base
	#Paragraphs
	#Sentences

	Connexor
	#Morpho
	#Phrase
	#Sentences
	#Syntax

	CoreNLP
	#Constituency
	#Morpho
	#NamedEntities
	#Sentences

	CMC
	#Morpho

	DeReKo
	#Structure

	DGD
	#Morpho

	DRuKoLa
	#Morpho

	Glemm
	#Morpho

	HNC
	#Morpho

	LWC
	#Dependency

	Malt
	#Dependency

	MarMoT
	#Morpho

	Mate
	#Dependency
	#Morpho

	MDParser
	#Dependency

	OpenNLP
	#Morpho
	#Sentences

	Sgbr
	#Lemma
	#Morpho

	TreeTagger
	#Morpho
	#Sentences

	XIP
	#Constituency
	#Morpho
	#Sentences


	More importers are in preparation.
	New annotation importers can be defined in the C<KorAP::XML::Annotation> namespace.
	See the built-in annotation importers as examples.


	=head1 AVAILABILITY

	https://github.com/KorAP/KorAP-XML-Krill


	=head1 COPYRIGHT AND LICENSE

	Copyright (C) 2015-2019, L<IDS Mannheim\|http://www.ids-mannheim.de/>

	Author: L<Nils Diewald\|http://nils-diewald.de/>

	Contributor: Eliza Margaretha

	L<KorAP::XML::Krill> is developed as part of the L<KorAP\|http://korap.ids-mannheim.de/>
	Corpus Analysis Platform at the
	L<Leibniz Institute for the German Language (IDS)\|http://ids-mannheim.de/>,
	member of the
	L<Leibniz-Gemeinschaft\|http://www.leibniz-gemeinschaft.de/en/about-us/leibniz-competition/projekte-2011/2011-funding-line-2/>.

	This program is free software published under the
	L<BSD-2 License\|https://raw.githubusercontent.com/KorAP/KorAP-XML-Krill/master/LICENSE>.

	=cut