commit	07d97146420a0909c9d2c21ab21c95069b3a98b1	[log] [tgz]
author	Marc Kupietz <kupietz@ids-mannheim.de>	Mon Sep 07 18:03:34 2020 +0200
committer	Marc Kupietz <kupietz@ids-mannheim.de>	Tue Sep 08 08:23:44 2020 +0200
tree	a14bb644bb2b3e729ac0a5a1de58b6ba78d6c83f
parent	c315c2a64a95f9fcf08ae30fef097179bddf7003 [diff]

tree: a14bb644bb2b3e729ac0a5a1de58b6ba78d6c83f

Readme.md

KorAP Tokenizer

Efficient, OpenNLP tools compatible DFA tokenizer with character offset output based on JFlex, suitable for German and other European languages.

Description

The KorAP tokenizer is used for the German Reference Corpus DeReKo. Being based on a finite state automaton, it is not accurate as language model based tokenizers, but with ~5 billion words per hour typically more efficient. An important feature in the DeReKo/KorAP context is also, that it reliably reports the character offsets of the tokens so that this information can be used for applying standoff annotations.

The main class KorAPTokenizerImpl implements the opennlp.tools.tokenize.Tokenizer interface and can thus be used as a drop-in replacement in OpenNLP applications.

The scanner is based on the Lucene scanner with modifications from David Hall.

Our changes mainly concern a good coverage of German abbreviations, and some updates for handling computer mediated communication, optimized and tested against the gold data from the EmpiriST 2015 shared task (Beißwenger et al. 2016).

Installation

$ mvn clean install

… with changed jflex tokenizer source

Because of the large table of abbreviations, the conversion from the jflex source to java, i.e. the calculation of the DFA, takes more than 10 minutes and requires a lot of heap space.

For this reason the java source that depends on the jflex source is distributed with the source code and not deleted on mvn clean.

If you want to modify the jflex source, while keeping the abbreviation lists, you will need ad least 10 GB of free RAM and set the maven option accordingly, e.g.:

$ MAVEN_OPTS="-Xss600m -Xmx16000m" mvn clean install

Documentation

The KorAP tokenizer reads from standard input and writes to standard output. It currently supports two modes.

In the default mode, the tokenizer prints all offsets of the first character of a token and the first character after a token. In order to end a text, flush the output and reset the character position, the magic escape sequence \n\x03\n .

Invocation Example

$ echo -n -e 'This is a text.\x0a\x03\x0aAnd this is another text.\n\x03\n' |\
   java -jar target/KorAP-Tokenizer-1.2-SNAPSHOT.jar

0 4 5 7 8 9 10 15 
0 3 4 8 9 11 12 19 20 25

Development and License

Authors:

This package is developed as part of the KorAP Corpus Analysis Platform at the Leibniz Institute for German Language (IDS).

The package contains code from Apache Lucene with modifications by Jim Hall.

It is published under the Apache 2.0 License.

Contributions

Contributions are very welcome!

Your contributions should ideally be committed via our Gerrit server to facilitate reviewing (see Gerrit Code Review - A Quick Introduction if you are not familiar with Gerrit). However, we are also happy to accept comments and pull requests via GitHub.

References

Beißwenger, Michael / Bartsch, Sabine / Evert, Stefan / Würzner, Kay-Michael. (2016). EmpiriST 2015: A Shared Task on the Automatic Linguistic Annotation of Computer-Mediated Communication and Web Corpora. 44-56. 10.18653/v1/W16-2606.