Branches

master

KorAP Tokenizer

Interface and implementation of a tokenizer and sentence splitter that can be used

for German, English, French, and with some limitations also for other languages
as standalone tokenizer and/or sentence splitter
or within the KorAP ingestion pipeline
or within the OpenNLP tools framework

The included implementations (DerekoDfaTokenizer_de, DerekoDfaTokenizer_de_old, DerekoDfaTokenizer_en, DerekoDfaTokenizer_fr) are highly efficient DFA tokenizers and sentence splitters with character offset output based on JFlex. The de-variant is used for the German Reference Corpus DeReKo and supports gender-sensitive forms (e.g., Nutzer:in, Nutzer/innen). Being based on finite state automata, the tokenizers are potentially not as accurate as language model based ones, but with ~5 billion words per hour typically more efficient. An important feature in the DeReKo/KorAP context is also that token character offsets can be reported, which can be used for applying standoff annotations.

German Tokenizer Variants

de (default): Modern German with support for gender-sensitive forms. Forms like Nutzer:in, Nutzer/innen, Kaufmann/frau are kept as single tokens.
de_old: Traditional German without gender-sensitive rules. These forms are split into separate tokens (e.g., Nutzer:in → Nutzer : in). Useful for processing older texts or when gender forms should not be treated specially.

Complexity and Performance

Unlike simple script-based or regex-based tokenizers, the KorAP Tokenizer uses high-performance Deterministic Finite Automata (DFA) generated by JFlex. This allows for extremely high throughput (5-20 MB/s) while handling thousands of complex rules and abbreviations simultaneously (see Diewald/Kupietz/Lüngen 2022).

The following table shows the complexity of the underlying automata for each language variant:

Language	DFA States	DFA Transitions (Edges)	Generated Java Code
German (`de`)	~15,000	1,737,648	~67,000 lines
German (`de_old`)	~15,000	1,669,140	~61,000 lines
English (`en`)	~15,000	1,186,205	~38,000 lines
French (`fr`)	~15,000	1,188,825	~38,000 lines

The significant size of the German DFA is primarily due to the integrated list of over 5,000 specialized abbreviations and the complex lookahead rules for gender-neutral forms (e.g., handling :in vs. namespace colons).

The included implementations of the KorapTokenizer interface also implement the opennlp.tools.tokenize.Tokenizer and opennlp.tools.sentdetect.SentenceDetector interfaces and can thus be used as a drop-in replacements in OpenNLP applications.

The underlying scanner is based on the Lucene scanner with modifications from David Hall.

Our changes mainly concern a good coverage of German, or optionally of some English and French abbreviations, and some updates for handling computer mediated communication, optimized and tested, in the case of German, against the gold data from the EmpiriST 2015 shared task (Beißwenger et al. 2016).

Installation

mvn clean package

Note

Because of the complexity of the task and the large table of abbreviations, the conversion from the JFlex source to Java, i.e. the calculation of the DFA, takes about 15 to 60 minutes, depending on your hardware, and requires a lot of heap space.

For development, you can disable the large abbreviation lists to speed up the build:

mvn clean generate-sources -Dforce.fast=true

Examples Usage

By default, KorAP tokenizer reads from standard input and writes to standard output. It supports multiple modes of operations.

Split English text into tokens

$ echo "It's working." | java -jar target/KorAP-Tokenizer-*-standalone.jar -l en
It
's
working
.

Split French text into tokens and sentences

$ echo "C'est une phrase. Ici, il s'agit d'une deuxième phrase." \
  | java -jar target/KorAP-Tokenizer-*-standalone.jar -s -l fr
C'
est
une
phrase
.

Ici
,
il
s'
agit
d'
une
deuxième
phrase
.

Print token character offsets

With the --positions option, for example, the tokenizer prints all offsets of the first character of a token and the first character after a token. In order to end a text, flush the output and reset the character position, an EOT character (0x04) can be used.

$ echo -n -e 'This is a text.\x0a\x04\x0aAnd this is another text.\n\x04\n' |\
     java -jar target/KorAP-Tokenizer-*-standalone.jar  --positions
This
is
a
text
.
0 4 5 7 8 9 10 14 14 15
And
this
is
another
text
.
0 3 4 8 9 11 12 19 20 24 24 25

Print token and sentence offset

echo -n -e ' This ist a start of a text. And this is a sentence!!! But what the hack????\x0a\x04\x0aAnd this is another text.'  |\
   java -jar target/KorAP-Tokenizer-*-standalone.jar --no-tokens --positions --sentence-boundaries
1 5 6 9 10 11 12 17 18 20 21 22 23 27 27 28 29 32 33 37 38 40 41 42 43 51 51 54 55 58 59 63 64 67 68 72 72 76
1 28 29 54 55 76
0 3 4 8 9 11 12 19 20 24 24 25
0 25

Adding Support for more Languages

To adapt the included implementations to more languages, take one of the language-specific_<language>.jflex-macro files as template and modify for example the macro for abbreviations SEABBR. Then add an execution section for the new language to the jcp (java-comment-preprocessor) artifact in pom.xml following the example of one of the configurations there. After building the project (see below) your added language specific tokenizer / sentence splitter should be selectable with the --language option.

Alternatively, you can also provide KorAPTokenizer implementations independently on the class path and select them with the --tokenizer-class option.

Development and License

Authors:

Contributor:

Gregor Middell

This package is developed as part of the KorAP Corpus Analysis Platform at the Leibniz Institute for German Language (IDS).

The package contains code from Apache Lucene with modifications by Jim Hall.

It is published under the Apache 2.0 License.

Contributions

Contributions are very welcome!

Your contributions should ideally be committed via our Gerrit server to facilitate reviewing (see Gerrit Code Review - A Quick Introduction if you are not familiar with Gerrit). However, we are also happy to accept comments and pull requests via GitHub.

References

Beißwenger, Michael / Bartsch, Sabine / Evert, Stefan / Würzner, Kay-Michael (2016): EmpiriST 2015: A Shared Task on the Automatic Linguistic Annotation of Computer-Mediated Communication and Web Corpora. 44-56. 10.18653/v1/W16-2606.
Diewald, Nils / Kupietz, Marc / Lüngen, Harald (2022): Tokenizing on scale – Preprocessing large text corpora on the lexical and sentence level. In Klosa-Kückelhaus, Annette / Engelberg, Stefan / Möhrs, Christine / Storjohann, Petra (eds.): Proceedings of the XX EURALEX International Congress (EURALEX 2022).

DFA tokenizer with character offset output, large abbreviation tables and CMC support.