Tokenization Benchmark

This repository contains benchmark scripts for comparing different tokenizers and sentence segmenters of German. For trouble-free testing, all tools are provided in a Dockerfile.

This work will be presented at EURALEX 2022. Please cite as:

Diewald, N./Kupietz, M./Lüngen, H. (2022): Tokenizing on scale - Preprocessing large text corpora on the lexical and sentence level. In: Proceedings of EURALEX 2022. Mannheim, Germany.

Creating the container

To build the Docker image, run

$ docker build -f Dockerfile -t korap/tokenbench .

This will create and install an image of approximately 12GB.

Running the evaluation suite

To run the benchmark, call

$ docker run --rm -i \
  -v ${PWD}/benchmarks:/tokenbench/benchmarks \
  -v ${PWD}/corpus:/tokenbench/corpus \
  korap/tokenbench benchmarks/[BENCHMARK-SCRIPT]

The supported benchmark scripts are:

benchmark.pl

Performance measurements of the tools. See the tools section for some remarks to take into account. Accepts two numerical parameters:

  • The duplication count of the example file
  • The number of iterations

benchmark_batches.pl

Performance measurements of the tools. See the tools section for some remarks to take into account. Accepts one numerical parameter:

  • The number of iterations

Will check batches of 1000, 2000, 4000, 8000 ... 8192000 tokens against all tools.

empirist.pl

To run the empirist evaluation suite, you first need to download the empirist gold standard corpus and tooling, and extract it into the corpus directory.

$ wget https://sites.google.com/site/empirist2015/home/shared-task-data/empirist_gold_cmc.zip
$ unzip empirist_gold_cmc.zip -d corpus

$ wget https://sites.google.com/site/empirist2015/home/shared-task-data/empirist_gold_web.zip
$ unzip empirist_gold_web.zip -d corpus

To investigate the output, start the benchmark with mounted output folders

-v ${PWD}/output_cmc:/tokenbench/empirist_cmc
-v ${PWD}/output_web:/tokenbench/empirist_web

ud_tokens.pl

To run the token evaluation suite against the Universal Dependency corpus, first install the empirist tooling as explained above, and download the corpus.

$ wget https://github.com/UniversalDependencies/UD_German-GSD/raw/master/de_gsd-ud-train.conllu \
  -O corpus/de_gsd-ud-train.conllu

ud_sentences.pl

To run the sentence evaluation suite, first download the corpus as explained above.

Caveat

When running this benchmark using Docker you may need to run all processes privileged to get meaningful results.

docker run --privileged -v

Tools

Our tools for token and sentence boundary detection:

  • KorAP-Tokenizer is rule-based and compiles, using the lexical analysis generator framework JFlex, a list of regular expressions into a deterministic finite state automaton that can introduce segment boundaries at terminal nodes. The ruleset is based on Apache Lucene's tokenizer and has been extensively modified. Rule sets are available for English, French and German. KorAP-Tokenizer is used productively for tokenization and (among other tools) for sentence segmentation of DeReKo.
  • Datok is rule-based and generates an extended finite deterministic state automaton based on a finite state transducer generated by XFST (Beesley & Karttunen 2003) which is reduced to a few transition rules and can be interpreted by Datok for tokenization and sentence segmentation. The rule set of KorAP-Tokenizer was transferred to XFST for this purpose. The generation is done with Foma (Hulden 2009). Rule sets are only available for German at this time. Datok is currently being evaluated experimentally.

Tools for token and sentence boundary detection:

  • SoMaJo (Proisl & Uhrig 2016) is rule-based and applies a list of regular expressions to segment a text. SoMaJo won first place in the competition of the aforementioned EmpiriST 2015 Shared Task for tokenizing German-language Web and CMC corpora and has been regularly improved since then. SoMaJo is available specifically for German.
  • Cutter (Graën et al. 2018) is rule-based and recursively applies language-specific and language-independent rules to a text to segment it. Compared to other rule-based tools, Cutter uses a context-free rather than a regular grammar.
  • OpenNLP is a framework that offers both tokenizers and sentence segmenters in different models. Both tools are based on a maximum entropy approach. In addition, OpenNLP offers SimpleTokenizer, a tool based on simple character class decisions.
  • JTok is based on cascading regular expressions that segment tokens until they can be assigned to a token class, which (cf. SoMaJo) can also be returned. Rules exist for English, German and Italian.
  • Waste (Jurish/Würzner 2013) is based on a hidden Markov model in which a pre-segmented stream of (pseudo)tokens are re-evaluated at the boundaries found and classified as to whether they are word-initial or sentence-initial.
  • Stanford Tokenizer is rule-based, and relies on JFlex (cf. KorAP tokenizer) to compile a deterministic finite state automaton based on a list of regular expressions that can introduce segment boundaries at terminal nodes.
  • SpaCy is a framework in which the tokenization stage is rule-based and runs in several phases in which the tokens are split into increasingly finer segments. Rule sets are provided for numerous languages. Different models are offered for sentence segmentation: Sentencizer is rule-based, Dependency Parser performs a syntactic analysis, Statistical segments based on a simple statistical model.
  • Syntok is rule-based and applies successive separation rules, primarily in the form of regular expressions, to an input string for segmentation. There is both a tokenizer and a sentence segmenter based on it. Rules exist for Spanish, English, and German.
  • BlingFire is rule-based and compiles a deterministic finite state automaton based on regular expressions, which segments at terminal nodes. The tested model is implemented cross-language with a focus on English.

Tools for token boundary detection only:

  • TreeTagger (Schmid 1994) is a part-of-speech tagger that carries a separate rule-based tokenization tool that also uses a set of regular expressions to segment a text. TreeTagger does not itself introduce markers for sentence boundaries. license terms.
  • Elephant (Evang et al. 2013) is a machine-trained system for segmentation based on Conditional Random Fields and Recurrent Neural Networks. We evaluate here a wrapper implementation (Moreau/Vogel, 2018) that considers only token segmentation and not sentence segmentation, although Elephant provides both.

Tools for sentence boundary detection only:

  • Deep-EOS (Schweter/Ahmed 2019) is based on different implementations of neural networks with long short-term memory (LSTM), bidirectional LSTM, and convolutional neural networks. It is not based on pre-tokenization and operates directly on character streams.
  • NNSplit is a machine-trained approach based on a byte-level LSTM neural network.

Results

In terms of speed, the native output of the tools was measured, while in terms of accuracy, further reshaping was necessary to make it comparable to the gold standard.

Literature

  • Beesley, K. R./Karttunen, L. (2003): Finite State Morphology. CSLI Publications.
  • Evang, K./Basile, V./Chrupała, G./Bos, J. (2013): Elephant: Sequence Labeling for Word and Sentence Segmentation. Proceedings of the EMNLP 2013: Conference on Empirical Methods in Natural Language Processing, Seattle, US.
  • Graën, J./Bertamini, M./Volk, M. (2018): Cutter – a universal multilingual tokenizer. In: Cieliebak, M./Tuggener, D./Benites, F. (eds.): Swiss text analytics conference, Nr. 2226, pp. 75–81. CEUR-WS.
  • Hulden, M. (2009): Foma: A finite-state toolkit and library. Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, pp. 29–32.
  • Jurish, B./Würzner, K.-M. (2013): Word and Sentence Tokenization with Hidden Markov Models. JLCL, 28 (2), pp. 61–83.
  • Moreau, E./Vogel, C. (2018): Multilingual Word Segmentation: Training Many Language-Specific Tokenizers Smoothly Thanks to the Universal Dependencies Corpus. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). Miyazaki, Japan.
  • Proisl, T./Uhrig, P. (2016): SoMaJo: State-of-the-art tokenization for German web and social media texts. Proceedings of the 10th Web as Corpus Workshop, pp. 57–62.
  • Schmid, H. (1994): Probabilistic Part-of-Speech Tagging Using Decision Trees. Proceedings of International Conference on New Methods in Language Processing.
  • Schweter, S./Ahmed, S. (2019): Deep-EOS: General-Purpose Neural Networks for Sentence Boundary Detection. Proceedings of the 15th Conference on Natural Language Processing (KONVENS). KONVENS, Erlangen, Germany.