Add tools and literature to repository Change-Id: I481575b457ec905ca53f90d0cd6c3169bcd6b80a

commit: 1448b11568319a11fc3933f6ac9f8bbc996a8189 [log] [tgz]
author: Akron <nils@diewald-online.de> Mon Mar 21 23:31:49 2022 +0100
committer: Akron <nils@diewald-online.de> Mon Mar 21 23:31:49 2022 +0100
tree: d7933833fa7e61bc2fe5e3a791f4cbaa350ff525
parent: 325193e098e45f7a8eb174e3393af1e615c18eb1 [diff]
diff --git a/Readme.md b/Readme.md
index 8f75046..70b6f89 100644
--- a/Readme.md
+++ b/Readme.md

@@ -1,23 +1,18 @@
-# Creating the container
+# EURALEX 2022 - Tokenization Benchmark
+
+This repository contains benchmark scripts for comparing different tokenizers and sentence segmenters of German.  For trouble-free testing, all tools are provided in a Dockerfile.
+
+## Creating the container
 
 To build the Docker image, run
 
 ```shell
 $ docker build -f Dockerfile -t korap/euralex22 .
 ```
-This will download and install an image of approximately 6GB.
-
-It will download and install the following
-tokenizers in an image to your system:
-
-...
-
-To run the evaluation suite ...
-
-...
+This will create and install an image of approximately 12GB.
 
 
-# Running the evaluation suite
+## Running the evaluation suite
 
 To run the benchmark, call
 
@@ -30,7 +25,7 @@
 
 The supported benchmark scripts are:
 
-## `benchmark.pl`
+### `benchmark.pl`
 
 Performance measurements of the tools. See the tools section for some
 remarks to take into account. Accepts two numerical parameters:
@@ -38,8 +33,7 @@
 - The duplication count of the example file
 - The number of iterations
 
-
-## `empirist.pl`
+### `empirist.pl`
 
 To run the empirist evaluation suite, you first need to download
 the empirist gold standard corpus and tooling, and extract it into
@@ -53,8 +47,6 @@
 $ unzip empirist_gold_web.zip -d corpus
 ```
 
-Quality measurements based on EmpiriST 2015.
-
 To investigate the output, start the benchmark with mounted
 output folders
 
@@ -63,7 +55,7 @@
 -v ${PWD}/output_web:/euralex/empirist_web
 ```
 
-## `ud_tokens.pl`
+### `ud_tokens.pl`
 
 To run the token evaluation suite against the 
 [Universal Dependency](https://github.com/UniversalDependencies/UD_German-GSD)
@@ -75,54 +67,13 @@
   -O corpus/de_gsd-ud-train.conllu
 ```
 
-## `ud_sentences.pl`
+### `ud_sentences.pl`
 
 To run the sentence evaluation suite, first download the corpus
 as explained above.
 
 
-# Tools
-
-## Waste
-- Tokenization
-
-## OpenNLP
-- Tokenization
-
-## TreeTagger
-- Tokenization
-
-## JTok
-- Tokenization
-
-## SynTok
-- Tokenization
-
-## SoMaJo
-- Tokenization
-
-## Stanford CoreNLP
-- Tokenization
-
-All tools are run using [pipelining](https://stanfordnlp.github.io/CoreNLP/pipeline.html),
-which obviously introduces some overhead, that needs to be taken into account.
-
-## KorAP-Tokenizer
-- Tokenization + Sentence Splitting
-
-## Datok
-- Tokenization + Sentence Splitting
-
-
-# Licenses
-
-For Treetagger:
-Please read the [license terms](https://cis.uni-muenchen.de/~schmid/tools/TreeTagger/Tagger-Licence),
-before you download the software!
-By downloading the software, you agree to the terms stated there. 
-
-
-# Caveat
+## Caveat
 
 When running this benchmark using Docker you may need
 to run all processes privileged to get
@@ -132,4 +83,44 @@
 docker run --privileged -v
 ```
 
-# Literature
+## Tools
+
+### Our tools for token and sentence boundary detection:
+
+- [KorAP-Tokenizer](https://github.com/KorAP/KorAP-Tokenizer) is rule-based and compiles, using the lexical analysis generator framework [JFlex](https://jflex.de/), a list of regular expressions into a deterministic finite state automaton that can introduce segment boundaries at terminal nodes. The ruleset is based on [Apache Lucene](https://lucene.apache.org/)'s tokenizer and has been extensively modified. Rule sets are available for English, French and German. KorAP-Tokenizer is used productively for tokenization and (among other tools) for sentence segmentation of DeReKo.
+- [Datok](https://github.com/KorAP/Datok) is rule-based and generates an extended finite deterministic state automaton based on a finite state transducer generated by XFST (Beesley & Karttunen 2003) which is reduced to a few transition rules and can be interpreted by Datok for tokenization and sentence segmentation. The rule set of KorAP-Tokenizer was transferred to XFST for this purpose. The generation is done with Foma (Hulden 2009). Rule sets are only available for German at this time. Datok is currently being evaluated experimentally.
+
+### Tools for token and sentence boundary detection: 
+
+- [SoMaJo](https://github.com/tsproisl/SoMaJo) (Proisl & Uhrig 2016) is rule-based and applies a list of regular expressions to segment a text. SoMaJo won first place in the competition of the aforementioned EmpiriST 2015 Shared Task for tokenizing German-language Web and CMC corpora and has been regularly improved since then. SoMaJo is available specifically for German.
+- [Cutter](https://pub.cl.uzh.ch/wiki/public/cutter/start) (Graën et al. 2018) is rule-based and recursively applies language-specific and language-independent rules to a text to segment it. Compared to other rule-based tools, Cutter uses a context-free rather than a regular grammar.
+- [OpenNLP](https://opennlp.apache.org/) is a framework that offers both tokenizers and sentence segmenters in different models. Both tools are based on a maximum entropy approach. In addition, OpenNLP offers SimpleTokenizer, a tool based on simple character class decisions.
+- [JTok](https://github.com/DFKI-MLT/JTok) is based on cascading regular expressions that segment tokens until they can be assigned to a token class, which (cf. SoMaJo) can also be returned. Rules exist for English, German and Italian.
+- [Waste](https://kaskade.dwds.de/waste/) (Jurish/Würzner 2013) is based on a hidden Markov model in which a pre-segmented stream of (pseudo)tokens are re-evaluated at the boundaries found and classified as to whether they are word-initial or sentence-initial.
+- [Stanford Tokenizer](https://nlp.stanford.edu/software/tokenizer.shtml) is rule-based, and relies on JFlex (cf. KorAP tokenizer) to compile a deterministic finite state automaton based on a list of regular expressions that can introduce segment boundaries at terminal nodes.
+- [SpaCy](https://spacy.io/usage/linguistic-features) is a framework in which the tokenization stage is rule-based and runs in several phases in which the tokens are split into increasingly finer segments. Rule sets are provided for numerous languages. Different models are offered for sentence segmentation: Sentencizer is rule-based, Dependency Parser performs a syntactic analysis, Statistical segments based on a simple statistical model.
+- [Syntok](https://github.com/fnl/syntok) is rule-based and applies successive separation rules, primarily in the form of regular expressions, to an input string for segmentation. There is both a tokenizer and a sentence segmenter based on it. Rules exist for Spanish, English, and German.
+- [BlingFire](https://github.com/microsoft/BlingFire) is rule-based and compiles a deterministic finite state automaton based on regular expressions, which segments at terminal nodes. The tested model is implemented cross-language with a focus on English.
+
+### Tools for token boundary detection only:
+
+- [TreeTagger](https://cis.uni-muenchen.de/~schmid/tools/TreeTagger/) (Schmid 1994) is a part-of-speech tagger that carries a separate rule-based tokenization tool that also uses a set of regular expressions to segment a text. TreeTagger does not itself introduce markers for sentence boundaries. [license terms](https://cis.uni-muenchen.de/~schmid/tools/TreeTagger/Tagger-Licence).
+- [Elephant](https://gmb.let.rug.nl/elephant/about.php) (Evang et al. 2013) is a machine-trained system for segmentation based on Conditional Random Fields and Recurrent Neural Networks. We evaluate here a [wrapper implementation](https://github.com/erwanm/elephant-wrapper) (Moreau/Vogel, 2018) that considers only token segmentation and not sentence segmentation, although Elephant provides both.
+
+### Tools for sentence boundary detection only:
+
+- [Deep-EOS](https://github.com/dbmdz/deep-eos) (Schweter/Ahmed 2019) is based on different implementations of neural networks with long short-term memory (LSTM), bidirectional LSTM, and convolutional neural networks. It is not based on pre-tokenization and operates directly on character streams.
+- [NNSplit](https://bminixhofer.github.io/nnsplit/) is a machine-trained approach based on a byte-level LSTM neural network.
+
+
+## Literature
+
+- Beesley, K. R./Karttunen, L. (2003): Finite State Morphology. CSLI Publications.
+- Evang, K./Basile, V./Chrupała, G./Bos, J. (2013): Elephant: Sequence Labeling for Word and Sentence Segmentation. Proceedings of the EMNLP 2013: Conference on Empirical Methods in Natural Language Processing, Seattle, US.
+- Graën, J./Bertamini, M./Volk, M. (2018): [Cutter – a universal multilingual tokenizer](https://doi.org/10.5167/uzh-157243). In: Cieliebak, M./Tuggener, D./Benites, F. (eds.): Swiss text analytics conference, Nr. 2226, pp. 75–81. CEUR-WS.
+- Hulden, M. (2009): Foma: A finite-state toolkit and library. Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, pp. 29–32.
+- Jurish, B./Würzner, K.-M. (2013): Word and Sentence Tokenization with Hidden Markov Models. JLCL, 28 (2), pp. 61–83.
+- Moreau, E./Vogel, C. (2018): [Multilingual Word Segmentation: Training Many Language-Specific Tokenizers Smoothly Thanks to the Universal Dependencies Corpus](https://aclanthology.org/L18-1180). Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). Miyazaki, Japan.
+- Proisl, T./Uhrig, P. (2016): SoMaJo: State-of-the-art tokenization for German web and social media texts. Proceedings of the 10th Web as Corpus Workshop, pp. 57–62.
+- Schmid, H. (1994): Probabilistic Part-of-Speech Tagging Using Decision Trees. Proceedings of International Conference on New Methods in Language Processing.
+- Schweter, S./Ahmed, S. (2019): Deep-EOS: General-Purpose Neural Networks for Sentence Boundary Detection. Proceedings of the 15th Conference on Natural Language Processing (KONVENS). KONVENS, Erlangen, Germany.
commit	1448b11568319a11fc3933f6ac9f8bbc996a8189	[log] [tgz]
author	Akron <nils@diewald-online.de>	Mon Mar 21 23:31:49 2022 +0100
committer	Akron <nils@diewald-online.de>	Mon Mar 21 23:31:49 2022 +0100
tree	d7933833fa7e61bc2fe5e3a791f4cbaa350ff525
parent	325193e098e45f7a8eb174e3393af1e615c18eb1 [diff]