Update complexity info in Readme
Change-Id: I11a909c063bda51b7e11abcd3c8469c38479a439
diff --git a/Readme.md b/Readme.md
index c85d814..6778093 100644
--- a/Readme.md
+++ b/Readme.md
@@ -19,6 +19,21 @@
- **`de`** (default): Modern German with support for gender-sensitive forms. Forms like `Nutzer:in`, `Nutzer/innen`, `Kaufmann/frau` are kept as single tokens.
- **`de_old`**: Traditional German without gender-sensitive rules. These forms are split into separate tokens (e.g., `Nutzer:in` → `Nutzer` `:` `in`). Useful for processing older texts or when gender forms should not be treated specially.
+### Complexity and Performance
+
+Unlike simple script-based or regex-based tokenizers, the KorAP Tokenizer uses high-performance Deterministic Finite Automata (DFA) generated by JFlex. This allows for extremely high throughput (5-20 MB/s) while handling thousands of complex rules and abbreviations simultaneously (see Diewald/Kupietz/Lüngen 2022).
+
+The following table shows the complexity of the underlying automata for each language variant:
+
+| Language | DFA States | DFA Transitions (Edges) | Generated Java Code |
+| :--- | :--- | :--- | :--- |
+| **German** (`de`) | ~15,000 | 1,737,648 | ~67,000 lines |
+| **German** (`de_old`) | ~15,000 | 1,669,140 | ~61,000 lines |
+| **English** (`en`) | ~15,000 | 1,186,205 | ~38,000 lines |
+| **French** (`fr`) | ~15,000 | 1,188,825 | ~38,000 lines |
+
+The significant size of the German DFA is primarily due to the integrated list of over 5,000 specialized abbreviations and the complex lookahead rules for gender-neutral forms (e.g., handling `:in` vs. namespace colons).
+
The included implementations of the `KorapTokenizer` interface also implement the [`opennlp.tools.tokenize.Tokenizer`](https://opennlp.apache.org/docs/2.3.0/apidocs/opennlp-tools/opennlp/tools/tokenize/Tokenizer.html)
and [`opennlp.tools.sentdetect.SentenceDetector`](https://opennlp.apache.org/docs/2.3.0/apidocs/opennlp-tools/opennlp/tools/sentdetect/SentenceDetector.html)
@@ -31,12 +46,15 @@
## Installation
+
```shell script
mvn clean package
```
+
#### Note
-Because of the large table of abbreviations, the conversion from the jflex source to java,
-i.e. the calculation of the DFA, takes about 20 to 40 minutes, depending on your hardware,
+
+Because of the complexity of the task and the large table of abbreviations, the conversion from the JFlex source to Java,
+i.e. the calculation of the DFA, takes about 15 to 60 minutes, depending on your hardware,
and requires a lot of heap space.
For development, you can disable the large abbreviation lists to speed up the build:
@@ -44,10 +62,14 @@
mvn clean generate-sources -Dforce.fast=true
```
+
+
## Examples Usage
+
By default, KorAP tokenizer reads from standard input and writes to standard output. It supports multiple modes of operations.
#### Split English text into tokens
+
```
$ echo "It's working." | java -jar target/KorAP-Tokenizer-*-standalone.jar -l en
It
@@ -55,7 +77,9 @@
working
.
```
+
#### Split French text into tokens and sentences
+
```
$ echo "C'est une phrase. Ici, il s'agit d'une deuxième phrase." \
| java -jar target/KorAP-Tokenizer-*-standalone.jar -s -l fr
@@ -79,6 +103,7 @@
```
#### Print token character offsets
+
With the `--positions` option, for example, the tokenizer prints all offsets of the first character of a token and the first character after a token.
In order to end a text, flush the output and reset the character position, an EOT character (0x04) can be used.
```
@@ -125,7 +150,7 @@
**Contributor**:
* [Gregor Middell](https://github.com/gremid)
-Copyright (c) 2023-2025, [Leibniz Institute for the German Language](http://www.ids-mannheim.de/), Mannheim, Germany
+Copyright (c) 2023-2026, [Leibniz Institute for the German Language](http://www.ids-mannheim.de/), Mannheim, Germany
This package is developed as part of the [KorAP](http://korap.ids-mannheim.de/)
Corpus Analysis Platform at the Leibniz Institute for German Language