Update complexity info in Readme Change-Id: I11a909c063bda51b7e11abcd3c8469c38479a439

commit: e030297934cc94fc7b647e3a81cca7d6a8e06221 [log] [tgz]
author: Marc Kupietz <kupietz@ids-mannheim.de> Sun Feb 08 12:43:12 2026 +0100
committer: Marc Kupietz <kupietz@ids-mannheim.de> Sun Feb 08 12:43:12 2026 +0100
tree: 204e6b2865766ebc7a0003ae54aa5750d313f5ae
parent: 4d59ee4209a3c12f74149898cc76f10f24915855 [diff]
diff --git a/Readme.md b/Readme.md
index c85d814..6778093 100644
--- a/Readme.md
+++ b/Readme.md

@@ -19,6 +19,21 @@
 - **`de`** (default): Modern German with support for gender-sensitive forms. Forms like `Nutzer:in`, `Nutzer/innen`, `Kaufmann/frau` are kept as single tokens.
 - **`de_old`**: Traditional German without gender-sensitive rules. These forms are split into separate tokens (e.g., `Nutzer:in` → `Nutzer` `:` `in`). Useful for processing older texts or when gender forms should not be treated specially.
 
+### Complexity and Performance
+
+Unlike simple script-based or regex-based tokenizers, the KorAP Tokenizer uses high-performance Deterministic Finite Automata (DFA) generated by JFlex. This allows for extremely high throughput (5-20 MB/s) while handling thousands of complex rules and abbreviations simultaneously (see Diewald/Kupietz/Lüngen 2022).
+
+The following table shows the complexity of the underlying automata for each language variant:
+
+| Language | DFA States | DFA Transitions (Edges) | Generated Java Code |
+| :--- | :--- | :--- | :--- |
+| **German** (`de`) | ~15,000 | 1,737,648 | ~67,000 lines |
+| **German** (`de_old`) | ~15,000 | 1,669,140 | ~61,000 lines |
+| **English** (`en`) | ~15,000 | 1,186,205 | ~38,000 lines |
+| **French** (`fr`) | ~15,000 | 1,188,825 | ~38,000 lines |
+
+The significant size of the German DFA is primarily due to the integrated list of over 5,000 specialized abbreviations and the complex lookahead rules for gender-neutral forms (e.g., handling `:in` vs. namespace colons).
+
  
 The included implementations of the `KorapTokenizer` interface also implement the [`opennlp.tools.tokenize.Tokenizer`](https://opennlp.apache.org/docs/2.3.0/apidocs/opennlp-tools/opennlp/tools/tokenize/Tokenizer.html)
 and [`opennlp.tools.sentdetect.SentenceDetector`](https://opennlp.apache.org/docs/2.3.0/apidocs/opennlp-tools/opennlp/tools/sentdetect/SentenceDetector.html)
@@ -31,12 +46,15 @@
 
 
 ## Installation
+
 ```shell script
 mvn clean package
 ```
+
 #### Note
-Because of the large table of abbreviations, the conversion from the jflex source to java,
-i.e. the calculation of the DFA, takes about 20 to 40 minutes, depending on your hardware,
+
+Because of the complexity of the task and the large table of abbreviations, the conversion from the JFlex source to Java,
+i.e. the calculation of the DFA, takes about 15 to 60 minutes, depending on your hardware,
 and requires a lot of heap space.
 
 For development, you can disable the large abbreviation lists to speed up the build:
@@ -44,10 +62,14 @@
 mvn clean generate-sources -Dforce.fast=true
 ```
 
+
+
 ## Examples Usage
+
 By default, KorAP tokenizer reads from standard input and writes to standard output. It supports multiple modes of operations.
 
 #### Split English text into tokens
+
 ```
 $ echo "It's working." | java -jar target/KorAP-Tokenizer-*-standalone.jar -l en
 It
@@ -55,7 +77,9 @@
 working
 .
 ```
+
 #### Split French text into tokens and sentences
+
 ```
 $ echo "C'est une phrase. Ici, il s'agit d'une deuxième phrase." \
   | java -jar target/KorAP-Tokenizer-*-standalone.jar -s -l fr
@@ -79,6 +103,7 @@
 ```
 
 #### Print token character offsets
+
 With the `--positions` option, for example, the tokenizer prints all offsets of the first character of a token and the first character after a token.
 In order to end a text, flush the output and reset the character position, an EOT character (0x04) can be used.
 ```
@@ -125,7 +150,7 @@
 **Contributor**:
 * [Gregor Middell](https://github.com/gremid)
 
-Copyright (c) 2023-2025, [Leibniz Institute for the German Language](http://www.ids-mannheim.de/), Mannheim, Germany
+Copyright (c) 2023-2026, [Leibniz Institute for the German Language](http://www.ids-mannheim.de/), Mannheim, Germany
 
 This package is developed as part of the [KorAP](http://korap.ids-mannheim.de/)
 Corpus Analysis Platform at the Leibniz Institute for German Language
commit	e030297934cc94fc7b647e3a81cca7d6a8e06221	[log] [tgz]
author	Marc Kupietz <kupietz@ids-mannheim.de>	Sun Feb 08 12:43:12 2026 +0100
committer	Marc Kupietz <kupietz@ids-mannheim.de>	Sun Feb 08 12:43:12 2026 +0100
tree	204e6b2865766ebc7a0003ae54aa5750d313f5ae
parent	4d59ee4209a3c12f74149898cc76f10f24915855 [diff]