Improve readme
Change-Id: I94bc735bc78e0a261e493a148dc658db30306a1a
diff --git a/Readme.md b/Readme.md
index a7930a2..123e333 100644
--- a/Readme.md
+++ b/Readme.md
@@ -1,17 +1,16 @@
# Datok - Finite State Tokenizer
-This is an implementation of an FSA for natural language
-tokenization, either in form of a matrix representation
-or as a double array.
-The system accepts a finite state transducer (FST)
-describing a tokenizer generated by
-[Foma](https://fomafst.github.io/)
-that needs to follow some conventional rules as described
-below.
+Implementation of a finite state automaton for
+natural language tokenization, based on a finite state
+transducer generated with [Foma](https://fomafst.github.io/).
+
+The library contains sources for a german tokenizer
+based on [KorAP-Tokenizer](https://github.com/KorAP/KorAP-Tokenizer).
## Conventions
-The FST generated by Foma must adhere to the following rules:
+The FST generated by Foma must adhere to the following rules,
+to be converted by Datok:
- Character accepting arcs need to be translated
*only* to themselves or to ε (the empty symbol).
@@ -26,7 +25,7 @@
A minimal usable tokenizer written in XFST and following
the guidelines to tokenizers in Beesley and Karttunen (2003)
-and Beesley (2004) could look like this:
+and Beesley (2004) would look like this:
```xfst
define TE "@_TOKEN_SYMBOL_@";
@@ -60,7 +59,7 @@
$ go build ./cmd/datok.go
```
-To create a foma file from example XFST sources, first install
+To create a foma file from the example sources, first install
[Foma](https://fomafst.github.io/), then run in
the root directory of this repository
@@ -72,17 +71,18 @@
```
This will load and compile `tokenizer.xfst` and will save
-the generated FST as `mytokenizer.fst`
+the compiled FST as `mytokenizer.fst`
in the root directory.
-To generate a matrix representation of this FST, run
+To generate a Datok FSA (matrix representation) based on
+this FST, run
```shell
$ datok convert -i mytokenizer.fst -o mytokenizer.datok
```
-To generate a double array representation
-of this FST, run
+To generate a Datok FSA (double array representation) based
+on this FST, run
```shell
$ datok convert -i mytokenizer.fst -o mytokenizer.datok -d
@@ -90,7 +90,7 @@
*Caution*: This may take some time depending on the number of arcs in the FST.
-The final datok file can then be used as an input to the tokenizer.
+The final datok file can then be used as a model for the tokenizer.
## Example
@@ -113,13 +113,16 @@
## Technology
-The double array representation (Aoe 1989) of all transitions
-in the FST is implemented as an extended DFA following Mizobuchi
-et al. (2000) and implementation details following Kanda et al. (2018).
+Internally the FSA is represented
+either as a matrix or as a double array.
Both representations mark all non-word-character targets with a
leading bit. The transduction is greedy with a single backtracking
-option to the last ε transition.
+option to the last ε (aka *tokenend*) transition.
+
+The double array representation (Aoe 1989) of all transitions
+in the FST is implemented as an extended DFA following Mizobuchi
+et al. (2000) and implementation details following Kanda et al. (2018).
## License