Improve readme Change-Id: I94bc735bc78e0a261e493a148dc658db30306a1a

commit: e0dffe015655ca4768a769a58d8a3181d525663b [log] [tgz]
author: Akron <nils@diewald-online.de> Fri Oct 15 19:28:11 2021 +0200
committer: Akron <nils@diewald-online.de> Fri Oct 15 19:28:11 2021 +0200
tree: 53ad62f34a6319917afaf67dc68a2e925e423031
parent: e7751b807975e77757a63456aa89094f490b9dce [diff]
diff --git a/Readme.md b/Readme.md
index a7930a2..123e333 100644
--- a/Readme.md
+++ b/Readme.md

@@ -1,17 +1,16 @@
 # Datok - Finite State Tokenizer
 
-This is an implementation of an FSA for natural language
-tokenization, either in form of a matrix representation
-or as a double array.
-The system accepts a finite state transducer (FST)
-describing a tokenizer generated by
-[Foma](https://fomafst.github.io/)
-that needs to follow some conventional rules as described
-below.
+Implementation of a finite state automaton for
+natural language tokenization, based on a finite state
+transducer generated with [Foma](https://fomafst.github.io/).
+
+The library contains sources for a german tokenizer
+based on [KorAP-Tokenizer](https://github.com/KorAP/KorAP-Tokenizer).
 
 ## Conventions
 
-The FST generated by Foma must adhere to the following rules:
+The FST generated by Foma must adhere to the following rules,
+to be converted by Datok:
 
 - Character accepting arcs need to be translated
   *only* to themselves or to ε (the empty symbol).
@@ -26,7 +25,7 @@
 
 A minimal usable tokenizer written in XFST and following
 the guidelines to tokenizers in Beesley and Karttunen (2003)
-and Beesley (2004) could look like this:
+and Beesley (2004) would look like this:
 
 ```xfst
 define TE "@_TOKEN_SYMBOL_@";
@@ -60,7 +59,7 @@
 $ go build ./cmd/datok.go
 ```
 
-To create a foma file from example XFST sources, first install
+To create a foma file from the example sources, first install
 [Foma](https://fomafst.github.io/), then run in
 the root directory of this repository
 
@@ -72,17 +71,18 @@
 ```
 
 This will load and compile `tokenizer.xfst` and will save
-the generated FST as `mytokenizer.fst`
+the compiled FST as `mytokenizer.fst`
 in the root directory.
 
-To generate a matrix representation of this FST, run
+To generate a Datok FSA (matrix representation) based on
+this FST, run
 
 ```shell
 $ datok convert -i mytokenizer.fst -o mytokenizer.datok
 ```
 
-To generate a double array representation
-of this FST, run
+To generate a Datok FSA (double array representation) based
+on this FST, run
 
 ```shell
 $ datok convert -i mytokenizer.fst -o mytokenizer.datok -d
@@ -90,7 +90,7 @@
 
 *Caution*: This may take some time depending on the number of arcs in the FST.
 
-The final datok file can then be used as an input to the tokenizer.
+The final datok file can then be used as a model for the tokenizer.
 
 ## Example
 
@@ -113,13 +113,16 @@
 
 ## Technology
 
-The double array representation (Aoe 1989) of all transitions
-in the FST is implemented as an extended DFA following Mizobuchi
-et al. (2000) and implementation details following Kanda et al. (2018).
+Internally the FSA is represented
+either as a matrix or as a double array.
 
 Both representations mark all non-word-character targets with a
 leading bit. The transduction is greedy with a single backtracking
-option to the last ε transition.
+option to the last ε (aka *tokenend*) transition.
+
+The double array representation (Aoe 1989) of all transitions
+in the FST is implemented as an extended DFA following Mizobuchi
+et al. (2000) and implementation details following Kanda et al. (2018).
 
 ## License
commit	e0dffe015655ca4768a769a58d8a3181d525663b	[log] [tgz]
author	Akron <nils@diewald-online.de>	Fri Oct 15 19:28:11 2021 +0200
committer	Akron <nils@diewald-online.de>	Fri Oct 15 19:28:11 2021 +0200
tree	53ad62f34a6319917afaf67dc68a2e925e423031
parent	e7751b807975e77757a63456aa89094f490b9dce [diff]