Merge "Introduce token_writer object"

commit: ce018e1800185f46a832cf01e3990b709ccb10e9 [log] [tgz]
author: Akron <diewald@ids-mannheim.de> Wed Oct 20 19:08:57 2021 +0200
committer: Gerrit Code Review <gerrit2@korap.ids-mannheim.de> Wed Oct 20 19:08:57 2021 +0200
tree: 076e825c3e369dd9393b72f94d2c3fe482f47839
parent: e0dffe015655ca4768a769a58d8a3181d525663b [diff]
parent: e396a93ea5848a941e664f992aed89b057ca3120 [diff]
diff --git a/Readme.md b/Readme.md
index 956bcf6..99ace7f 100644
--- a/Readme.md
+++ b/Readme.md

@@ -1,17 +1,16 @@
 # Datok - Finite State Tokenizer
 
-This is an implementation of an FSA for natural language
-tokenization, either in form of a matrix representation
-or as a double array.
-The system accepts a finite state transducer (FST)
-describing a tokenizer generated by
-[Foma](https://fomafst.github.io/)
-that needs to follow some conventional rules as described
-below.
+Implementation of a finite state automaton for
+natural language tokenization, based on a finite state
+transducer generated with [Foma](https://fomafst.github.io/).
+
+The library contains sources for a german tokenizer
+based on [KorAP-Tokenizer](https://github.com/KorAP/KorAP-Tokenizer).
 
 ## Conventions
 
-The FST generated by Foma must adhere to the following rules:
+The FST generated by Foma must adhere to the following rules,
+to be converted by Datok:
 
 - Character accepting arcs need to be translated
   *only* to themselves or to ε (the empty symbol).
@@ -28,7 +27,7 @@
 
 A minimal usable tokenizer written in XFST and following
 the guidelines to tokenizers in Beesley and Karttunen (2003)
-and Beesley (2004) could look like this:
+and Beesley (2004) would look like this:
 
 ```xfst
 define TE "@_TOKEN_SYMBOL_@";
@@ -62,7 +61,7 @@
 $ go build ./cmd/datok.go
 ```
 
-To create a foma file from example XFST sources, first install
+To create a foma file from the example sources, first install
 [Foma](https://fomafst.github.io/), then run in
 the root directory of this repository
 
@@ -74,17 +73,18 @@
 ```
 
 This will load and compile `tokenizer.xfst` and will save
-the generated FST as `mytokenizer.fst`
+the compiled FST as `mytokenizer.fst`
 in the root directory.
 
-To generate a matrix representation of this FST, run
+To generate a Datok FSA (matrix representation) based on
+this FST, run
 
 ```shell
 $ datok convert -i mytokenizer.fst -o mytokenizer.datok
 ```
 
-To generate a double array representation
-of this FST, run
+To generate a Datok FSA (double array representation) based
+on this FST, run
 
 ```shell
 $ datok convert -i mytokenizer.fst -o mytokenizer.datok -d
@@ -92,7 +92,7 @@
 
 *Caution*: This may take some time depending on the number of arcs in the FST.
 
-The final datok file can then be used as an input to the tokenizer.
+The final datok file can then be used as a model for the tokenizer.
 
 ## Example
 
@@ -115,13 +115,16 @@
 
 ## Technology
 
-The double array representation (Aoe 1989) of all transitions
-in the FST is implemented as an extended DFA following Mizobuchi
-et al. (2000) and implementation details following Kanda et al. (2018).
+Internally the FSA is represented
+either as a matrix or as a double array.
 
 Both representations mark all non-word-character targets with a
 leading bit. The transduction is greedy with a single backtracking
-option to the last ε transition.
+option to the last ε (aka *tokenend*) transition.
+
+The double array representation (Aoe 1989) of all transitions
+in the FST is implemented as an extended DFA following Mizobuchi
+et al. (2000) and implementation details following Kanda et al. (2018).
 
 ## License
commit	ce018e1800185f46a832cf01e3990b709ccb10e9	[log] [tgz]
author	Akron <diewald@ids-mannheim.de>	Wed Oct 20 19:08:57 2021 +0200
committer	Gerrit Code Review <gerrit2@korap.ids-mannheim.de>	Wed Oct 20 19:08:57 2021 +0200
tree	076e825c3e369dd9393b72f94d2c3fe482f47839
parent	e0dffe015655ca4768a769a58d8a3181d525663b [diff]
parent	e396a93ea5848a941e664f992aed89b057ca3120 [diff]