Added readme file

commit: 31cc3076563703cb40d437949a3e92913e002f56 [log] [tgz]
author: Akron <nils@diewald-online.de> Fri Aug 13 10:52:01 2021 +0200
committer: Akron <nils@diewald-online.de> Fri Aug 13 10:52:01 2021 +0200
tree: 709cc9d640b81cb6c5848bf24f4df85141c22160
parent: 1e10d008fad06a5757dfc748d631b1c30fbcb9ae [diff]
diff --git a/Readme.md b/Readme.md
new file mode 100644
index 0000000..3b39060
--- /dev/null
+++ b/Readme.md

@@ -0,0 +1,143 @@
+# Datok - Double Array Tokenizer
+
+This is an implementation of a double array based
+finite state automaton (FSA) for natural language tokenization.
+The system accepts a finite state transducer (FST)
+describing a tokenizer generated by
+[Foma](https://fomafst.github.io/)
+that needs to follow some rules as described below.
+
+# Conventions
+
+The FST generated by Foma must adhere to the following rules:
+
+- Character accepting arcs need to be translated
+  *only* to themselves or to ε (the empty symbol).
+- Multi-character symbols are not allowed,
+  except for the `@_TOKEN_SYMBOL_@`,
+  that denotes the end of a token.
+- ε accepting arcs (transitions not consuming
+  any character) need to be translated to
+  the `@_TOKEN_SYMBOL_@`.
+- Two consecutive `@_TOKEN_SYMBOL_@`s mark a sentence end.
+- Flag diacritics are not supported.
+
+A minimal usable tokenizer written in XFST and following
+the guidelines to tokenizers in Beesley and Karttunen (2003)
+could look like this:
+
+```xfst
+define TE "@_TOKEN_SYMBOL_@";
+
+define WS [" "|"\u000a"|"\u0009"];
+
+define PUNCT ["."|"?"|"!"];
+
+define Char \[WS|PUNCT];
+
+define Word Char+;
+
+! Compose token ends
+define Tokenizer [[Word|PUNCT] @-> ... TE] .o.
+! Compose Whitespace ignorance
+       [WS+ @-> 0] .o.
+! Compose sentence ends
+       [[PUNCT+] @-> ... TE \/ TE _ ];
+
+read regex Tokenizer;
+```
+
+*Hint*: For development it's easier to replace `@_TOKEN_SYMBOL_@`
+with a newline.
+
+# Building
+
+To build the Double Array Tokenizer tool, run
+
+```shell
+$ go build ./cmd/datok.go
+```
+
+To create a foma file from example XFST sources, first install
+[Foma](https://fomafst.github.io/), then run in
+the root directory of this repository
+
+```shell
+$ cd src && \
+  foma -e "source tokenizer.xfst" \
+  -e "save stack ../mytokenizer.fst" -q -s && \
+  cd ..
+```
+
+This will load and compile `tokenizer.xfst` and will save
+the generated FST as `mytokenizer.fst`
+in the root directory.
+
+To generate a double array representation
+of this FST, run
+
+```shell
+$ datok convert -i mytokenizer.fst -o mytokenizer.datok
+```
+
+*Caution*: This may take some time depending on the number of arcs in the FST.
+
+The final datok file can then be used as an input to the tokenizer.
+
+# Example
+
+```shell
+$ echo "Es war spät, schon ca. 2 Uhr. ;-)" | ./datok tokenize -t testdata/tokenizer.datok 
+Es
+war
+spät
+,
+schon
+ca.
+2
+Uhr
+.
+
+;-)
+```
+
+*Caution*: When experimenting with STDIN this way, you may need to disable history expansion.
+
+# Technology
+
+Datok is based on a double array representation (Aoe 1989) of all transitions in the FST,
+implemented as an extended FSA following Mizobuchi et al. (2000)
+and implementation details following Kanda et al. (2018).
+
+The german tokenizer shipped is based on work done by the
+[Lucene project](https://github.com/apache/lucene-solr)
+(published under the Apache License),
+[David Hall](https://github.com/dlwh/epic)
+(published under the Apache License),
+[Çağrı Çöltekin](https://github.com/coltekin/TRmorph/)
+(published under the MIT License),
+and [Marc Kupietz](https://github.com/KorAP/KorAP-Tokenizer)
+ (published under the Apache License).
+
+The foma parser is based on
+[*foma2js*](https://github.com/mhulden/foma),
+written by Mans Hulden (published under the Apache License).
+
+# Bibliography
+
+Aoe, Jun-ichi (1989): *An Efficient Digital Search Algorithm by Using a Double-Array Structure*.
+IEEE Transactions on Software Engineering, 15 (9), pp. 1066-1077.
+
+Beesley, Kenneth R. & Lauri Karttunen (2003): *Finite State Morphology*. Stanford, CA: CSLI Publications.
+
+Hulden, Mans (2009): *Foma: a finite-state compiler and library*. In: Proceedings of the
+12th Conference of the European Chapter of the Association for Computational Linguistics,
+Association for Computational Linguistics, pp. 29-32.
+
+Mizobuchi, Shoji, Toru Sumitomo, Masao Fuketa & Jun-ichi Aoe (2000):
+*An efficient representation for implementing finite state machines based on the double-array*.
+Information Sciences 129, pp. 119-139.
+
+Kanda, Shunsuke, Yuma Fujita, Kazuhiro Morita & Masao Fuketa (2018):
+*Practical rearrangement methods for dynamic double-array dictionaries*.
+Software: Practice and Experience (SPE), 48(1), pp. 65–83.
\ No newline at end of file
commit	31cc3076563703cb40d437949a3e92913e002f56	[log] [tgz]
author	Akron <nils@diewald-online.de>	Fri Aug 13 10:52:01 2021 +0200
committer	Akron <nils@diewald-online.de>	Fri Aug 13 10:52:01 2021 +0200
tree	709cc9d640b81cb6c5848bf24f4df85141c22160
parent	1e10d008fad06a5757dfc748d631b1c30fbcb9ae [diff]