Add introduction video to readme Change-Id: I8ec7b9a9f80fd668d4c408c946cd28d468fbccb9

commit: 3d31453bc5d2b04fd2cdb8eaf851825d76be5dc6 [log] [tgz]
author: Akron <nils@diewald-online.de> Tue Nov 02 21:53:55 2021 +0100
committer: Akron <nils@diewald-online.de> Tue Nov 02 21:53:55 2021 +0100
tree: 795fdfa0b8f151e344674d4ae4a417f8222e311f
parent: 6792bd282edf68403e38fecbdc95329eb605ab3f [diff]
diff --git a/Readme.md b/Readme.md
index e1642b3..6365186 100644
--- a/Readme.md
+++ b/Readme.md

@@ -1,36 +1,51 @@
 # Datok - Finite State Tokenizer
 
+![Introduction to Datok](https://raw.githubusercontent.com/KorAP/Datok/master/misc/introducing-datok.gif)
+
 Implementation of a finite state automaton for
-natural language tokenization, based on a finite state
+fast natural language tokenization, based on a finite state
 transducer generated with [Foma](https://fomafst.github.io/).
 
 The library contains sources for a german tokenizer
 based on [KorAP-Tokenizer](https://github.com/KorAP/KorAP-Tokenizer).
 
-## Tokenizing
+## Tokenization
 
-```shell
-$ echo "Es war spät, schon ca. <time datetime=\"02:00\">02:00 Uhr</time>. ;-)" | ./datok tokenize -t testdata/tokenizer.matok -
-Es
-war
-spät
-,
-schon
-ca.
-<time datetime="02:00">
-02:00
-Uhr
-</time>
-.
+```
+Usage: datok tokenize --tokenizer=STRING <input>
 
-;-)
+Arguments:
+  <input>    Input file to tokenize (use - for STDIN)
 
+Flags:
+  -h, --help                  Show context-sensitive help.
 
+  -t, --tokenizer=STRING      The Matrix or Double Array Tokenizer file
+      --[no-]tokens           Print token surfaces (defaults to true)
+      --[no-]sentences        Print sentence boundaries (defaults to true)
+  -p, --token-positions       Print token offsets (defaults to false)
+      --sentence-positions    Print sentence offsets (defaults to false)
+      --newline-after-eot     Ignore newline after EOT (defaults to false)
 ```
 
 The special `END OF TRANSMISSION` character (`\x04`) can be used to mark the end of a text.
 
-> *Caution*: When experimenting with STDIN this way, you may need to disable history expansion.
+> *Caution*: When experimenting with STDIN and echo,
+> you may need to disable history expansion.
+
+## Conversion
+
+```
+Usage: datok convert --foma=STRING --tokenizer=STRING
+
+Flags:
+  -h, --help                Show context-sensitive help.
+
+  -i, --foma=STRING         The Foma FST file
+  -o, --tokenizer=STRING    The Tokenizer file
+  -d, --double-array        Convert to Double Array instead of Matrix
+                            representation
+```
 
 ## Conventions
 
@@ -126,8 +141,10 @@
 either as a matrix or as a double array.
 
 Both representations mark all non-word-character targets with a
-leading bit. The transduction is greedy with a single backtracking
-option to the last ε (aka *tokenend*) transition.
+leading bit. All ε (aka *tokenend*) transitions mark the end of a
+token or the end of a sentence (2 subsequential ε).
+The transduction is greedy with a single backtracking
+option to the last ε transition.
 
 The double array representation (Aoe 1989) of all transitions
 in the FST is implemented as an extended DFA following Mizobuchi

diff --git a/misc/introducing-datok.gif b/misc/introducing-datok.gif
new file mode 100644
index 0000000..603c25e
--- /dev/null
+++ b/misc/introducing-datok.gif
Binary files differ
commit	3d31453bc5d2b04fd2cdb8eaf851825d76be5dc6	[log] [tgz]
author	Akron <nils@diewald-online.de>	Tue Nov 02 21:53:55 2021 +0100
committer	Akron <nils@diewald-online.de>	Tue Nov 02 21:53:55 2021 +0100
tree	795fdfa0b8f151e344674d4ae4a417f8222e311f
parent	6792bd282edf68403e38fecbdc95329eb605ab3f [diff]