commit | 0630be5878075f7b9d7f497ebafa0f8c58fbbe44 | [log] [tgz] |
---|---|---|
author | Akron <nils@diewald-online.de> | Sat Aug 28 09:06:16 2021 +0200 |
committer | Akron <nils@diewald-online.de> | Sat Aug 28 09:06:16 2021 +0200 |
tree | 63d07742f9e3bb5a993c7e235125adf7c29408cb | |
parent | 235ea12bd2814b6cdf4bb4c275e895d3f0588ae0 [diff] |
Fix parsing of end states
This is an implementation of a double array based finite state automaton (FSA) for natural language tokenization. The system accepts a finite state transducer (FST) describing a tokenizer generated by Foma that needs to follow some rules as described below.
The FST generated by Foma must adhere to the following rules:
@_TOKEN_SYMBOL_@
, that denotes the end of a token.@_TOKEN_SYMBOL_@
.@_TOKEN_SYMBOL_@
s mark a sentence end.A minimal usable tokenizer written in XFST and following the guidelines to tokenizers in Beesley and Karttunen (2003) and Beesley (2004) could look like this:
define TE "@_TOKEN_SYMBOL_@"; define WS [" "|"\u000a"|"\u0009"]; define PUNCT ["."|"?"|"!"]; define Char \[WS|PUNCT]; define Word Char+; ! Compose token ends define Tokenizer [[Word|PUNCT] @-> ... TE] .o. ! Compose Whitespace ignorance [WS+ @-> 0] .o. ! Compose sentence ends [[PUNCT+] @-> ... TE \/ TE _ ]; read regex Tokenizer;
Hint: For development it's easier to replace @_TOKEN_SYMBOL_@
with a newline.
To build the Double Array Tokenizer tool, run
$ go build ./cmd/datok.go
To create a foma file from example XFST sources, first install Foma, then run in the root directory of this repository
$ cd src && \ foma -e "source tokenizer.xfst" \ -e "save stack ../mytokenizer.fst" -q -s && \ cd ..
This will load and compile tokenizer.xfst
and will save the generated FST as mytokenizer.fst
in the root directory.
To generate a double array representation of this FST, run
$ datok convert -i mytokenizer.fst -o mytokenizer.datok
Caution: This may take some time depending on the number of arcs in the FST.
The final datok file can then be used as an input to the tokenizer.
$ echo "Es war spät, schon ca. 2 Uhr. ;-)" | ./datok tokenize -t testdata/tokenizer.datok Es war spät , schon ca. 2 Uhr . ;-)
Caution: When experimenting with STDIN this way, you may need to disable history expansion.
Datok is based on a double array representation (Aoe 1989) of all transitions in the FST, implemented as an extended FSA following Mizobuchi et al. (2000) and implementation details following Kanda et al. (2018).
The german tokenizer shipped is based on work done by the Lucene project (published under the Apache License), David Hall (published under the Apache License), Çağrı Çöltekin (published under the MIT License), and Marc Kupietz (published under the Apache License).
The foma parser is based on foma2js, written by Mans Hulden (published under the Apache License).
Aoe, Jun-ichi (1989): An Efficient Digital Search Algorithm by Using a Double-Array Structure. IEEE Transactions on Software Engineering, 15 (9), pp. 1066-1077.
Beesley, Kenneth R. & Lauri Karttunen (2003): Finite State Morphology. Stanford, CA: CSLI Publications.
Beesley, Kenneth R. (2004): Tokenizing Transducers. https://web.stanford.edu/~laurik/fsmbook/clarifications/tokfst.html
Hulden, Mans (2009): Foma: a finite-state compiler and library. In: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, pp. 29-32.
Mizobuchi, Shoji, Toru Sumitomo, Masao Fuketa & Jun-ichi Aoe (2000): An efficient representation for implementing finite state machines based on the double-array. Information Sciences 129, pp. 119-139.
Kanda, Shunsuke, Yuma Fujita, Kazuhiro Morita & Masao Fuketa (2018): Practical rearrangement methods for dynamic double-array dictionaries. Software: Practice and Experience (SPE), 48(1), pp. 65–83.