commit | c419d5b22c508a352f00fb11f23034a10bfbaf3d | [log] [tgz] |
---|---|---|
author | Marc Kupietz <kupietz@ids-mannheim.de> | Thu Sep 17 15:21:26 2020 +0200 |
committer | Akron <nils@diewald-online.de> | Fri Sep 18 17:44:06 2020 +0200 |
tree | 83eb21aa0d47920a7c1fcd225601249c1dc4108c | |
parent | de949deb083c43f4e0fed3713617aed768c000aa [diff] |
Add new command line options using picocli and sanitize code Usage: koraptokenizer [-hnpsV] [--force] [-ktt] [--[no-]tokens] [-o=<output_fillename>] [<inputFiles>...] Tokenizes (and sentence splits) text input. [<inputFiles>...] input files --force Force overwrite (default: false) -h, --help Show this help message and exit. -ktt Deprecated. For internal use only. (default: false) -n, --normalize Normalize tokens (default: false) --[no-]tokens Print tokens (default: true) -o, --output-file=<output_fillename> Output file (default: -) -p, --positions Print token start and end positions as character offsets (default: false) -s, --sentence-boundaries Print sentence boundary positions (default: false) -V, --version Print version information and exit. Change-Id: Ib92678c832a2d95799a8f503c3e86dd4da2b4d73
Efficient, OpenNLP tools compatible DFA tokenizer and sentence splitter with character offset output based on JFlex, suitable for German and other European languages.
The KorAP tokenizer is used for the German Reference Corpus DeReKo. Being based on a finite state automaton, it is not accurate as language model based tokenizers, but with ~5 billion words per hour typically more efficient. An important feature in the DeReKo/KorAP context is also, that it reliably reports the character offsets of the tokens so that this information can be used for applying standoff annotations.
The main class KorAPTokenizerImpl
implements the opennlp.tools.tokenize.Tokenizer
and opennlp.tools.sentdetect.SentenceDetector
interfaces and can thus be used as a drop-in replacement in OpenNLP applications.
The scanner is based on the Lucene scanner with modifications from David Hall.
Our changes mainly concern a good coverage of German abbreviations, and some updates for handling computer mediated communication, optimized and tested against the gold data from the EmpiriST 2015 shared task (Beißwenger et al. 2016).
$ MAVEN_OPTS="-Xss50m" mvn clean install
Because of the large table of abbreviations, the conversion from the jflex source to java, i.e. the calculation of the DFA, takes about 4 to 20 minutes, depending on your hardware, and requires a lot of heap space.
For this reason the java source that is generated from the jflex source is distributed with the source code and not deleted on mvn clean
.
If you want to modify the jflex source, while keeping the abbreviation lists, you will need ad least 5 GB of free RAM.
The KorAP tokenizer reads from standard input and writes to standard output. It supports multiple modes of operations.
With the --positions
option, for example, the tokenizer prints all offsets of the first character of a token and the first character after a token. In order to end a text, flush the output and reset the character position, an EOT character (0x04) can be used.
$ echo -n -e 'This is a text.\x0a\x03\x0aAnd this is another text.\n\x03\n' |\ java -jar target/KorAP-Tokenizer-1.3-SNAPSHOT.jar --positions 0 4 5 7 8 9 10 15 0 3 4 8 9 11 12 19 20 25
echo -n -e ' This ist a start of a text. And this is a sentence!!! But what the hack????\x0a\x03\x0aAnd this is another text.\n\x03\nAnd this a sentence without marker\n' |\ java -jar target/KorAP-Tokenizer-1.3-SNAPSHOT.jar --positions --sentence-boundaries 1 5 6 9 10 11 12 17 18 20 21 22 23 27 27 28 29 32 33 37 38 40 41 42 43 51 51 54 55 58 59 63 64 67 68 72 72 76 1 28 29 54 55 76 0 3 4 8 9 11 12 19 20 24 24 25 0 25
Authors:
Copyright (c) 2020, Leibniz Institute for the German Language, Mannheim, Germany
This package is developed as part of the KorAP Corpus Analysis Platform at the Leibniz Institute for German Language (IDS).
The package contains code from Apache Lucene with modifications by Jim Hall.
It is published under the Apache 2.0 License.
Contributions are very welcome!
Your contributions should ideally be committed via our Gerrit server to facilitate reviewing (see Gerrit Code Review - A Quick Introduction if you are not familiar with Gerrit). However, we are also happy to accept comments and pull requests via GitHub.