Merge "Update Readme.md"
diff --git a/Readme.md b/Readme.md
index 959fc33..ef34e4e 100644
--- a/Readme.md
+++ b/Readme.md
@@ -1,76 +1,65 @@
# KorAP Tokenizer
Interface and implementation of a tokenizer and sentence splitter that can be used
+* for German, English, French, and with some limitations also for other languages
* as standalone tokenizer and/or sentence splitter
-* within the KorAP ingestion pipeline
-* within the [OpenNLP tools](https://opennlp.apache.org) framework
+* or within the KorAP ingestion pipeline
+* or within the [OpenNLP tools](https://opennlp.apache.org) framework
-## DeReKo Tokenizer (included default implementation)
-The included default implementation (`DerekoDfaTokenizer_de`) is a highly efficient DFA tokenizer and sentence splitter with character offset output based on [JFlex](https://www.jflex.de/), suitable for German and other European languages.
-It is used for the German Reference Corpus DeReKo. Being based on a finite state automaton,
-it is not as accurate as language model based tokenizers, but with ~5 billion words per hour typically more efficient.
-An important feature in the DeReKo/KorAP context is also, that it reliably reports the character offsets of the tokens
-so that this information can be used for applying standoff annotations.
+The included implementations (`DerekoDfaTokenizer_de, DerekoDfaTokenizer_en, DerekoDfaTokenizer_fr`) are highly efficient DFA tokenizers and sentence splitters with character offset output based on [JFlex](https://www.jflex.de/).
+The de-variant is used for the German Reference Corpus DeReKo. Being based on finite state automata,
+the tokenizers are potentially not as accurate as language model based ones, but with ~5 billion words per hour typically more efficient.
+An important feature in the DeReKo/KorAP context is also that token character offsets can be reported, which can be used for applying standoff annotations.
-`DerekoDfaTokenizer_de` and any implementation of the `KorapTokenizer` interface also implement the [`opennlp.tools.tokenize.Tokenizer`](https://opennlp.apache.org/docs/1.8.2/apidocs/opennlp-tools/opennlp/tools/tokenize/Tokenizer.html)
+The include mplementations of the `KorapTokenizer` interface also implement the [`opennlp.tools.tokenize.Tokenizer`](https://opennlp.apache.org/docs/1.8.2/apidocs/opennlp-tools/opennlp/tools/tokenize/Tokenizer.html)
and [`opennlp.tools.sentdetect.SentenceDetector`](https://opennlp.apache.org/docs/1.8.2/apidocs/opennlp-tools/opennlp/tools/sentdetect/SentenceDetector.html)
-interfaces and can thus be used as a drop-in replacement in OpenNLP applications.
+interfaces and can thus be used as a drop-in replacements in OpenNLP applications.
-The scanner is based on the Lucene scanner with modifications from [David Hall](https://github.com/dlwh).
+The underlying scanner is based on the Lucene scanner with modifications from [David Hall](https://github.com/dlwh).
-Our changes mainly concern a good coverage of German abbreviations,
-and some updates for handling computer mediated communication, optimized and tested against the gold data from the [EmpiriST 2015](https://sites.google.com/site/empirist2015/) shared task (Beißwenger et al. 2016).
+Our changes mainly concern a good coverage of German, or optionally of some English and French abbreviations,
+and some updates for handling computer mediated communication, optimized and tested, in the case of German, against the gold data from the [EmpiriST 2015](https://sites.google.com/site/empirist2015/) shared task (Beißwenger et al. 2016).
-### Adding Support for more Languages
-To adapt the included implementations to more languages, take one of the `language-specific_<language>.jflex-macro` files as template and
-modify for example the macro for abbreviations `SEABBR`. Then add an `execution` section for the new language
-to the jcp ([java-comment-preprocessor](https://github.com/raydac/java-comment-preprocessor)) artifact in `pom.xml` following the example of one of the configurations there.
-After building the project (see below) your added language specific tokenizer / sentence splitter should be selectable with the `--language` option.
-
-Alternatively, you can also provide `KorAPTokenizer` implementations independently on the class path and select them with the `--tokenizer-class` option.
## Installation
```shell script
-$ MAVEN_OPTS="-Xss2m" mvn clean install
+mvn clean install
```
#### Note
Because of the large table of abbreviations, the conversion from the jflex source to java,
-i.e. the calculation of the DFA, takes about 4 to 20 minutes, depending on your hardware,
+i.e. the calculation of the DFA, takes about 5 to 30 minutes, depending on your hardware,
and requires a lot of heap space.
-## Documentation
-The KorAP tokenizer reads from standard input and writes to standard output. It supports multiple modes of operations.
+## Examples Usage
+By default, KorAP tokenizer reads from standard input and writes to standard output. It supports multiple modes of operations.
-#### Split into tokens
+#### Split English text into tokens
```
-$ echo 'This is a sentence. This is a second sentence.' | java -jar target/KorAP-Tokenizer-2.2.0-standalone.jar
-This
-is
-a
-sentence
+$ echo "It's working." | java -jar target/KorAP-Tokenizer-2.2.0-standalone.jar -l en
+It
+'s
+working
.
-This
-is
-a
-second
-sentence
+```
+#### Split French text into tokens and sentences
+```
+$ echo "C'est une phrase. Ici, il s'agit d'une deuxième phrase." \
+ | java -jar target/KorAP-Tokenizer-2.2.0-standalone.jar -s -l fr
+C'
+est
+une
+phrase
.
-```
-#### Split into tokens and sentences
-```
-$ echo 'This is a sentence. This is a second sentence.' | java -jar target/KorAP-Tokenizer-2.2.0-standalone.jar -s
-This
-is
-a
-sentence
-.
-
-This
-is
-a
-second
-sentence
+Ici
+,
+il
+s'
+agit
+d'
+une
+deuxième
+phrase
.
```
@@ -105,6 +94,14 @@
0 25
```
+### Adding Support for more Languages
+To adapt the included implementations to more languages, take one of the `language-specific_<language>.jflex-macro` files as template and
+modify for example the macro for abbreviations `SEABBR`. Then add an `execution` section for the new language
+to the jcp ([java-comment-preprocessor](https://github.com/raydac/java-comment-preprocessor)) artifact in `pom.xml` following the example of one of the configurations there.
+After building the project (see below) your added language specific tokenizer / sentence splitter should be selectable with the `--language` option.
+
+Alternatively, you can also provide `KorAPTokenizer` implementations independently on the class path and select them with the `--tokenizer-class` option.
+
## Development and License
**Authors**: