Merge "Update Readme.md"

commit: 0abcb5939eb59bf04e0729792760a46d3d283ecb [log] [tgz]
author: Marc Kupietz <kupietz@ids-mannheim.de> Wed Jul 28 18:05:34 2021 +0200
committer: Gerrit Code Review <gerrit2@korap.ids-mannheim.de> Wed Jul 28 18:05:34 2021 +0200
tree: 63fd2adfabc718f0e3e67e291df3230ec6041666
parent: ab9187de39116f5566a3cebf6bc142742c0f50ef [diff]
parent: 8c7488bcf02d080f32c8d7b7c24a3b2be90ebf37 [diff]
diff --git a/Readme.md b/Readme.md
index 959fc33..ef34e4e 100644
--- a/Readme.md
+++ b/Readme.md

@@ -1,76 +1,65 @@
 # KorAP Tokenizer
 Interface and implementation of a tokenizer and sentence splitter that can be used
 
+* for German, English, French, and with some limitations also for other languages
 * as standalone tokenizer and/or sentence splitter
-* within the KorAP ingestion pipeline
-* within the [OpenNLP tools](https://opennlp.apache.org) framework
+* or within the KorAP ingestion pipeline
+* or within the [OpenNLP tools](https://opennlp.apache.org) framework
 
-## DeReKo Tokenizer (included default implementation)
-The included default implementation (`DerekoDfaTokenizer_de`) is a highly efficient DFA tokenizer and sentence splitter with character offset output based on [JFlex](https://www.jflex.de/), suitable for German and other European languages.
-It is used for the German Reference Corpus DeReKo. Being based on a finite state automaton, 
-it is not as accurate as language model based tokenizers, but with ~5 billion words per hour typically more efficient.
-An important feature in the DeReKo/KorAP context is also, that it reliably reports the character offsets of the tokens 
-so that this information can be used for applying standoff annotations.
+The included implementations (`DerekoDfaTokenizer_de, DerekoDfaTokenizer_en, DerekoDfaTokenizer_fr`) are highly efficient DFA tokenizers and sentence splitters with character offset output based on [JFlex](https://www.jflex.de/).
+The de-variant is used for the German Reference Corpus DeReKo. Being based on finite state automata,
+the tokenizers are potentially not as accurate as language model based ones, but with ~5 billion words per hour typically more efficient.
+An important feature in the DeReKo/KorAP context is also that token character offsets can be reported, which can be used for applying standoff annotations.
  
-`DerekoDfaTokenizer_de` and any implementation of the `KorapTokenizer` interface also implement the [`opennlp.tools.tokenize.Tokenizer`](https://opennlp.apache.org/docs/1.8.2/apidocs/opennlp-tools/opennlp/tools/tokenize/Tokenizer.html)
+The include mplementations of the `KorapTokenizer` interface also implement the [`opennlp.tools.tokenize.Tokenizer`](https://opennlp.apache.org/docs/1.8.2/apidocs/opennlp-tools/opennlp/tools/tokenize/Tokenizer.html)
 and [`opennlp.tools.sentdetect.SentenceDetector`](https://opennlp.apache.org/docs/1.8.2/apidocs/opennlp-tools/opennlp/tools/sentdetect/SentenceDetector.html)
-interfaces and can thus be used as a drop-in replacement in OpenNLP applications.
+interfaces and can thus be used as a drop-in replacements in OpenNLP applications.
 
-The scanner is based on the Lucene scanner with modifications from [David Hall](https://github.com/dlwh).  
+The underlying scanner is based on the Lucene scanner with modifications from [David Hall](https://github.com/dlwh).
 
-Our changes mainly concern a good coverage of German abbreviations, 
-and some updates for handling computer mediated communication, optimized and tested against the gold data from the [EmpiriST 2015](https://sites.google.com/site/empirist2015/) shared task (Beißwenger et al. 2016).
+Our changes mainly concern a good coverage of German, or optionally of some English and French abbreviations,
+and some updates for handling computer mediated communication, optimized and tested, in the case of German, against the gold data from the [EmpiriST 2015](https://sites.google.com/site/empirist2015/) shared task (Beißwenger et al. 2016).
 
-### Adding Support for more Languages
-To adapt the included implementations to more languages, take one of the `language-specific_<language>.jflex-macro` files as template and 
-modify for example the macro for abbreviations `SEABBR`. Then add an `execution` section for the new language
-to the jcp ([java-comment-preprocessor](https://github.com/raydac/java-comment-preprocessor)) artifact in `pom.xml` following the example of one of the configurations there.
-After building the project (see below) your added language specific tokenizer / sentence splitter should be selectable with the `--language` option.
-
-Alternatively, you can also provide `KorAPTokenizer` implementations independently on the class path and select them with the `--tokenizer-class` option.
 
 ## Installation
 ```shell script
-$ MAVEN_OPTS="-Xss2m" mvn clean install
+mvn clean install
 ```
 #### Note
 Because of the large table of abbreviations, the conversion from the jflex source to java,
-i.e. the calculation of the DFA, takes about 4 to 20 minutes, depending on your hardware,
+i.e. the calculation of the DFA, takes about 5 to 30 minutes, depending on your hardware,
 and requires a lot of heap space.
 
-## Documentation
-The KorAP tokenizer reads from standard input and writes to standard output. It supports multiple modes of operations.
+## Examples Usage
+By default, KorAP tokenizer reads from standard input and writes to standard output. It supports multiple modes of operations.
 
-#### Split into tokens
+#### Split English text into tokens
 ```
-$ echo 'This is a sentence. This is a second sentence.' | java -jar target/KorAP-Tokenizer-2.2.0-standalone.jar
-This
-is
-a
-sentence
+$ echo "It's working." | java -jar target/KorAP-Tokenizer-2.2.0-standalone.jar -l en
+It
+'s
+working
 .
-This
-is
-a
-second
-sentence
+```
+#### Split French text into tokens and sentences
+```
+$ echo "C'est une phrase. Ici, il s'agit d'une deuxième phrase." \
+  | java -jar target/KorAP-Tokenizer-2.2.0-standalone.jar -s -l fr
+C'
+est
+une
+phrase
 .
 
-```
-#### Split into tokens and sentences
-```
-$ echo 'This is a sentence. This is a second sentence.' | java -jar target/KorAP-Tokenizer-2.2.0-standalone.jar -s
-This
-is
-a
-sentence
-.
-
-This
-is
-a
-second
-sentence
+Ici
+,
+il
+s'
+agit
+d'
+une
+deuxième
+phrase
 .
 
 ```
@@ -105,6 +94,14 @@
 0 25
 ```
 
+### Adding Support for more Languages
+To adapt the included implementations to more languages, take one of the `language-specific_<language>.jflex-macro` files as template and
+modify for example the macro for abbreviations `SEABBR`. Then add an `execution` section for the new language
+to the jcp ([java-comment-preprocessor](https://github.com/raydac/java-comment-preprocessor)) artifact in `pom.xml` following the example of one of the configurations there.
+After building the project (see below) your added language specific tokenizer / sentence splitter should be selectable with the `--language` option.
+
+Alternatively, you can also provide `KorAPTokenizer` implementations independently on the class path and select them with the `--tokenizer-class` option.
+
 ## Development and License
 
 **Authors**:
commit	0abcb5939eb59bf04e0729792760a46d3d283ecb	[log] [tgz]
author	Marc Kupietz <kupietz@ids-mannheim.de>	Wed Jul 28 18:05:34 2021 +0200
committer	Gerrit Code Review <gerrit2@korap.ids-mannheim.de>	Wed Jul 28 18:05:34 2021 +0200
tree	63fd2adfabc718f0e3e67e291df3230ec6041666
parent	ab9187de39116f5566a3cebf6bc142742c0f50ef [diff]
parent	8c7488bcf02d080f32c8d7b7c24a3b2be90ebf37 [diff]