Changelog

2.0.0

Dependencies updated
Tokenizer and sentence splitter for English (-l en option) added
Tokenizer and sentence splitter for French (-l fr option) added
Support for adding more languages
UTF-8 input encoding is now expected by default, different encodings can be set by the --encoding <enc> option
By default, tokens are now printed to stdout (use options --no-tokens --positions to print character offsets instead)
Abbreviated German street names like Kunststr. are now recognized as tokens
Added heuristics for distinguishing between I. as abbrevation vs PPER / CARD
URLs without URI-scheme are now recognized as single tokens if they start wit www.

Quoted email names containing space characters, like "John Doe"@xx.com, are no longer interpreted as single tokens
Sentence splitter functionality added (--sentence-boundaries option)

First version published on https://korap.ids-mannheim.de/gerrit/plugins/gitiles/KorAP/KorAP-Tokenizer
Extracted from KorAP-internal ingestion pipeline