Changelog
2.0.0
- Dependencies updated
- Tokenizer and sentence splitter for English (
-l en option) added - Tokenizer and sentence splitter for French (
-l fr option) added - Support for adding more languages
UTF-8 input encoding is now expected by default, different encodings can be set by the --encoding <enc> option- By default, tokens are now printed to stdout (use options
--no-tokens --positions to print character offsets instead) - Abbreviated German street names like Kunststr. are now recognized as tokens
- Added heuristics for distinguishing between I. as abbrevation vs PPER / CARD
- URLs without URI-scheme are now recognized as single tokens if they start wit
www.
1.3
- Standard EOT/EOF character x04 is used instead of magic escape \n\x03\n
- Quoted email names containing space characters, like "John Doe"@xx.com, are no longer interpreted as single tokens
- Sentence splitter functionality added (
--sentence-boundaries option)
1.2