commit | 04041eea4e5eae8cc2fbc258ad598bd935cf69c8 | [log] [tgz] |
---|---|---|
author | Nils Diewald <nils@diewald-online.de> | Thu Nov 20 21:02:26 2014 +0000 |
committer | Nils Diewald <nils@diewald-online.de> | Thu Nov 20 21:02:26 2014 +0000 |
tree | fd5fe764a22273a2a28795ca9b51d2eada1691e8 | |
parent | 38a9466f27d8c54aceba39b79d6f33b09e9697d8 [diff] |
Added readme
KorAP is available at https://korap.ids-mannheim.de/
The Lucene backend is not character but token based. In addition to that it only has support for one single tokenization. Although it supports multiple annotations on tokenizations, these annotations have to match the basic token's character offsets.
Token annotations that do not match the basic tokenization are not indexed. Span annotations, that span a smaller range than one basic token, will not be indexed as well.
Tokens are only indexed in case they are word tokens, i.e. not punctuations. This limitation is necessary to make distance query work on word levels.
The maximum value for repetitions is 100.
The maximum value for distance units is 100.
Copyright 2014, IDS Mannheim, Germany Authors: Nils Diewald, Eliza Margaretha and contributors.
???
Named entities annotated in the test data by CoreNLP was done using models based on: Manaal Faruqui and Sebastian Padó (2010): Training and Evaluating a German Named Entity Recognizer with Semantic Generalization, Proceedings of KONVENS 2010, Saarbrücken, Germany