commit	04041eea4e5eae8cc2fbc258ad598bd935cf69c8	[log] [tgz]
author	Nils Diewald <nils@diewald-online.de>	Thu Nov 20 21:02:26 2014 +0000
committer	Nils Diewald <nils@diewald-online.de>	Thu Nov 20 21:02:26 2014 +0000
tree	fd5fe764a22273a2a28795ca9b51d2eada1691e8
parent	38a9466f27d8c54aceba39b79d6f33b09e9697d8 [diff]

tree: fd5fe764a22273a2a28795ca9b51d2eada1691e8

README.md

KorAP Lucene Index

KorAP is available at https://korap.ids-mannheim.de/

Limitations

Tokenization

The Lucene backend is not character but token based. In addition to that it only has support for one single tokenization. Although it supports multiple annotations on tokenizations, these annotations have to match the basic token's character offsets.

Token annotations that do not match the basic tokenization are not indexed. Span annotations, that span a smaller range than one basic token, will not be indexed as well.

Tokens are only indexed in case they are word tokens, i.e. not punctuations. This limitation is necessary to make distance query work on word levels.

Repetitions

The maximum value for repetitions is 100.

Distances

The maximum value for distance units is 100.

Copyright

Citation

???

Further References

Named entities annotated in the test data by CoreNLP was done using models based on: Manaal Faruqui and Sebastian Padó (2010): Training and Evaluating a German Named Entity Recognizer with Semantic Generalization, Proceedings of KONVENS 2010, Saarbrücken, Germany