commit | 93d6d1b375ad301fee1f4ee8172202d11770c717 | [log] [tgz] |
---|---|---|
author | Nils Diewald <nils@diewald-online.de> | Mon Feb 02 21:47:43 2015 +0000 |
committer | Nils Diewald <nils@diewald-online.de> | Mon Feb 02 21:47:43 2015 +0000 |
tree | dc611d4b1c8b14d609f8e8b82b13644251c8a643 | |
parent | 0cc4f2eb90ea1f6bf7c3d0524abdc1f2d01a90a0 [diff] |
Cleanup and initial position frames
KorAP is available at https://korap.ids-mannheim.de/
...
...
To run the test suite, type
$ mvn test
To start the server, type
$ mvn compile exec:java
To compile the indexer, type
$ mvn compile assembly:single
To run the indexer, type
$ java -jar target/KorAP-lucene-index-X.XX.jar src/main/resources/korap.conf src/test/resources/examples/
For changes of the current version, please consult the Changes file.
The Lucene backend is not character but token based. In addition to that it only has support for one single tokenization. Although it supports multiple annotations on tokenizations, these annotations have to match the basic token's character offsets.
Token annotations that do not match the basic tokenization are not indexed. Span annotations, that span a smaller range than one basic token, will not be indexed as well.
Tokens are only indexed in case they are word tokens, i.e. not punctuations. This limitation is necessary to make distance query work on word levels.
The maximum value for repetitions is 100.
The maximum value for distance units is 100.
Before contribution, please reformat your code according to the korap style guideline, provided by means of an Eclipse style sheet (korap-style.xml). You can either reformat using Eclipse or using Maven with the command
$ mvn java-formatter:format
???
Named entities annotated in the test data by CoreNLP was done using models based on:
Manaal Faruqui and Sebastian Padó (2010): Training and Evaluating a German Named Entity Recognizer with Semantic Generalization, Proceedings of KONVENS 2010, Saarbrücken, Germany
Copyright 2014, IDS Mannheim, Germany
Authors: Nils Diewald, Eliza Margaretha, and contributors.