The Proceedings of the Konvens 2012 conference (The 11th Conference on Natural Language Processing) are now online, including the paper “Using information retrieval technology for a corpus analysis platform” that has been published within KorAP.
We are happy to report that we submitted a paper titled “Using Information Retrieval Technology for a Corpus Analysis Platform” for the Konvens 2012 (The 11th Conference on Natural Language Processing) yesterday!
The Lucene project has released version 3.6 today. Apart from some bug fixes, it provides mainly improvement in text processing. These are features from which KorAP does not profit very much. But in addition, several bugs have been fixed, full Java 7 support is introduced, and the Finite State Transducers applied for certain queries have been improved.
Our setup is a collection of index structures1 that are physically distributed across multiple machines (worker nodes). Their purpose is to allow fast querying of different segmentations (tokenization, sentence boundaries etc.) and annotations (e.g. part-of-speech tags, dependencies, syntactic constituents) on arbitrary document collections (corpora). Conversely, this implies that the union of all the distributed indexes sums up to the complete corpus collection2. Continue reading