Today, the Lucene team has announced the release of Lucene version 4.0. We have been working on migrating our Lucene-based code to Lucene 4.0 since the alpha has been released in July this year. Many thanks to all the Lucene developers for another great piece of open source software!
For KorAP, focussing on processing and the analysis of linguistic corpora, most interesting new features are those that improve indexing and querying performance because we do not make use of typical information retrieval methods when analysing text. The following is a list of new functionalities that are most relevant to us, extracted from the release announcement.
- The index formats for terms, postings lists, stored fields, term vectors, etc are pluggable via the Codec api.
- When indexing via multiple threads, each IndexWriter thread now flushes its own segment to disk concurrently, resulting in substantial performance improvements
- New default term dictionary/index (BlockTree) that indexes shared prefixes instead of every n’th term.
- FuzzyQuery is 100-200 times faster than in past releases
- Substantially faster performance when using a Filter during searching
- Added index statistics such as the number of tokens for a term or field, number of postings for a field, and number of documents with a posting for a field
- Term offsets can be optionally encoded into the postings lists and can be retrieved per-position.
See the Lucene homepage for more details about implementation and theoretical background.