Using Lucene with Pre-defined Tokenization

Lucene is probably THE state of the art tool for indexing text. It creates inverted indexes in which — in short — tokens are keys and their positions are the values so that any term can be looked up rapidly. By default, the Lucene pipeline for creating an inverted index works like this:

  1. Read the input text, including XML parsing etc. to extract plain text.
  2. Analyse the text in order to break it up into single tokens (and potentially perform other modifications like lower-casing, filtering, etc.).
  3. Create an inverted index having the tokens as keys.

Numerous built-in analysers make this process flexible enough for most typical use cases and if Lucene does not provide the right one, a custom analyser can be implemented.

Unfortunately, our set-up cannot strictly follow the typical Lucene pipeline because tokenization (step 2) is already done by external tools, before and independently of Lucene indexing. Multiple tokenizations and other segmentations are supposed to produce different indexes so we can choose which index to use for querying; reproducability is highly important for linguistic research so the tokenization on which a query is based is very relevant. Those segmentations are not always transparent and may be produced with proprietary tools or uploaded by users, so implementing a Lucene analyser that re-does the tokenization process is not an option.

In our scenario, segmentation (including tokenization) information is stored in a stand-alone XML file that contains elements like <span id="t_0" from="5" to="10"/>. The primary text is stored in another file (text.xml) from which the token text value is extracted at offset 5 of the raw text. Further analysis is not necessary.

We have considered several more or less promising ideas to overcome the issue that we cannot let Lucene analyse incoming plain text at indexing time:

  • Implement a custom TokenStream class that has some document field that provides both the primary text and all the token boundaries for that text. The incrementToken() method of the TokenStream emits one token per call. For each document, a Lucene Field is initialized with Field("text", ts) where ts is a TokenStream object. The TokenStream‘s document field has been set before so that it matches the document to index at that moment.
    One downside of that method is that in the default Lucene Field implementation, the input value cannot be stored in the index. Hence, we would have to implement our own Field class if we need to do so. The other disadvantage is that all the documents are parsed before indexing and therefore have to be stored either in memory — which will be impossible for large corpora at some point –, or on disk which requires an additional intermediate storage format and time-consuming disk I/O operations.
  • Implement a custom Lucene Analyzer that performs a pseudo-analysis during indexing. Instead of really analysing incoming text, it would pass the location of the source XML files to an XML parser whenever a document is added and retrieve the tokens after the parser has extracted them. The argument can be a simple String object that provides a pointer to the file.
  • Re-write the primary text so that it can be safely tokenized by a Lucene built-in analyser, e.g. insert a white space at each token boundary and use the Lucene WhitespaceAnalyzer. One downside has been mentioned above: doing the re-write is costly. Another one is that the original text indexes are modified (by the additional spaces), so that the offsets have to be fixed as a last analysis step.

For the time being, that the second option — going through a custom Analyzer — seems to be the most practical one regarding memory, computation time, and I/O operations.

Leave a Reply

Your email address will not be published. Required fields are marked *