‘Koral’ query serializer released!

We are happy to announce the release of Koral, the module which KorAP uses to translate queries from its supported query languages into KoralQuery, a general protocol for queries to corpus analysis systems. Taking a query string as its input, Koral generates a corresponding KoralQuery instance which represents that query independently of the source query language, such that the system may work in a query language-agnostic fashion. Besides the actual linguistic query, KoralQuery also has facilities to represent virtual collection definitions as well as error and warning messages that may arise during query processing.

You can access and download the Koral sources from the KorAP GitHub repository. Please note that the current version 0.1.0 is not a final version and subject to work in progress, which will result in further releases in the not-so-far future.

Issues with Mate pipeline

We’ve come across a little coding problem when building a Java pipeline to process our texts with the MATE tools. In the current version of the MATE source (as of the writing of this post, revision 234), the dependency parser at is2.parser.Parser can only be called by its main() method for every single document, which also means that the parsing model has to be loaded for every document. With a few million texts to be processed, this would take ages… Unfortunately, the central out() method of the class [source], which does most of the work and is called after the model has once been loaded, is set to private access, i.e. our Java pipeline cannot access it. Interestingly, the equivalent methods in the lemmatizer and tagger classes are public. As a fix, we checked out the MATE source and set the method to public so that we could use it in our pipeline. We are not sure why the MATE developers decided to set the method to private, but as we see no gain in this and given the public access for the central methods in the other classes, we believe this was not done on purpose.

We encountered similar problems the method is2.parser.Pipe.nextInstance() [source] and the fields is2.mtag.Tagger.pipe and is2.mtag.Tagger.params [source]. As in the first case with is2.parser.Parser.out(), we set the method/fields to public in order to use them.

Our pipeline/MATE wrapper now works fine, and we only have to load each model once for all processed files!