Rabbid – Rapid Application Development Environment released on GitHub!

Rabbid - Recherche- und Analyse-Basis für Belegstellen in Diskursen

We are happy to announce the open source release of Rabbid (“Recherche- und Analyse-Basis für Belegstellen in Diskursen”). Rabbid is a standalone rapid application development environment for KorAP and used in production for the creation and management of collections of textual examples in the area of discourse analysis and discourse lexicography.

The development of Rabbid was a joint effort by the KorAP project and Dr. Ruth Mell of the Demokratiediskurs 1918-1925 project at the Institute for the German Language in Mannheim.

Unlike KorAP, Rabbid provides only a limited set of search operators for small, non-annotated corpora.

You can download Rabbid from GitHub. Rabbid is free software published under the BSD-2 License.

Rabbid - Screenshots

Proceedings in NER annotation

Comparing two models for NER (the Huge German Corpus-generalized classifier and the deWaC-generalized classifier), we found that neither of them did really well for our German Wikipedia corpus. Possibly the deWaC-generalized classifier led to slightly more intuitive results but as I did not do a statistical validation, this is hard to say.

Therefore we decided to run the NER task on both models to allow the user to evaluate and compare the results and store the date of both models in the coreNLP foundry (the name of the model is included in the file name). This foundry is open to extension by other tools provided by Stanford NLP team.

The process of Named entity Recognition and classification (NER/NEC) is done by a pipeline which loads the classifier(s) once for the entire annotation process and iterates over the corpus. For each document, the Stanford NE Recognizer takes the text of this document and the loaded model. The output is written to the coreNLP foundry: one file per model classifier <ne_[MODEL].xml>, containing the found Named Entity, a file <tokens.xml> with absolute token spans and a file <sentences.xml> with the sentence spans.

Issues with Mate pipeline

We’ve come across a little coding problem when building a Java pipeline to process our texts with the MATE tools. In the current version of the MATE source (as of the writing of this post, revision 234), the dependency parser at is2.parser.Parser can only be called by its main() method for every single document, which also means that the parsing model has to be loaded for every document. With a few million texts to be processed, this would take ages… Unfortunately, the central out() method of the class [source], which does most of the work and is called after the model has once been loaded, is set to private access, i.e. our Java pipeline cannot access it. Interestingly, the equivalent methods in the lemmatizer and tagger classes are public. As a fix, we checked out the MATE source and set the method to public so that we could use it in our pipeline. We are not sure why the MATE developers decided to set the method to private, but as we see no gain in this and given the public access for the central methods in the other classes, we believe this was not done on purpose.

We encountered similar problems the method is2.parser.Pipe.nextInstance() [source] and the fields is2.mtag.Tagger.pipe and is2.mtag.Tagger.params [source]. As in the first case with is2.parser.Parser.out(), we set the method/fields to public in order to use them.

Our pipeline/MATE wrapper now works fine, and we only have to load each model once for all processed files!

new Mate annotations

Thanks to the work of Bastian Beyer last year, continued and expanded on first by Carsten, and now by Joachim, we are now in the process of adding a new foundry (annotation set), output by the Mate parser.

We use Mate for several reasons:

  • we wanted to be able to release dependency annotations to the public,
  • Mate was recommended to us as a reliable parser,
  • it can produce interesting annotation layers for German (above the standard segmentation-based annotations, we also get two kinds of dependency annotations and semantic role annotations); the two kinds of dependencies are quite precious as reference for the ISO CQLF work done by Andreas, Elena and myself.
  • it is also a bit of a challenge because unlike the other annotation tools that we use, Mate does not come with its own tokenization tool, and thus, theoretically, can be made to use any of our existing tokenization layers. This forces us to tighten some aspects of our data model, e.g. to force the presence of the element that encodes the tokenization layer in each instance of foundry metadata information, and to enable it to act as a soft link for cases where the given foundry has to rely on tokenization information external to it. For now, our Mate foundries will use the conservative tokenization layers of the Base foundry, in this way strengthening the concept of the Base foundry even further.