Public DeReKo access

KorAP

We just relaunched KorAP providing a large subset of DeReKo (with most data from the W archive of COSMAS II, comprising more than 11 million documents). The data is annotated with part-of-speech information from CoreNLP, MarMoT, OpenNLP and TreeTagger, additional morphological features from MarMoT, lemma annotation from TreeTagger, constituency annotations from CoreNLP and dependency annotation from Malt.

To grant access to the restricted corpora, we are currently fixing a critical bug in the integration of the user management of COSMAS II – therefore, KorAP is temporarily not accessible from outside the IDS until we finished the integration.

Rabbid – Rapid Application Development Environment released on GitHub!

Rabbid - Recherche- und Analyse-Basis für Belegstellen in Diskursen

We are happy to announce the open source release of Rabbid (“Recherche- und Analyse-Basis für Belegstellen in Diskursen”). Rabbid is a standalone rapid application development environment for KorAP and used in production for the creation and management of collections of textual examples in the area of discourse analysis and discourse lexicography.

The development of Rabbid was a joint effort by the KorAP project and Dr. Ruth Mell of the Demokratiediskurs 1918-1925 project at the Institute for the German Language in Mannheim.

Unlike KorAP, Rabbid provides only a limited set of search operators for small, non-annotated corpora.

You can download Rabbid from GitHub. Rabbid is free software published under the BSD-2 License.

Rabbid - Screenshots

Kalamar – User Frontend released on GitHub!

Mojolicious-based Frontend to KorAP

We are happy to announce the open source release of Kalamar, the Mojolicious-based frontend for KorAP!

Kalamar is written in Perl and JavaScript, acts as a proof-of-concept for the KorAP API, and provides, among other features, …

  • aligned KWIC views,
  • multiple highlighting,
  • table views of morphological annotations,
  • tree views of hierarchical annotations,
  • localization,
  • a language-independent query helper for multiple tag sets,
  • and an embedded and interactive documentation!

Screenshots

Expect more features to come! You can already use Kalamar from inside the IDS and download the sources from GitHub.

EDIT: The IDS-Instance of KorAP is currently not accessible from outside the IDS.

Krill – Lucene-based Search Backend released on GitHub!

A Corpusdata Retrieval Index using Lucene for Look-Ups

We are happy to announce the open source release of Krill, the Lucene-based search backend for KorAP! Krill is the reference implementation for KoralQuery, covering most of the protocols features, including …

  • Fulltext search
  • Token-based annotation search
  • Span-based annotation search
  • Distance search
  • Positional search
  • Nested queries

… and many more!

You can download Krill on GitHub – feedback and contributions are very welcome!

‘Koral’ query serializer released!

We are happy to announce the release of Koral, the module which KorAP uses to translate queries from its supported query languages into KoralQuery, a general protocol for queries to corpus analysis systems. Taking a query string as its input, Koral generates a corresponding KoralQuery instance which represents that query independently of the source query language, such that the system may work in a query language-agnostic fashion. Besides the actual linguistic query, KoralQuery also has facilities to represent virtual collection definitions as well as error and warning messages that may arise during query processing.

You can access and download the Koral sources from the KorAP GitHub repository. Please note that the current version 0.1.0 is not a final version and subject to work in progress, which will result in further releases in the not-so-far future.

KorAP-SRU

KorAP has been integrated to the CLARIN technology and infrastructure, especially the CLARIN-FCS (Federated Content Search). CLARIN-FCS is an interface specification implementing the Search Retrieve via URL / Contextual Query Language (SRU/CQL), where SRU is a client-server standard XML-based protocol formulating CQL queries in URL to perform a search. CLARIN-FCS allows searching within resource content stored in CLARIN repositories.

KorAP-SRU, an implementation of the CLARIN-FCS, namely an endpoint, has been released. It allows searching in IDS Mannheim repository via KorAP. KorAP-SRU currently has the basic search capability as defined by CLARIN-FCS supporting term-only (e.g Hund) and boolean (AND and OR) queries. Moreover, it interprets the queries as case-sensitive.

Typically an FCS endpoint needs to translate a query in an SRU search retrieve request into the query language of the search engine. Since KorAP can accept various query languages including CQL, the KorAP-SRU endpoint does not need to alter the CQL query. It simply includes the query in an HTTP request and sent it to KorAP public search service. The KorAP service sends back query results serialized in JSON format and KorAP-SRU translates this into CLARIN-FCS result format.

The KorAP-SRU endpoint has been registered in the CLARIN center registry, specifically in the IDS center information. It is connected to the Aggregator a CLARIN-FCS client sending search requests to multiple CLARIN repositories, collecting and displaying the results. In the near future, it will be integrated to Weblicht and can be used as a tool in building a linguistic processing tool chain or pipeline.

Monitoring KorAP

As part of the monitoring functionalities of KorAP, I have implemented a framework to allow logging and tracing of user and service activities. Whereas that alone represents a fulfilment of a design requirement imposed by licence agreements of the containing data (text and annotations alike), the framework also includes the possibility of recording users’ query activity. Thus, KorAP’s Auditing framework allows the retrieval of query information such as the most frequently formulated queries or the data/level of annotations that are queried most frequently. Not only does this allow KorAP to create usage statistics concerning the underlying data set, but it also enables the developers to improve KorAP’s usability for feature releases according to the data retrieved.

For example, in the case of null query (queries with no result) tracking, the data can be used to extend and improve documentation or even find new traps in users’ understanding and application of query language constructs. All recorded information is subject to legal data protection and in case it is published, the user data will be anonymized. When in production mode, KorAP will inform the users about the extent of record keeping and the usage of the data.

The API was designed with the following design concepts in mind:

  • to be easily extendible for future developer convenience
  • separation of duties (Auditing takes place outside of the system logic via Spring proxy calls)