commit	8104dacfe6f2038e17d2f818b7405b5de0bb8054	[log] [tgz]
author	Marc Kupietz <kupietz@ids-mannheim.de>	Tue Dec 06 18:08:32 2022 +0100
committer	Marc Kupietz <kupietz@ids-mannheim.de>	Wed Dec 07 06:59:23 2022 +0100
tree	e71802b17b94678f9c2b891b1ba3b51285220d9e
parent	5c859014da7379e23f3658d89231b48b927541f5 [diff]

tree: e71802b17b94678f9c2b891b1ba3b51285220d9e

Readme.md

KorAP-Docker

The KorAP Corpus Analysis Platform consists of several independent components, but they can easily be installed together using Docker. This repository contains a recipe to install all components needed to run KorAP on a local machine with a single command.

In addition, all relevant tools are installed and made available that are necessary for data conversion and indexing of corpora in the widely used TEI-P5 (I5) format for KorAP. For different options of the tools we refer to the respective repositories.

Requirements

Install docker and docker compose.

Starting

To download, intialize and run KorAP pointing to an existing index (in this example index in the local directory), run

$ INDEX=./index docker-compose --profile=lite up

This will make the frontend be available at localhost:64543.

To run the service with an additional user management system, initialize and start the service with

$ INDEX=./index docker-compose --profile=init up
$ INDEX=./index docker-compose --profile=full up

The init step creates a file called super_client_info in the current directory that acts as a shared secret between the frontend and the backend. To enable this in Kalamar, the configuration file kalamar.production.conf needs to point to the mounted file, so it requires a configuration along the lines of

{
    Kalamar => {
        plugins  => ['Auth']
    },
    'Kalamar-Auth' => {
        client_file => '/kalamar/super_client_info'
    }
}

Corpus Conversion

In order to create an index based on existing corpus data, some conversion steps are usually necessary. In the case of a conversion from TEI P5 (I5) format, the tools required for this have already been installed with the command above.

In the following we take the open part of the Dortmunder Chatkorpus 2.2 (Beißwenger & Storrer 2008) as an example to build an index.

The file is located at example/dck-part1.i5.xml.

The command ...

$ docker run --rm \
  -v ${PWD}/example:/data:z korap/kalamar:latest-conv \
  tei2korapxml \
  --inline-tokens '!cmc#morpho' \
  --input /data/dck-part1.i5.xml > dck.zip

... will convert the i5 file into a KorAP-XML file using tei2korapxml.

This format is designed to add further arbitrary annotations to the primary data. In this example, however, we will stick with the inline annotations that the example corpus already contains and will make available later under the label cmc.

To convert the KorAP-XML archive in a second step into individual Krill compatible JSON files, the following command ...

$ mkdir json
$ docker run --rm -u root \
  -v ${PWD}:/kalamar/data:z korap/kalamar:latest-conv\
  korapxml2krill archive \
  -z \
  -i /kalamar/data/dck.zip \
  --jobs -1 \
  --token 'cmc#morpho' \
  --base-paragraphs 'DeReKo#Structure' \
  --base-sentences 'DeReKo#Structure' \
  -o ./data/json/

... will use korapxml2krill.

Depending on how the source data is designed, different parameters must be specified for the conversion.

Here, the inline token annotation is used as the basis for word tokenization, and the included document structure is used for default annotation of sentence and paragraph boundaries.

Index Creation

Krill's indexer tool can now be used to index the JSON files:

$ mkdir index
$ docker run -u root --rm -v ${PWD}:/data:z korap/kustvakt \
  Krill-Indexer.jar -c /kustvakt/kustvakt-lite.conf \
  -i /data/json -o /data/index/

After that, the index can be loaded with the aforementioned call and is searchable via the browser.

Development and License

Authors: Nils Diewald, Harald Lüngen, Marc Kupietz

KorAP-Docker is published under the BSD-2 License.

The example corpus corresponds to the release part of the Dortmunder Chatkorpus 2.2 as prepared by DeReKo. The corpus is released under the CC BY 4.0 License. Legal restrictions may arise from data protection legislation.

Bibliography

Beißwenger, Michael / Storrer, Angelika (2008): Corpora of Computer-Mediated Communication. In: Anke Lüdeling & Merja Kytö (Eds): Corpus Linguistics. An International Handbook. Volume 1. Berlin. New York (Handbooks of Linguistics and Communication Science 29.1), pp. 292--308.