Add export recipe

Change-Id: I5dd2dce3a83fdab0702f11dea45287847a4e0b8b

Update Kalamar to v0.64

Change-Id: Ie5ea8d6bb02b8aa910e736ed04c4b63a9a628aed

Add CSP exception

Better add another stub in the future.

Change-Id: Id641b9679c00b76019e6b315e864e56566672f85

Pin export plugin to 0.3.4

Change-Id: I0fbd3af58474b6d3ce911ef5db79a620e1b2f5e4

Add export plugin config via env variables

Change-Id: I3a641142480568941b07d3039a030b3a0e49d99f

Fix export icon escaping

Change-Id: I893b4da02969e2040d9a43bc6998015ecd1679f0

Update Readme.md with explort plugin instructions

Change-Id: I38c1335f10c1b676f03e4d1a97985c27f29aad44

Bump actions/cache from 4 to 5

Bumps [actions/cache](https://github.com/actions/cache) from 4 to 5.
- [Release notes](https://github.com/actions/cache/releases)
- [Changelog](https://github.com/actions/cache/blob/main/RELEASES.md)
- [Commits](https://github.com/actions/cache/compare/v4...v5)

---
updated-dependencies:
- dependency-name: actions/cache
  dependency-version: '5'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>

Closes #17

Change-Id: I7df4dc89e16985c88922b1d6607eb749f507e924

Bump actions/checkout from 4 to 6

Bumps [actions/checkout](https://github.com/actions/checkout) from 4 to 6.
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](https://github.com/actions/checkout/compare/v4...v6)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-version: '6'
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>

Closes #15

Change-Id: Iac23c6ef1526c0d18ae0d5fa715c137172149c6d
3 files changed
tree: a616c2cf1ed8526f0f51a059387bbf5d0c2a26f1
  1. .github/
  2. example/
  3. .gitignore
  4. compose.yaml
  5. LICENSE
  6. Readme.md
Readme.md

KorAP-Docker

The KorAP Corpus Analysis Platform consists of several independent components, but they can easily be installed together using Docker. This repository contains a recipe to install all components needed to run KorAP on a local machine with a single command.

In addition, all relevant tools are installed and made available that are necessary for data conversion and indexing of corpora in the widely used TEI-P5 (I5) format for KorAP. For different options of the tools we refer to the respective repositories.

Requirements

Install docker and docker compose (>= v2; as a CLI plugin).

Starting

To get KorAP running, an index is required. For testing, there is a test index available as a docker image. Just run

INDEX='example-index' docker compose -p korap --profile=lite --profile=example up

to start the example image and the service with Linux (See here for more information on Windows).

To include the export plugin, add the export profile:

COMPOSE_PROFILES="export" INDEX='example-index' docker compose -p korap --profile=lite --profile=example --profile=export up

Otherwise it's possible to download the sample index provided by Kustvakt. To download, intialize and run KorAP pointing to that index folder (in this example stored in the index folder in the local directory), run

INDEX=./index docker compose -p korap --profile=lite up

This will make the frontend be available at localhost:64543.

To use your own index, please follow the instructions on Corpus Conversion first.

To run the service with a user management system, first create a directory data in your working directory and then start it with

INDEX=./index docker compose -p korap --profile=full up

Login with user1 and password1. To change authentication settings, see the /kusvakt/ldap folder inside the docker container and Kustvakt's LDAP Settings Wiki for documentation.

Corpus Conversion

In order to create an index based on existing corpus data, some conversion steps are usually necessary. In the case of a conversion from TEI P5 (I5) format, the tools required for this have already been installed with the command above.

In the following we take the open part of the Dortmunder Chatkorpus 2.2 (Beißwenger & Storrer 2008) as an example to build an index.

The file is located at example/dck-part1.i5.xml.

The command ...

docker run --rm \
  -v ${PWD}/example:/data:z korap/kalamar:latest \
  tei2korapxml \
  --inline-tokens '!cmc#morpho' \
  --no-tokenizer \
  --input /data/dck-part1.i5.xml \
  --output dck.zip

... will convert the i5 file into a KorAP-XML file using tei2korapxml.

This format is designed to add further arbitrary annotations to the primary data. In this example, however, we will stick with the inline annotations that the example corpus already contains and will make available later under the label cmc.

To convert the KorAP-XML archive in a second step into individual Krill compatible JSON files, the following command ...

mkdir json
docker run --rm -u root \
  -v ${PWD}:/kalamar/data:z korap/kalamar:latest\
  korapxml2krill archive \
  --gzip \
  --input /kalamar/data/dck.zip \
  --jobs -1 \
  --token 'cmc#morpho' \
  --base-paragraphs 'DeReKo#Structure' \
  --base-sentences 'DeReKo#Structure' \
  --output ./data/json/

... will use korapxml2krill.

Depending on how the source data is designed, different parameters must be specified for the conversion.

Here, the inline token annotation is used as the basis for word tokenization, and the included document structure is used for default annotation of sentence and paragraph boundaries.

Index Creation

Krill's indexer tool can now be used to index the JSON files:

mkdir index
docker run -u root --rm -v ${PWD}:/data:z korap/kustvakt \
  Krill-Indexer.jar -c /kustvakt/kustvakt-lite.conf \
  -i /data/json -o /data/index/

After that, the index can be loaded with the aforementioned call and is searchable via the browser.

Windows

Windows with Powershell requires environment variables to pass in a different way. In addition the PWD variable is not set beforehand. To run, e.g., the KorAP one-liner with Windows, you have to start

$env:INDEX='example-index'; $env:PWD='.'; docker compose -p korap --profile=lite --profile=example up

Development and License

Authors: Nils Diewald, Harald Lüngen, Marc Kupietz

Copyright (c) 2022-2025, IDS Mannheim, Germany

KorAP-Docker is published under the BSD-2 License.

The example corpus corresponds to the release part of the Dortmunder Chatkorpus 2.2 as prepared by DeReKo. The corpus is released under the CC BY 4.0 License. Legal restrictions may arise from data protection legislation.

Bibliography

Beißwenger, Michael / Storrer, Angelika (2008): Corpora of Computer-Mediated Communication. In: Anke Lüdeling & Merja Kytö (Eds): Corpus Linguistics. An International Handbook. Volume 1. Berlin. New York (Handbooks of Linguistics and Communication Science 29.1), pp. 292--308.