Reimplementaion and future replacement of the Perl version

Clone this repo:

Branches

  1. 0a5b5ff Also rename gradle project to korapxmltool by Marc Kupietz · 4 weeks ago master
  2. 54259c0 Warn but don't crash on invalid spans by Marc Kupietz · 4 weeks ago
  3. e35e9b0 Fix download link in Readme.md by Marc Kupietz · 4 weeks ago
  4. a62fe24 Fix GitLab CI script and Readme by Marc Kupietz · 4 weeks ago
  5. ffefc82 Update Readme by Marc Kupietz · 4 weeks ago

korapxmltool

Tool package to convert and annotate KorAP-XML ZIP files.

Up to 200 times faster and more accurate drop-in replacement for the korapxml2conllu part of KorAP-XML-CoNLL-U.

For some conversion tasks, however, you currently need the conllu2korapxml part of KorAP-XML-CoNLL-U.

Download

You can download the latest jar build from the build artifacts here.

Build it yourself

./gradlew shadowJar

Conversion to CoNLL-U format

$ java  -jar ./app/build/libs/korapxmltool.jar app/src/test/resources/wdf19.zip | head -10

# foundry = base
# filename = WDF19/A0000/13072/base/tokens.xml
# text_id = WDF19_A0000.13072
# start_offsets = 0 0 14 17 25 30 35 42 44 52 60 73
# end_offsets = 74 12 16 24 29 34 41 43 51 59 72 74
1	Australasien	_	_	_	_	_	_	_	_
2	on	_	_	_	_	_	_	_	_
3	devrait	_	_	_	_	_	_	_	_
4	peut	_	_	_	_	_	_	_	_
5	être	_	_	_	_	_	_	_	_

Conversion to language model training data input format from KorAP-XML

$ java  -jar ./app/build/libs/korapxmltool.jar --word2vec t/data/wdf19.zip

Arts visuels Pourquoi toujours vouloir séparer BD et Manga ?
Ffx 18:20 fév 25 , 2003 ( CET ) soit on ne sépara pas , soit alors on distingue aussi , le comics , le manwa , le manga ..
la bd belge et touts les auteurs européens ..
on commence aussi a parlé de la bd africaine et donc ...
wikipedia ce prete parfaitement à ce genre de decryptage .

Example producing language model training input with preceding metadata columns

java  -jar ./app/build/libs/korapxmltool.jar  -m '<textSigle>([^<]+)' -m '<creatDate>([^<]+)' --word2vec t/data/wdf19.zip
WDF19/A0000.10894	2014.08.28	Arts visuels Pourquoi toujours vouloir séparer BD et Manga ?
WDF19/A0000.10894	2014.08.28	Ffx 18:20 fév 25 , 2003 ( CET ) soit on ne sépara pas , soit alors on distingue aussi , le comics , le manwa , le manga ..
WDF19/A0000.10894	2014.08.28	la bd belge et touts les auteurs européens ..
WDF19/A0000.10894	2014.08.28	on commence aussi a parlé de la bd africaine et donc ...
WDF19/A0000.10894	2014.08.28	wikipedia ce prete parfaitement à ce genre de decryptage .

Annotation

Tagging with integrated MarMoT POS tagger directly to a new KorAP-XML ZIP file

You need to download the pre-trained MarMoT models from the here.

$ java -jar ./app/build/libs/korapxmltool.jar -f zip -t marmot:models/de.marmot app/src/test/resources/goe.zip

Tagging with integrated OpenNLP POS tagger directly to a new KorAP-XML ZIP file

You need to download the pre-trained OpenNLP models from here or older models from here.

java -jar ./app/build/libs/korapxmltool.jar -f zip -t opennlp:/usr/local/kl/korap/Ingestion/lib/models/opennlp/de-pos-maxent.bin /tmp/zca24.zip

Tag and lemmatize with TreeTagger

This requires the TreeTagger Docker Image with CoNLL-U Support. Language models are downloaded automatically.

java  -jar app/build/libs/korapxmltool.jar app/src/test/resources/wdf19.zip | docker run --rm -i korap/conllu2treetagger -l french | conllu2korapxml

Tag and lemmatize with spaCy

This requires the spaCy Docker Image with CoNLL-U Support and is only available for German.

java  -jar app/build/libs/korapxmltool.jar app/src/test/resources/goe.zip | docker run --rm -i korap/conllu2spacy | conllu2korapxml > goe.spacy.zip

Parsing

Using the integrated Maltparser

You need to download the pre-trained MaltParser models from the here. Note that parsers take POS tagged input.

java -jar ./app/build/libs/korapxmltool.jar -f zip -T2 -P malt:libs/german.mco goe.tree_tagger.zip

Development and License

Author:

Copyright (c) 2024-2025, Leibniz Institute for the German Language, Mannheim, Germany

This package is developed as part of the KorAP Corpus Analysis Platform at the Leibniz Institute for German Language (IDS).

It is published under the GNU General Public License, Version 3, 29 June 2007.

Contributions

Contributions are very welcome!

Your contributions should ideally be committed via our Gerrit server to facilitate reviewing ( see Gerrit Code Review - A Quick Introduction if you are not familiar with Gerrit). However, we are also happy to accept comments and pull requests via GitHub.