| commit | 5b16f657e16c7d1a8a2161a0472484bcc9e0cf7b | [log] [tgz] |
|---|---|---|
| author | Marc Kupietz <kupietz@ids-mannheim.de> | Tue Nov 18 06:07:31 2025 +0100 |
| committer | Marc Kupietz <kupietz@ids-mannheim.de> | Tue Nov 18 06:07:31 2025 +0100 |
| tree | 79ee3be6eae46c2e66a2c321a91e8d1d0cb77f44 | |
| parent | d4c6bd5e4af454af171a86ba7eedc9ac34943c5d [diff] |
Use thread local document builders in krill output Change-Id: I123d5b7015ae4f6ac361884c158794640518be23
Converts between KorAP-XML ZIP format and formats like CoNLL-U, Krill, word2vec, NOW and annotates KorAP XML ZIPs with various taggers and parsers.
Drop-in replacement for korapxml2conllu KorAP-XML-CoNLL-U and korapxml2krill KorAP-XML-Krill
./gradlew build
After building, the executable will be available at ./build/bin/korapxmltool.
$ ./build/bin/korapxmltool app/src/test/resources/wdf19.zip | head -10 # foundry = base # filename = WDF19/A0000/13072/base/tokens.xml # text_id = WDF19_A0000.13072 # start_offsets = 0 0 14 17 25 30 35 42 44 52 60 73 # end_offsets = 74 12 16 24 29 34 41 43 51 59 72 74 1 Australasien _ _ _ _ _ _ _ _ 2 on _ _ _ _ _ _ _ _ 3 devrait _ _ _ _ _ _ _ _ 4 peut _ _ _ _ _ _ _ _ 5 être _ _ _ _ _ _ _ _
$ ./build/bin/korapxmltool --word2vec t/data/wdf19.zip Arts visuels Pourquoi toujours vouloir séparer BD et Manga ? Ffx 18:20 fév 25 , 2003 ( CET ) soit on ne sépara pas , soit alors on distingue aussi , le comics , le manwa , le manga .. la bd belge et touts les auteurs européens .. on commence aussi a parlé de la bd africaine et donc ... wikipedia ce prete parfaitement à ce genre de decryptage . …
./build/bin/korapxmltool -m '<textSigle>([^<]+)' -m '<creatDate>([^<]+)' --word2vec t/data/wdf19.zip
WDF19/A0000.10894 2014.08.28 Arts visuels Pourquoi toujours vouloir séparer BD et Manga ? WDF19/A0000.10894 2014.08.28 Ffx 18:20 fév 25 , 2003 ( CET ) soit on ne sépara pas , soit alors on distingue aussi , le comics , le manwa , le manga .. WDF19/A0000.10894 2014.08.28 la bd belge et touts les auteurs européens .. WDF19/A0000.10894 2014.08.28 on commence aussi a parlé de la bd africaine et donc ... WDF19/A0000.10894 2014.08.28 wikipedia ce prete parfaitement à ce genre de decryptage .
One text per line with <p> as sentence delimiter.
./build/bin/korapxmltool -f now /vol/corpora/DeReKo/current/KorAP/zip/*24.zip | pv > dach24.txt
If lemma annotations (morpho layer) are present alongside the base tokens, you can output lemmas instead of surface tokens with --lemma.
# Word2Vec style output with lemmas where available ./build/bin/korapxmltool --lemma -f w2v app/src/test/resources/goe.tree_tagger.zip | head -3 # NOW corpus style output with lemmas ./build/bin/korapxmltool --lemma -f now app/src/test/resources/goe.tree_tagger.zip | head -1
If a lemma for a token is missing (_) the surface form is used as fallback.
--lemma-only: For -f w2v and -f now, skip loading data.xml and output only lemmas from morpho.xml. This reduces memory and speeds up throughput.--sequential: Process entries inside each zip sequentially (zips can still run in parallel). Recommended for w2v/now to keep locality and lower memory.--zip-parallelism N: Limit how many zips are processed concurrently (defaults to --threads). Helps avoid disk thrash and native inflater pressure.--exclude-zip-glob GLOB (repeatable): Skip zip basenames that match the glob (e.g., --exclude-zip-glob 'w?d24.tree_tagger.zip').Example for large NOW export with progress and exclusions:
KORAPXMLTOOL_XMX=64g KORAPXMLTOOL_MODELS_PATH=/data/models KORAPXMLTOOL_JAVA_OPTS="-XX:+UseG1GC -Djdk.util.zip.disableMemoryMapping=true -Djdk.util.zip.reuseInflater=true" \
./build/bin/korapxmltool -l info --threads 100 --zip-parallelism 8 \
--lemma-only --sequential -f now \
--exclude-zip-glob 'w?d24.tree_tagger.zip' \
/vol/corpora/DeReKo/current/KorAP/zip/*24.tree_tagger.zip | pv > dach2024.lemma.txt
At INFO level the tool logs:
--lemma-only).Generate a tar archive containing gzipped Krill/KoralQuery JSON files across all provided foundries.
./build/bin/korapxmltool -f krill -D out/krill \ app/src/test/resources/wud24_sample.zip \ app/src/test/resources/wud24_sample.spacy.zip \ app/src/test/resources/wud24_sample.marmot-malt.zip
This writes out/krill/wud24_sample.krill.tar plus a log file. Add more annotated KorAP-XML zips (e.g., TreeTagger, CoreNLP) to merge their layers into the same Krill export; use --non-word-tokens if punctuation should stay in the token stream.
You need to download the pre-trained MarMoT models from the MarMoT models repository.
You can specify the full path to the model, or set the KORAPXMLTOOL_MODELS_PATH environment variable to specify a default search directory. If not set, KORAPXMLTOOL_MODELS_PATH defaults to ../lib/models relative to the executable location.
# With full path ./build/bin/korapxmltool -f zip -t marmot:models/de.marmot app/src/test/resources/goe.zip # With KORAPXMLTOOL_MODELS_PATH (searches in /data/models/ if model not found locally) export KORAPXMLTOOL_MODELS_PATH=/data/models ./build/bin/korapxmltool -f zip -t marmot:de.marmot app/src/test/resources/goe.zip # Without setting KORAPXMLTOOL_MODELS_PATH (uses default ../lib/models from executable) ./build/bin/korapxmltool -f zip -t marmot:de.marmot app/src/test/resources/goe.zip
You need to download the pre-trained OpenNLP models from the OpenNLP model download page or older models from the legacy OpenNLP models archive.
./build/bin/korapxmltool -f zip -t opennlp:/usr/local/kl/korap/Ingestion/lib/models/opennlp/de-pos-maxent.bin /tmp/zca24.zip
This requires the TreeTagger Docker Image with CoNLL-U Support. Language models are downloaded automatically.
./build/bin/korapxmltool app/src/test/resources/wdf19.zip | docker run --rm -i korap/conllu2treetagger -l french | conllu2korapxml
This requires the spaCy Docker Image with CoNLL-U Support and is only available for German.
./build/bin/korapxmltool -T4 -A "docker run -e SPACY_USE_DEPENDENCIES=False --rm -i korap/conllu2spacy:latest" -f zip ./app/src/test/resources/goe.zip
./build/bin/korapxmltool -T4 -A "docker run -e SPACY_USE_DEPENDENCIES=True --rm -i korap/conllu2spacy:latest" -f zip ./app/src/test/resources/goe.zip
Download the Stanford CoreNLP v3.X POS tagger and constituency parser models (e.g., german-fast.tagger and germanSR.ser.gz) into libs/.
./build/bin/korapxmltool -f zip -D out \ -t corenlp:libs/german-fast.tagger \ -P corenlp:libs/germanSR.ser.gz \ app/src/test/resources/wud24_sample.zip
The resulting out/wud24_sample.corenlp.zip contains corenlp/morpho.xml and corenlp/constituency.xml alongside the base tokens.
You need to download the pre-trained MaltParser models from the MaltParser model repository. Note that parsers take POS tagged input.
./build/bin/korapxmltool -f zip -T2 -P malt:german.mco goe.tree_tagger.zip
./build/bin/korapxmltool -f zip -t marmot:models/de.marmot -P malt:german.mco goe.zip
Author:
Copyright (c) 2024-2025, Leibniz Institute for the German Language, Mannheim, Germany
This package is developed as part of the KorAP Corpus Analysis Platform at the Leibniz Institute for German Language (IDS).
It is published under the GNU General Public License, Version 3, 29 June 2007.
Contributions are very welcome!
Your contributions should ideally be committed via our Gerrit server to facilitate reviewing ( see Gerrit Code Review - A Quick Introduction if you are not familiar with Gerrit). However, we are also happy to accept comments and pull requests via GitHub.