Docker image for spaCy POS tagging, lemmatization and dependency parsing with support for input and output in CoNLL-U format.
This is a slim, focused implementation extracted from sota-pos-lemmatizers, originally developed by José Angel Daza, following the same pattern as conllu-treetagger-docker.
/local/modelsdocker pull korap/conllu-spacy
git clone https://github.com/KorAP/conllu-spacy-docker.git cd conllu-spacy-docker make
# Default: German model with dependency parsing and GermaLemma docker run --rm -i korap/conllu-spacy < input.conllu > output.conllu
# Disable dependency parsing for faster processing docker run --rm -i korap/conllu-spacy -d < input.conllu > output.conllu
# Use a smaller German model docker run --rm -i korap/conllu-spacy -m de_core_news_sm < input.conllu > output.conllu # Use French model docker run --rm -i korap/conllu-spacy -m fr_core_news_lg < input.conllu > output.conllu # Use English model (disable GermaLemma for non-German) docker run --rm -i korap/conllu-spacy -m en_core_web_lg -g < input.conllu > output.conllu
To avoid downloading the language model on every run, mount a local directory to /local/models:
chmod 777 /path/to/local/models docker run --rm -i -v /path/to/local/models:/local/models korap/conllu-spacy < input.conllu > output.conllu
The first run will download the model to /path/to/local/models/, and subsequent runs will reuse it.
There are several ways to preload models before running the container:
# Preload the default model (de_core_news_lg) ./preload-models.sh # Preload a specific model ./preload-models.sh de_core_news_sm # Preload to a custom directory ./preload-models.sh de_core_news_lg /path/to/models # Then run with the preloaded models docker run --rm -i -v ./models:/local/models korap/conllu-spacy < input.conllu
# Build an image with models pre-installed docker build -f Dockerfile.with-models -t korap/conllu-spacy:with-models . # Run without needing to mount volumes docker run --rm -i korap/conllu-spacy:with-models < input.conllu > output.conllu
Edit Dockerfile.with-models to include additional models (sm, md) by uncommenting the relevant lines.
# Create models directory mkdir -p ./models # Download using a temporary container docker run --rm -v ./models:/models python:3.12-slim bash -c " pip install -q spacy && python -m spacy download de_core_news_lg && python -c 'import spacy, shutil, site; shutil.copytree(site.getsitepackages()[0] + \"/de_core_news_lg\", \"/models/de_core_news_lg\")' " # Use the preloaded model docker run --rm -i -v ./models:/local/models korap/conllu-spacy < input.conllu
korapxmltool, which includes korapxml2conllu as a shortcut, can be downloaded from https://github.com/KorAP/korapxmltool.
korapxml2conllu goe.zip | docker run --rm -i korap/conllu-spacy
korapxmltool -A "docker run --rm -i korap/conllu-spacy" -t zip goe.zip
Usage: docker run --rm -i korap/conllu-spacy [OPTIONS] Options: -h Display help message -m MODEL Specify spaCy model (default: de_core_news_lg) -L List available/installed models -V Display spaCy version information -d Disable dependency parsing (faster processing) -g Disable GermaLemma (use spaCy lemmatizer only)
To check which version of conllu-spacy-docker and its components are installed:
docker run --rm korap/conllu-spacy -V
Example output:
=== Version Information === conllu-spacy-docker version: 3.8.11-1 spaCy version: 3.8.11 GermaLemma version: 0.1.3 Python version: 3.12.1
You can customize processing behavior with environment variables:
docker run --rm -i \ -e SPACY_USE_DEPENDENCIES="False" \ -e SPACY_USE_GERMALEMMA="True" \ -e SPACY_CHUNK_SIZE="10000" \ -e SPACY_BATCH_SIZE="1000" \ -e SPACY_N_PROCESS="1" \ -e SPACY_PARSE_TIMEOUT="30" \ -e SPACY_MAX_SENTENCE_LENGTH="500" \ korap/conllu-spacy < input.conllu > output.conllu
Available environment variables:
SPACY_USE_DEPENDENCIES: Enable/disable dependency parsing (default: "True")SPACY_USE_GERMALEMMA: Enable/disable GermaLemma (default: "True")SPACY_CHUNK_SIZE: Number of sentences to process per chunk (default: 20000)SPACY_BATCH_SIZE: Batch size for spaCy processing (default: 2000)SPACY_N_PROCESS: Number of processes (default: 10)SPACY_PARSE_TIMEOUT: Timeout for dependency parsing per sentence in seconds (default: 30)SPACY_MAX_SENTENCE_LENGTH: Maximum sentence length for dependency parsing in tokens (default: 500)# Fast processing: disable dependency parsing docker run --rm -i korap/conllu-spacy -d < input.conllu > output.conllu # Use spaCy lemmatizer only (without GermaLemma) docker run --rm -i korap/conllu-spacy -g < input.conllu > output.conllu # Smaller model for faster download docker run --rm -i korap/conllu-spacy -m de_core_news_sm < input.conllu > output.conllu # Persistent model storage docker run --rm -i -v ./models:/local/models korap/conllu-spacy < input.conllu > output.conllu
List installed models:
docker run --rm -i korap/conllu-spacy -L
Open a shell within the container:
docker run --rm -it --entrypoint /bin/bash korap/conllu-spacy
Any spaCy model can be specified with the -m option. Models will be downloaded automatically on first use.
spaCy provides trained models for 70+ languages. See spaCy Models for the complete list.
de_core_news_lg (default, 560MB) - Large model, best accuracyde_core_news_md (100MB) - Medium model, balancedde_core_news_sm (15MB) - Small model, fastest# Use French small model docker run --rm -i -v ./models:/local/models korap/conllu-spacy -m fr_core_news_sm < input.conllu
fr_core_news_lg (560MB) - Large French modelfr_core_news_md (100MB) - Medium French modelfr_core_news_sm (15MB) - Small French model# Use English model docker run --rm -i -v ./models:/local/models korap/conllu-spacy -m en_core_web_lg < input.conllu
en_core_web_lg (560MB) - Large English modelen_core_web_md (100MB) - Medium English modelen_core_web_sm (15MB) - Small English modelNote: GermaLemma integration only works with German models. For other languages, the standard spaCy lemmatizer is used (with -g flag to disable GermaLemma).
From the sota-pos-lemmatizers benchmarks on the TIGER corpus (50,472 sentences):
| Configuration | Lemma Acc | POS Acc | POS F1 | sents/sec |
|---|---|---|---|---|
| spaCy + GermaLemma | 90.98 | 99.07 | 95.84 | 1,230 |
| spaCy (without GermaLemma) | 85.33 | 99.07 | 95.84 | 1,577 |
Note: Disabling dependency parsing (-d flag) significantly improves processing speed while maintaining POS tagging and lemmatization quality.
The project consists of:
Based on the sota-pos-lemmatizers evaluation project, originally by José Angel Daza and Marc Kupietz, with contributions by Rebecca Wilm, follows the pattern established by conllu-treetagger-docker.
This project's source code is licensed under the BSD 2-Clause License.
See, however, the licenses of the individual components: