Clone this repo:

Branches

  1. 27fb1dc Fix release data and author link by Marc Kupietz · 2 weeks ago master v3.8.11-1
  2. e0ca9d2 Prepare release by Marc Kupietz · 2 weeks ago
  3. a137c07 Add -V option to print spaCy version information by Marc Kupietz · 2 weeks ago
  4. 9baa27a Shrink docker size by avoiding chown by Marc Kupietz · 2 weeks ago
  5. 68a1813 Actually list available models with -L by Marc Kupietz · 2 weeks ago

spaCy Docker Image with CoNLL-U Support

CI Docker Pulls Docker Stars GitHub issues GitHub closed issues GitHub last commit License

Docker image for spaCy POS tagging, lemmatization and dependency parsing with support for input and output in CoNLL-U format.

This is a slim, focused implementation extracted from sota-pos-lemmatizers, originally developed by José Angel Daza, following the same pattern as conllu-treetagger-docker.

Features

  • Multi-language support: Works with any spaCy model for 70+ languages
  • CoNLL-U input/output: Reads and writes CoNLL-U format
  • On-demand model fetching: Models are downloaded on first run and cached in /local/models
  • GermaLemma integration: Enhanced lemmatization for German (optional, German models only)
  • Morphological features: Extracts and formats morphological features in CoNLL-U format
  • Dependency parsing: Optional dependency relations (HEAD/DEPREL columns)
  • Flexible configuration: Environment variables for batch size, chunk size, timeouts, etc.

Installation

From Docker Hub

docker pull korap/conllu-spacy

From source

git clone https://github.com/KorAP/conllu-spacy-docker.git
cd conllu-spacy-docker
make

Usage

Basic usage

# Default: German model with dependency parsing and GermaLemma
docker run --rm -i korap/conllu-spacy < input.conllu > output.conllu

Without dependency parsing

# Disable dependency parsing for faster processing
docker run --rm -i korap/conllu-spacy -d < input.conllu > output.conllu

Using different language models

# Use a smaller German model
docker run --rm -i korap/conllu-spacy -m de_core_news_sm < input.conllu > output.conllu

# Use French model
docker run --rm -i korap/conllu-spacy -m fr_core_news_lg < input.conllu > output.conllu

# Use English model (disable GermaLemma for non-German)
docker run --rm -i korap/conllu-spacy -m en_core_web_lg -g < input.conllu > output.conllu

Persisting Models

To avoid downloading the language model on every run, mount a local directory to /local/models:

chmod 777 /path/to/local/models
docker run --rm -i -v /path/to/local/models:/local/models korap/conllu-spacy < input.conllu > output.conllu

The first run will download the model to /path/to/local/models/, and subsequent runs will reuse it.

Preloading Models

There are several ways to preload models before running the container:

Option 1: Using the preload script (recommended)

# Preload the default model (de_core_news_lg)
./preload-models.sh

# Preload a specific model
./preload-models.sh de_core_news_sm

# Preload to a custom directory
./preload-models.sh de_core_news_lg /path/to/models

# Then run with the preloaded models
docker run --rm -i -v ./models:/local/models korap/conllu-spacy < input.conllu

Option 2: Build image with models included

# Build an image with models pre-installed
docker build -f Dockerfile.with-models -t korap/conllu-spacy:with-models .

# Run without needing to mount volumes
docker run --rm -i korap/conllu-spacy:with-models < input.conllu > output.conllu

Edit Dockerfile.with-models to include additional models (sm, md) by uncommenting the relevant lines.

Option 3: Manual download

# Create models directory
mkdir -p ./models

# Download using a temporary container
docker run --rm -v ./models:/models python:3.12-slim bash -c "
  pip install -q spacy &&
  python -m spacy download de_core_news_lg &&
  python -c 'import spacy, shutil, site;
  shutil.copytree(site.getsitepackages()[0] + \"/de_core_news_lg\", \"/models/de_core_news_lg\")'
"

# Use the preloaded model
docker run --rm -i -v ./models:/local/models korap/conllu-spacy < input.conllu

Running with korapxmltool

korapxmltool, which includes korapxml2conllu as a shortcut, can be downloaded from https://github.com/KorAP/korapxmltool.

korapxml2conllu goe.zip | docker run --rm -i korap/conllu-spacy

Generate a spaCy-tagged KorAP XML zip directly

korapxmltool -A "docker run --rm -i korap/conllu-spacy" -t zip goe.zip

Command-line Options

Usage: docker run --rm -i korap/conllu-spacy [OPTIONS]

Options:
  -h            Display help message
  -m MODEL      Specify spaCy model (default: de_core_news_lg)
  -L            List available/installed models
  -V            Display spaCy version information
  -d            Disable dependency parsing (faster processing)
  -g            Disable GermaLemma (use spaCy lemmatizer only)

Version Information

To check which version of conllu-spacy-docker and its components are installed:

docker run --rm korap/conllu-spacy -V

Example output:

=== Version Information ===
conllu-spacy-docker version: 3.8.11-1
spaCy version: 3.8.11
GermaLemma version: 0.1.3
Python version: 3.12.1

Environment Variables

You can customize processing behavior with environment variables:

docker run --rm -i \
  -e SPACY_USE_DEPENDENCIES="False" \
  -e SPACY_USE_GERMALEMMA="True" \
  -e SPACY_CHUNK_SIZE="10000" \
  -e SPACY_BATCH_SIZE="1000" \
  -e SPACY_N_PROCESS="1" \
  -e SPACY_PARSE_TIMEOUT="30" \
  -e SPACY_MAX_SENTENCE_LENGTH="500" \
  korap/conllu-spacy < input.conllu > output.conllu

Available environment variables:

  • SPACY_USE_DEPENDENCIES: Enable/disable dependency parsing (default: "True")
  • SPACY_USE_GERMALEMMA: Enable/disable GermaLemma (default: "True")
  • SPACY_CHUNK_SIZE: Number of sentences to process per chunk (default: 20000)
  • SPACY_BATCH_SIZE: Batch size for spaCy processing (default: 2000)
  • SPACY_N_PROCESS: Number of processes (default: 10)
  • SPACY_PARSE_TIMEOUT: Timeout for dependency parsing per sentence in seconds (default: 30)
  • SPACY_MAX_SENTENCE_LENGTH: Maximum sentence length for dependency parsing in tokens (default: 500)

Examples

# Fast processing: disable dependency parsing
docker run --rm -i korap/conllu-spacy -d < input.conllu > output.conllu

# Use spaCy lemmatizer only (without GermaLemma)
docker run --rm -i korap/conllu-spacy -g < input.conllu > output.conllu

# Smaller model for faster download
docker run --rm -i korap/conllu-spacy -m de_core_news_sm < input.conllu > output.conllu

# Persistent model storage
docker run --rm -i -v ./models:/local/models korap/conllu-spacy < input.conllu > output.conllu

Miscellaneous commands

List installed models:

docker run --rm -i korap/conllu-spacy -L

Open a shell within the container:

docker run --rm -it --entrypoint /bin/bash korap/conllu-spacy

Supported Languages and Models

Any spaCy model can be specified with the -m option. Models will be downloaded automatically on first use.

spaCy provides trained models for 70+ languages. See spaCy Models for the complete list.

Example: German models (default)

  • de_core_news_lg (default, 560MB) - Large model, best accuracy
  • de_core_news_md (100MB) - Medium model, balanced
  • de_core_news_sm (15MB) - Small model, fastest

Example: French models

# Use French small model
docker run --rm -i -v ./models:/local/models korap/conllu-spacy -m fr_core_news_sm < input.conllu
  • fr_core_news_lg (560MB) - Large French model
  • fr_core_news_md (100MB) - Medium French model
  • fr_core_news_sm (15MB) - Small French model

Example: English models

# Use English model
docker run --rm -i -v ./models:/local/models korap/conllu-spacy -m en_core_web_lg < input.conllu
  • en_core_web_lg (560MB) - Large English model
  • en_core_web_md (100MB) - Medium English model
  • en_core_web_sm (15MB) - Small English model

Note: GermaLemma integration only works with German models. For other languages, the standard spaCy lemmatizer is used (with -g flag to disable GermaLemma).

Performance

From the sota-pos-lemmatizers benchmarks on the TIGER corpus (50,472 sentences):

ConfigurationLemma AccPOS AccPOS F1sents/sec
spaCy + GermaLemma90.9899.0795.841,230
spaCy (without GermaLemma)85.3399.0795.841,577

Note: Disabling dependency parsing (-d flag) significantly improves processing speed while maintaining POS tagging and lemmatization quality.

Architecture

The project consists of:

  • Dockerfile: Multi-stage build for optimized image size
  • docker-entrypoint.sh: Entry point script that handles model fetching and CLI argument parsing
  • systems/parse_spacy_pipe.py: Main spaCy processing pipeline
  • lib/CoNLL_Annotation.py: CoNLL-U format parsing and token classes
  • my_utils/file_utils.py: File handling utilities for chunked processing

Credits

Based on the sota-pos-lemmatizers evaluation project, originally by José Angel Daza and Marc Kupietz, with contributions by Rebecca Wilm, follows the pattern established by conllu-treetagger-docker.

License

This project's source code is licensed under the BSD 2-Clause License.

See, however, the licenses of the individual components:

  • spaCy: MIT License
  • GermaLemma: Apache 2.0 License