Docker image for Helmut Schmid's TreeTagger (based on Stefan Fischer's docker-treetagger) with support for input and output in CoNLL-U format.
Based on Stefan Fischer's docker-treetagger.
Please read Helmut Schmid's license terms before using this Dockerfile.
docker pull korap/conllu-treetagger
git clone https://github.com/KorAP/conllu-treetagger-docker.git cd conllu-treetagger-docker make build-docker
$ docker run --rm -i korap/conllu-treetagger < goe.conllu | head -8 # foundry = tree_tagger # filename = GOE/AGA/00000/base/tokens.xml # text_id = GOE_AGA.00000 # start_offsets = 0 0 9 12 # end_offsets = 22 8 11 22 1 Campagne <unknown> _ NN _ _ _ _ _ 2 in in _ APPR _ _ _ _ _ 3 Frankreich Frankreich _ NE _ _ _ _ _
To output different pos/lemma interpretations with their probabilities, use the -p option. You can optionally specify a threshold with -t (default: 0.1):
$ docker run --rm -i korap/conllu-treetagger -p -t 0.01 < goe.conllu | head -8 # foundry = tree_tagger # filename = GOE/AGA/00000/base/tokens.xml # text_id = GOE_AGA.00000 # start_offsets = 0 0 9 12 # end_offsets = 22 8 11 22 1 Campagne <unknown> _ NN _ _ _ _ _ 2 in in _ APPR _ _ _ _ _ 3 Frankreich Frankreich _ NE|NN|ADJD _ _ _ _ 0.956|0.032|0.012
korapxmltool, which includes korapxml2conllu as a shortcut, can be downloaded from https://github.com/KorAP/korapxmltool.
korapxml2conllu goe.zip | docker run --rm -i korap/conllu-treetagger -l german -p
korapxmltool -A "docker run --rm -i korap/conllu-treetagger" -t zip t24.zip
To avoid downloading the language model on every run, you can mount a local directory to /local/models:
korapxml2conllu goe.zip | docker run --rm -i -v /path/to/local/models:/local/models korap/conllu-treetagger -l german
For an overview of the available languages / models, run one of the following command:
docker run --rm -i korap/conllu-treetagger -L
Open a shell within the container:
docker run --rm -it --entrypoint /bin/bash korap/conllu-treetagger
The language can be specified with the -l option. Parameter files will be downloaded automatically from the tagger's website.
The following languages are available: Bulgarian, Catalan, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Middle High german, Greek, Ancient Greek, Ancient Greek (beta encoding), Italian, Korean, Latin, Norwegian (Bokmål), Polish, Portuguese, Portuguese (fine-grained tagset), Portuguese (alternative corpus), Romanian, Russian, Slovak, Slovenian, Spanish, Spanish (Ancora corpus), Swahili, Swedish.