SOTA Part-of-Speech and Lemmatizers for German

Build and run as a Docker Image

docker build -t conllu2spacy .

Then run the image for example with:

korapxml2conllu rei.zip | docker run -i conllu2spacy | conllu2korapxml > rei.spacy.zip

Build locally and install Requirements

Create a Virtual Environment

	python3 -m venv venv
	source venv/bin/activate
	export PYTHONPATH=PYTHONPATH:.

Install Libraries (as needed)

SpaCy 2.x

pip install -U pip setuptools wheel
pip install -U spacy==2.3.2

For more details you can visit the official website

Germalemma

pip install -U germalemma

More details in their website

Turku Parser

A neural parsing pipeline for segmentation, morphological tagging, dependency parsing and lemmatization with pre-trained models for more than 50 languages. Top ranker in the CoNLL-18 Shared Task.

Follow the installation instructions in their website

TreeTagger

A tool for annotating text with part-of-speech and lemma information.

Follow the installation instructions in their website

RNNTagger

RNNTagger was implemented in Python using the Deep Learning library PyTorch. Compared to TreeTagger, it has higher tagging accuracy and lemmatizes all tokens however it is much slower and requires GPUs.

Follow the installation instructions in their website

SpaCy 3.x

(If you use this, it needs its own virtual environment to avoid conflict with SpaCy 2.x)

python3 -m venv venv
source venv/bin/activate
pip install -U pip setuptools wheel
pip install -U spacy-nightly --pre

For more details on this version you can visit the official website

Pre-trained Lemmatization & POS Taggers

Run TurkuNLP Parser

We assume the Parser is already available at /path/to/Turku-neural-parser-pipeline/.

This is a client-server software, therefore you must first start the server (for example inside an independent screen), and separately use the client code provided in our repo.

  1. Download the pre-trained models that you wish to use, and uncompress it in the root folder of the Turku-neural-parser-pipeline repository. Models are in this URL. For example, the german model is called models_de_gsd.tgz.

  2. Run the Server:

screen -S turku-server

cd /path/to/Turku-neural-parser-pipeline/

source venv-parser-neural/bin/activate

python full_pipeline_server.py --port 7689 --conf models_de_gsd/pipelines.yaml parse_conllu
  1. Once the server is up and running, run the Client script (under the folder systems/parse_turku.py).

The script receives 4 parameters:

  • input_file: path to the (empty) CoNLL file that will be processed. An empty CoNLL file means that all columns (except column one, which should have a word-token can be empty)
  • corpus_name: A string to distinguish the current corpus being processed
  • gld_token_type: Determines the precise CoNLL Format of the input file. Current Options: CoNLL09_Token | CoNLLUP_Token (CoNLL-U).
  • comment_str: indicates what string determines if a line is a comment inside the CoNLL file.

For example to parse the TIGER corpus (which is in CoNLL-U token format), one runs:

screen -S turku-client

cd /path/to/this-repo/

python systems/parse_turku.py \
	--input_file /vol/work/kupietz/Tiger_2_2/data/german/tiger/test/german_tiger_test.conll \
	--output_file /path/to/german_tiger_test \
	--corpus_name TigerTestOld \
	--gld_token_type CoNLLUP_Token 

This code fragments the CoNLL into N Chunks (currently hardcoded to get chunks of 10K sentences each, this is be able to fit any large file in memory). Once the script finishes, it produces a log file in: logs/Parse_{corpus_name}_Turku.log as well as output files in the form of {output_file}.parsed.{chunk_index}.conllu

Run TreeTagger

TreeTagger requires a particular input file format where each line must have only one token and sentences are separated by a special token. We include a script to convert "normal" CoNLL files into this format. To do so, run:

python my_utils/conll_to_tok.py \
	--src_file /vol/work/kupietz/Tiger_2_2/data/german/tiger/test/german_tiger_test.conll \
	--output_file german_tiger_test \
	--sent_sep "</S>" \
	--token_type CoNLLUP_Token

This outputs a output_file.sep.tok with the proper format. Then go to the path/to/TreeTagger/ and once inside that folder, run:

bash cmd/tree-tagger-german-notokenize /path/to/german_tiger_test.sep.tok > /path/to/output/german_tiger_test.TreeTagger.conll

Run RNNTagger

RNNTagger uses a similar format as TreeTagger, but does not require a specific </S> separator. You can obtain the proper file with:

python my_utils/conll_to_tok.py \
	--src_file /vol/work/kupietz/Tiger_2_2/data/german/tiger/test/german_tiger_test.conll \
	--output_file german_tiger_test \
	--token_type CoNLLUP_Token

This outputs a output_file.tok with the proper format. Then go to the path/to/RNNTagger/, activate the python environment where PyTorch is installed, and once inside that folder, run:

bash cmd/rnn-tagger-german-notokenize.sh /path/to/german_tiger_test.tok > /path/to/output/german_tiger_test.RNNTagger.conll

Run SpaCy 2.x

The Pre-trained SpaCy POS Parser was trained using the TIGER Corpus and it contains a Lookup Table for Lemmatization.

Our script for running the SpaCy parser can receive 8 parameters:

  • input_file: path to the (empty) CoNLL file that will be processed. An empty CoNLL file means that all columns (except column one, which should have a word-token can be empty)
  • corpus_name: A string to distinguish the current corpus being processed
  • output_file: File where the SpaCy Predictions will be saved
  • text_file: Output text file where the sentences will be saved in plain text (one sentence per line)
  • spacy_model: a string indicating which SpaCy model should be used to process the text (e.g. de_core_news_lg)
  • gld_token_type: Determines the precise CoNLL Format of the input file. Current Options: CoNLL09_Token | CoNLLUP_Token (CoNLL-U) and others (see lib/CoNLL_Annotation.py for the classes of tokens that can be used).
  • use_germalemma: Flag to decide wether to use Germalemma lemmatizer on top of SpaCy or not ("True" is recommended!)
  • comment_str: indicates what string determines if a line is a comment inside the CoNLL file.

To keep with the tiger_test example, one can obtain the SpaCy annotations by running:

python systems/parse_spacy.py \
	--corpus_name TigerTest \
	--gld_token_type CoNLLUP_Token \
	--input_file /vol/work/kupietz/Tiger_2_2/data/german/tiger/test/german_tiger_test.conll \
	--output_file /path/to/output/tiger_test.spacy.conllu \
	--text_file /path/to/output/tiger_test_sentences.txt

Note that the script is already optimized for reading CoNLL-U files, keeping the appropriate comments, and partitioning huge files into N chunks and running them in parallel with M processes. These parameters are currently hard-coded, however they already contain the values that were found optimal to process the whole DeReKo corpus (but of course it can be furtherly adapted...).

Evaluation of Taggers

We evaluate the models for Accuracy and Macro F1 as well as processing speed

  1. TIGER Corpus (50,472 sentences) - to test speed...
SystemLemma AccPOS AccPOS F1sents/sec
TreeTagger*90.6295.2474.3512,618
SpaCy85.3399.0795.841,577
SpaCy + Germalemma90.9899.0795.841,230
Turku NLP [CPU]78.9094.4370.78151
RNNTagger* [GPU]97.9399.4493.72141

* because of lemmas and POS Tags divergences, TreeTagger and RNNTagger needed post-processing to agree with the DE_GSD gold standard.

One can see that the best performance-speed trade-off is with SpaCy! Especially when having several CPUs available.

  1. TIGER Test (767 sentences)
SystemLemma AccPOS AccPOS F1
RNNTagger*97.5799.4198.41
SpaCy + Germalemma91.2498.9797.01
TreeTagger*90.2195.4279.73
Turku NLP77.0794.6578.24

* with post-processing applied

  1. DE_GSD Test (977 sentences) Universal Dependencies - CoNLL-18 Dataset
SystemLemma AccPOS AccPOS F1
Turku NLP81.9797.0786.58
RNNTagger*93.8795.8982.86
SpaCy + Germalemma90.5995.4383.63
SpaCy85.9295.4383.63
TreeTagger*90.9193.6475.70
RNNTagger (original)93.8790.4180.97
TreeTagger (original)79.6588.1773.83

* with post-processing applied

Custom Train

Train TurkuNLP Parser

You can follow the instructions for training custom models here

Train Spacy 2.x Tagger

  • It is possible to train from scratch or fine-tune a POS Tagger using SpaCy API. It is also possible to load custom lemmatizer rules (currently spacy only uses a lookup table, that is why adding GermaLemma provided an improvement in performance).

  • To train a Spacy model in the 2.x version, you can follow the dummy code provided in spacy_train/custom_spacy_tagger_2x.py.

Train Spacy 3.x Tagger (Transformer)

  • SpaCy 3.x in on BETA mode, however it will provide a more robust API for training custom models as well as implementing all of the models available from Hugging Face Transformers library. More information about this version of SpaCy is available in their blog.

  • This version will also provide a more flexible API for Lemmatization, however this is still not implemented...

  1. To train a POS Tagger in SpaCy 3.x one must follow the following steps (using TIGEr as an example, but it can use any data available in CoNLL or SpaCy-JSON format):

    1. Convert the input CoNLL file into SpaCy-JSON training format with spacy_train/conll2spacy.py:
    	python DeReKo/spacy_train/conll2spacy.py --corpus_name TigerALL --gld_token_type CoNLLUP_Token \
    		--input_file /vol/work/kupietz/Tiger_2_2/data/german/tiger/train/german_tiger_train.conll \
    		--output_file /path/to/Tiger.train.json \
    		--text_file /path/to/Tiger.train.sents.txt
    
    1. Convert to SpaCy 3.x dataset format file by running:
    	python -m spacy convert -c json /path/to/Tiger.train.json  out_dir_only/
    
    1. Create a basic config file following SpaCy API. An example is provided here in ids-projects/DeReKo/spacy_train/basic_config_newOrth.cfg.
    2. Create the final Config file with:
    	python -m spacy init fill-config basic_config.cfg final_config.cfg
    
    1. Train the Model using GPU
    	python -m spacy train final_config.cfg --output tiger_spacy --verbose --gpu-id 0
    

Overall Repository Structure

DeReKo

Specific scripts for parallel execution of SpaCy Lemmatizer and Tagger on DeReKo big files

  • Text files used to process DeReKo: dereko_all_filenames.txt and dereko_mini_test.txt
  • exec_dereko_parallel.sh: Main Execution Script for running N SpaCy processes on ALL DeReKo Files.
  • explore_dereko.py: Prints available files inside the DeReko Directory
  • Directory spacy_train:
    • conll2spacy.py: creates JSON dataset files readable by SpaCy scripts
    • custom_spacy_dereko.py: Prepares pre-trained vector files to be SpaCy readable
    • custom_spacy_tagger_2x.py: Trains a POS Tagger using SpaCy 2.x library
    • Config Files *.cfg used by SpaCy 3.x scripts

lib

Main Class definitions and other useful resources

  • CoNLL_Annotation.py: Contains the CoNLL Token Class definitions used by most systems to process CoNLL datasets.
  • German_STTS_Tagset.tsv: Inventory of German POS Tags as defined in the TIGER corpus

logs

Directory where the logging *.log files are saved

my_utils

Contains auxiliary scripts for file-handling, pre-processing and execution of systems

  • clean_dereko_vectors.py:
  • conll_to_tok.py:
  • file_utils.py:
  • make_new_orth_silver_lemmas.py: DELETE!?
  • make_tiger_new_orth.py:

outputs

Here is where all experiment's outputs are saved, including error analysis, evaluation stats, etcetera...

systems

Main scripts to execute the Lemmatizers and Taggers on any dataset

  • parse_spacy.py:
  • parse_spacy3.py:
  • parse_turku.py:
  • Run_Tree-RNN_Taggers.txt:

systems_eval

Scripts to evaluate and compare systems' performance

  • eval_old_vs_new_tiger.py:
  • evaluate.py: