commit	209f8aebfa0b73ddf9aecb717a41ecdb72a88005	[log] [tgz]
author	Marc Kupietz <kupietz@ids-mannheim.de>	Mon Feb 05 06:44:30 2024 +0100
committer	Marc Kupietz <kupietz@ids-mannheim.de>	Mon Feb 05 06:44:30 2024 +0100
tree	7810d9b8ee395df185baf782b66c4d8c84fc4cc4
parent	90db82246fe239a37cc5c59e4b349aa04aec1a93 [diff]

tree: 7810d9b8ee395df185baf782b66c4d8c84fc4cc4

README.md

SOTA Part-of-Speech and Lemmatizers for German

Build and run as a Docker Image

docker build -t conllu2spacy .

Then run the image for example with:

korapxml2conllu rei.zip | docker run -i conllu2spacy | conllu2korapxml > rei.spacy.zip

Build locally and install Requirements

Create a Virtual Environment

	python3 -m venv venv
	source venv/bin/activate
	export PYTHONPATH=PYTHONPATH:.

Install Libraries (as needed)

SpaCy 2.x

pip install -U pip setuptools wheel
pip install -U spacy==2.3.2

For more details you can visit the official website

Germalemma

pip install -U germalemma

More details in their website

Turku Parser

A neural parsing pipeline for segmentation, morphological tagging, dependency parsing and lemmatization with pre-trained models for more than 50 languages. Top ranker in the CoNLL-18 Shared Task.

Follow the installation instructions in their website

TreeTagger

A tool for annotating text with part-of-speech and lemma information.

Follow the installation instructions in their website

RNNTagger

RNNTagger was implemented in Python using the Deep Learning library PyTorch. Compared to TreeTagger, it has higher tagging accuracy and lemmatizes all tokens however it is much slower and requires GPUs.

Follow the installation instructions in their website

SpaCy 3.x

(If you use this, it needs its own virtual environment to avoid conflict with SpaCy 2.x)

python3 -m venv venv
source venv/bin/activate
pip install -U pip setuptools wheel
pip install -U spacy-nightly --pre

For more details on this version you can visit the official website

Pre-trained Lemmatization & POS Taggers

Run TurkuNLP Parser

We assume the Parser is already available at /path/to/Turku-neural-parser-pipeline/.

This is a client-server software, therefore you must first start the server (for example inside an independent screen), and separately use the client code provided in our repo.

Download the pre-trained models that you wish to use, and uncompress it in the root folder of the Turku-neural-parser-pipeline repository. Models are in this URL. For example, the german model is called models_de_gsd.tgz.
Run the Server:

screen -S turku-server

cd /path/to/Turku-neural-parser-pipeline/

source venv-parser-neural/bin/activate

python full_pipeline_server.py --port 7689 --conf models_de_gsd/pipelines.yaml parse_conllu

Once the server is up and running, run the Client script (under the folder systems/parse_turku.py).

The script receives 4 parameters:

input_file: path to the (empty) CoNLL file that will be processed. An empty CoNLL file means that all columns (except column one, which should have a word-token can be empty)
corpus_name: A string to distinguish the current corpus being processed
gld_token_type: Determines the precise CoNLL Format of the input file. Current Options: CoNLL09_Token | CoNLLUP_Token (CoNLL-U).
comment_str: indicates what string determines if a line is a comment inside the CoNLL file.

For example to parse the TIGER corpus (which is in CoNLL-U token format), one runs:

screen -S turku-client

cd /path/to/this-repo/

python systems/parse_turku.py \
	--input_file /vol/work/kupietz/Tiger_2_2/data/german/tiger/test/german_tiger_test.conll \
	--output_file /path/to/german_tiger_test \
	--corpus_name TigerTestOld \
	--gld_token_type CoNLLUP_Token

This code fragments the CoNLL into N Chunks (currently hardcoded to get chunks of 10K sentences each, this is be able to fit any large file in memory). Once the script finishes, it produces a log file in: logs/Parse_{corpus_name}_Turku.log as well as output files in the form of {output_file}.parsed.{chunk_index}.conllu

Run TreeTagger

TreeTagger requires a particular input file format where each line must have only one token and sentences are separated by a special token. We include a script to convert "normal" CoNLL files into this format. To do so, run:

python my_utils/conll_to_tok.py \
	--src_file /vol/work/kupietz/Tiger_2_2/data/german/tiger/test/german_tiger_test.conll \
	--output_file german_tiger_test \
	--sent_sep "</S>" \
	--token_type CoNLLUP_Token

This outputs a output_file.sep.tok with the proper format. Then go to the path/to/TreeTagger/ and once inside that folder, run:

bash cmd/tree-tagger-german-notokenize /path/to/german_tiger_test.sep.tok > /path/to/output/german_tiger_test.TreeTagger.conll

Run RNNTagger

RNNTagger uses a similar format as TreeTagger, but does not require a specific </S> separator. You can obtain the proper file with:

python my_utils/conll_to_tok.py \
	--src_file /vol/work/kupietz/Tiger_2_2/data/german/tiger/test/german_tiger_test.conll \
	--output_file german_tiger_test \
	--token_type CoNLLUP_Token

This outputs a output_file.tok with the proper format. Then go to the path/to/RNNTagger/, activate the python environment where PyTorch is installed, and once inside that folder, run:

bash cmd/rnn-tagger-german-notokenize.sh /path/to/german_tiger_test.tok > /path/to/output/german_tiger_test.RNNTagger.conll

Run SpaCy 2.x

The Pre-trained SpaCy POS Parser was trained using the TIGER Corpus and it contains a Lookup Table for Lemmatization.

Our script for running the SpaCy parser can receive 8 parameters:

input_file: path to the (empty) CoNLL file that will be processed. An empty CoNLL file means that all columns (except column one, which should have a word-token can be empty)
corpus_name: A string to distinguish the current corpus being processed
output_file: File where the SpaCy Predictions will be saved
text_file: Output text file where the sentences will be saved in plain text (one sentence per line)
spacy_model: a string indicating which SpaCy model should be used to process the text (e.g. de_core_news_lg)
gld_token_type: Determines the precise CoNLL Format of the input file. Current Options: CoNLL09_Token | CoNLLUP_Token (CoNLL-U) and others (see lib/CoNLL_Annotation.py for the classes of tokens that can be used).
use_germalemma: Flag to decide wether to use Germalemma lemmatizer on top of SpaCy or not ("True" is recommended!)
comment_str: indicates what string determines if a line is a comment inside the CoNLL file.

To keep with the tiger_test example, one can obtain the SpaCy annotations by running:

python systems/parse_spacy.py \
	--corpus_name TigerTest \
	--gld_token_type CoNLLUP_Token \
	--input_file /vol/work/kupietz/Tiger_2_2/data/german/tiger/test/german_tiger_test.conll \
	--output_file /path/to/output/tiger_test.spacy.conllu \
	--text_file /path/to/output/tiger_test_sentences.txt

Note that the script is already optimized for reading CoNLL-U files, keeping the appropriate comments, and partitioning huge files into N chunks and running them in parallel with M processes. These parameters are currently hard-coded, however they already contain the values that were found optimal to process the whole DeReKo corpus (but of course it can be furtherly adapted...).

Evaluation of Taggers

We evaluate the models for Accuracy and Macro F1 as well as processing speed

TIGER Corpus (50,472 sentences) - to test speed...

System	Lemma Acc	POS Acc	POS F1	sents/sec
TreeTagger*	90.62	95.24	74.35	12,618
SpaCy	85.33	99.07	95.84	1,577
SpaCy + Germalemma	90.98	99.07	95.84	1,230
Turku NLP [CPU]	78.90	94.43	70.78	151
RNNTagger* [GPU]	97.93	99.44	93.72	141

* because of lemmas and POS Tags divergences, TreeTagger and RNNTagger needed post-processing to agree with the DE_GSD gold standard.

One can see that the best performance-speed trade-off is with SpaCy! Especially when having several CPUs available.

TIGER Test (767 sentences)

System	Lemma Acc	POS Acc	POS F1
RNNTagger*	97.57	99.41	98.41
SpaCy + Germalemma	91.24	98.97	97.01
TreeTagger*	90.21	95.42	79.73
Turku NLP	77.07	94.65	78.24

* with post-processing applied

DE_GSD Test (977 sentences) Universal Dependencies - CoNLL-18 Dataset

System	Lemma Acc	POS Acc	POS F1
Turku NLP	81.97	97.07	86.58
RNNTagger*	93.87	95.89	82.86
SpaCy + Germalemma	90.59	95.43	83.63
SpaCy	85.92	95.43	83.63
TreeTagger*	90.91	93.64	75.70
RNNTagger (original)	93.87	90.41	80.97
TreeTagger (original)	79.65	88.17	73.83

* with post-processing applied

Custom Train

Train TurkuNLP Parser

You can follow the instructions for training custom models here

Train Spacy 2.x Tagger

It is possible to train from scratch or fine-tune a POS Tagger using SpaCy API. It is also possible to load custom lemmatizer rules (currently spacy only uses a lookup table, that is why adding GermaLemma provided an improvement in performance).
To train a Spacy model in the 2.x version, you can follow the dummy code provided in spacy_train/custom_spacy_tagger_2x.py.

Train Spacy 3.x Tagger (Transformer)

SpaCy 3.x in on BETA mode, however it will provide a more robust API for training custom models as well as implementing all of the models available from Hugging Face Transformers library. More information about this version of SpaCy is available in their blog.
This version will also provide a more flexible API for Lemmatization, however this is still not implemented...

To train a POS Tagger in SpaCy 3.x one must follow the following steps (using TIGEr as an example, but it can use any data available in CoNLL or SpaCy-JSON format):

Convert the input CoNLL file into SpaCy-JSON training format with spacy_train/conll2spacy.py:

	python DeReKo/spacy_train/conll2spacy.py --corpus_name TigerALL --gld_token_type CoNLLUP_Token \
		--input_file /vol/work/kupietz/Tiger_2_2/data/german/tiger/train/german_tiger_train.conll \
		--output_file /path/to/Tiger.train.json \
		--text_file /path/to/Tiger.train.sents.txt

Convert to SpaCy 3.x dataset format file by running:

	python -m spacy convert -c json /path/to/Tiger.train.json  out_dir_only/

Create a basic config file following SpaCy API. An example is provided here in ids-projects/DeReKo/spacy_train/basic_config_newOrth.cfg.
Create the final Config file with:

	python -m spacy init fill-config basic_config.cfg final_config.cfg

Train the Model using GPU

	python -m spacy train final_config.cfg --output tiger_spacy --verbose --gpu-id 0

More information available at the SpaCy API webpage.

Overall Repository Structure

DeReKo

Specific scripts for parallel execution of SpaCy Lemmatizer and Tagger on DeReKo big files

Text files used to process DeReKo: dereko_all_filenames.txt and dereko_mini_test.txt
exec_dereko_parallel.sh: Main Execution Script for running N SpaCy processes on ALL DeReKo Files.
explore_dereko.py: Prints available files inside the DeReko Directory
Directory spacy_train:
- conll2spacy.py: creates JSON dataset files readable by SpaCy scripts
- custom_spacy_dereko.py: Prepares pre-trained vector files to be SpaCy readable
- custom_spacy_tagger_2x.py: Trains a POS Tagger using SpaCy 2.x library
- Config Files *.cfg used by SpaCy 3.x scripts

lib

Main Class definitions and other useful resources

CoNLL_Annotation.py: Contains the CoNLL Token Class definitions used by most systems to process CoNLL datasets.
German_STTS_Tagset.tsv: Inventory of German POS Tags as defined in the TIGER corpus

logs

Directory where the logging *.log files are saved

my_utils

Contains auxiliary scripts for file-handling, pre-processing and execution of systems

clean_dereko_vectors.py:
conll_to_tok.py:
file_utils.py:
make_new_orth_silver_lemmas.py: DELETE!?
make_tiger_new_orth.py:

outputs

Here is where all experiment's outputs are saved, including error analysis, evaluation stats, etcetera...

systems

Main scripts to execute the Lemmatizers and Taggers on any dataset

parse_spacy.py:
parse_spacy3.py:
parse_turku.py:
Run_Tree-RNN_Taggers.txt:

systems_eval

Scripts to evaluate and compare systems' performance

eval_old_vs_new_tiger.py:
evaluate.py: