docker build -t -t korap/conllu2spacy:latest .
Then run the image for example with:
korapxml2conllu rei.zip | docker run -i korap/conllu2spacy | conllu2korapxml > rei.spacy.zip
python3 -m venv venv source venv/bin/activate export PYTHONPATH=PYTHONPATH:.
pip install -U pip setuptools wheel pip install -U spacy
For more details you can visit the official website
pip install -U germalemma
More details in their website
A neural parsing pipeline for segmentation, morphological tagging, dependency parsing and lemmatization with pre-trained models for more than 50 languages. Top ranker in the CoNLL-18 Shared Task.
Follow the installation instructions in their website
A tool for annotating text with part-of-speech and lemma information.
Follow the installation instructions in their website
RNNTagger was implemented in Python using the Deep Learning library PyTorch. Compared to TreeTagger, it has higher tagging accuracy and lemmatizes all tokens however it is much slower and requires GPUs.
Follow the installation instructions in their website
(If you use this, it needs its own virtual environment to avoid conflict with SpaCy 2.x)
python3 -m venv venv source venv/bin/activate pip install -U pip setuptools wheel pip install -U spacy-nightly --pre
For more details on this version you can visit the official website
We assume the Parser is already available at /path/to/Turku-neural-parser-pipeline/
.
This is a client-server software, therefore you must first start the server (for example inside an independent screen), and separately use the client code provided in our repo.
Download the pre-trained models that you wish to use, and uncompress it in the root folder of the Turku-neural-parser-pipeline
repository. Models are in this URL. For example, the german model is called models_de_gsd.tgz
.
Run the Server:
screen -S turku-server cd /path/to/Turku-neural-parser-pipeline/ source venv-parser-neural/bin/activate python full_pipeline_server.py --port 7689 --conf models_de_gsd/pipelines.yaml parse_conllu
systems/parse_turku.py
).The script receives 4 parameters:
For example to parse the TIGER corpus (which is in CoNLL-U token format), one runs:
screen -S turku-client cd /path/to/this-repo/ python systems/parse_turku.py \ --input_file /vol/work/kupietz/Tiger_2_2/data/german/tiger/test/german_tiger_test.conll \ --output_file /path/to/german_tiger_test \ --corpus_name TigerTestOld \ --gld_token_type CoNLLUP_Token
This code fragments the CoNLL into N Chunks (currently hardcoded to get chunks of 10K sentences each, this is be able to fit any large file in memory). Once the script finishes, it produces a log file in: logs/Parse_{corpus_name}_Turku.log
as well as output files in the form of {output_file}.parsed.{chunk_index}.conllu
TreeTagger requires a particular input file format where each line must have only one token and sentences are separated by a special token. We include a script to convert "normal" CoNLL files into this format. To do so, run:
python my_utils/conll_to_tok.py \ --src_file /vol/work/kupietz/Tiger_2_2/data/german/tiger/test/german_tiger_test.conll \ --output_file german_tiger_test \ --sent_sep "</S>" \ --token_type CoNLLUP_Token
This outputs a output_file.sep.tok
with the proper format. Then go to the path/to/TreeTagger/
and once inside that folder, run:
bash cmd/tree-tagger-german-notokenize /path/to/german_tiger_test.sep.tok > /path/to/output/german_tiger_test.TreeTagger.conll
RNNTagger uses a similar format as TreeTagger, but does not require a specific </S>
separator. You can obtain the proper file with:
python my_utils/conll_to_tok.py \ --src_file /vol/work/kupietz/Tiger_2_2/data/german/tiger/test/german_tiger_test.conll \ --output_file german_tiger_test \ --token_type CoNLLUP_Token
This outputs a output_file.tok
with the proper format. Then go to the path/to/RNNTagger/
, activate the python environment where PyTorch is installed, and once inside that folder, run:
bash cmd/rnn-tagger-german-notokenize.sh /path/to/german_tiger_test.tok > /path/to/output/german_tiger_test.RNNTagger.conll
The Pre-trained SpaCy POS Parser was trained using the TIGER Corpus and it contains a Lookup Table for Lemmatization.
Our script for running the SpaCy parser can receive 8 parameters:
de_core_news_lg
)lib/CoNLL_Annotation.py
for the classes of tokens that can be used).To keep with the tiger_test example, one can obtain the SpaCy annotations by running:
python systems/parse_spacy.py \ --corpus_name TigerTest \ --gld_token_type CoNLLUP_Token \ --input_file /vol/work/kupietz/Tiger_2_2/data/german/tiger/test/german_tiger_test.conll \ --output_file /path/to/output/tiger_test.spacy.conllu \ --text_file /path/to/output/tiger_test_sentences.txt
Note that the script is already optimized for reading CoNLL-U files, keeping the appropriate comments, and partitioning huge files into N chunks and running them in parallel with M processes. These parameters are currently hard-coded, however they already contain the values that were found optimal to process the whole DeReKo corpus (but of course it can be furtherly adapted...).
We evaluate the models for Accuracy and Macro F1 as well as processing speed
System | Lemma Acc | POS Acc | POS F1 | sents/sec |
---|---|---|---|---|
TreeTagger* | 90.62 | 95.24 | 74.35 | 12,618 |
SpaCy | 85.33 | 99.07 | 95.84 | 1,577 |
SpaCy + Germalemma | 90.98 | 99.07 | 95.84 | 1,230 |
Turku NLP [CPU] | 78.90 | 94.43 | 70.78 | 151 |
RNNTagger* [GPU] | 97.93 | 99.44 | 93.72 | 141 |
* because of lemmas and POS Tags divergences, TreeTagger and RNNTagger needed post-processing to agree with the DE_GSD gold standard.
One can see that the best performance-speed trade-off is with SpaCy! Especially when having several CPUs available.
System | Lemma Acc | POS Acc | POS F1 |
---|---|---|---|
RNNTagger* | 97.57 | 99.41 | 98.41 |
SpaCy + Germalemma | 91.24 | 98.97 | 97.01 |
TreeTagger* | 90.21 | 95.42 | 79.73 |
Turku NLP | 77.07 | 94.65 | 78.24 |
* with post-processing applied
System | Lemma Acc | POS Acc | POS F1 |
---|---|---|---|
Turku NLP | 81.97 | 97.07 | 86.58 |
RNNTagger* | 93.87 | 95.89 | 82.86 |
SpaCy + Germalemma | 90.59 | 95.43 | 83.63 |
SpaCy | 85.92 | 95.43 | 83.63 |
TreeTagger* | 90.91 | 93.64 | 75.70 |
RNNTagger (original) | 93.87 | 90.41 | 80.97 |
TreeTagger (original) | 79.65 | 88.17 | 73.83 |
* with post-processing applied
You can follow the instructions for training custom models here
It is possible to train from scratch or fine-tune a POS Tagger using SpaCy API. It is also possible to load custom lemmatizer rules (currently spacy only uses a lookup table, that is why adding GermaLemma provided an improvement in performance).
To train a Spacy model in the 2.x version, you can follow the dummy code provided in spacy_train/custom_spacy_tagger_2x.py
.
SpaCy 3.x in on BETA mode, however it will provide a more robust API for training custom models as well as implementing all of the models available from Hugging Face Transformers library. More information about this version of SpaCy is available in their blog.
This version will also provide a more flexible API for Lemmatization, however this is still not implemented...
To train a POS Tagger in SpaCy 3.x one must follow the following steps (using TIGEr as an example, but it can use any data available in CoNLL or SpaCy-JSON format):
spacy_train/conll2spacy.py
:python DeReKo/spacy_train/conll2spacy.py --corpus_name TigerALL --gld_token_type CoNLLUP_Token \ --input_file /vol/work/kupietz/Tiger_2_2/data/german/tiger/train/german_tiger_train.conll \ --output_file /path/to/Tiger.train.json \ --text_file /path/to/Tiger.train.sents.txt
python -m spacy convert -c json /path/to/Tiger.train.json out_dir_only/
ids-projects/DeReKo/spacy_train/basic_config_newOrth.cfg
.python -m spacy init fill-config basic_config.cfg final_config.cfg
python -m spacy train final_config.cfg --output tiger_spacy --verbose --gpu-id 0
Specific scripts for parallel execution of SpaCy Lemmatizer and Tagger on DeReKo big files
dereko_all_filenames.txt
and dereko_mini_test.txt
exec_dereko_parallel.sh
: Main Execution Script for running N SpaCy processes on ALL DeReKo Files.explore_dereko.py
: Prints available files inside the DeReko Directoryspacy_train
:conll2spacy.py
: creates JSON dataset files readable by SpaCy scriptscustom_spacy_dereko.py
: Prepares pre-trained vector files to be SpaCy readablecustom_spacy_tagger_2x.py
: Trains a POS Tagger using SpaCy 2.x library*.cfg
used by SpaCy 3.x scriptsMain Class definitions and other useful resources
CoNLL_Annotation.py
: Contains the CoNLL Token Class definitions used by most systems to process CoNLL datasets.German_STTS_Tagset.tsv
: Inventory of German POS Tags as defined in the TIGER corpusDirectory where the logging *.log
files are saved
Contains auxiliary scripts for file-handling, pre-processing and execution of systems
clean_dereko_vectors.py
:conll_to_tok.py
:file_utils.py
:make_new_orth_silver_lemmas.py
: DELETE!?make_tiger_new_orth.py
:Here is where all experiment's outputs are saved, including error analysis, evaluation stats, etcetera...
Main scripts to execute the Lemmatizers and Taggers on any dataset
parse_spacy.py
:parse_spacy3.py
:parse_turku.py
:Run_Tree-RNN_Taggers.txt
:Scripts to evaluate and compare systems' performance
eval_old_vs_new_tiger.py
:evaluate.py
: