commit | fb308a2618feb359ea06335c4d4cf01a2906e2a9 | [log] [tgz] |
---|---|---|
author | daza <daza@uni-heidelberg.de> | Wed Jan 27 16:20:08 2021 +0100 |
committer | daza <daza@uni-heidelberg.de> | Wed Jan 27 16:20:08 2021 +0100 |
tree | 48df297112112a986d387ba417957856f21e4a94 | |
parent | d7d707559a6ae5568b76ea2533a8ab382a42e6b4 [diff] | |
parent | 54e072e61c24a3ce12a7f47e17e9d8d0d1583236 [diff] |
Stable tested version
python3 -m venv venv source venv/bin/activate export PYTHONPATH=PYTHONPATH:.
pip install -U pip setuptools wheel pip install -U spacy==2.3.2
For more details you can visit the official website
A neural parsing pipeline for segmentation, morphological tagging, dependency parsing and lemmatization with pre-trained models for more than 50 languages. Top ranker in the CoNLL-18 Shared Task.
Follow the installation instructions in their website
A tool for annotating text with part-of-speech and lemma information.
Follow the installation instructions in their website
RNNTagger was implemented in Python using the Deep Learning library PyTorch. Compared to TreeTagger, it has higher tagging accuracy and lemmatizes all tokens however it is much slower and requires GPUs.
Follow the installation instructions in their website
(If you use this, it needs its own virtual environment to avoid conflict with SpaCy 2.x)
python3 -m venv venv source venv/bin/activate pip install -U pip setuptools wheel pip install -U spacy-nightly --pre
For more details on this version you can visit the official website
We assume the Parser is already available at /path/to/Turku-neural-parser-pipeline/
.
This is a client-server software, therefore you must first start the server (for example inside an independent screen), and separately use the client code provided in our repo.
Download the pre-trained models that you wish to use, and uncompress it in the root folder of the Turku-neural-parser-pipeline
repository. Models are in this URL. For example, the german model is called models_de_gsd.tgz
.
Run the Server:
screen -S turku-server cd /path/to/Turku-neural-parser-pipeline/ source venv-parser-neural/bin/activate python full_pipeline_server.py --port 7689 --conf models_de_gsd/pipelines.yaml parse_conllu
systems/parse_turku.py
).The script receives 4 parameters:
For example to parse the TIGER corpus (which is in CoNLL-U token format), one runs:
screen -S turku-client cd /path/to/this-repo/ python systems/parse_turku.py \ --input_file /vol/work/kupietz/Tiger_2_2/data/german/tiger/test/german_tiger_test.conll \ --output_file /path/to/german_tiger_test \ --corpus_name TigerTestOld \ --gld_token_type CoNLLUP_Token
This code fragments the CoNLL into N Chunks (currently hardcoded to get chunks of 10K sentences each, this is be able to fit any large file in memory). Once the script finishes, it produces a log file in: logs/Parse_{corpus_name}_Turku.log
as well as output files in the form of {output_file}.parsed.{chunk_index}.conllu
TreeTagger requires a particular input file format where each line must have only one token and sentences are separated by a special token. We include a script to convert "normal" CoNLL files into this format. To do so, run:
python my_utils/conll_to_tok.py \ --src_file /vol/work/kupietz/Tiger_2_2/data/german/tiger/test/german_tiger_test.conll \ --output_file german_tiger_test \ --sent_sep "</S>" \ --token_type CoNLLUP_Token
This outputs a output_file.sep.tok
with the proper format. Then go to the path/to/TreeTagger/
and once inside that folder, run:
bash cmd/tree-tagger-german-notokenize /path/to/german_tiger_test.sep.tok > /path/to/output/german_tiger_test.TreeTagger.conll
RNNTagger uses a similar format as TreeTagger, but does not require a specific </S>
separator. You can obtain the proper file with:
python my_utils/conll_to_tok.py \ --src_file /vol/work/kupietz/Tiger_2_2/data/german/tiger/test/german_tiger_test.conll \ --output_file german_tiger_test \ --token_type CoNLLUP_Token
This outputs a output_file.tok
with the proper format. Then go to the path/to/RNNTagger/
, activate the python environment where PyTorch is installed, and once inside that folder, run:
bash cmd/rnn-tagger-german-notokenize.sh /path/to/german_tiger_test.tok > /path/to/output/german_tiger_test.RNNTagger.conll
Our script for running the SpaCy pretrained parser can receive 8 parameters:
de_core_news_lg
)lib/CoNLL_Annotation.py
for the classes of tokens that can be used).To keep with the tiger_test example, one can obtain the SpaCy annotations by running:
python systems/parse_spacy.py \ --corpus_name TigerTest \ --gld_token_type CoNLLUP_Token \ --input_file /vol/work/kupietz/Tiger_2_2/data/german/tiger/test/german_tiger_test.conll \ --output_file /path/to/output/tiger_test.spacy.conllu \ --text_file /path/to/output/tiger_test_sentences.txt
Note that the script is already optimized for reading CoNLL-U files, keeping the appropriate comments, and partitioning huge files into N chunks and running them in parallel with M processes. These parameters are currently hard-coded, however they already contain the values that were found optimal to process the whole DeReKo corpus (but of course it can be furtherly adapted...).
You can follow the instructions for training custom models here
<< Insert Table of Results HERE >>
Specific scripts for parallel execution of SpaCy Lemmatizer and Tagger on DeReKo big files
dereko_all_filenames.txt
and dereko_mini_test.txt
exec_dereko_parallel.sh
: Main Execution Script for running N SpaCy processes on ALL DeReKo Files.explore_dereko.py
: Prints available information of DeReko Directoryturku_client_parser.py
(do I need this still? Must be the same as in systems/... DOUBLE CHECK!)spacy_train
:conll2spacy.py
: creates JSON dataset files readable by SpaCy scriptscustom_spacy_dereko.py
: Prepares pre-trained vector files to be SpaCy readablecustom_spacy_tagger_2x.py
: Trains a POS Tagger using SpaCy 2.x librarycustom_spacy_tagger_3x.py
: Trains a POS Tagger using SpaCy 3.x library*.cfg
used by SpaCy 3.x scriptsMain Class definitions and other useful resources
CoNLL_Annotation.py
: Contains the CoNLL Token Class definitions used by most systems to process CoNLL datasets.German_STTS_Tagset.tsv
: Inventory of German POS Tags as defined in the TIGER corpusDirectory where the logging *.log
files are saved
Contains auxiliary scripts for file-handling, pre-processing and execution of systems
clean_dereko_vectors.py
:conll_to_tok.py
:file_utils.py
:make_new_orth_silver_lemmas.py
: DELETE!?make_tiger_new_orth.py
:Here is where all experiment's outputs are saved, including error analysis, evaluation stats, etcetera...
Main scripts to execute the Lemmatizers and Taggers on any dataset
parse_spacy.py
:parse_spacy3.py
:parse_turku.py
:Run_Tree-RNN_Taggers.txt
:Scripts to evaluate and compare systems' performance
eval_old_vs_new_tiger.py
:evaluate.py
: