fb308a2618feb359ea06335c4d4cf01a2906e2a9 - KorAP/sota-pos-lemmatizers

commit	fb308a2618feb359ea06335c4d4cf01a2906e2a9	[log] [tgz]
author	daza <daza@uni-heidelberg.de>	Wed Jan 27 16:20:08 2021 +0100
committer	daza <daza@uni-heidelberg.de>	Wed Jan 27 16:20:08 2021 +0100
tree	48df297112112a986d387ba417957856f21e4a94
parent	d7d707559a6ae5568b76ea2533a8ab382a42e6b4 [diff]
parent	54e072e61c24a3ce12a7f47e17e9d8d0d1583236 [diff]

tree: 48df297112112a986d387ba417957856f21e4a94

README.md

Install Requirements

Create a Virtual Environment

	python3 -m venv venv
	source venv/bin/activate
	export PYTHONPATH=PYTHONPATH:.

Install Libraries (as needed)

SpaCy 2.x

pip install -U pip setuptools wheel
pip install -U spacy==2.3.2

For more details you can visit the official website

Turku Parser

A neural parsing pipeline for segmentation, morphological tagging, dependency parsing and lemmatization with pre-trained models for more than 50 languages. Top ranker in the CoNLL-18 Shared Task.

Follow the installation instructions in their website

TreeTagger

A tool for annotating text with part-of-speech and lemma information.

Follow the installation instructions in their website

RNNTagger

RNNTagger was implemented in Python using the Deep Learning library PyTorch. Compared to TreeTagger, it has higher tagging accuracy and lemmatizes all tokens however it is much slower and requires GPUs.

Follow the installation instructions in their website

SpaCy 3.x

(If you use this, it needs its own virtual environment to avoid conflict with SpaCy 2.x)

python3 -m venv venv
source venv/bin/activate
pip install -U pip setuptools wheel
pip install -U spacy-nightly --pre

For more details on this version you can visit the official website

Pre-trained Lemmatization & POS Taggers

Run TurkuNLP Parser

We assume the Parser is already available at /path/to/Turku-neural-parser-pipeline/.

This is a client-server software, therefore you must first start the server (for example inside an independent screen), and separately use the client code provided in our repo.

Download the pre-trained models that you wish to use, and uncompress it in the root folder of the Turku-neural-parser-pipeline repository. Models are in this URL. For example, the german model is called models_de_gsd.tgz.
Run the Server:

screen -S turku-server

cd /path/to/Turku-neural-parser-pipeline/

source venv-parser-neural/bin/activate

python full_pipeline_server.py --port 7689 --conf models_de_gsd/pipelines.yaml parse_conllu

Once the server is up and running, run the Client script (under the folder systems/parse_turku.py).

The script receives 4 parameters:

input_file: path to the (empty) CoNLL file that will be processed. An empty CoNLL file means that all columns (except column one, which should have a word-token can be empty)
corpus_name: A string to distinguish the current corpus being processed
gld_token_type: Determines the precise CoNLL Format of the input file. Current Options: CoNLL09_Token | CoNLLUP_Token (CoNLL-U).
comment_str: indicates what string determines if a line is a comment inside the CoNLL file.

For example to parse the TIGER corpus (which is in CoNLL-U token format), one runs:

screen -S turku-client

cd /path/to/this-repo/

python systems/parse_turku.py \
	--input_file /vol/work/kupietz/Tiger_2_2/data/german/tiger/test/german_tiger_test.conll \
	--output_file /path/to/german_tiger_test \
	--corpus_name TigerTestOld \
	--gld_token_type CoNLLUP_Token

This code fragments the CoNLL into N Chunks (currently hardcoded to get chunks of 10K sentences each, this is be able to fit any large file in memory). Once the script finishes, it produces a log file in: logs/Parse_{corpus_name}_Turku.log as well as output files in the form of {output_file}.parsed.{chunk_index}.conllu

Run TreeTagger

TreeTagger requires a particular input file format where each line must have only one token and sentences are separated by a special token. We include a script to convert "normal" CoNLL files into this format. To do so, run:

python my_utils/conll_to_tok.py \
	--src_file /vol/work/kupietz/Tiger_2_2/data/german/tiger/test/german_tiger_test.conll \
	--output_file german_tiger_test \
	--sent_sep "</S>" \
	--token_type CoNLLUP_Token

This outputs a output_file.sep.tok with the proper format. Then go to the path/to/TreeTagger/ and once inside that folder, run:

bash cmd/tree-tagger-german-notokenize /path/to/german_tiger_test.sep.tok > /path/to/output/german_tiger_test.TreeTagger.conll

Run RNNTagger

RNNTagger uses a similar format as TreeTagger, but does not require a specific </S> separator. You can obtain the proper file with:

python my_utils/conll_to_tok.py \
	--src_file /vol/work/kupietz/Tiger_2_2/data/german/tiger/test/german_tiger_test.conll \
	--output_file german_tiger_test \
	--token_type CoNLLUP_Token

This outputs a output_file.tok with the proper format. Then go to the path/to/RNNTagger/, activate the python environment where PyTorch is installed, and once inside that folder, run:

bash cmd/rnn-tagger-german-notokenize.sh /path/to/german_tiger_test.tok > /path/to/output/german_tiger_test.RNNTagger.conll

Run SpaCy 2.x

Our script for running the SpaCy pretrained parser can receive 8 parameters:

input_file: path to the (empty) CoNLL file that will be processed. An empty CoNLL file means that all columns (except column one, which should have a word-token can be empty)
corpus_name: A string to distinguish the current corpus being processed
output_file: File where the SpaCy Predictions will be saved
text_file: Output text file where the sentences will be saved in plain text (one sentence per line)
spacy_model: a string indicating which SpaCy model should be used to process the text (e.g. de_core_news_lg)
gld_token_type: Determines the precise CoNLL Format of the input file. Current Options: CoNLL09_Token | CoNLLUP_Token (CoNLL-U) and others (see lib/CoNLL_Annotation.py for the classes of tokens that can be used).
use_germalemma: Flag to decide wether to use Germalemma lemmatizer on top of SpaCy or not ("True" is recommended!)
comment_str: indicates what string determines if a line is a comment inside the CoNLL file.

To keep with the tiger_test example, one can obtain the SpaCy annotations by running:

python systems/parse_spacy.py \
	--corpus_name TigerTest \
	--gld_token_type CoNLLUP_Token \
	--input_file /vol/work/kupietz/Tiger_2_2/data/german/tiger/test/german_tiger_test.conll \
	--output_file /path/to/output/tiger_test.spacy.conllu \
	--text_file /path/to/output/tiger_test_sentences.txt

Note that the script is already optimized for reading CoNLL-U files, keeping the appropriate comments, and partitioning huge files into N chunks and running them in parallel with M processes. These parameters are currently hard-coded, however they already contain the values that were found optimal to process the whole DeReKo corpus (but of course it can be furtherly adapted...).

Custom Train

Train TurkuNLP Parser

You can follow the instructions for training custom models here

Evaluation of Taggers

<< Insert Table of Results HERE >>

Train Spacy 2.x Tagger

Train Spacy 3.x Tagger (Transformer)

Overall Repository Structure

DeReKo

Specific scripts for parallel execution of SpaCy Lemmatizer and Tagger on DeReKo big files

Text files used to process DeReKo: dereko_all_filenames.txt and dereko_mini_test.txt
exec_dereko_parallel.sh: Main Execution Script for running N SpaCy processes on ALL DeReKo Files.
explore_dereko.py: Prints available information of DeReko Directory
turku_client_parser.py (do I need this still? Must be the same as in systems/... DOUBLE CHECK!)
Directory spacy_train:
- conll2spacy.py: creates JSON dataset files readable by SpaCy scripts
- custom_spacy_dereko.py: Prepares pre-trained vector files to be SpaCy readable
- custom_spacy_tagger_2x.py: Trains a POS Tagger using SpaCy 2.x library
- custom_spacy_tagger_3x.py: Trains a POS Tagger using SpaCy 3.x library
- Config Files *.cfg used by SpaCy 3.x scripts

lib

Main Class definitions and other useful resources

CoNLL_Annotation.py: Contains the CoNLL Token Class definitions used by most systems to process CoNLL datasets.
German_STTS_Tagset.tsv: Inventory of German POS Tags as defined in the TIGER corpus

logs

Directory where the logging *.log files are saved

my_utils

Contains auxiliary scripts for file-handling, pre-processing and execution of systems

clean_dereko_vectors.py:
conll_to_tok.py:
file_utils.py:
make_new_orth_silver_lemmas.py: DELETE!?
make_tiger_new_orth.py:

outputs

Here is where all experiment's outputs are saved, including error analysis, evaluation stats, etcetera...

systems

Main scripts to execute the Lemmatizers and Taggers on any dataset

parse_spacy.py:
parse_spacy3.py:
parse_turku.py:
Run_Tree-RNN_Taggers.txt:

systems_eval

Scripts to evaluate and compare systems' performance

eval_old_vs_new_tiger.py:
evaluate.py: