Finished repo README
diff --git a/DeReKo/spacy_train/custom_spacy_tagger_3x.py b/DeReKo/spacy_train/custom_spacy_tagger_3x.py
deleted file mode 100644
index 45a76c9..0000000
--- a/DeReKo/spacy_train/custom_spacy_tagger_3x.py
+++ /dev/null
@@ -1,4 +0,0 @@
-import spacy
-nlp = spacy.load("de_dep_news_trf")
-doc = nlp("Das ist ein Satz.")
-print(doc)
\ No newline at end of file
diff --git a/DeReKo/turku_client_parser.py b/DeReKo/turku_client_parser.py
deleted file mode 100644
index 0eba137..0000000
--- a/DeReKo/turku_client_parser.py
+++ /dev/null
@@ -1,82 +0,0 @@
-# TODO: write a client to make multiple requests to the server!
-import subprocess, json, time
-import glob, logging
-import os.path, sys
-from my_utils.file_utils import *
-from lib.CoNLL_Annotation import CoNLLUP_Token
-
-# TODO: Add logging instead of Prints!
-
-DEREKO_DIR = "/export/netapp/kupietz/N-GRAMM-STUDIE/conllu/"
-
-def get_filenames(data_dir):
- filenames = []
- for filepath in glob.iglob(f'{data_dir}/*.conllu.gz', recursive=False):
- fname = filepath.split("/")[-1]
- filenames.append(filepath)
- return sorted(filenames)
-
-
-def expand_file(f):
- # Expand the .tgz file
- fname = f[:-3]
- if not os.path.isfile(fname):
- p = subprocess.call(f"gunzip -c {f} > {fname}", shell=True)
- if p == 0:
- logger.info("Successfully uncompressed file")
- else:
- logger.info(f"Couldn't expand file {f}")
- raise Exception
- else:
- logger.info(f"File {fname} is already uncompressed. Skipping this step...")
-
- # Substitute the Commentary Lines on the Expanded file
- fixed_filename = f"{fname}.fixed"
- p = subprocess.call(f"sed 's/^# /###C: /g' {fname}", shell=True, stdout=open(fixed_filename, "w")) # stdout=subprocess.PIPE
- if p == 0:
- logger.info("Successfully fixed comments on file")
- else:
- logger.info(f"Something went wrong when substituting commentaries")
- raise Exception
- return fixed_filename
-
-
-
-if __name__ == "__main__":
- conll_files = get_filenames(DEREKO_DIR)[:1] # This is for Development Purposes only process the first [at most] 2 files
- #print(conll_files)
- #conll_files = ["tutorial_examples/mini_test_raw.conllu.gz"]
- file_has_next, chunk_ix = True, 0
- CHUNK_SIZE = 20000
-
- # =====================================================================================
- # LOGGING INFO ...
- # =====================================================================================
- logger = logging.getLogger(__name__)
- console_hdlr = logging.StreamHandler(sys.stdout)
- file_hdlr = logging.FileHandler(filename=f"ParseTests.log")
- logging.basicConfig(level=logging.INFO, handlers=[console_hdlr, file_hdlr])
- logger.info("Start Logging")
- logger.info(f"Chunking in Files of {CHUNK_SIZE} Sentences")
-
- # =====================================================================================
- # PROCESS (PARSE) ALL FILES FOUND ...
- # =====================================================================================
- for f in conll_files:
- start = time.time()
- text_filename = expand_file(f)
- line_generator = file_generator(text_filename)
- total_processed_sents = 0
- while file_has_next:
- raw_text, file_has_next, n_sents = get_file_chunk(line_generator, chunk_size=CHUNK_SIZE, token_class=CoNLLUP_Token)
- total_processed_sents += n_sents
- if len(raw_text) > 0:
- turku_parse_file(raw_text, text_filename, chunk_ix)
- now = time.time()
- elapsed = (now - start)
- logger.info(f"Time Elapsed: {elapsed}. Processed {total_processed_sents}. [{total_processed_sents/elapsed} Sents/sec]\n") # Toks/Sec???
- chunk_ix += 1
- end = time.time()
- logger.info(f"Processing File {f} took {(end - start)} seconds!")
-
-
\ No newline at end of file
diff --git a/README.md b/README.md
index 5ce3ed9..50c5093 100644
--- a/README.md
+++ b/README.md
@@ -18,6 +18,14 @@
For more details you can visit the [official website](https://spacy.io/usage#quickstart)
+#### Germalemma
+
+```
+pip install -U germalemma
+```
+
+More details in their [website](https://github.com/WZBSocialScienceCenter/germalemma)
+
#### Turku Parser
A neural parsing pipeline for segmentation, morphological tagging, dependency parsing and lemmatization with pre-trained models for more than 50 languages. Top ranker in the CoNLL-18 Shared Task.
@@ -136,7 +144,9 @@
### Run SpaCy 2.x
-Our script for running the SpaCy pretrained parser can receive 8 parameters:
+The Pre-trained SpaCy POS Parser was trained using the TIGER Corpus and it contains a Lookup Table for Lemmatization.
+
+Our script for running the SpaCy parser can receive 8 parameters:
* **input_file**: path to the (empty) CoNLL file that will be processed. An *empty* CoNLL file means that all columns (except column one, which should have a word-token can be empty)
* **corpus_name**: A string to distinguish the current corpus being processed
* **output_file**: File where the SpaCy Predictions will be saved
@@ -159,24 +169,96 @@
Note that the script is already optimized for reading CoNLL-U files, keeping the appropriate comments, and partitioning huge files into N chunks and running them in parallel with M processes. These parameters are currently hard-coded, however they already contain the values that were found optimal to process the whole DeReKo corpus (but of course it can be furtherly adapted...).
+## Evaluation of Taggers
+
+We evaluate the models for Accuracy and Macro F1 as well as processing speed
+
+1. Whole TIGER Corpus (50,472 sentences) - to test speed...
+
+| System | Lemma Acc | POS Acc | POS F1 | sents/sec |
+|-------------------------------|-----------|-------------|------------|-----------|
+| TreeTagger* | 90.62 | 95.24 | 74.35 | 12,618 |
+| SpaCy | 85.33 | 99.07 | 95.84 | 1,577 |
+| **SpaCy + Germalemma** | **90.98** | **99.07** | **95.84** | **1,230** |
+| Turku NLP [CPU] | 78.90 | 94.43 | 70.78 | 151 |
+| RNNTagger* [GPU] | 97.93 | 99.44 | 93.72 | 141 |
+
+\* because of lemmas and POS Tags divergences, TreeTagger and RNNTagger needed *post-processing* to agree with the DE_GSD gold standard.
+
+One can see that the best performance-speed trade-off is with SpaCy! Especially when having several CPUs available.
+
+
+2. TIGER Test (767 sentences)
+
+| System | Lemma Acc | POS Acc | POS F1 |
+|---------------------------|-------------|---------|---------|
+| RNNTagger* | 97.57 | 99.41 | 98.41 |
+| SpaCy + Germalemma | 91.24 | 98.97 | 97.01 |
+| TreeTagger* | 90.21 | 95.42 | 79.73 |
+| Turku NLP | 77.07 | 94.65 | 78.24 |
+
+\* with post-processing applied
+
+3. DE_GSD Test (977 sentences) [Universal Dependencies - CoNLL-18 Dataset](https://universaldependencies.org/conll18/)
+
+| System | Lemma Acc | POS Acc | POS F1 |
+|-----------------------|-----------|---------|--------|
+| RNNTagger* | 93.87 | 95.89 | 82.86 |
+| TreeTagger* | 90.91 | 93.64 | 75.70 |
+| SpaCy + Germalemma | 90.59 | 95.43 | 83.63 |
+| SpaCy | 85.92 | 95.43 | 83.63 |
+| Turku NLP | 81.97 | 97.07 | 86.58 |
+| RNNTagger (original) | 93.87 | 90.41 | 80.97 |
+| TreeTagger (original) | 79.65 | 88.17 | 73.83 |
+
+\* with post-processing applied
+
+
## Custom Train
### Train TurkuNLP Parser
You can follow the instructions for training custom models [here](https://turkunlp.org/Turku-neural-parser-pipeline/training.html)
-
-## Evaluation of Taggers
-
-<< Insert Table of Results HERE >>
-
-
### Train Spacy 2.x Tagger
+* It is possible to train from scratch or fine-tune a POS Tagger using SpaCy API. It is also possible to load custom lemmatizer rules (currently spacy only uses a lookup table, that is why adding GermaLemma provided an improvement in performance).
+
+* To train a Spacy model in the 2.x version, you can follow the dummy code provided in `spacy_train/custom_spacy_tagger_2x.py`.
+
### Train Spacy 3.x Tagger (Transformer)
+* [SpaCy 3.x](https://nightly.spacy.io/usage/v3) in on BETA mode, however it will provide a more robust API for training custom models as well as implementing all of the models available from [Hugging Face Transformers library](https://huggingface.co/transformers/pretrained_models.html). More information about this version of SpaCy is available in [their blog](https://explosion.ai/blog/spacy-v3-nightly).
+* This version will also provide a more flexible API for Lemmatization, however this is still not implemented...
+
+1. To train a POS Tagger in SpaCy 3.x one must follow the following steps (using TIGEr as an example, but it can use any data available in CoNLL or SpaCy-JSON format):
+ 1. Convert the input CoNLL file into SpaCy-JSON training format with `spacy_train/conll2spacy.py`:
+
+ ```
+ python DeReKo/spacy_train/conll2spacy.py --corpus_name TigerALL --gld_token_type CoNLLUP_Token \
+ --input_file /vol/work/kupietz/Tiger_2_2/data/german/tiger/train/german_tiger_train.conll \
+ --output_file /path/to/Tiger.train.json \
+ --text_file /path/to/Tiger.train.sents.txt
+ ```
+
+ 2. Convert to SpaCy 3.x [dataset format](https://nightly.spacy.io/api/data-formats#training) file by running:
+ ```
+ python -m spacy convert -c json /path/to/Tiger.train.json out_dir_only/
+ ```
+
+ 3. Create a basic config file following [SpaCy API](https://nightly.spacy.io/usage/training#config-custom)
+ 4. Create the final Config file with:
+ ```
+ python -m spacy init fill-config basic_config.cfg final_config.cfg
+ ```
+ 5. Train the Model using GPU
+ ```
+ python -m spacy train final_config.cfg --output tiger_spacy --verbose --gpu-id 0
+ ```
+
+* More information available at the [SpaCy API webpage](https://nightly.spacy.io/usage/training).
## Overall Repository Structure
@@ -185,13 +267,11 @@
* Text files used to process DeReKo: `dereko_all_filenames.txt` and `dereko_mini_test.txt`
* `exec_dereko_parallel.sh`: Main Execution Script for running *N* SpaCy processes on **ALL DeReKo Files**.
-* `explore_dereko.py`: Prints available information of DeReko Directory
-* `turku_client_parser.py` (do I need this still? Must be the same as in systems/... DOUBLE CHECK!)
+* `explore_dereko.py`: Prints available files inside the DeReko Directory
* Directory `spacy_train`:
* `conll2spacy.py`: creates JSON dataset files readable by SpaCy scripts
* `custom_spacy_dereko.py`: Prepares pre-trained vector files to be SpaCy readable
* `custom_spacy_tagger_2x.py`: Trains a POS Tagger using SpaCy 2.x library
- * `custom_spacy_tagger_3x.py`: Trains a POS Tagger using SpaCy 3.x library
* Config Files `*.cfg` used by SpaCy 3.x scripts
#### lib