Finished repo README

commit: 867167decb98d22c5409998ae8ddbecaea5a22bc [log] [tgz]
author: daza <daza@uni-heidelberg.de> Thu Jan 28 15:07:55 2021 +0100
committer: daza <daza@uni-heidelberg.de> Thu Jan 28 15:07:55 2021 +0100
tree: ca6e0fd89b31f73cf112469a1c6d009a131dcd5d
parent: fb308a2618feb359ea06335c4d4cf01a2906e2a9 [diff]
diff --git a/DeReKo/spacy_train/custom_spacy_tagger_3x.py b/DeReKo/spacy_train/custom_spacy_tagger_3x.py
deleted file mode 100644
index 45a76c9..0000000
--- a/DeReKo/spacy_train/custom_spacy_tagger_3x.py
+++ /dev/null

@@ -1,4 +0,0 @@
-import spacy
-nlp = spacy.load("de_dep_news_trf")
-doc = nlp("Das ist ein Satz.")
-print(doc)
\ No newline at end of file

diff --git a/DeReKo/turku_client_parser.py b/DeReKo/turku_client_parser.py
deleted file mode 100644
index 0eba137..0000000
--- a/DeReKo/turku_client_parser.py
+++ /dev/null

@@ -1,82 +0,0 @@
-# TODO: write a client to make multiple requests to the server!
-import subprocess, json, time
-import glob, logging
-import os.path, sys
-from my_utils.file_utils import *
-from lib.CoNLL_Annotation import CoNLLUP_Token
-
-# TODO: Add logging instead of Prints!
-
-DEREKO_DIR = "/export/netapp/kupietz/N-GRAMM-STUDIE/conllu/"
-
-def get_filenames(data_dir):
-    filenames = []
-    for filepath in glob.iglob(f'{data_dir}/*.conllu.gz', recursive=False):
-        fname = filepath.split("/")[-1]
-        filenames.append(filepath)
-    return sorted(filenames)
-
-
-def expand_file(f):
-    # Expand the .tgz file
-    fname = f[:-3]
-    if not os.path.isfile(fname): 
-        p = subprocess.call(f"gunzip -c {f} > {fname}", shell=True)
-        if p == 0:
-            logger.info("Successfully uncompressed file")
-        else:
-            logger.info(f"Couldn't expand file {f}")
-            raise Exception
-    else:
-        logger.info(f"File {fname} is already uncompressed. Skipping this step...")
-    
-    # Substitute the Commentary Lines on the Expanded file
-    fixed_filename = f"{fname}.fixed"
-    p = subprocess.call(f"sed 's/^# /###C: /g' {fname}", shell=True, stdout=open(fixed_filename, "w")) # stdout=subprocess.PIPE
-    if p == 0:
-        logger.info("Successfully fixed comments on file")
-    else:
-        logger.info(f"Something went wrong when substituting commentaries")
-        raise Exception    
-    return fixed_filename
-
-
-
-if __name__ == "__main__":
-    conll_files = get_filenames(DEREKO_DIR)[:1] # This is for Development Purposes only process the first [at most] 2 files
-    #print(conll_files)
-    #conll_files = ["tutorial_examples/mini_test_raw.conllu.gz"]
-    file_has_next, chunk_ix = True, 0
-    CHUNK_SIZE = 20000
-    
-    # =====================================================================================
-    #                    LOGGING INFO ...
-    # =====================================================================================
-    logger = logging.getLogger(__name__)
-    console_hdlr = logging.StreamHandler(sys.stdout)
-    file_hdlr = logging.FileHandler(filename=f"ParseTests.log")
-    logging.basicConfig(level=logging.INFO, handlers=[console_hdlr, file_hdlr])
-    logger.info("Start Logging")
-    logger.info(f"Chunking in Files of {CHUNK_SIZE} Sentences")
-    
-    # =====================================================================================
-    #                    PROCESS (PARSE) ALL FILES FOUND ...
-    # =====================================================================================
-    for f in conll_files:
-        start = time.time()
-        text_filename = expand_file(f)
-        line_generator = file_generator(text_filename)
-        total_processed_sents = 0
-        while file_has_next:
-            raw_text, file_has_next, n_sents = get_file_chunk(line_generator, chunk_size=CHUNK_SIZE, token_class=CoNLLUP_Token)
-            total_processed_sents += n_sents
-            if len(raw_text) > 0:
-                turku_parse_file(raw_text, text_filename, chunk_ix)
-                now = time.time()
-                elapsed = (now - start)
-                logger.info(f"Time Elapsed: {elapsed}. Processed {total_processed_sents}. [{total_processed_sents/elapsed} Sents/sec]\n") # Toks/Sec???
-            chunk_ix += 1       
-        end = time.time()
-        logger.info(f"Processing File {f} took {(end - start)} seconds!")
-    
-    
\ No newline at end of file

diff --git a/README.md b/README.md
index 5ce3ed9..50c5093 100644
--- a/README.md
+++ b/README.md

@@ -18,6 +18,14 @@
 
 For more details you can visit the [official website](https://spacy.io/usage#quickstart)
 
+#### Germalemma
+
+```
+pip install -U germalemma
+```
+
+More details in their [website](https://github.com/WZBSocialScienceCenter/germalemma)
+
 #### Turku Parser
 
 A neural parsing pipeline for segmentation, morphological tagging, dependency parsing and lemmatization with pre-trained models for more than 50 languages. Top ranker in the CoNLL-18 Shared Task.
@@ -136,7 +144,9 @@
 
 ### Run SpaCy 2.x
 
-Our script for running the SpaCy pretrained parser can receive 8 parameters:
+The Pre-trained SpaCy POS Parser was trained using the TIGER Corpus and it contains a Lookup Table for Lemmatization.
+
+Our script for running the SpaCy parser can receive 8 parameters:
  * **input_file**: path to the (empty) CoNLL file that will be processed. An *empty* CoNLL file means that all columns (except column one, which should have a word-token can be empty)
  * **corpus_name**: A string to distinguish the current corpus being processed
  * **output_file**: File where the SpaCy Predictions will be saved
@@ -159,24 +169,96 @@
 
 Note that the script is already optimized for reading CoNLL-U files, keeping the appropriate comments, and partitioning huge files into N chunks and running them in parallel with M processes. These parameters are currently hard-coded, however they already contain the values that were found optimal to process the whole DeReKo corpus (but of course it can be furtherly adapted...).
 
+## Evaluation of Taggers
+
+We evaluate the models for Accuracy and Macro F1 as well as processing speed
+
+1. Whole TIGER Corpus (50,472 sentences) - to test speed...
+
+| System                        | Lemma Acc | POS Acc     | POS F1     | sents/sec |
+|-------------------------------|-----------|-------------|------------|-----------|
+| TreeTagger*                   | 90.62     | 95.24       | 74.35      | 12,618    |
+| SpaCy                         | 85.33     | 99.07       | 95.84      | 1,577     |
+| **SpaCy + Germalemma**        | **90.98** | **99.07**   | **95.84**  | **1,230** |
+| Turku NLP   [CPU]             | 78.90     | 94.43       | 70.78      | 151       |
+| RNNTagger*  [GPU]             | 97.93     | 99.44       | 93.72      | 141       |
+
+\* because of lemmas and POS Tags divergences, TreeTagger and RNNTagger needed *post-processing* to agree with the DE_GSD gold standard.
+
+One can see that the best performance-speed trade-off is with SpaCy! Especially when having several CPUs available.
+
+
+2. TIGER Test (767 sentences) 
+
+| System                    | Lemma Acc   | POS Acc | POS F1  |
+|---------------------------|-------------|---------|---------|
+| RNNTagger*                | 97.57       | 99.41   | 98.41   |
+| SpaCy + Germalemma        | 91.24       | 98.97   | 97.01   |
+| TreeTagger*               | 90.21       | 95.42   | 79.73   |
+| Turku NLP                 | 77.07       | 94.65   | 78.24   |
+
+\* with post-processing applied
+
+3. DE_GSD Test (977 sentences) [Universal Dependencies - CoNLL-18 Dataset](https://universaldependencies.org/conll18/)
+
+| System                | Lemma Acc | POS Acc | POS F1 |
+|-----------------------|-----------|---------|--------|
+| RNNTagger*            | 93.87     | 95.89   | 82.86  |
+| TreeTagger*           | 90.91     | 93.64   | 75.70  |
+| SpaCy + Germalemma    | 90.59     | 95.43   | 83.63  |
+| SpaCy                 | 85.92     | 95.43   | 83.63  |
+| Turku NLP             | 81.97     | 97.07   | 86.58  |
+| RNNTagger (original)  | 93.87     | 90.41   | 80.97  |
+| TreeTagger (original) | 79.65     | 88.17   | 73.83  |
+
+\* with post-processing applied
+
+
 ## Custom Train 
 
 ### Train TurkuNLP Parser
 
 You can follow the instructions for training custom models [here](https://turkunlp.org/Turku-neural-parser-pipeline/training.html)
 
-
-## Evaluation of Taggers
-
-<< Insert Table of Results HERE >>
-
-
 ### Train Spacy 2.x Tagger
 
+* It is possible to train from scratch or fine-tune a POS Tagger using SpaCy API. It is also possible to load custom lemmatizer rules (currently spacy only uses a lookup table, that is why adding GermaLemma provided an improvement in performance).
+
+* To train a Spacy model in the 2.x version, you can follow the dummy code provided in `spacy_train/custom_spacy_tagger_2x.py`.
+
 
 ### Train Spacy 3.x Tagger (Transformer)
 
+* [SpaCy 3.x](https://nightly.spacy.io/usage/v3) in on BETA mode, however it will provide a more robust API for training custom models as well as implementing all of the models available from [Hugging Face Transformers library](https://huggingface.co/transformers/pretrained_models.html). More information about this version of SpaCy is available in [their blog](https://explosion.ai/blog/spacy-v3-nightly).
 
+* This version will also provide a more flexible API for Lemmatization, however this is still not implemented...
+
+1. To train a POS Tagger in SpaCy 3.x one must follow the following steps (using TIGEr as an example, but it can use any data available in CoNLL or SpaCy-JSON format):
+	1. Convert the input CoNLL file into SpaCy-JSON training format with `spacy_train/conll2spacy.py`:
+	
+	```
+	python DeReKo/spacy_train/conll2spacy.py --corpus_name TigerALL --gld_token_type CoNLLUP_Token \
+		--input_file /vol/work/kupietz/Tiger_2_2/data/german/tiger/train/german_tiger_train.conll \
+		--output_file /path/to/Tiger.train.json \
+		--text_file /path/to/Tiger.train.sents.txt
+	```
+	
+	2. Convert to SpaCy 3.x [dataset format](https://nightly.spacy.io/api/data-formats#training) file by running:
+	```
+	python -m spacy convert -c json /path/to/Tiger.train.json  out_dir_only/
+	```
+	
+	3. Create a basic config file following [SpaCy API](https://nightly.spacy.io/usage/training#config-custom)
+	4. Create the final Config file with:
+	```
+	python -m spacy init fill-config basic_config.cfg final_config.cfg
+	```
+	5. Train the Model using GPU
+	```
+	python -m spacy train final_config.cfg --output tiger_spacy --verbose --gpu-id 0
+	```
+
+* More information available at the [SpaCy API webpage](https://nightly.spacy.io/usage/training).
 
 ## Overall Repository Structure
 
@@ -185,13 +267,11 @@
 
 * Text files used to process DeReKo: `dereko_all_filenames.txt` and `dereko_mini_test.txt`
 * `exec_dereko_parallel.sh`: Main Execution Script for running *N* SpaCy processes on **ALL DeReKo Files**.
-* `explore_dereko.py`: Prints available information of DeReko Directory
-* `turku_client_parser.py` (do I need this still? Must be the same as in systems/... DOUBLE CHECK!)
+* `explore_dereko.py`: Prints available files inside the DeReko Directory
 * Directory `spacy_train`:
 	* `conll2spacy.py`: creates JSON dataset files readable by SpaCy scripts
 	* `custom_spacy_dereko.py`: Prepares pre-trained vector files to be SpaCy readable
 	* `custom_spacy_tagger_2x.py`: Trains a POS Tagger using SpaCy 2.x library 
-	* `custom_spacy_tagger_3x.py`: Trains a POS Tagger using SpaCy 3.x library 
 	* Config Files `*.cfg` used by SpaCy 3.x scripts
 
 #### lib
commit	867167decb98d22c5409998ae8ddbecaea5a22bc	[log] [tgz]
author	daza <daza@uni-heidelberg.de>	Thu Jan 28 15:07:55 2021 +0100
committer	daza <daza@uni-heidelberg.de>	Thu Jan 28 15:07:55 2021 +0100
tree	ca6e0fd89b31f73cf112469a1c6d009a131dcd5d
parent	fb308a2618feb359ea06335c4d4cf01a2906e2a9 [diff]