commit	da9c41126f573bb0cf8d55fdc3126931048a4874	[log] [tgz]
author	Akron <nils@diewald-online.de>	Sat Mar 19 17:51:05 2022 +0100
committer	Akron <nils@diewald-online.de>	Sat Mar 19 17:51:05 2022 +0100
tree	b233d86d1f8ec74e9b87cbd90a73e5264e073e33
parent	93ff869c29a56fc0803a432e0c7d42bcefa5a0a1 [diff]

tree: b233d86d1f8ec74e9b87cbd90a73e5264e073e33

Readme.md

Creating the container

To build the Docker image, run

$ docker build -f Dockerfile -t korap/euralex22 .

This will download and install an image of approximately 6GB.

It will download and install the following tokenizers in an image to your system:

...

To run the evaluation suite ...

...

Running the evaluation suite

To run the benchmark, call

$ docker run --rm -i \
  -v ${PWD}/benchmarks:/euralex/benchmarks \
  -v ${PWD}/corpus:/euralex/corpus \
  korap/euralex22 benchmarks/[BENCHMARK-SCRIPT]

The supported benchmark scripts are:

`benchmark.pl`

Performance measurements of the tools. See the tools section for some remarks to take into account. Accepts two numerical parameters:

The duplication count of the example file
The number of iterations

`empirist.pl`

To run the empirist evaluation suite, you first need to download the empirist gold standard corpus and tooling, and extract it into the corpus directory.

$ wget https://sites.google.com/site/empirist2015/home/shared-task-data/empirist_gold_cmc.zip
$ unzip empirist_gold_cmc.zip -d corpus

$ wget https://sites.google.com/site/empirist2015/home/shared-task-data/empirist_gold_web.zip
$ unzip empirist_gold_web.zip -d corpus

Quality measurements based on EmpiriST 2015.

To investigate the output, start the benchmark with mounted output folders

-v ${PWD}/output_cmc:/euralex/empirist_cmc
-v ${PWD}/output_web:/euralex/empirist_web

`ud_tokens.pl`

To run the token evaluation suite against the Universal Dependency corpus, first install the empirist tooling as explained above, and download the corpus.

$ wget https://github.com/UniversalDependencies/UD_German-GSD/raw/master/de_gsd-ud-train.conllu \
  -O corpus/de_gsd-ud-train.conllu

`ud_sentences.pl`

To run the sentence evaluation suite, first download the corpus as explained above.

Tools

Waste

Tokenization

OpenNLP

Tokenization

TreeTagger

Tokenization

JTok

Tokenization

SynTok

Tokenization

SoMaJo

Tokenization

Stanford CoreNLP

Tokenization

All tools are run using pipelining, which obviously introduces some overhead, that needs to be taken into account.

KorAP-Tokenizer

Tokenization + Sentence Splitting

Datok

Tokenization + Sentence Splitting

Licenses

For Treetagger: Please read the license terms, before you download the software! By downloading the software, you agree to the terms stated there.

Caveat

When running this benchmark using Docker you may need to run all processes privileged to get meaningful results.

docker run --privileged -v