commit	a5c191f4216b69803fc05067e3ba3415b4b162c8	[log] [tgz]
author	Marc Kupietz <kupietz@ids-mannheim.de>	Thu Dec 01 13:30:12 2022 +0100
committer	Marc Kupietz <kupietz@ids-mannheim.de>	Thu Dec 01 13:30:12 2022 +0100
tree	e793289e22774ccd542376aa7d422003b23c2608
parent	b6f601f7a9191b457e55f639cd0dcef7ce782fee [diff]

tree: e793289e22774ccd542376aa7d422003b23c2608

Readme.md

totalngrams

Package for effectively processing token lists from very large corpora in tab separated value format, by making full use of multicore-processors.

An older version of totalngrams was used for Koplenig et al. (2022).

Synopsis

totalngrams [-dhlNSV] [--force] [--pad] [-f=<fold>] [-F=<FOLDS>]
                   [-L=<logFileName>] [-n=<ngram_size>] [-o=<output_fillename>]
                   [-p=<worker_pool_specification>] [-P=<max_threads>]
                   <inputFiles>...
sum ngram counts from KorAP-XML, CoNLL-U files and frequency lists
      <inputFiles>...    input files
  -d, --downcase         Convert all token characters into lower case (default:
                           false)
  -f, --fold=<fold>      current fold (default: 1)
  -F, --folds=<FOLDS>    number of random folds (default: 1)
      --force            Force overwrite (default: false)
  -h, --help             Show this help message and exit.
  -l, --with-lemma-pos   Use also lemma and part-of-speech annotations
                           (default: false
  -L, --log-file=<logFileName>
                         log file name (default: totalngrams.log)
  -n, --ngram-size=<ngram_size>
                         n-gram size (default: 1)
  -N, --numeric-secondary-sort
                         Sort entries with same frequency numerically
                           (default: false)
  -o, --output-file=<output_fillename>
                         Output file (default: -)
  -p, --worker-pool=<worker_pool_specification>
                         Run preprocessing on extern hosts, e.g. '10*local,
                           5*host1,3*smith@host2' (default: )
  -P, --max-procs=<max_threads>
                         Run up to max-procs processes at a time (default: 6)
      --pad              Add padding «START» and «END» symbols at text edges
                           (default: false)
  -S, --sort             Toggle output sorting (default: true)
  -V, --version          Print version information and exit.

Example usage

FOLDS=16
BASE="."

for l in "-l"; do #  "-l"
  for n in $(seq 1 2 3); do
    for f in $(seq 1 $FOLDS); do
      totalngrams\
        --pad \
        -P 79 \
        -n $n \
        -f $f \
        -F $FOLDS \
        $l -o "$BASE/paddedlemmaposfreq/$n-gram-token$l-freqs.$f.tsv.xz" $BASE/conllu/*.conllu.gz
    done
  done
done

Sampling into Folds

The deterministic pseudo-random sampling into folds is based on the cryptographic hash algorithm BLAKE2b (Aumasson et al. 2014)

Scripts

The package also contains some groovy scripts for handling pseudonymization tasks, i.e. replacing each token or lemma with a corresponding number according to separate key files.

You can run the groovy scripts directly, if you have installed groovy or from the totalngrams jar, otjherwise.

GeneratePseudonymKey

Example usage

./src/main/groovy/org/ids_mannheim/GeneratePseudonymKey.groovy -h

or:

generate_pseudonym_key -c 0 1-gram-token-l-freqs.*.tsv.xz | xz -T0 > token_key.tsv.xz

Pseudonymize

Example usage

pseudonymize -d /tmp -k tokens_key.tsv.xz -k lemma_key.tsv.xz  *-gram-token-l-freqs.*.tsv.xz

FilterKeys

Example usage

filter_keys -k token_keys.tsv.xz -k lemma_keys.tsv.xz 1-gram-token-l-freqs.*.tsv.xz

Installation

Prerequisites

Install

git clone "https://korap.ids-mannheim.de/gerrit/IDS-Mannheim/totalngrams"
cd totalngrams
mvn install
export PATH=`pwd`/appassembler/bin:$PATH

References

Koplenig, Alexander/Kupietz, Marc/Wolfer, Sascha (2022): Testing the relationship between word length, frequency, and predictability based on the German Reference Corpus. Cognitive Science 46(6)
Aumasson, Jean-Philippe/Meier, Willi/Phan, Raphael C-W/Henzen, Luca (2014): BLAKE2. In: The Hash Function BLAKE. Springer. p. 165–183.