commit	0c778db1638c257478440c0a728d3d1e3bf0ba32	[log] [tgz]
author	Marc Kupietz <kupietz@ids-mannheim.de>	Wed Dec 21 21:08:03 2022 +0100
committer	Marc Kupietz <kupietz@ids-mannheim.de>	Wed Dec 21 21:08:03 2022 +0100
tree	de3399cb596754de67965a0c068d0fcd8df3ace6
parent	73397d808ed666dcba649c2be9a27a8567b590a1 [diff]

tree: de3399cb596754de67965a0c068d0fcd8df3ace6

Readme.md

totalngrams

Package for effectively processing token lists from very large corpora in tab separated value format, by making full use of multicore-processors.

An older version of totalngrams was used for Koplenig et al. (2022).

Synopsis

totalngrams [-dhlNSV] [--force] [--pad] [-f=<fold>] [-F=<FOLDS>]
                   [-L=<logFileName>] [-n=<ngram_size>] [-o=<output_fillename>]
                   [-p=<worker_pool_specification>] [-P=<max_threads>]
                   <inputFiles>...
sum ngram counts from KorAP-XML, CoNLL-U files and frequency lists
      <inputFiles>...    input files
  -d, --downcase         Convert all token characters into lower case (default:
                           false)
      --exclude-punctuation
                         Ignore all tokens tagged as punctuation (according to
                           STTS tags set, i.e. starting with '$') (default:
                           false)
  -f, --fold=<fold>      current fold (default: 1)
  -F, --folds=<FOLDS>    number of random folds (default: 1)
      --force            Force overwrite (default: false)
  -h, --help             Show this help message and exit.
  -l, --with-lemma-pos   Use also lemma and part-of-speech annotations
                           (default: false
  -L, --log-file=<logFileName>
                         log file name (default: totalngrams.log)
  -n, --ngram-size=<ngram_size>
                         n-gram size (default: 1)
  -N, --numeric-secondary-sort
                         Sort entries with same frequency numerically
                           (default: false)
  -o, --output-file=<output_fillename>
                         Output file (default: -)
  -p, --worker-pool=<worker_pool_specification>
                         Run preprocessing on extern hosts, e.g. '10*local,
                           5*host1,3*smith@host2' (default: )
  -P, --max-procs=<max_threads>
                         Run up to max-procs processes at a time (default: 6)
      --pad              Add padding «START» and «END» symbols at text edges
                           (default: false)
  -S, --sort             Toggle output sorting (default: true)
  -V, --version          Print version information and exit.

Example usage

FOLDS=16
BASE="."

for l in "-l"; do #  "-l"
  for n in $(seq 1 2 3); do
    for f in $(seq 1 $FOLDS); do
      totalngrams\
        --pad \
        -P 79 \
        -n $n \
        -f $f \
        -F $FOLDS \
        $l -o "$BASE/paddedlemmaposfreq/$n-gram-token$l-freqs.$f.tsv.xz" $BASE/conllu/*.conllu.gz
    done
  done
done

Sampling into Folds

The deterministic pseudo-random sampling into folds is based on the cryptographic hash algorithm BLAKE2b (Aumasson et al. 2014)

Scripts

The package also contains some groovy scripts for handling pseudonymization tasks, i.e. replacing each token or lemma with a corresponding number according to separate key files.

You can run the groovy scripts directly, if you have installed groovy or from the totalngrams jar, otjherwise.

GeneratePseudonymKey

Example usage

./src/main/groovy/org/ids_mannheim/GeneratePseudonymKey.groovy -h

or:

generate_pseudonym_key -c 0 1-gram-token-l-freqs.*.tsv.xz | xz -T0 > token_key.tsv.xz

Pseudonymize

Example usage

pseudonymize -d /tmp -k tokens_key.tsv.xz -k lemma_key.tsv.xz  *-gram-token-l-freqs.*.tsv.xz

FilterKeys

Example usage

filter_keys -k token_keys.tsv.xz -k lemma_keys.tsv.xz 1-gram-token-l-freqs.*.tsv.xz

Installation

Prerequisites

Install

git clone "https://korap.ids-mannheim.de/gerrit/IDS-Mannheim/totalngrams"
cd totalngrams
mvn install
export PATH=`pwd`/appassembler/bin:$PATH

References

Koplenig, Alexander/Kupietz, Marc/Wolfer, Sascha (2022): Testing the relationship between word length, frequency, and predictability based on the German Reference Corpus. Cognitive Science 46(6)
Aumasson, Jean-Philippe/Meier, Willi/Phan, Raphael C-W/Henzen, Luca (2014): BLAKE2. In: The Hash Function BLAKE. Springer. p. 165–183.