totalngrams

Package for effectively processing token lists from very large corpora in tab separated value format, by making full use of multicore-processors.

An older version of totalngrams was used for Koplenig et al. (2022).

Requirements

Java 17 or higher

Install from binary distribution

unzip totalngrams-2.2.1-bin.zip
export PATH=`pwd`/totalngrams-2.2.1/bin:$PATH

Synopsis

totalngrams [-dhlNSV] [--force] [--pad] [-f=<fold>] [-F=<FOLDS>]
                   [-L=<logFileName>] [-n=<ngram_size>] [-o=<output_fillename>]
                   [-p=<worker_pool_specification>] [-P=<max_threads>]
                   <inputFiles>...
sum ngram counts from KorAP-XML, CoNLL-U files and frequency lists
      <inputFiles>...    input files
  -d, --downcase         Convert all token characters into lower case (default:
                           false)
      --exclude-punctuation
                         Ignore all tokens tagged as punctuation (according to
                           STTS tags set, i.e. starting with '$') (default:
                           false)
  -f, --fold=<fold>      current fold (default: 1)
  -F, --folds=<FOLDS>    number of random folds (default: 1)
      --force            Force overwrite (default: false)
  -h, --help             Show this help message and exit.
  -l, --with-lemma-pos   Use also lemma and part-of-speech annotations
                           (default: false
  -L, --log-file=<logFileName>
                         log file name (default: totalngrams.log)
  -n, --ngram-size=<ngram_size>
                         n-gram size (default: 1)
  -N, --numeric-secondary-sort
                         Sort entries with same frequency numerically
                           (default: false)
  -o, --output-file=<output_fillename>
                         Output file (default: -)
  -p, --worker-pool=<worker_pool_specification>
                         Run preprocessing on extern hosts, e.g. '10*local,
                           5*host1,3*smith@host2' (default: )
  -P, --max-procs=<max_threads>
                         Run up to max-procs processes at a time (default: 6)
      --pad              Add padding «START» and «END» symbols at text edges
                           (default: false)
  -S, --sort             Toggle output sorting (default: true)
  -V, --version          Print version information and exit.

Example usage

FOLDS=16
BASE="."

for l in "-l"; do #  "-l"
  for n in $(seq 1 2 3); do
    for f in $(seq 1 $FOLDS); do
      totalngrams\
        --pad \
        -P 79 \
        -n $n \
        -f $f \
        -F $FOLDS \
        $l -o "$BASE/paddedlemmaposfreq/$n-gram-token$l-freqs.$f.tsv.xz" $BASE/conllu/*.conllu.gz
    done
  done
done

Sampling into Folds

The deterministic pseudo-random sampling into folds is based on the cryptographic hash algorithm BLAKE2b (Aumasson et al. 2014)

Scripts

The package also contains some groovy scripts for handling pseudonymization tasks, i.e. replacing each token or lemma with a corresponding number according to separate key files.

You can run the groovy scripts directly, if you have installed groovy or from the totalngrams jar, otjherwise.

GeneratePseudonymKey

Example usage

./src/main/groovy/org/ids_mannheim/GeneratePseudonymKey.groovy -h

or:

generate_pseudonym_key -c 0 1-gram-token-l-freqs.*.tsv.xz | xz -T0 > token_key.tsv.xz

Pseudonymize

Example usage

pseudonymize -d /tmp -k tokens_key.tsv.xz -k lemma_key.tsv.xz  *-gram-token-l-freqs.*.tsv.xz

FilterKeys

Example usage

filter_keys -k token_keys.tsv.xz -k lemma_keys.tsv.xz 1-gram-token-l-freqs.*.tsv.xz

Installation from source

Prerequisites

Build and install from git

git clone "https://korap.ids-mannheim.de/gerrit/IDS-Mannheim/totalngrams"
cd totalngrams
mvn install
export PATH=`pwd`/target/bin:$PATH

Changes

See Changelog.

Development and License

Authors:

Marc Kupietz

This package is published under the Apache 2.0 License.

Contributions

Contributions are very welcome!

Your contributions should ideally be committed via our Gerrit server to facilitate reviewing (see Gerrit Code Review - A Quick Introduction if you are not familiar with Gerrit).

References

Koplenig, Alexander/Kupietz, Marc/Wolfer, Sascha (2022): Testing the relationship between word length, frequency, and predictability based on the German Reference Corpus. Cognitive Science 46(6)
Aumasson, Jean-Philippe/Meier, Willi/Phan, Raphael C-W/Henzen, Luca (2014): BLAKE2. In: The Hash Function BLAKE. Springer. p. 165–183.