Package for effectively processing token lists from very large corpora in tab separated value format, by making full use of multicore-processors.
An older version of totalngrams
was used for Koplenig et al. (2022).
unzip totalngrams-2.3.0-bin.zip export PATH=`pwd`/totalngrams-2.3.0/bin:$PATH
totalngrams [-dhlNSV] [--force] [--pad] [-f=<fold>] [-F=<FOLDS>] [-L=<logFileName>] [-n=<ngram_size>] [-o=<output_fillename>] [-p=<worker_pool_specification>] [-P=<max_threads>] <inputFiles>... sum ngram counts from KorAP-XML, CoNLL-U files and frequency lists <inputFiles>... input files -d, --downcase Convert all token characters into lower case (default: false) --exclude-punctuation Ignore all tokens tagged as punctuation (according to STTS tags set, i.e. starting with '$') (default: false) -f, --fold=<fold> current fold (default: 1) -F, --folds=<FOLDS> number of random folds (default: 1) --force Force overwrite (default: false) -h, --help Show this help message and exit. -l, --with-lemma-pos Use also lemma and part-of-speech annotations (default: false -L, --log-file=<logFileName> log file name (default: totalngrams.log) -n, --ngram-size=<ngram_size> n-gram size (default: 1) -N, --numeric-secondary-sort Sort entries with same frequency numerically (default: false) -o, --output-file=<output_fillename> Output file (default: -) -p, --worker-pool=<worker_pool_specification> Run preprocessing on extern hosts, e.g. '10*local, 5*host1,3*smith@host2' (default: ) -P, --max-procs=<max_threads> Run up to max-procs processes at a time (default: 6) --pad Add padding «START» and «END» symbols at text edges (default: false) -S, --sort Toggle output sorting (default: true) -V, --version Print version information and exit.
FOLDS=16 BASE="." for l in "-l"; do for n in $(seq 1 3); do for f in $(seq 1 $FOLDS); do totalngrams\ --pad \ -P 79 \ -n $n \ -f $f \ -F $FOLDS \ --exclude-empty-texts \ $l -o "$BASE/paddedlemmaposfreq/$n-gram-token$l-freqs.$f.tsv.xz" $BASE/conllu/*.conllu.gz done done done
The deterministic pseudo-random sampling into folds is based on the cryptographic hash algorithm BLAKE2b (Aumasson et al. 2014)
The package also contains some groovy scripts for handling pseudonymization tasks, i.e. replacing each token or lemma with a corresponding number according to separate key files.
You can run the groovy scripts directly, if you have installed groovy or from the totalngrams jar, otjherwise.
./src/main/groovy/org/ids_mannheim/GeneratePseudonymKey.groovy -h
or:
generate_pseudonym_key -c 0 1-gram-token-l-freqs.*.tsv.xz | xz -T0 > token_key.tsv.xz
pseudonymize -d /tmp -k tokens_key.tsv.xz -k lemma_key.tsv.xz *-gram-token-l-freqs.*.tsv.xz
filter_keys -k token_keys.tsv.xz -k lemma_keys.tsv.xz 1-gram-token-l-freqs.*.tsv.xz
git clone "https://korap.ids-mannheim.de/gerrit/IDS-Mannheim/totalngrams" cd totalngrams mvn install export PATH=`pwd`/target/bin:$PATH
See Changelog.
Authors:
Copyright (c) 2023, Leibniz Institute for the German Language, Mannheim, Germany
This package is published under the Apache 2.0 License.
Contributions are very welcome!
Your contributions should ideally be committed via our Gerrit server to facilitate reviewing (see Gerrit Code Review - A Quick Introduction if you are not familiar with Gerrit).
Koplenig, Alexander/Kupietz, Marc/Wolfer, Sascha (2022): Testing the relationship between word length, frequency, and predictability based on the German Reference Corpus. Cognitive Science 46(6)
Aumasson, Jean-Philippe/Meier, Willi/Phan, Raphael C-W/Henzen, Luca (2014): BLAKE2. In: The Hash Function BLAKE. Springer. p. 165–183.