commit | a5c191f4216b69803fc05067e3ba3415b4b162c8 | [log] [tgz] |
---|---|---|
author | Marc Kupietz <kupietz@ids-mannheim.de> | Thu Dec 01 13:30:12 2022 +0100 |
committer | Marc Kupietz <kupietz@ids-mannheim.de> | Thu Dec 01 13:30:12 2022 +0100 |
tree | e793289e22774ccd542376aa7d422003b23c2608 | |
parent | b6f601f7a9191b457e55f639cd0dcef7ce782fee [diff] |
Add BLAKE2b reference to Readme Change-Id: Ifc7a0ecfc5c95147238ede97a605c156a12c97b1
Package for effectively processing token lists from very large corpora in tab separated value format, by making full use of multicore-processors.
An older version of totalngrams
was used for Koplenig et al. (2022).
totalngrams [-dhlNSV] [--force] [--pad] [-f=<fold>] [-F=<FOLDS>] [-L=<logFileName>] [-n=<ngram_size>] [-o=<output_fillename>] [-p=<worker_pool_specification>] [-P=<max_threads>] <inputFiles>... sum ngram counts from KorAP-XML, CoNLL-U files and frequency lists <inputFiles>... input files -d, --downcase Convert all token characters into lower case (default: false) -f, --fold=<fold> current fold (default: 1) -F, --folds=<FOLDS> number of random folds (default: 1) --force Force overwrite (default: false) -h, --help Show this help message and exit. -l, --with-lemma-pos Use also lemma and part-of-speech annotations (default: false -L, --log-file=<logFileName> log file name (default: totalngrams.log) -n, --ngram-size=<ngram_size> n-gram size (default: 1) -N, --numeric-secondary-sort Sort entries with same frequency numerically (default: false) -o, --output-file=<output_fillename> Output file (default: -) -p, --worker-pool=<worker_pool_specification> Run preprocessing on extern hosts, e.g. '10*local, 5*host1,3*smith@host2' (default: ) -P, --max-procs=<max_threads> Run up to max-procs processes at a time (default: 6) --pad Add padding «START» and «END» symbols at text edges (default: false) -S, --sort Toggle output sorting (default: true) -V, --version Print version information and exit.
FOLDS=16 BASE="." for l in "-l"; do # "-l" for n in $(seq 1 2 3); do for f in $(seq 1 $FOLDS); do totalngrams\ --pad \ -P 79 \ -n $n \ -f $f \ -F $FOLDS \ $l -o "$BASE/paddedlemmaposfreq/$n-gram-token$l-freqs.$f.tsv.xz" $BASE/conllu/*.conllu.gz done done done
The deterministic pseudo-random sampling into folds is based on the cryptographic hash algorithm BLAKE2b (Aumasson et al. 2014)
The package also contains some groovy scripts for handling pseudonymization tasks, i.e. replacing each token or lemma with a corresponding number according to separate key files.
You can run the groovy scripts directly, if you have installed groovy or from the totalngrams jar, otjherwise.
./src/main/groovy/org/ids_mannheim/GeneratePseudonymKey.groovy -h
or:
generate_pseudonym_key -c 0 1-gram-token-l-freqs.*.tsv.xz | xz -T0 > token_key.tsv.xz
pseudonymize -d /tmp -k tokens_key.tsv.xz -k lemma_key.tsv.xz *-gram-token-l-freqs.*.tsv.xz
filter_keys -k token_keys.tsv.xz -k lemma_keys.tsv.xz 1-gram-token-l-freqs.*.tsv.xz
git clone "https://korap.ids-mannheim.de/gerrit/IDS-Mannheim/totalngrams" cd totalngrams mvn install export PATH=`pwd`/appassembler/bin:$PATH
Koplenig, Alexander/Kupietz, Marc/Wolfer, Sascha (2022): Testing the relationship between word length, frequency, and predictability based on the German Reference Corpus. Cognitive Science 46(6)
Aumasson, Jean-Philippe/Meier, Willi/Phan, Raphael C-W/Henzen, Luca (2014): BLAKE2. In: The Hash Function BLAKE. Springer. p. 165–183.