commit | 1e09a23b4b246646212d785b118561c603378cf5 | [log] [tgz] |
---|---|---|
author | Marc Kupietz <kupietz@ids-mannheim.de> | Thu Dec 01 13:31:28 2022 +0100 |
committer | Marc Kupietz <kupietz@ids-mannheim.de> | Thu Dec 01 13:31:28 2022 +0100 |
tree | cc696e1479adc34ff74a346634764ea1629a6f3a | |
parent | a5c191f4216b69803fc05067e3ba3415b4b162c8 [diff] |
Cosmetic code changes Change-Id: I0e4218de3657890b43afec95596b9ddcdb5fef60
Package for effectively processing token lists from very large corpora in tab separated value format, by making full use of multicore-processors.
An older version of totalngrams
was used for Koplenig et al. (2022).
totalngrams [-dhlNSV] [--force] [--pad] [-f=<fold>] [-F=<FOLDS>] [-L=<logFileName>] [-n=<ngram_size>] [-o=<output_fillename>] [-p=<worker_pool_specification>] [-P=<max_threads>] <inputFiles>... sum ngram counts from KorAP-XML, CoNLL-U files and frequency lists <inputFiles>... input files -d, --downcase Convert all token characters into lower case (default: false) -f, --fold=<fold> current fold (default: 1) -F, --folds=<FOLDS> number of random folds (default: 1) --force Force overwrite (default: false) -h, --help Show this help message and exit. -l, --with-lemma-pos Use also lemma and part-of-speech annotations (default: false -L, --log-file=<logFileName> log file name (default: totalngrams.log) -n, --ngram-size=<ngram_size> n-gram size (default: 1) -N, --numeric-secondary-sort Sort entries with same frequency numerically (default: false) -o, --output-file=<output_fillename> Output file (default: -) -p, --worker-pool=<worker_pool_specification> Run preprocessing on extern hosts, e.g. '10*local, 5*host1,3*smith@host2' (default: ) -P, --max-procs=<max_threads> Run up to max-procs processes at a time (default: 6) --pad Add padding «START» and «END» symbols at text edges (default: false) -S, --sort Toggle output sorting (default: true) -V, --version Print version information and exit.
FOLDS=16 BASE="." for l in "-l"; do # "-l" for n in $(seq 1 2 3); do for f in $(seq 1 $FOLDS); do totalngrams\ --pad \ -P 79 \ -n $n \ -f $f \ -F $FOLDS \ $l -o "$BASE/paddedlemmaposfreq/$n-gram-token$l-freqs.$f.tsv.xz" $BASE/conllu/*.conllu.gz done done done
The deterministic pseudo-random sampling into folds is based on the cryptographic hash algorithm BLAKE2b (Aumasson et al. 2014)
The package also contains some groovy scripts for handling pseudonymization tasks, i.e. replacing each token or lemma with a corresponding number according to separate key files.
You can run the groovy scripts directly, if you have installed groovy or from the totalngrams jar, otjherwise.
./src/main/groovy/org/ids_mannheim/GeneratePseudonymKey.groovy -h
or:
generate_pseudonym_key -c 0 1-gram-token-l-freqs.*.tsv.xz | xz -T0 > token_key.tsv.xz
pseudonymize -d /tmp -k tokens_key.tsv.xz -k lemma_key.tsv.xz *-gram-token-l-freqs.*.tsv.xz
filter_keys -k token_keys.tsv.xz -k lemma_keys.tsv.xz 1-gram-token-l-freqs.*.tsv.xz
git clone "https://korap.ids-mannheim.de/gerrit/IDS-Mannheim/totalngrams" cd totalngrams mvn install export PATH=`pwd`/appassembler/bin:$PATH
Koplenig, Alexander/Kupietz, Marc/Wolfer, Sascha (2022): Testing the relationship between word length, frequency, and predictability based on the German Reference Corpus. Cognitive Science 46(6)
Aumasson, Jean-Philippe/Meier, Willi/Phan, Raphael C-W/Henzen, Luca (2014): BLAKE2. In: The Hash Function BLAKE. Springer. p. 165–183.