commit | 545bb919879b8d4a130eb1040efa5443e18a63fe | [log] [tgz] |
---|---|---|
author | Marc Kupietz <kupietz@ids-mannheim.de> | Thu Dec 08 08:05:59 2022 +0100 |
committer | Marc Kupietz <kupietz@ids-mannheim.de> | Thu Dec 08 10:25:28 2022 +0100 |
tree | d2e6a1f1b2b54fc6cf7e02f7c47bc2f7f6ba67bc | |
parent | 575ab1cfa7b5ed38f8e50881ef06398100c5183c [diff] |
Extend isPunctuation unit test Change-Id: I670a7679e774e23aa914230bf2b72df2a3bd839e
Package for effectively processing token lists from very large corpora in tab separated value format, by making full use of multicore-processors.
An older version of totalngrams
was used for Koplenig et al. (2022).
totalngrams [-dhlNSV] [--force] [--pad] [-f=<fold>] [-F=<FOLDS>] [-L=<logFileName>] [-n=<ngram_size>] [-o=<output_fillename>] [-p=<worker_pool_specification>] [-P=<max_threads>] <inputFiles>... sum ngram counts from KorAP-XML, CoNLL-U files and frequency lists <inputFiles>... input files -d, --downcase Convert all token characters into lower case (default: false) --exclude-punctuation Ignore all tokens tagged as punctuation (according to STTS tags set, i.e. starting with '$') (default: false) -f, --fold=<fold> current fold (default: 1) -F, --folds=<FOLDS> number of random folds (default: 1) --force Force overwrite (default: false) -h, --help Show this help message and exit. -l, --with-lemma-pos Use also lemma and part-of-speech annotations (default: false -L, --log-file=<logFileName> log file name (default: totalngrams.log) -n, --ngram-size=<ngram_size> n-gram size (default: 1) -N, --numeric-secondary-sort Sort entries with same frequency numerically (default: false) -o, --output-file=<output_fillename> Output file (default: -) -p, --worker-pool=<worker_pool_specification> Run preprocessing on extern hosts, e.g. '10*local, 5*host1,3*smith@host2' (default: ) -P, --max-procs=<max_threads> Run up to max-procs processes at a time (default: 6) --pad Add padding «START» and «END» symbols at text edges (default: false) -S, --sort Toggle output sorting (default: true) -V, --version Print version information and exit.
FOLDS=16 BASE="." for l in "-l"; do # "-l" for n in $(seq 1 2 3); do for f in $(seq 1 $FOLDS); do totalngrams\ --pad \ -P 79 \ -n $n \ -f $f \ -F $FOLDS \ $l -o "$BASE/paddedlemmaposfreq/$n-gram-token$l-freqs.$f.tsv.xz" $BASE/conllu/*.conllu.gz done done done
The deterministic pseudo-random sampling into folds is based on the cryptographic hash algorithm BLAKE2b (Aumasson et al. 2014)
The package also contains some groovy scripts for handling pseudonymization tasks, i.e. replacing each token or lemma with a corresponding number according to separate key files.
You can run the groovy scripts directly, if you have installed groovy or from the totalngrams jar, otjherwise.
./src/main/groovy/org/ids_mannheim/GeneratePseudonymKey.groovy -h
or:
generate_pseudonym_key -c 0 1-gram-token-l-freqs.*.tsv.xz | xz -T0 > token_key.tsv.xz
pseudonymize -d /tmp -k tokens_key.tsv.xz -k lemma_key.tsv.xz *-gram-token-l-freqs.*.tsv.xz
filter_keys -k token_keys.tsv.xz -k lemma_keys.tsv.xz 1-gram-token-l-freqs.*.tsv.xz
git clone "https://korap.ids-mannheim.de/gerrit/IDS-Mannheim/totalngrams" cd totalngrams mvn install export PATH=`pwd`/appassembler/bin:$PATH
Koplenig, Alexander/Kupietz, Marc/Wolfer, Sascha (2022): Testing the relationship between word length, frequency, and predictability based on the German Reference Corpus. Cognitive Science 46(6)
Aumasson, Jean-Philippe/Meier, Willi/Phan, Raphael C-W/Henzen, Luca (2014): BLAKE2. In: The Hash Function BLAKE. Springer. p. 165–183.