commit | ed14736732d67ee47f5c0aa31bf5050b9e01e879 | [log] [tgz] |
---|---|---|
author | Marc Kupietz <kupietz@ids-mannheim.de> | Thu Dec 01 13:27:20 2022 +0100 |
committer | Marc Kupietz <kupietz@ids-mannheim.de> | Thu Dec 01 13:27:20 2022 +0100 |
tree | cc2cc7a96fed3295dec3b2f5164ab7361bd19b1a | |
parent | 31574d915d1ed6ab02b1291fb817e14ee0401785 [diff] |
clean up ci pipeline Change-Id: Iee56fa93a0d9a608f58e6657ccf5ae732766f3ed
Package for effectively processing token lists from very large corpora in tab separated value format, by making full use of multicore-processors.
An older version of totalngrams
was used for Koplenig et al. (2022).
totalngrams [-dhlNSV] [--force] [--pad] [-f=<fold>] [-F=<FOLDS>] [-L=<logFileName>] [-n=<ngram_size>] [-o=<output_fillename>] [-p=<worker_pool_specification>] [-P=<max_threads>] <inputFiles>... sum ngram counts from KorAP-XML, CoNLL-U files and frequency lists <inputFiles>... input files -d, --downcase Convert all token characters into lower case (default: false) -f, --fold=<fold> current fold (default: 1) -F, --folds=<FOLDS> number of random folds (default: 1) --force Force overwrite (default: false) -h, --help Show this help message and exit. -l, --with-lemma-pos Use also lemma and part-of-speech annotations (default: false -L, --log-file=<logFileName> log file name (default: totalngrams.log) -n, --ngram-size=<ngram_size> n-gram size (default: 1) -N, --numeric-secondary-sort Sort entries with same frequency numerically (default: false) -o, --output-file=<output_fillename> Output file (default: -) -p, --worker-pool=<worker_pool_specification> Run preprocessing on extern hosts, e.g. '10*local, 5*host1,3*smith@host2' (default: ) -P, --max-procs=<max_threads> Run up to max-procs processes at a time (default: 6) --pad Add padding «START» and «END» symbols at text edges (default: false) -S, --sort Toggle output sorting (default: true) -V, --version Print version information and exit.
FOLDS=16 BASE="." for l in "-l"; do # "-l" for n in $(seq 1 2 3); do for f in $(seq 1 $FOLDS); do totalngrams\ --pad \ -P 79 \ -n $n \ -f $f \ -F $FOLDS \ $l -o "$BASE/paddedlemmaposfreq/$n-gram-token$l-freqs.$f.tsv.xz" $BASE/conllu/*.conllu.gz done done done
The package also contains some groovy scripts for handling pseudonymization tasks, i.e. replacing each token or lemma with a corresponding number according to separate key files.
You can run the groovy scripts directly, if you have installed groovy or from the totalngrams jar, otjherwise.
./src/main/groovy/org/ids_mannheim/GeneratePseudonymKey.groovy -h
or:
generate_pseudonym_key -c 0 1-gram-token-l-freqs.*.tsv.xz | xz -T0 > token_key.tsv.xz
pseudonymize -d /tmp -k tokens_key.tsv.xz -k lemma_key.tsv.xz *-gram-token-l-freqs.*.tsv.xz
filter_keys -k token_keys.tsv.xz -k lemma_keys.tsv.xz 1-gram-token-l-freqs.*.tsv.xz
git clone "https://korap.ids-mannheim.de/gerrit/IDS-Mannheim/totalngrams" cd totalngrams mvn install export PATH=`pwd`/appassembler/bin:$PATH