commit | 32530791b767bf1757bd0ed1ac389fcc85ba75ce | [log] [tgz] |
---|---|---|
author | Marc Kupietz <kupietz@ids-mannheim.de> | Tue Nov 08 08:52:16 2022 +0100 |
committer | Marc Kupietz <kupietz@ids-mannheim.de> | Tue Nov 08 08:52:16 2022 +0100 |
tree | 1e28d08dd7805e25fe97a33e06daee0de82318ab | |
parent | ab91cf0ebc01ce806ceef927c73bbad4130b6450 [diff] |
Update changelog Change-Id: I38f5ef1d2570d9a9051fabc190e23fe41b2eb678
Package for effectively processing frequency lists from very large corpora in tab separated value format, by making full use of multicore-processors.
An older version of totalngrams
was used for Koplenig et al. (2022).
totalngrams [-dhlNSV] [--force] [--pad] [-f=<fold>] [-F=<FOLDS>] [-L=<logFileName>] [-n=<ngram_size>] [-o=<output_fillename>] [-p=<worker_pool_specification>] [-P=<max_threads>] <inputFiles>... sum ngram counts from KorAP-XML, CoNLL-U files and frequency lists <inputFiles>... input files -d, --downcase Convert all token characters into lower case (default: false) -f, --fold=<fold> current fold (default: 1) -F, --folds=<FOLDS> number of random folds (default: 1) --force Force overwrite (default: false) -h, --help Show this help message and exit. -l, --with-lemma-pos Use also lemma and part-of-speech annotations (default: false -L, --log-file=<logFileName> log file name (default: totalngrams.log) -n, --ngram-size=<ngram_size> n-gram size (default: 1) -N, --numeric-secondary-sort Sort entries with same frequency numerically (default: false) -o, --output-file=<output_fillename> Output file (default: -) -p, --worker-pool=<worker_pool_specification> Run preprocessing on extern hosts, e.g. '10*local, 5*host1,3*smith@host2' (default: ) -P, --max-procs=<max_threads> Run up to max-procs processes at a time (default: 6) --pad Add padding «START» and «END» symbols at text edges (default: false) -S, --sort Toggle output sorting (default: true) -V, --version Print version information and exit.
The package also contains some groovy scripts for handling pseudonymization tasks, i.e. replacing each token or lemma with a corresponding number according to separate key files.
You can run the groovy scripts directly, if you have installed groovy or from the totalngrams jar, otjherwise.
./src/main/groovy/org/ids_mannheim/GeneratePseudonymKey.groovy -h
java -Dgroovy.grape.enable=false -cp target/totalngrams-2.1.0.jar\ org.ids_mannheim.GeneratePseudonymKey -c 0 1-gram-token-l-freqs.*.tsv.xz | xz -T0 > token_key.tsv.xz java -Dgroovy.grape.enable=false -cp target/totalngrams-2.1.0.jar\ org.ids_mannheim.GeneratePseudonymKey -c 1 1-gram-token-l-freqs.*.tsv.xz
java -Dgroovy.grape.enable=false -cp totalngrams-2.1.0.jar org.ids_mannheim.Pseudonymize
java -Xmx160000m -Dgroovy.grape.enable=false -cp totalngrams-2.1.0.jar org.ids_mannheim.FilterKeys\ -k token_keys.tsv.xz -k lemma_keys.tsv.xz 1-gram-token-l-freqs.*.tsv.xz
git clone "https://korap.ids-mannheim.de/gerrit/IDS-Mannheim/totalngrams" cd totalngrams mvn install