tree: 1e28d08dd7805e25fe97a33e06daee0de82318ab [path history] [tgz]
  1. src/
  2. .gitignore
  3. .gitlab-ci.yml
  4. CHANGELOG.md
  5. config.groovy
  6. pom.xml
  7. Readme.md
Readme.md

totalngrams

Package for effectively processing frequency lists from very large corpora in tab separated value format, by making full use of multicore-processors.

An older version of totalngrams was used for Koplenig et al. (2022).

Synopsis

totalngrams [-dhlNSV] [--force] [--pad] [-f=<fold>] [-F=<FOLDS>]
                   [-L=<logFileName>] [-n=<ngram_size>] [-o=<output_fillename>]
                   [-p=<worker_pool_specification>] [-P=<max_threads>]
                   <inputFiles>...
sum ngram counts from KorAP-XML, CoNLL-U files and frequency lists
      <inputFiles>...    input files
  -d, --downcase         Convert all token characters into lower case (default:
                           false)
  -f, --fold=<fold>      current fold (default: 1)
  -F, --folds=<FOLDS>    number of random folds (default: 1)
      --force            Force overwrite (default: false)
  -h, --help             Show this help message and exit.
  -l, --with-lemma-pos   Use also lemma and part-of-speech annotations
                           (default: false
  -L, --log-file=<logFileName>
                         log file name (default: totalngrams.log)
  -n, --ngram-size=<ngram_size>
                         n-gram size (default: 1)
  -N, --numeric-secondary-sort
                         Sort entries with same frequency numerically
                           (default: false)
  -o, --output-file=<output_fillename>
                         Output file (default: -)
  -p, --worker-pool=<worker_pool_specification>
                         Run preprocessing on extern hosts, e.g. '10*local,
                           5*host1,3*smith@host2' (default: )
  -P, --max-procs=<max_threads>
                         Run up to max-procs processes at a time (default: 6)
      --pad              Add padding «START» and «END» symbols at text edges
                           (default: false)
  -S, --sort             Toggle output sorting (default: true)
  -V, --version          Print version information and exit.

Scripts

The package also contains some groovy scripts for handling pseudonymization tasks, i.e. replacing each token or lemma with a corresponding number according to separate key files.

You can run the groovy scripts directly, if you have installed groovy or from the totalngrams jar, otjherwise.

GeneratePseudonymKey

Example usage

./src/main/groovy/org/ids_mannheim/GeneratePseudonymKey.groovy -h
java -Dgroovy.grape.enable=false -cp target/totalngrams-2.1.0.jar\
 org.ids_mannheim.GeneratePseudonymKey -c 0 1-gram-token-l-freqs.*.tsv.xz | xz -T0 > token_key.tsv.xz

java -Dgroovy.grape.enable=false -cp target/totalngrams-2.1.0.jar\
 org.ids_mannheim.GeneratePseudonymKey -c 1 1-gram-token-l-freqs.*.tsv.xz

Pseudonymize

Example usage

java -Dgroovy.grape.enable=false -cp totalngrams-2.1.0.jar org.ids_mannheim.Pseudonymize

FilterKeys

Example usage

java -Xmx160000m -Dgroovy.grape.enable=false -cp totalngrams-2.1.0.jar org.ids_mannheim.FilterKeys\
 -k token_keys.tsv.xz -k lemma_keys.tsv.xz 1-gram-token-l-freqs.*.tsv.xz

Installation

Prerequisites

git clone "https://korap.ids-mannheim.de/gerrit/IDS-Mannheim/totalngrams"
cd totalngrams
mvn install

References