commit	0ab973950366ba7853a894e21676b86fd6b53453	[log] [tgz]
author	Marc Kupietz <kupietz@ids-mannheim.de>	Tue Dec 10 16:16:32 2024 +0100
committer	Marc Kupietz <kupietz@ids-mannheim.de>	Tue Dec 10 16:18:46 2024 +0100
tree	a4db99b5517139b60aa57cdab20754690f814cd7
parent	57cc90271fcfcb0a7938dc5379e524aca71c359f [diff]

tree: a4db99b5517139b60aa57cdab20754690f814cd7

README.md

DeReKoVecs (server and web app)

Visualizes paradigmatic and syntagmatic relations between words based on wang2vec / structured skip-n-gram (Ling et al. 2015) word embeddings (Mikolov et al. 2013) and word embedding networks.

DeReKoVecs (Fankhauser / Kupietz 2017, 2019, 2022; Kupietz et al. 2018) serves as part of the new open lab of the Corpus Linguistics group at IDS Mannheim. Similar to the Collocation Database CCDB (Keibel / Belica 2007, Belica 2011), DeReKoVecs serves for investigating and comparing of measurements, dimension reduction procedures, visualizations etc., to track down detailed paradigmatic and syntagmatic relations between words based on their use in very large corpora such as the German Reference Corpus DeReKo (Kupietz et al. 2010).

Installation from source

Dependencies

libcollocaltordb >= v1.3.2

Build and install

cpanm https://github.com/Akron/Mojolicious-Plugin-Localize.git
cpanm --installdeps .

perl Makefile.PL
make
make install

A detailed and known to work installation procedure can also be found in the GitLab CI pipeline script.

Please note the IDS::DeReKoVecs::Read is not stable and not recommended to be used, yet.

Build your own models

You can build you own models with dereko2vec.

Run

From prebuilt docker image

docker run -v ./example-models:/example-models:z -e MOJO_CONFIG=/example-models/example-docker.conf -p 3000:3000 idscorpuslinguistics/derekovecs

From prebuilt docker image with docker compose

docker compose up

From source in debug mode

MOJO_CONFIG=$(pwd)/example.conf morbo script/derekovecs-server

From source in production mode

MOJO_CONFIG=$(pwd)/example.conf hypnotoad script/derekovecs-server

The web user interface will than be available for example at http://localhost:3000

Web Service API

In addition to the web user interface, derekovecs also provides a web api which is however still very unsystematic and not stable. To figure out the meaning of still undocumented result components, have a look at the table head mouse-overs in the GUI or at the source code around here.

Command	Parameters	Description
/	word, n, dedupe, cutoff, json=1	get paradigmatic and syntagmatic neighbours, from word embeddings
getCollocationAssociation	w, c	get association scores for specific node collocate pairs
getSimilarity	w1, w2	get cosine similarity of w1 and w2
getVersion		get version of derekovecs
getModelName		get name of model (inferred from the file name)
getVocabSize		get vocabulary size of model

Get classical (count-based) collocates

Command	Parameters	Description
getClassicCollocators	w	get count based collocates of word w

Example Result (node: Grund)

{
   "N" : 55650540526,           // number of tokens in corpus
   "collocates" : [             // array of collocates
      {
         "afwin" : 64,          // binary encoded auto-focus window
                                // (see Perkuhn et al. 2012: E8-15):
                                // 64 = 2^6 ≙ 00010 node 00000
                                // (Aus [gutem] Grund)
         "delta" : 0,           // rank delta compared to collocation in a background
                                // corpus (currently unused)
         "dice" : 0.00198886,   // dice score
         "f" : 113490,          // abs. frequency of collocation
         "f2" : 10965575,       // abs. frequency of collocate
         "ld" : 5.02616,        // log-dice score (Rychlý 2008) for whole window
         "ldaf" : 7.39257,      // log-dice score for auto focus window
         "lfmd" : 36.0655,      // log-frequency biased mutual dependency ≙ pmi³
                                // (Dalle 1994; Thanopoulos et al. 2002)
         "llr" : 204906,        // log-likelihood (Dunning 1993; Evert 2004)
         "ln_count" : 36,       // frequency of collocate as left neighbour of node
         "ln_pmi" : -5.81926,   // pmi as left neighbour
         "md" : 19.2733,        // mutual dependency ≙ pmi²
                                // (Dalle 1994; Thanopoulos et al. 2002)
         "npmi" : 0.111633,     // normalized pmi (Bouma 2009)
         "pmi" : 2.4811,        // pointwise mutual information
         "rn_count" : 386,      // frequency of collocate as right neighbour of node
         "rn_pmi" : -2.39672,   // pmi as right neighbour
         "win" : 1023,          // binary encoded positions at which the collocate
                                // appears at least once 1023 = 2^10-1 ≙ 11111 node 11111
                                // (unmarked scores refer to this)
         "word" : "Aus"         // collocate
      },
      // ...
   ]
}

Get top predictive collocates position-wise

Command	Parameters	Description
/getPosWiseW2VCollocators	w(,max=200,format=json)	get top `max`predictive collocates position-wise of word w

Examples

GET 'http://localhost:3000/?word=Grund&n=10&dedupe=0&sort=0&cutoff=1000000&json=1' | json_pp |less

curl -L http://localhost:3000/getClassicCollocators?w=Grund

GET 'http://localhost:3000/getCollocationAssociation?w=Grund&c=diesem'

GET 'http://localhost:3000/getPosWiseW2VCollocators?w=Test'

(Build and) run using docker / podman

Optional: Build docker image from source

docker build -t idscorpuslinguistics/derekovecs .

Optional: Slim down image using Slim(toolkit)

slim build --include-path /usr/local/share/perl5 --mount ./example-models:/example-models:z --env MOJO_CONFIG=/example-models/example-docker.conf idscorpuslinguistics/derekovecs

Will build an image ids-kl/derekovecs.slim reduced to ~25% of the original size.

Run docker image

docker run -v ./example-models:/example-models:z -e MOJO_CONFIG=/example-models/example-docker.conf -p 3000:3000 idscorpuslinguistics/derekovecs

Client library for R

See rderekovecs.

News

See Changelog

Development and License

Author: Marc Kupietz

Contributors: Peter Fankhauser, Rainer Perkuhn, Tim Feldmüller

DeReKoVecs is published under the Apache 2.0 License.

How to cite

If you are using DeReKoVecs (results) for a scientific publication, please cite at least Fankhauser / Kupietz (2022).

References

Belica, Cyril (2011): Semantische Nähe als Ähnlichkeit von Kookkurrenzprofilen. In: Andrea Abel, Renata Zanin, Hrsg., Korpora in Lehre und Forschung, S. 155-178. Bozen-Bolzano University Press. Freie Universität Bozen-Bolzano.

Bouma, Gerlof (2009): Normalized (pointwise) mutual information in collocation extraction. In Proceedings of GSCL

Daille, B. (1994): Approche mixte pour l’extraction automatique de terminologie: statistiques lexicales et filtres linguistiques. PhD thesis, Université Paris 7.

Fankhauser, Peter / Kupietz, Marc (2022): Count-Based and Predictive Language Models for Exploring DeReKo. In: Proceedings of the LREC 2022 Workshop on Challenges in the Management of Large Corpora (CMLC-10 2022). Paris/Marseille: ELRA. pp. 27-31.

Fankhauser, Peter / Kupietz, Marc (2017): Visualizing Language Change in a Corpus of Contemporary German. In: Proceedings of the 9th International Corpus Linguistics Conference. Birmingham: University of Birmingham.

Fankhauser, Peter/Kupietz, Marc (2019): Analyzing domain specific word embeddings for a large corpus of contemporary German. International Corpus Linguistics Conference, Cardiff, Wales, UK, July 22-26, 2019. 2019. 6 S.

Keibel, H. / Belica, C. (2007): CCDB: A Corpus-Linguistic Research and Development Workbench. In: Proceedings of the 4th Corpus Linguistics Conference (CL 2007). Birmingham: University of Birmingham.

Kupietz, M. / Belica, C. / Keibel, H., Witt, A. (2010): The German Reference Corpus DeReKo: A primordial sample for linguistic research. In: Calzolari, N. et al. (eds.): Proceedings of the seventh conference on International Language Resources and Evaluation (LREC 2010). Paris: ELRA, 1848-1854.

Kupietz, M. / Lüngen, H. / Kamocki, P./ Witt, A. (2018): German Reference Corpus DeReKo: New Developments – New Opportunities. In: Calzolari, N. et al (eds): Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC 2018). Miyazaki: ELRA, 4353-4360

Ling, Wang / Dyer, C. / Black, A. / Trancoso, I. (2015): Two/too simple adaptations of word2vec for syntax problems. In Proc. of NAACL.

Mikolov, T. / Sutskever, I. / Chen, K. / Corrado, G. S. / Dean, J.(2013): Distributed representations of words and phrases and their compositionality. In Proceedings of NIPS (Advances in Neural Information Processing Systems) 2013, 3111–3119.

Perkuhn, Rainer / Keibel, Holger / Kupietz, Marc (2012): Korpuslinguistik. Paderborn: Fink, 2012. Addendum

Rychlý, Pavel (2008): A lexicographer-friendly association score. In Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN, 6–9, 2008

Thanopoulos, A. / Fakotakis, N. / Kokkinakis, G. (2002): Comparative evaluation of collocation extraction metrics. In: Proc. of LREC 2002: 620–625.