commit | 0ab973950366ba7853a894e21676b86fd6b53453 | [log] [tgz] |
---|---|---|
author | Marc Kupietz <kupietz@ids-mannheim.de> | Tue Dec 10 16:16:32 2024 +0100 |
committer | Marc Kupietz <kupietz@ids-mannheim.de> | Tue Dec 10 16:18:46 2024 +0100 |
tree | a4db99b5517139b60aa57cdab20754690f814cd7 | |
parent | 57cc90271fcfcb0a7938dc5379e524aca71c359f [diff] |
Add getPosWiseW2VCollocators -> json to API Resolves #8 Change-Id: Id1eca2896fadc59bebb2786e45fd8a6b3efdca6c
Visualizes paradigmatic and syntagmatic relations between words based on wang2vec / structured skip-n-gram (Ling et al. 2015) word embeddings (Mikolov et al. 2013) and word embedding networks.
DeReKoVecs (Fankhauser / Kupietz 2017, 2019, 2022; Kupietz et al. 2018) serves as part of the new open lab of the Corpus Linguistics group at IDS Mannheim. Similar to the Collocation Database CCDB (Keibel / Belica 2007, Belica 2011), DeReKoVecs serves for investigating and comparing of measurements, dimension reduction procedures, visualizations etc., to track down detailed paradigmatic and syntagmatic relations between words based on their use in very large corpora such as the German Reference Corpus DeReKo (Kupietz et al. 2010).
cpanm https://github.com/Akron/Mojolicious-Plugin-Localize.git cpanm --installdeps . perl Makefile.PL make make install
A detailed and known to work installation procedure can also be found in the GitLab CI pipeline script.
Please note the IDS::DeReKoVecs::Read
is not stable and not recommended to be used, yet.
You can build you own models with dereko2vec.
docker run -v ./example-models:/example-models:z -e MOJO_CONFIG=/example-models/example-docker.conf -p 3000:3000 idscorpuslinguistics/derekovecs
docker compose up
MOJO_CONFIG=$(pwd)/example.conf morbo script/derekovecs-server
MOJO_CONFIG=$(pwd)/example.conf hypnotoad script/derekovecs-server
The web user interface will than be available for example at http://localhost:3000
In addition to the web user interface, derekovecs also provides a web api which is however still very unsystematic and not stable. To figure out the meaning of still undocumented result components, have a look at the table head mouse-overs in the GUI or at the source code around here.
Command | Parameters | Description |
---|---|---|
/ | word, n, dedupe, cutoff, json=1 | get paradigmatic and syntagmatic neighbours, from word embeddings |
getCollocationAssociation | w, c | get association scores for specific node collocate pairs |
getSimilarity | w1, w2 | get cosine similarity of w1 and w2 |
getVersion | get version of derekovecs | |
getModelName | get name of model (inferred from the file name) | |
getVocabSize | get vocabulary size of model |
Command | Parameters | Description |
---|---|---|
getClassicCollocators | w | get count based collocates of word w |
{ "N" : 55650540526, // number of tokens in corpus "collocates" : [ // array of collocates { "afwin" : 64, // binary encoded auto-focus window // (see Perkuhn et al. 2012: E8-15): // 64 = 2^6 ≙ 00010 node 00000 // (Aus [gutem] Grund) "delta" : 0, // rank delta compared to collocation in a background // corpus (currently unused) "dice" : 0.00198886, // dice score "f" : 113490, // abs. frequency of collocation "f2" : 10965575, // abs. frequency of collocate "ld" : 5.02616, // log-dice score (Rychlý 2008) for whole window "ldaf" : 7.39257, // log-dice score for auto focus window "lfmd" : 36.0655, // log-frequency biased mutual dependency ≙ pmi³ // (Dalle 1994; Thanopoulos et al. 2002) "llr" : 204906, // log-likelihood (Dunning 1993; Evert 2004) "ln_count" : 36, // frequency of collocate as left neighbour of node "ln_pmi" : -5.81926, // pmi as left neighbour "md" : 19.2733, // mutual dependency ≙ pmi² // (Dalle 1994; Thanopoulos et al. 2002) "npmi" : 0.111633, // normalized pmi (Bouma 2009) "pmi" : 2.4811, // pointwise mutual information "rn_count" : 386, // frequency of collocate as right neighbour of node "rn_pmi" : -2.39672, // pmi as right neighbour "win" : 1023, // binary encoded positions at which the collocate // appears at least once 1023 = 2^10-1 ≙ 11111 node 11111 // (unmarked scores refer to this) "word" : "Aus" // collocate }, // ... ] }
Command | Parameters | Description |
---|---|---|
/getPosWiseW2VCollocators | w(,max=200,format=json) | get top max predictive collocates position-wise of word w |
GET 'http://localhost:3000/?word=Grund&n=10&dedupe=0&sort=0&cutoff=1000000&json=1' | json_pp |less
curl -L http://localhost:3000/getClassicCollocators?w=Grund
GET 'http://localhost:3000/getCollocationAssociation?w=Grund&c=diesem'
GET 'http://localhost:3000/getPosWiseW2VCollocators?w=Test'
docker build -t idscorpuslinguistics/derekovecs .
slim build --include-path /usr/local/share/perl5 --mount ./example-models:/example-models:z --env MOJO_CONFIG=/example-models/example-docker.conf idscorpuslinguistics/derekovecs
Will build an image ids-kl/derekovecs.slim
reduced to ~25% of the original size.
docker run -v ./example-models:/example-models:z -e MOJO_CONFIG=/example-models/example-docker.conf -p 3000:3000 idscorpuslinguistics/derekovecs
See rderekovecs.
See Changelog
Author: Marc Kupietz
Contributors: Peter Fankhauser, Rainer Perkuhn, Tim Feldmüller
Copyright (c) 2016-2024, Leibniz Institute for the German Language, Mannheim, Germany
DeReKoVecs is published under the Apache 2.0 License.
If you are using DeReKoVecs (results) for a scientific publication, please cite at least Fankhauser / Kupietz (2022).
Belica, Cyril (2011): Semantische Nähe als Ähnlichkeit von Kookkurrenzprofilen. In: Andrea Abel, Renata Zanin, Hrsg., Korpora in Lehre und Forschung, S. 155-178. Bozen-Bolzano University Press. Freie Universität Bozen-Bolzano.
Bouma, Gerlof (2009): Normalized (pointwise) mutual information in collocation extraction. In Proceedings of GSCL
Daille, B. (1994): Approche mixte pour l’extraction automatique de terminologie: statistiques lexicales et filtres linguistiques. PhD thesis, Université Paris 7.
Fankhauser, Peter / Kupietz, Marc (2022): Count-Based and Predictive Language Models for Exploring DeReKo. In: Proceedings of the LREC 2022 Workshop on Challenges in the Management of Large Corpora (CMLC-10 2022). Paris/Marseille: ELRA. pp. 27-31.
Fankhauser, Peter / Kupietz, Marc (2017): Visualizing Language Change in a Corpus of Contemporary German. In: Proceedings of the 9th International Corpus Linguistics Conference. Birmingham: University of Birmingham.
Fankhauser, Peter/Kupietz, Marc (2019): Analyzing domain specific word embeddings for a large corpus of contemporary German. International Corpus Linguistics Conference, Cardiff, Wales, UK, July 22-26, 2019. 2019. 6 S.
Keibel, H. / Belica, C. (2007): CCDB: A Corpus-Linguistic Research and Development Workbench. In: Proceedings of the 4th Corpus Linguistics Conference (CL 2007). Birmingham: University of Birmingham.
Kupietz, M. / Belica, C. / Keibel, H., Witt, A. (2010): The German Reference Corpus DeReKo: A primordial sample for linguistic research. In: Calzolari, N. et al. (eds.): Proceedings of the seventh conference on International Language Resources and Evaluation (LREC 2010). Paris: ELRA, 1848-1854.
Kupietz, M. / Lüngen, H. / Kamocki, P./ Witt, A. (2018): German Reference Corpus DeReKo: New Developments – New Opportunities. In: Calzolari, N. et al (eds): Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC 2018). Miyazaki: ELRA, 4353-4360
Ling, Wang / Dyer, C. / Black, A. / Trancoso, I. (2015): Two/too simple adaptations of word2vec for syntax problems. In Proc. of NAACL.
Mikolov, T. / Sutskever, I. / Chen, K. / Corrado, G. S. / Dean, J.(2013): Distributed representations of words and phrases and their compositionality. In Proceedings of NIPS (Advances in Neural Information Processing Systems) 2013, 3111–3119.
Perkuhn, Rainer / Keibel, Holger / Kupietz, Marc (2012): Korpuslinguistik. Paderborn: Fink, 2012. Addendum
Rychlý, Pavel (2008): A lexicographer-friendly association score. In Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN, 6–9, 2008
Thanopoulos, A. / Fakotakis, N. / Kokkinakis, G. (2002): Comparative evaluation of collocation extraction metrics. In: Proc. of LREC 2002: 620–625.