commit | b37bd84fef2e5f0a6786144a7d36ee8c0cd3b580 | [log] [tgz] |
---|---|---|
author | Marc Kupietz <kupietz@ids-mannheim.de> | Fri Apr 05 11:22:30 2024 +0200 |
committer | Marc Kupietz <kupietz@ids-mannheim.de> | Fri Apr 05 11:22:30 2024 +0200 |
tree | 08a72327b0487e269ba1f191565462eede38ba41 | |
parent | dd8cfbeffdd66a5e8651ecc2bb79ef5c5c8007ca [diff] |
Add missing reference to latest derekovecs publication Change-Id: Ic729a6b01f92324c33ca778cff5a7a191b75beca
Visualizes paradigmatic and syntagmatic relations between words based on wang2vec / structured skip-n-gram (Ling et al. 2015) word embeddings (Mikolov et al. 2013) and word embedding networks.
DeReKoVecs (Fankhauser / Kupietz 2017, 2019, 2022; Kupietz et al. 2018) serves as part of the new open lab of the Corpus Linguistics group at IDS Mannheim. Similar to the Collocation Database CCDB (Keibel / Belica 2007, Belica 2011), DeReKoVecs serves for investigating and comparing of measurements, dimension reduction procedures, visualizations etc., to track down detailed paradigmatic and syntagmatic relations between words based on their use in very large corpora such as the German Reference Corpus DeReKo (Kupietz et al. 2010).
cpanm https://github.com/Akron/Mojolicious-Plugin-Localize.git cpanm --installdeps . perl Makefile.PL make make install
Detailed and a known to work installation procedure can also be found in the GitLab CI pipeline script.
Please note the IDS::DeReKoVecs::Read
is not stable and not recommended to be used, yet.
You can build you own models with dereko2vec.
MOJO_CONFIG=$(pwd)/example.conf morbo script/derekovecs-server
MOJO_CONFIG=$(pwd)/example.conf hypnotoad script/derekovecs-server
The web user interface will than be available for example at http://localhost:3000
In addition to the web user interface, derekovecs also provides a web api which is however still very unsystematic and not stable. To figure out the meaning of still undocumented result components, have a look at the table head mouse-overs in the GUI or at the source code around here.
Command | Parameters | Description |
---|---|---|
/ | word, n, dedupe, cutoff, json=1 | get paradigmatic and syntagmatic neighbours, from word embeddings |
getCollocationAssociation | w, c | get association scores for specific node collocate pairs |
getSimilarity | w1, w2 | get cosine similarity of w1 and w2 |
getVersion | get version of derekovecs | |
getModelName | get name of model (inferred from the file name) | |
getVocabSize | get vocabulary size of model |
Command | Parameters | Description |
---|---|---|
getClassicCollocators | w | get count based collocates of word w |
{ "N" : 55650540526, // number of tokens in corpus "collocates" : [ // array of collocates { "afwin" : 64, // binary encoded auto-focus window // (see Perkuhn et al. 2012: E8-15): // 64 = 2^6 ≙ 00010 node 00000 // (Aus [gutem] Grund) "delta" : 0, // rank delta compared to collocation in a background // corpus (currently unused) "dice" : 0.00198886, // dice score "f" : 113490, // abs. frequency of collocation "f2" : 10965575, // abs. frequency of collocate "ld" : 5.02616, // log-dice score (Rychlý 2008) for whole window "ldaf" : 7.39257, // log-dice score for auto focus window "lfmd" : 36.0655, // log-frequency biased mutual dependency ≙ pmi³ // (Dalle 1994; Thanopoulos et al. 2002) "llr" : 204906, // log-likelihood (Dunning 1993; Evert 2004) "ln_count" : 36, // frequency of collocate as left neighbour of node "ln_pmi" : -5.81926, // pmi as left neighbour "md" : 19.2733, // mutual dependency ≙ pmi² // (Dalle 1994; Thanopoulos et al. 2002) "npmi" : 0.111633, // normalized pmi (Bouma 2009) "pmi" : 2.4811, // pointwise mutual information "rn_count" : 386, // frequency of collocate as right neighbour of node "rn_pmi" : -2.39672, // pmi as right neighbour "win" : 1023, // binary encoded positions at which the collocate // appears at least once 1023 = 2^10-1 ≙ 11111 node 11111 // (unmarked scores refer to this) "word" : "Aus" // collocate }, // ... ] }
GET 'http://localhost:3000/?word=Grund&n=10&dedupe=0&sort=0&cutoff=1000000&json=1' | json_pp |less
curl -L http://localhost:3000/getClassicCollocators?w=Grund
GET 'http://localhost:3000/getCollocationAssociation?w=Grund&c=diesem'
docker build -t ids-kl/derekovecs .
mkdir config cp example.conf config/derekovecs.conf
docker run -d=false -p 3000:3000 --rm -v $(pwd)/config:/config:z ids-kl/derekovecs
See Changelog
Author: Marc Kupietz
Copyright (c) 2016-2023, Leibniz Institute for the German Language, Mannheim, Germany
DeReKoVecs is published under the Apache 2.0 License.
If you are using DeReKoVecs (results) for a scientific publication, please cite at least Fankhauser / Kupietz (2022).
Belica, Cyril (2011): Semantische Nähe als Ähnlichkeit von Kookkurrenzprofilen. In: Andrea Abel, Renata Zanin, Hrsg., Korpora in Lehre und Forschung, S. 155-178. Bozen-Bolzano University Press. Freie Universität Bozen-Bolzano.
Bouma, Gerlof (2009): Normalized (pointwise) mutual information in collocation extraction. In Proceedings of GSCL
Daille, B. (1994): Approche mixte pour l’extraction automatique de terminologie: statistiques lexicales et filtres linguistiques. PhD thesis, Université Paris 7.
Fankhauser, Peter / Kupietz, Marc (2022): Count-Based and Predictive Language Models for Exploring DeReKo. In: Proceedings of the LREC 2022 Workshop on Challenges in the Management of Large Corpora (CMLC-10 2022). Paris/Marseille: ELRA. pp. 27-31.
Fankhauser, Peter / Kupietz, Marc (2017): Visualizing Language Change in a Corpus of Contemporary German. In: Proceedings of the 9th International Corpus Linguistics Conference. Birmingham: University of Birmingham.
Fankhauser, Peter/Kupietz, Marc (2019): Analyzing domain specific word embeddings for a large corpus of contemporary German. International Corpus Linguistics Conference, Cardiff, Wales, UK, July 22-26, 2019. 2019. 6 S.
Keibel, H. / Belica, C. (2007): CCDB: A Corpus-Linguistic Research and Development Workbench. In: Proceedings of the 4th Corpus Linguistics Conference (CL 2007). Birmingham: University of Birmingham.
Kupietz, M. / Belica, C. / Keibel, H., Witt, A. (2010): The German Reference Corpus DeReKo: A primordial sample for linguistic research. In: Calzolari, N. et al. (eds.): Proceedings of the seventh conference on International Language Resources and Evaluation (LREC 2010). Paris: ELRA, 1848-1854.
Kupietz, M. / Lüngen, H. / Kamocki, P./ Witt, A. (2018): German Reference Corpus DeReKo: New Developments – New Opportunities. In: Calzolari, N. et al (eds): Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC 2018). Miyazaki: ELRA, 4353-4360
Ling, Wang / Dyer, C. / Black, A. / Trancoso, I. (2015): Two/too simple adaptations of word2vec for syntax problems. In Proc. of NAACL.
Mikolov, T. / Sutskever, I. / Chen, K. / Corrado, G. S. / Dean, J.(2013): Distributed representations of words and phrases and their compositionality. In Proceedings of NIPS (Advances in Neural Information Processing Systems) 2013, 3111–3119.
Perkuhn, Rainer / Keibel, Holger / Kupietz, Marc (2012): Korpuslinguistik. Paderborn: Fink, 2012. Addendum
Rychlý, Pavel (2008): A lexicographer-friendly association score. In Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN, 6–9, 2008
Thanopoulos, A. / Fakotakis, N. / Kokkinakis, G. (2002): Comparative evaluation of collocation extraction metrics. In: Proc. of LREC 2002: 620–625.