commit | 0551bc4f586ef7d7065ca31ffde49053f233e8c0 | [log] [tgz] |
---|---|---|
author | Marc Kupietz <kupietz@ids-mannheim.de> | Tue Jul 19 17:50:56 2022 +0200 |
committer | Marc Kupietz <kupietz@ids-mannheim.de> | Tue Jul 19 17:50:56 2022 +0200 |
tree | 925f36034e4fb40f554fcce49784ae170834cbd3 | |
parent | c82b15fd22b3aabe7a59c41a8ced2d0d66de502f [diff] |
Readme: make examples work out of the box Change-Id: I5131c356d55b6750ac8bcf5dcd7f9360821738f2
Visualizes paradigmatic and syntagmatic relations between words based on wang2vec / structured skip-n-gram (Ling et al. 2015) word embeddings (Mikolov et al. 2013) and word embedding networks.
DeReKoVecs (Fankhauser & Kupietz 2017, 2019; Kupietz et al. 2018) serves as part of the new open lab of the Corpus Linguistics group at IDS Mannheim. Similar to the Collocation Database CCDB (Keibel & Belica 2007, Belica 2011), DeReKoVecs serves for investigating and comparing of measurements, dimension reduction procedures, visualizations etc., to track down detailed paradigmatic and syntagmatic relations between words based on their use in very large corpora such as the German Reference Corpus DeReKo (Kupietz et al. 2010).
cpanm https://github.com/Akron/Mojolicious-Plugin-Localize.git cpanm --installdeps . perl Makefile.PL make make install
Detailed and a known to work installation procedure can also be found in the GitLab CI pipeline script.
Please note the IDS::DeReKoVecs::Read
is not stable and not recommended to be used, yet.
You can build you own models with dereko2vec.
MOJO_CONFIG=$(pwd)/example.conf morbo script/derekovecs-server
MOJO_CONFIG=$(pwd)/example.conf hypnotoad script/derekovecs-server
The web user interface will than be available for example at http://localhost:3000
In addition to the web user interface, derekovecs also provides a web api which is however still very unsystematic and not stable.
Command | Parameters | Description |
---|---|---|
/ | word, n, dedupe, cutoff, json=1 | get paradigmatic and syntagmatic neighbours, from word embeddings |
getCollocationAssociation | w, c | get association scores for specific node collocate pairs |
getClassicCollocators | w | get count based collocates of word w |
GET 'http://localhost:3000/?word=Grund&n=10&dedupe=0&sort=0&cutoff=1000000&json=1' | json_pp |less
curl -L http://localhost:3000/getClassicCollocators?w=Grund
$ GET 'http://localhost:3000/getCollocationAssociation?w=Grund&c=diesem'
docker build -t ids-kl/derekovecs .
mkdir config cp example.conf config/derekovecs.conf
docker run -d=false -p 3000:3000 --rm -v $(pwd)/config:/config:z ids-kl/derekovecs
Author: Marc Kupietz
Copyright (c) 2016-2022, Leibniz Institute for the German Language, Mannheim, Germany
DeReKoVecs is published under the Apache 2.0 License.
Belica, Cyril (2011): Semantische Nähe als Ähnlichkeit von Kookkurrenzprofilen. In: Andrea Abel, Renata Zanin, Hrsg., Korpora in Lehre und Forschung, S. 155-178. Bozen-Bolzano University Press. Freie Universität Bozen-Bolzano.
Fankhauser, P., Kupietz, M.(2022): Count-Based and Predictive Language Models for Exploring DeReKo. In: Proceedings of the LREC 2022 Workshop on Challenges in the Management of Large Corpora (CMLC-10 2022). Paris/Marseille: ELRA. pp. 27-31.
Fankhauser, P., Kupietz, M. (2017): Visualizing Language Change in a Corpus of Contemporary German. In: Proceedings of the 9th International Corpus Linguistics Conference. Birmingham: University of Birmingham.
Fankhauser, Peter/Kupietz, Marc (2019): Analyzing domain specific word embeddings for a large corpus of contemporary German. International Corpus Linguistics Conference, Cardiff, Wales, UK, July 22-26, 2019. 2019. 6 S.
Keibel, H., Belica, C. (2007): CCDB: A Corpus-Linguistic Research and Development Workbench. In: Proceedings of the 4th Corpus Linguistics Conference (CL 2007). Birmingham: University of Birmingham.
Kupietz, M., Belica, C., Keibel, H., Witt, A. (2010): The German Reference Corpus DeReKo: A primordial sample for linguistic research. In: Calzolari, N. et al. (eds.): Proceedings of the seventh conference on International Language Resources and Evaluation (LREC 2010). Paris: ELRA, 1848-1854.
Kupietz, M., Lüngen, H., Kamocki, P., Witt, A. (2018): German Reference Corpus DeReKo: New Developments – New Opportunities. In: Calzolari, N. et al (eds): Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC 2018). Miyazaki: ELRA, 4353-4360
Ling, W., Dyer, C., Black, A., & Trancoso, I. (2015): Two/too simple adaptations of word2vec for syntax problems. In Proc. of NAACL.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., Dean, J.(2013): Distributed representations of words and phrases and their compositionality. In Proceedings of NIPS (Advances in Neural Information Processing Systems) 2013, 3111–3119.
@InProceedings{Ling:2015:naacl, author = {Ling, Wang and Dyer, Chris and Black, Alan and Trancoso, Isabel}, title="Two/Too Simple Adaptations of word2vec for Syntax Problems", booktitle="Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies", year="2015", publisher="Association for Computational Linguistics", location="Denver, Colorado", } @InProceedings{FankhauserKupietz2019, author = {Peter Fankhauser and Marc Kupietz}, title = {Analyzing domain specific word embeddings for a large corpus of contemporary German}, series = {Proceedings of the 10th International Corpus Linguistics Conference}, publisher = {University of Cardiff}, address = {Cardiff}, year = {2019}, note = {\url{https://doi.org/10.14618/ids-pub-9117}} }