tree: 45ff4cdcc212c8b29b3191120267d1383c28f7c1 [path history] [tgz]
  1. ci/
  2. css/
  3. example-models/
  4. img/
  5. js/
  6. lib/
  7. script/
  8. t/
  9. templates/
  10. .gitignore
  11. .gitlab-ci.yml
  12. derekovecs-server.dict
  13. Dockerfile
  14. example.conf
  15. LICENSE
  16. Makefile.PL
  17. README.md
README.md

DeReKoVecs (server and web app)

Visualizes paradigmatic and syntagmatic relations between words based on wang2vec / structured skip-n-gram (Ling et al. 2015) word embeddings (Mikolov et al. 2013) and word embedding networks.

DeReKoVecs (Fankhauser & Kupietz 2017, 2019; Kupietz et al. 2018) serves as part of the new open lab of the Corpus Linguistics group at IDS Mannheim. Similar to the Collocation Database CCDB (Keibel & Belica 2007, Belica 2011), DeReKoVecs serves for investigating and comparing of measurements, dimension reduction procedures, visualizations etc., to track down detailed paradigmatic and syntagmatic relations between words based on their use in very large corpora such as the German Reference Corpus DeReKo (Kupietz et al. 2010).

Installation

Dependencies

Build and install

cpanm https://github.com/Akron/Mojolicious-Plugin-Localize.git
cpanm --installdeps .

perl Makefile.PL
make
make install

Detailed and a known to work installation procedure can also be found in the GitLab CI pipeline script.

Please note the IDS::DeReKoVecs::Read is not stable and not recommended to be used, yet.

Build your own models

You can build you own models with dereko2vec.

Run

Debugging mode

MOJO_CONFIG=example.conf morbo script/derekovecs-server

Production mode

MOJO_CONFIG=example.conf hypnotoad script/derekovecs-server

The web user interface will than be available for example at http://localhost:3000

Web Service API

In addition to the web user interface, derekovecs also provides a web api which is however still very unsystematic and not stable.

CommandParametersDescription
/word, n, dedupe, cutoff, json=1get paradigmatic and syntagmatic neighbours, from word embeddings
getCollocationAssociationw, cget association scores for specific node collocate pairs
getClassicCollocatorswget count based collocates of word w

Examples

GET 'http://localhost:3000/?word=Grund&n=10&dedupe=0&sort=0&cutoff=1000000&json=1' | json_pp |less
curl -L http://localhost:3000/getClassicCollocators?w=Grund
$ GET 'http://localhost:3000/getCollocationAssociation?w=Grund&c=diesem'

Build and run using docker / podman

Build image

docker build -t ids-kl/derekovecs .

Copy configuration to make it mountable

mkdir config
cp example.conf config/derekovecs.conf

Run

docker run -d=false -p 3000:3000 --rm -v $(pwd)/config:/config:z ids-kl/derekovecs

Development and License

Author: Marc Kupietz

Copyright (c) 2016-2022, Leibniz Institute for the German Language, Mannheim, Germany

DeReKoVecs is published under the Apache 2.0 License.

References

Belica, Cyril (2011): Semantische Nähe als Ähnlichkeit von Kookkurrenzprofilen. In: Andrea Abel, Renata Zanin, Hrsg., Korpora in Lehre und Forschung, S. 155-178. Bozen-Bolzano University Press. Freie Universität Bozen-Bolzano.

Fankhauser, P., Kupietz, M.(2022): Count-Based and Predictive Language Models for Exploring DeReKo. In: Proceedings of the LREC 2022 Workshop on Challenges in the Management of Large Corpora (CMLC-10 2022). Paris/Marseille: ELRA. pp. 27-31.

Fankhauser, P., Kupietz, M. (2017): Visualizing Language Change in a Corpus of Contemporary German. In: Proceedings of the 9th International Corpus Linguistics Conference. Birmingham: University of Birmingham.

Fankhauser, Peter/Kupietz, Marc (2019): Analyzing domain specific word embeddings for a large corpus of contemporary German. International Corpus Linguistics Conference, Cardiff, Wales, UK, July 22-26, 2019. 2019. 6 S.

Keibel, H., Belica, C. (2007): CCDB: A Corpus-Linguistic Research and Development Workbench. In: Proceedings of the 4th Corpus Linguistics Conference (CL 2007). Birmingham: University of Birmingham.

Kupietz, M., Belica, C., Keibel, H., Witt, A. (2010): The German Reference Corpus DeReKo: A primordial sample for linguistic research. In: Calzolari, N. et al. (eds.): Proceedings of the seventh conference on International Language Resources and Evaluation (LREC 2010). Paris: ELRA, 1848-1854.

Kupietz, M., Lüngen, H., Kamocki, P., Witt, A. (2018): German Reference Corpus DeReKo: New Developments – New Opportunities. In: Calzolari, N. et al (eds): Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC 2018). Miyazaki: ELRA, 4353-4360

Ling, W., Dyer, C., Black, A., & Trancoso, I. (2015): Two/too simple adaptations of word2vec for syntax problems. In Proc. of NAACL.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., Dean, J.(2013): Distributed representations of words and phrases and their compositionality. In Proceedings of NIPS (Advances in Neural Information Processing Systems) 2013, 3111–3119.

BibTeX

@InProceedings{Ling:2015:naacl,  
    author = {Ling, Wang and Dyer, Chris and Black, Alan and Trancoso, Isabel},  
    title="Two/Too Simple Adaptations of word2vec for Syntax Problems",  
    booktitle="Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",  
    year="2015",  
    publisher="Association for Computational Linguistics",  
    location="Denver, Colorado",  
}

@InProceedings{FankhauserKupietz2019,
    author    = {Peter Fankhauser and Marc Kupietz},
    title     = {Analyzing domain specific word embeddings for a large corpus of contemporary German},
    series = {Proceedings of the 10th International Corpus Linguistics Conference},
    publisher = {University of Cardiff},
    address   = {Cardiff},
    year      = {2019},
    note      = {\url{https://doi.org/10.14618/ids-pub-9117}}
}