commit | 41c1e3abb8d6cb6f86238db1a98ac4f377de337e | [log] [tgz] |
---|---|---|
author | feldmueller <feldmueller@posteo.de> | Mon May 19 14:46:32 2025 +0200 |
committer | feldmueller <feldmueller@posteo.de> | Mon May 19 14:46:32 2025 +0200 |
tree | a014d0153a2007dc7636d02b39ad5251860bd6dc | |
parent | dc71ac8dd88899ec424f0b1ceda94557bee52dea [diff] |
fix test for word_embedding()
A Python client package that makes the DeReKoVecs (Fankhauser & Kupietz, 2022) web service API accessible from Python.
pip install git+https://korap.ids-mannheim.de/gerrit/IDS-Mannheim/pyderekovecs.git
Or clone the repository and install locally:
git clone https://korap.ids-mannheim.de/gerrit/IDS-Mannheim/pyderekovecs.git cd pyderekovecs pip install -e .
import pyderekovecs as pdv # Get paradigmatic neighbors for a word neighbors = pdv.paradigmatic_neighbours("Haus") print(neighbors.head()) # Get syntagmatic neighbors collocates = pdv.syntagmatic_neighbours("Haus") print(collocates.head()) # Get word embedding embedding = pdv.word_embedding("Haus") print(len(embedding)) # Should be 200 # Calculate cosine similarity between two words similarity = pdv.cosine_similarity("Haus", "Gebäude") print(f"Similarity: {similarity}")
import os os.environ["DEREKOVECS_SERVER"] = "https://corpora.ids-mannheim.de/openlab/kokokomvecs"
Based on the 2015 edition of the English Wikipedia article and talk pages corpus wxe15 (see Margaretha & Lüngen, 2014).
import pyderekovecs as envecs import os os.environ["DEREKOVECS_SERVER"] = "https://corpora.ids-mannheim.de/openlab/enwikivecs" neighbors = envecs.paradigmatic_neighbours("runs") print(neighbors)
import os os.environ["DEREKOVECS_SERVER"] = "https://corpora.ids-mannheim.de/openlab/corolavecs"
syntagmatic_neighbours(word, **params)
: Get the syntagmatic neighbour predictions of a wordcountbased_collocates(w, **params)
: Get the collocates of a word in the count-based dereko modelword_frequency(w, **params)
: Get the absolute frequency of a word in the corpuscorpus_size(w, **params)
: Get the token size of the corpus used to train the modelparadigmatic_neighbours(word, **params)
: Get the paradigmatic neighbours of a wordword_embedding(word, **params)
: Get the normalized embedding vector of a wordfrequency_rank(word, **params)
: Get the frequency rank of a word in the training dataserver_version()
: Get the version of the derekovecs servervocab_size()
: Get the vocabulary size of the modelmodel_name()
: Get the name of the modelcollocation_scores(w, c, **params)
: Calculate the association scores between a node and a collocatecosine_similarity(w1, w2, **params)
: Calculate the cosine similarity between two wordsTo run tests:
python -m unittest discover tests
Fankhauser, Peter/Kupietz, Marc (2022): Count-based and predictive language models for exploring DeReKo. In: Bański, Piotr/Barbaresi, Adrien/Clematide, Simon/Kupietz, Marc/Lüngen, Harald (eds.): Proceedings of the LREC 2022 Workshop on Challenges in the Management of Large Corpora (CMLC-10 2022). Marseille, 20 June 2022. Paris: European Language Resources Association (ELRA), pp. 27–31. https://aclanthology.org/2022.cmlc-1.5/
Margaretha, Eliza/Lüngen, Harald (2014): Building linguistic corpora from Wikipedia articles and discussions. Journal for Language Technology and Computational Linguistics, 29(2), pp. 59–82. https://doi.org/10.21248/jlcl.29.2014.189.