commit | 7af026234c47f6ec71ad6559f46ff396c4bfed2d | [log] [tgz] |
---|---|---|
author | Marc Kupietz <kupietz@ids-mannheim.de> | Sun May 04 14:49:55 2025 +0200 |
committer | Marc Kupietz <kupietz@ids-mannheim.de> | Sun May 04 14:51:00 2025 +0200 |
tree | 5155cac0b00ba488d98a4d46661ca0b2b736072d | |
parent | 870039cb376f8b657b511a0b792821cd0a18671c [diff] |
README: add example for English Change-Id: I89bbe47be8d316b972f746f42c0600d502408131
A Python client package that makes the DeReKoVecs web service API accessible from Python.
pip install git+https://korap.ids-mannheim.de/gerrit/IDS-Mannheim/pyderekovecs.git
Or clone the repository and install locally:
git clone https://korap.ids-mannheim.de/gerrit/IDS-Mannheim/pyderekovecs.git cd pyderekovecs pip install -e .
import pyderekovecs as pd # Get paradigmatic neighbors for a word neighbors = pd.paradigmatic_neighbours("Haus") print(neighbors.head()) # Get syntagmatic neighbors collocates = pd.syntagmatic_neighbours("Haus") print(collocates.head()) # Get word embedding embedding = pd.word_embedding("Haus") print(len(embedding)) # Should be 200 # Calculate cosine similarity between two words similarity = pd.cosine_similarity("Haus", "Gebäude") print(f"Similarity: {similarity}")
import os os.environ["DEREKOVECS_SERVER"] = "https://corpora.ids-mannheim.de/openlab/kokokomvecs"
Based on the 2015 edition of the English Wikipedia article and talk pages corpus wxe15 (see Margaretha & Lüngen, 2014).
import pyderekovecs as envecs import os os.environ["DEREKOVECS_SERVER"] = "https://corpora.ids-mannheim.de/openlab/enwikivecs" neighbors = envecs.paradigmatic_neighbours("runs") print(neighbors)
import os os.environ["DEREKOVECS_SERVER"] = "https://corpora.ids-mannheim.de/openlab/corolavecs"
syntagmatic_neighbours(word, **params)
: Get the syntagmatic neighbour predictions of a wordcountbased_collocates(w, **params)
: Get the collocates of a word in the count-based dereko modelword_frequency(w, **params)
: Get the absolute frequency of a word in the corpuscorpus_size(w, **params)
: Get the token size of the corpus used to train the modelparadigmatic_neighbours(word, **params)
: Get the paradigmatic neighbours of a wordword_embedding(word, **params)
: Get the normalized embedding vector of a wordfrequency_rank(word, **params)
: Get the frequency rank of a word in the training dataserver_version()
: Get the version of the derekovecs servervocab_size()
: Get the vocabulary size of the modelmodel_name()
: Get the name of the modelcollocation_scores(w, c, **params)
: Calculate the association scores between a node and a collocatecosine_similarity(w1, w2, **params)
: Calculate the cosine similarity between two wordsTo run tests:
python -m unittest discover tests
Margaretha, E., & Lüngen, H. (2014). Building linguistic corpora from Wikipedia articles and discussions. Journal for Language Technology and Computational Linguistics, 29(2), 59–82. https://doi.org/10.21248/jlcl.29.2014.189