commit	04784b96d4ac2e3a57e1bf4e503c808881d0a4e4	[log] [tgz]
author	Marc Kupietz <kupietz@ids-mannheim.de>	Sun May 04 13:38:12 2025 +0200
committer	Marc Kupietz <kupietz@ids-mannheim.de>	Sun May 04 13:41:10 2025 +0200
tree	fd9f6908c479e09ad768d6d2995d58ee53026ce5

tree: fd9f6908c479e09ad768d6d2995d58ee53026ce5

README.md

pyderekovecs

A Python client package that makes the DeReKoVecs web service API accessible from Python.

Installation

pip install git+https://korap.ids-mannheim.de/gerrit/IDS-Mannheim/pyderekovecs.git

Or clone the repository and install locally:

git clone https://korap.ids-mannheim.de/gerrit/IDS-Mannheim/pyderekovecs.git
cd pyderekovecs
pip install -e .

Usage

import pyderekovecs as pd

# Get paradigmatic neighbors for a word
neighbors = pd.paradigmatic_neighbours("Haus")
print(neighbors.head())

# Get syntagmatic neighbors
collocates = pd.syntagmatic_neighbours("Haus")
print(collocates.head())

# Get word embedding
embedding = pd.word_embedding("Haus")
print(len(embedding))  # Should be 200

# Calculate cosine similarity between two words
similarity = pd.cosine_similarity("Haus", "Gebäude")
print(f"Similarity: {similarity}")

Accessing other DeReKoVecs instances

KoKoKom

import os
os.environ["DEREKOVECS_SERVER"] = "https://corpora.ids-mannheim.de/openlab/kokokomvecs"

CoRoLa (Contemporary Reference Corpus of the Romanian Language)

import os
os.environ["DEREKOVECS_SERVER"] = "https://corpora.ids-mannheim.de/openlab/corolavecs"

Available Functions

syntagmatic_neighbours(word, **params): Get the syntagmatic neighbour predictions of a word
countbased_collocates(w, **params): Get the collocates of a word in the count-based dereko model
word_frequency(w, **params): Get the absolute frequency of a word in the corpus
corpus_size(w, **params): Get the token size of the corpus used to train the model
paradigmatic_neighbours(word, **params): Get the paradigmatic neighbours of a word
word_embedding(word, **params): Get the normalized embedding vector of a word
frequency_rank(word, **params): Get the frequency rank of a word in the training data
server_version(): Get the version of the derekovecs server
vocab_size(): Get the vocabulary size of the model
model_name(): Get the name of the model
collocation_scores(w, c, **params): Calculate the association scores between a node and a collocate
cosine_similarity(w1, w2, **params): Calculate the cosine similarity between two words

Development

To run tests:

python -m unittest discover tests