fix test for word_embedding()
1 file changed
tree: a014d0153a2007dc7636d02b39ad5251860bd6dc
  1. .gitlab-ci-local/
  2. pyderekovecs/
  3. tests/
  4. .gitignore
  5. .gitlab-ci.yml
  6. README.md
  7. setup.py
README.md

pyderekovecs

A Python client package that makes the DeReKoVecs (Fankhauser & Kupietz, 2022) web service API accessible from Python.

Installation

pip install git+https://korap.ids-mannheim.de/gerrit/IDS-Mannheim/pyderekovecs.git

Or clone the repository and install locally:

git clone https://korap.ids-mannheim.de/gerrit/IDS-Mannheim/pyderekovecs.git
cd pyderekovecs
pip install -e .

Usage

import pyderekovecs as pdv

# Get paradigmatic neighbors for a word
neighbors = pdv.paradigmatic_neighbours("Haus")
print(neighbors.head())

# Get syntagmatic neighbors
collocates = pdv.syntagmatic_neighbours("Haus")
print(collocates.head())

# Get word embedding
embedding = pdv.word_embedding("Haus")
print(len(embedding))  # Should be 200

# Calculate cosine similarity between two words
similarity = pdv.cosine_similarity("Haus", "Gebäude")
print(f"Similarity: {similarity}")

Accessing other DeReKoVecs instances

KoKoKom

import os
os.environ["DEREKOVECS_SERVER"] = "https://corpora.ids-mannheim.de/openlab/kokokomvecs"

enwiki

Based on the 2015 edition of the English Wikipedia article and talk pages corpus wxe15 (see Margaretha & Lüngen, 2014).

import pyderekovecs as envecs
import os
os.environ["DEREKOVECS_SERVER"] = "https://corpora.ids-mannheim.de/openlab/enwikivecs"
neighbors = envecs.paradigmatic_neighbours("runs")
print(neighbors)

CoRoLa (Contemporary Reference Corpus of the Romanian Language)

import os
os.environ["DEREKOVECS_SERVER"] = "https://corpora.ids-mannheim.de/openlab/corolavecs"

Available Functions

  • syntagmatic_neighbours(word, **params): Get the syntagmatic neighbour predictions of a word
  • countbased_collocates(w, **params): Get the collocates of a word in the count-based dereko model
  • word_frequency(w, **params): Get the absolute frequency of a word in the corpus
  • corpus_size(w, **params): Get the token size of the corpus used to train the model
  • paradigmatic_neighbours(word, **params): Get the paradigmatic neighbours of a word
  • word_embedding(word, **params): Get the normalized embedding vector of a word
  • frequency_rank(word, **params): Get the frequency rank of a word in the training data
  • server_version(): Get the version of the derekovecs server
  • vocab_size(): Get the vocabulary size of the model
  • model_name(): Get the name of the model
  • collocation_scores(w, c, **params): Calculate the association scores between a node and a collocate
  • cosine_similarity(w1, w2, **params): Calculate the cosine similarity between two words

Development

To run tests:

python -m unittest discover tests

References

Fankhauser, Peter/Kupietz, Marc (2022): Count-based and predictive language models for exploring DeReKo. In: Bański, Piotr/Barbaresi, Adrien/Clematide, Simon/Kupietz, Marc/Lüngen, Harald (eds.): Proceedings of the LREC 2022 Workshop on Challenges in the Management of Large Corpora (CMLC-10 2022). Marseille, 20 June 2022. Paris: European Language Resources Association (ELRA), pp. 27–31. https://aclanthology.org/2022.cmlc-1.5/

Margaretha, Eliza/Lüngen, Harald (2014): Building linguistic corpora from Wikipedia articles and discussions. Journal for Language Technology and Computational Linguistics, 29(2), pp. 59–82. https://doi.org/10.21248/jlcl.29.2014.189.