Clone this repo:
  1. 8b16977 Attempt a completely static build in CI script by Marc Kupietz · 3 weeks ago master
  2. d4120f1 Also link static libstdc++ if available by Marc Kupietz · 3 weeks ago
  3. 803fff1 Also link static libsnappy by Marc Kupietz · 3 weeks ago
  4. 5846b33 Link compression libs static if possible by Marc Kupietz · 3 weeks ago
  5. e3c7e64 Bump version also in source code by Marc Kupietz · 4 months ago


Fork of wang2vec with extensions for re-training and count based models, support for tokens with frequencies > 2³² and a more accurate ETA prognosis.



Build and install

cd dereko2vec
mkdir build
cd build
cmake ..
make && ctest3 --extra-verbose && sudo make install


The command to build word embeddings is exactly the same as in the original version, except that we added type 5 for setting up a purely count based collocation database.

The -type argument is a integer that defines the architecture to use. These are the possible parameters:
0 - cbow
1 - skipngram
2 - cwindow (see below)
3 - structured skipngram(see below)
4 - collobert's senna context window model (still experimental)
5 - build a collocation count database instead of word embeddings


./dereko2vec -train input_file -output embedding_file -type 0 -size 50 -window 5 -negative 10 -nce 0 -hs 0 -sample 1e-4 -threads 1 -binary 1 -iter 5 -cap 0

Generate dereko2vec training input files from KorAP-XML ZIPs

The KorAP-XML-CoNLL-U tool can be used to generate input files for dereko2vec from KorAP-XML ZIPs using its tokenization and setence boundary information, for example:

korapxml2conllu --word2vec > wpd19.w2vinput

Retrain existing model with new data

For example:

dereko2vec -train new.traindata -output new.vecs -save-net -type 3 -size 200 -window 5 -negative 10 -threads 44 -binary 1 -iter 100 -read-vocab old.vocab -read-net


author = {Ling, Wang and Dyer, Chris and Black, Alan and Trancoso, Isabel},  
title="Two/Too Simple Adaptations of word2vec for Syntax Problems",  
booktitle="Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",  
publisher="Association for Computational Linguistics",  
location="Denver, Colorado",  

author    = {Peter Fankhauser and Marc Kupietz},
title     = {Analyzing domain specific word embeddings for a large corpus of contemporary German},
series = {Proceedings of the 10th International Corpus Linguistics Conference},
publisher = {University of Cardiff},
address   = {Cardiff},
year      = {2019},
note      = {\url{}}