commit	3b8d2eff3d4154dba3289b26a24128fa83604ae9	[log] [tgz]
author	Marc Kupietz <kupietz@ids-mannheim.de>	Wed Jan 31 19:05:28 2024 +0100
committer	Marc Kupietz <kupietz@ids-mannheim.de>	Wed Jan 31 19:05:28 2024 +0100
tree	e73b5aff6bf56f331f6d27a3cc31c76de119648b
parent	9ce82fa6f9752462c2392da9e85bbbe0752af196 [diff]

tree: e73b5aff6bf56f331f6d27a3cc31c76de119648b

README.md

dereko2vec

Fork of wang2vec with extensions for re-training and count based models and a more accurate ETA prognosis.

Installation

Dependencies

cmake3
libcollocaltordb >= v1.3.0

Build and install

cd dereko2vec
mkdir build
cd build
cmake ..
make && ctest3 --extra-verbose && sudo make install

Run

The command to build word embeddings is exactly the same as in the original version, except that we added type 5 for setting up a purely count based collocation database.

The -type argument is a integer that defines the architecture to use. These are the possible parameters:
0 - cbow
1 - skipngram
2 - cwindow (see below)
3 - structured skipngram(see below)
4 - collobert's senna context window model (still experimental)
5 - build a collocation count database instead of word embeddings

Example

./dereko2vec -train input_file -output embedding_file -type 3 -size 200 -window 5 -negative 10 -threads 1 -binary 1 -iter 5

Generate dereko2vec training input files from KorAP-XML ZIPs

The KorAP-XML-CoNLL-U tool can be used to generate input files for dereko2vec from KorAP-XML ZIPs using its tokenization and setence boundary information.

Example

korapxml2conllu --word2vec wpd19.zip > wpd19.w2vinput

Example with year of creation and topic domain as metadata

korapxml2conllu -m '<creatDate>([^<]{4})' -m '<catRef n="." target="topic.([^.]+)' --word2vec  wpd19.zip > wpd19.w2vinput
./dereko2vec -train wpd19.w2vinput -output wpd19.vecs -metadata-categories 2 -type 3 -size 200 -window 5 -negative 10 -threads 1 -binary 1 -iter 10 -min-count 2

References

@InProceedings{Ling:2015:naacl,  
author = {Ling, Wang and Dyer, Chris and Black, Alan and Trancoso, Isabel},  
title="Two/Too Simple Adaptations of word2vec for Syntax Problems",  
booktitle="Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",  
year="2015",  
publisher="Association for Computational Linguistics",  
location="Denver, Colorado",  
}

@InProceedings{FankhauserKupietz2019,
author    = {Peter Fankhauser and Marc Kupietz},
title     = {Analyzing domain specific word embeddings for a large corpus of contemporary German},
series = {Proceedings of the 10th International Corpus Linguistics Conference},
publisher = {University of Cardiff},
address   = {Cardiff},
year      = {2019},
note      = {\url{https://doi.org/10.14618/ids-pub-9117}}
}