Update README.md

Change-Id: Ibeefe2e7c8b0a13c55867e0319e8a0b41eb36cd5
1 file changed
tree: 8c768327d0b7337acf981aeb4e0b96a451ee7969
  1. ci/
  2. src/
  3. tests/
  4. .gitignore
  5. .gitlab-ci.yml
  6. CMakeLists.txt
  7. LICENSE
  8. README.md
README.md

dereko2vec (experimental metadata branch)

Fork of wang2vec with extensions for metadata, re-training, count based models and a more accurate ETA prognosis.

Installation

Dependencies

Build and install

cd dereko2vec
mkdir build
cd build
cmake ..
make && ctest3 --extra-verbose && sudo make install

Run

The command to build word embeddings is mostly the same as in the original version, except that we added -metadata-categories <num> for specifying the number of metadata catagoeries and -type 5 for setting up a purely count based collocation database.

The -type argument is a integer that defines the architecture to use. These are the possible parameters:
0 - cbow
1 - skipngram
2 - cwindow (see below)
3 - structured skipngram (see below) 4 - collobert's senna context window model (still experimental)
5 - build a collocation count database instead of word embeddings

Example

./dereko2vec -train input_file -output embedding_file -type 3 -size 200 -window 5 -negative 10 -threads 1 -binary 1 -iter 5

Generate dereko2vec training input files from KorAP-XML ZIPs

The KorAP-XML-CoNLL-U tool can be used to generate input files for dereko2vec from KorAP-XML ZIPs using its tokenization and setence boundary information.

Example

korapxml2conllu --word2vec wpd19.zip > wpd19.w2vinput

Example with year of creation and topic domain as metadata

korapxml2conllu -m '<creatDate>([^<]{4})' -m '<catRef n="." target="topic.([^.]+)' --word2vec  wpd19.zip > wpd19.w2vinput
./dereko2vec -train wpd19.w2vinput -output wpd19.vecs -metadata-categories 2 -type 3 -size 200 -window 5 -negative 10 -threads 1 -binary 1 -iter 10 -min-count 2

References

@InProceedings{Ling:2015:naacl,  
author = {Ling, Wang and Dyer, Chris and Black, Alan and Trancoso, Isabel},  
title="Two/Too Simple Adaptations of word2vec for Syntax Problems",  
booktitle="Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",  
year="2015",  
publisher="Association for Computational Linguistics",  
location="Denver, Colorado",  
}

@incollection{fankhauser_count-based_2022,
 address = {Paris},
 title = {Count-based and predictive language models for exploring {DeReKo}},
 isbn = {979-10-95546-83-2},
 url = {http://www.lrec-conf.org/proceedings/lrec2022/workshops/CMLC10/pdf/2022.cmlc10-1.5.pdf},
 abstract = {We present the use of count-based and predictive language models for exploring language use in the German Reference Corpus DeReKo. For collocation analysis along the syntagmatic axis we employ traditional association measures based on co-occurrence counts as well as predictive association measures derived from the output weights of skipgram word embeddings. For inspecting the semantic neighbourhood of words along the paradigmatic axis we visualize the high dimensional word embeddings in two dimensions using t-stochastic neighbourhood embeddings. Together, these visualizations provide a complementary, explorative approach to analysing very large corpora in addition to corpus querying. Moreover, we discuss count-based and predictive models w.r.t. scalability and maintainability in very large corpora.},
 booktitle = {Proceedings of the {LREC} 2022 {Workshop} on {Challenges} in the {Management} of {Large} {Corpora} ({CMLC}-10 2022). {Marseille}, 20 {June} 2022},
 publisher = {European Language Resources Association (ELRA)},
 author = {Fankhauser, Peter and Kupietz, Marc},
 editor = {Bański, Piotr and Barbaresi, Adrien and Clematide, Simon and Kupietz, Marc and Lüngen, Harald},
 year = {2022},
 keywords = {Korpus, Deutsch, Assoziationsmaß, collocation analysis, Deutsches Referenzkorpus (DeReKo), German Reference Corpus (DeReKo), Kollokation, language models, Paradigma, Syntagma, word embeddings},
 pages = {27--31},
}

@InProceedings{FankhauserKupietz2019,
author    = {Peter Fankhauser and Marc Kupietz},
title     = {Analyzing domain specific word embeddings for a large corpus of contemporary German},
series = {Proceedings of the 10th International Corpus Linguistics Conference},
publisher = {University of Cardiff},
address   = {Cardiff},
year      = {2019},
note      = {\url{https://doi.org/10.14618/ids-pub-9117}}
}