Update Readme.md Change-Id: Id9df2d50d651e5bb4658a765324bb5819fdf5ed4

commit: 66789b89d32b828298b97b7fc7ca27481403c3a8 [log] [tgz]
author: Marc Kupietz <kupietz@ids-mannheim.de> Mon Mar 22 17:22:59 2021 +0100
committer: Marc Kupietz <kupietz@ids-mannheim.de> Mon Mar 22 17:22:59 2021 +0100
tree: 53290df5498b870df15180711ce86c1ca03967e2
parent: 31e2e313ab1948b43f8068ba446d5fdf590e70b3 [diff]
diff --git a/README.md b/README.md
index 4e4dfa3..51e6e18 100644
--- a/README.md
+++ b/README.md

@@ -1,13 +1,22 @@
-# wang2vec
-Extension of the original word2vec (https://code.google.com/p/word2vec/) using different architectures
+# dereko2vec
+Fork of [wang2vec](https://github.com/wlin12/wang2vec) with extensions for re-training and count based models and a 
+more accurate ETA prognosis.
 
-To build the code, simply run:
+## Installation
+### Dependencies
+* cmake3
+* [libcollocaltordb](https://korap.ids-mannheim.de/gerrit/plugins/gitiles/private/collocatordb) >= v1.3.0
+### Build and install
+```
+cd dereko2vec
+mkdir build
+cd build
+cmake ..
+make && ctest3 --extra-verbose && sudo make install
+```
+## Run
 
-make
-
-The command to build word embeddings is exactly the same as in the original version, except that we removed the argument -cbow and replaced it with the argument -type:
-
-./word2vec -train input_file -output embedding_file -type 0 -size 50 -window 5 -negative 10 -nce 0 -hs 0 -sample 1e-4 -threads 1 -binary 1 -iter 5 -cap 0
+The command to build word embeddings is exactly the same as in the original version, except that we added type 5 for setting up a purely count based collocation database.
 
 The -type argument is a integer that defines the architecture to use. These are the possible parameters:  
 0 - cbow  
@@ -15,9 +24,16 @@
 2 - cwindow (see below)  
 3 - structured skipngram(see below)  
 4 - collobert's senna context window model (still experimental)  
+5 - build a collocation count database instead of word embeddings
 
-If you use functionalities we added to the original code for research, please support us by citing our paper (thanks!):
+### Example
+```
+./dereko2vec -train input_file -output embedding_file -type 0 -size 50 -window 5 -negative 10 -nce 0 -hs 0 -sample 1e-4 -threads 1 -binary 1 -iter 5 -cap 0
+```
 
+
+## References
+```
 @InProceedings{Ling:2015:naacl,  
 author = {Ling, Wang and Dyer, Chris and Black, Alan and Trancoso, Isabel},  
 title="Two/Too Simple Adaptations of word2vec for Syntax Problems",  
@@ -27,50 +43,13 @@
 location="Denver, Colorado",  
 }
 
-The main changes we made to the code are:
-
-****** Structured Skipngram and CWINDOW ******
-
-The two NN architectures cwindow and structured skipngram (aimed for solving syntax problems). 
-
-These are described in our paper:
-
--Two/Too Simple Adaptations of word2vec for Syntax Problems
-
-****** Noise Contrastive Estimation objective ******
-
-Noise contrastive estimation is another approximation for the word softmax objective function, in additon to Hierarchical softmax and negative sampling, which are implemented in the default word2vec toolkit. This can be turned on by setting the -nce argument. Simply set -nce 10, to use 10 negative samples. Also remember to set -negative and -hs to 0.
-
-****** Parameter Capping ******
-
-By default parameters are updated freely, and are not checked for algebric overflows to maximize efficiency. However, we had some datasets where the CWINDOW architecture overflows, which leads to segfaults, If this happens, even in other architectures, try setting the paramter -cap 1 in order to avoid this problem at the cost of a small degradation in computational speed.
-
-****** Class-based Negative Sampling ******
-
-A new argument -negative-classes can be added to specify groups of classes. It receives a file in the format:
- 
-N dog  
-N cat  
-N worm  
-V doing  
-V finding  
-V dodging  
-A charming  
-A satirical  
-
-where each line defines a class and a word belonging to that class. For words belonging to the class, negative sampling is only performed on words on that class. For instance, if the desired output is dog, we would only sample from cat and worm. For words not in the list, sampling is performed over all word types.
-
-warning: the file must be order so that all words in the same class are grouped, so the following would not work correctly.
-
-N dog  
-A charming  
-N cat  
-N worm  
-V doing  
-V finding  
-V dodging  
-A satirical  
-
-****** Minor Changes ******
-
-The distance_txt and kmeans_txt are adaptations of the original distance and kmeans code to take textual (-binary 0) embeddings as input
+@InProceedings{FankhauserKupietz2019,
+author    = {Peter Fankhauser and Marc Kupietz},
+title     = {Analyzing domain specific word embeddings for a large corpus of contemporary German},
+series = {Proceedings of the 10th International Corpus Linguistics Conference},
+publisher = {University of Cardiff},
+address   = {Cardiff},
+year      = {2019},
+note      = {\url{https://doi.org/10.14618/ids-pub-9117}}
+}
+```
commit	66789b89d32b828298b97b7fc7ca27481403c3a8	[log] [tgz]
author	Marc Kupietz <kupietz@ids-mannheim.de>	Mon Mar 22 17:22:59 2021 +0100
committer	Marc Kupietz <kupietz@ids-mannheim.de>	Mon Mar 22 17:22:59 2021 +0100
tree	53290df5498b870df15180711ce86c1ca03967e2
parent	31e2e313ab1948b43f8068ba446d5fdf590e70b3 [diff]