Update README.md
Change-Id: I25c4af0e6d1aad706ab4f9ce092bd5c020dc6e05
diff --git a/README.md b/README.md
index 6316f29..b4052ce 100644
--- a/README.md
+++ b/README.md
@@ -1,19 +1,24 @@
# dereko2vec
-Fork of [wang2vec](https://github.com/wlin12/wang2vec) with extensions for re-training and count based models and a
-more accurate ETA prognosis.
+
+Fork of [wang2vec](https://github.com/wlin12/wang2vec) with extensions for re-training and count based models, support for tokens with frequencies > 2³² and a more accurate ETA prognosis.
## Installation
+
### Dependencies
+
* cmake3
* [libcollocaltordb](https://korap.ids-mannheim.de/gerrit/plugins/gitiles/ids-kl/collocatordb) >= v1.3.0
+
### Build and install
-```
+
+```bash
cd dereko2vec
mkdir build
cd build
cmake ..
make && ctest3 --extra-verbose && sudo make install
```
+
## Run
The command to build word embeddings is exactly the same as in the original version, except that we added type 5 for setting up a purely count based collocation database.
@@ -27,7 +32,8 @@
5 - build a collocation count database instead of word embeddings
### Example
-```
+
+```bash
./dereko2vec -train input_file -output embedding_file -type 0 -size 50 -window 5 -negative 10 -nce 0 -hs 0 -sample 1e-4 -threads 1 -binary 1 -iter 5 -cap 0
```
@@ -35,12 +41,13 @@
The [KorAP-XML-CoNLL-U](https://github.com/KorAP/KorAP-XML-CoNLL-U) tool can be used to generate input files for dereko2vec from KorAP-XML ZIPs using its tokenization and setence boundary information, for example:
-```
+```bash
korapxml2conllu --word2vec wpd19.zip > wpd19.w2vinput
```
## References
-```
+
+```bash
@InProceedings{Ling:2015:naacl,
author = {Ling, Wang and Dyer, Chris and Black, Alan and Trancoso, Isabel},
title="Two/Too Simple Adaptations of word2vec for Syntax Problems",