Add documentation for getClassicCollocators web service funtion result
Change-Id: I1abb3df3c2bde6417cb962183a1f184bdca1e170
diff --git a/README.md b/README.md
index fb544c5..40f6bdf 100644
--- a/README.md
+++ b/README.md
@@ -2,8 +2,8 @@
Visualizes paradigmatic and syntagmatic relations between words based on [wang2vec](https://github.com/wlin12/wang2vec) / structured skip-n-gram (Ling et al. 2015) word embeddings (Mikolov et al. 2013) and word embedding networks.
-DeReKoVecs (Fankhauser & Kupietz 2017, 2019; Kupietz et al. 2018) serves as part of the new open lab of the Corpus Linguistics group at IDS Mannheim. Similar to the
-Collocation Database <a href="http://corpora.ids-mannheim.de/">CCDB</a> (Keibel & Belica 2007, Belica 2011), DeReKoVecs serves for investigating and comparing
+DeReKoVecs (Fankhauser / Kupietz 2017, 2019, 2022; Kupietz et al. 2018) serves as part of the new open lab of the Corpus Linguistics group at IDS Mannheim. Similar to the
+Collocation Database [CCDB](http://corpora.ids-mannheim.de/) (Keibel / Belica 2007, Belica 2011), DeReKoVecs serves for investigating and comparing
of measurements, dimension reduction procedures, visualizations etc., to track down detailed paradigmatic and syntagmatic relations
between words based on their use in very large corpora such as the German Reference Corpus DeReKo (Kupietz et al. 2010).
@@ -55,8 +55,47 @@
| ----------- | ----------- | ----------- |
| / | word, n, dedupe, cutoff, json=1 | get paradigmatic and syntagmatic neighbours, from word embeddings |
| getCollocationAssociation | w, c | get association scores for specific node collocate pairs |
+
+### Get classical (count-based) collocates
+
+| Command | Parameters | Description |
+| ----------- | ----------- | ----------- |
| getClassicCollocators | w | get count based collocates of word w|
+#### Example Result (node: Grund)
+
+```jsonc
+{
+ "N" : 55650540526, // number of tokens in corpus
+ "collocates" : [ // array of collocates
+ {
+ "afwin" : 64, // auto-focus window (see Perkuhn et al. 2012: E8-15) bit field 64 = 2^6 ≙ 00010 node 00000 (Aus [gutem] Grund)
+ "delta" : 0, // rank delta compared to collocation in background corpus (currently unused)
+ "dice" : 0.00198886, // dice score
+ "f" : 113490, // abs. frequency of collocation
+ "f2" : 10965575, // abs. frequency of collocate
+ "ld" : 5.02616, // log-dice score (Rychlý 2008) for whole window
+ "ldaf" : 7.39257, // log-dice score for auto focus window
+ "lfmd" : 36.0655, // log-frequency biased mutual dependency ≙ pmi³
+ // (Dalle 1994; Thanopoulos et al. 2002)
+ "llr" : 204906, // log-likelihood (Dunning 1993; Evert 2004)
+ "ln_count" : 36, // frequency of collocate as left neighbour of node
+ "ln_pmi" : -5.81926, // pmi as left neighbour
+ "md" : 19.2733, // mutual dependency ≙ pmi²
+ // (Dalle 1994; Thanopoulos et al. 2002)
+ "npmi" : 0.111633, // normalized pmi (Bouma 2009)
+ "pmi" : 2.4811, // pointwise mutual information
+ "rn_count" : 386, // frequency of collocate as right neighbour of node
+ "rn_pmi" : -2.39672, // pmi as right neighbour
+ "win" : 1023, // full window around node as bit field 1023 = 2^10-1 ≙ 11111 node 11111
+ // (unmarked scores refer to this)
+ "word" : "Aus" // collocate
+ },
+ // ...
+ ]
+}
+```
+
### Examples
```bash
GET 'http://localhost:3000/?word=Grund&n=10&dedupe=0&sort=0&cutoff=1000000&json=1' | json_pp |less
@@ -67,7 +106,7 @@
```
```bash
-$ GET 'http://localhost:3000/getCollocationAssociation?w=Grund&c=diesem'
+GET 'http://localhost:3000/getCollocationAssociation?w=Grund&c=diesem'
```
## Build and run using docker / podman
@@ -79,12 +118,14 @@
```
### Copy configuration to make it mountable
+
```bash
mkdir config
cp example.conf config/derekovecs.conf
```
### Run
+
``` bash
docker run -d=false -p 3000:3000 --rm -v $(pwd)/config:/config:z ids-kl/derekovecs
```
@@ -97,24 +138,36 @@
DeReKoVecs is published under the [Apache 2.0 License](LICENSE).
+## How to cite
+
+If you are using DeReKoVecs (results) for a scientific publication, please cite at least Fankhauser / Kupietz (2022).
+
## References
-Belica, Cyril (2011): Semantische Nähe als Ähnlichkeit von Kookkurrenzprofilen. In: Andrea Abel, Renata Zanin, Hrsg., Korpora in Lehre und Forschung, S. 155-178. Bozen-Bolzano University Press. Freie Universität Bozen-Bolzano.
+Belica, Cyril (2011): [Semantische Nähe als Ähnlichkeit von Kookkurrenzprofilen](https://nbn-resolving.org/urn:nbn:de:bsz:mh39-28361). In: Andrea Abel, Renata Zanin, Hrsg., Korpora in Lehre und Forschung, S. 155-178. Bozen-Bolzano University Press. Freie Universität Bozen-Bolzano.
-Fankhauser, P., Kupietz, M.(2022): [Count-Based and Predictive Language Models for Exploring DeReKo](http://www.lrec-conf.org/proceedings/lrec2022/workshops/CMLC10/pdf/2022.cmlc10-1.5.pdf). In: Proceedings of the LREC 2022 Workshop on Challenges in the Management of Large Corpora (CMLC-10 2022). Paris/Marseille: ELRA. pp. 27-31.
+Bouma, Gerlof (2009): Normalized (pointwise) mutual information in collocation extraction. In Proceedings of GSCL
-Fankhauser, P., Kupietz, M. (2017): Visualizing Language Change in a Corpus of Contemporary German. In: Proceedings of the 9th International Corpus Linguistics Conference. Birmingham: University of Birmingham.
+Daille, B. (1994): Approche mixte pour l’extraction automatique de terminologie: statistiques lexicales et filtres linguistiques. PhD thesis, Université Paris 7.
+
+Fankhauser, Peter / Kupietz, Marc (2022): [Count-Based and Predictive Language Models for Exploring DeReKo](http://www.lrec-conf.org/proceedings/lrec2022/workshops/CMLC10/pdf/2022.cmlc10-1.5.pdf). In: Proceedings of the LREC 2022 Workshop on Challenges in the Management of Large Corpora (CMLC-10 2022). Paris/Marseille: ELRA. pp. 27-31.
+
+Fankhauser, Peter / Kupietz, Marc (2017): Visualizing Language Change in a Corpus of Contemporary German. In: Proceedings of the 9th International Corpus Linguistics Conference. Birmingham: University of Birmingham.
Fankhauser, Peter/Kupietz, Marc (2019): [Analyzing domain specific word embeddings for a large corpus of contemporary German](https://doi.org/10.14618/ids-pub-9117). International Corpus Linguistics Conference, Cardiff, Wales, UK, July 22-26, 2019. 2019. 6 S.
-Keibel, H., Belica, C. (2007): CCDB: A Corpus-Linguistic Research and Development Workbench. In: Proceedings of the 4th Corpus Linguistics Conference (CL 2007). Birmingham: University of Birmingham.
+Keibel, H. / Belica, C. (2007): CCDB: A Corpus-Linguistic Research and Development Workbench. In: Proceedings of the 4th Corpus Linguistics Conference (CL 2007). Birmingham: University of Birmingham.
-Kupietz, M., Belica, C., Keibel, H., Witt, A. (2010): The German Reference Corpus DeReKo: A primordial sample for linguistic research. In: Calzolari, N. et al. (eds.): Proceedings of the seventh conference on International Language Resources and Evaluation (LREC 2010). Paris: ELRA, 1848-1854.
+Kupietz, M. / Belica, C. / Keibel, H., Witt, A. (2010): The German Reference Corpus DeReKo: A primordial sample for linguistic research. In: Calzolari, N. et al. (eds.): Proceedings of the seventh conference on International Language Resources and Evaluation (LREC 2010). Paris: ELRA, 1848-1854.
-Kupietz, M., Lüngen, H., Kamocki, P., Witt, A. (2018): German Reference Corpus DeReKo: New Developments – New Opportunities. In: Calzolari, N. et al (eds): Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC 2018). Miyazaki: ELRA, 4353-4360
+Kupietz, M. / Lüngen, H. / Kamocki, P./ Witt, A. (2018): German Reference Corpus DeReKo: New Developments – New Opportunities. In: Calzolari, N. et al (eds): Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC 2018). Miyazaki: ELRA, 4353-4360
-Ling, W., Dyer, C., Black, A., & Trancoso, I. (2015): Two/too simple adaptations of word2vec for syntax problems. In Proc. of NAACL.
+Ling, Wang / Dyer, C. / Black, A. / Trancoso, I. (2015): Two/too simple adaptations of word2vec for syntax problems. In Proc. of NAACL.
-Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., Dean, J.(2013): Distributed representations of words and phrases and their compositionality. In Proceedings of NIPS (Advances in Neural Information Processing Systems) 2013, 3111–3119.
+Mikolov, T. / Sutskever, I. / Chen, K. / Corrado, G. S. / Dean, J.(2013): Distributed representations of words and phrases and their compositionality. In Proceedings of NIPS (Advances in Neural Information Processing Systems) 2013, 3111–3119.
+Perkuhn, Rainer / Keibel, Holger / Kupietz, Marc (2012): Korpuslinguistik. Paderborn: Fink, 2012. [Addendum](http://corpora.ids-mannheim.de/libac/doc/libac-addOn-Kookkurrenz.pdf)
+Rychlý, Pavel (2008): A lexicographer-friendly association score. In Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN, 6–9, 2008
+
+Thanopoulos, A. / Fakotakis, N. / Kokkinakis, G. (2002): Comparative evaluation of collocation extraction metrics. In: Proc. of LREC 2002: 620–625.