Add Readme.md Change-Id: Ieb8142e25eda3b31e68fd044c48785908e0edfb6

commit: ebf837e12f222e9204b9e518538dbc507f1f7018 [log] [tgz]
author: Marc Kupietz <kupietz@ids-mannheim.de> Thu Jul 07 13:33:59 2022 +0200
committer: Marc Kupietz <kupietz@ids-mannheim.de> Thu Jul 07 15:36:12 2022 +0200
tree: f22645bd46298422e21a56c5bc96d5481f758502
diff --git a/Readme.md b/Readme.md
new file mode 100644
index 0000000..850409e
--- /dev/null
+++ b/Readme.md

@@ -0,0 +1,67 @@
+# Language model training data version of the [20th Century in Basic Terms](https://www.ids-mannheim.de/lexik/pb1/woerter-medien-und-gesellschaft/politisch-soziale-grundbegriffe-grosser-reichweite-und-dauer/) project corpus.
+
+## Corpus definition
+
+The corpus is a virtual sub-corpus of the German Reference Corpus DeReKo (DeReKo-2022-I) (IDS 2022, Kupietz et al. 2010, 2018), containing the following parts:
+
+| Title | Corpus-IDs (Sigles) |
+|-------|--------------|
+| Die Zeit | Z53-Z20    |
+| Der Spiegel | S47-S20 |
+| die tageszeitung | T86-T99 |
+| [Bonner Zeitungskorpus](http://www1.ids-mannheim.de/kl/projekte/korpora/archiv/bzk.html) | bzk |
+| [Handbuchkorpus](http://www1.ids-mannheim.de/kl/projekte/korpora/archiv/hbk.html) | hbk |
+| [Wendekorpus Bundesrepublik](https://www.ids-mannheim.de/digspra/kl/projekte/korpora/archiv/wk/)  | wkb|
+| [Wendekorpus DDR](https://www.ids-mannheim.de/digspra/kl/projekte/korpora/archiv/wk/) | wkd|
+| Umbruchsgeschichte | umb45, umb68|
+
+## Data Format
+
+The file `20CBT.tsv.bz2` contains all sentence-like segments of the aforementioned virtual corpus, encoded as randomly shuffled lines with three tab separated values:
+
+1. document sigle (=document id)
+2. year (and month) of first publication
+3. one tokenized (space-separated) sentence (Kupietz / Diewald 2021, Diewald / Kupietz / Lüngen 2021)
+  
+For example:
+
+```tsv
+T91/MAI 1991.05 Beide Male fällt auch auf , daß niemand festgenommen wurde .
+```
+
+### Construct KorAP corpus queries based on the data
+
+For the example above:
+
+<https://korap.ids-mannheim.de/?q=Beide+Male+fällt&cq=docSigle+%3D+"T91%2FMAI">
+
+image.png
+
+
+## Software used
+
+The archive `20CBT.tsv.bz2` was generated using the script `extract-shuffled-sentences.sh` provided here which uses the [korapxm2conllu](https://github.com/KorAP/KorAP-XML-CoNLL-U) tool for generating the one sentence per line format tsv-format with metadata, using the following command:
+
+```bash
+korapxml2conllu -m '<textSigle>([^<.]+)' -m '<creatDate>([^<]{4,7})' --word2vec $corpus > $dest
+```
+
+## License of the data
+
+The corpus contains copyrighted and licensed material. Therefore, although the sentences are shuffled in random order, the corpus may only be shared among members of the project [Das 20. Jahrhundert in Grundbegriffen](https://www.zfl-berlin.org/projekt/das-20-jahrhundert-in-grundbegriffen.html), funded by the Leibniz Association 2022-2024, for text and data mining purposes in accordance with the TDM exception of the German Copyright Act (§ 60d UrhG.) and must be deleted upon completion of the project.
+
+
+## References
+
+- Diewald, Nils / Kupietz, Marc / Lungen, Harald (2022): Tokenizing on scale – Preprocessing large text corpora on the lexical and sentence level. In Proceedings of EURALEX 2022, Mannheim.
+
+- IDS (2022): [Deutsches Referenzkorpus / Archiv der Korpora geschriebener Gegenwartssprache 2022-I (Release vom 08.03.2022)](http://www.dereko.de/).
+Mannheim: Leibniz-Institut für Deutsche Sprache.
+
+- Kupietz, Marc/Lüngen, Harald/Kamocki, Paweł/Witt, Andreas (2018): [The German Reference Corpus DeReKo: New Developments – New Opportunities](https://nbn-resolving.org/urn:nbn:de:bsz:mh39-74917).
+In: Calzolari, Nicoletta/Choukri, Khalid/Cieri, Christopher/Declerck, Thierry/Goggi, Sara/Hasida, Koiti/Isahara, Hitoshi/Maegaard, Bente/Mariani, Joseph/Mazo, Hélène/Moreno, Asuncion/Odijk, Jan/Piperidis, Stelios/Tokunaga, Takenobu (Hrsg.): Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). Miyazaki: European Language Resources Association (ELRA). 4353-4360.
+
+- Kupietz, Marc/Belica, Cyril/Keibel, Holger/Witt, Andreas (2010): [The German Reference Corpus DeReKo: A primordial sample for linguistic research](https://ids-pub.bsz-bw.de/frontdoor/deliver/index/docId/2837/file/Kupietz_Belica_Keibel_Witt_The+German+Reference_Corpus.pdf). In: Calzolari, Nicoletta et al. (eds.): Proceedings of the 7th conference on International Language Resources and Evaluation (LREC 2010). Valletta, Malta: European Language Resources Association (ELRA). 1848-1854.
+
+- Kupietz, Marc / Diewald, Nils (2021): KorAP/KorAP-Tokenizer: KorAP-Tokenizer v2.2.0 (v2.1.0.9000). Zenodo. [doi: 10.5281/zenodo.5144835]() https://doi.org/10.5281/zenodo.5144835)
+
commit	ebf837e12f222e9204b9e518538dbc507f1f7018	[log] [tgz]
author	Marc Kupietz <kupietz@ids-mannheim.de>	Thu Jul 07 13:33:59 2022 +0200
committer	Marc Kupietz <kupietz@ids-mannheim.de>	Thu Jul 07 15:36:12 2022 +0200
tree	f22645bd46298422e21a56c5bc96d5481f758502