commit | ebf837e12f222e9204b9e518538dbc507f1f7018 | [log] [tgz] |
---|---|---|
author | Marc Kupietz <kupietz@ids-mannheim.de> | Thu Jul 07 13:33:59 2022 +0200 |
committer | Marc Kupietz <kupietz@ids-mannheim.de> | Thu Jul 07 15:36:12 2022 +0200 |
tree | f22645bd46298422e21a56c5bc96d5481f758502 |
Add Readme.md Change-Id: Ieb8142e25eda3b31e68fd044c48785908e0edfb6
The corpus is a virtual sub-corpus of the German Reference Corpus DeReKo (DeReKo-2022-I) (IDS 2022, Kupietz et al. 2010, 2018), containing the following parts:
Title | Corpus-IDs (Sigles) |
---|---|
Die Zeit | Z53-Z20 |
Der Spiegel | S47-S20 |
die tageszeitung | T86-T99 |
Bonner Zeitungskorpus | bzk |
Handbuchkorpus | hbk |
Wendekorpus Bundesrepublik | wkb |
Wendekorpus DDR | wkd |
Umbruchsgeschichte | umb45, umb68 |
The file 20CBT.tsv.bz2
contains all sentence-like segments of the aforementioned virtual corpus, encoded as randomly shuffled lines with three tab separated values:
For example:
T91/MAI 1991.05 Beide Male fällt auch auf , daß niemand festgenommen wurde .
For the example above:
https://korap.ids-mannheim.de/?q=Beide+Male+fällt&cq=docSigle+%3D+"T91%2FMAI"
image.png
The archive 20CBT.tsv.bz2
was generated using the script extract-shuffled-sentences.sh
provided here which uses the korapxm2conllu tool for generating the one sentence per line format tsv-format with metadata, using the following command:
korapxml2conllu -m '<textSigle>([^<.]+)' -m '<creatDate>([^<]{4,7})' --word2vec $corpus > $dest
The corpus contains copyrighted and licensed material. Therefore, although the sentences are shuffled in random order, the corpus may only be shared among members of the project Das 20. Jahrhundert in Grundbegriffen, funded by the Leibniz Association 2022-2024, for text and data mining purposes in accordance with the TDM exception of the German Copyright Act (§ 60d UrhG.) and must be deleted upon completion of the project.
Diewald, Nils / Kupietz, Marc / Lungen, Harald (2022): Tokenizing on scale – Preprocessing large text corpora on the lexical and sentence level. In Proceedings of EURALEX 2022, Mannheim.
IDS (2022): Deutsches Referenzkorpus / Archiv der Korpora geschriebener Gegenwartssprache 2022-I (Release vom 08.03.2022). Mannheim: Leibniz-Institut für Deutsche Sprache.
Kupietz, Marc/Lüngen, Harald/Kamocki, Paweł/Witt, Andreas (2018): The German Reference Corpus DeReKo: New Developments – New Opportunities. In: Calzolari, Nicoletta/Choukri, Khalid/Cieri, Christopher/Declerck, Thierry/Goggi, Sara/Hasida, Koiti/Isahara, Hitoshi/Maegaard, Bente/Mariani, Joseph/Mazo, Hélène/Moreno, Asuncion/Odijk, Jan/Piperidis, Stelios/Tokunaga, Takenobu (Hrsg.): Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). Miyazaki: European Language Resources Association (ELRA). 4353-4360.
Kupietz, Marc/Belica, Cyril/Keibel, Holger/Witt, Andreas (2010): The German Reference Corpus DeReKo: A primordial sample for linguistic research. In: Calzolari, Nicoletta et al. (eds.): Proceedings of the 7th conference on International Language Resources and Evaluation (LREC 2010). Valletta, Malta: European Language Resources Association (ELRA). 1848-1854.
Kupietz, Marc / Diewald, Nils (2021): KorAP/KorAP-Tokenizer: KorAP-Tokenizer v2.2.0 (v2.1.0.9000). Zenodo. doi: 10.5281/zenodo.5144835 https://doi.org/10.5281/zenodo.5144835)