commit	c0eff68d083f8640e86c47d63ced9b3f19f6cabd	[log] [tgz]
author	Marc Kupietz <kupietz@ids-mannheim.de>	Thu Jul 07 16:02:11 2022 +0200
committer	Marc Kupietz <kupietz@ids-mannheim.de>	Thu Jul 07 16:14:59 2022 +0200
tree	b919504b0ebc8e488ee350ccaaa78ae346956970
parent	0e7dd3a2ef1ad462f17758e1684003aed423b288 [diff]

tree: b919504b0ebc8e488ee350ccaaa78ae346956970

Readme.md

Language model training data version of the 20th Century in Basic Terms project corpus.

Corpus Definition

The corpus is a virtual sub-corpus of the German Reference Corpus DeReKo (DeReKo-2022-I) (IDS 2022, Kupietz et al. 2010, 2018), containing the following parts:

Title	Corpus-IDs (Sigles)
Die Zeit	Z53-Z20
Der Spiegel	S47-S20
die tageszeitung	T86-T99
Bonner Zeitungskorpus	bzk
Handbuchkorpus	hbk
Wendekorpus Bundesrepublik	wkb
Wendekorpus DDR	wkd
Umbruchsgeschichte	umb45, umb68

Data Format

The file 20CBT.tsv.bz2 contains all sentence-like segments of the aforementioned virtual corpus, encoded as randomly shuffled lines with three tab separated values:

document sigle (=document id)
year (and month) of first publication
one tokenized (space-separated) sentence (Kupietz / Diewald 2021, Diewald / Kupietz / Lüngen 2021)

For example:

T91/MAI 1991.05 Beide Male fällt auch auf , daß niemand festgenommen wurde .

Construct KorAP Corpus Queries Based on a Data Line

For the example above:

https://korap.ids-mannheim.de/?q=Beide+Male+fällt&cq=docSigle+%3D+"T91%2FMAI"

image.png

Used Software

The file 20CBT.tsv.bz2 was generated using the script extract-shuffled-sentences.sh provided here which uses the korapxm2conllu tool for generating the one sentence per line format tsv-format with metadata, using the following command:

korapxml2conllu -m '<textSigle>([^<.]+)' -m '<creatDate>([^<]{4,7})' --word2vec $corpus > $dest

License of the Data

The corpus contains copyrighted and licensed material. Therefore, although the sentences are shuffled in random order, the corpus may only be shared among members of the project Das 20. Jahrhundert in Grundbegriffen, funded by the Leibniz Association 2022-2024, for text and data mining purposes in accordance with the TDM exception of the German Copyright Act (§ 60d UrhG) and must be deleted upon completion of the project.

References

Diewald, Nils / Kupietz, Marc / Lungen, Harald (2022): Tokenizing on scale – Preprocessing large text corpora on the lexical and sentence level. In Proceedings of EURALEX 2022, Mannheim.
IDS (2022): Deutsches Referenzkorpus / Archiv der Korpora geschriebener Gegenwartssprache 2022-I (Release vom 08.03.2022). Mannheim: Leibniz-Institut für Deutsche Sprache.
Kupietz, Marc/Lüngen, Harald/Kamocki, Paweł/Witt, Andreas (2018): The German Reference Corpus DeReKo: New Developments – New Opportunities. In: Calzolari, Nicoletta/Choukri, Khalid/Cieri, Christopher/Declerck, Thierry/Goggi, Sara/Hasida, Koiti/Isahara, Hitoshi/Maegaard, Bente/Mariani, Joseph/Mazo, Hélène/Moreno, Asuncion/Odijk, Jan/Piperidis, Stelios/Tokunaga, Takenobu (Hrsg.): Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). Miyazaki: European Language Resources Association (ELRA). 4353-4360.
Kupietz, Marc/Belica, Cyril/Keibel, Holger/Witt, Andreas (2010): The German Reference Corpus DeReKo: A primordial sample for linguistic research. In: Calzolari, Nicoletta et al. (eds.): Proceedings of the 7th conference on International Language Resources and Evaluation (LREC 2010). Valletta, Malta: European Language Resources Association (ELRA). 1848-1854.
Kupietz, Marc / Diewald, Nils (2021): KorAP/KorAP-Tokenizer: KorAP-Tokenizer v2.2.0 (v2.1.0.9000). Zenodo. doi: 10.5281/zenodo.5144835 https://doi.org/10.5281/zenodo.5144835)