commit	ebf837e12f222e9204b9e518538dbc507f1f7018	[log] [tgz]
author	Marc Kupietz <kupietz@ids-mannheim.de>	Thu Jul 07 13:33:59 2022 +0200
committer	Marc Kupietz <kupietz@ids-mannheim.de>	Thu Jul 07 15:36:12 2022 +0200
tree	f22645bd46298422e21a56c5bc96d5481f758502

tree: f22645bd46298422e21a56c5bc96d5481f758502

Readme.md

Readme.md

Language model training data version of the 20th Century in Basic Terms project corpus.

Corpus definition

The corpus is a virtual sub-corpus of the German Reference Corpus DeReKo (DeReKo-2022-I) (IDS 2022, Kupietz et al. 2010, 2018), containing the following parts:

Title	Corpus-IDs (Sigles)
Die Zeit	Z53-Z20
Der Spiegel	S47-S20
die tageszeitung	T86-T99
Bonner Zeitungskorpus	bzk
Handbuchkorpus	hbk
Wendekorpus Bundesrepublik	wkb
Wendekorpus DDR	wkd
Umbruchsgeschichte	umb45, umb68

Data Format

The file 20CBT.tsv.bz2 contains all sentence-like segments of the aforementioned virtual corpus, encoded as randomly shuffled lines with three tab separated values:

document sigle (=document id)
year (and month) of first publication
one tokenized (space-separated) sentence (Kupietz / Diewald 2021, Diewald / Kupietz / Lüngen 2021)

For example:

T91/MAI 1991.05 Beide Male fällt auch auf , daß niemand festgenommen wurde .

Construct KorAP corpus queries based on the data

For the example above:

https://korap.ids-mannheim.de/?q=Beide+Male+fällt&cq=docSigle+%3D+"T91%2FMAI"

image.png

Software used

The archive 20CBT.tsv.bz2 was generated using the script extract-shuffled-sentences.sh provided here which uses the korapxm2conllu tool for generating the one sentence per line format tsv-format with metadata, using the following command:

korapxml2conllu -m '<textSigle>([^<.]+)' -m '<creatDate>([^<]{4,7})' --word2vec $corpus > $dest

License of the data

The corpus contains copyrighted and licensed material. Therefore, although the sentences are shuffled in random order, the corpus may only be shared among members of the project Das 20. Jahrhundert in Grundbegriffen, funded by the Leibniz Association 2022-2024, for text and data mining purposes in accordance with the TDM exception of the German Copyright Act (§ 60d UrhG.) and must be deleted upon completion of the project.

References

Diewald, Nils / Kupietz, Marc / Lungen, Harald (2022): Tokenizing on scale – Preprocessing large text corpora on the lexical and sentence level. In Proceedings of EURALEX 2022, Mannheim.
IDS (2022): Deutsches Referenzkorpus / Archiv der Korpora geschriebener Gegenwartssprache 2022-I (Release vom 08.03.2022). Mannheim: Leibniz-Institut für Deutsche Sprache.
Kupietz, Marc/Lüngen, Harald/Kamocki, Paweł/Witt, Andreas (2018): The German Reference Corpus DeReKo: New Developments – New Opportunities. In: Calzolari, Nicoletta/Choukri, Khalid/Cieri, Christopher/Declerck, Thierry/Goggi, Sara/Hasida, Koiti/Isahara, Hitoshi/Maegaard, Bente/Mariani, Joseph/Mazo, Hélène/Moreno, Asuncion/Odijk, Jan/Piperidis, Stelios/Tokunaga, Takenobu (Hrsg.): Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). Miyazaki: European Language Resources Association (ELRA). 4353-4360.
Kupietz, Marc/Belica, Cyril/Keibel, Holger/Witt, Andreas (2010): The German Reference Corpus DeReKo: A primordial sample for linguistic research. In: Calzolari, Nicoletta et al. (eds.): Proceedings of the 7th conference on International Language Resources and Evaluation (LREC 2010). Valletta, Malta: European Language Resources Association (ELRA). 1848-1854.
Kupietz, Marc / Diewald, Nils (2021): KorAP/KorAP-Tokenizer: KorAP-Tokenizer v2.2.0 (v2.1.0.9000). Zenodo. doi: 10.5281/zenodo.5144835 https://doi.org/10.5281/zenodo.5144835)