Language model training data version of the 20th Century in Basic Terms project corpus.

Corpus definition

The corpus is a virtual sub-corpus of the German Reference Corpus DeReKo (DeReKo-2022-I) (IDS 2022, Kupietz et al. 2010, 2018), containing the following parts:

TitleCorpus-IDs (Sigles)
Die ZeitZ53-Z20
Der SpiegelS47-S20
die tageszeitungT86-T99
Bonner Zeitungskorpusbzk
Handbuchkorpushbk
Wendekorpus Bundesrepublikwkb
Wendekorpus DDRwkd
Umbruchsgeschichteumb45, umb68

Data Format

The file 20CBT.tsv.bz2 contains all sentence-like segments of the aforementioned virtual corpus, encoded as randomly shuffled lines with three tab separated values:

  1. document sigle (=document id)
  2. year (and month) of first publication
  3. one tokenized (space-separated) sentence (Kupietz / Diewald 2021, Diewald / Kupietz / Lüngen 2021)

For example:

T91/MAI 1991.05 Beide Male fällt auch auf , daß niemand festgenommen wurde .

Construct KorAP corpus queries based on the data

For the example above:

https://korap.ids-mannheim.de/?q=Beide+Male+fällt&cq=docSigle+%3D+"T91%2FMAI"

image.png

Software used

The archive 20CBT.tsv.bz2 was generated using the script extract-shuffled-sentences.sh provided here which uses the korapxm2conllu tool for generating the one sentence per line format tsv-format with metadata, using the following command:

korapxml2conllu -m '<textSigle>([^<.]+)' -m '<creatDate>([^<]{4,7})' --word2vec $corpus > $dest

License of the data

The corpus contains copyrighted and licensed material. Therefore, although the sentences are shuffled in random order, the corpus may only be shared among members of the project Das 20. Jahrhundert in Grundbegriffen, funded by the Leibniz Association 2022-2024, for text and data mining purposes in accordance with the TDM exception of the German Copyright Act (§ 60d UrhG.) and must be deleted upon completion of the project.

References