Data Set Released

We have published a test data set!

Warning: this contains a snapshot of the full German Wikipedia plus annotations and that is why the full archive size is 1.6 GByte.

Creative Commons License
A corpus derived from the 2005 version of German Wikipedia including TreeTagger annotations by Institut für Deutsche Sprache is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Based on a work at www.ids-mannheim.de.
Permissions beyond the scope of this license may be available at http://korap.ids-mannheim.de.

README (also included in the archive)

Piotr Bański, Carsten Schnober
IDS Mannheim, March 2012
{banski,schnober}@ids-mannheim.de

1. Introduction

This package contains a snapshot of the German Wikipedia from 2005, as included in the DeReKo corpus (Deutsches Referenzkorpus), and annotations produced by TreeTagger, published by the KorAP project at the Institut für Deutsche Sprache (IDS). Recent downloads should be available for download from http://korap.ids-mannheim.de/downloads/.

Please note: this is not an attempt to come up with a new standard — far from that, in fact. This is an abstraction of the underlying KorAP data model, and we will soon(ish) provide a way to translate from the KorAP format into others, most probably via the SaltNPepper framework. The data is released as a by-product of our project, and in the hope to make it easier for colleagues who cooperate with us to download the suite. The rest of the world is welcome to use it as well — please be so kind as to send us a note at {korap in-the-domain ids-mannheim.de} if you decide to use it. All remarks are welcome.

A more complete description of the data structures used here will be available soon as a whitepaper. In this document, we only cover the basics. Please be so kind as to visit the project page and share your comments there or e-mail us at the above-mentioned address.

The release of the data set is accompanied by a separate release of the validator tool for it.

2. Credits

The package has been prepared by Piotr Bański and Carsten Schnober. We would like to thank Elena Frick for her work on testing the data set.

The textual content has been produced by the contributors to the German Wikipedia project and has been released under the Creative Commons BY-SA 3.0 Unported license. We are grateful to Helmut Schmid, the creator of TreeTagger, for his consent to release the TT annotations of KorAP-WPD under the same license.

The KorAP project is funded within the Senate Committee Competition (SAW) programme of the Leibniz Association.

3. Directory structure

This is a quick walkthrough across the directory tree.

3.1. root directory

  • schemas/ – RNG schemas for validation
  • WPD/ – corpus data (see section 3.2 below)
  • catalog – XML catalog file
  • LICENCE = CC BY-SA 3.0 Unported
  • README – this file

3.2. WPD directory

  • WPD_corpus_header.xml – the header of the entire Wikipedia subcorpus from DeReKo
  • Sub-directories AAA, BBB, …, ZZZ containing all document directories whose titles begin with the respective letters
  • Each sub-directory AAA, …, ZZZ contains a sub-corpus header file WPD-[A-Z][A-Z][A-Z]_header.xml

3.3. document directory

  • text.xml – raw text as a single UTF-8 sequence within /raw_text/text (excuse the element names)
  • header.xml – the header file, currently wrapping the contents of DeReKo headers
  • metadata.xml – pointers to other resources in this directory, in particular to “foundries” that contain various views of the raw text

3.4. subdirectories/foundries

Subdirectories contain what we call “foundries” (with thanks to Cyril Belica for the convenient term), each of which contains a set of views of the raw text connected by a common theme — in our case, the product of a single tool, such as TT, Connexor MPT, etc. There is one privileged foundry — the base foundry, which contains segmentation information: paragraph divisions, sentence boundaries, and two types of tokenization: conservative (stops at whitespace) and greedy (splits words whenever character class changes). Other foundries may use the segmentation information of the base, or have their own (which is actually the case for the tree_tagger foundry in this release).

4. Metadata documents

Each subdirectory contains a metadata.xml document, which points at the particular annotation layers. There is a restricted set of types of names of layers, regulated in metadata.rng. An example metadata.xml document for a single text may look as follows:

<?xml-model href="metadata.rng" type="application/xml" schematypens="http://relaxng.org/ns/structure/1.0"?>
  <metadata docid="WPD_AAA.00002" masked="0" type="document" xmlns="http://ids-mannheim.de/ns/KorA">
    <doc file="text.xml"/>
    <foundry name="base" path="base/"/>
    <foundry name="mpt" path="connexor/" restricted="1"/>
    <foundry name="tt" path="tree_tagger/"/>
    <foundry name="xip" path="xip/" restricted="1"/>
  </metadata>

The PI invokes the metadata.rng schema document, which can be found if the catalog file (from the root directory) is installed in your system (see e.g. http://xmlsoft.org/catalog.html for hints). The @docid is a unique ID of the entire document (understood as text and annotations), @masked is set to “1” if the document is hidden, and @type can be set to “document” or “foundry”). As can be seen, some foundries are restricted by IPR — sorry about that, we will attempt to use as much open content as will be possible.

An example of a tree_tagger foundry-level metadata document is as follows:

<metadata docid="WPD_AAA.00002" type="foundry" xmlns="http://ids-mannheim.de/ns/KorA">
  <doc file="../text.xml" />
  <foundry name="tt">
    <layer name="token" type="segm" gran="tok" file="tokens.xml" />
    <layer name="morph" file="morpho.xml" />
    <layer name="sent" type="segm" gran="s" file="sentences.xml" />
  </foundry>
</metadata>

The @name attribute of <layer> is restricted by the metadata.rng schema and at present, it’s a list of the kinds of annotations that occur in the entire, unrestricted suite, meant to be partially mnemonic, but not documented yet. The relationship between the filename and the @name attribute is full arbitrary and mediated by the <layer> element.

If a layer is of the type “segm” for segmentation (= saturating the text string into discrete spans, possibly without the intervening whitespace), it can take the following values of types of granularity: s, chunk (for when we are not sure), para, tok.

5. Annotation layers

The annotation layers that provide information beyond segmentation, in this release, are uniformly spans with feature structures attached to them, as in the fragment below.

<span id="s_178" from="936" to="940">
  <fs type="lex" xmlns="http://www.tei-c.org/ns/1.0">
    <f name="lex">
      <fs>
        <f name="lemma">noch</f>
        <f name="certainty">0.871970</f>
        <f name="ctag">ADV</f>
      </fs>
    </f>
    <f name="lex">
      <fs>
        <f name="lemma">noch</f>
        <f name="certainty">0.128030</f>
        <f name="ctag">KON</f>
      </fs>
    </f>
  </fs>
</span>

The names of the features are hopefully mostly self-explanatory (‘ctag’ is CES legacy, equivalent with ‘pos’).

Unfortunately, this data set does not exemplify dependency or hierarchical annotations, because we are not free to release them, but this will hopefully change as we move to a new dependency parser soon. We may be free to release Connexor MPT annotations to Connexor licensee holders, but that requires clearance, so please contact us if you think you may be eligible.

————————————————————–
Enjoy, we will be grateful for any feedback.

4 thoughts on “Data Set Released

  1. Pingback: Validator for Data Set | KorAP — The Blog

  2. Pingback: Data Set fixed | KorAP — The Blog

  3. Pingback: Another data set fix | KorAP@ids-mannheim.de

  4. Pingback: TreeTagger tokenization fixed in data set | KorAP@ids-mannheim.de

Leave a Reply

Your email address will not be published. Required fields are marked *