The new data set version 0.3 release contains a couple of minor changes in the XML specification. Furthermore, the TreeTagger foundry has been re-created with TreeTagger version 3.2, fixing issues with the tokenization.
Here’s the link: http://korap.ids-mannheim.de/files/WPD.rootbasett_0.3.tar.bz2
And again, thanks to the patience and precise look of Eliza Margaretha, we have spotted a bug in the data set we have released. The TreeTagger foundry had some issues in places caused by a bug in converting certain special quotation marks from UTF-8 to Latin-1 encoding. Continue reading
Lucene is probably THE state of the art tool for indexing text. It creates inverted indexes in which — in short — tokens are keys and their positions are the values so that any term can be looked up rapidly. Continue reading
The KorAP Validator iterates over the given directory(s) and validates data in the internal KorAP format regarding both the XML format and the content (to a certain extent). It is intended to be applied in order to validate our test data set. Continue reading
Warning: this contains a snapshot of the full German Wikipedia plus annotations and that is why the full archive size is 1.6 GByte. Continue reading