The new data set version 0.3 release contains a couple of minor changes in the XML specification. Furthermore, the TreeTagger foundry has been re-created with TreeTagger version 3.2, fixing issues with the tokenization.
Here’s the link: http://korap.ids-mannheim.de/files/WPD.rootbasett_0.3.tar.bz2
And again, thanks to the patience and precise look of Eliza Margaretha, we have spotted a bug in the data set we have released. The TreeTagger foundry had some issues in places caused by a bug in converting certain special quotation marks from UTF-8 to Latin-1 encoding. Continue reading
Warning: this contains a snapshot of the full German Wikipedia plus annotations and that is why the full archive size is 1.6 GByte. Continue reading