Proceedings in NER annotation

Comparing two models for NER (the Huge German Corpus-generalized classifier and the deWaC-generalized classifier), we found that neither of them did really well for our German Wikipedia corpus. Possibly the deWaC-generalized classifier led to slightly more intuitive results but as I did not do a statistical validation, this is hard to say.

Therefore we decided to run the NER task on both models to allow the user to evaluate and compare the results and store the date of both models in the coreNLP foundry (the name of the model is included in the file name). This foundry is open to extension by other tools provided by Stanford NLP team.

The process of Named entity Recognition and classification (NER/NEC) is done by a pipeline which loads the classifier(s) once for the entire annotation process and iterates over the corpus. For each document, the Stanford NE Recognizer takes the text of this document and the loaded model. The output is written to the coreNLP foundry: one file per model classifier <ne_[MODEL].xml>, containing the found Named Entity, a file <tokens.xml> with absolute token spans and a file <sentences.xml> with the sentence spans.