readme.md

Creating a TEI corpus file out of xml documents from the Bulgarian National Corpus

This tool is to be used on data from the Bulgarian National Corpus. It contains two python scripts,bunc2tei.py and escape_amp.py.

Usage

Converting the xml files and storing them into one corpus file

The script bunc2tei.py can be executed with the command bunc2tei.py *.xml > corpus.p5.xml.

A new xml tree is created in the main function, with the element teiCorpus as its root. The corpus file will be built from it.
To build the corpus, each xml file is first passed to the function extract_data. Extract_data tries to parse the xml file. If the parsing fails, it means the xml file is faulty and needs to be repaired first. This is done by the script escape_amp.py and will be explained in the next section. If the parsing succeeds, the relevant data is then extracted by saving the metadata and text elements in a dictionary. The dictionary is then passed to the function create_tree.
Create_tree creates a new xml tree with the element teiDoc as its root. All relevant elements are then inserted at the desired position in the xml tree and filled with the data from the dictionary.
The resulting trees are appended to the corpus tree.

Additional preprocessing for unescaped symbols

The script escape_amp.py aims at fixing xml files that cannot be processed due to unescaped symbols. In the Bulgarian National Corpus, this was the case for ampersand symbols inside text elements. The script can be executed using escape_amp.py *.xml.

The script tries to parse the xml files. If the parsing fails, the file is passed to the function escape.
Escape looks for all &-symbols that are not yet escaped and that are not themselves used to escape another symbol using a regular expression.
The unescaped &-symbols are replaced by their escaped variant inside the file.