commit | 97c3e5373df62a4cab0925438f0e21636378ceaf | [log] [tgz] |
---|---|---|
author | lora-sp <lora.spassova@swhk.ids-mannheim.de> | Thu Apr 06 11:44:50 2023 +0200 |
committer | lora-sp <lora.spassova@swhk.ids-mannheim.de> | Thu Apr 06 11:44:50 2023 +0200 |
tree | c4ebea4cd8894a11bd73bec385e409d36dabee40 | |
parent | 011209887ad0d8ab90ead02d69500f1bf19f97cc [diff] |
updated readme Change-Id: Iec7b252cf2bc0c50fe565c8eb2dea3728bda7dd8
This tool is to be used on data from the Bulgarian National Corpus. It contains two python scripts,bunc2tei.py
and escape_amp.py
.
The script bunc2tei.py
can be executed with the command bunc2tei.py *.xml > corpus.p5.xml
.
teiCorpus
as its root. The corpus file will be built from it.extract_data
. Extract_data
tries to parse the xml file. If the parsing fails, it means the xml file is faulty and needs to be repaired first. This is done by the script escape_amp.py
and will be explained in the next section. If the parsing succeeds, the relevant data is then extracted by saving the metadata and text elements in a dictionary. The dictionary is then passed to the function create_tree
.Create_tree
creates a new xml tree with the element teiDoc
as its root. All relevant elements are then inserted at the desired position in the xml tree and filled with the data from the dictionary.The script escape_amp.py
aims at fixing xml files that cannot be processed due to unescaped symbols. In the Bulgarian National Corpus, this was the case for ampersand symbols inside text elements. The script can be executed using escape_amp.py *.xml
.
escape
.Escape
looks for all &
-symbols that are not yet escaped and that are not themselves used to escape another symbol using a regular expression.&
-symbols are replaced by their escaped variant inside the file.