commit | 52f1a295b8c4f92ab8544631868dc9ffeccdd7b3 | [log] [tgz] |
---|---|---|
author | lora-sp <lora.spassova@swhk.ids-mannheim.de> | Fri Mar 17 10:31:50 2023 +0100 |
committer | Lora Spassova <lora.spassova@swhk.ids-mannheim.de> | Mon Mar 27 09:47:09 2023 +0200 |
tree | a70fe9c223ee376e1bb9d7ec771e5cbb252c4c50 | |
parent | a158640b30d861640773ac79f62d575e8e659c36 [diff] |
added readme Change-Id: I595ac3767e58d81148ba145516ccf16bb47ed044
This tool is to be used on data from the Bulgarian National Corpus. It contains two script files,bunc2tei.py
and escape_amp.py
.
The script bunc2tei.py
can be executed with the command bunc2tei.py *.xml
inside a directory containing the data.
teiCorpus
as its root.tree_structure
in a new folder called input
. This file will serve as the tree from which the new corpus will be built.convert
. Convert
tries parsing the xml file. If the parsing fails, the xml file is faulty and needs to be cleaned first. This is done by the script escape_amp.py
and will be explained in the next section. If the parsing succeeds, the file is converted and appended to the corpus:The resulting corpus file is stored in the file corpus.p5.xml
inside the folder output
.
The script escape_amp.py
aims at fixing xml files that cannot be processed due to unescaped symbols. In the Bulgarian National Corpus, this was the case for ampersand symbols inside text elements. The script can be executed using escape_amp.py *.xml
inside a directory containing the data.
escape
.Escape
looks for all &
-symbols that are not yet escaped and that are not themselves used to escape another symbol using a regular expression.&
-symbols are replaced by their escaped variant inside the file.