added readme

Change-Id: I595ac3767e58d81148ba145516ccf16bb47ed044
1 file changed
tree: a70fe9c223ee376e1bb9d7ec771e5cbb252c4c50
  1. input/
  2. output/
  3. bunc2tei.py
  4. dnevnik.bg - 2020-01-01.xml
  5. dnevnik.bg - 2020-01-02.xml
  6. dnevnik.bg - 2020-01-03.xml
  7. dnevnik.bg - 2020-01-04.xml
  8. dnevnik.bg - 2020-01-05.xml
  9. dnevnik.bg - 2020-01-06.xml
  10. dnevnik.bg - 2020-01-07.xml
  11. dnevnik.bg - 2020-01-08.xml
  12. dnevnik.bg - 2020-01-09.xml
  13. dnevnik.bg - 2020-01-10.xml
  14. escape_amp.py
  15. fakti.bg - 2020-01-01.xml
  16. fakti.bg - 2020-01-02.xml
  17. fakti.bg - 2020-01-03.xml
  18. fakti.bg - 2020-01-04.xml
  19. fakti.bg - 2020-01-05.xml
  20. fakti.bg - 2020-01-06.xml
  21. fakti.bg - 2020-01-07.xml
  22. fakti.bg - 2020-01-08.xml
  23. fakti.bg - 2020-01-09.xml
  24. fakti.bg - 2020-01-10.xml
  25. ill-formed_docs.txt
  26. investor.bg - 2020-01-01.xml
  27. investor.bg - 2020-01-02.xml
  28. investor.bg - 2020-01-03.xml
  29. investor.bg - 2020-01-04.xml
  30. investor.bg - 2020-01-05.xml
  31. investor.bg - 2020-01-06.xml
  32. investor.bg - 2020-01-07.xml
  33. investor.bg - 2020-01-08.xml
  34. investor.bg - 2020-01-09.xml
  35. investor.bg - 2020-01-10.xml
  36. marica.bg - 2020-01-01.xml
  37. marica.bg - 2020-01-02.xml
  38. marica.bg - 2020-01-03.xml
  39. marica.bg - 2020-01-04.xml
  40. marica.bg - 2020-01-05.xml
  41. marica.bg - 2020-01-06.xml
  42. marica.bg - 2020-01-07.xml
  43. marica.bg - 2020-01-08.xml
  44. marica.bg - 2020-01-09.xml
  45. marica.bg - 2020-01-10.xml
  46. readme.md
  47. svobodnaevropa.bg - 2020-01-01.xml
  48. svobodnaevropa.bg - 2020-01-02.xml
  49. svobodnaevropa.bg - 2020-01-03.xml
  50. svobodnaevropa.bg - 2020-01-04.xml
  51. svobodnaevropa.bg - 2020-01-05.xml
  52. svobodnaevropa.bg - 2020-01-06.xml
  53. svobodnaevropa.bg - 2020-01-07.xml
  54. svobodnaevropa.bg - 2020-01-08.xml
  55. svobodnaevropa.bg - 2020-01-09.xml
  56. svobodnaevropa.bg - 2020-01-10.xml
  57. webcafe.bg - 2020-01-01.xml
  58. webcafe.bg - 2020-01-02.xml
  59. webcafe.bg - 2020-01-03.xml
  60. webcafe.bg - 2020-01-04.xml
  61. webcafe.bg - 2020-01-05.xml
  62. webcafe.bg - 2020-01-06.xml
  63. webcafe.bg - 2020-01-07.xml
  64. webcafe.bg - 2020-01-08.xml
  65. webcafe.bg - 2020-01-09.xml
  66. webcafe.bg - 2020-01-10.xml
readme.md

Creating a TEI corpus file out of xml documents

This tool is to be used on data from the Bulgarian National Corpus. It contains two script files,bunc2tei.py and escape_amp.py.

Usage

Converting the xml files and storing them into one corpus file

The script bunc2tei.py can be executed with the command bunc2tei.py *.xml inside a directory containing the data.

  1. A new xml file is created in the main function, with the element teiCorpus as its root.
  2. The corpus file parsed as such and saved into the file tree_structure in a new folder called input. This file will serve as the tree from which the new corpus will be built.
  3. To build the corpus, each xml file is first passed to the function convert. Convert tries parsing the xml file. If the parsing fails, the xml file is faulty and needs to be cleaned first. This is done by the script escape_amp.py and will be explained in the next section. If the parsing succeeds, the file is converted and appended to the corpus:
    1. To convert the file, the data (text metadata, text elements) is first stored in a list.
    2. The tree structure of the xml file is then deleted.
    3. The new tree structure is built and the data is inserted into the correct position.

The resulting corpus file is stored in the file corpus.p5.xml inside the folder output.

Additional preprocessing for unescaped symbols

The script escape_amp.py aims at fixing xml files that cannot be processed due to unescaped symbols. In the Bulgarian National Corpus, this was the case for ampersand symbols inside text elements. The script can be executed using escape_amp.py *.xml inside a directory containing the data.

  1. The script tries to parse the xml files. If the parsing fails, the file is passed to the function escape.
  2. Escape looks for all &-symbols that are not yet escaped and that are not themselves used to escape another symbol using a regular expression.
  3. The unescaped &-symbols are replaced by their escaped variant inside the file.