updated readme
Change-Id: Iec7b252cf2bc0c50fe565c8eb2dea3728bda7dd8
diff --git a/.Rhistory b/.Rhistory
new file mode 100644
index 0000000..e69de29
--- /dev/null
+++ b/.Rhistory
diff --git a/readme.md b/readme.md
index 423f904..4daea4c 100644
--- a/readme.md
+++ b/readme.md
@@ -1,23 +1,20 @@
-## Creating a TEI corpus file out of xml documents
+## Creating a TEI corpus file out of xml documents from the Bulgarian National Corpus
-This tool is to be used on data from the Bulgarian National Corpus. It contains two script files,`bunc2tei.py` and `escape_amp.py`.
+This tool is to be used on data from the Bulgarian National Corpus. It contains two python scripts,`bunc2tei.py` and `escape_amp.py`.
### Usage
#### Converting the xml files and storing them into one corpus file
-The script `bunc2tei.py` can be executed with the command `bunc2tei.py *.xml` inside a directory containing the data.
-1. A new xml file is created in the main function, with the element `teiCorpus` as its root.
-2. The corpus file parsed as such and saved into the file `tree_structure` in a new folder called `input`. This file will serve as the tree from which the new corpus will be built.
-3. To build the corpus, each xml file is first passed to the function `convert`. `Convert` tries parsing the xml file. If the parsing fails, the xml file is faulty and needs to be cleaned first. This is done by the script `escape_amp.py` and will be explained in the next section. If the parsing succeeds, the file is converted and appended to the corpus:
- 1. To convert the file, the data (text metadata, text elements) is first stored in a list.
- 2. The tree structure of the xml file is then deleted.
- 3. The new tree structure is built and the data is inserted into the correct position.
-
-The resulting corpus file is stored in the file `corpus.p5.xml` inside the folder `output`.
+The script `bunc2tei.py` can be executed with the command `bunc2tei.py *.xml > corpus.p5.xml`.
+1. A new xml tree is created in the main function, with the element `teiCorpus` as its root. The corpus file will be built from it.
+2. To build the corpus, each xml file is first passed to the function `extract_data`. `Extract_data` tries to parse the xml file. If the parsing fails, it means the xml file is faulty and needs to be repaired first. This is done by the script `escape_amp.py` and will be explained in the next section. If the parsing succeeds, the relevant data is then extracted by saving the metadata and text elements in a dictionary. The dictionary is then passed to the function `create_tree`.
+3. `Create_tree` creates a new xml tree with the element `teiDoc` as its root. All relevant elements are then inserted at the desired position in the xml tree and filled with the data from the dictionary.
+4. The resulting trees are appended to the corpus tree.
+
#### Additional preprocessing for unescaped symbols
-The script `escape_amp.py` aims at fixing xml files that cannot be processed due to unescaped symbols. In the Bulgarian National Corpus, this was the case for ampersand symbols inside text elements. The script can be executed using `escape_amp.py *.xml` inside a directory containing the data.
+The script `escape_amp.py` aims at fixing xml files that cannot be processed due to unescaped symbols. In the Bulgarian National Corpus, this was the case for ampersand symbols inside text elements. The script can be executed using `escape_amp.py *.xml`.
1. The script tries to parse the xml files. If the parsing fails, the file is passed to the function `escape`.
2. `Escape` looks for all `&`-symbols that are not yet escaped and that are not themselves used to escape another symbol using a regular expression.
3. The unescaped `&`-symbols are replaced by their escaped variant inside the file.