commit | a158640b30d861640773ac79f62d575e8e659c36 | [log] [tgz] |
---|---|---|
author | lora-sp <lora.spassova@swhk.ids-mannheim.de> | Mon Mar 13 15:58:30 2023 +0100 |
committer | Lora Spassova <lora.spassova@swhk.ids-mannheim.de> | Fri Mar 17 08:53:43 2023 +0100 |
tree | fa740b3e3dba75785842d149eb049b1fcc7e36ce | |
parent | 688e8461a9a46a1bcee84f1fd9255ee9570dbaa3 [diff] [blame] |
Converting xml files and merging them into one corpus file Change-Id: I0fbdd1e89658523c0f4bbcda73b41af7e277c2f8
diff --git a/ill-formed_docs.txt b/ill-formed_docs.txt new file mode 100644 index 0000000..a058fc0 --- /dev/null +++ b/ill-formed_docs.txt
@@ -0,0 +1,4 @@ +Ill-formed documents: + +- 132 instances of unescaped "&" in text-elements +- doc "investor.bg - 2020-01-04.xml" contains ill-formed line "<p><div</p>" (line 168)