blob: 8e589ee391a3cfd9be9ffac033623243a06ff9cd [file] [log] [blame]
lora-spa1586402023-03-13 15:58:30 +01001Ill-formed documents:
2
lora-sp090514f2023-04-06 11:30:44 +02003- 132 instances of unescaped "&" in text-elements (fixed)
lora-spa1586402023-03-13 15:58:30 +01004- doc "investor.bg - 2020-01-04.xml" contains ill-formed line "<p><div</p>" (line 168)
lora-sp090514f2023-04-06 11:30:44 +02005- doc "svobodnaevropa.bg - 2020-01-04.xml" lacks author name for second text
6- doc "webcafe.bg - 2020-01-10.xml" lacks at least one author name
7- in all of the 10 docs from dnevnik.bg, there is a string of the following form:
8[class*="general-article"] .article-content > p:first-of-type::first-letter { float: none; font-size: 17px; line-height: 1.42em; padding: 0; }
9it can be found by the command grep -e "\[.*\}" *.xml