blob: fd52f1641d94b9f4441933768f439847f808a1c9 [file] [log] [blame]
Marc Kupietz05e8b172026-04-03 16:15:35 +020012.7.3 2026-04-03
2 - Upgrade KorAP-Tokenizer to v2.4.1.
3 - KorAP-Tokenizer now fixes Unicode surrogate-pair handling in
4 German gender-sensitive forms to avoid crashes on unmatched
5 characters.
6 - Restart tokenizer on failure.
7
Marc Kupietz32781e92026-03-05 18:32:43 +010082.7.2 2026-03-05
9 - Fix XML parser error caused by elements (e.g. <ref>) whose
10 attributes span multiple lines.
11 - Progress bar now writes directly to /dev/tty (CON on Windows)
12 instead of stderr, so it is not captured by log redirection.
13 Automatically disabled when no controlling terminal is available
14 (e.g. detached container or CI environment).
15
Marc Kupietzff061ef2026-03-05 09:59:35 +0100162.7.1 2026-03-05
17 - Fix parser error when closing body and text tags
18 appear on the same line.
19
Marc Kupietz67ee44e2026-03-03 10:04:48 +0100202.7.0 2026-03-03
21 - Upgrade KorAP-Tokenizer to v2.4.0
22 with fixes for soft hyphens, thousands separators, and
23 support for German sensitive spelling forms, separeted by colons, slashes, and brackets.
24
Marc Kupietz4ad648e2025-12-10 10:38:46 +0100252.6.2 2025-12-10
26 - Upgrade KorAP-Tokenizer to v2.3.0 (resolves issues with
27 gendersternchen after hyphens, emoji clusters, and Wikipedia templates).
28 - Upgrade Java dependency to 21.
Marc Kupietz2115ecc2025-12-10 11:37:03 +010029 - Added --progress option.
Marc Kupietz4ad648e2025-12-10 10:38:46 +010030
Marc Kupietzb6fd6bc2025-04-16 12:47:26 +0200312.6.1 2025-04-16
32 - Fix ASCII entity resolution.
Marc Kupietzd254f5c2025-04-16 10:37:08 +020033 - Make KorAP-Tokenizer heap size configurable via environment
34 variable KORAPXMLTEI_TOKENIZER_HEAP_SIZE.
35
Marc Kupietz5b3f1d82024-07-05 17:50:55 +0200362.6.0 2024-11-11
Akron132bdeb2024-06-06 14:28:56 +020037 - Add -o parameter.
Akron6b1f26b2024-09-19 11:35:32 +020038 - Add support for inline dependency relations.
Marc Kupietzfc3a0ee2024-07-05 16:58:16 +020039 - Add support for --auto-textsigle.
Marc Kupietz5b3f1d82024-07-05 17:50:55 +020040 - Add support for multiple input files.
Akron132bdeb2024-06-06 14:28:56 +020041
Akron6b1f26b2024-09-19 11:35:32 +0200422.5.0 2024-01-24
Akron568b22f2024-01-23 10:12:34 +010043 - Upgrade minimal Perl version to 5.36 to improve
44 unicode handling.
45 - Upgrade KorAP-Tokenizer to v2.2.5 and Java to 17 to
46 improve unicode handling.
47
Akronec503252023-04-24 18:03:17 +0200482.4.4 2023-04-25
49 - Allow line-breaks in text only lines.
50
Akron72f4a882023-03-02 09:48:14 +0100512.4.3 2023-03-02
52 - Allow closing elements to start with "text".
53
Akron997aa222023-02-10 11:26:28 +0100542.4.2 2023-02-10
55 - Improve checks for numerical annotation bounds.
56
Akronfcff7342023-02-07 14:05:15 +0100572.4.1 2023-02-07
58 - Fix test.
59
Akronfc2a82a2023-02-07 11:29:11 +0100602.4.0 2023-02-07
Marc Kupietza671ae52022-12-22 16:28:14 +010061 - Conversion of standard TEI P5 should now work, at least
62 in some cases.
63 - Option --xmlid-to-textsigle <from-regex>@<to-c/to-d/to-t>
64 added to convert standard P5 text id attributes to I5
65 sigles with three parts.
Akronb93fabb2023-01-13 12:05:44 +010066 - Add --no-tokenizer parameter as a requirement
67 for relying on inline tokens only.
Marc Kupietza671ae52022-12-22 16:28:14 +010068
Akron2520a342022-03-29 18:18:05 +0200692.3.4 2022-11-09
Akron85269c02022-11-07 14:03:31 +010070 - Improve stability of XML entity replacement.
Akron2520a342022-03-29 18:18:05 +020071 - Check version for script and KorAP-Tokenizer
72 library when requested.
Akron85269c02022-11-07 14:03:31 +010073
Akron2520a342022-03-29 18:18:05 +0200742.3.3 2022-03-30
Akronbd4281e2022-03-28 08:31:40 +020075 - Load KorAP-Tokenizer only on request.
76
Akrond708a612022-03-21 16:00:01 +0100772.3.2 2022-03-23
Akron540fd622022-03-21 18:20:05 +010078 - Do not reference metadata.xml
Akrond708a612022-03-21 16:00:01 +010079 - Remove schema references from header files.
Akron4ee372a2022-02-24 17:54:24 +010080 - Improve test suite for unability to use
81 KorAP-Tokenizer.
Akron540fd622022-03-21 18:20:05 +010082
Marc Kupietz0bca4f12022-01-14 13:24:22 +0100832.3.1 2022-01-14 Release
Akrona3799ce2021-10-15 16:27:30 +020084 - Improve script handling of broken data
85 - Improve handling of unknown header types
86 - Check for valid sigles to avoid broken directories
87 - Introduce exclusivity for inline tokens handling.
Akrona2cb2812021-10-30 10:29:08 +020088 - Use single dash for STDIN.
Marc Kupietz0bca4f12022-01-14 13:24:22 +010089 - Update KorAP-Tokenizer to v2.2.2 (single quote, "du." bug fixes)
Akrona3799ce2021-10-15 16:27:30 +020090
912.2.0 2021-08-26 Release
Akrond658df72021-02-18 18:58:56 +010092 - Remove unnecessary branch in recursive call
Akrondd0be8f2021-02-18 19:29:41 +010093 - Support inline-structures parameter
Akron26a71522021-02-19 10:27:37 +010094 - Introduce --base-foundry, --data-file, and --header-file parameters
Akron91705d72021-02-19 10:59:45 +010095 - Introduce --tokens-file parameter
Akron75d63142021-02-23 18:40:56 +010096 - Introduce --skip-inline-tokens parameter
Akrond3e1d282021-02-24 14:51:27 +010097 - Minor cleanups and improvements
Akron54c3ff12021-02-25 11:33:37 +010098 - Introduce --skip-inline-tags parameter
Akroneb12e232021-02-25 13:49:50 +010099 - Introduce KorAP::XML::TEI::Inline class
Akron692d17d2021-03-05 13:21:03 +0100100 - Introduce --skip-inline-token-annotations parameter
101 - Deprecate KORAPXMLTEI_INLINE environment variable
102 in favor of --skip-inline-token-annotations
Akrond658df72021-02-18 18:58:56 +0100103
Akrona3799ce2021-10-15 16:27:30 +02001041.0.0 2021-02-18 Release
Akrond3e1d282021-02-24 14:51:27 +0100105 - -s option added that uses sentence boundaries
106 provided by the KorAP tokenizer (-tk)
Marc Kupietza1421f02021-02-18 15:32:38 +0100107 - Tokenizer invocation comments removed from KorAP XML output
108 - Indentation of </span> tags fixed
Akrond3e1d282021-02-24 14:51:27 +0100109 - Character entities used in DeReKo are automatically
110 replaced by their corresponding characters
Marc Kupietza1421f02021-02-18 15:32:38 +0100111 - Resources defined in Makefile
112 - Fixed possible IO deadlock with KorAP tokenizer
Akron4e3c7e32021-02-18 15:19:53 +0100113 - Simplified debugging by combining with X::C::T line numbers
Akron1a5271a2021-02-18 13:18:15 +0100114 - Support inline-tokens parameter
Akronf8088e62021-02-18 16:18:59 +0100115 - Move verbose code documentation to trailing
116 script section
Marc Kupietzeed4cb12021-02-17 19:39:32 +0100117
Akronf7084c42021-01-07 10:25:22 +01001180.03 2021-01-12
Marc Kupietzb505d442021-01-06 16:40:29 +0100119 - Update KorAP-Tokenizer to released 2.0 version
Akronf7084c42021-01-07 10:25:22 +0100120 - Improve test suite for recent version
121 of Mojolicious.
122
Marc Kupietz44b1f252020-11-26 16:31:40 +01001230.02 2020-11-27
Akronf7084c42021-01-07 10:25:22 +0100124 - Update KorAP-Tokenizer to v2.0.0.
Akroneaa96232020-10-15 17:06:15 +0200125 - Switch input encoding based on XML
126 processing instruction.
Marc Kupietz44b1f252020-11-26 16:31:40 +0100127 - Fix handling of UTF-8 in sigles.
Akroneaa96232020-10-15 17:06:15 +0200128
Akron0c41ab32020-09-29 07:33:33 +02001290.01 2020-09-28
130 - Initial release to GitHub.