Update Readme.md Change-Id: I8d189a3f218ba6e368faff2134891aea49b4762a

commit: 00d894aebc009bc6914e5188c39d6a2d64fcfb9e [log] [tgz]
author: Marc Kupietz <kupietz@ids-mannheim.de> Fri Apr 10 14:45:17 2026 +0200
committer: Marc Kupietz <kupietz@ids-mannheim.de> Fri Apr 10 14:51:13 2026 +0200
tree: eaa59345d2d9dfa0538c7f05efbbd0b3b0b7f50f
parent: 804750def74b4a5b9ce607e375042628b9ff5d69 [diff]
diff --git a/Readme.md b/Readme.md
index 9acb989..e4cb61b 100644
--- a/Readme.md
+++ b/Readme.md

@@ -2,19 +2,64 @@
 
 [![Docker](https://img.shields.io/docker/v/korap/conllu-cmc?label=Docker&sort=semver)](https://hub.docker.com/r/korap/conllu-cmc)
 
-Reads CoNLL-U format from stdin and annotates emojis, emoticons, hashtags, URLs, email addresses, action words, @names, and Wikipedia emoji templates with their corresponding STTS-IBK POS tag (Beißwenger/Bartsch/Evert/Würzner 2016). Writes CoNLL-U format to stdout.
+Reads CoNLL-U format from stdin and annotates Unicode emojis, ASCII emoticons, Wikipedia emoji templates, action words, hashtags, URLs, email addresses, and @names with CMC-oriented STTS-IBK tags in the XPOS column (Beißwenger/Bartsch/Evert/Würzner 2016). Writes CoNLL-U format to stdout.
 
 For Unicode emojis (`EMOIMG`), the base emoji without skin tone modifiers
 is written to the LEMMA column and Unicode emoji metadata is added to the FEATS column:
 
+<!-- markdownlint-disable MD010 -->
 ```tsv
 # text = 😂
 1	😂	😂	_	EMOIMG	g=smileys_&_emotion|s=face_smiling|q=fully_qualified|v=E0.6|n=face_with_tears_of_joy	_	_	_	_
 ```
+<!-- markdownlint-enable MD010 -->
 
 The FEATS field encodes: `g` (group), `s` (subgroup), `q` (qualification status), `v` (Unicode version first introduced), `n` (emoji name – including skin tone). See <https://www.unicode.org/Public/UCD/latest/emoji/emoji-test.txt> for details.
 
+## Tagset
 
+All CMC annotations are written to the XPOS column (column 5 in CoNLL-U). For `EMOIMG`, the tagger also normalizes the LEMMA column to the base emoji and enriches FEATS with emoji metadata when available.
+
+| Tag | Phenomenon | Example token | Output behavior |
+| --- | --- | --- | --- |
+| `EMOWIKI` | Wikipedia emoji templates | `[_EMOJI:{{S\|;)}}_]` | Writes `EMOWIKI` to XPOS |
+| `EMOIMG` | Unicode emoji tokens | `😂`, `😇` | Writes `EMOIMG` to XPOS, normalizes LEMMA to the base emoji, and adds FEATS metadata |
+| `AKW` | Action words / inflectives | `:grins:` | Writes `AKW` to XPOS |
+| `EMOASC` | ASCII emoticons | `:)`, `<3` | Writes `EMOASC` to XPOS |
+| `HST` | Hashtags | `#KorAP`, `#10` | Writes `HST` to XPOS |
+| `URL` | URLs | `https://korap.ids-mannheim.de` | Writes `URL` to XPOS |
+| `EML` | Email addresses | `mail@example.org` | Writes `EML` to XPOS |
+| `ADR` | `@`-names / addresses | `@markup` | Writes `ADR` to XPOS |
+
+## CoNLL-U Output Examples
+
+The following example shows how the different tags appear in CoNLL-U output. In all cases, the annotation is written to XPOS; only `EMOIMG` additionally changes LEMMA and FEATS.
+
+<!-- markdownlint-disable MD010 -->
+```tsv
+# foundry = cmc
+# text_id = readme-demo
+# text = [_EMOJI:{{cool}}_] 😂 :grins: :) #KorAP https://korap.ids-mannheim.de mail@example.org @handle <3
+1	[_EMOJI:{{cool}}_]	_	_	EMOWIKI	_	_	_	_	_
+2	😂	😂	_	EMOIMG	g=smileys_&_emotion|s=face_smiling|q=fully_qualified|v=E0.6|n=face_with_tears_of_joy	_	_	_	_
+3	:grins:	_	_	AKW	_	_	_	_	_
+4	:)	_	_	EMOASC	_	_	_	_	_
+5	#KorAP	_	_	HST	_	_	_	_	_
+6	https://korap.ids-mannheim.de	_	_	URL	_	_	_	_	_
+7	mail@example.org	_	_	EML	_	_	_	_	_
+8	@handle	_	_	ADR	_	_	_	_	_
+9	<3	_	_	EMOASC	_	_	_	_	_
+```
+<!-- markdownlint-enable MD010 -->
+
+For compound emojis with modifiers or zero-width joiners, the tagger still writes `EMOIMG` and reduces LEMMA to the base emoji. For example, `✊🏿` is normalized to lemma `✊`, and `👨‍👨‍👦` is normalized to lemma `👨`.
+
+## Current Limitations
+
+- The tagger is purely pattern-based. It does not consider sentential, pragmatic, or discourse context.
+- The matching strategy is intentionally recall-oriented rather than precision-oriented. Ambiguous strings such as `<3` may therefore produce false positives.
+- Annotation quality depends heavily on tokenization. Unicode emojis, grapheme clusters, zero-width joiners, modifiers, emoticons, and Wikipedia emoji templates should already be segmented into correct token units before tagging.
+- We recommend [KorAP-Tokenizer](https://github.com/KorAP/KorAP-Tokenizer), which supports Unicode 17.0, including grapheme clusters, zero-width joiners, modifiers, emoticons, and Wikipedia-template-based emojis.
 
 ## Local Usage
 
@@ -53,11 +98,36 @@
 docker run --rm korap/conllu-cmc --help
 ```
 
+## Performance
+
+The tagger is implemented in Node.js because the runtime provides efficient regular-expression execution, which is central to this regex-based annotation workflow.
+
+On CMC corpora with many matches, throughput is above 13 MB/s. This includes dense CMC material such as the NottDeuYTSch corpus.
+
+## Applications
+
+The tagger is already used in corpus analysis scenarios with the corpus analysis platform [KorAP](https://github.com/KorAP/).
+
+### German Wikipedia Talk Pages
+
+The German Wikipedia Talk Pages corpus is available at <https://korap.ids-mannheim.de/instance/wiki>. A query for an `EMOWIKI`, an `EMOASC`, and an `EMOIMG` sequence in one posting with up to 12 intervening tokens between each match is:
+
+```cqp
+[cmc/p=EMOWIKI] []{0,12} [cmc/p=EMOASC] []{0,12} [cmc/p=EMOIMG]
+```
+
+You can run this query directly here: <https://korap.ids-mannheim.de/instance/wiki?q=[cmc%2Fp%3DEMOWIKI]+[]{0%2C12}+[cmc%2Fp%3DEMOASC]+[]{0%2C12}+[cmc%2Fp%3DEMOIMG]>.
+
+### NottDeuYTSch
+
+The NottDeuYTSch corpus (Cotgrove 2023) is accessible on request via <https://korap.ids-mannheim.de/instance/nottdeuytsch>.
+
 ## Installation
 
 ### Pre-built Binaries
 
 Download pre-built executables from the [Releases](https://github.com/KorAP/KorAP-CoNLL-U-CMC/releases) page:
+
 - `conllu2cmc` - Linux x64
 - `conllu2cmc` - macOS x64
 - `conllu2cmc.exe` - Windows x64
@@ -94,7 +164,8 @@
 docker pull korap/conllu-cmc
 ```
 
-
 ## References
 
-Beißwenger, Michael/Bartsch, Sabine/Evert, Stefan/Würzner, Kay-Michael (2016): EmpiriST 2015: A Shared Task on the Automatic Linguistic Annotation of Computer-Mediated Communication and Web Corpora. In: Proceedings of the 10th Web as Corpus Workshop. Berlin: Association for Computational Linguistics, S. 44–56. https://doi.org/10.18653/v1/W16-2606.
+- Beißwenger, Michael/Bartsch, Sabine/Evert, Stefan/Würzner, Kay-Michael (2016): EmpiriST 2015: A Shared Task on the Automatic Linguistic Annotation of Computer-Mediated Communication and Web Corpora. In: Proceedings of the 10th Web as Corpus Workshop. Berlin: Association for Computational Linguistics, S. 44–56. <https://doi.org/10.18653/v1/W16-2606>.
+- Cotgrove, Louis (2023): New opportunities for researching digital youth language: The NottDeuYTSch corpus. In: Kupietz, Marc/Schmidt, Thomas (Hrsg.): Neue Entwicklungen in der Korpuslandschaft der Germanistik. Beiträge zur IDS-Methodenmesse 2022. (= Korpuslinguistik und interdisziplinäre Perspektiven auf Sprache (CLIP) 11). Tübingen: Narr, S. 102-115.
+- Margaretha, Eliza/Lüngen, Harald/Diewald, Nils/Kupietz, Marc/Yaddehige, Rameela (2025): Building and querying Wikipedia discussion corpora using KorAP. In: Impulses and Approaches to Computer-Mediated Communication: Proceedings of the 12th International Conference on Computer Mediated Communication and Social Media Corpora for the Humanities (CMC 2025). Edited by Annamária Fábián/Igor Trost, S. 123-124.
commit	00d894aebc009bc6914e5188c39d6a2d64fcfb9e	[log] [tgz]
author	Marc Kupietz <kupietz@ids-mannheim.de>	Fri Apr 10 14:45:17 2026 +0200
committer	Marc Kupietz <kupietz@ids-mannheim.de>	Fri Apr 10 14:51:13 2026 +0200
tree	eaa59345d2d9dfa0538c7f05efbbd0b3b0b7f50f
parent	804750def74b4a5b9ce607e375042628b9ff5d69 [diff]