Reads CoNLL-U format from stdin and annotates Unicode emojis, ASCII emoticons, Wikipedia emoji templates, action words, hashtags, URLs, email addresses, and @names with CMC-oriented STTS-IBK tags in the XPOS column (Beißwenger/Bartsch/Evert/Würzner 2016). Writes CoNLL-U format to stdout.
For Unicode emojis (EMOIMG), the base emoji without skin tone modifiers is written to the LEMMA column and Unicode emoji metadata is added to the FEATS column:
# text = 😂 1 😂 😂 _ EMOIMG g=smileys_&_emotion|s=face_smiling|q=fully_qualified|v=E0.6|n=face_with_tears_of_joy _ _ _ _
The FEATS field encodes: g (group), s (subgroup), q (qualification status), v (Unicode version first introduced), n (emoji name – including skin tone). In n, separators such as : and , are preserved without following spaces, e.g. thumbs_up:light_skin_tone or family:man,man,boy. See https://www.unicode.org/Public/UCD/latest/emoji/emoji-test.txt for details.
All CMC annotations are written to the XPOS column (column 5 in CoNLL-U). For EMOIMG, the tagger also normalizes the LEMMA column to the base emoji and enriches FEATS with emoji metadata when available.
| Tag | Phenomenon | Example token | Output behavior |
|---|---|---|---|
EMOWIKI | Wikipedia emoji templates | [_EMOJI:{{S|;)}}_] | Writes EMOWIKI to XPOS |
EMOIMG | Unicode emoji tokens | 😂, 😇 | Writes EMOIMG to XPOS, normalizes LEMMA to the base emoji, and adds FEATS metadata |
AKW | Action words / inflectives | :grins: | Writes AKW to XPOS |
EMOASC | ASCII emoticons | :), <3 | Writes EMOASC to XPOS |
HST | Hashtags | #KorAP, #3D | Writes HST to XPOS when the hashtag contains at least one letter |
URL | URLs | https://korap.ids-mannheim.de | Writes URL to XPOS |
EML | Email addresses | mail@example.org | Writes EML to XPOS |
ADR | @-names / addresses | @markup | Writes ADR to XPOS |
Numeric-only forms such as #10 are not tagged as HST.
The following example shows how the different tags appear in CoNLL-U output. In all cases, the annotation is written to XPOS; only EMOIMG additionally changes LEMMA and FEATS.
# foundry = cmc # text_id = readme-demo # text = [_EMOJI:{{cool}}_] 😂 :grins: :) #KorAP https://korap.ids-mannheim.de mail@example.org @handle <3 1 [_EMOJI:{{cool}}_] _ _ EMOWIKI _ _ _ _ _ 2 😂 😂 _ EMOIMG g=smileys_&_emotion|s=face_smiling|q=fully_qualified|v=E0.6|n=face_with_tears_of_joy _ _ _ _ 3 :grins: _ _ AKW _ _ _ _ _ 4 :) _ _ EMOASC _ _ _ _ _ 5 #KorAP _ _ HST _ _ _ _ _ 6 https://korap.ids-mannheim.de _ _ URL _ _ _ _ _ 7 mail@example.org _ _ EML _ _ _ _ _ 8 @handle _ _ ADR _ _ _ _ _ 9 <3 _ _ EMOASC _ _ _ _ _
For compound emojis with modifiers or zero-width joiners, the tagger still writes EMOIMG and reduces LEMMA to the base emoji. For example, ✊🏿 is normalized to lemma ✊, and 👨👨👦 is normalized to lemma 👨.
<3 may therefore produce false positives.The npm package exposes the CLI under both conllu-cmc and cmc-tagger.
cat ./test/data/ndy.conllu | npx conllu-cmc # Show version npx conllu-cmc -V # Alternative CLI name npm exec --yes --package=. cmc-tagger -- -V
cat ./test/data/ndy.conllu | ./bin/linux/cmc-tagger # Show version ./bin/linux/cmc-tagger -V
korapxml2conllu kyc.zip | cmc-tagger -s | conllu2korapxml > kyc.cmc.zip
# Annotate CoNLL-U input cat ./test/data/ndy.conllu | docker run --rm -i korap/conllu-cmc # With sparse output (only annotated lines) cat ./test/data/ndy.conllu | docker run --rm -i korap/conllu-cmc -s # Generate KorAP-XML zip with CMC annotations # For korapxmltool see <https://github.com/KorAP/korapxmltool> korapxmltool -A "docker run --rm -i korap/conllu-cmc -s" -t zip ndy.zip # Show help docker run --rm korap/conllu-cmc --help # Show version docker run --rm korap/conllu-cmc -V
The tagger is implemented in Node.js because the runtime provides efficient regular-expression execution, which is central to this regex-based annotation workflow.
On CMC corpora with many matches, throughput is above 10 MB/s. This includes dense CMC material such as the NottDeuYTSch corpus.
The tagger is already used in corpus analysis scenarios with the corpus analysis platform KorAP.
The German Wikipedia Talk Pages corpus is available at https://korap.ids-mannheim.de/instance/wiki. A query for an EMOWIKI, an EMOASC, and an EMOIMG sequence in one posting with up to 12 intervening tokens between each match is:
[cmc/p=EMOWIKI] []{0,12} [cmc/p=EMOASC] []{0,12} [cmc/p=EMOIMG]
You can run this query directly here: https://korap.ids-mannheim.de/instance/wiki?q=[cmc%2Fp%3DEMOWIKI]+[]{0%2C12}+[cmc%2Fp%3DEMOASC]+[]{0%2C12}+[cmc%2Fp%3DEMOIMG].
The NottDeuYTSch corpus (Cotgrove 2023) is accessible on request via https://korap.ids-mannheim.de/instance/nottdeuytsch.
Download pre-built executables from the Releases page:
cmc-tagger - Linux x64cmc-tagger - macOS x64cmc-tagger.exe - Windows x64npm install 'git+https://gitlab.ids-mannheim.de/KorAP/conllu-cmc-docker.git'
npm install
# Build for all platforms npm run pkg-all # Or build for specific platforms npm run pkg-linux # Linux x64 npm run pkg-macos # macOS x64 npm run pkg-win # Windows x64
Executables are created as cmc-tagger or cmc-tagger.exe in bin/linux/, bin/macos/, and bin/win/.
docker pull korap/conllu-cmc