Add --version option

Change-Id: I4390f2df5a7e0c61c6f8a5233e5092f9f907435e
4 files changed
tree: 11ec24ffaa3eb9bd95e7f102db2270a30b2431c2
  1. .github/
  2. scripts/
  3. src/
  4. test/
  5. .gitignore
  6. .gitlab-ci.yml
  7. CHANGELOG.md
  8. docker-entrypoint.sh
  9. Dockerfile
  10. package-lock.json
  11. package.json
  12. Readme.md
Readme.md

conllu-cmc

Docker

Reads CoNLL-U format from stdin and annotates Unicode emojis, ASCII emoticons, Wikipedia emoji templates, action words, hashtags, URLs, email addresses, and @names with CMC-oriented STTS-IBK tags in the XPOS column (Beißwenger/Bartsch/Evert/Würzner 2016). Writes CoNLL-U format to stdout.

For Unicode emojis (EMOIMG), the base emoji without skin tone modifiers is written to the LEMMA column and Unicode emoji metadata is added to the FEATS column:

# text = 😂
1	😂	😂	_	EMOIMG	g=smileys_&_emotion|s=face_smiling|q=fully_qualified|v=E0.6|n=face_with_tears_of_joy	_	_	_	_

The FEATS field encodes: g (group), s (subgroup), q (qualification status), v (Unicode version first introduced), n (emoji name – including skin tone). In n, separators such as : and , are preserved without following spaces, e.g. thumbs_up:light_skin_tone or family:man,man,boy. See https://www.unicode.org/Public/UCD/latest/emoji/emoji-test.txt for details.

Tagset

All CMC annotations are written to the XPOS column (column 5 in CoNLL-U). For EMOIMG, the tagger also normalizes the LEMMA column to the base emoji and enriches FEATS with emoji metadata when available.

TagPhenomenonExample tokenOutput behavior
EMOWIKIWikipedia emoji templates[_EMOJI:{{S|;)}}_]Writes EMOWIKI to XPOS
EMOIMGUnicode emoji tokens😂, 😇Writes EMOIMG to XPOS, normalizes LEMMA to the base emoji, and adds FEATS metadata
AKWAction words / inflectives:grins:Writes AKW to XPOS
EMOASCASCII emoticons:), <3Writes EMOASC to XPOS
HSTHashtags#KorAP, #3DWrites HST to XPOS when the hashtag contains at least one letter
URLURLshttps://korap.ids-mannheim.deWrites URL to XPOS
EMLEmail addressesmail@example.orgWrites EML to XPOS
ADR@-names / addresses@markupWrites ADR to XPOS

Numeric-only forms such as #10 are not tagged as HST.

CoNLL-U Output Examples

The following example shows how the different tags appear in CoNLL-U output. In all cases, the annotation is written to XPOS; only EMOIMG additionally changes LEMMA and FEATS.

# foundry = cmc
# text_id = readme-demo
# text = [_EMOJI:{{cool}}_] 😂 :grins: :) #KorAP https://korap.ids-mannheim.de mail@example.org @handle <3
1	[_EMOJI:{{cool}}_]	_	_	EMOWIKI	_	_	_	_	_
2	😂	😂	_	EMOIMG	g=smileys_&_emotion|s=face_smiling|q=fully_qualified|v=E0.6|n=face_with_tears_of_joy	_	_	_	_
3	:grins:	_	_	AKW	_	_	_	_	_
4	:)	_	_	EMOASC	_	_	_	_	_
5	#KorAP	_	_	HST	_	_	_	_	_
6	https://korap.ids-mannheim.de	_	_	URL	_	_	_	_	_
7	mail@example.org	_	_	EML	_	_	_	_	_
8	@handle	_	_	ADR	_	_	_	_	_
9	<3	_	_	EMOASC	_	_	_	_	_

For compound emojis with modifiers or zero-width joiners, the tagger still writes EMOIMG and reduces LEMMA to the base emoji. For example, ✊🏿 is normalized to lemma , and 👨‍👨‍👦 is normalized to lemma 👨.

Current Limitations

  • The tagger is purely pattern-based. It does not consider sentential, pragmatic, or discourse context.
  • The matching strategy is intentionally recall-oriented rather than precision-oriented. Ambiguous strings such as <3 may therefore produce false positives.
  • Annotation quality depends heavily on tokenization. Unicode emojis, grapheme clusters, zero-width joiners, modifiers, emoticons, and Wikipedia emoji templates should already be segmented into correct token units before tagging.
  • We recommend KorAP-Tokenizer, which supports Unicode 17.0, including grapheme clusters, zero-width joiners, modifiers, emoticons, and Wikipedia-template-based emojis.

Local Usage

Using npm/node

cat ./test/data/ndy.conllu | npx conllu2cmc

# Show version
npx conllu2cmc -V

Using standalone binary

cat ./test/data/ndy.conllu | ./bin/linux/conllu2cmc

# Show version
./bin/linux/conllu2cmc -V

Generate KorAP-XML zip with CMC annotations

korapxml2conllu kyc.zip | conllu2cmc -s | conllu2korapxml > kyc.cmc.zip

Docker Usage

# Annotate CoNLL-U input
cat ./test/data/ndy.conllu | docker run --rm -i korap/conllu-cmc

# With sparse output (only annotated lines)
cat ./test/data/ndy.conllu | docker run --rm -i korap/conllu-cmc -s

# Generate KorAP-XML zip with CMC annotations
# For korapxmltool see <https://github.com/KorAP/korapxmltool>
korapxmltool -A "docker run --rm -i korap/conllu-cmc -s" -t zip ndy.zip

# Show help
docker run --rm korap/conllu-cmc --help

# Show version
docker run --rm korap/conllu-cmc -V

Performance

The tagger is implemented in Node.js because the runtime provides efficient regular-expression execution, which is central to this regex-based annotation workflow.

On CMC corpora with many matches, throughput is above 10 MB/s. This includes dense CMC material such as the NottDeuYTSch corpus.

Applications

The tagger is already used in corpus analysis scenarios with the corpus analysis platform KorAP.

German Wikipedia Talk Pages

The German Wikipedia Talk Pages corpus is available at https://korap.ids-mannheim.de/instance/wiki. A query for an EMOWIKI, an EMOASC, and an EMOIMG sequence in one posting with up to 12 intervening tokens between each match is:

[cmc/p=EMOWIKI] []{0,12} [cmc/p=EMOASC] []{0,12} [cmc/p=EMOIMG]

You can run this query directly here: https://korap.ids-mannheim.de/instance/wiki?q=[cmc%2Fp%3DEMOWIKI]+[]{0%2C12}+[cmc%2Fp%3DEMOASC]+[]{0%2C12}+[cmc%2Fp%3DEMOIMG].

NottDeuYTSch

The NottDeuYTSch corpus (Cotgrove 2023) is accessible on request via https://korap.ids-mannheim.de/instance/nottdeuytsch.

Installation

Pre-built Binaries

Download pre-built executables from the Releases page:

  • conllu2cmc - Linux x64
  • conllu2cmc - macOS x64
  • conllu2cmc.exe - Windows x64

npm

npm install 'git+https://gitlab.ids-mannheim.de/KorAP/conllu-cmc-docker.git'

Build from source

npm install

Build standalone executables

# Build for all platforms
npm run pkg-all

# Or build for specific platforms
npm run pkg-linux   # Linux x64
npm run pkg-macos   # macOS x64
npm run pkg-win     # Windows x64

Executables are created in bin/linux/, bin/macos/, and bin/win/.

Docker

docker pull korap/conllu-cmc

References

  • Beißwenger, Michael/Bartsch, Sabine/Evert, Stefan/Würzner, Kay-Michael (2016): EmpiriST 2015: A Shared Task on the Automatic Linguistic Annotation of Computer-Mediated Communication and Web Corpora. In: Proceedings of the 10th Web as Corpus Workshop. Berlin: Association for Computational Linguistics, S. 44–56. https://doi.org/10.18653/v1/W16-2606.
  • Cotgrove, Louis (2023): New opportunities for researching digital youth language: The NottDeuYTSch corpus. In: Kupietz, Marc/Schmidt, Thomas (Hrsg.): Neue Entwicklungen in der Korpuslandschaft der Germanistik. Beiträge zur IDS-Methodenmesse 2022. (= Korpuslinguistik und interdisziplinäre Perspektiven auf Sprache (CLIP) 11). Tübingen: Narr, S. 102-115.
  • Margaretha, Eliza/Lüngen, Harald/Diewald, Nils/Kupietz, Marc/Yaddehige, Rameela (2025): Building and querying Wikipedia discussion corpora using KorAP. In: Impulses and Approaches to Computer-Mediated Communication: Proceedings of the 12th International Conference on Computer Mediated Communication and Social Media Corpora for the Humanities (CMC 2025). Edited by Annamária Fábián/Igor Trost, S. 123-124.