| commit | 804750def74b4a5b9ce607e375042628b9ff5d69 | [log] [tgz] |
|---|---|---|
| author | Marc Kupietz <kupietz@ids-mannheim.de> | Fri Apr 10 14:44:13 2026 +0200 |
| committer | Marc Kupietz <kupietz@ids-mannheim.de> | Fri Apr 10 14:44:13 2026 +0200 |
| tree | 39c7550aa58c87eff43cd10f0686e3d617391cbe | |
| parent | a17c2e5577e966bb78e6abbfd277a8e7d2c63706 [diff] |
Always tag as ADR if pattern matches Change-Id: I7f8636eb3d9e8b03fad2e8747d50d4cb208bfc43
Reads CoNLL-U format from stdin and annotates emojis, emoticons, hashtags, URLs, email addresses, action words, @names, and Wikipedia emoji templates with their corresponding STTS-IBK POS tag (Beißwenger/Bartsch/Evert/Würzner 2016). Writes CoNLL-U format to stdout.
For Unicode emojis (EMOIMG), the base emoji without skin tone modifiers is written to the LEMMA column and Unicode emoji metadata is added to the FEATS column:
# text = 😂 1 😂 😂 _ EMOIMG g=smileys_&_emotion|s=face_smiling|q=fully_qualified|v=E0.6|n=face_with_tears_of_joy _ _ _ _
The FEATS field encodes: g (group), s (subgroup), q (qualification status), v (Unicode version first introduced), n (emoji name – including skin tone). See https://www.unicode.org/Public/UCD/latest/emoji/emoji-test.txt for details.
cat ./test/data/ndy.conllu | npx conllu2cmc
cat ./test/data/ndy.conllu | ./bin/linux/conllu2cmc
korapxml2conllu kyc.zip | conllu2cmc -s | conllu2korapxml > kyc.cmc.zip
# Annotate CoNLL-U input cat ./test/data/ndy.conllu | docker run --rm -i korap/conllu-cmc # With sparse output (only annotated lines) cat ./test/data/ndy.conllu | docker run --rm -i korap/conllu-cmc -s # Generate KorAP-XML zip with CMC annotations # For korapxmltool see <https://github.com/KorAP/korapxmltool> korapxmltool -A "docker run --rm -i korap/conllu-cmc -s" -t zip ndy.zip # Show help docker run --rm korap/conllu-cmc --help
Download pre-built executables from the Releases page:
conllu2cmc - Linux x64conllu2cmc - macOS x64conllu2cmc.exe - Windows x64npm install 'git+https://gitlab.ids-mannheim.de/KorAP/conllu-cmc-docker.git'
npm install
# Build for all platforms npm run pkg-all # Or build for specific platforms npm run pkg-linux # Linux x64 npm run pkg-macos # macOS x64 npm run pkg-win # Windows x64
Executables are created in bin/linux/, bin/macos/, and bin/win/.
docker pull korap/conllu-cmc
Beißwenger, Michael/Bartsch, Sabine/Evert, Stefan/Würzner, Kay-Michael (2016): EmpiriST 2015: A Shared Task on the Automatic Linguistic Annotation of Computer-Mediated Communication and Web Corpora. In: Proceedings of the 10th Web as Corpus Workshop. Berlin: Association for Computational Linguistics, S. 44–56. https://doi.org/10.18653/v1/W16-2606.