tree: e7997b0eef3d3a0e7e014592177044aefdfef878 [path history] [tgz]
  1. .github/
  2. scripts/
  3. src/
  4. test/
  5. .gitignore
  6. .gitlab-ci.yml
  7. docker-entrypoint.sh
  8. Dockerfile
  9. package-lock.json
  10. package.json
  11. Readme.md
Readme.md

conllu-cmc

Docker

Reads CoNLL-U format from stdin and annotates emojis, emoticons, hashtags, URLs, email addresses, action words, @names, and Wikipedia emoji templates with their corresponding STTS-IBK POS tag (Beißwenger/Bartsch/Evert/Würzner 2016). Writes CoNLL-U format to stdout.

For Unicode emojis (EMOIMG), the base emoji without skin tone modifiers is written to the LEMMA column and Unicode emoji metadata is added to the FEATS column:

# text = 😂
1	😂	😂	_	EMOIMG	g=smileys_&_emotion|s=face_smiling|q=fully_qualified|v=E0.6|n=face_with_tears_of_joy	_	_	_	_

The FEATS field encodes: g (group), s (subgroup), q (qualification status), v (Unicode version first introduced), n (emoji name – including skin tone). See https://www.unicode.org/Public/UCD/latest/emoji/emoji-test.txt for details.

Local Usage

Using npm/node

cat ./test/data/ndy.conllu | npx conllu2cmc

Using standalone binary

cat ./test/data/ndy.conllu | ./bin/linux/conllu2cmc

Generate KorAP-XML zip with CMC annotations

korapxml2conllu kyc.zip | conllu2cmc -s | conllu2korapxml > kyc.cmc.zip

Docker Usage

# Annotate CoNLL-U input
cat ./test/data/ndy.conllu | docker run --rm -i korap/conllu-cmc

# With sparse output (only annotated lines)
cat ./test/data/ndy.conllu | docker run --rm -i korap/conllu-cmc -s

# Generate KorAP-XML zip with CMC annotations
# For korapxmltool see <https://github.com/KorAP/korapxmltool>
korapxmltool -A "docker run --rm -i korap/conllu-cmc -s" -t zip ndy.zip

# Show help
docker run --rm korap/conllu-cmc --help

Installation

Pre-built Binaries

Download pre-built executables from the Releases page:

  • conllu2cmc - Linux x64
  • conllu2cmc - macOS x64
  • conllu2cmc.exe - Windows x64

npm

npm install 'git+https://gitlab.ids-mannheim.de/KorAP/conllu-cmc-docker.git'

Build from source

npm install

Build standalone executables

# Build for all platforms
npm run pkg-all

# Or build for specific platforms
npm run pkg-linux   # Linux x64
npm run pkg-macos   # macOS x64
npm run pkg-win     # Windows x64

Executables are created in bin/linux/, bin/macos/, and bin/win/.

Docker

docker pull korap/conllu-cmc

References

Beißwenger, Michael/Bartsch, Sabine/Evert, Stefan/Würzner, Kay-Michael (2016): EmpiriST 2015: A Shared Task on the Automatic Linguistic Annotation of Computer-Mediated Communication and Web Corpora. In: Proceedings of the 10th Web as Corpus Workshop. Berlin: Association for Computational Linguistics, S. 44–56. https://doi.org/10.18653/v1/W16-2606.