tree: d8c66d7139efbd31c07984549a2c30d16cc44ab4 [path history] [tgz]
  1. .github/
  2. src/
  3. test/
  4. .gitignore
  5. .gitlab-ci.yml
  6. package-lock.json
  7. package.json
  8. Readme.md
Readme.md

conllu-gender

Reads CoNLL-U format from stdin and annotates German gender-sensitive personal nouns, gendered determiners/pronouns, and neo-pronouns with correct POS (UPOS and XPOS/STTS), lemma, and morphological features. Writes CoNLL-U format to stdout.

Existing annotations for matched tokens are replaced; all other tokens pass through unchanged.

Background

Based on the morphosyntactic analysis in Ochs (2026), the tool covers all six gender-marker types distinguished by Ochs & Rüdiger (2025):

TypeExamplesIntentGender feature
Genderstern *Lehrer*in, Bürger*innennon-binaryFem,Masc,NonBin
Doppelpunkt :Lehrer:in, Bürger:innennon-binaryFem,Masc,NonBin
Unterstrich _Lehrer_in, Bürger_innennon-binaryFem,Masc,NonBin
Binnen-I ILehrerIn, LehrerInnenbinaryMasc,Fem
Klammern ()Lehrer(in), Lehrer(innen)binaryMasc,Fem
Schrägstrich /Lehrer/in, Lehrer/-innenbinaryMasc,Fem

Annotation

Nouns (NOUN / NN)

Surface formLemmaFEATS
Lehrer*inLehrer*inGender=Fem,Masc,NonBin|Number=Sing
Lehrer*innenLehrer*inGender=Fem,Masc,NonBin|Number=Plur
Lehrer:inLehrer:inGender=Fem,Masc,NonBin|Number=Sing
Lehrer:innenLehrer:inGender=Fem,Masc,NonBin|Number=Plur
Lehrer_inLehrer_inGender=Fem,Masc,NonBin|Number=Sing
LehrerInLehrerInGender=Masc,Fem|Number=Sing
LehrerInnenLehrerInGender=Masc,Fem|Number=Plur
Lehrer(in)Lehrer(in)Gender=Masc,Fem|Number=Sing
Lehrer/inLehrer/inGender=Masc,Fem|Number=Sing
Lehrer/-innenLehrer/inGender=Masc,Fem|Number=Plur

The lemma is always the nominative singular of the gender-marked derivate, preserving the original gender marker. Plural markers are stripped and their information encoded in Number=Plur. The Gender feature is retained for plural forms because the marker is still visibly present on the surface form.

Determiners (DET / ART, PIAT, PWAT, …)

Common gendered determiners are annotated.

Surface formLemmaFEATS
jede*rjede*rGender=Fem,Masc,NonBin
jede:rjede:rGender=Fem,Masc,NonBin
eine*neine*nGender=Fem,Masc,NonBin
kein_ekein_eGender=Fem,Masc,NonBin
die/derdie/derGender=Masc,Fem

Non-binary markers (*, :, _) yield Gender=Fem,Masc,NonBin; Schrägstrich (/) yields Gender=Masc,Fem.

Pronouns / Neo-pronouns (PRON / PPER)

Merged pronoun pairs with gender markers receive Gender=Fem,Masc,NonBin:

Surface formLemmaFEATS
sie*ersie*erGender=Fem,Masc,NonBin
er:sieer:sieGender=Fem,Masc,NonBin

Lexicon-based neo-pronouns are matched by exact surface form (case-insensitive) against a built-in lexicon sourced from pronomen.net. All forms are tagged PRON PPER with Gender=Fem,Masc,NonBin|PronType=Prs; the lemma is the nominative form.

Verschmelzung (blend pronouns)

Paradigm (NOM/DAT)NOMGENDATACC
sier/siemsiersiessiemsien
xier/xiemxierxiesxiemxien
ersie/ihmihrersieseinihrihmihrihnsie

They-ähnlich

ParadigmNOMGENDATACC
dej/denen/dejdej(deren)(denen)dej
dey/denen/demdey(deren)(denen)(dem)
dey/denen/demmdey(deren)(denen)demm
ey/emmeyeysemmemm
they/themtheytheirthemthem

Forms in italics are excluded from the lexicon because they are homonymous with standard German words (deren, denen, dem).

Neuer Stamm (new-stem pronouns)

ParadigmNOMGENDATACC
el/emelemsemen
em/emememsemem
en/enenensesenen
en/emenensemen
ens/ensensensensens
et/siemetsiersiemsien
ex/exexexexex
hän/simhänsirsimsin
hen/hemhenhenshemhen
hie/hiemhieheinhiemhie
iks/iksiksiksesiksiks
ind/indeindindsindeind
mensch/menschmenschmenschsmenschmensch
nin/nimninnimsnimnin
oj/ojmojjujojmojn
per/perperpersperper
ser/semsersessemsen
Y/YYYsYY
zet/zermzetzetszermzern
*/* (Stern)**s**

Note: oblique forms of et/siem (sier, siem, sien) are shared with the sier paradigm and annotated with lemma sier.

Known limitations

  • Binnen-I with non-final capital (e.g. jedEn, jedEr): these forms embed the capital letter at a non-final position; detection requires morphological analysis beyond simple pattern matching and is not currently supported.
  • Gendered adjectives (e.g. begeisterte*n): not yet annotated (occur in ~5 % of gendered NP elements per Ochs 2026, §7.3.2).
  • Inflected case suffixes on gendered nouns (e.g. genitive Lehrers*in, dative plural extra marking): rare and not detected.
  • Ambiguous neo-pronoun forms: dem, deren, denen are excluded from the neo-pronoun lexicon as they are indistinguishable from standard German determiners/pronouns without syntactic context. per is included despite its use as a preposition.

Usage

# Annotate CoNLL-U input
korapxml2conllu doc.zip | conllu-gender

# Sparse output (only annotated tokens, with their sentence headers)
korapxml2conllu doc.zip | conllu-gender -s

# Pipe with other KorAP annotation tools
korapxml2conllu doc.zip | conllu-cmc | conllu-gender | conllu2korapxml > doc.annotated.zip

Options

OptionDescription
-s, --sparsePrint only tokens that received new annotations (with sentence headers).
-h, --helpPrint usage guide.

Installation

npm

npm install 'git+https://gitlab.ids-mannheim.de/KorAP/conllu-gender.git'

Build from source

npm install

Build standalone binary

npm run pkg-linux   # Linux x64
npm run pkg-macos   # macOS x64
npm run pkg-win     # Windows x64
npm run pkg-all     # all platforms

Testing

npm test

References

Ochs, Samira/Rüdiger, Jan Oliver (2025): Of stars and colons: A corpus-based analysis of gender-inclusive orthographies in German press texts. In: Schmitz, Dominic/Stein, Simon David/Schneider, Viktoria (eds.): Linguistic intersections of language and gender. Of gender bias and gender fairness. Berlin/Boston: De Gruyter, pp. 31–62. https://doi.org/10.1515/9783111388694.

Ochs, Samira (2026). Die morphosyntaktische Integration neuer Gendersuffixe: Eine korpusbasierte Analyse deutschsprachiger Pressetexte. Gender Linguistics, 2. https://doi.org/10.65020/0619d927