| commit | c9b7c43aedf86020e172a2e5093b8f6f21383b33 | [log] [tgz] |
|---|---|---|
| author | Marc Kupietz <kupietz@ids-mannheim.de> | Sat Mar 07 11:34:28 2026 +0100 |
| committer | Marc Kupietz <kupietz@ids-mannheim.de> | Sat Mar 07 11:34:28 2026 +0100 |
| tree | c954852a82226166aef8a8310477e63b16f3fe39 | |
| parent | ebf770ed13ad7709d754f552d2912a420c8ffb73 [diff] |
Fix citations in Readme Change-Id: I3252dcc6199cf7fb95725ab343c21f28054a309c
Reads CoNLL-U format from stdin and annotates German gender-sensitive personal nouns, gendered determiners/pronouns, and neo-pronouns with correct POS (UPOS and XPOS/STTS), lemma, and morphological features. Writes CoNLL-U format to stdout.
Existing annotations for matched tokens are replaced; all other tokens pass through unchanged.
Based on the morphosyntactic analysis in Ochs (2026), the tool covers all six gender-marker types distinguished by Ochs & Rüdiger (2025):
| Type | Examples | Intent | Gender feature |
|---|---|---|---|
Genderstern * | Lehrer*in, Bürger*innen | non-binary | NonBin |
Doppelpunkt : | Lehrer:in, Bürger:innen | non-binary | NonBin |
Unterstrich _ | Lehrer_in, Bürger_innen | non-binary | NonBin |
Binnen-I I | LehrerIn, LehrerInnen | binary | Masc,Fem |
Klammern () | Lehrer(in), Lehrer(innen) | binary | Masc,Fem |
Schrägstrich / | Lehrer/in, Lehrer/-innen | binary | Masc,Fem |
NOUN / NN)| Surface form | Lemma | FEATS |
|---|---|---|
Lehrer*in | Lehrer*in | Gender=NonBin|Number=Sing |
Lehrer*innen | Lehrer*in | Gender=NonBin|Number=Plur |
Lehrer:in | Lehrer:in | Gender=NonBin|Number=Sing |
Lehrer:innen | Lehrer:in | Gender=NonBin|Number=Plur |
Lehrer_in | Lehrer_in | Gender=NonBin|Number=Sing |
LehrerIn | LehrerIn | Gender=Masc,Fem|Number=Sing |
LehrerInnen | LehrerIn | Gender=Masc,Fem|Number=Plur |
Lehrer(in) | Lehrer(in) | Gender=Masc,Fem|Number=Sing |
Lehrer/in | Lehrer/in | Gender=Masc,Fem|Number=Sing |
Lehrer/-innen | Lehrer/in | Gender=Masc,Fem|Number=Plur |
The lemma is always the nominative singular of the gender-marked derivate, preserving the original gender marker. Plural markers are stripped and their information encoded in Number=Plur. The Gender feature is retained for plural forms because the marker is still visibly present on the surface form.
DET / ART, PIAT, PWAT, …)Common gendered determiners are annotated.
| Surface form | Lemma | FEATS |
|---|---|---|
jede*r | jede*r | Gender=NonBin |
jede:r | jede:r | Gender=NonBin |
eine*n | eine*n | Gender=NonBin |
kein_e | kein_e | Gender=NonBin |
die/der | die/der | Gender=Masc,Fem |
Non-binary markers (*, :, _) yield Gender=NonBin; Schrägstrich (/) yields Gender=Masc,Fem.
PRON / PPER)Merged pronoun pairs with gender markers receive Gender=NonBin:
| Surface form | Lemma | FEATS |
|---|---|---|
sie*er | sie*er | Gender=NonBin |
er:sie | er:sie | Gender=NonBin |
Lexicon-based neo-pronouns are matched by exact surface form (case-insensitive) against a built-in lexicon sourced from pronomen.net. All forms are tagged PRON PPER with Gender=NonBin|PronType=Prs; the lemma is the nominative form.
| Paradigm (NOM/DAT) | NOM | GEN | DAT | ACC |
|---|---|---|---|---|
| sier/siem | sier | sies | siem | sien |
| xier/xiem | xier | xies | xiem | xien |
| ersie/ihmihr | ersie | seinihr | ihmihr | ihnsie |
| Paradigm | NOM | GEN | DAT | ACC |
|---|---|---|---|---|
| dej/denen/dej | dej | (deren) | (denen) | dej |
| dey/denen/dem | dey | (deren) | (denen) | (dem) |
| dey/denen/demm | dey | (deren) | (denen) | demm |
| ey/emm | ey | eys | emm | emm |
| they/them | they | their | them | them |
Forms in italics are excluded from the lexicon because they are homonymous with standard German words (deren, denen, dem).
| Paradigm | NOM | GEN | DAT | ACC |
|---|---|---|---|---|
| el/em | el | ems | em | en |
| em/em | em | ems | em | em |
| en/en | en | enses | en | en |
| en/em | en | ens | em | en |
| ens/ens | ens | ens | ens | ens |
| et/siem | et | sier | siem | sien |
| ex/ex | ex | ex | ex | ex |
| hän/sim | hän | sir | sim | sin |
| hen/hem | hen | hens | hem | hen |
| hie/hiem | hie | hein | hiem | hie |
| iks/iks | iks | ikses | iks | iks |
| ind/inde | ind | inds | inde | ind |
| mensch/mensch | mensch | menschs | mensch | mensch |
| nin/nim | nin | nims | nim | nin |
| oj/ojm | oj | juj | ojm | ojn |
| per/per | per | pers | per | per |
| ser/sem | ser | ses | sem | sen |
| Y/Y | Y | Ys | Y | Y |
| zet/zerm | zet | zets | zerm | zern |
| */* (Stern) | * | *s | * | * |
Note: oblique forms of et/siem (sier, siem, sien) are shared with the sier paradigm and annotated with lemma sier.
jedEn, jedEr): these forms embed the capital letter at a non-final position; detection requires morphological analysis beyond simple pattern matching and is not currently supported.begeisterte*n): not yet annotated (occur in ~5 % of gendered NP elements per Ochs 2026, §7.3.2).Lehrers*in, dative plural extra marking): rare and not detected.dem, deren, denen are excluded from the neo-pronoun lexicon as they are indistinguishable from standard German determiners/pronouns without syntactic context. per is included despite its use as a preposition.# Annotate CoNLL-U input korapxml2conllu doc.zip | conllu-gender # Sparse output (only annotated tokens, with their sentence headers) korapxml2conllu doc.zip | conllu-gender -s # Pipe with other KorAP annotation tools korapxml2conllu doc.zip | conllu-cmc | conllu-gender | conllu2korapxml > doc.annotated.zip
| Option | Description |
|---|---|
-s, --sparse | Print only tokens that received new annotations (with sentence headers). |
-h, --help | Print usage guide. |
npm install 'git+https://gitlab.ids-mannheim.de/KorAP/conllu-gender.git'
npm install
npm run pkg-linux # Linux x64 npm run pkg-macos # macOS x64 npm run pkg-win # Windows x64 npm run pkg-all # all platforms
npm test
Ochs, Samira/Rüdiger, Jan Oliver (2025): Of stars and colons: A corpus-based analysis of gender-inclusive orthographies in German press texts. In: Schmitz, Dominic/Stein, Simon David/Schneider, Viktoria (eds.): Linguistic intersections of language and gender. Of gender bias and gender fairness. Berlin/Boston: De Gruyter, pp. 31–62. https://doi.org/10.1515/9783111388694.
Ochs, Samira (2026). Die morphosyntaktische Integration neuer Gendersuffixe: Eine korpusbasierte Analyse deutschsprachiger Pressetexte. Gender Linguistics, 2. https://doi.org/10.65020/0619d927