Controlling Gender Equality with Shallow NLP Techniques
M. Carl, S. Garnier, J. Haller
Institut f ur Angewandte Informationsforschung
66111 Saarbr ucken
Germany
fcarl,sandrine,hansg@iai.uni-sb.de
A. Altmayer and B. Miemietz
Universit at des Saarlandes
66123 Saarbr ucken, Germany
anne@altmayer.info
Miemietz.Baerbel@MH-Hannover.DE
Abstract
This paper introduces the \Gendercheck
Editor", a tool to check German texts for
gender discriminatory formulations. It re-
lays on shallow rule-based techniques as
used in the Controlled Language Author-
ing Technology (CLAT). The paper outlines
major sources of gender imbalances in Ger-
man texts. It gives a background on the
underlying CLAT technology and describes
the marking and annotation strategy to au-
tomatically detect and visualize the ques-
tionable pieces of text. The paper provides
a detailed evaluation of the editor.
1 Introduction
Times of feminist (language) revolution are
gone, but marks are left behind in the form
of a changed society with a changed conscious-
ness and changed gender roles. Nevertheless
language use seems to oppose changes much
stronger than society does. As the use of non-
discriminatory language is nowadays compul-
sory in administration and o cial documents, a
number of guidelines and recommendations ex-
ist, which help to avoid gender imbalance and
stereotypes in language use. Although in some
cases men may be concerned (as for example
most terms referring to criminals are mascu-
line nouns) the main concern is about adequate
representation of women in language, especially
in a professional context. Psychological tests
demonstrate that persons reading or hearing
masculine job titles (so-called generic terms al-
legedly meaning both women and men) do not
visualize women working in this  eld.
In order to avoid this kind of discrimination
two main principles are often suggested (e.g. as
in (Ges, 1999; Uni, 2000)):
1. use of gender-neutral language, which
rather "disguises" the acting person by us-
ing impersonal phrases.
2. explicit naming of women and men as
equally represented acting persons.
Using and applying these guidelines in a faith-
ful manner is time-consuming and requires a
great amount of practice, which can not al-
ways be provided, particularly by unexperi-
enced writer. Moreover these guidelines are of-
ten completely unknown in non-feminist circles.
A tool which checks texts for discriminatory use
of language is thus mandatory to promote writ-
ten gender equality, educate and remind writer
of unacceptable forms to avoid.
In this paper we describe the project \Gen-
dercheck" which uses a controlled-language au-
thoring tool (CLAT) as a platform and editor
to check German texts for discriminatory lan-
guage.
In section 2 we introduce three categories of
gender discrimination in German texts and pro-
vide possibilities for their reformulation.
Section 3 introduces the technology on which
the Gendercheck editor is based. The linguis-
tic engine proceeds in two steps, a marking and
 ltering phase where gender discriminatory for-
mulations are automatically detected. A graph-
ical interface plots the detected formulations
and prompts the according messages for correc-
tion .
Section 4 then goes into the detail of the
marking and  ltering technique. We use the
shallow pattern formalism kurd (Carl and
Schmidt-Wigger, 1998; Ins, 2004)  rst to mark
possible erroneous formulations and the to  lter
out those which occur in \gendered" context.
Section 5 evaluates the Gendercheck editor on
two texts.
2 Gender Inequality in German
Texts
Most prominent to achieve gender equality on a
linguistic level in German texts is to  nd solu-
tions and alternatives for the so-called generic
masculine: the masculine form is taken as
the generic form to designate all persons of
any sex. The major problem is to  gure out
whether or not a given person denotation refers
to a particular person. For instance, in ex-
ample (1a) \Beamter" (o cer) is most likely
used in its generic reading and refers to fe-
male o cers (Beamtinnen) and masculine o -
cers (Beamten). To achieve gender equality an
appropriate reformulation is required as shown
in example (1b).
(1a) Der Beamte muss den Anforderungen
Gen uge leisten.
(1b) Alle Beamten und Beamtinnen m ussen den
Anforderungen Gen uge leisten.
Since we tackle texts from administrative and
legal domains we principally assume unspeci-
 ed references. That is, a masculine (or fem-
inine!) noun will not denote a concrete person
but rather refers to all persons, irrespectively of
their sex.
A second class of errors are masculine rela-
tive, possessive and personal pronouns which
refer to a generic masculine or an inde nite mas-
culine pronoun.
(2a) Der Beamte muss seine Wohnung in der
N ahe des Arbeitsplatzes suchen.
(2b) Jeder muss seinen Beitrag dazu leisten.
(2c) Wer Rechte hat, der hat auch P ichten.
The possessive pronoun \seine" (his) in exam-
ple (2a) refers to the preceeding \Beamte" (of-
 cer). The generic masculine use of \Beamte"
and the referring pronoun will be marked. The
same holds for sentence (2b) where the posses-
sive pronoun refers to the inde nite pronoun
\jeder" (everymasc). The inde nite pronouns
\jemand" (someone) and \wer" (who) count as
acceptable. However, masculine pronouns refer-
ring to it will be marked. In example (2c), the
masculin relative pronoun \der" can be omitted.
A third class of gender inequality is lack of
agreement between the subject and the pred-
icative noun. Example (3a) gives an example
where the masculine subject \Ansprechpartner"
(partnermasc) occurs with the a female object
\Frau M uller" (Mrs. M uller).
(3a) Ihr Ansprechpartner ist Frau M uller.
(3b) Ihre Ansprechpartnerin ist Frau M uller.
A solution for this class of errors is shown in
example (3b) where the subject (Ansprechpart-
nerin) is adapted to the female gender of the
predicate.
Suggestions to reformulate gender imbalances
as shown in examples (1) and (2) can be classi-
 ed in two main categories:
1. Whenever possible, use gender neutral
formulations. These include collectiva
(e.g. Lehrk orper (teaching sta ) or
Arbeitnehmerschaft (collective of employ-
ees)) as well as nominalized participles
(Studierende (scholar)) or nominalized ad-
jectives (Berechtigte).
2. Use both forms if gender neutral formula-
tions cannot be found. That is, the femi-
nine and the masculine form are to be co-
ordinated with \und", \oder" or \bzw.".
A coordination with slash \/" will also be
suggested but should only be used in forms,
ordinance and regulations.
Amendments should accord to general Ger-
man writing rules. The so called \Binnen-I", an
upper case \I" as in \StudentInnen" will not be
suggested and also naming of the female su x in
parenthesis should be avoided. The same holds
for the inde nite pronoun \frau" (woman) which
was occasionally suggested to complement the
pronoun \man".
3 The Gendercheck Editor
Controlled-Language Authoring Technology
(CLAT) CLAT has been developed to suit the
need of some companies to automatically check
their technical texts for general language and
company speci c language conventions. Within
CLAT, texts are checked with respect to:
 orthographic correctness
 company speci c terminology and abbrevi-
ations
 general and company speci c grammatical
correctness
 stylistic correctness according to general
and company speci c requirements
The orthographic control examines texts for
orthographic errors and proposes alternative
writings. The terminology component matches
the text against a terminology and abbreviation
database where also term variants are detected
Figure 1: The Gendercheck Editor
(Carl et al., 2004). Grammar control checks
the text for grammatical correctness and dis-
ambiguates multiple readings. Stylistic control
detects stylistic inconsistencies.
The components build up on each other’s
output. Besides the described control mech-
anisms, CLAT also has a graphical front-end
which makes possible to mark segments in the
texts with di erent colors. Single error codes
can be switched o or on and segments of text
can be edited or ignored according to the au-
thors need. CLAT also allows batch processing
where XML-annotated text output is generated.
Figure 1 shows the graphical interface of the
editor. The lower part of the editor plots an in-
put sentence. The highlighted SGML codes are
manually annotated gender \mistakes". The
upper part plots the automatically annotated
sentence with underlined gender mistakes.
As we shall discuss in section 5, gender im-
balances are manually annotated to make eas-
ier automatic evaluation. In this example, the
highlighted words \Deutscher" (German) and
\EG-B urger" (EU-citizen) are identical in the
manually annotated text and in the automat-
ically annotated text. The user can click on
one of the highlighted words in the upper win-
dow to display the explanatory message in the
middle part of the screen. Further information
and correction or reformulation hints can also
be obtained by an additional window as shown
on the right side of the  gure. The messages
are designed according to main classes of gender
discriminatory formulations as previously dis-
cussed.
4 Gender Checking Strategy
Gendercheck uses a marking and  ltering strat-
egy:  rst all possible occurrences of words in
an error class are marked. In a second step
\gendered" formulations are  ltered out. The
remaining marked words are assigned an error
code which is plotted in the Gendercheck editor.
According to the classi cation in section 2,
this section examines the marking and  ltering
strategy for generic masculine agents in section
4.1, pronouns which refer to generic masculine
agents (section ??) and errors in agreement of
predicative nouns (section ??).
Marking and  ltering is realized with kurd
a pattern matching formalism as described in
(Carl and Schmidt-Wigger, 1998; Ins, 2004). In-
put for kurd are morphologically analyzed and
semantically tagged texts.
4.1 Class 1: Agents
4.1.1 Marking Agents
Two mechanisms are used to mark denotations
of persons:
a) The morphological analysis of mpro
(Maas, 1996) generates not only derivational
and in ectional information for German words,
but also assigns a small set of semantic
values. Male and female human agents
such as \Soldat" (soldier), \B urgermeister"
(mayormasc), \Beamte" (o cermasc), \Kranken-
schwester" (nursefem) etc. are assigned a se-
mantic feature s=agent. Words that carry this
feature will be marked style=agent.
b) Problems occur for nouns if the base
word is a nominalized verb. For instance
\Gewichtheber" (weightlifter) und \Busfahrer"
(bus driver) will not be assigned the feature
s=agent by mpro since a \lifter" and a \driver"
can be a thing or a human. Gender inequalities,
however, only apply to humans. Given that the
tool is used in a restricted domain, a special list
of lexemes can be used to assign these words
the style feature style=agent. The kurd
rule Include shows some of the lexemes from
this list. The list contains lexemes to cover
a maximum number of words. For instance
the lexeme absolvieren (graduate) will match
\Absolvent" (alumnusmasc), \Absolventin"
(alumnusfem), \Absolventen" (alumniplu;masc)
and \Absolventinnen" (alumniplu;fem).
1 Include =
2 Ae{c=noun,
3 ls:absolvieren$;
4 dezernieren$;
5 richten$;
6 fahren$;
7 administrieren$;
8 vorstand$}
9 : Ag{style=agent}.
Lines 3 to 8 enumerate a list of lexemes sep-
arated by a semicolon. The column in line
3 following the attribute name ls tells kurd
to interpret the values as regular expressions.
Since the dollar sign $ matches the end of the
value in the input object, each lexeme in the
list can also be the head of a compound word.
Thus, the test ls:fahren$ matches all lexemes
that have fahren as their head words, such
as \Fahrer" (driver), \Busfahrer" (bus driver),
etc. The action Ag{style=agent} marks the
matched words as an agent.
4.1.2 Filtering \gendered" Agents
The text then undergoes several  lters to delete
marks in words if the appear within gendered
formulations.
a) Excluded are marked agents which preceed
a family name. The marking of \Beamte" in
example (4) will be erased since it is followed
by the family name \Meier". \Beamte Meier"
is likely to have a speci c reference.
(4) Der Beamte Meier hat gegen die Vorschrift
versto en.
In terms of kurd this can be achieved with
the rule AgentMitFname: if a family name
(s=fname) follows a sequence of marked agents
(style=agent) the marks in the agent nodes are
removed (r{style=nil}).
1 AgentMitFname =
2 +Ae{style=agent},
3 Ae{c=noun,s=fname}
4 : Ar{style=nil}.
b) Also excluded are nominalized plural ad-
jectives and participles since they are well
suited for gender neutral formulations. In ex-
ample (5), the nominalized plural adjective
\Sachverst andige" (experts) is ambiguous with
respect to gender. The mark will thus be re-
moved.
(5) Sind bereits Sachverst andige bestellt?
c) Marked words in already gendered formu-
lations are also erased. Pairing female and male
forms by conjunction is a recommended way to
produce gender equality. In example (6) the
subject \Die Beamtin oder der Beamte" (the
o cerfem or the o cermasc) as well as the pro-
nouns which refer to it \sie oder er" (she or he)
and \ihrer oder seiner" (her or his) are gender
equal formulations.
(6) Die Beamtin oder der Beamte auf Lebens-
zeit oder auf Zeit ist in den Ruhestand
zu versetzen, wenn sie oder er infolge
eines k orperlichen Gebrechens oder wegen
Schw ache ihrer oder seiner k orperlichen
oder geistigen Kr afte zur Erf ullung
ihrer oder seiner Dienstp ichten dauernd
unf ahig (dienstunf ahig) ist.
The kurd rule gegendert removes these
marks. The description in lines 2 to 5
matches a conjunction of two marked agents
(style=agent) which share the same lexeme
ls=_L but which are di erent in gender. This
latter constraint is expressed in two variables
ehead={g=_G} and ehead={g~=_G} which only
unify if the gender features \g" have non-
identical sets of values.
1 gegendert =
2 Ae{style=agent,ls=_L,ehead={g=_G}},
3 e{lu=oder;und;bzw.;/},
4 *a{style~=agent}e{c=w},
5 Ae{style=agent,ls=_L,ehead={g~=_G}}
6 : Ar{style=nil}.
The rule allows the conjunctions \und",
\oder", \bzw." and \/".
d) Some nouns are erroneously marked even if
no gender equal formulation is possible. For in-
stance words such as \Mensch" (human being),a
\Gast" (guest), \Fl uchtling" (refugee) are mas-
culine in gender, yet there is no corresponding
female form in German. These words are in-
cluded in an exclude list which works similar
to the include list previously discussed.
1 exclude =
2 Aa{style=agent,
3 lu:mensch$;
4 fl uchtling$;
5 s augling$;
6 gast$;
7 rat$}
8 : Ar{style=nil}.
4.1.3 Non marked Expressions
a) Currently, we do not mark compound nouns
which have an agent as their modi er and a
non-agent as their head. However, also words
such as \Rednerpult" (talker desk = lectern)
and \Teilnehmerliste" (participants list = list
of participants) are suitable for gender main-
streaming and should be spelled as \Redepult"
(talk desk) and \Teilnehmendeliste" (participat-
ing list).
b) We do not mark articles and adjectives
which preceed the marked noun. This would
be troublesome in constructions like example
(7) where the article \der" (the) and the cor-
responding noun \Dezernent" (head of depart-
ment) are separated by an intervening adjectival
phrase.
(7) Den Vorsitz f uhrt der jeweils f ur die
Aufgaben zust andige Dezernent.
c) It is currently impossible to look beyond
the sentence boundary. As a consequence, the
reference of a agent cannot be detected if it oc-
curs in the preceeding sentence. For instance
\Herr M uller" is the reference of \Beamte" in
the second sentence in example (8).
(8) Herr M uller hat die Dienstvorschrift ver-
letzt. Der Beamte ist somit zu entlassen.
The word \Beamte" will be erroneously
marked because information of the preceeding
sentence is not available to resolve the reference.
4.2 Class 2: Pronouns
Also personal pronouns, possessive pronouns,
relative pronouns and inde nite pronouns are
marked. The strategy is similar to the one for
agents above:  rst all pronouns are marked and
in a second step markings in correct formula-
tions are erased.
With the exception of inde nite pronouns
(\Mancher", \Jemand", \Niemand" etc.), a
marked referent agent must be available in the
same sentence. Three di erent rules are used to
mark relative pronouns, personal pronouns and
possessive pronouns.
1 MarkRelativPronomen =
2 e{style=agent,ehead={g=_G}},
3 *a{lu~=&cm},
4 e{lu=&cm},
5 Ae{lu=d_rel,ehead={g=_G}}
6 : Ag{style=agent}.
a) The rule MarkRelativPronomen detects a
marked agent in line 2. Lines 3 and 4 search
the next comma1 that follows the marked agent
and line 5 matches the relative pronoun2 that
immediately follows the comma. The relative
1commas are coded as \&cm" in the formalism.
2relative pronouns are assigned the lexeme \d rel".
Size of Test Classes of errors
Text #sent. #words Errors/sent. Class 1 Class 2 Class 3 P
ET1 95 1062 1,83 97 62 15 174
TT2 251 6473 0,46 95 21 | 116
pronoun must agree in gender with the agent
(ehead={g=_G}). As we shall see in section 5,
this is an error prone approximation to reference
solution.
b) Personal and possessive pronouns are
only marked if they refer to a male agent.
The two rules MarkPersonalPronomen and
MarkPossesivPronomen work in a similar fash-
ion: in line 2 the marked masculine reference
is matched. Lines 3 and 4 match the follow-
ing personal pronoun (c=w,sc=pers) and pos-
sessive pronoun (c=w,sc=poss). In lines 5, the
pronouns are marked.
1 MarkPersonalPronomen =
2 e{style=agent,ehead={g=m}},
3 1Ae{lu=er;er_es,c=w,sc=pers}
4 |e{s~=agent,sc~=punct}
5 : Ar{style=agent}.
1 MarkPossesivPronomen =
2 e{style=agent,ehead={g=m}},
3 1Ae{lu=sein,c=w,sc=poss}
4 |e{s~=agent,sc~=punct}
5 : Ar{style=agent}.
After the marking step, pronoun marks are
 ltered. Filtering of pronouns is similar to the
previously discussed rule gegendert.
4.3 Class 3: Predicative Noun
Missing agreement between subject and pred-
icative noun is detected with the following kurd
rule:
1 Praedikatsnomen =
2 +Ae{mark=np,style=agent,ehead={g=_G}},
3 *Ae{mark=np},
4 e{ls=sein,c=verb},
5 *Ae{style~=agent},
6 Ae{mark=np,style=agent,ehead={g~=_G}},
7 *Ae{mark=np}
8 : Ar{bstyle=Gen3,estyle=Gen3}.
Lines 2 and 3 detect the marked subject. No-
tice that noun groups are marked with the fea-
ture mark=np by a previous chunking module.
Lines 5 to 7 match the predicative noun. Both
parts of the sentence are connected by the cop-
ula \sein" (be). Similar to the rule gegendert,
the rule only applies if both parts are di erent
in gender.
5 Evaluation of Gendercheck
We evaluated the Gendercheck editor based on
two texts:
ET1 A collection of unconnected negative exam-
ples taken from the (Ges, 1999) and (Sch,
1996).
TT2 The deputy law of the German Bundestag
Gender imbalances were manually annotated
with a SGML code, where each di erent code
refers to a di erent rewrite proposal to be plot-
ted in the editor as in the lower part in  g-
ure 1. Table 4.1.3 shows the distribution of er-
ror classes in the two texts. Each error class
had several subtypes which are omitted here for
sake of simplicity.
In ET1 every sentence has at least one er-
ror; on average one word out of six is marked
as \ungendered". Since ET1 is a set of negative
examples, errors are uniformly distributed. Dis-
tribution of errors in text TT2 is di erent from
ET1. TT2 does not contain a single occurrence
of a class 3 error. On average, only one word
out of 60 is manually marked and | due to the
long size of sentences | there are 0.46 errors
per sentence on average.
Text ET1 was used to develop and adjust the
kurd rule system for marking,  ltering and er-
ror code assignment. We iteratively compared
the automatically annotated text with the man-
ually annotated text and computed precision
and recall. Based on the misses and the noise,
we adapted the style module as well as the error
annotation schema. Thus, in a  rst annotation
schema we assigned more than 30 di erent er-
ror codes literally taken from (Ges, 1999) and
(Uni, 2000). However, it turned out that this
was too  ne a granularity to be automatically
reproduced and values for precision and recall
were very low. We than assigned only one er-
ror class and achieved very good values for pre-
cision of over 95% and recall over more than
89%. Based on these results we carefully re-
 ned a number of subtypes of the three error
classes.
Final results are shown in table 5. Results for
the test text TT2 are slightly inferior to those of
Text ET1
Error hit misses noise precision recall
Class 1 85 12 1 0.988 0.876
Class 2 55 7 5 0.917 0.887
Class 3 15 0 1 0.937 1.000P
155 19 7 0.957 0.891
Text TT2
Class 1 86 9 5 0.945 0.905
Class 2 14 7 4 0.778 0.667P
100 16 9 0.917 0.862
the development text ET1. We brie y discuss
typical instances of misses and noise.
a) Noise in class 1 (generic use of masculine)
are mainly due to \-ling" - derivations such as
\Abk ommling" (descendant) which are mascu-
line in German and for which no female equiva-
lent forms exist. These words could be included
in the exclude lexicon (see section 4).
b) In some cases nominalized participles
such as \Angestellte" (employee) and \Hin-
terbliebene" (surviving dependant), which are
usually very well suited for gendered formula-
tions due to their ambiguity in gender, were er-
roneously disambiguated. These instances pro-
duced noise because  lters did not apply.
c) Misses in class 1 can be traced back to
some words which have not been detected as
human agents such as \Schriftf uhrer" (recording
clerk) and \Ehegatte" (spouse). These words
could be entered into the include lexicon. Both
lexicon should be made user-adaptable and user
extendible in future versions of the system.
d) Many of the misses in class 2 are due to a
reference in the preceeding sentence. Since the
system is currently sentence based, there is no
easy solution in enhancing this type of errors.
The possessive pronoun \seiner" in the second
sentence of example (9) refers to \Bewerber"
(applicant) in the  rst sentence. This connection
cannot be reproduced if the system works on a
sentence basis.
(9) Einem Bewerber um einen Sitz im Bun-
destag ist zur Vorbereitung seiner Wahl in-
nerhalb der letzten zwei Monate vor dem
Wahltag auf Antrag Urlaub von bis zu zwei
Monaten zu gew ahren. Ein Anspruch auf
Fortzahlung seiner Bez uge besteht f ur die
Dauer der Beurlaubung nicht.
e) An example for noise in class 2 is shown
in example (10). The relative pronoun \der"
(who,which) was detected by Gendercheck but
erroneously been linked to \Beamte" instead of
\Antrag" (application) which are both masculin
in German.
(10) Der Beamte ist auf seinen Antrag, der bin-
nen drei Monaten seit der Beendigung der
Mitgliedschaft zu stellen ist, . ..
Much more powerful mechanisms are required
to achieve a breakthrough for this kind of errors.
6 Conclusion
This paper describes and evaluates the \Gen-
dercheck Editor" a tool to check German ad-
ministrative and legal texts for gender equal for-
mulations. The tool is based on the Controlled
Language Authoring Technology (CLAT), a
software package to control and check technical
documents for orthographical, grammatical and
styptic correctness. A part of the Style compo-
nent has been modi ed and adapted to the re-
quirements of linguistic gender main-streaming.
The paper outlines a shallow technique to dis-
cover gender-imbalance and evaluates the tech-
nique with two texts. Values for precision and
recall of more than 90% and 85% respectively
are reported.

References

Michael Carl and Antje Schmidt-Wigger.
1998. Shallow Postmorphological Process-
ing with KURD. In Proceedings of NeM-
LaP3/CoNLL98, pages 257{265, Sydney.

Michael Carl, Maryline Hernandez, Susanne
Preu , and Chantal Enguehard. 2004. En-
glish Terminology in CLAT. In LREC-
Workshop on Computational & Computer-
assisted Terminology, Lisbonne.

Gesellschaft f ur Informatik (Hg.), Bonn, 1999.
Gleichbehandlung im Sprachgebrauch: Reden
und Schreiben f ur Frauen und M anner.
Institut f ur Angewandte Informationsforschung,
Saarbr ucken, 2004. Working paper 38. to ap-
pear.

Heinz-Dieter Maas. 1996. MPRO - Ein Sys-
tem zur Analyse und Synthese deutscher
W orter. In Roland Hausser, editor, Linguis-
tische Veri kation, Sprache und Information.
Max Niemeyer Verlag, T ubingen.

Schweizerische Bundeskanzlei (Hg.), Bern,
1996. Leitfaden zur sprachlichen Gleichbe-
handlung im Deutschen.

Universit at Z urich (Hg.), Z urich, 2000. Leit-
faden zur sprachlichen Gleichbehandlung von
Frau und Mann.
