I-IBI~Z J. WEBEa 
THE AUTOMATICALLY BUILT 
UP HOMOGRAPH DICTIONARY 
- A COMPONENT OF A DYNAMIC LEXICAL SYSTEM - 
O. Introduction. 
Ambiguous word forms (often called "homonyms " or - in writ- 
ten language - "homographs ") are known as obstacles in many fields 
of computational linguistics, especially in automatic documentation, 
content analysis or mechanical translation. In this respect two problems 
must be distinguished: 
1) the detection of homographic word fonus in the text, 
2) their disambiguation by analysis procedures. 
This paper exclusively deals with the first problem. 
1. The Detection Of Homographs. 
1.1. In current procedures for the detection of homographs two 
alternatives can be differentiated: 
i) Homographs are identified like monosemic word forms by seg- 
mentation and looking up in the standard lexicon. Homographs are 
detected, if segments of text word forms correspond with more than 
one lexicon-entry. Lexicon-entries representing homographic items 
therefore need no special marking. 
ii) Homographs are identified by means of a special homograph 
dictionary, which can be worked out in two versions: 
1) the homograph dictionary contains the graphenfic shapes of 
all homographic word forms (full forms) and their possible linguistic 
specifications. In this case no segmentation procedures are required. 
2) the dictionary does not contain full forms but only the respective 
canonical forms. A special marking gives information about other 
corresponding dictionary-entries and the extent of their overlapping. 
458 tmNz j. WF~R 
In both cases (1) and (2) the identification of homographic text 
word forms is separated from the identification of monosemic ones. 
Procedures (i) and (ii) have some characteristic advantages and 
defects, which I will consider rather briefly. 
1.2. As already pointed out the first method requires (1) a segmen- 
tational algorithm, with the help of which the word forms of a text 
can be parsed into segments (e.g. stems and inflectional affixes), (2) 
an identificational component composed of a grapheme-sequence- 
comparing algorithm and the standard lexicon; thus it can be checked, 
whether a text segment detected by (1) is the expression-side of one (or 
perhaps more) lexical unit(s). If this is the case, the content-side(s) of 
the corresponding unit(s) can be assigned to the text segment. 
1.2.1. According to this conception the identification of word 
forms would offer no problems in the following cases: 
i) the word form represents only one sequence of segments (is 
monosemic), that means that each segment corresponds to one lexicon- 
entry: 
Germ.: ...OkindesO... -/kind/+/es/ Kind 
Engl. : ...OchildsO... -/child/+/s/ child 
Frnc. : ...OenfantO .,. -/enfant/+/121 / enfant 
Russ.: ...Oreb~nokO... -/reb~nok/+\[O/ reb~nok 
ii) The word form can be parsed into more than one set of seg- 
ments (is homographic), and the possible readings show coinciding 
segment-boundaries: 
Germ.: ...OlautO... -Ilautl+lol L`'.t (SUB) 
Ilautl+lOI laut (ADJ) 
 .ngl.: ...OmeanO... -Imeanl+l l mean (ADJ) 
Im anl+Ol to me,,. 
... ,,, 
Frnc. : ...OmortO... -/mort/+/£1/ mort (SUB) 
Imort/+/o/ mort (ADJ) 
,,, .°, 
tkuss.: ...O~ilaO... @il/+/a/ ~ilal 
/~il/+\[a/ ~ilaz 
THE AUTOMATICALLY BUILT UP HOMOGRAPH DICTIONARY 459 
If the detected segments are compared with all graphematically 
corresponding lexicon-entries, that is, if the lexicon-look-up is not 
stopped after the first correspondence, this case can also easily be coped 
with. If the entries are arranged in alphabetical order, the respective 
units will immediately succeed one another. 
1.2.2. The detection becomes difficult however, if one word form 
is parsed into more than one set of segments (is homographic), while 
the segment-boundaries in the possible readings are overlapping: 
Germ. : ...OgetriebenO... -Itriebl+/ge-en/ treiben 
Igetriebel+ln/ Getriebe 
Engl. : ...ohearingfS... -\]hear/+/ing/ to hear 
-/hearing/+/O/ hearing 
Frnc.: ...OpacherO... -Ipachl+/er p cher (VRB) 
/p~cher/-}-\]O/ p~cher (SUB) 
Russ.: ...f3valaxf3... -/val/q-/ax/ val 
/valax/+/O/ valax 
As the segments which have thus been detected do not coincide 
graphematically (e.g./trieb/-/getriebe{), i.e., as the respective lexicon- 
entries are to be fotmd at different places in the lexicon, for the identif- 
ication of such homographs enormous parsing - and comparing - 
procedures are required. As cases of homography with overlapping 
segment-boundaries in the various readings are encountered quite 
frequently in languages with extensive inflection (e.g. German, French, 
Russian), method 1.1. (i) is not the best in any case. 
1.2.3. The advantage of this method is above all to be seen in the 
fact that the identification of homographs can be managed automati- 
cally, and that no special marking of the respective entries is necessary. 
This is especially important with regard to dynamic lexical systems, 
where the number of lexicon-entries and their specification can vary; 
new entries do not require a change of the detection procedure. The 
disadvantage consists in the fact that monosemic and ambiguous word 
forms are submitted to the same procedure, which amounts to an undue 
delay of the determination ofmonosemic word forms. Multiple parsing 
with subsequent lexicon-look-up has always to be applied if the re- 
spective text word forms contain grapheme sequences, whidl correspond 
to inflectional affixes. Only after this can it be found out whether more 
than one plausible reading has resulted: 
460 HEINZ J. WEBER 
Germ.: ...OgetriebenO... 
or only one reading is true: 
Germ. : ...~3gelagenO... 
-Itriebl+lge-enl tr iben /getriebe/+/n/ G trieb  
-lgelagel + lnl Celage 
but not: 
/lag/+/ge-en/ liegen 
1.3. Method 1.1. (ii) does not have this disadvantage. Homographs 
are separately registered and marked according to their readings; thus 
a considerable acceleration of the identificational procedure is made 
possible. Monosemic word forms, the stems of which are listed up in 
the standard lexicon, are identified more easily, as the segmentation and 
lexicon-look-up can already be stopped after assignment of one reading. 
Ambiguous word forms are specified more easily, as the extensive 
segmentation- and comparing-procedures do not have to be applied 
(as the various readings are registered in the homograph dictionary 
- version 1.1. (ii) (1) -) or are reduced to a minimum (as the respective 
lexicon-entries bear a special marking, by which their homography 
can be derived - version (ii) (2) -). These advantages however entail 
certain disadvantages: as a rule homograph dictionaries are built up 
manually and have to be manually complemented, when the standard 
lexicon is extended; the same has to be stated for the marking of lex- 
icon-entries. Aside from this troublesome and time-consuming business 
one cannot be sure that all homographies are registered or are marked 
exhaustively. 
2. The Automatically Built Up Homograph Dictionary. 
2.1. In this paper a method will be outlined, in which the advan- 
tages of the first procedure are combined with those of the second one: 
the standard lexicon therefore can be extended automatically' without 
delaying the identificational procedure. The homograph dictionary is 
compiled by analysis of the standard lexicon; all stems representing 
homographic items are taken away from it and integrated into the 
homograph dictionary. The same algorithm, which detects homogra- 
phies incorporated in the standard lexicon, can be used to find out by 
analysis of both lexica, whether new entries and all inflected forms 
represented by them are homographs. If this is the case, they are re- 
gistered in the homograph dictionary, otherwise in the standard 
lexicon. Thus the number of entries in both lexica can be increased 
THE AUTOMATICALLY BUILT UP HOMOGRAPH DICTIONARY 461 
automatically and the specifications of ambiguities in the homograph 
dictionary always correspond to the current state of information. 
New Entries 
~ standard Lex\[icon ... 
Algorithm.~ 
~Homograph Dictionary ... 
2.2. The standard lexicon can be characterized as follows: the 
lexicon-entries are "stems "; each stem representing a set of inflected 
forms, which is called a "paradigm " or "part of a paradigm ". In 
order to abbreviate the graphemic assimilation of text word forms to 
the graphemic shapes of lexicon-entries during the identificational pro- 
cedure, morphologically and syntactically determinated allomorphs 
have also been noted. Stems of complex lexical units (e.g./kickOtheO 
bucket/, /zumOzugOkomm/) are, however, ignored: 
stems of (Get.) graben (VRB): /grab/, /grub/ 
beipflichten : \[beipflicht/, /pflicht/ ... 
stems of (Eng.) to sing: /sing/, /sang/, /sung/ 
stems of (Frc.) mourir: /mour/, /meur/, /mort/ .... 
stems of (Rus.) reMnok: /reb~nok/,/reb~nk/, ... 
2.3. The homograph dictionary is built up by comparing selected 
entries of the standard lexicon. In order to elucidate the comparing 
procedure we restrict ourselves to the coordination of just two lexicon- 
entries. Two stems represent homographic inflected forms, if the fol- 
lowing conditions are fulfilled: 
i) The graphemic shapes of the stems belonging to the paradigms 
P1 and P2 are identical: 
In this case homography exists, if any inflectional affixes co-occurring 
with the respective stems are homographic too: 
462 HmNZ J. WEBER 
ii) The graphemic shape belonging to the stem of paradigm P~, 
concatenated with a sequence G~, is homographic with the stem of 
paradigm P~: 
In this case the graphemic shapes of the co-occurring inflectional af- 
fixes have to correspond in the following way: 
G~(Vl) \ G,, ~ G}(P~). 
Concerning Gk some restrictions have to be observed: 
iii) G~ has to be homographic or partially homographic with any 
inflectional affix of the respective language: 
G~ ~ G~, for G~ as an element of the finite set ~,, which contains all 
inflectional affixes. 
iv) G~ for its part must be co-occurring with the respective stem 
of the paradigm: 
=- G (Vx). 
The co-occurence of stem and affix is specified by the respective in- 
flection-class-marking of the lexicon-entry. 
2.4. Conditions (i) and (ii) determine the selection of stems. (iii) 
prevents that all partially homographic stems in the lexicon are selected 
and examined. Thus - for example - a coordination of the German 
stems/arm/Arm and/armee/Armee is not permitted (though they are 
fulfilling condition (ii): Gs(PI)×/...ee/~= G~(P~.)), as the sequence 
G, =/...eel does not fulfill the condition G, ~ G}. 
iv) further reduces the number of the selected stems to those 
cases, where G} (~_ G,) is co-occurring with Gs(Pl) (e.g. German 
/schwer/schwer and/schwert\[ Schwert are not combined, though both 
stems fulfill conditions (ii) and G~ (=/...t/) fulfills condition (iii); but 
G~ does not fulfill (iv):/...tl is not homographic with an affix co-occur- 
ring with/schwer/ (ADJ)). 
The relationship between the graphemic shapes of the affixes co-oc- 
curring with the stems of both paradigms is finally specified by the 
complementary part of condition (ii): G}(P1) \ Gk ~ G~(P~). E.g. the 
German stems/hoer/ h6ren and/gehoer/ Geh6r (SUB) could be com- 
bined according to condition (ii):/hoer/×/ge.../~/gehoer/. Moreover 
conditions (iii) and (iv) concerning Gk =/ge.../ would be fulfilled: /ge.../c G~ and G~(=/ge-t/)---G}(Vl). 
But there is no inflectional 
THE AUTOMATICALLY BUILT UP HOMOGRAPH DICTIONARY 463 
afire /t/ co-occurring with the stem /gehoer/of paradigm P~, which 
corresponds to /ge-t/in the mentioned way: /ge-t/\ \[ge.../~-/t/. 
2.5. These conditions determining the selection of lexicon- 
entries and their examination for homographic items can be transferred 
into programmed instructions: 
start 
take'stem of 
paradigm P~ 
rtakc stem of~ stem of P,+~. ,.,a-_ 
para \[igm /3i+l 
stem of~. 
t 
condition (i): 7 
= q(v,+~) 
condition (ii): 
q(P,) x c,o ~ c,(p,+d 
,,4, condition (iii) : 
condition (iv): 
q;. ~ c;(P,) 
compare intl. aft. 
-of both paradigms 
according to (ii): c}(p,) ', c~ ~= c}(x,,.,) 
generate homographic 
intl. forms of both 
para.digms Pi, Pi+l; 
mark stems accord- 
ing to the overlap- 
ping of both paradigms 
"'"~Pi 
compare intl. 
aft. of both 
paradigms ac- 
"~cording to (i) 
~-. i G;.(P,) "" q.(P,.l) 
464 HEINZ j. WEBER 
2.6. Prerequisites for the outlined algorithm are: 
1) a computerized stem-lexicon; the entries, which have to be 
arranged in alphabetical order, must bear an inflection-class-marking, 
which makes it possible to generate all inflected forms represented by 
the respective stems: 
,o# 
/ficht\] :fechten, VRB, intl.-class 48 
/fichte/ : Fichte, SUB, intl.-class 5 
... 
2) A complete list of inflectional affixes, which bear the possible 
inflection-class-markings corresponding to those of the stems: 
/0/ : SUB, intl.-classes ..., 5 .... 
VRB, intl.-classes ..., 48, ... 
.., 
/st/ : VRB, intl.-classes ..., 48, ... 
2.7. The selection of stems and the comparison of the co-occurring 
inflectional afftxes could be carried out in a slightly modified way. As 
already pointed out, the selection of stems is in the main determined 
by the grapheme sequence G, (which specifies the graphematic over- 
lapping of non-homographic stems). Further restrictions concern the 
correspondence between Gk and the inflectional affixes co-occurring 
with the selected stems (see 2.3. (iii) and (iv). As the inflection-class- 
markings of stems and affixes (which are similar) are shortened distribu- 
tional classifications, it is obvious to bring them into a system, accord- 
ing to the respective specifications of Gk. A matrix is built up by 
which it can be seen whether a G~ -specification restricts the coordi- 
nation of stems with certain inflection-classmarkings. In this way the 
detailed examination and comparison of all co-occurring affixes (in 
accordance to condition 2.3. (iv)) can be substituted by one single 
operation, at least in a good number of cases. 
THE AUTOMATICALLY BUILT UP HOMOGRAPH DICTIONARY 
coordinated Gk-specifications of stem1 : 
intl.-classes of stem,/stem, 
G, = ~...el ... 
465 
..° 
VRB 1 /SUB 5 
VRB 481SU13 5 
VRB 55/SUB 5 
°°o 
+ 
+ 
°°. 
SUB 5: /stelle/ Stelle; /fichte/Fichte; 
/wiese/ Wiese; ... 
VRB 1: /stell/ stellen; ... 
VRB 48:/ficht/fechten; ... 
VRB 55:/wies/ weisen; ... 
This matrix forbids the coordination of two stems belonging to 
the classes VRB 48 and SUB 5 (in this succession), though they may 
correspond in accordance to condition (ii): 
/fieht/ x ~...el ~= If, ehtel. 
On the other hand the coordination of two stems belonging to the 
classes VRB 1 or VRB 55 and SUB 5 is admissibile: 
/stell/ X ~...el ~= lstelle/or /wies/x 
~...el ~-/wiese/. 
The building-up of such matrices seems to be a useful device, as the 
number of G,-specifications in the respective languages (German, 
French, Russian) is limited. In German we have found out ten frequent 
and about thirty extremely rare G~-specifications. In English homo- 
graphies with graphemetically overlapping stems are without that 
rather seldom. 
2.8. The conceived algorithm selects just two entries (respectively 
their paradigms), which are examined for homographic word forms. 
After the first cycle of selecting and comparing -as pointed out in 2.5. - 
homographs, which are members of more than two paradigms (e.g. 
466 UEINZ j. w~R 
German OalbenO: Album, Alba, Albe, Albx, Alb~), therefore, are not 
specified exhaustively: 
Album n Alba ={alben} 
Album A Albe = {alben} 
Album fl Alb~ ={alben} 
Album A Alb~ = {alben} 
Alba fl Albe ={alben} 
Alba I"1 Albx = {alben} 
Alba n Alb2 ={alben} 
Albe fl Alb, = {albe, alben} 
Albe fl Alb~ ={alben} 
Albl fl Alb~ = {alb, alben} 
In this example OalbenO is described as a member of altogether 
ten intersection sets (which are the results of the first coordination- 
cycle). In a second cycle (which will not be dealt with in detail) all 
intersection sets of the first cycle are examined for identical word forms. 
Oalbenl3 now can be described as an intersection set of five paradigms: 
Album I"1 Alba fl Albe n Albl fl Alb~ ={alben} 
Albe fl Albl = {albe} 
Albl n Alb2 = {alb) 
The coordination of intersection sets in the second cycle is - as well 
as the coordination of paradigms in the first one - determined by 
conditions, which are derived from the graphemic shapes of the respec- 
tive stems. In all probability stems like /album/, /alba/, /albe/, /alb/ 
will represent at least one homographic word form, while stems like 
/album/,/alge/,/alibi/,/altar/will not. 
2.9. As the outlined method of building up a homograph dic- 
tionary is, in the main, using facts of the expression-side of lexical units, 
it can be applied to various lexicon-types. The content-sides of the 
respective entries (i.e. stems) can bear either semantically, syntacti- 
cally or otherwise relevant information. 

REFERENCES 

H. EGGERS, et al., Elektronische Syntaxa- 
nalyse der deutschen Gegenwartssprache, 
Tiibingen 1969. 

D. K.RALLMANN, Linguistische Datenbank 
und kumulatives Wiirterbuch, in Kollo- 
quium Maschinelle Sprachverarbeitung, 
Mannheim 1968. 

W. L~NDraS, Static And Dymanic Lexical 
Systems, Bonn 1969. 

H. J. Wr~, Mehrdeutige Wortformen - 
grammatische Beschreibung und texiko- 
graphische Erfassung (thesis), Saar- 
brticken 1973. 
