A Best-Match Algorithm for Broad-Coverage 
Example-Based Disambiguation 
Naohiko URAM()TO 
IBM Research, Tokyo Research Laboratory 
1623-1.4 Simotsuruln~L Yalnato-shi, Kanagawa-ken 242 Japan 
urtmloto((~trl.vncl;.ibm.com 
Abstract 
To improve tit(.' coverage of examl)le-bases , two 
nlethods are introduced into the 1)est-match algo- 
rithm. The first is for acquiring conjunctive rela- 
tionships fl'om corpora, as measures of word simi- 
larity that can be used in addition to thesauruses. 
The Second, used when a word does not appear in 
an examltled)asc or a thesaurus, is for inferring links 
to words in the examph>base by ('mnparing the us- 
age of the word in the text ~md that of words in the 
example- base. 
1 Introduction 
Improvement of cow, rage in practical domains is one 
of the most important issues in the area of example- 
based systems. The examl)le-based apI)roach \[6\] has 
become a (:amman technique for m~turM language 
processing apI)lications such as machine translation 
*rod disambiguatkm (e.g. \[5, 10\]). However, few 
existing systems can cover a practical domain or 
handle a l)road range of phenomena. 
The most serious obstacle to robust example- 
based systems is the coverage of examt)le-bases. It is 
an oi)en question how many e~xaml)les are required 
for disambiguating sentences in a specific domain. 
The Sentence AnMyzer (SENA) wax developed 
in order to resolve attachment, word-sense, and 
conjunctive anlbiguitics t)y using constraints and 
example-based preferences \[11\]. It lists at)out 
57,000 disambiguated head-modifier relationships 
and al)out 300,000 synonyms and is-a 1)inary~ 
relationships. Even so, lack of examl)les (no rele- 
vant examlfles ) accounted for 46.1% of failures in a 
experiment with SENA \[12\]. 
Previously, it was believed to be easier to collect 
examples than to develop rules for resolving ambi- 
guities. However, the coverage of each examltie is 
nmch nlore local than a rule, and therefore a huge 
munber of examt)les is required in order to resolve 
realistic 1)rot)lems. There has been some carl)uS- 
based research (m how to acquire large-scah~ knowl- 
edge automati(-ally in order to cover the domain to 
be disambiguatcd, lint there are still major 1)rot)- 
lcnls to \])e overeonle. 
First, smmmtic kvowledge such as word-sense 
cannot be extracted by automatic cort)u~-base(l 
knowledge, acquisition. The example-base in SENA 
is deveh)l)ed by using a bootstr~q)ping method. 
However, the results of word-sense disambiguation 
nmst be (:he(:ked by a hutnan, a,nd word-senses are 
tagged to only about ;t half of all the examt)les , since 
the task is very time-consmning. 
A second ditliculty in the exalnple-t)ased att- 
proach ix the algorithm itself, namely, the be.st- 
match algorithm, which was used in earlier systems 
built around a thesaurus that consisted of a hierttr- 
chy of is-a or synonym relationships between words (word-senses). 
This paper proposes two methods for ilnprov- 
ing the coverage of exantple-bases. The selected 
domain is th~tt of sentences in comt)uter manmds. 
First, knowledge thtd; represents a type of similar- 
ity other than synonym or is-a relationships is a(> 
quired. As one measurement of the similarity, inter- 
changeability between words (:~m be used. In this 
paper, two types of the relationship reflect such in- 
terchangeability. First, the elements of coordinated 
structures are good clues to the interchangeat)ility 
of words. Words can be extracted easily from a 
dolnain-specitic carl)us , and therefore the example- 
base can I)e adapted to the sl)ecific domain by using 
the domain-specific relationships. 
If there are no examples and relations in the the- 
saurus, the example-base gives no information for 
disambiguation. However, the text to be disam- 
1)iguate.d provides useful knowledge for this pur- 
pose \[7, 3\]. '\['he relationshit)s between words in the 
example-base and ;ut unknown word can be guessed 
by comi)aring that word's usage in extracted cxant- 
ples and in the text. 
2 A Best-Match Algorithm 
In this section, conventional algorithms for 
exami)le-b~tsed disalnl)iguation~ art(1 their associat- 
e(i prol)lems, a.re briefly introduced. The algorithms 
of lnost examph>l)ased systems consist of the fol- 
lowing three steps~: 
till some systenls, the exact-mah:h ttl|(I Lhe best-match 
~tr(! ll/orge({. 
717 
"store+V" *storel "in" "disk" *disk 1) 
"store+V" *store1 "in" "storage-device" *device 2) 
"store+V" *storel "in" "cell" *cell 1) 
"store+V" *store1 "in" "computer" *computer1 4) 
"store+V" *storel "in" "storage" *storage2 3) 
"store+V" *storel "in" "format" *formatl 1) 
"store+V" *storel "in" "data-network" *network3 t) 
Fig. 1: Examples for R1 
("progrmn+N" *progl "in" "profile+N" *profile 5) 
("program+N" *progl "in" "data-storage+N" *stor- 
age3 1) 
("program+N" *progl "in" "publieation+N" *publica- 
tion1 2) 
("program+N" *progl "in" "form+N" *form1 2) 
("program+N" *prog2 "in" "group+N" *group1 1) 
Fig. 2: Examples for R2 
1. Searching for examples 
2. Exact matching 
3. Best matching with a thesaurus 
Suppose the prepositional phase attachment ambi- 
guity in $1 is resolved by using these steps. 
(S1) A managed AS/400 system can store a 
new program in the repository. 
There are two candidates for the attachnmnt of 
the prepositional phrase "in the repository." They 
are represented by the following head-modifier rela- 
tionships: 
(R1) ("store+V" (PP "in") "repository-FN") 
(R2) ("program+N" (PP "in") "repository+N") 
In R1 the m)un "repository" modifies the verb 
"store" with "in," while in R2, it modifies the noun 
"program." 
First,, SENA searches for examples whose heads 
match the candidate. Figures 1 and 2 show the 
relevant examples for R1 and I/.2. They represent 
the head-modifier relationships, including word- 
senses, a relation label between the word-senses, 
(e.g. 'in"), and a frequency. 
If a relationship identical to either of the can- 
didates R1 and R2 is found, a high similarity is 
attached to the candidate and the example (exact 
matching). 
Word-sense ambiguities are resolved by using the 
same framework \[12\]. In this case, each candi- 
date represent each word sense. For example, the 
word-sense *store1 is preferred among the examples 
shown in Fig. I. 
If no examples are obtained by the exact- 
matching process, the system executes the best- 
matching process, which is the most important 
mechanism in the example-based approach. For the 
comparison, synonym or is-a relationships described 
in a thesaurus are used. For example, if synonym 
relations are h)und between "repository" and "disk" 
in the first example for the R1, a similarity whose 
value is smaller than that for exact matching is giv- 
en to the examples. The most preferable candi- 
date is selected by comparing all examples in Fig. 1 
and computing the total similarity value for each 
candidate. If multiple candidates have tile same 
similarity values, the frequency of the example and 
some heuristics (for example, innermost attachment 
is preferred) are used to weight the similarities. 
Experience with SENA reveals two problems that 
prevent an improvement in the performance of the 
best-matching algorithm. First, the approach is 
strongly dependent on the thesaurus. Many sys- 
tems calculate the similarity or preference mainly 
or entirely by using the hierarchy of the thesaurus. 
However, these relationships indicate only a cer- 
tain kind of similarity between words. To improve 
the coverage of the example-base, other additional 
types of knowledge are required, as will be discussed 
in the following sections. 
Another problem is the existence of unknown 
words; that is, words that are described in the sys- 
tem dictionary but do not appear in the example- 
base or the thesaurus. In SENA, the New Collins 
Thesaurus \[1\] is used to disambiguate sentences in 
computer manuals. Many unknown words appear, 
especially nouns, since the thesaurus is for the gen- 
eral domain. Therefore, a inechanism for handling 
the unknown words is required. This is covered in 
Chapter 4. 
3 Knowledge Acquisition for 
Robust Best-Matching 
As described in the previous section, the best- 
matching algorithm is a basic element of example- 
based disambiguation, but is strongly dependent on 
the thesaurus. Nirenburg \[8\] discusses the type of 
knowledge needed for the matching; in his method, 
morphological information and antonyms are used 
in addition to synonym and is-a relationships. This 
section discusses the acquisition of knowledge front 
other aspects for a broad-coverage best-match algo- 
rithm. 
3.1 Acquisition of Conjunctive Rela- 
tionships from Corpora 
The New Collins Thesaurus, which is used in SENA 
as a source of synonym or is-a relationships, gives 
the following synonyms of "store": 
store: 
accumulate, deposit, garner, hoard, keep, etc. 
In our example-base, there are few examples for 
any of the words except "keep," since the example- 
base was developed nminly to resolve sentences in 
technical documents such as computer manuals. 
When the domain is changed, the vocabulary and 
718 
the usage of words also (:hange. Even a general- 
dommn thesaurus some, tinms does not suit a. spe- 
(:ific domain. Moreover, develolmmnk of a domain- 
spccitie thesaurus is it time-consuming task. 
The use of synonym or is-a relationships suggests 
the hypothesis that from the viewpoint of the 
exalni)le-l)~tsed itI)pl'oadl ~ a, word in iL sentell(;e citn 
be replaced by its synonyms or t~xonyms. That 
is, it supports the existe, nce of the (virtual) exam- 
pie $1' when "store" and "keep" h~tve a synonynl 
relationshil). 
(SI') A managed AS/400 systenl can keep a new 
program in tile repository. 
l}~terchangeability is :m important condition for 
cM('ulating similarity or preferences t)etween words. 
Our claim is that if words are inter(:hangeat)h~ in 
senten(:es, they should have strong similarity. 
In this l)al)er, (:onjmtetive relationships, whMt 
are COllllDon ill te(:hnictd (lOClllDetlts~ 3,re l)roposed 
as relationships that satisfy the conditiml of inter 
ehlmgeability. Seutenee, s in which the word "store" 
ix used as an element of coordinated structure can 
be extracted from computer manuMs, as following 
examples show: 
(1) The service retrieves, fornlats, all(/ stores a message 
for the user, 
(2) Delete the identifier being stored or rood|tied froin 
the tM)le. 
(3) This EXEC verifies mM StOlIt!S the language defaults 
in your tile. 
(4) You use the fltnetion to add, store, retrieve, ~tll(l 
update inforlna, tion Mmut doculnents. 
From tile sentences, the R)tlowing words that are 
inter(:hangeable with "store" are acquired: 
store,: retrieve, fo'r'm, at, modiJy, "oeTiiflj, add, "ltpda, te 
Often the words share easeq)atterns, which is ;t 
useNl characteristic fi)r determining interchanl,/e-- 
ability. Another reason we use (:onjunctive re- 
lationships is that they can 1)e extracted scmi- 
automatieMly from untagged or tagged corpora 1)y 
using a simph', patkeri>matehing nmtho(l. We ex- 
tract, ed about 700 conjunctive relationships from 
nntagged computer mamlMs by i)attern matching. 
The relationships include various types of knowl- 
edge, such as 10t ) antonyms (e.g. "private" itnd 
"publiC'), (t>)sequences of ~ctions (e.g. "toad" 
itnd "edit"), (c) (weak) synonyms (e.g. "program" 
and "service"), and ((l) part-of relationships (e.g. 
"tape" ~tn(l "device"). Another merit of conjunctive 
relationships is that they reflect dommn-specili(: re- 
lations. 
3.2 Acquisition from Text to Be Dis- 
ambiguated 
If there are no exami)les of i~ word to I)e dismn.- 
biguated, and the word does not appear in the the- 
saurus, no relationships ~Lre acquired. 
The existence of words theft m'e mlknown to khe 
examl)le-base antl the thesaurus ix inevitat)le wtmn 
one is deMing with tile disambiguation <>f senten<:es 
in f>ri~(:ti(:al dmmdns. Computer manuals, for e×- 
~nni)le , coIiLain lnally special llOUns such as llantes 
of colDlllands and products, but, there are no the- 
sauruses for such highly domMn-speeilic words. 
One w~ty of resolving the prol)h'nt ix to use the 
text to be processed as the most domainospecilic 
example-base. This idea ix supported by the fact 
that most word-It;O-word dependencies il,<:luding the 
UllklloWll words aq)pear lltalty kimes il~ the sAIue 
text. Nasukawa \[7\] deveh)pe(l the Dis(:ourse An- 
alyzer (DIANA), which resolves ambiguities in a 
text by dynamically referring to contextual infor- 
mation. Kinoshita et ;-I.1. \[3\] Mso prolmsed *t nletho<l 
for machine I;ra.nslatiml by lm.rsing ;t eoml)lete text 
in advance aud using it as an ex~mlple-1)ase, tlowev- 
er, neither system works for llllkllown wt)rds~ since 
both use only dependencies that al)l)eltr explicitly 
in the texl. 
4 An Algorithm to Search tbr 
Unknown Words 
We first give ~ut enilaneed best-matci~ algorithm for 
disamlfiguation. '\['he steps given ill Chapter 2 axe 
moditied as follows: 
\[. Searching for examph!s 
2. \]~xlt(q, matching 
3. Best matching with a thesmtrus and conjunc- 
tive relationshil)s 
4. Unknowll-word-makx:hil~g using a. context-base 
'\]'he outline of the the algorithm is as follows: Sen- 
tences in the text; to he processed are parsed ill ad- 
VILllC(! 1 aud 1;11(! parse trees axe stored as a, context- 
base. '\['tie com;ext-h~tse caAI inchlde alIll)igllOllS 
word-to-word dependencies, since no disambigua- 
kion l)rot:ess is executed. Using tm exanq)le-base 
slid the contextd)ase, the sentences ill the text are 
disantbiguated sequentially. If an ambiguous word 
does not ~q~pear in an exanlple-base or in the the- 
saltrus, 3.11 IlIIklIOWII word search is executed (other- 
wise, the COltve(lliOllil,\[ best~lllaA;ch process is eX(!Ctll;- 
ed.) The mlknow:u-word-matching i)l'oeess includes 
the following ske, ps: 
1. '\['he dependencies that include the unknown 
word are extracted froIil the context-base. 
2. A candidate set of words that is interchange- 
abh; with tile unknown word ix searched for in 
kite (!xamph>base by using the context depen- 
dency. 
3. The e~mdidate set ~(:quired ill step 2 is com- 
p~tred with the examples extracted for each 
candidate of interpretation. A preference wd- 
ue is ea.leulated by using the sets, and the most 
preferred interpretation is selected. 
719 
Let us see how the algorithm resolves the attach- 
ment ambiguity in sentence S1 from Chapter 2, 
which is taken from a text (manual) for the AS/400 
system. 
(Sl) A managed AS/400 system can store a 
new program in the repository. 
The text that contains S1 is parsed in advance, 
and stored in the context-base. The results of the 
example search arc shown in Fig. 1. There are two 
candidate relationships for the attachment of the 
prepositional phrase "in the repository". 
(R1) ("store+V" (PP "in") "repository+N") 
(R2) ( "program+N" (PP "in") "repository+N") 
Tile noun "repository" does not appear in the 
example-base or thesaurus, and therefore no infor- 
mation for the attachment is acquired. 
Consequently, the word-to-word dependencies 
that contain "repository" are searched for in the 
context-base. The following sentences appear be- 
fore or after S1 in the text: 
(CBI) The repository can hold objects that are 
ready to be sent or that have been received 
from another user library. 
(CB2) A distribution catalog entry exists ~or 
each object in the distribution repository. 
(CB3) A data object can be loaded into the 
distribution repository from an AS/400 library. 
(CB4) The object type of the object specified 
must match the information in the distribution 
repository. 
From the sentences, the head-nn)difier relation- 
ships that contain the unknown word "repository" 
are listed. These relationships are called the context 
dependency for the word. The context dependency 
of "repository" is us follows: 
(D1) ("hold+V" (sub j) "repository+N"): 1 
(D2) ("exist+V" (PP "in") "repository+N') : 0.5 
(D3) ("object+N" (PP "in") "repository+N'): 0,5 
(D4) ("load+V" (PP "into") "relmsitory+N"): 1 
(D5) ("information+N" (PP "in") "repository+N') : 
0.5 
(D6) ("match+V" (PP "into") "repository+N"): 0.5 
The last number in each relation is the certainty 
factor (CF) of the relationship. The value is 1/(the 
number of candidates for the resolving ambiguity). 
For example, the attachment of "repository" in CB2 
has two candidates, D2 and D3. Therefore, the cer- 
tainty factors for D2 and D3 are 1/2. 
For each dependency, candidate words (CB) in 
the context-base are searched for in the example- 
base. The words in the set can be considered as 
substitutable synonyms of the unknown word. For 
example, the WORDs that satisfy the relationship 
("hold+V" (subj) WORD+N)in the case of D1 are 
searched for. The Mlowing are candidate words in 
the context-base for the word"repository." 
CB1 = {I, user, cradle, rock} (for D1) 
CB2 = {storage, transient data} (for D2) 
CB3 = {condition, format, path, 1916, technique, 
control area} (for DO) 
CB4 = {systema8, facility} (for D4) 
CB5 ={reeord} (for DS) 
CB6 = {} (for D6) 
The total set of candidate words (CB) of the 
"repository" is an union of CB1 through CB6. The 
set is compared with the extracted examples for 
each attachntent candidate (Fig. 1). The words in 
the examples are candidate words in the example- 
base. By intersecting the candidate words in the 
context-base and the example-base, word that are 
interchangeable with the unknown word can be ex- 
tracted. The intersections of ea(:h set are as follows: 
For 111, CBr3C1 -- {storage, format} 
For R2, CBNC2 = {} 
This result means that "storage" and "format" 
have the same usage (or are interchangeal)le) in the 
text. The preference value P(R) for the candidate 
R with the interchangeable word w is calculated by 
the formula: 
P(R) = E~,(CF) × (frequency) 
In this (:use, P(R1) = 0.5 x 1+0.5 x 1 = 1.0, 
and P(R2) = 0 (sui)posing that the frequency of 
the words is 1). As a result, R1 is preferred to R2. 
if both sets of candidates are empty, the num- 
bers of extracted examples are coml)ared (this is 
called Heuristic-I). If there are no related words in 
this ease, R1 is preferred to i"12 (see Fig. 1). This 
heuristic indicates that "in" is preferred after "s- 
tore," irrespective of the head word of the preposi- 
tional phrase. 
5 Experimental Results 
5.1 Example-Base and Thesaurus 
All example-base for disambigu~tion of sentences in 
computer manuMs is now being developed. Table 1 
shows its currem; size. The sentences are extracted 
from examples in the L(mgman Dictionary of Con- 
temporary English \[9\] and definitions in the IBM 
Dictionary of Computing \[2\]. Synonym and is-a re- 
lati(mships arc extracted from the New Collins The- 
saurus \[1\] and Webster's Seventh New Collegiate 
Dictionary \[4\]. 
Our exainple-base is a set of head-modifier binary 
dependencies with relations between word, such as 
(subject), (object), and (PP "in"). It was developed 
by a bootstrapping method with human correction. 
In SENA, the example-base is used to resolve three 
types of ambiguity: attachment, wor(l-scnse~ and 
coordination. The h,vel of knowledge depends on 
the type of ambiguity. 
720 
Table 1: Size of the Example-Base and 'rlmsaurus 
Example-Base 
Examples 57,170 binary relationshit)s 
(in 9,500 sentences) 
Distinct words 8,602 
Thesaurus 
Synonylns 283,21 i binary relationshil)s 
(11,1)06 entries) 
Is-a relations 6,353 binary relationslfips 
52.4 (%) I Success with unknown word matctfing 
Success with Heuristic-1 
L'tilure 
20.o (%) 
27.6 (%) 
Fig. 3: Result of disambiguation 
To resolve semantic ambiguities, the examl)les 
should be disambiguated semantically. On the oth- 
er band, structural def)endencies can be extracted 
from raw or tagged corpora t)y using simple rules or 
patterns, in our approach, multile, vel descriptions 
of examples are allowed: one example may provide 
both structural and word-sense information, while 
another may provide only structural dependem:ies. 
Word-senses are added to a half of the sentences in 
example-base. 
5.2 Experiment 
We did a small experiment on disambiguation of 
prepositional I)hrase attachment. First, we pre- 
pared 105 ambiguous test dater randomly from 3,000 
sentences in a (:olni)ute.r manual. The format of the 
data was as follows: 
verb noun prep unknown-noun 
None of these data (:an be disambiguated by us- 
ing the conventional best-mateldng algorithm, s- 
ince noun2 does not appear in the example-base or 
thesaurus. Conjunctive, relationslfips, described in 
Chapter 3, are used with the exmnple-base and the 
thesaurus. 
The results of the disambiguation are shown in 
Fig. 3. We were able to disambiguate 52.4% of the, 
test data by using mlknown-word-matching. By us- 
ing Heuristic-1 in addition, we obt~ine(l a 72.4% 
success rate for unknown words. 
ODe cause of failure is imbalai,ce among exam- 
pies. The number of exanq)les for frequent verbs 
is larger than the number of exanq)les tk)r frequent 
nouns. As a result, verb attactunent tends to be 
preferred. 2 Another cause of failure is the mmfl)er 
of context dependen(:ies. In tim experim(mt, at most 
the nearest eight sentences were used; the optinmm 
number is still an open question. 
2We did not use other heuristics such as prefl?r(mce lop 
inner attachment. 
6 Conclusion 
Methods h)r improving the coverage of example- 
bases were 1)reposed in order to allow the realization 
of broad-coverage examph>l)ased systems. ~vV(, are 
evMuating our approacl) with larger amounts of da- 
ta. For future progress, the following issues must 
be discussed: 
I. In this paper, conjunctive relationships were 
used as knowledge with the best-match algo- 
rithm, in addition to a thesaurus. However, 
various types of knowledge will be required on 
a large scale for a more robust system. Au- 
tomatic or semi-mttomatic acquisition, using 
corpus-based methods, is also needed. 
2. If there are many unknown words ill an all\]- 
biguity, unknown-word matching will not work 
well. In additio,t to scaling up the example- 
base and the tlwsaurus, we should deve, top a 
nmre robust algorithm. 
References 
\[1\] Collins. The New Collin,~ The,~aurus. Collins Pub- 
lishers, Glasgow, 1984. 
\[2\] IBM Corpor~ttion. 1BM Dictionary of Comp'~din.q, 
volume SC20q699o(17. IBM Corporation, 1988. 
\[3\] S. Kinoslfita, M. Shimazu, and H. Hirakawa. "Bet- 
ter Translation with Knowledge Extracted from 
Souree Text". In Proceedings of TM1-93, pages 
240 252, 1993. 
\[4\] Merriam. Webster'a Seventh Nc'w 
Collegiate Dictionary. G.& C. Merriam, Spring- 
tield,Massttehuset t s, 1963. 
\[5\] K. Nagao. "Delmmleney Analyzer: A Knowledge- 
Based A1)proaeh to Structural Disambiguation". In 
Procceding,~ of COLING-90, pages 282 287, 1990. 
\[6\] M. Nagao. "A Frmnework of a Mechanical Trans- 
lation between a~pmtese ~md English by Analogy 
Principle". In A. Elithorn and 12. Banerji, editors, 
Artificial and Human Intelligence. NAT(), 1984. 
\[7\] T. Nasukawa. "Discourse Constraint ill Computer 
Mammls". In Proceedinga of TMI-93, pages 183 
194, 1993. 
\[8\] S. Niren|mrg, (2. Domashnev, and D. I. Grannes. 
"Two Approaclms to Matching iu Exmnple-Based 
Machine ~lYmLslation". In PTvcec, ding.~ of TMI-93, 
pages 4"\[ 57, 1993. 
\[9\] P. Proeter. Longman Dictionary of Contemporary 
Engli,~h. Longman Group Limited, Harlow told 
London, England, 1978. 
\[10\] S. Sate and M. Nagao. "Towards Memory-Based 
Trmtslation". in Proeeeding,~ of CO LING-90, pages 
1.46 152, 1990. 
\[11\] N. Urmnoto. "LexieM and Structural Dismnbigua- 
lion Using an Exmnl)le-Base". In The 2rid Japan- 
Australia ,loint 5'ympo,~i,um on Natural Language 
Procc,~,sing, 1);tges 150 160, 1991. 
\[12\] N. Urmnoto. "lgxmnple-Based Word-Sense I)isam- 
biguation". IEICE Transactio.n.~ on Information 
and Sy.~tcma, E77-D(2), 1.994. 
721 
