BUILDING A LARGE ONTOLOGY 
FOR MACHINE TRANSLATION 
Kevin Knight 
USC/Information Sciences Institute 
4676 Admiralty Way 
Marina del Rey, CA 90292 
ABSTRACT 
This paper describes efforts underway to construct a large- 
scale ontology to support semantic processing in the PAN- 
GLOSS knowledge-base machine translation system. Be- 
cause we axe aiming at broad sem~tntic coverage, we are focus- 
ing on automatic and semi-automatic methods of knowledge 
acquisition. Here we report on algorithms for merging com- 
plementary online resources, in particular the LDOCE and 
WordNet dictionaries. We discuss empirical results, and how 
these results have been incorporated into the PANGLOSS 
ontology. 
1. Introduction 
The PANGLOSS project is a three-site collaborative 
effort to build a large-scale knowledge-based machine 
translation system. Key components of PANGLOSS 
include New Mexico State University's ULTRA parser 
\[Farwell and Wilks, 1991\], Carnegie Mellon's interlingua 
representation format \[Nirenburg and Defrise, 1991\], and 
USC/ISI's PENMAN English generation system \[Pen- 
man, 1989\]. Another key component currently under 
construction at ISI is the PANGLOSS ontology, a large- 
scale conceptual network intended to support seman- 
tic processing in other PANGLOSS modules. This net- 
work will contain 50,000 nodes representing commonly 
encountered objects, entities, qualities, and relations. 
The upper (more abstract) region of the ontology is 
called the Ontology Base (OB) and contains approxi- 
mately 400 items that represent generalizations essential 
for the various PANGLOSS modules' linguistic process- 
ing during translation. The middle region of the ontol- 
ogy, approximately 50,000 items, provides a framework 
for a generic world model, containing items representing 
many English word senses. The lower (more specific) 
regions O f the ontology provide anchor points for differ- 
ent application domains. Both the middle and domain 
model regions of the ontology house the open-class terms 
of the MT interlingua. They also contain specific infor- 
mation used to screen unlikely semantic and anaphoric 
interpretations. 
The Ontology Base is a synthesis of USC/ISI's PEN- 
MAN Upper Model \[Bateman, 1990\] and CMU's ON- 
TOS concept hierarchy \[Carlson and Nirenburg, 1990\]. 
Both of these high-level ontologies were built by hand, 
and they were merged manually. Theoretical motiva- 
tions behind the OB and its current status are described 
in \[Hovy and Knight, 1993\]. 
The problem we focus on in this paper is the construc- 
tion of the large middle region of the ontology. Because 
large-scale knowledge resources are difficult to build by 
hand, we are pursuing primarily automatic methods ap- 
plied in several stages. During the first stage we created 
several tens of thousands of nodes, organized them into 
sub/superclass taxonomies, and subordinated those tax- 
onomies to the 400-node Ontology Base. This work we 
describe below. Later stages will address the insertion of 
additional semantic information such as restrictions on 
actors in events, domain/range constraints on relations, 
and so forth. 
For the major node creation and taxonomization stage, 
we have primarily used two on line sources of infor- 
mation: (1) the Longman Dictionary of Contempo- 
rary English (LDOCE)\[Group, 1978\], and (2) the lexical 
database WordNet \[Miller, 1990\]. 
2. Merging LDOCE and WordNet 
LDOCE is a learner's dictionary of English with 27,758 
words and 74,113 word senses. Each word sense comes 
with: 
A short definition. One of the unique features of 
LDOCE is that its definitions only use words from a 
"control vocabulary" list of 2000 words. This makes 
it attractive from the point of view of extracting 
semantic information by parsing dictionary entries. 
Examples of usage. 
One or more of 81 syntactic codes. 
For nouns, one of 33 semantic codes. 
For nouns, one of 124 pragmatic codes. 
185 
WordNet is a semantic word database based on psy- 
cholinguistic principles. Its size is comparable to 
LDOCE, but its information is organized in a completely 
different manner. WordNet groups synonymous word 
senses into single units ("synsets"). Noun senses are 
organized into a deep hierarchy, and the database ~so 
contains part-of links, antonym links, and others. Ap- 
proximately 55% of WordNet synsets have brief informal 
definitions. 
Each of these resources has something to offer a large- 
scale natural language system, but each is missing im- 
portant features present in the other. What we need is 
a combination of the features of both. 
Our most significant project to date has been to merge 
LDOCE and WordNet. This involves producing a list of 
matching pairs of word senses, e.g.: 
LDOCE WORDNET 
(abdomen_O_O ABDOMEN-l) 
(crane_l_2 CRANE-l) 
(crane_l_1 CRANE-2) 
(abbess_O_O ABBESS-I) 
(abbott_O_O ABBOTT-I) 
... °,° 
Section 4 describes how we produced this list semi- 
automatically. Solving this problem yields several bene- 
fits: 
• It allows us to taxonomize tens of thousands of 
LDOCE word senses and subordinate them quickly 
to the Ontology Base. Section 5 describes how we 
did this. 
• It provides a syntactic and pragmatic lexicon for 
WordNet, as well as careful definitions. 
• It groups LDOCE senses into synonyms sets and 
taxonomies. 
* It allows us to identify and correct errors in the 
original resources. 
3. Related Work 
Our ontology is a symbolic model for fueling semantic 
processing in a knowledge-based MT system. We are 
aiming at broader coverage (dictionary-scale) than has 
previously been available to symbolic MT systems. Also, 
we are committed to automatic and semi-automatic 
methods of knowledge acquisition from the start. This, 
and the fact that we are concentrating on a partic- 
ular language-processing application, distinguishes the 
PANGLOSS work from the CYC knowledge base \[Lenat 
and Guha, 1990\]. We also believe that dictionaries and 
corpora are imperfect sources of knowledge, so we still 
employ human effort to check the results of our semi- 
automatic algorithms. This is in contrast to purely sta- 
tistical systems (e.g., \[Brown et al., 1992\]), which are 
difficult to inspect and modify. 
There has been considerable use in the NLP community 
of both WordNet (e.g., \[Lehman et al., 1992; Resnik, 
1992\]) and LDOCE (e.g..., \[Liddy et aL, 1992; Wilks et 
al., 1990\]), but no one has merged the two in order to 
combine their strengths. The next section describes our 
approach in detail. 
4. Algorithms and Results 
We have developed two algorithms for merging LDOCE 
and WordNet. Both algorithms generate lists of sense 
pairs, where each pair consists of one sense from LDOCE 
and the proposed matching sense from WordNet, if any. 
4.1. Definition Match 
The Definition Match algorithm is based on the idea 
that two word senses should he matched if their two 
definitions share words. For example, there are two noun 
definitions of "batter" in LDOCE: 
(batter_2_0) "mixture of flour, eggs, and milk, 
beaten together and used in cooking" 
(batter_3_0) "a person who bats, esp in baseball -- 
compare BATSMAN" 
and two definitions in WordNet: 
• (BATTER-l) "ballplayer who bats" 
• (BATTER-2) "a flour mixture thin enough to pour 
or drop from a spoon" 
The Definition Match Algorithm will match (batter_2_0) 
with (BATTER-2) because their definitions share words 
like "flour" and "mixture." Similarly (batter_3_0) and 
(BATTER-I) both contain the word "bats," so they are 
also matched together. 
Not all senses in WordNet have definitions, but most 
have synonyms and superordinates. For this reason, the 
algorithm looks not only at WordNet definitions, but 
also at locally related words and senses. For example, if 
186 
synonyms of WordNet sense x appear in the definition 
of LDOCE sense y, then this is evidence that x and y 
should be matched. 
Here is the algorithm: 
Definition-Match 
For each English word w found in both LDOCE and 
WordNet: 
1. Let n be the number of senses of w in LDOCE. 
2. Let m be the number of senses of w in WordNet. 
. Identify and stem all open-class, content words in 
the definitions (and example sentences) of all senses 
of w in both resources. 
. 
. 
Let ULD be the union of all stemmed content words 
appearing in LDOCE definitions. 
Let UWN be the same for WordNet, plus all syn- 
onyms of the senses, their direct superordinates, sib- 
lings, super-superordinates, as well as stemmed con- 
tent words from the definitions of direct superordi- 
nates. 
6. Let CW=(ULD N UWN) - w. These are definition 
words common to LDOCE and WordNet. 
. Create matrix L of the n LDOCE senses and the 
words fromCW. Forall0<i<nand0<z< \[ 
CW l: 
L\[i,z\]= { 0.011"00 
if the definition of sense i 
in LDOCE contains word x 
otherwise 
8. Create matrix W of the m WordNet senses and the 
words fromCW. For all0<j< mand0<x < \] 
CW l: 
1.00 
0.80 
w\[x,j\] = 
0.60 
0.01 
if x is a synonym or 
superordinate of sense j 
in WordNet 
if x is contained in the 
definition of sense j or 
the definition of its 
superordinate 
if x is a sibling or 
super-superordinate of sense 
j in WordNet 
otherwise 
9. Create similarity matrix SIM of LDOCE and Word- 
Net senses. For all 0 _< i < n and 0 < j < m: 
FlCWl-  \] 
SIMti, j I = .\[ ~ (L\[i,x\]-W\[x,jl) / I CWl 
10. Repeat until SIM is a zero matrix: 
(a) Let SIM\[y, z\] be the largest value in the SIM 
matrix. 
(b) Generate matched pair of LDOCE sense y and 
WordNet sense z. 
(c) For all 0 _< i < n, set SIM\[i, z\] = 0.0. 
(d) For all 0 < j < m, set sIm\[y, j\] = 0.0. 
In constructing the SIM matrix the algorithm comes up 
with a similarity measure between each of the n.m possi- 
ble pairs of LDOCE and WordNet senses. This measure, 
SIM\[i, j\], is a number from 0 to 1, with 1 being as good a 
match as possible. Thus, every matching pair proposed 
by the algorithm comes with a confidence factor. 
Empirical results are as follows. We ran the algorithm 
over all nouns in both LDOCE and WordNet. We judged 
the correctness of its proposed matches, keeping records 
of the confidence levels and the degree of ambiguity 
present. 
For low-ambiguity words (words with exactly two senses 
in LDOCE and two in WordNet), the results are: 
confidence pct. pct. 
level correct coverage 
> 0.0 75% 100% 
>_ 0.4 85% 53% 
_> 0.8 90% 27% 
At confidence levels > 0.0, 75% of the proposed matches 
are correct. If we restrict ourselves to only matches pro- 
posed at confidence ~ 0.8, accuracy increases to 90%, 
but we only get 27% of the possible matches. 
For high-ambiguity words (more than five senses in 
LDOCE and WordNet), the results are: 
confidence pet. pct. 
level correct coverage 
> 0.0 47% 100% 
>_ O. 1 76% 44% 
>_ 0.2 81% 20% 
187 
Accuracy here is worse, but increases sharply when we 
only consider high confidence matches. 
The algorithm's performance is quite reasonable, given 
that 45% of WordNet senses have no definitions and 
that many existing definitions are brief and contain 
misspellings. Still, there are several improvements to 
be made e.g., modify the "greedy" strategy in which 
matches are extracted from SIM matrix, weigh rare 
words in definitions more highly than common ones, 
and/or score senses with long definitions lower than ones 
with short definitions. These improvements yield only 
slightly better results, however, because most failures 
are simply due to the fact that matching sense defini- 
tions have no words in common. For example, "seal" 
has 5 noun senses in LDOCE, one of which is: 
(seal_l_l) "any of several types of large fish- 
eating animals living mostly on cool seacoasts 
and floating ice, with broad flat limbs (FLIP- 
PERs) suitable for swimming" 
WordNet has 7 definitions of "seal," one of which is: 
For example, (bat_l_l) is defined as "any of the several 
types of specially shaped wooden stick used for ..." The 
genus term for (bat_l_l) is (stick_l_l). As another exam- 
ple, the genus sense of (aisle_0_l) is (passage_0_7). The 
genus sense and the semantic code hierarchies were ex- 
tracted automatically from LDOCE. The semantic code 
hierarchy is fairly robust, but since the genus sense hier- 
archy was generated heuristically, it is only 80% correct. 
The idea of the Hierarchy Match algorithm is that once 
two senses are matched, it is a good idea to look at 
their respective ancestors and descendants for further 
matches. For example, once (animal_l_2) and ANIMAL- 
1 are matched, we can look into the respective animal- 
subhierarchies. We find that the word "seal" is locally 
unambiguous---only one sense of "seal" refers to an an- 
imal (in both LDOCE and WordNet). So we feel con- 
fident to match those seal-animal senses. As another 
example, suppose we know that (swan_dive-0_0) is the 
same concept as (SWAN-DIVE-l). We can then match 
their superordinates (dive_2_l) and (DIVE-3) with high 
confidence; we need not consider other senses of "dive." 
Here is the algorithm: 
(SEAL-7) "any of numerous marine mammals 
that come on shore to breed; chiefly of cold 
regions" 
The Definition Match algorithm cannot see any simi- 
larity between (seal_l_1) and (SEAL-7), so it does not 
match them. However, we have developed another match 
algorithm that can handle cases like these. 
4.2. Hierarchy Match 
The Hierarchy Match algorithm dispenses with sense def- 
initions altogether. Instead, it uses the various sense 
hierarchies inside LDOCE and WordNet. 
WordNet noun senses are arranged in a deep is-a hierar- 
chy. For example, SEAL-7 is a PINNIPED-1, which is 
on AQUATIC-MAMMAL-l, which is a EUTHERIAN- 
1, which is a MAMMAL-l, which is ultimately an 
ANIMAL-I, and so forth. 
LDOCE has two fairly flat hierarchies. The semantic 
code hierarchy is induced by a set of 33 semantic codes 
drawn up by Longman lexicographers. Each sense is 
marked with one of these codes, e.g., "H" for human 
"P" for plant, "J" for movable object. The other hier- 
archy is the genus sense hierarchy. Researchers at New 
Mexico State University have built an automatic algo- 
rithm \[Bruce and Guthrie, 1992\] for locating and disam- 
biguating genus terms (head nouns) in sense definitions. 
Hierarchy-Match 
1. Initialize the set of matches: 
(a) Retrieve all words that are unambiguous in 
both LDOCE and WordNet. Match their cor- 
responding senses, and place all the matches 
on a list called M1. 
(b) Retrieve a prepared list of hand-crafted 
matches. Place these matches on a list called 
M2. We created 15 of these, mostly high- 
level matches like (person_0_l, PERSON-2) 
and (plant_2_l, PLANT-3). This step is not 
strictly necessary, but provides guidance to the 
algorithm. 
2. Repeat until M1 and M2 are empty: 
(a) For each match on M2, look for words that 
are unambiguous within the hierarchies rooted 
at the two matched senses. Match the senses 
of locally unambiguous words and place the 
matches on M1. 
(b) Move all matches from M2 to a list called M3. 
(c) For each match on M1, look upward in the two 
hierarchies from the matched senses. When- 
ever a word appears in both hierarchies, match 
the corresponding senses, and place the match 
on M2. 
188 
(d) Move all matches from M1 to M2. 
The algorithm operate in phases, shifting matches from 
M1 to M2 to M3, placing newly-generated matches on 
M1 and M2. Once M1 and M2 are exhausted, M3 con- 
tains the final list of matches proposed by the algorithm. 
Again, we can measure the success of the algorithm along 
two dimensions, coverage and correctness: 
pct. matches 
phase correct proposed 
Step 1 99% 7563 
94% 876 Step 2(a) 
Step 2@ 
Step 2(a) 
Step 2@ 
Step 2(a) 
Step 2(c) 
85% 530 
93% 2018 
83% 40 
92% 99 
100% 2 
In the end, the algorithm produced 11,128 matches at 
96% accuracy. We expected 100% accuracy, but the algo- 
rithm was foiled at several places by errors in one or an- 
other of the hierarchies. For example, (savings_bank_0_0) 
is mistakenly a subclass of river-bank (bank_l_1) in the 
LDOCE genus hierarchy, rather than (bank_l_4), the 
money-bank. "Savings bank" senses are matched in step 
l(a), so step 2(c) erroneously goes on to match the river- 
bank of LDOCE with the money-bank of WordNet. 
Fortunately, the Definition and Hierarchy Match algo- 
rithms complement one another, and there are several 
ways to combine them. Our practical experience has 
been to run the Hierarchy Match algorithm to comple- 
tion, remove the matched senses from the databases, 
then run the Definition Match algorithm. The Definition 
Match algorithm's performance improves slightly after 
hierarchy matching removes some word senses. Once the 
high confidence definition matches have.been verified, we 
use them as fuel for another run of the Hierarchy Match 
algorithm. 
We have built an interface that allows a person to verify 
matches produced by both algorithms, and to reject or 
correct faulty matches. So far, we have 15,000 correct 
matches, with 10,000 to follow shortly. The next section 
describes what we do with them in our ontology. 
5. The Current Ontology 
The ontology currently contains 15,000 noun senses from 
LDOCE and 20,000 more from WordNet. Its purpose is 
to support semantic processing in the PANGLOSS anal- 
ysis and generation modules. Because we have not yet 
taxonomized adjective and verb senses (see Section 6) 
semantic support is still very limited. 
On the generation side, the PENMAN system requires 
that all concepts be subordinated to the PENMAN Up- 
per Model, which is part of the Ontology Base (OB). It 
is difficult to subordinate tens of thousands of LDOCE 
word senses to the OB individually, but if we instead 
subordinate various WordNet hierarchies to the OB, 
the LDOCE senses will follow automatically via the 
WordNet-LDOCE merge. 
Subordinating the WordNet noun hierarchy to the OB 
required about 100 manual operations. Each operation 
either merged a WordNet concept with an OB equiva- 
lent, inserted one or more WordNet concepts between 
two OB concepts, or attached a WordNet concept below 
an OB concept. The noun senses from WordNet (and 
their matches from LDOCE) fall under all three of the 
OB's primary top-level categories of OBJECT, PROCESS, 
and QUALITY. The PENMAN generator now has access 
to the semantic knowledge it needs to generate a broad 
range of English. 
To support parsing, we have manually added about 20 
mutual-disjoint assertions into the ontology. One of 
these assertions states that no individual can be both 
an INANIMATE-OBJECT and an ANIMATE-OBJECT, another 
states that PERSON and 1011-HtrtlAN-ANItlAL are mutually 
disjoint, and so forth. A parser can use such information 
to disambiguate sentences like "this crane is my pet," 
where "crane" and "pet" have several senses in LDOCE 
(crane_l_l, a machine; crane_l_2, a bird; pet_l_1, a do- 
mestic animal; pet_l_2, a favorite person; etc.). The only 
pair of senses that are not mutually disjoint in our on- 
tology is (crane_l.2)/(pet_l_l), so this is the preferred 
interpretation. So far, all mutual-disjoint links are be- 
tween OB concepts. We plan a study of our lexicon to 
determine which nouns have senses that are not distin- 
guishable on the basis of mutual-disjointness, and this 
will drive further knowledge acquisition of these asser- 
tions. 
We are now integrating the ontology with ULTRA, 
the Prolog-based parsing component of the PANGLOSS 
translator. Although ULTRA parses Spanish input for 
PANGLOSS, the lexical items have already been seman- 
tically tagged with LDOCE sense keys, so no large-scale 
knowledge acquisition is necessary. Our first step has 
been to produce a Prolog version of the ontology, with in- 
ference rules for inheritance and propagation of mutual- 
disjoint links. 
Another use of the ontology has been to help us refine 
LDOCE and WordNet themselves. For example, any 
189 
sample of the automatically-generated LDOCE genus- 
sense hierarchy has approximately 20% errors. Using our 
merged LDOCE-WordNet-OB ontology as a standard, 
we have been able to locate and fix a large number of 
these errors automatically. 
6. Future Work 
There are several items on our immediate agenda: 
• Ontologize adjective, verb, and adverb senses 
from LDOCE. Most adjective senses either per- 
tain to objects (e.g., atomic_l_l) or represent 
slot-value pairs in the ontology (e.g.,green_l_l 
refers to COLOR/GREEI-C0LOR as pertaining to 
PItYSICAL-OBJECTs). Most verb senses refer to 
PROCESSes, whose participants have class restric- 
tions, and so forth. Much of this information can be 
mined from WordNet and LDOCE, as well as from 
online corpora. 
• Extract a large Spanish lexicon for the ontology. We 
plan to use a bilingual Spanish-English dictionary 
(and merging techniques similar in spirit to the ones, 
described in this paper) in order to roughly annotate 
the ontology with Spanish words and phrases. 
• Incrementally flesh out the ontology to improve the 
quality of PANGLOSS translations. We will focus 
on acquiring relations like SIZE, PURPOSE, PART-0F, 
POSTC011DITI01I, etc., through primarily automatic 
methods, including parsing of LDOCE definitions 
and processing corpora. 
7. Acknowledgments 
I would like to thank Richard Whitney for significant as- 
sistance in programming and verification. The Ontology 
Base was built by Eduard Hovy, Licheng Zeng, Akitoshi 
Okumura, Richard Whitney, and the author. I wish to 
express gratitude to Longman Group, Ltd., for making 
the machine readable version of LDOCE, 2nd edition, 
available to us. Louise Guthrie assisted in LDOCE ex- 
traction and kindly provided us with the LDOCE genus 
sense hierarchy. This work was carried out under ARPA 
Order No. 8073, contract MDA904-91-C-5224. 
References 
Bateman, J. 1990. Upper modeling: Organizing knowl- 
edge for natural language processing. In Proc. Fifth 
International Workshop on Natural Language Gener- 
ation, Pittsburgh, PA. 
Brown, P., V. Della Pietra, P. deSouza, J. Lai, and 
R. Mercer. 1992. Class-based n-gram models of natu- 
ral language. Computational Linguistics 18(4). 
Bruce, Rebecca and Louise Guthrie. 1992. Genus dis- 
ambiguation: A study in weighted preference. In Pro- 
ceedings of the 15th International Conference on Com- 
putational Linguistics (COLING-92). 
Carlson, L. and S. Nirenburg. 1990. World Modeling 
for NLP. Tech. Rep. CMU-CMT-90-121, Center for 
Machine Translation, Carnegie Mellon University. 
Farwell, D. and Y. Wilks. 1991. Ultra: A multilin- 
gual machine translator. In Proceedings of the 3rd 
MT Summit. 
Longman Group. 1978. Longman Dictionary of Con- 
temporary English. Essex, UK: Longman. 
Hovy, E. and K. Knight. 1993. Motivating shared knowl- 
edge resources: An example from the pangloss collab- 
oration. (Submitted to: Theoretical and Method- 
ological Issues in Machine Translation). 
Lehman, J., A. Newell, T. Polk, and R. Lewis. 1992. The 
rule of language in cognition. In Conceptions of the 
Human Mind, ed. G. Harman. Hillsdale, N J: Lawrence 
Erlbaum. (Forthcoming). 
Lenat, D. and R.V. Guha. 1990. Building Large 
Knowledge-Based Systems. Reading, MA: Addison- 
Wesley. 
Liddy, E., W. Paik, and J. Woelfel. 1992. Use of sub- 
ject field codes from a machine-readable dictionary for 
automatic classification of documents. In Advances 
in Classification Research: Proc. 3rd ASIS SIG/CR 
Classification Research Workshop. 
Miller, George. 1990. Wordnet: An on-line lexical 
database. International Journal of Lexicography 3(4). 
(Special Issue). 
Nirenburg, S. and C. Defrise. 1991. Aspects of text 
meaning. In Semantics and the Lexicon, ed. J. Puste- 
jovsky. Dordrecht, Holland: Kluwer. 
Penman. 1989. The Penman Documentation. Teeh. rep., 
USC/Information Sciences Institute. 
Resnik, P. 1992. Wordnet and distributional analy- 
sis: A class-based approach to lexical discovery. In 
Proc. AAAI Workshop on Statistically-Based NLP 
techniques. 
Wilks, Y., D. Fuss, C. Guo, J. McDonald, T. Plate, and 
B. Slator. 1990. Providing machine tractable dictio- 
nary tools. Machine Translation 5. 
190 
