Measuring Semantic Coverage 
Sergei Nirenburg, Kavi Mahesh and Stephen Beale 
Computing Research L~bor~tory 
New Mexico St~tte University 
bt~s Cruces, NM 88003-0001 
USA 
sergei;m~hesh;sb{~crl.nmsu.edu 
Abstract 
The developlnent of natural language 
processing systems is currently driven to 
a large extent by measures of knowledge- 
base size and coverage of individual phe- 
nomena relative to a corpus. While these 
measures have led to significant advances 
for knowledge-lean applications, they 
do not adequately motivate progress in 
computational semantics leading to the 
development of large-scale, general pur- 
pose NLP systems. In this article, we 
argue that depth of semantic represen- 
tation is essential for covering a broad 
range of phenomena in the computa- 
tional treatment of language and propose 
(lepth as an important additional dimen- 
sion for measuring the semantic cover- 
age of NLP systems. We propose an 
operationalization of this measure and 
show how to characterize an NLP system 
along the dimensions of size, corpus cov- 
erage, and depth. The proposed frame- 
work is illustrated using sever~fl promi- 
nent NLP systems. We hope the prelim- 
inary proposals made in this article will 
lead to prolonged debates in the field and 
will continue to be refined. 
1 Measures of Size versus 
Measures of Depth 
Evaluation of current and potential performance 
of' an NLP system or method is of crucial impor- 
tance to researchers, developers and users. Cur- 
rent performance of systems is directly measured 
using a variety of tests and techniques. Often, as 
in the case of machine translation or information 
extraction, an entire "industry" of evaluation gets 
developed (see, for example, ARPA MT Evalua- 
tion; MUC-4 ). Measuring the performance of an 
NLP method, approach or technique (and through 
it the promise of a system based on it) is more dif- 
ficult, as judgments must be made about "blame 
assigmnent" and the impact of improving a variety 
of system components on the overall future per- 
formance. One of the widely accepted measures 
of potential performance improvement is the fea- 
sibility of scaling up the static knowledge sources 
of an NLP system its grammars, lexicons, worht 
knowledge bases and other sets of language de- 
scriptions (the reasoning being that the larger the 
system's grammars and lexicons, the greater per- 
centage of input they would be able to match 
and, therefore, the better the performance of the 
systeml). As a result, a system would be con- 
sidered very promising if its knowledge sources 
could be significantly scaled up at a reasonable 
expense. Natm'ally, the expense is lowest if acqui- 
sition is performed automatically. This consider- 
ation and the recent resurgence of corpus-based 
methods heighten the interest in the automation 
of knowledge acquisition, llowever, we believe 
that such acquisition should not 1)e judged solely 
by the utility of acquired knowledge for ~ partic- 
ular application. 
A preliminary to the sealability estimates is a 
judgment of the current coverage of a system's 
static knowledge sources. Unfortunately, judg- 
ments based purely on size ace often mislead- 
ing. While they may be sufficiently straightfor- 
ward for less km)wledgeAntensive methods used 
in such applications as information extraction and 
retrieval, part of speech tagging, bilingual corpus 
alignment, and so on, the saute is not true about 
more rule- and knowledge-based methods (such 
as syntactic parsers, semantic analyzers, seman- 
tic lexicons, ontological world models, etc.). It ix 
widely accepted, for instance, that judgments of 
the coverage of a syntactic grammar in terms of 
the number of rules are tlawed. It is somewhat 
less self-evident, however, that the number of lex- 
icon entries or ontology concepts is not an ade- 
quate measure of the quality or coverage of NLP 
a Incidentally, this consideration eontributes to 
evMuation of current perforntance as well. In the ab- 
sence of actual evaluation results, it is customary to 
c|aim the utility of the system by simply mention- 
ing tit(: size of its knowledge sources (e.g., "over 550 
grammar rules, over 50,000 concepts in the ontology 
and over 100,00(I word senses in the dictionary"). 
33 
systems. A.n adequate measure of these must ex- 
amine not only size and its scalability, but also 
depth of knowledge along with its scalability. In 
addition, these size and depth measures cannot 
be generalized over the whole system, but must 
be directly associated with individual areas that 
cover the breadth of NLP problems (i.e. morphol- 
ogy, word-sense ambiguity, semantic dependency, 
coreference, discourse, semantic inference, etc.). 
And finally, the most helpfld measurements will 
not judge the system solely as it stands, but must 
in some way reflect the ultimate potential of the 
system, along with a quantification of how far ad- 
ditional work aimed at size and depth will bring 
about advancement toward that potential. 
In this article, we attempt to formulate mea- 
sures of coverage important to the development 
and evaluation of semantic systems. We proceed 
h'om the assumption that coverage is a function of 
not only the number of elements in (i.e., size of) 
a static knowledge source but also of the amount 
of information (i.e., depth) and the types of in- 
formation (i.e., breadth) contained in each such 
element. Static size is often emphasized in evalu- 
ations with no attention paid to the often very in- 
significant amount of information associated with 
each of the many "labels" or primitive symbols. 
We snggest a starting framework for measuring 
size together with other significant dimensions 
of semantic coverage. In particular, the evalu- 
ation measures we propose reflect the necessary 
contribution of the depth and breadth of seman- 
tic descriptions. Depth and breadth of seman- 
tic description are essential for progress in com- 
putational semantics and, ultimately, for build- 
ing large-scale, general purpose NLP systems. Of 
course, for a number of applications a very lim- 
ited semantic analysis (e.g., in terms of, say, a 
dozen separate features) may be adequate for suf- 
ficiently high performance. However, in the long 
run, progress towards the ultimate goal of NLP is 
not possible without depth and breadth in seman- 
tic description and analysis. 
There is a well-known belief that it is not ap- 
propriate to measure success of NLP using field- 
internal criteria. Its adherents maintain that 
NLP should be evaluated exclusively through 
evaluating its applications: information retrieval, 
machine translation, robotic planning, human- 
computer interaction, etc. (see, for: example, the 
Proc. of the Active NLP Workshop; ARPA MT 
Evaluation). This may be true for NLP users, but 
developers must have internal measures of success. 
This is because it is very difficult to assign blame 
for the success or failure of an application on spe- 
cific components of an NLP system. For exam- 
ple, in reporting on the MUC-3 evaluation efforts, 
Lehnert and Sundheim (1991) write: 
A wide range of language processing 
strategies was employed by the top- 
scoring systems, indicating that many 
natnral language-processing techniques 
provide a viable foundation for sophis- 
ticated text analysis. Further evaluation 
is needed to produce a more detailed as- 
sessment of the relative merits of specific 
technologies and establish true perfor- 
mance limits tbr automated information 
extraction. \[emphasis added.\] 
Thus, evaluating the information extraction ap- 
plication did not provide constructive criticism on 
particular NLP techniques to enable advances in 
the state of the art. Also, evaluating an appli- 
cation does not directly contribute to progress in 
NLP as such. This is in part because a majority 
of current and exploratory NLP systems are not 
complete enough to fit an application but rather 
are devoted to one or more of a variety of com- 
ponents of a comprehensive NLP system (static 
e.g., lexicons, grammars, etc.; or dynamic e.g., 
an algorithm for" treating metonymy in English). 
1.1 Current Measures of Coverage 
Success in NLP (including semantic analysis and 
related areas) is currently measured by the follow- 
ing criteria: 
• Size of static knowledge sources: A mere 
nmnber indicating the size of a knowledge 
source does not tell us much about the cover- 
age of the system, let alone its semantic capa- 
bilities. For example, most machine readable 
dictionaries (MRI)) are larger than compu- 
tational lexicons but they are not usable for: 
computational semantics. 
• Coverage of corpus, either blanket cover:- 
age ("56% of sentences were translated cor- 
rectly") or resolution of a certain phe- 
nomenon (" 78% of anaphors were determined 
correctly"). These measures are ofl;en mis- 
leading by themselves since what may be cov- 
ered are just one or two highly specific phe- 
nomena such as recognizing place or prod- 
uct names (i.e., limited breadth). NLP is not 
yet at a stage where "covering a corpus" can 
mean "analyzing all elenmnts of meanings of 
texts in the corpus." It may be noted that 
"correctly" is a problematic term since peo- 
ple often have difficulty judging what is "cor- 
rect" (Will, 1993). Moreover, correctness is 
orthogonal to the entire discussion here since 
we would like to increase semantic coverage 
along various dimensions while maintaining 
an acceptable degree of correctness. On the 
same lines, processing efficiency (often spec- 
ified in terms such as "A sentence of length 
9 takes 750 milliseconds to process") is also 
more or less orthogonal to the dimensions we 
propose for measuring semantic coverage. In- 
creasing semantic (:overage would be Ntile if 
34 
l'henome l)t~.'ired Slate 
" C.rrent State iiiii 
,,- 
~~::.:i " Knowled e Base 
Figure 1: Dimensions of Semantic Coverage: (hlr- 
rent and Desired l)irections 
processing became exponentially expensive as 
a result. 
Figure 1 shows the dimensions of size and 
breadth (or phenomenon coverage) along tit(', hor- 
izontal plane. Depth (or richness) of a semantic 
system is shown on the vertical axis. We believe 
that recent progress in NLP with its emphasis on 
corpus linguistics and statistic~d methods has re- 
suited in a significant spread akmg the horizontal 
plane but little been done to grow the Iield in the 
vertical dimension. Figure 1 also shows the de- 
sired state of computational semantics advmlced 
alor|g each of the three dimensions shown. If 
We proceed from the assumption that high- 
quality NLI ) systems require optimum coverage 
on all three scales, the|| apparently different roads 
(-an be taken to that target. The speetrmn of 
choices ranges from developing all three dimen- 
sions more or less simultaneously to taking care 
of them in turn. As is often the case in king- 
term high-risk enterprises, inany researchers opt 
to start out with acquisition work which promises 
short-term gains on one of the coverage dimen- 
sions, with little thought about further steps. Of_ 
ten the reason they cite can be summarized by 
the phrase "Science is the art of the possible." 
This position is quite defensihle .... if no claims 
are made about broad semantic (:overage. Indeed, 
it is quite legitimate to study a particular lan- 
guage phenomenon exclusively or to cover large 
chunks of the lexis of a language in a shallow man- 
ner. IIowever, for practical gains in large-scale 
computational-selnantie applications one needs to 
achieve results on each of the three dimensions of 
coverage. 
1.2 Desiderata for Large-Scale 
Colnputational Semantics 
Once the initial knowledge acquisition canq)aign 
for a I)articular apt)lication has been concluded, 
the following crucial scalability issues 2 ira|st be 
addressed, if any t|nderstanding of the longer-term 
significance of the research is sought: 
• domain independence: scalability to new (lo- 
mains; general-purpose Nl,l ) 
• language independence: sealability across 
languages 
• phenolnenon coverage: sealability to new 
phenomena; going beyond core semantic 
analysis; ease of integrating component pro- 
eesses and resources. 
• application-independence: sealability to new 
applications; toolkit o\[' NLP techniques ap- 
plicable to any t~sk. 
We believe that coverage in terms of the det)th 
and breadth of the knowledge given to an NLI ) 
system is mandatory for attaining the above goals 
in the long run. Such coverage is best esti(nated 
not in terms of raw sizes of lexicons or world mod- 
els but rather through the availability in them of 
information necessary for the treatment of a w> 
riety of l)henomena in natural language issues 
related to semantic dependency bull(ling, lexical 
disambiguation, semantic constraint tracking and 
relaxation (for the cases of unexpected input, in- 
cluding non-li~eral language as well as treatment 
of unknown lexis), reference, pragmatic impact 
and discourse structure. The resolution of these 
issues is at the core of t)ost-syntactic text process- 
ing. We believe that one can treat the al)ove phe- 
nomena only by acquiring a broad range of rele- 
vant knowledge elements for the system. One nse- 
flfl measure for sufficiency of infbrmation would be 
an analysis of kinds of knowledge necessary to gen- 
erate a text (or (liMog) meaning representation. 
For applications in which more procedural com- 
putational semantics is l)refl~'rable, a correspond 
ing measure of sutliciency should be developed. 
There exist other, broader desiderata which are 
applicable to any All systetn. They include con- 
cerns about system robustness, correctness, and 
efficiency which are orthogonal to the above is- 
sues. EquMly important but more broadly appli- 
cable are considerations of economy and ease of 
acquisition of knowledge sources for example, 
reducing the size of knowledge bases and sharing 
knowledge across applications. 
2At present, se~dability is considered in the field 
ahnost exclusively ~ts propagation o\[ the nulnber of 
entries in the NLP knowledge bases, not the quantity 
and quality of information inside each such entry. 
85 
2 How to Reason about Depth, 
Breadth and Size 
A useful measure of semantic coverage must in- 
volve measurement along each of the three dimen- 
sions with respect to correctness (or success rate) 
and efficiency (or speed). In this first attempt 
at a qualitative metric, we list questions relevant 
for assigning qualitative ("tendency") scores to 
an NLP system to measure its semantic coverage. 
Our experience over the years has led us to the 
following sets of criteria for measuring semantic 
coverage. Itowever, we understand that the fol- 
lowing are not complete or unique; they are rep- 
resentative of the types of issues that are relevant 
to measuring semantic coverage. 
2.1 Lexical Coverage 
• 'lb what extent do entries share semantic 
primitives (or concepts) to represent word 
meanings? What is the relation between the 
number of semantic primitives defined and 
the number of word senses covered? 
• What is the size of the semantic zones of the 
entry? tlow many semantic features are cov- 
ered? 
• How many word senses from standard 
human-oriented dictionaries are covered in 
the NLP-oriented lexicon entry? 
• What types of information are included? 
- seleetional restrictions 
- constraint relaxation information 
syntax-semantics linking 
- collocations 
- procedural attachments for contextual 
processing 
-- stylistic parameters 
- aspectual, temporal, modal and attitu- 
dinal meanings 
-other idiosyncratic information about 
the word 
• and, finally, the total number of entries in the 
lexicon. 
2.2 Ontological Coverage 
The total number of primitive labels in a world 
model is not a useful measure of the semantic cov- 
erage of a system. At least the following consid- 
erations must be factored in: 
• The number of properties and links defined 
for an individual concept 
• Number of types of non-taxonomic relation- 
ships among concepts 
• Average number of links per concept: "con- 
nectivity" 
• Types of knowledge included: defaults, selec- 
tional constraints, complex events, etc. 
• Ratio of number of entries in a lexicon to 
number of concepts in the ontology 
• and, finally, total number of concepts in the 
ontology. 
2.3 Measuring Breadth of Meaning 
Representations 
Apart from lexical and ontological coverage, the 
depth and breadth of the meaning representations 
constructed by a system are good indicators of 
the overall semantic coverage of the system. Tile 
number of different types of meaning elements in- 
cluded fl'om the following set provides a reason- 
able measure of coverage: 
• Argument structure only 
• Template filling only 
• Events and participants 
• Thematic role assignments 
• Time and temporal relations 
• Aspect 
• Properties: attributes of events and objects; 
relations between events and objects. 
• R,eference and coreference 
• Attitude, modality, stylistics 
• Quantitative, comparative, and other mathe- 
matical relations 
* Textual relations and other discourse rela- 
tions 
• Multiple ambiguous interpretations 
* Propositional and story/dialog structure 
3 Measuring Semantic Coverage: 
Examples 
Figure 2 shows the approximate position of 
several well-known approaches and systems (in- 
cluding a possible Cyc-based system) in the 3- 
dimensional space of semantic coverage. We have 
chosen representative systems fl'om the different 
approaches for lack of precise terms to name tlle 
approaches. 
How do the approaches illustrated in Figure 2 
rate with respect to the metrics suggested above'? 
When estimating their profiles, we thought either 
about some representative systems belonging to 
an approach or thought of the properties of a pro- 
totypical system in a particular paradigm if no 
examples presented themselves readily. In the in- 
terests of space, we consider the above criteria 
for measuring semantic coverage but only provide 
brief summaries of how each system or approach 
is located along the dimensions of depth, breadth 
and size. 
The schema-based reasoner, Boris (behnert et 
al, 1983) was used as a prototype system for the 
8g 
+ +++++++++{ !!}: i + :-+:::+ ::+. 
++++++5++ i+~i++++++ 
+ 
++++++++5+i+i+?+++++++++++5+++++i+++ +++5+i+ ++i +++ ++++++++++i~ + + +ii+i{5++++++51+5~+++++++ ~!:+:: :, +77:+}+ }+++ ++ .................................................... ~'ii ......................... , ....................................................... Ix.~,tA ..................... 
i{ {! ii ii', :: :: ¢ { 
iii{i{5~!{iii!!5 ~:i! 5{ 5! 7!!5 ~ ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: ~ i i { {i{i:!Si{ i 
:~:s~;ii~:::::!:!~:!i ~ ~/:i ! ! !i!::?!: !:::::~ :: = :: ~: ::: :::::::::::::::::: :::::::::::::::::::::::::::::::::::: ~ 5~!:: !:!: ~ :~ ~:! ~s : ~ ~::: ?~:!i.'.,~-: :::: ......... :":~ ::: :: ,: {!~i{~i: ::::: :~::::::::: ::::::::::::~::~:. :::::::::::::::::::::::::::::: ~ ::::::::::::::::::::::::::: 
2: :::::!:!::!!!:i: ::::::::::::::::::::::::::::::::::::::::::: ::::::::::::::::::::::::::::::::::::::::::::::::: :, :. :::i:2: ::::!:: ::f? ::::i: : 
............................................................... I "2; ": ................. .~*~ ............. : ~ ................. ~ ......... 
?!:5!ii!i!iii!:ili ~iiiiliiii:i:i :i:i:i:i:i:5:!:i:?i:!~ '!!!!!ii .~ ::::: :::: ::£,~. ::::::::::::::::::::::::: :{::{!{:::i:i::{!;:i:i ~:~:i:i::::i{i:: ii:~!~:i~\[::i .::::::::: t,,:xx::::::::::: ::::::::::::::::::::::::: 
::::::::::~:: :,"~::::::hl :::::::::::::::% :::::::::++::~:::i:i:::::!!: z.. ::::::::::::::::::::::::::: i~ili~}iii::iil)+::i: ~ti{i::~:ii::;:i: ~}:!}i}:.~}!!~i!ii~i} ~ .ilili~ii~iii~i #. ~:i:~:}ii~ v+} } if< 
::::::::::::5::::: ;{:::::::::::::::::: ::::::::::::::::::::::: :::::::::: :~::::::::::: qX ::::::::::::::::::::::: 
a {{ 
+++++++++++++++++++{++t++++++++++++++++~+-:+ +~++++++i~}+++++++i~2~+++;++g~i +++++++++~++i)+ +~;~+i++ii++++)Y + +++ +++ :+~++ :: ::: :::::: ;;.;~ ;;~a::::: ~,<E'I:~ :::: N~i~++:i/~i~: :: :~ 'a::+" *~::i:i:i :;+JlKI ~ +p-'~' ~:~:~9.:~ : :: ~i<~. +.~.~t : ::::: 
Figure 2: l)imensions of Semantic Coverage: Cur-+ 
rent att(t l)esired l)ireetions 
domain- and task-del)endent, AI-style, schema- 
based NLP systc'm. It may I>e considered an ex- 
treme example of a system with deep, rich knowl 
edge of its, rather narrow, worhl in which cowwing 
language phenomena is nee(ted only inasmuch as it 
supports general reasoning. Boris was able to pro- 
cess a very smMl number of texts sutticiently for 
its goMs. The coverage of phenomena was strictly 
utilitarian (which, we believe, is quite appropri- 
ate). lilt was not demoimtratext that Boris can be 
scaled up to (:over a signiticant part of the English 
lexicon. 
As an example of an early knowletlge-basetl MT 
system (thai, is, unlike the above, a system whose 
goals were mainly computational-lit|guistic) we 
chose the KBMT-89 system (Goodman and Niren- 
burg, 1991). It covered its small corpus relatively 
completely and described the necessary phenom- 
ena relatively fldly, lilt was a primary goal of this 
line of research to begin meeting the above criteria 
for semantic coverage. 
A pntative NIA ) system based on the (~yc 
project has been selected as a prototyl)e for sys- 
tems not devised h)r a particular application. The 
Cyc large-scMe knowledge base \]|as significant 
amounts of deep knowledge, llowever, it is not 
clear whether the knowledge is apl>licM)le in a 
straightff)rward manner to deal with a range of 
linguistic phenomena. The big question for this 
kind of system is whether it is, in fact, possible, 
to acquire knowledge without a reference to an 
intended application. 
A purely corpus--based, statisticM approach to 
NLP, on the other hand, has an extremely nar- 
row range of knowledge, but, may haw; a large 
size. For example, snch a system may have a 
large lexicon with only word frequency and col- 
location information in each entry. Although sta- 
tistical methods have been shown to work on some 
problenm and applications, they are typically ap- 
plied to one or two phenomena at a time. It is 
not rlear that statistical information acquired tbr 
one probleln (such as sense disambiguation) is of 
use in hmtdling other problems (such as processing 
non-literal expressions). 
Mixed-strategy NI,I ~ systems are epitomized by 
I'angloss (199d), a multi-engine translation sys- 
tem in which semantic processing is only one of 
the possible translation engines. The semantics 
engine of this system is equipped with a large-size 
ontology of over 50,000 entries (Knight and link, 
t99d) which is nsed essentially as an am:hot for 
mapl)ing lexicM traits front the 8otlrce to the tar.- 
gel, language. As shown in I,'igure 2, Pangloss has 
a large size and covers a good range of l)het,)m. 
em~ as well. llowew',r, there is little information 
(only taxonomic and partonolnie relationships) in 
each concept in its Sensus ontology. The limited 
depth constrains the ultimate potentia.1 of the sys- 
tetn as a sentatd,ic and pragmatic processor. I"or 
exatnple, there is no hfl'ortnal;ion in its knowledge 
sources to make judgements about constr~fint re- 
laxal,ion to process non-literal expressions snch as 
metonymies and metal~hors. 
The Mikrokosntos system (e.g., Onyshkevych 
an(t Nirenburg, 1994), has attempted to cover each 
dimension equally well. Its knowledge bases and 
text meaning representations are rather deep and 
of nontrivial sizes. It has been designed froln the 
start to deal with a comprehensiw; range ot'seman-. 
tic phenomena including the linldng of syntax attd 
semantics, (-ore semantic analysis, sense disam- 
l)iguation, I)rocessing non-literM expressions, SLIt({ 
so on, althongh not all of them have yet been im 
plemented. 
Front the abow'~ examples, it is clea.r that hav- 
ing good coverage along one or two of the three 
dimensions is not good enough for meeting the 
long-term goMs of NI,P. Poor coverage of language 
phenomena (i.e., poor brea, dth) indicates that the 
acquired knowh;dge, even when it is deep and large 
in size, may not be applicable to other phenom- 
ena and may not transfer to other applications. 
Poor depth suggests that knowledge and process- 
ing techniques are either application- or language- 
specific and limits the ultimate potential of the 
system in solving semantic problems. Depth and 
breadth are of course of little use if the system 
cmmot bc scaled up to a signilicant size. More- 
over, as already noted, cow;rage in depth, breadth, 
and size must all be achievetl in conjnnction with 
maintaining good me, asnres of correctness, et\[i- 
ciency, and robustness. 
4 Discussion and Conclusions 
All oft-quoted objection to having deep semantic 
(:overage is the dilliculty in scMing up such a sys- 
tem along the dimension of size. This is a valid 
concern, llowever, the situation (:an be amelio- 
87 
rated to a large extent by developing a method- 
ology (see, e.g., Mahesh and Nirenburg, 1995) for 
constraining knowledge acquisition to minimally 
meet semantic processing needs. Such concen- 
tration of effort will allow knowledge acquirers 
to have spend a fraction of the effort that must 
go into building a general machine-tractable en- 
cyclopedia of knowledge and yet to attain signifi- 
cant coverage of language phenomena. Significant 
scale-up can be accomplished under such a con- 
straint without jeopardizing the high values on the 
depth and breadth scales. 
Size is important in NLP. But size alone is not 
a sufficient metric for evaluating semantic cover- 
age. Focusing on size to the exclusion of other 
criteria has biased the field away from semantic 
solutions to NLP problems. We have made a first 
step in formulating a more appropriate and com- 
plete set of measures of semantic coverage. Depth 
and breadth of knowledge necessary to cover a 
wide range language phenomena are at least as 
important to NLP as size. The discussion of pe- 
culiarities of the various approaches should be ex- 
panded in at least two directions - greater detail 
of description and analysis of the relative ditficulty 
of reaching the set goal of attaining an optimum 
value on each of the three measurement scales. We 
hope that this paper will elicit interest in contin- 
ned discussion of the issues of coverage measure- 
ment, which, in turn, will lead to better -- quanti- 
tative as well as qualitative- measures, including 
a methodology for comparing lexicons and ontolo- 
gies. 
Acknowledgments 
Many thanks to Yorick Wilks for his constructive 
criticism. 
References 
Active NLP Workshop: Working Notes from the 
AAAI Spring Symposium "Active NLP: Natural 
Language Understanding in Integrated Systems" 
March 21-23, 1994, Stanford University, Califor- 
nia (Also available as a Technical Report from the 
American Association for Artificial Intelligence). 
AI{PA MT Evaluation: Report of the Advanced 
Research Projects Agency, Machine Translation 
Program System Evaluation, May-August 1993. 
Goodman, K. and S. Nirenburg (eds.) (1991). 
The KBMT Project: A Case Study in Knowledge- 
Based Machine Translation. San Marco, CA: 
Morgan Kaufmann. 
Knight, K. and Luk, S. K. (1994). Building a 
Large-Scale Knowledge Base for Machine Trans- 
lation. In Proc. Twelfth National Conf. on Arti- 
ficial Intelligence, (AAAI-94). 
Leant, D. B. and Guha, R. V. (1990). Building 
Large Knowledge-Based Sysiems. Reading, MA: 
Addison-Wesley. 
Lehnert, W. G., Dyer, M. G., Johnson, P. N., 
Yang, C. J., and Harley, S. (1983). BORIS - An 
Experiment in In-I)epth Understanding of Narra- 
tives. Artificial Intelligence, 20(1):15-62. 
Lehnert, W. G. and Sundheim, B. (1991). A 
performance evaluation of text-analysis technolo- 
gies. AI Magazine, 12(3):81-94. 
Mahesh, K. and Nirenburg, S. (1995). A situ- 
ated ontology for practical NLP. In Proceedings 
of the Workshop on Basic Ontological Issues in 
Knowledge Sharing, International Joint Confer- 
ence on Artificial Intelligence (IJCAI-95), Mon- 
treal, Canada, August 1995. 
MUC-4: Proc. Fourth Message Understanding 
Conference (MUC-4), June 1992. Defense Ad- 
vanced Research Projects Agency.Morgan Kanf- 
mann Publishers. 
Onyshkevych, B. and Nirenburg, S. (1994). The 
lexicon in the scheme of KBMT things. Technical 
Report MCCS-94-277, Computing Research Lab- 
oratory, New Mexico State University. Also to 
appear in Machine rlh'anslation. 
Pangloss. (1994). The PANGLOSS Mark Ill 
Machine Translation System. A Joint Technical 
Report by NMSU CRL, USC ISI and CMU CMT, 
Jan. 1994. 
Will, C. A. (1993). Comparing human and ma- 
chine performance for natural language informa- 
tion extraction: Results from the Tipster evalua- 
tion. Proc. Tipster Text Program, ARPA, Mor- 
gan Kaufmann Publishers. 
38 
