An Empirical Assessment of Semantic Interpretation 
Martin Romacker &: Udo Hahn 
Text Understanding Lab, \[-~ Group, 
Freiburg University, Freiburg, D-79085, Germany 
{mr, hahn}~coling, uni-freiburg, de 
Abstract 
We introduce a framework for semantic interpreta- 
tion in which dependency structures are mapped to 
conceptual representations based on a parsimonious 
set of interpretation schemata. Our focus is on the 
empirical evaluation of this approach to semantic in- 
terpretation, i.e., its quality in terms of recall and 
precision. Measurements are taken with respect to 
two real-world domains, viz. information technology 
test reports and medical finding reports. 
1 Introduction 
Semantic interpretation has been an actively investi- 
gated issue on the research agenda of the logic-based 
paradigm of NLP in the late eighties (e.g., Charniak 
and Goldman (1988), Moore (1989), Pereira and 
Pollack (1991)). With the emergence of empirical 
methodologies in the early nineties, attention has al- 
most completely shifted away from this topic. Since 
then, semantic issues have mainly been dealt with 
under a lexical perspective, viz. in terms of the res- 
olution of lexico-semantie ambiguities (e.g., Schfitze 
(1998), Pedersen and Bruce (1998)) and the gener- 
ation of lexical hierarchies from large text corpora 
(e.g., Li and Abe (1996), Hirakawa et al. (1996)) 
massively using statistical techniques. 
The research on semantic interpretation that was 
conducted in the pre-empiricist age of NLP was 
mainly driven by an interest in logical formalisms 
as carriers for appropriate semantic representations 
of NL utterances. With this representational bias, 
computational matters -- how can semantic repre- 
sentation structures be properly derived from parse 
trees for a large variety of linguistic phenomena? -- 
became a secondary issue. In particular, this re- 
search lacked entirely quantitative data reflecting 
the accuracy of the proposed semantic interpreta- 
tion mechanisms on real-world  data. 
One might be tempted to argue that recent eval- 
uation efforts within the field of information extrac- 
tion (IE) systems (Chinchor et al., 1993) are going to 
remedy this shortcoming. Given, however, the fixed 
number of knowledge templates and the restricted 
types of entities, locations, and events they encode 
as target information to be extracted, one readily re- 
alizes that such an evaluation framework provides, 
at best, a considerably biased, overly selective test 
environment for judging the understanding potential 
of text analysis systems which are not tuned for this 
special application. 
On the other hand, the IE experiments clearly in- 
dicate the need for a quantitative assessment of the 
interpretative performance of natural  un- 
derstanding systems. We will focus on this challenge 
and propose such a general evaluation framework. 
We first outline the model of semantic interpretation 
underlying our approach and then focus on 'its em- 
pirical assessment for two basic syntactic structures 
of the German , viz. genitives and auxiliary 
constructions, in two domains. 
2 The Basic Model for Semantic 
Interpretation 
The problem of semantic interpretation can be de- 
scribed as the mapping from syntactic to semantic 
(or conceptual) representation structures. In our ap- 
proach, the syntactic representation structures are 
given as dependency graphs (Hahn et al., 1994). Un- 
like constituency-based syntactic descriptions, de- 
pendency graphs consist of lexical nodes only, and 
these nodes are connected by vertices, each one of 
which is labeled by a particular dependency relation 
(cf. Figure 1). 
For the purpose of semantic interpretation, de- 
pendency graphs can be decomposed into semanti- 
cally interpretable subgraphs3 Basically, two types 
of semantically interpretable subgraphs can be dis- 
tinguished. The first one consists of lexical nodes 
which are labeled by content words only (lexical in- 
stances of verbs, nouns, adjectives or adverbs) and 
which are directly linked by a single dependency re- 
lation of any type whatsoever. Such a subgraph is 
illustrated in Figure 1 by 8peicher- genatt - Com- 
puters. The second type of subgraph is also delim- 
ited by labels of content words but, in addition, a 
series of n -- 1... 4 intermediary lexical nodes may 
1This notion and all subsequent criteria for interpretation 
are formally described in Romacker et al. (1999). 
327 
pro : 
kann subject: L _ 
/ Der Computers . erweitert 
I specifier/~ I 
I des * m~ject: 
SDRAM-Modulen 
/The memory, of the computer -- can -- with SDRAM-modules -- extended -- be\] 
The memory of the computer can be extended with SDRAM-modules 
Figure 1: Dependency Graph for a Sample Sentence 
appear between these content words, all of which are 
labeled by non-content words (such as auxiliary or 
modal verbs, prepositions). Hence, in contrast to 
direct linkage we speak here of indirect linkage be- 
tween content words. Such a subgraph, with two 
intervening non-content words - the modal "kann" 
and the auxiliary "werden" -, is given in Figure 1 by 
Speieher- subject - kann- verbpart - werden- 
verbpart - erweitert. Another subgraph with just 
one intervening non-content word - the preposition 
"mit" - is illustrated by erweitert- ppadjunct - mit 
- pobject - SDRAM-Modulen. From these consid- 
erations follows that, e.g., the subgraph spanned by 
Speieher and SDRAM-Modulen does not form a 
semantically interpretable subgraph, since the con- 
tent word erweitert intervenes on the linking path. 
Our approach to semantic interpretation sub- 
scribes to the principles of locality and composition- 
ality. It operates on discrete and well-defined units 
(subgraphs) of the parse tree, and the results of se- 
mantic interpretation are incrementally combined by 
fusing semantically interpretable subgraphs. 
As semantic target  we have chosen the 
framework of KL-ONE-type description logics (DL) 
(Woods and Schmohe, 1992). Since these logics are 
characterized by a settheoretical semantics we stay 
on solid formal ground. Fhrthermore, we take ad- 
vantage of the powerful inference engine of DL sys- 
tems, the description classifier, which turns out to be 
essential for embedded reasoning during the seman- 
tic interpretation process. By equating the semantic 
representation  with the conceptual one, we 
follow arguments discussed by Allen (1993). 
The basic idea for semantic interpretation is as 
follows: Each lexical surface form of a content word 
is associated with a set of concept identifiers repre- 
senting its (different) lexical meanings. This way, 
lexical ambiguity is accounted for. These concep- 
tual correlates are internal to the domain knowledge 
base, where they are described by a list of attributes 
or conceptual roles, and corresponding restrictions 
on permitted attribute values or role fillers are asso- 
ciated with them. 
, , ~s-~0~iN0-~0~Y f. ' - l @ / 
!XTENS I 0N-PATIENT 
EXTENSION. 04 ~ r , 
k~ MODALITY L ....... , 
Figure 2: Concept Graph for a Sample Sentence 
As an example, consider the description for the 
concept COMPUTER-SYSTEM. It may be character- 
ized by a set of roles, such as HAS-HARD-DISK Or HAS- 
WORKING-MEMORY, with corresponding restrictions 
on the concept types of potential role fillers. HAS- 
WORKING-MEMORY, e.g., sanctions only fillers of 
the concept type MEMORY. These conceptual con- 
straints are used for semantic filtering, i.e., for the 
elimination of syntactically admissible dependency 
graphs which, nevertheless, do not have a valid se- 
mantic interpretation. 
Semantic interpretation, in effect, boils down to 
finding appropriate conceptual relations in the do- 
main knowledge that link the conceptual correlates 
of the two content words spanning the semanti- 
cally interpretable subgraph, irrespective of whether 
a direct or an indirect linkage holds at the syn- 
tactic level. Accordingly, Figure 2 depicts the se- 
mantic/conceptual interpretation of the dependency 
structure given in Figure 1. Instances represent- 
ing the concrete discourse entities and events in 
the sample sentence are visualized as solid rectan- 
gles containing a unique identifier (e.g., COMPUTER- 
SYSTEM.02). Labeled and directed edges indicate 
instance roles. Dashed rectangles characterize sym- 
bols used as makers for tense and modality. 2 
Note that in Figure 2 each tuple of content words 
which configures a minimal subgraph in Figure 1 
has already received an interpretation in terms of a 
relation linking the conceptual correlates. For exam- 
ple, Speicher- genatt - Computers (cf. Figure 1, 
box 1) is mapped to COMPUTER-SYSTEM.02 HAS- 
WORKING-MEMORY MEMORY.01 (cf. Figure 2, box 
1). However, the search for a valid conceptual rela- 
tion is not only limited to a simple one-link slot-filler 
structure. We rather may determine conceptual re- 
lation paths between conceptual correlates of lexical 
items, the length of which may be greater than 1. 
2We currently do not further interpret the information con- 
tained in tense or modality markers. 
328 
VerbTrans iliary 
<subject: {agent patient}> <dirobject: {patient co-patient}>i 
erweitern (extend) werden_.passive <{patient co-patient}> 
Lexeme 
Nominal Pre ~osition 
Noun _ Pronoun: <genitive attribute:~ > :: 
Speicher (memory) mlt (with) 
<{has-part instrument ...}> 
Figure 3: Fragment of the Lexeme Class Hierarchy 
(Thus, the need for role composition in the DL lan- 
guage becomes evident.) The directed search in the 
concept graph of the domain knowledge requires so- 
phisticated structural and topological constraints to 
be manageable at all. These constraints are encap- 
sulated in a special path finding and path evaluation 
algorithm specified in Markert and Hahn (1997). 
Besides these conceptual constraints holding in 
the domain knowledge, we further attempt to reduce 
the search space for finding relation paths by two 
kinds of syntactic criteria. First, the search may be 
constrained by the type of dependency relation hold- 
ing between the content words of the currently con- 
sidered semantically interpretable subgraph (direct 
linkage), or it may be constrained by the intervening 
lexical material, viz. the non-content words (indirect 
linkage). Each of these syntactic constraints has an 
immediate mapping to conceptual ones. 
For some dependency configurations, however, no 
syntactic constraints may apply. Such a case of un- 
constrained semantic interpretation (e.g., for geni- 
tive attributes directly linked by the genatt relation) 
leads to an exhaustive directed search in the knowl- 
edge base in order to find all conceptually compati- 
ble role fillings among the two concepts involved. 
Syntactic restrictions on semantic interpretation 
either come from lexeme classes or concrete lexemes. 
They are organized in terms of the lexeme class hi- 
erarchy superimposed on the fully lexicalized depen- 
dency grammar we use (Hahn et al., 1994). In the 
fragment depicted in Figure 3, the lexeme class of 
transitive verbs, VERBTRANS, requires that when- 
ever a subject dependency relation is encountered, 
semantic interpretation is constrained to the con- 
ceptual roles AGENT or PATIENT and all their sub- 
relations (such as EXTENSION-PATIENT). All other 
conceptual roles are excluded from the subsequent 
semantic interpretation. Exploiting the property in- 
heritance mechanisms provided by the hierarchic or- 
ganization of the lexicalized dependency grammar, 
all concrete lexemes subsumed by the lexeme class 
VERBTRANS, like "erweitern" (extend), inherit the 
corresponding constraint. However, there are lexeme 
classes such as NOUN which do not render any con- 
straints for dependency relations such as evidenced 
by gen\[itive\] att\[ribute\] (cf. Fig. 3). 
It may even happen that such restrictions can only 
be attached to concrete lexemes in order to avoid 
overgeneralization. Fortunately, we observed that 
this only happened to be the case for closed-class, 
i.e., non-content words. Accordingly, in Figure 3 
the preposition "with" is characterized by the con- 
straint that only the conceptual roles HAS-PART, IN- 
STRUMENT, etc. must be taken into consideration for 
semantic interpretation. 
Since the constraints at the lexeme class or the lex- 
eme level are hard-wired in the class hierarchy, we 
refer to the mapping of dependency relations (or id- 
iosyncratic lexemes) to a set of conceptual relations 
(expanded to their transitive closure) as static inter- 
pretation. In contradistinction, the computation of 
relation paths for tuples of concepts during the sen- 
tence analysis process is called dynamic interpreta- 
tion, since the latter process incorporates additional 
conceptual constraints on the fly. 
The above-mentioned conventions allow the 
specification of high-level semantic interpretation 
schemata covering a large variety of different syntac- 
tic constructions by a single schema. For instance, 
each syntactic construction for which no conceptual 
constraints apply (e.g., the interpretation of geni- 
tives, most adjectives, etc.) receives its semantic 
interpretation by instantiating the same interpreta- 
tion schema (Romacker et al., 1999). The power of 
this approach comes from the fact that these high- 
level schemata are instantiated in the course of the 
parsing process by exploiting the dense specifications 
of the inheritance hierarchies both at the grammar 
level (the lexeme class hierarchy), as well as the con- 
ceptual level (the concept and role hierarchies). 
We currently supply up to ten semantic interpre- 
tation schemata for declaratives, relatives, and pas- 
329 
sives at the clause level, complement subcategoriza- 
tion via PPs, auxiliaries, all tenses at the VP level, 
pre- and and postnominal modifiers at the NP level, 
and anaphoric expressions. We currently do not ac- 
count for control verbs (work in progress), coordina- 
tion and quantification. 
3 The Evaluation of Semantic 
Interpretation 
In this section, we want to discuss, for two particular 
types of German  phenomena, the adequacy 
of our approach in the light of concrete  
data taken from the two corpora we work with. This 
part of the enterprise, the empirical assessment of se- 
mantic interpretation, is almost entirely neglected in 
the literature (for two notable exceptions, cf. Bon- 
nema et al. (1997) and Bean et al. (1998)). 
Though similarities exist (viz. dealing with the 
performance of NLP systems in terms of their abil- 
ity to generate semantic/conceptual structures), the 
semantic interpretation (SI) task has to be clearly 
distinguished from the information extraction (IE) 
task and its standard evaluation settings (Chinchor 
et al., 1993). In the IE task, a small subset of the 
templates from the entire domain is selected into 
which information from the texts are mapped. Also, 
the design of these templates focus on particularly 
interesting facets (roles, in our terminology), so that 
an IE system does not have to deal with the full 
range of qualifications that might occur -- even re- 
lating to relevant, selected concepts. Note that in 
any case, a priori relevance decisions limit the range 
of a posteriori fact retrieval. 
The SI task, however, is far less restricted. We 
here evaluate the adequacy of the conceptual rep- 
resentation structures relating, in principle (only re- 
stricted, of course, by the limits of the knowledge ac- 
quisition devices), to the entire domain of discourse, 
with all qualifications mentioned in a text. Whether 
these are relevant or not for a particular application 
has to be determined by subsequent data/knowledge 
cleansing. In this sense, semantic interpretation 
might deliver the raw data for transformation into 
appropriate IE target structures. Only because 
of feasibility reasons, the designers of IE systems 
equate IE with SI. The cross-linking of IE and SI 
tasks, however, bears the risk of having to determine, 
in advance, what will be relevant or not for later re- 
trieval processes, assumptions which are likely to be 
flawed by the dynamics of domains and the unpre- 
dictability of the full range of interests of prospective 
users. 
3.1 Methodological Issues 
Our methodology to deal with the evaluation of se- 
mantic interpretation is based on a triple division of 
test conditions. The first category relates to checks 
whether so-called static constraints, effected by the 
mapping from a single dependency relation to one 
or more conceptual relations, are valid (cf. Figure 3 
for restrictions of this type). Second, one may in- 
vestigate the appropriateness of the results from the 
search of the domain knowledge base, i.e., whether a 
relation between two concepts can be determined at 
all, and, if so, whether that relation (or role chain) 
is adequate. The conceptual constraints which come 
into play at this stage of processing are here referred 
to as dynamic constraint propagation, since they are 
to be computed on the fly, while judging the valid- 
ity of the role chain in question. 3 Third, interactions 
between the above-mentioned static constraints and 
dynamic constraint propagation may occur. This 
is the case for the interpretation of auxiliaries or 
prepositions, where intervening lexical material and 
associated constraints have to be accounted for si- 
multaneously. 
In our evaluation study, we investigated the effects 
of category II and category III phenomena by consid- 
ering genitives and modal as well as auxiliary verbs, 
respectively. The knowledge background is consti- 
tuted by a domain ontology that is divided into an 
upper generic part (containing about 1,500 concepts 
and relations) and domain-specific extensions. We 
here report on the two specialized domains we deal 
with -- a hardware-biased information technology 
(IT) domain model and an ontology covering parts 
of anatomical medicine (MED). Each of these two 
domain models adds roughly about 1,400 concepts 
and relations to the upper model. Corresponding 
lexeme entries in the lexicon provide linkages to the 
entire ontology. In order to avoid error chaining, we 
always assume a correct parse to be delivered for the 
semantic interpretation process. 
We took a random selection of 54 texts (compris- 
ing 18,500 words) from the two text corpora, viz. 
IT test reports and MEDical finding reports. For 
evaluation purposes (cf. Table 1), we concentrated 
on the interpretation of genitives (as an instance of 
direct linkage; GEN) and on the interpretation of 
periphrastic verbal complexes, i.e., passive, tempo- 
ral and modal constructions (as instances of indirect 
linkage; MODAUX). 
The choice of these two grammatical patterns al- 
lows us to ignore the problems caused by syntac- 
tic ambiguity, since in our data no structural am- 
3Note that computations at the domain knowledge level 
which go beyond mere type checking are usually located out- 
side the scope the semantic considerations. This is due to 
the fact that encyclopedic knowledge and its repercussions on 
the understanding process are typically not considered part 
of the semantic interpretation task proper. While this may 
be true from a strict linguistic point of view, from the com- 
putational perspective of NLP this position cannot seriously 
be maintained. Even more so, when semantic and conceptual 
representations are collapsed. 
330 
biguities occurred. If one were to investigate the 
combined effects of syntactic ambiguity and seman- 
tic interpretation the evaluation scenario had to be 
changed. Methodologically, the first step were to ex- 
plore the precision of a semantic interpretation task 
without structural ambiguities (as we do) and then, 
in the next step, incorporate the treatment of syn- 
tactic ambiguities (e.g., by semantic filtering devices, 
cf. Bonnema et al. (1997)). 
Several guidelines were defined for the evaluation 
procedure. A major issue dealt with the correctness 
of a semantic interpretation. In cases with interpre- 
tation, we considered a semantic interpretation to 
be a correct one, if the conceptual relation between 
the two concepts involved was considered adequate 
by introspection (otherwise, incorrect). This qualifi- 
cation is not as subjective as it may sound, since we 
applied really strict conditions adjusted to the fine- 
grained domain knowledge. 4 Interpretations were 
considered to be correct in those cases which con- 
tained exactly one relation, as well as cases of se- 
mantical/conceptual ambiguities (up to three read- 
ings, the most), presumed the relation set contained 
the correct one. 5 A special case of incorrectness, 
called nil, occurred when no relation path could be 
determined though the two concepts under scrutiny 
were contained in the domain knowledge base and 
an interpretation should have been computed. 
We further categorized the cases where the sys- 
tem failed to produce an interpretation due to at 
least one concept specification missing (with respect 
to the two linked content words in a semantically 
interpretable subgraph). In all those cases with- 
out interpretation, insufficient coverage of the upper 
model was contrasted with that of the two domain 
models in focus, MED and IT, and with cases in 
which concepts referred to other domains, e.g., fash- 
ion or food. Ontological subareas that could nei- 
ther be assigned to the upper model nor to partic- 
ular domains were denoted by phrases referring to 
time (e.g., "the beginning of the year"), space (e.g., 
4The majority of cases were easy to judge. For instance, 
"the infiltration of the stroma" resulted in a correct reading 
- STROMA being the PATIENT of the INFILTRATION event -, 
as well as in an incorrect one - being the AGENT of the IN- 
FILTRATION. Among the incorrect semantic interpretations we 
also categorized, e.g., the interpretation of the expression "the 
prices of the manufacturers" as a conceptual linkage from 
PRICE via PRICE-OF to PRODUCT via HAS-MANUFACTURER to 
MANUFACTURER (this type of role chaining can be considered 
an intriguing example of the embedded reasoning performed 
by the description logic inference engine), since it did not ac- 
count for the interpretation that MANUFACTURERS fix PRICES 
as part of their marketing strategies. After all, correct inter- 
pretations always boiled down to entirely evident cases, e.g., 
HARD-DISK PART-OF COMPUTER. 
5At the level of semantic interpretation, the notion of se- 
mantic ambiguity relates to the fact that the search algorithm 
for valid conceptual relation paths retrieves more than a single 
relation (chain). 
"the surface of the storage medium"), and abstract 
notions (e.g., "the acceptance of IT technology"). 
Finally, we further distinguished evaluative expres- 
sions (e.g., "the advantages of plasma display") from 
figurative , including idiomatic expressions 
(e.g., "the heart of the notebook"). 
At first glance, the choice of genitives may appear 
somewhat trivial. From a syntactic point of view, 
genitives are directly linked and, indeed, constitute 
an easy case to deal with at the dependency level. 
From a conceptual perspective, however, they pro- 
vide a real challenge. Since no static constraints are 
involved in the interpretation of genitives (cf. Figure 
3, lexeme class NOUN) and, hence, no prescriptions 
of (dis)allowed conceptual relations are made, an un- 
constrained search (apart from connectivity condi- 
tions imposed on the emerging role chains) of the 
domain knowledge base is started. Hence, the main 
burden rests on the dynamic constraint processing 
part of semantic interpretation, i.e., the path find- 
ing procedure muddling through the complete do- 
main knowledge base in order to select the adequate 
conceptual reading(s). Therefore, genitives make a 
strong case for test category II mentioned above. 
Dependency graphs involving modal verbs or aux- 
iliaries are certainly more complex at the syntac- 
tic level, since the corresponding semantically in- 
terpretable subgraphs may be composed of up to 
six lexical nodes. However, all intervening non- 
content-word nodes accumulate constraints for the 
search of a valid relation for semantic interpretations 
and, hence, allows us to test category III phenom- 
ena. The search space is usually pruned, since only 
those relations that are sanctioned by the interven- 
ing nodes have to be taken into consideration. 
3.2 Evaluation Data 
We considered a total of almost 250 genitives in all 
these texts, from which about 59%/33% (MED/IT) 
received an interpretation. 6 Out of the total loss due 
to incomplete conceptual coverage, 56%/58% (23 of 
41 genitives/57 of 98 genitives) can be attributed to 
insufficient coverage of the domain models. Only the 
remaining 44%/42% are due to the residual factors 
listed in Table 1. 
In our sample, the number of syntactic construc- 
tions containing modal verbs or auxiliaries amout to 
292 examples. Compared to genitives, we obtained 
a more favorable recall for both domains: 66% for 
MED and 40% for IT. As for genitives, lacking in- 
terpretations, in the majority of cases, can be at- 
tributed to insufficient conceptual coverage. For the 
IT domain, however, a dramatic increase in the num- 
ber of missing concepts is due to gaps in the upper 
model (78 or 63%) indicating that a large number of 
6Confidence intervals at a 95% reliability level are given in 
brackets in Table 1. 
331 
MED-GEN IT-GEN MED-MODAUX IT-MODAUX 
# texts 29 25 29 25 
# words 4,300 14,200 4,300 14,200 
recall 57% 31% 66% 40% 
precision 97% 94% 95% 85% 
100 # occurrences ... 
•.. with interpretation 
\[confidence intervals\] 
correct (single reading) 
• correct (multiple readings) 
incorrect 
nil 
•.. without interpretation 
domain model (MED/IT) 
59 (59%) 
\[48%-67%1 
53 (53%) 
4 (4%) 
0 
2 
41 (41%) 
23 (23%) 
147 
49 (33%) 
\[24%-41%\] 
28 (19%) 
18 (12%) 
3 
0 
98 (67%) 
57 (39%) 
58 
40 (69%) \[56%-81%\] 
38 (66%) 
0 (0%) 
0 
2 
18 (31%) 
11 (19%) 
upper model 
other domains 
• time 
•. space 
•. abstracta, generics 
•. evaluative expressions 
.. figurative  
....... miscellaneous 
3 
0 
0 
7 
11 
0 
1 
0 
23 
4 
15 
8 
12 
8 
17 
1 
234 
Ill (47%) 
\[40%-53%\] 
88 (38%) 
6 (3%) 
14 
3 
123 (53%) 
42 (34%) 
78 
0 
I 
5 
16 
3 
24 
3 
Table 1: Empirical Results for the Semantic Interpretation of Genitives (GEN) and Modal Verbs and Aux- 
iliaries (MODAUX) in the IT and MED domains 
essential concepts for verbs were not modeled. Also, 
figurative speech plays a more important role in IT 
with 24 occurrences. Both observations mirror the 
fact that IT reports are linguistically far less con- 
strained and are rhetorically more advanced than 
their MED counterparts. 
Another interesting observation which is not made 
explicit in Table 1 concerns the distribution of modal 
verbs and auxiliaries. In MED, we encountered 57 
passives and just one modal verb and no temporal 
auxiliaries, i.e., our data are in line with prevailing 
findings about the basic patterns of medical sublan- 
guage (Dunham, 1986). For the IT domain, cor- 
responding occurrences were far less biased, viz. 80 
passives, 131 modal verbs, and 23 temporal auxil- 
iaries. Finally, for the two domains 25 samples con- 
tained both modal verbs and auxiliaries, thus form- 
ing semantically interpretable subgraphs with four 
word nodes. 
One might be tempted to formulate a null hy- 
pothesis concerning the detrimental impact of the 
length of semantically interpretable subgraphs (i.e., 
the number of intervening lexical nodes carrying 
non-content words) on the quality of semantic inter- 
pretation. In order to assess the role of the length 
of the path in a dependency graph, we separately 
investigated the results for these subclasses of com- 
bined verbal complexes• From the entire four-node 
set (cf. Table 2) with 25 occurrences (3 for MED and 
22 for IT), 16 received an interpretation (3 for MED, 
13 for IT). While we neglect the MED data due to 
the small absolute numbers, the IT domain revealed 
MED IT 
4-nodes 4-nodes 
recall - 59% 
precision - 85% 
# occurrences ... 3 22 
• .. with interpretation 3 13 
....... correct 3 11 
Table 2: Interpretation Results for Semantically In- 
terpretable Graphs Consisting of Four Nodes 
59% recall and 85% precision. If we compare this 
to the overall figures for recall (40%) and precision 
(85%), the data might indicate a gain in recall for 
longer subgraphs, while precision keeps stable. 
The results we have worked out are just a first step 
into a larger series of broader and deeper evaluation 
efforts. The concrete values we present, sobering as 
they may be for recall (57%/31% for genitives and 
66%/40% for modal verbs and auxiliaries), encour- 
aging, however, for precision (97%/94% for genitives 
and 95%/85% for modal verbs and auxiliaries), can 
only be interpreted relative to other data still lacking 
on a broader scale. 
As with any such evaluation, idiosyncrasies of the 
coverage of the knowledge bases are inevitably tied 
with the results and, thus, put limits on too far- 
reaching generalizations. However, our data reflect 
the intention to submit a knowledge-intensive text 
understander to a realistic, i.e., conceptually un- 
constrained and therefore "unfriendly" test environ- 
ment. 
332 
Judged from the figures of our recall data, there 
is no doubt, whatsoever, that conceptual coverage 
of the domain constitutes the bottleneck for any 
knowledge-based approach to NLP. ~ Sub 
differences are also mirrored systematically in these 
data, since medical texts adhere more closely to well- 
established concept taxonomies and writing stan- 
dards than magazine articles in the IT domain. 
4 Related Work 
After a period of active research within the logic- 
based paradigm (e.g., Charniak and Goldman 
(1988), Moore (1989), Pereira and Pollack (1991)), 
work on semantic interpretation has almost ceased 
with the emergence of the empiricist movement in 
NLP (cf. Bos et al. (1996) for one of the more recent 
studies dealing with logic-based semantic interpreta- 
tion in the framework of the VERBMOBIL project). 
Only few methodological proposals for semantic 
computations were made since then (e.g., higher- 
order colored unification as a mechanism to avoid 
over-generation inherent to unconstrained higher- 
order unification (Gardent and Kohlhase, 1996)). 
An issue which has lately received more focused at- 
tention are ways to cope with the tremendous com- 
plexity of semantic interpretations in the light of an 
exploding number of (scope) ambiguities. Within 
the underspecification framework of semantic repre- 
sentations, e.g., DSrre (1997) proposes a polynomial 
algorithm which constructs packed semantic repre- 
sentations directly from parse forests. 
All the previously mentioned studies (with the ex- 
ception of the experimental setup in DSrre (1997)), 
however, lack an empirical foundation of their var- 
ious claims. Though the MUC evaluation rounds 
(Chinchor et al., 1993) yield the flavor of an empiri- 
cal assessment of semantic structures, their scope is 
far too limited to count as an adequate evaluation 
platform for semantic interpretation. Nirenburg et 
al. (1996) already criticize the 'black-box' architec- 
ture underlying MUC-style evaluations, which pre- 
cludes to draw serious conclusions from the short- 
comings of MUC-style systems as far as single lin- 
guistic modules are concerned. More generally, in 
this paper the rationale underlying size (of the lex- 
icons, knowledge or rule bases) as the major assess- 
ment category is questioned. Rather dimensions re- 
lating to the depth and breadth of the knowledge 
sources involved in complex system behavior should 
be taken more seriously into consideration. This is 
exactly what we intended to provide in this paper. 
As far as evaluation studies are concerned dealing 
with the assessment of semantic interpretations, few 
7At least for the medical domain, we are currently actively 
pursuing research on the semiautomatic creation of large-scale 
ontologies from weak knowledge sources (medical terminolo- 
gies); cf. Schulz and Hahn (2000). 
have been carried out, some of which under severe 
restrictions. For instance, Bean et al. (1998) nar- 
row semantic interpretation down to a very limited 
range of spatial relations in anatomy, while Gomez et 
al. (1997) bias the result by preselecting only those 
phrases that were already covered by their domain 
models, thus optimizing for precision while shunting 
aside recall considerations. 
A recent study by Bonnema et al. (1997) comes 
closest to a serious confrontation with a wide range 
of real-world data (Dutch dialogues on a train travel 
domain). This study proceeds from a corpus of 
annotated parse trees to which are assigned type- 
logical formulae which express the corresponding se- 
mantic interpretation. The goal of this work is to 
compute the most probable semantic interpretation 
for a given parse tree. Accuracy (i.e., precision) is 
rather high and ranges between 89,2%-92,3% de- 
pending on the training size and depth of the parse 
tree. Our accuracy criterion is weaker (the intended 
meaning must be included in the set of all read- 
ings), which might explain the slightly higher rates 
we achieve for precision. However, this study does 
not distinguish between different syntactic construc- 
tions that undergo semantic interpretation, nor does 
it consider the level of conceptual interpretation (we 
focus on) as distinguished from the level of semantic 
interpretation to which Bonnema et al. refer. 
5 Conclusions 
The evaluation of the quality and adequacy of se- 
mantic interpretation data is still in its infancy. Our 
approach which confronts semantic interpretation 
devices with a random sample of textual real-world 
data, without intentionally constraining the selec- 
tion of these  data, is a real challenge for 
the proposed methodology and it is unique in its 
experimental rigor. 
However, our work is just a step in the right di- 
rection rather than giving a complete picture or al- 
lowing final conclusions. Two reasons may be given 
for the lack of such experiments. First, interest in 
the deeper conceptual aspects of text interpretation 
has ceased in the past years, with almost all efforts 
devoted to robust and shallow syntactic processing 
of large data sets. This also results in a lack of so- 
phisticated semantic and conceptual specifications, 
in particular, for larger text analysis systems. Sec- 
ond, providing a gold standard for semantic inter- 
pretation is, in itself, an incredibly underconstrained 
and time-consuming process for which almost no re- 
sources have been allocated in the NLP community 
up to now. 
Acknowledgements. We want to thank the mem- 
bers of the ~-~ group for close cooperation. Martin Ro- 
macker is supported by a grant from DFG (Ha 2097/5-1). 
333 

References 

James F. Allen. 1993. Natural , knowledge 
representation, and logical form. In M. Bates and 
R. M. Weischedel, editors, Challenges in Natural 
Language Processing, pages 146-175. Cambridge: 
Cambridge University Press. 

Carol A. Bean, Thomas C. Rindflesch, and 
Charles A. Sneiderman. 1998. Automatic semantic interpretation of anatomic spatial relationships 
in clinical text. In Proceedings of the 1998 AMIA 
Annual Fall Symposium., pages 897-901. Orlando, 
Florida, November 7-11, 1998. 

Remko Bonnema, Rens Bod, and Remko Scha. 1997. 
A DOP model for semantic interpretation. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics 8th Conference of the European Chapter of the ACL, pages 
159-167. Madrid, Spain, July 7-12, 1997. 

Johan Bos, Bjorn Gambaick, Christian Lieske, Yoshiki Mori, Manfred Pinkal, and Karsten 
Worm. 1996. Compositional semantics in VERBMOBIL. In COLING'96 - Proceedings of the 16th 
International Conference on Computational Linguistics, pages 131-136. Copenhagen, Denmark, 
August 5-9, 1996. 

Eugene Charniak and Robert Goldman. 1988. A 
logic for semantic interpretation. In Proceedings 
of the 26th Annual Meeting of the Association for 
Computational Linguistics, pages 87-94. Buffalo, 
New York, U.S.A., 7-10 June 1988. 

Nancy Chinchor, Lynette Hirschman, and David D. 
Lewis. 1993. Evaluating message understanding 
systems: an analysis of the third Message Un- 
derstanding Conference (MUC-3). Computational 
Linguistics, 19(3):409-447. 

Jochen Dorre. 1997. Efficient construction of underspecified semantics under massive ambiguity. 
In Proceedings of the 35th Annual Meeting of the 
Association for Computational Linguistics 8th 
Conference of the European Chapter of the ACL, 
pages 386-393. Madrid, Spain, July 7-12, 1997. 

George Dunham. 1986. The role of syntax in the 
sub of medical diagnostic statements. In 
R. Grishman and R. Kittredge, editors, Analyzing Language in Restricted Domains: Sub 
Description and Processing, pages 175-194. Hillsdale, NJ & London: Lawrence Erlbaum. 

Claire Gardent and Michael Kohlhase. 1996. 
Higher-order coloured unification and natural  semantics. In ACL'96 - Proceedings of the 
34th Annual Meeting of the Association for Computational Linguistics, pages 1-9. Santa Cruz, 
California, U.S.A., 24-27 June 1996. 

Fernando Gomez, Carlos Segami, and Richard 
Hull. 1997. Determining prepositional attachment, prepositional meaning, verb meaning and 
thematic roles. Computational Intell., 13(1):1-31. 

Udo Hahn, Susanne Schacht, and Norbert Breker. 
1994. Concurrent, object-oriented natural  parsing: the PARSETALK model. International Journal of Human-Computer Studies, 
41(1/2):179-222. 

Hideki Hirakawa, Zhonghui Xu, and Kenneth Haase. 
1996. Inherited feature-based similarity measure 
based on large semantic hierarchy and large text 
corpus. In COLING'96 - Proceedings of the 16th 
International Conference on Computational Linguistics, pages 508-513. Copenhagen, Denmark, 
August 5-9, 1996. 

Hang Li and Naoki Abe. 1996. Clustering words 
with the MDL principle. In COLING'96 - Proceedings of the 16th International Conference on 
Computational Linguistics, pages 4-9. Copenhagen, Denmark, August 5-9, 1996. 

Katja Markert and Udo Hahn. 1997. On the interaction of metonymies and anaphora. In IJCAI'97- Proceedings of the 15th International 
Joint Conference on Artificial Intelligence, pages 
1010-1015. Nagoya, Japan, August 23-29, 1997. 

Robert C. Moore. 1989. Unification-based semantic interpretation. In Proceedings of the 27th Annual Meeting of the Association for Computational Linguistics, pages 33-41. Vancouver, B.C., 
Canada, 26-29 June 1989. 

Sergei Nirenburg, Kavi Mahesh, and Stephen Beale. 
1996. Measuring semantic coverage. In COLING'96 - Proceedings of the 16th International 
Conference on Computational Linguistics, pages 
83-88. Copenhagen, Denmark, August 5-9, 1996. 

Ted Pedersen and Rebecca Bruce. 1998. Knowledge 
lean word-sense disambiguation. In AAAI'98 - Proceedings of the 15th National Conference on 
Artificial Intelligence, pages 800-805. Madison, 
Wisconsin, July 26-30, 1998. 

Fernando C.N. Pereira and Martha E. Pollack. 1991. 
Incremental interpretation. Artificial Intelligence, 
50(1):37-82. 

Martin Romacker, Katja Markert, and Udo Hahn. 
1999. Lean semantic interpretation. In IJCAI'99 
- Proceedings of the 16th International Joint Conference on Artificial Intelligence, pages 868-875. 
Stockholm, Sweden, July 31 - August 6, 1999. 

Stefan Schulz and Udo Hahn. 2000. Knowledge engineering by large-scale knowledge reuse: experience from the medical domain. In Proceedings of 
the 7th International Conference on Principles of 
Knowledge Representation and Reasoning. Breckenridge, CO, USA, April 12-15, 2000. 

Hinrich Schiitze. 1998. Automatic word sense 
discrimination. Computational Linguistics, 
24(1):97-124. 

William A. Woods and James G. Schmolze. 1992. 
The KL-ONE family. Computers ~ Mathematics 
with Applications, 23(2/5):133-177. 
