A Lexicon for Underspecified Semantic Tagging 
Paul Buitelaar 
Dept. of Computer Science 
Brandeis University 
Waltham, MA 02254-9110, USA 
paulb@cs, brandeis, edu 
Abstract 
The paper defends the notion that seman- 
tic tagging should be viewed as more than 
disambiguation between senses. Instead, 
semantic tagging should be a first step 
in the interpretation process by assigning 
each lexJ.cal item a representation of all 
of its sy=stematically related senses, from 
which fuxther semantic processing steps 
can derive discourse dependent interpre- 
tations. This leads to a new type of se- 
mantic lexicon (CoRv.Lzx) that supports 
underspecified semantic tagging through 
a design based on systematic polysemous 
classes and a class-based acquisition of lex- 
ical knowledge for specific domains. 
1 Underspecified semantic tagging 
Semantic tagging has mostly been considered as 
nothing more than disambiguation to be performed 
along the same lines as part-of-speech tagging: given 
n lexical items each with m senses apply linguis- 
tic heuristics and/or statistical measures to pick 
the most likely sense for each lexical item (see eg: 
(Yarowsky, 1Q92) (Stevenson and Wilks, 1997)). 
I do not believe this to be the right approach because 
it blurs the distinction between 'related' (systematic 
polysemy) and 'unrelated' senses (homonymy : bank 
- bank). Although homonyms need to be tagged with 
a disambiguated sense, this is not necessarily so in 
the case of systematic polysemy. There are two rea- 
sons for this that I will discuss briefly here. 
First, the problem of multiple reference. Consider 
this example from the BROWN corpus: 
\[A long book heavily weighted with 
milltary technlcalities\]Np, in this edi- 
tion it is neither so long nor so technical as 
it was originally. 
The discourse marker (it) refers back to an NP that 
expresses more than one interpretation at the same 
time. The head of the NP (book) has a number 
of systematically related senses that are being ex- 
pressed simultaneously. The meaning of book in this 
sentence cannot be disambiguated between the num- 
ber of interpretations that are implied: the informa- 
tional content of the book (military technicali- 
ties), its physical appearance (heavily weighted) 
and the events that are involved in its construction 
and use (long). 
The example illustrates the fact that disambigua- 
tion between related senses is not always possible, 
which leads to the further question if a discrete dis- 
tinction between such senses is desirable at all. A 
number of researchers have answered this question 
negatively (see eg: (Pustejovsky, 1995) (Killgariff, 
1992)). Consider these examples from BROWN: 
(1) fast run-up (of the stock) 
(2) fast action (by the city government) 
(3) fast footwork (by Washington) 
(4) fast weight gaining 
(5) fast condition (of the track) 
(6) fast response time 
(7) fast people 
(8) fast ball 
Each use of the adjective 'fast' in these examples has 
a slightly different interpretation that could be cap- 
tured in a number of senses, reflecting the different 
syntactic and semantic patterns. For instance: 
1. 'a fast action' (1, 2, 3, 4) 
2. 'a fast state of affairs' (5, 6) 
3. 'a fast object' (7, 8) 
25 
On the other hand all of the interpretations have 
something in common also, namely the idea of 
'speed'. It seems therefore useful to underspecify 
the lexical meaning of 'fast' to a representation that 
captures this primary semantic aspect and gives a 
general structure for its combination with other lex- 
ical items, both locally (in compositional semantics) 
and globally (in discourse structure). 
Both the multiple reference and the sense enumer- 
ation problem show that lexical items mostly have 
an indefinite number of related but highly discourse 
dependent interpretations, between which cannot be 
distinguished by semantic tagging alone. Instead, se- 
mantic tagging should be a first step in the interpre- 
tation process by assigning each lexical item a repre- 
sentation of all of its systematically related 'senses'. 
Further semantic processing steps derive discourse 
dependent interpretations from this representation. 
Semantic tags are therefore more like pointers to 
complex knowledge representations, which can be 
seen as underspecified lexical meanings. 
2 CORELEX: A Semantic Lexicon 
with Systematic Polysemous 
Classes 
In this section I describe the structure and content 
of a lexicon (CORELEX) that builds on the assump- 
tions about lexical semantics and discourse outlined 
above. More specifically, it is to be 'structured in 
such a way that it reflects the lexical semantics 
of a language in systematic and predictable ways' 
(Pustejovsky, Boguraev, and Johnston, 1995). This 
assumption is fundamentally different from the de- 
sign philosophies behind existing lexical semantic re- 
sources like WORDNET that do not account for any 
regularities between senses. For instance, WORD- 
NET assigns to the noun book the following senses: 
the content that is being communicated (commu- 
nicatiofl) and the medium of communication (ar- 
tifact). More accurately, book should be assigned a 
qualia structure which implies both of these interpre- 
tations and connects them to each of the more spe- 
cific senses that WORDNET assigns: that is, facts, 
drama and a journal can be part-of the content of a 
book; a section is part-of both the content and the 
medium; publication, production and record- 
ing are all events in which both the content and the 
medium aspects of a book can be involved. 
An important advantage of the CORELEX approach 
is more consistency among the assignments of lex- 
ical semantic structure. Consider the senses that 
WORDNET assigns to door, gate and window: 
door 
movable_barrier -,~ artifact 
entrance ~-~ opening 
access ~* cognition, knowledge 
house ~-, ?? 
room ~-~ ?? 
gate 
movable_barrier -,~ artifact 
computer_circult -,~ opening 
grossAncome -,~ opening 
window 
opening -~ opening 
panel --~ artifact 
display ~-* cognition, knowledge 
publication 
product, production 
fact 
dramatic_composltion, dramatic_work 
record 
section, subdivision 
journal 
Figure I: WORDNET senses for the noun book 
At the top of the WORDNET hierarchy these seven 
senses can be reduced to two unrelated 'basic senses': 
26 
Figure 2: WORDNET senses for the nouns door, 
window and gate 
Obviously these are similar words, something which 
is not expressed in the WORDNET sense assign- 
ments. In the CORELEX approach, these nouns are 
given the same semantic type, which is underspeci- 
fled for any specific 'sense' but assigns them consis- 
tently with the same basic lexical semantic structure 
that expresses the regularities between all of their 
interpretations. 
However, despite its shortcomings WORDNET is a 
vast resource of lexical semantic knowledge that can 
m 
m 
m 
mm 
m 
\[\] 
m 
m 
n 
\[\] 
m 
m 
n 
m 
m 
m 
m 
m 
n 
m 
m 
be mined, restructured and extended, which makes 
it a good starting point for the construction of 
CORELEX. The next sections describe how system- 
atic polysem0us classes and underspecified semantic 
types can be derived from WORDNET. In this pa- 
per I only consider classes of noun,s, but the process 
described here can also be applied to other parts of 
speech. 
2.1 Systematic polysemous classes 
We can arrive at classes of systematically poly- 
semous lexical items by investigating which items 
share the same senses and are thus polysemous in 
the same way. This comparison is done at the top 
levels of the WORDNET hierarchy. WORDNET does 
not have an explicit level structure, but for the pur- 
pose of this research one can distinguish a set of 32 
=basic senses' that partly coincides with, but is not 
based directly on WORDNET'S list of 26 'top types': 
act (act), agent (agt), animal (~.m), 
artifact (art), attribute (air), blun- 
der (bln), cell (cel), chemical (chm), 
communication (corn), 
event (evl;), food (rod), form (frm), 
group_biological (grb), group (grp), 
group_social (grs), h-m~n (hum), lln- 
ear_measure (1me), location (loc), 1o- 
cation_geographical (log), measure 
(mea), natural_object (nat), phe- 
nomenon (p\]m), plant (plt), posses- 
sion (pos), part (prt), psychological 
(psy), quantity_definite (qud), quan- 
tity_indefinite (qui), relation (re1), 
space (spc), state (sta), time (tree) 
Figure 3 shows their distribution among noun stems 
in the BROWN corpus. For instance there are 2550 
different noun stems (with 49,824 instances) that 
have each 2 out of the 32 'basic senses' assigned to 
them in 238 different combinations (a subset of 322 
= 1024 possible combinations). 
We now reduce all of WORDNET'S sense assignments 
to these basic senses. For instance, the seven differ- 
ent senses that WORDNET assigns to the lexical item 
book (see Figure I above) can be reduced to the two 
basic senses: 'art corn'. We do this for each lexical 
item and then group them into classes according to 
their assignments. 
From these one can filter out those classes that have 
only one member because they obviously do not rep- 
resent a systematically polysemous class. The lexical 
items in those classes have a highly idiosyncratic be- 
havior and are most likely homonyms. This leaves 
27 
senses comb's stems instances 
2 238 2550 49824 
3 379 936 35608 
4 268 347 22543 
5 148 154 15345 
6 52 52 5915 
7 27 27 5073 
8 10 10 3273 
9 3 3 1450 
I0 1 1 483 
11 2 2 959 
12 1 1 441 
1161 10797 140914 
Figure 3: Polysemy of nouns in BROWN 
a set of 442 polysemous classes, of which Figure 4 
gives a selection: 
act art evt rel 
act art log 
act evt nat 
chin sta 
com prt 
frm sta 
line qud 
loc psy 
log pos sta 
phm pos 
tel sta 
click modification reverse 
berth habitation mooring 
ascent climb 
grease ptomaine 
appendix brickbat index 
solid vacancy void 
em fathom fthm inch mil 
bourn bourne demarcation 
fairyland rubicon trend vertex 
barony province 
accretion usance wastage 
baronetcy connectedness 
context efficiency inclusion 
liquid relationship 
Figure 4: A selection of polysemous classes 
Not all of the 442 classes are systematically polyse- 
mous. Consider for example the following classes: 
Some of these classes are collections of homonyms 
that are ambigtzotz,s in similar ways, but do not lead 
to any kind of predictable polysemous behavior, for 
instance the class 'act anm art' with the lexical 
items: drill ruff solitaire stud. Other classes con- 
sist of both homonyms and systematically polyse- 
mous lexical items like the class act log, which in- 
cludes caliphate, clearing, emirate, prefecture, repair, 
wheeling vs. bolivia, charleston, chicago, michigan. 
m 
m 
act ~nm art 
act log 
act plt 
axt rod loc 
chmpsy 
rod hum plt 
drill ruff solitaire stud 
bolivia caliphate charleston 
chicago clearing emirate michigan 
prefecture repair santiago wheeling 
chess grapevine rape 
pike port 
complex incense 
mandarin sage swede 
Figure 5: A selection of ambiguous classes 
Whereas the first group of nouns express two sepa- 
rated but related meanings (the act of clearing, re- 
pair, etc. takes place at a certain location), the 
second group expresses two meanings that are not 
related (the charleston dance which was named after 
the town by the same name). 
The ambiguous classes need to be removed alto- 
gether, while the ones with mixed ambiguous and 
polllsemous lexical items are to be weeded out care- 
fully. 
2.2 Underspecified semantic types 
The next step in the research is to organize the re- 
maining classes into knowledge representations that 
relate their senses to each other. These representa- 
tions are based on Generative Lexicon theory (G£), 
using qualia roles and (dotted) types (Pustejovsky, 
19os). 
Qualia roles distinguish different semantic aspects: 
FORMAL indicates semantic type; CONSTITUTIVE 
part-whole information; AGENTIVE and TELIC asso- 
ciated events (the first dealing with the origin of 
the object, the second with its purpose). Each role 
is typed to a specific class of lexical items. Types 
are either simple (human, artifact,...) or complex 
(e.g., information.physical). Complex types are 
called dotted types after the 'dots' that are used as 
type constructors. Here I introduce two kinds of 
dots: 
Closed clots '.' connect systematically re- 
lated types that are always interpreted si- 
multaneonsly. 
Open dots 'o' connect systematically re- 
lated types that are not (normally) inter- 
preted simultaneously. 
Both '#*~" and 'aor' denote sets of pairs of objects 
(a, b), a an object of type ~ and b an object of type 
~'. A condition aRb restricts this set of pairs to only 
those for which some relation R holds, where R de- 
notes a subset of the Cartesian product of the sets 
of type ~ objects and type r objects. 
The difference between types '#or' and 'cot' is in 
the nature of the objects they denote. The type 
'aer' denotes sets of pairs of objects where each 
pair behaves as a complex object in discourse struc- 
ture. For instance, the pairs of objects that are in- 
troduced by the type informationephysical (book, 
journal, scoreboard .... ) are addressed as the complex 
objects (x:information, y:physical) in discourse. 
On the other hand, the type '#or' denotes simply 
a set of pairs of objects that do not occur together 
in discourse structure. For instance, the pairs of ob- 
jects that are introduced by the type form.artifact 
(door, gate, window .... ) are not (normally) ad- 
dressed simultaneously in discourse, rather one side 
of the object is picked out in a particular context. 
Nevertheless, the pair as a whole remains active dur- 
ing processing. 
The resulting representations can be seen as under- 
specified lexical meanings and are therefore referred 
to as underspecified semantic types. CORELEX cur- 
rently covers 104 underspecified semantic types. 
This section presents a number of examples, for a 
complete overview see the CORELEX webpage: 
http://~, ca. brandeis, edu/"paulb/Cor eLex/corelex, html 
Closed Dots Consider the underspecified repre- 
sentation for the semantic type actorelation: 
FORMAL = Q:act.relation 
CONSTITUTIVE = 
X:act V Y:relation V Z:act,relation 
TELIC --- 
P:event (acterelation) A act (R1) A 
relation(R2,Rs) 
Figure 6: Representation for type: actorelation 
The representation introduces a number of objects 
that are of a certain type. The FORMAL role in- 
troduces an object Q of type actorelation. The 
CONSTITUTIVE introduces objects that are in a part- 
whole relationship with Q. These are either of the 
same type actorelation or of the simple types act or 
relation. The TELIC expresses the event P that can 
be associated with an object of type acterelation. 
For instance, the event of increase as in 'increasing the 
communication between member states' implies 'in- 
creasing' both the act of communicating an object 
28 
m 
m 
m 
m 
m 
m 
m 
m 
m 
\[\] 
m 
mm 
m 
m 
m 
m 
m 
m 
m 
m 
m 
m 
mm 
mm 
m 
m 
m 
m 
m 
RI and the communication relation between two 
objects R2 and Rs. All these objects are introduced 
on the semantic level and correspond to a number 
of objects that will be realized in syntax. However, 
not all semantic objects will be realized in syntax. 
(See Section 3.4 for more on the syntax-semantics 
interface.) 
The instances for the type act*relation are given 
in Figure 7, covering three different systematic pol- 
ysemous classes. We could have chosen to include 
only the instances of the 'act rel' class, but the 
nouns in the other two classes seem similar enough 
to describe all of them with the same type. 
generative the lexicon should be and if one allows 
overgeneration of semantic objects. 
.nm rod bluepoint capon clam cockle crawdad 
crawfish crayfish duckling fowl 
grub hen lamb langouste limpet 
lobster monkfish mussel octopus panfish 
partridge pheasant pigeon poultry 
prawn pullet quail saki scallop 
scollop shellfish shrimp snail 
squid whelk whitebait whitefish winkle 
act evt rel 
act rel 
act rel s~a 
blend competition flux 
transformation 
acceleration communication 
dealings designation discourse gait 
glide likening negation neologism 
neology prevention qualifying 
sharing synchronisation 
synchronization synchronizing 
coordination gradation involvement 
Figure 7: Instances for the type: act*relation 
Open Dots The type act.relation describes in- 
terpretations that can not be separated from each 
other (the act and relation aspects are intimately 
connected). The following representation for type 
-nimalofood describes interpretations that can not 
occur simultaneously but are however related ~. It 
therefore uses a 'o' instead of a '.' as a type con- 
structor: 
FORMAL = Q:animalofood 
CONSTITUTIVE = X:an~mal V Y:food 
TELIC = 
Pz :act (Rz,"n|mal) V P2 :act (animal,R2) 
v P3:act(R3,food) 
Figure 8: Representation for type: animalofood 
The instances for this type only cover the class ' ~,m 
rod'. A case could be made for including also every 
instance of the class c~-m' because in principal every 
animal could be eaten. This is a question of how 
1See the literature on animal grinding, for instance 
(Copestake and Briscoe, 1992) 
29 
Figure 9: Instances for the type: animalofood 
2.3 Homonyms 
CORELEX is designed around the idea of system- 
atic polysemons classes that exclude homonyms. 
Traditionally a lot of research in lexical semantics 
has been occupied with the problem of ambiguity 
in homonyms. Our research shows however that 
homonyms only make up a fraction of the whole of 
the lexicon of a language. Out of the 37,793 noun 
stems that were derived from WORDNET 1637 are 
to be viewed as true homonyms because they have 
two or more unrelated senses, less than 5%. The re- 
maining 95% are nouns that do have (an indefinite 
number of) different interpretations, hut all of these 
are somehow related and should be inferred from a 
common knowledge representation. These numbers 
suggest a stronger emphasis in research on system- 
atic polysemy and less on homonyms, an approach 
that is advocated here (see also (Killgariff, 1992)). 
In CORZLEX homonyms are simply assigned two or 
more underspecified semantic types, that need to be 
disambiguated in a traditional way. There is how- 
ever an added value also here because each disam- 
biguated type can generate any number of context 
dependent interpretations. 
3 Adapting CORELEx to Domain 
Specific Corpora 
The underspectfied semantic type that CORELEX as- 
signs to a noun provides a basic lexical semantic 
structure that can be seen as the class-wide back- 
bone semantic description on top of which specific 
information for each lexical item is to be defined. 
That is, doors and gates are both artifacts but they 
have different appearances. Gates are typically open 
constructions, whereas doors tend to be solid. This 
kind of information however is corpus specific and 
therefore needs to be adapted specifically to and on 
the basis of that particular corpus of texts. 
This process involves a number of consecutive steps 
that includes the probabilistic classification of un- 
known lexical items: 
1. Assignment of underspecified semantic tags to 
those nouns that are in CORELEX 
2. Running class-sensitive patterns over the 
(partly) tagged corpus 
3. (a) Constructing a probabilistic classifier from 
the data obtained in step 2. 
(b) Probabilistically tag nouns that are not in 
CORELEX according to this classifier 
4. Relating the data obtained in step 2. to one or 
more qualia roles 
Step 1. is trivial, but steps 2. through 4. form 
a complex process of constructing a corpus specific 
semantic lexicon that is to be used in additional 
processing for knowledge intensive reasoning steps 
(i.e. abduction (Hobbs et al., 1993)) that would solve 
metaphoric, metonymic and other non-literal use of 
language. 
3.1 Assignment of CORELEX Tags 
The first step in analyzing a new corpus involves 
tagging each noun that is in CORELEX with an un- 
derspecified semantic tag. This tag represents the 
following information: a definition of the type of 
the noun (FORMAL); a definition of types of pos- 
sible nouns it can stand in a part-whole relation- 
ship with (CONSTITUTIVE); a definition of types of 
possible verbs it can occur with and their argument 
structures (AGENTIVE / TELIC). CORELEX is imple- 
mented as a database of associative arrays, which 
allows a fast lookup of this information in pattern 
matching. 
3.2 Class-Sensitive Pattern Matching 
The pattern matcher runs over corpora that are: 
part-of-speech tagged using a widely used tagger 
(Brill, 1992); stemmed by using an experimental sys- 
tem that extends the Porter stemmer, a stemming 
algorithm widely used in information retrieval, with 
the Celex database on English morphology; (partly) 
semantically tagged using the CORELEX set of un- 
derspecified semantic tags as discussed in the previ- 
ous section. 
There are about 30 different patterns that are ar- 
ranged around the headnoun of an NP. They cover 
30 
the following syntactic constructions that roughly 
correspond to a VP, an S, an NP and an NP fol- 
lowed by a PP: 
• verb-headnoun 
• headnoun-verb 
• adjective-headnoun 
• modiflernoun-headnoun 
• headnoun-preposition-headnoun 
The patterns assume NP's of the following generic 
structure 2: 
PreDet* Det* Num* (Adj INamelNoun)* Noun 
The heuristics for finding the headnoun is then sim- 
ply to take the rightmost noun in the NP, which for 
English is mostly correct. 
The verb-headnoun patterns approach that of a 
true 'verb-obj' analysis by including a normalization 
of passive constructions as follows: 
\[Noun Have? Be Adv? Verb\] =~ \[Verb Noun\] 
Similarly, the headnoun-verb patterns approach 
a true 'sub j-verb' analysis. However, because no 
deep syntactic analysis is performed, the patterns 
can only approximate subjects and Objects in this 
way and I therefore do not refer to these patterns as 
'subject-verb' and 'verb-object' respectively. 
The pattern matching is class-sensitive in employing 
the assigned CORELEX tag to determine if the appli- 
cation of this pattern is appropriate. For instance, 
one of the headnoun-preposition-headnoun pat- 
terns is the following, that is used to detect part- 
whole (CONSTITUTIVE) relations: 
PreDet* Det* Num* (Adj \[ Name \[ Noun)* Noun of 
PreDet* Det* Num* (Adj \[NameJNoun)* Noun 
Clearly not every syntactic construction that fits this 
pattern is to be interpreted as the expression of a 
part-whole relation. One of the heuristics we there- 
fore use is that the pattern may only apply if both 
head nouns carry the same CORELEx tag or if the 
tag of the second head noun subsumes the tag of the 
first one through a dotted type. That is, if the sec- 
ond head noun is of a dotted type and the first is of 
one of its composing types. For instance, 'paragraph' 
~The interpretation of '*' and '?' in this section fol- 
lows that of common usage in regular expressions: 'w 
indicates 0 or more occurrences; '?' indicates 0 or 1 
occurrence 
and 'journal' can be in a part-whole relation to each 
other because the first is of type information, while 
the second is of type information*physical. Simi- 
lar heuristics can be identified for the application of 
other patterns. 
Recall of the patterns (percentage of nouns that 
are covered) is on average, among different cor- 
pora (wsJ, BROWN, PDGF - a corpus we constructed 
for independent purposes from 1000 medical ab- 
stracts in the MEDLINE database on Platelet Derived 
Growth Factor - and DARWIN - the complete Origin 
of Species), about 70% to 80%. Precision is much 
harder to measure, but depends both on the accu- 
racy of the output of the part-of-speech tagger and 
on the accuracy of class-sensitive heuristics. 
3.3 Probabilistic Classification 
The knowledge about the linguistic context of nouns 
in the corpus that is collected by the pattern matcher 
is now used to classify unknown nouns. This involves 
a similarity measure between the linguistic contexts 
of classes of nouns that are in CORELEX and the 
linguistic context of unknown nouns. For this pur- 
pose the pattern matcher keeps two separate arrays, 
one that collects knowledge only on COrtELEx nouns 
and the other collecting knowledge on all nouns. 
The classifier uses mutual information (MI) scores 
rather than the raw frequences of the occurring pat- 
terns (Church and Hanks, 1990). Computing MI 
scores is by now a standard procedure for measuring 
the co-occurrence between objects relative to their 
overall occurrence. MI is defined in general as fol- 
lows: 
y) 
I ix y) = log2 P(x) P(y) 
We can use this definition to derive an estimate of 
the connectedness between words, in terms of collo- 
cations (Smadja, 1993), but also in terms of phrases 
and grammatical relations (Hindle, 1990). For in- 
stance the co-occurrence of verbs and the heads of 
their NP objects iN: size of the corpus, i.e. the num- 
ber of stems): 
N Cobj (v n) = log2 /(v) /(n) 
N N 
All nouns are now classified by running a simi- 
laxity measure over their MI scores and the MI 
scores of each CoRELEx class. For this we use the 
Jaccard measure that compares objects relative to 
the attributes they share (Grefenstette, 1994). In 
our case the 'attributes' are the different linguistic 
constructions a noun occurs in: headnoun-verb, 
adjective-headnoun, modifiernoun-headnoun, 
etc. 
The Jaccard measure is defined as the number of 
attributes shared by two objects divided by the total 
number of unique attributes shared by both objects: 
A 
A+B+C 
A : attributes shared by both objects 
B : attributes unique to object 1 
C : attributes unique to object 2 
The Jaccard scores for each CORELEx class are 
sorted and the class with the highest score is as- 
signed to the noun. If the highest score is equal to 
0, no class is assigned. 
The classification process is evaluated in terms of 
precision and recall figures, but not directly on the 
classified unknown nouns, because their precision is 
hard to measure. Rather we compute precision and 
recall on the classification of those nouns that are in 
CoreLex, because we can check their class automati- 
cally. The assumption then is that the precision and 
recall figures for the classification of nouns that are 
known correspond to those that are unknown. An 
additional measure of the effectiveness of the clas- 
sifter is measuring the recall on classification of all 
nouns, known and unknown. This number seems to 
correlate with the size of the corpus, in larger cor- 
pora more nouns are being classified, but not nec- 
essarily more correctly. Correct classification rather 
seems to depend on the homogeneity of the corpus: 
if it is written in one style, with one theme and so 
on. 
Recall of the classifier (percentage of all nouns that 
are classified > 0) is on average, among different 
larger corpora (> 100,000 tokens), about 80% to 
90%. Recall on the nouns in CoRELEx is between 
35% and 55%, while precision is between 20% and 
40%. The last number is much better on smaller cor- 
pora (70% on average). More detailed information 
about the performance of the classifier, matcher and 
acquisition tool (see below) can be obtained from 
(Buitelaar, forthcoming). 
3.4 Lexical Knowledge Acquisition 
The final step in the process of adapting CORELEx 
to a specific domain involves the 'translation' of ob- 
served syntactic patterns into corresponding seman- 
tic ones and generating a semantic lexicon represent- 
ing that information. 
31 
There are basically three kinds of semantic patterns 
that are utilized in a CORELEX lexicon: hyponymy 
(sub-supertype information) in the FORMAL role, 
meronymy (part-whole information) in the CONSTI- 
TUTIVE role and predicate-argument structure in the 
TELIC and AGENTIVE roles. There are no compelling 
reasons to exclude other kinds of information, but 
for now we base our basic design on ~£, which only 
includes these three in its definition of qualia struc- 
ture. 
Hyponymic information is acquired through the clas- 
sification process discussed in Sections 2.2 and 3.3. 
Meronymic information is obtained through a trans- 
lation of various VP and PP patterns into 'has-part' 
and 'part-of' relations. Predicate-argument struc- 
ture finally, is derived from verb-headnoun and 
headnoun-verb constructions. 
The semantic lexicon that is generated in such a 
way comes in two formats: T2)£, a Type De- 
scription Language based on typed feature-logic 
(Krieger and Schaefer, 1994a) (Krieger and Schae- 
fer, 1994b) and HTML, the markup language for the 
World Wide Web. The first provides a constraint- 
based formalism that allows CORELEX lexicons to 
be used stralghtforwardiy in constraint-based gram- 
mars. The second format is used to present a gen- 
erated semantic lexicon as a semantic index on a 
World Wide Web document. We will not elaborate 
on this further because the subject of semantic in- 
dexing is out of the scope of this paper, but we refer 
to (Pustejovsky et al., 1997). 
3.5 An Example: The PDGF Lexicon 
The semantic lexicon we generated for the PDGF 
corpus covers 1830 noun stems, spread over 81 
CORELEX types. For instance, the noun evidence 
is of type communication.psychological and the 
following representation is generated: 
4 Conclusion 
In this paper I discuss the construction of a new 
type of semantic lexicon that supports underspeci- 
fled semantic tagging. Traditional semantic tagging 
assumes a number of distinct senses for each lexical 
item between which the system should choose. Un- 
derspecified semantic tagging however assumes no 
finite lists of senses, but instead tags each lexical 
item with a comprehensive knowledge representa- 
tion from which a specific interpretation can be con- 
structed. CORZLEx provides such knowledge rep- 
resentations, and as such it is fundamentally differ- 
ent from existing semantic lexicons like WORDNET. 
32 
"evidence 
FORMAL "- 
= \[ARG1 = commlmlcation\]\] 
CLOSED LARG2 psychological J J 
CONSTITUTIVE -~ 
I HAS-PAI~ ----- 
TELIC = ip ov,.e \] 
FIRST L ARG-STRUCT ---~ ... 
REST ---- .o. 
FIRST = structure\] 1 
REST ... 
Figure 10: Lexical entry for: evidence 
Additionally, it was shown that CoI~LEx provides 
for more consistent assignments of lexical semantic 
structure among classes of lexical items. Finally, 
the approach described above allows one to gener- 
ate domain specific semantic lexicons by enhancing 
CORELEX lexical entries with corpus based informa- 
tion. 

References 
Brill, Eric. 1992. A simple rule-based part of speech 
tagger. In Procee~ngs of the Third Conference on 
Applied Natural Language Processing. ACL. 
Buitelaar, Paul. forthcoming. CORELEX: An 
Adaptable Semantic Lexicon with Systematic Pol- 
ysemous Classes. Ph.D. thesis, Brandeis Univer- 
sity, Department of Computer Science. 
Church, K. W. and P. Hanks. 1990. Word associa- 
tion norms, mutual information, and lexicography. 
Computational Linguistics, 16:22-29. 
Copestake, A. and E. Briscoe. 1992. Lexical opera- 
tions in a unification-based framework. In James 
Pustejovsky and Sabine Bergler, editors, Lez/ca/ 
Semantics and Knowledge Representation. Lec- 
ture Notes in Artificial Intelligence 627, pages 22- 
29, Berlin. Springer Verlag. 
Grefenstette, Gregory. 1994. Explorations in Au- 
tomatic Thesaurus Discovery. Kluwer Academic 
Press, Boston. 
Hindle, Donald. 1990. Noun classification from 
predicate-argument structures. In Proceedings of 
the 28th Annual Meeting of the ACL, pages 268- 
275. 
Hobbs, J., M. Stickel, P. Martin, and D. Edwards. 
1993. Interpretation as abduction. Artificial In- 
teUigence, 63. 
KiUgariff, Adam. 1992. Polysemy. Ph.D. thesis, 
University of Sussex, Brighton. 
Krieger, Hans-Ulrich and Ulrich Schaefer. 1994a. 
Tdl-a type description language for hpsg. partl: 
Overview. Technical Report RR-94-37, DFKI, 
Saarbruecken, Germany. 
Krieger, Hans-Ulrich and Ulrich Schaefer. 1994b. 
Tdloa type description language for hpsg. part2: 
Reference manual. Technical Report D-94-14, 
DFKI, Saarbruecken, Germany. 
Pustejovsky, J., B. Boguraev, M. Verhagen, P.P. 
Buitelaar, and M. Johnston. 1997. Semantic in- 
dexing and typed hyperlinking. In AAAI Spring 
1997 Workshop on Natural Language Processing 
for the World Wibe Web. 
Pustejovsky, James. 1995. The Generative Lexicon. 
MIT Press, Cambridge, MA. 
Pustejovsky, James, Bran Boguraev, and Michael 
Johnston. 1995. A core lexical engine: The con- 
textual determination of word sense. Technical 
report, Department of Computer Science, Bran- 
deis University.. 
Smadja, Frank. 1993. Retrieving collocations from 
text: Xtraet. Computational Linguistics, 19(1). 
Stevenson, Mark and Yorick Wilks. 1997. The 
grammar of sense: Is word-sense tagging much 
more than part-of-speech tagging? To appear in 
a Special Issue of Computational Linguistics on 
Word sense Disambiguation. 
Yarowsky, David. 1992. Word sense disambigua- 
tion using statistical models of roget's categories 
trained on large corpora. In Proceedings of COL- 
ING92, pages 454-460. 
