Learning Correlations between Linguistic Indicators and Semantic 
Constraints: 
Reuse of Context-Dependent Descriptions of Entities 
Dragomir R. Radev 
Department of Computer Science 
Columbia University 
New York, NY 10027 
radev@cs.columbia.edu 
Abstract 
This paper presents the results of a study on the 
semantic constraints imposed on lexical choice 
by certain contextual indicators. We show how 
such indicators are computed and how correla- 
tions between them and the choice of a noun 
phrase description of a named entity can be au- 
tomatically established using supervised learn- 
ing. Based on this correlation, we have devel- 
oped a technique for automatic lexical choice of 
descriptions of entities in text generation. We 
discuss the underlying relationship between the 
pragmatics of choosing an appropriate descrip- 
tion that serves a specific purpose in the auto- 
matically generated text and the semantics of 
the description itself. We present our work in 
the framework of the more general concept of 
reuse of linguistic structures that are automati- 
cally extracted from large corpora. We present 
a formal evaluation of our approach and we con- 
clude with some thoughts on potential applica- 
tions of our method. 
1 Introduction 
Human writers constantly make deliberate deci- 
sions about picking a particular way of express- 
ing a certain concept. These decisions are made 
based on the topic of the text and the effect that 
the writer wants to achieve. Such contextual 
and pragmatic constraints are obvious to ex- 
perienced writers who produce context-specific 
text without much effort. However, in order for 
a computer to produce text in a similar way, 
either these constraints have to be added man- 
ually by an expert or the system must be able 
to acquire them in an automatic way. 
An example related to the lexical choice of 
an appropriate nominal description of a person 
should make the above clear. Even though it 
seems intuitive that Bill Clinton should always 
be described with the NP "U. S. president" or a 
variation thereof, it turns out that many other 
descriptions appear in on-line news stories that 
characterize him in light of the topic of the arti- 
cle. For example, an article from 1996 on elec- 
tions uses "Bill Clinton, the democratic pres- 
idential candidate", while a 1997 article on a 
false bomb alert in Little Rock, Ark. uses "Bill 
Clinton, an Arkansas native". 
This paper presents the results of a study of 
the correlation between named entities (people, 
places, or organizations) and noun phrases used 
to describe them in a corpus. 
Intuitively, the use of a description is based on 
a deliberate decision on the part of the author 
of a piece of text. A writer is likely to select a 
description that puts the entity in the context 
of the rest of the article. 
It is known that the distribution of words in 
a document is related to its topic (Salton and 
McGill, 1983). We have developed related tech- 
niques for approximating pragmatic constraints 
using words that appear in the immediate con- 
text of the entity. 
We will show that context influences the 
choice of a description, as do several other lin- 
guistic indicators. Each of the indicators by it- 
self doesn't provide enough empirical data that 
distinguishes among all descriptions that are re- 
lated to a an entity. However, a carefully se- 
lected combination of such indicators provides 
enough information in order pick an appropriate 
description with more than 80% accuracy. 
Section 2 describes how we can automatically 
obtain enough constraints on the usage of de- 
scriptions. In Section 3, we show how such con- 
structions are related to language reuse. 
In Section 4 we describe our experimental 
setup and the algorithms that we have designed. 
Section 5 includes a description of our results. 
1072 
In Section 6 we discuss some possible exten- 
sions to our study and we provide some thoughts 
about possible uses of our framework. 
2 Problem Description 
Let's define the relation DescriptionOf(E) to 
be the one between a named entity E and a 
noun phrase, D, describing the named entity. 
In the example shown in Table 1, there are two 
entity-description pairs. 
DescriptionOf ("Tareq Aziz") = "Iraq's 
Deputy Prime Minister" 
DescriptionOf ("Richard Butler") = "Chief 
U.N. arms inspector" 
Chief U.N. arms inspector Richard Butler 
met Iraq's Deputy Prime Minister Tareq Aziz 
Monday after rejecting Iraqi attempts to set 
deadlines for finishing his work. 
Figure 1: Sample sentence containing two 
entity-description pairs. 
Each entity appearing in a text can have mul- 
tiple descriptions (up to several dozen) associ- 
ated with it. 
We call the set of all descriptions related to 
the same entity in a corpus, a profile of that 
entity. Profiles for a large number of entities 
were compiled using our earlier system, PRO- 
FILE (Radev and McKeown, 1997). It turns 
out that there is a large variety in the size of 
the profile (number of distinct descriptions) for 
different entities. Table 1 shows a subset of the 
profile for Ung Huot, the former foreign minister 
of Cambodia, who was elected prime minister at 
some point of time during the run of our exper- 
iment. A few sample semantic features of the 
descriptions in Table 1 are shown as separate 
columns. 
We used information extraction techniques to 
collect entities and descriptions from a corpus 
and analyzed their lexical and semantic proper- 
ties. 
We have processed 178 MB 1 of newswire 
and analyzed the use of descriptions related 
to 11,504 entities. Even though PROFILE ex- 
tracts other entities in addition to people (e.g., 
1The corpus contains 19,473 news stories that cover 
the period October 1, 1997 - January 9, 1998 that were 
available through PROFILE. 
places and organizations), we have restricted 
our analysis to names of people only. We claim, 
however, that a large portion of our findings re- 
late to the other types of entities as well. 
We have investigated 35,206 tuples, consist- 
ing of an entity, a description, an article ID, 
and the position (sentence number) in the arti- 
cle in which the entity-description pair occurs. 
Since there are 11,504 distinct entities, we had 
on average 3.06 distinct descriptions per entity 
(DDPE). Table 2 shows the distribution of 
DDPE values across the corpus. Notice that a 
large number of entities (9,053 out of the 11,504) 
have a single description. These are not as in- 
teresting for our analysis as the remaining 2,451 
entities that have DDPE values between 2 and 
24. 
10' :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: !F!il i!i i i!i!i 
i if! ii !iiii!i 
'o' a!!!!!~!~iii!iiii!!~i!!~: !!iii!~i!~!!i~!~i!iiii~ii~..: ~!!i~iiiiiiii~i!~!i!iiiii~i~!!!~i!i!iiii~i\]i iil i ! 
,o ° ,o' ,o' 
X - Number of d~inct ~l~crlpl~ne per ent Ily (DDPE) 
Figure 2: Number of distinct descriptions per 
entity (log-log scale) 
3 Language Reuse in Text 
Generation 
Text generation usually involves lexical choice - 
that is, choosing one way of referring to an en- 
tity over another. Lexical choice refers to a vari- 
ety of decisions that have to made in text gener- 
ation. For example, picking one among several 
equivalent (or nearly equivalent) constructions 
is a form of lexical choice (e.g., "The Utah Jazz 
handed the Boston Celtics a de fear' vs. "The 
Utah Jazz defeated the Boston Celtics" (Robin, 
1994)). We are interested in a different aspect 
of the problem: namely learning the rules that 
can be used for automatically selecting an ap- 
propriate description of an entity in a specific 
1073 
Description 
a senior member 
Cambodia's 
Cambodian foreign minister 
co-premier 
first prime minister 
foreign minister 
His Excellency 
Mr. 
new co-premier 
new first prime minister 
newly-appointed first prime minister 
premier 
prime minister 
addressing 
X 
Semantic categories 
country male new political post 
X 
X 
X 
Table 1: Profile of Ung Huot 
cou at 
~6 
12 
lO 
8 
2 
3 
8 
9 
10 
11 
12 
13 
14 
9,053 
1,481 
472 
182 
112 
74 
31 
X 
X 
X 
seniority 
X 
X 
X 
X 
X 
X 
X 
X 
X 
X 
DDPE ~4 
16 
17 
18 
19 
24 
Table 2: Number of distinct descriptions per entity (DDPE). 
context. 
To be feasible and scaleable, a technique for 
solving a particular case of the problem of lex- 
icai choice must involve automated learning. It 
is also useful if the technique can specify enough 
constraints on the text to be generated so that 
the number of possible surface realizations that 
match the semantic constraints is reduced sig- 
nificantly. The easiest case in which lexical 
choice can be made is when the full surface 
structure can be used, and when it has been au- 
tomatically extracted from a corpus. Of course, 
the constraints on the use of the structure in the 
generated text have to be reasonably similar to 
the ones in the source text. 
We have found that a natural application for 
the analysis of entity-description pairs is lan- 
guage reuse, which includes techniques of ex- 
tracting shallow structure from a corpus and 
applying that structure to computer-generated 
texts. 
Language reuse involves two components: a 
source text written by a human and a target 
text, that is to be automatically generated by 
a computer, partially making use of structures 
reused from the source text. The source text 
is the one from which particular surface struc- 
tures are extracted automatically, along with 
the appropriate syntactic, semantic, and prag- 
matic constraints under which they are used. 
Some examples of language reuse include col- 
location analysis (Smadja, 1993), the use of 
entire factual sentences extracted from cor- 
pora (e.g., "'Toy Story' is the Academy Award 
winning animated film developed by Pixar~'), 
and summarization using sentence extraction 
(Paice, 1990; Kupiec et al., 1995). In the case 
of summarization through sentence extraction, 
the target text has the additional property of 
being a subtext of the source text. Other tech- 
niques that can be broadly categorized as lan- 
guage reuse are learning relations from on-line 
texts (Mitchell, 1997) and answering natural 
language questions using an on-line encyclope- 
dia (Kupiec, 1993). 
Stydying the concept of language reuse is re- 
warding because it allows generation systems to 
leverage on texts written by humans and their 
deliberate choice of words, facts, structure. 
We mentioned that for language reuse to take 
1074 
place, the generation system has to use the same 
surface structure in the same syntactic, seman- 
tic, and pragmatic context as the source text 
from which it was extracted. Obviously, all of 
this information is typically not available to a 
generation system. There are some special cases 
in which most of it can be automatically com- 
puted. 
Descriptions of entities are a particular in- 
stance of a surface structure that can be reused 
relatively easily. Syntactic constraints related 
to the use of descriptions are modest - since de- 
scriptions are always noun phrases that appear 
as either pre-modifiers or appositions 2, they are 
quite flexibly usable in any generated text in 
which an entity can be modified with an ap- 
propriate description. We will show in the rest 
of the paper how the requisite semantic (i.e., 
"what is the meaning of the description to pick") 
and pragmatic constraints (i.e., "what purpose 
does using the description achieve ?') can be ex- 
tracted automatically. 
Given a profile like the one shown in Table 1, 
and an appropriate set of semantic constraints 
(columns 2-7 of the table), the generation com- 
ponent needs to perform a profile lookup and 
select a row (description) that satisfies most or 
all semantic constraints. For example, if the se- 
mantic constraints specify that the description 
has to include the country and the political po- 
sition of Ung Huot, the most appropriate de- 
scription is "Cambodian foreign minister". 
4 Experimental Setup 
In our experiments, we have used two widely 
available tools - WordNet and Ripper. 
WordNet (Miller et al., 1990) is an on-line 
hierarchical lexical database which contains se- 
mantic information about English words (in- 
cluding hypernymy relations which we use in 
our system). We use chains of hypernyms when 
we need to approximate the usage of a particu- 
lar word in a description using its ancestor and 
sibling nodes in WordNet. Particularly useful 
for our application are the synset offsets of the 
words in a description. The synset offset is a 
number that uniquely identifies a concept node 
(synset) in the WordNet hierarchy. Figure 3 
shows that the synset offset for the concept "ad- 
ministrator, decision maker" is "(07063507}', 
2We haven't included relative clauses in our study. 
while its hypernym, "head, chief, top dog" has 
a synset offset of "~07311393} ". 
Ripper (Cohen, 1995) is an algorithm that 
learns rules from example tuples in a relation. 
Attributes in the tuples can be integers (e.g., 
length of an article, in words), sets (e.g., se- 
mantic features), or bags (e.g., words that ap- 
pear in a sentence or document). We use Rip- 
per to learn rules that correlate context and 
other linguistic indicators with the semantics 
of the description being extracted and subse- 
quently reused. It is important to notice that 
Ripper is designed to learn rules that classify 
data into atomic classes (e.g., "good", "aver- 
age", and "bad"). We had to modify its al- 
gorithm in order to classify data into sets of 
atoms. For example, a rule can have the form 
"if CONDITION then \[( 07063762} { 02864326} 
{ 0001795~}\] "3 . This rule states that if a certain 
"CONDITION" (which is a function of the in- 
dicators related to the description) is met, then 
the description is likely to contain words that 
are semantically related to the three WordNet 
nodes \[{07063762} {02864326} {00017954}\]. 
The stages of our experiments are described 
in detail in the remainder of this section. 
4.1 Semantic tagging of descriptions 
Our system, PROFILE, processes WWW- 
accessible newswire on a round-the-clock basis 
and extracts entities (people, places, and orga- 
nizations) along with related descriptions. The 
extraction grammar, developed in CREP (Du- 
ford, 1993), covers a variety of pre-modifier and 
appositional noun phrases. 
For each word wi in a description, we use a 
version of WordNet to extract the synset offset 
of the immediate parent of wi. 
4.2 Finding linguistic cues 
Initially, we were interested in discovering rules 
manually and then validating them using the 
learning algorithm. However, the task proved 
(nearly) impossible considering the sheer size 
of the corpus. One possible rule that we hy- 
pothesized and wanted to verify empirically at 
this stage was parallelism. This linguistically- 
motivated rule states that in a sentence with 
a parallel structure (consider, for instance, the 
3These offsets correspond to the WordNet nodes 
"manager", "internetn, and "group" 
1075 
DIRECTOR: {07063762} director, manager, managing director 
=~ {07063507} administrator, decision maker 
=~ {07311393} head, chief, top dog 
=~ {06950891} leader 
=~ {00004123} person, individual, someone, somebody, mortal, human, soul 
=~ {00002086} life form, organism, being, living thing 
=~ {00001740} entity, something 
Figure 3: Hypernym chain of "director" in WordNet, showing synset offsets. 
sentence fragment "... Alija Izetbegovic, a Mus- 
lim, Kresimir Zubak, a Croat, and Momcilo 
Krajisnik, a Serb... ") all entities involved have 
similar descriptions. However, rules at such a 
detailed syntactic level take too long to process 
on a 180 MB corpus and, further, no more than 
a handful of such rules can be discovered manu- 
ally. As a result, we made a decision to extract 
all indicators automatically. We would also like 
to note that using syntactic information on such 
a large corpus doesn't appear particularly fea- 
sible. We limited therefore our investigation 
to lexicai, semantic, and contextual indicators 
only. The following subsection describes the at- 
tributes used. 
4.3 Extracting linguistic cues 
automatically 
The list of indicators that we use in our system 
are the following: 
• Context: (using a window of size 4, ex- 
cluding the actual description used, but 
not the entity itself) - e.g., "\['clinton' 
'clinton' 'counsel' 'counsel' 'decision' 'deci- 
sion' 'gore' 'gore' 'ind' 'ind' 'index' 'news' 
'november' 'wednesday'\]" is a bag of words 
found near the description of Bill Clinton 
in the training corpus. 
• Length of the article: - an integer. 
• Name of the entity: - e.g., "Bill Clin- 
ton". 
• Profile: The entire profile related to a per- 
son (all descriptions of that person that are 
found in the training corpus). 
• Synset Offsets: - the WordNet node num- 
bers of all words (and their parents)) that 
appear in the profile associated with the 
entity that we want to describe. 
4.4 Applying machine learning method 
To learn rules, we ran Ripper on 90% (10,353) 
of the entities in the entire corpus. We kept the 
remaining 10% (or 1,151 entities) for evaluation. 
Sample rules discovered by the system are 
shown in Table 3. 
5 Results and Evaluation 
We have performed a standard evaluation of the 
precision and recall that our system achieves in 
selecting a description. Table 4 shows our re- 
sults under two sets of parameters. 
Precision and recall are based on how well the 
system predicts a set of semantic constraints. 
Precision (or P) is defined to be the number of 
matches divided by the number of elements in 
the predicted set. Recall (or R) is the number 
of matches divided by the number of elements 
in the correct set. If, for example, the system 
predicts \[A\] \[B\] \[C\], but the set of constraints 
on the actual description is \[B\] \[D\], we would 
compute that P = 33.3% and R --- 50.0%. Ta- 
ble 4 reports the average values of P and R for 
all training examples 4. 
Selecting appropriate descriptions based on 
our algorithm is feasible even though the val- 
ues of precision and recall obtained may seem 
only moderately high. The reason for this is 
that the problem that we axe trying to solve is 
underspecified. That is, in the same context, 
more than one description can be potentially 
used. Mutually interchangeable descriptions in- 
clude synonyms and near synonyms ("leader" 
vs. "chief) or pairs of descriptions of different 
generality (U.S. president vs. president). This 
4We run Ripper in a so-called "noise-free mode", 
which causes the condition parts of the rules it discovers 
to be mutually exclusive and therefore, the values of P 
and R on the training data are both 100~. 
1076 
Rule Decision 
IF CONTEXT - inflation {09613349} (politician) 
IF PROFILES " detective AND CONTEXT " agency {07485319} (policeman) 
IF CONTEXT - celine {07032298} (north_american) 
Table 3" Sample rules discovered by the system. 
Training set size 
500 
1,000 
2,000 
5,000 
10,000 
15,000 
20,000 
25,000 
30,000 
50,000 
word nodes only 
Precision Recall 
64.29% 2.86% 
71.43% 2.86% 
42.86% 40.71% 
59.33% 48.40% 
69.72% 45.04% 
76.24% 44.02% 
76.25% 49.91% 
83.37% 52.26% 
80.14% 50.55% 
83.13% 58.54% 
word and parent nodes 
Precision 
78.57% 
85.71% 
67.86% 
64.67% 
74.44% 
73.39% 
79.08% 
82.39% 
82.77% 
88.87% 
Recall 
2.86% 
2.86% 
62.14% 
53.73% 
59.32% 
53.17% 
58.70% 
57.49% 
57.66% 
63.39% 
Table 4: Values for precision and recall using word nodes only (left) and both word and parent 
nodes (right). 
type of evaluation requires the availability of hu- 
man judges. 
There are two parts to the evaluation: how 
well does the system performs in selecting se- 
mantic features (WordNet nodes) and how well 
it works in constraining the choice of a descrip- 
tion. To select a description, our system does a 
lookup in the profile for a possible description 
that satisfies most semantic constraints (e.g., we 
select a row in Table 1 based on constraints on 
the columns). 
Our system depends crucially on the multiple 
components that we use. For example, the shal- 
low CREP grammar that is used in extracting 
entities and descriptions often fails to extract 
good descriptions, mostly due to incorrect PP 
attachment. We have also had problems from 
the part-of-speech tagger and, as a result, we 
occasionally incorrectly extract word sequences 
that do not represent descriptions. 
6 Applications and Future Work 
We should note that PROFILE is part of a 
large system for information retrieval and sum- 
marization of news through information extrac- 
tion and symbolic text generation (McKeown 
and Radev, 1995). We intend to use PROFILE 
to improve lexical choice in the summary gen- 
eration component, especially when producing 
user-centered summaries or summary updates 
(Radev and McKeown, 1998 to appear). There 
are two particularly appealing cases - (1) when 
the extraction component has failed to extract a 
description and (2) when the user model (user's 
interests, knowledge of the entity and personal 
preferences for sources of information and for ei- 
ther conciseness or verbosity) dictates that a de- 
scription should be used even when one doesn't 
appear in the texts being summarized. 
A second potentially interesting application 
involves using the data and rules extracted by 
PROFILE for language regeneration. In (Radev 
and McKeown, 1998 to appear) we show how 
the conversion of extracted descriptions into 
components of a generation grammar allows for 
flexible (re)generation of new descriptions that 
don't appear in the source text. For example, 
a description can be replaced by a more general 
one, two descriptions can be combined to form 
a single one, or one long description can be de- 
constructed into its components, some of which 
can be reused as new descriptions. 
We are also interested in investigating an- 
other idea - that of predicting the use of a de- 
scription of an entity even when the correspond- 
ing profile doesn't contain any description at all, 
or when it contains only descriptions that con- 
tain words that are not directly related to the 
words predicted by the rules of PROFILE. In 
this case, if the system predicts a semantic cat- 
1077 
egory that doesn't match any of the descriptions 
in a specific profile, two things can be done: (1) 
if there is a single description in the profile, to 
pick that one, and (2) if there is more than one 
description, pick the one whose semantic vector 
is closest to the predicted semantic vector. 
Finally, the profile extractor will be used as 
part of a large-scale, automatically generated 
Who's who site which will be accessible both 
by users through a Web interface and by NLP 
systems through a client-server API. 
7 Conclusion 
In this paper, we showed that context and other 
linguistic indicators correlate with the choice 
of a particular noun phrase to describe an en- 
tity. Using machine learning techniques from a 
very large corpus, we automatically extracted 
a large set of rules that predict the choice of a 
description out of an entity profile. We showed 
that high-precision automatic prediction of an 
appropriate description in a specific context is 
possible. 
8 Acknowledgments 
This material is based upon work supported by 
the National Science Foundation under Grants 
No. IRI-96-19124, IRI-96-18797, and CDA-96- 
25374, as well as a grant from Columbia Uni- 
versity's Strategic Initiative Fund sponsored by 
the Provost's Office. Any opinions, findings, 
and conclusions or recommendations expressed 
in this material are those of the author(s) and 
do not necessarily reflect the views of the Na- 
tional Science Foundation. 
The author is grateful to the following 
people for their comments and suggestions: 
Kathy McKeown, Vasileios Hatzivassiloglou, 
and Hongyan Jing. 

References 
William W. Cohen. 1995. Fast effective rule 
induction. In Proc. 12th International Con- 
ference on Machine Learning, pages 115-123. 
Morgan Kaufmann. 
Darrin Duford. 1993. CREP: a regular 
expression-matching textual corpus tool. 
Technical Report CUCS-005-93, Columbia 
University. 
Julian M. Kupiec, Jan Pedersen, and Francine 
Chen. 1995. A trainable document summa- 
rizer. In Proceedings, 18th Annual Interna- 
tional ACM SIGIR Conference on Research 
and Development in Information Retrieval, 
pages 68-73, Seattle, Washington, July. 
Julian M. Kupiec. 1993. MURAX: A robust 
linguistic approach for question answering us- 
ing an on-line encyclopedia. In Proceedings, 
16th Annual International A CM SIGIR Con- 
ference on Research and Development in In- 
formation Retrieval. 
Kathleen R. McKeown and Dragomir R. Radev. 
1995. Generating summaries of multiple news 
articles. In Proceedings, 18th Annual Interna- 
tional A CM SIGIR Conference on Research 
and Development in Information Retrieval, 
pages 74-82, Seattle, Washington, July. 
George A. Miller, Richard Beckwith, Christiane 
Fellbaum, Derek Gross, and Katherine J. 
Miller. 1990. Introduction to WordNet: An 
on-line lexical database. International Jour- 
hal of Lexicography (special issue), 3(4):235- 
312. 
Tom M. Mitchell. 1997. Does machine learning 
really work? AI Magazine, 18(3). 
Chris Paice. 1990. Constructing literature 
abstracts by computer: Techniques and 
prospects. Information Processing and Man- 
agement, 26:171-186. 
Dragomir R. Radev and Kathleen R. McKe- 
own. 1997. Building a generation knowledge 
source using internet-accessible newswire. In 
Proceedings of the 5th Conference on Ap- 
plied Natural Language Processing, Washing- 
ton, DC, April. 
Dragomir R. Radev and Kathleen R. McK- 
eown. 1998, to appear. Generating natu- 
ral language summaries from multiple on-line 
sources. Computational Linguistics. 
Jacques Robin. 1994. Revision-Based Gener- 
ation of Natural Language Summaries Pro- 
viding Historical Background. Ph.D. the- 
sis, Computer Science Department, Columbia 
University. 
G. Salton and M.J. McGill. 1983. Introduction 
to Modern Information Retrieval. Computer 
Series. McGraw Hill, New York. 
Frank Smadja. 1993. Retrieving collocations 
from text: Xtract. Computational Linguis- 
tics, 19(1):143-177, March. 
