Using Syntactic Information to Extract Relevant Terms for Multi-Document
Summarization
Enrique Amig´o Julio Gonzalo V´ıctor Peinado Anselmo Pe˜nas Felisa Verdejo
Departamento de Lenguajes y Sistemas Inform´aticos
Universidad Nacional de Educaci´on a Distancia
c/ Juan del Rosal, 16 - 28040 Madrid - Spain
http://nlp.uned.es
Abstract
The identification of the key concepts in a set of
documents is a useful source of information for
several information access applications. We are
interested in its application to multi-document
summarization, both for the automatic genera-
tion of summaries and for interactive summa-
rization systems.
In this paper, we study whether the syntactic po-
sition of terms in the texts can be used to predict
which terms are good candidates as key con-
cepts. Our experiments show that a) distance
to the verb is highly correlated with the proba-
bility of a term being part of a key concept; b)
subject modifiers are the best syntactic locations
to find relevant terms; and c) in the task of auto-
matically finding key terms, the combination of
statistical term weights with shallow syntactic
information gives better results than statistical
measures alone.
1 Introduction
The fundamental question addressed in this article
is: can syntactic information be used to find the
key concepts of a set of documents? We will pro-
vide empirical answers to this question in a multi-
document summarization environment.
The identification of key terms out of a set of doc-
uments is a common problem in information access
applications and, in particular, in text summariza-
tion: a fragment containing one or more key con-
cepts can be a good candidate to be part of a sum-
mary.
In single-document summarization, key terms are
usually obtained from the document title or head-
ing (Edmundson, 1969; Preston, 1994; Kupiec et
al., 1995). In multi-document summarization, how-
ever, some processing is needed to identify key con-
cepts (Lin and Hovy, 2002; Kraaij et al., 2002;
Schlesinger et al., 2002). Most approaches are
based on statistical criteria.
Criteria to elaborate a manual summary depend,
by and large, on the user interpretation of both the
information need and the content of documents.
This is why this task has also been attempted from
an interactive perspective (Boguraev et al., 1998;
Buyukkokten et al., 1999; Neff and Cooper, 1999;
Jones et al., 2002; Leuski et al., 2003). A standard
feature of such interactive summarization assistants
is that they offer a list of relevant terms (automati-
cally extracted from the documents) which the user
may select to decide or refine the focus of the sum-
mary.
Our hypothesis is that the key concepts of a doc-
ument set will tend to appear in certain syntactic
functions along the sentences and clauses of the
texts. To confirm this hypothesis, we have used
a test bed with manually produced summaries to
study:
 which are the most likely syntactic functions
for the key concepts manually identified in the
document sets.
 whether this information can be used to auto-
matically extract the relevant terms from a set
of documents, as compared to standard statis-
tical term weights.
Our reference corpus is a set of 72 lists of key
concepts, manually elaborated by 9 subjects on
8 different topics, with 100 documents per topic.
It was built to study Information Synthesis tasks
(Amigo et al., 2004) and it is, to the best of
our knowledge, the multi-document summarization
testbed with a largest number of documents per
topic. This feature enables us to obtain reliable
statistics on term occurrences and prominent syn-
tactic functions.
The paper is organized as follows: in Section 2
we review the main approaches to the evaluation
of automatically extracted key concepts for summa-
rization. In Section 3 we describe the creation of the
reference corpus. In Section 4 we study the correla-
tion between key concepts and syntactic function in
texts, and in Section 5 we discuss the experimental
results of syntactic function as a predictor to extract
key concepts. Finally, in Section 6 we draw some
conclusions.
2 Evaluation of automatically extracted
key concepts
It is necessary, in the context of an interactive sum-
marization system, to measure the quality of the
terms suggested by the system, i.e., to what extent
they are related to the key topics of the document
set.
(Lin and Hovy, 1997) compared different strate-
gies to generate lists of relevant terms for summa-
rization using Topic Signatures. The evaluation was
extrinsic, comparing the quality of the summaries
generated by a system using different term lists as
input. The results, however, cannot be directly ex-
trapolated to interactive summarization systems, be-
cause the evaluation does not consider how informa-
tive terms are for a user.
From an interactive point of view, the evaluation
of term extraction approaches can be done, at least,
in two ways:
 Evaluating the summaries produced in the in-
teractive summarization process. This option
is difficult to implement (how do we evaluate
a human produced summary? What is the ref-
erence gold standard?) and, in any case, it is
too costly: every alternative approach would
require at least a few additional subjects per-
forming the summarization task.
 Comparing automatically generated term lists
with manually generated lists of key concepts.
For instance, (Jones et al., 2002) describes a
process of supervised learning of key concepts
from a training corpus of manually generated
lists of phrases associated to a single docu-
ment.
We will, therefore, use the second approach,
evaluating the quality of automatically generated
term lists by comparing them to lists of key con-
cepts which are generated by human subjects after a
multi-document summarization process.
3 Test bed: the ISCORPUS
We have created a reference test bed, the ISCOR-
PUS1 (Amigo et al., 2004) which contains 72 man-
ually generated reports summarizing the relevant in-
formation for a given topic contained in a large doc-
ument set.
For the creation of the corpus, nine subjects per-
formed a complex multi-document summarization
1Available at http://nlp.uned.es/ISCORPUS.
task for eight different topics and one hundred rele-
vant documents per topic. After creating each topic-
oriented summary, subjects were asked to make a
list of relevant concepts for the topic, in two cate-
gories: relevant entities (people, organizations, etc.)
and relevant factors (such as “ethnic conflicts” as
the origin of a civil war) which play a key role in
the topic being summarized.
These are the relevant details of the ISCORPUS
test bed:
3.1 Document collection and topic set
We have used the Spanish CLEF 2001-2003 news
collection testbed (Peters et al., 2002), and selected
the eight topics with the largest number of docu-
ments manually judged as relevant from the CLEF
assessment pools. All the selected CLEF topics
have more than one hundred documents judged as
relevant by the CLEF assessors; for homogeneity,
we have restricted the task to the first 100 docu-
ments for each topic (using a chronological order).
This set of eight CLEF topics was found to have
two differentiated subsets: in six topics, it is neces-
sary to study how a situation evolves in time: the
importance of every event related to the topic can
only be established in relation with the others. The
invasion of Haiti by UN and USA troops is an ex-
ample of such kind of topics. We refer to them as
“Topic Tracking” (TT) topics, because they are suit-
able for such a task. The other two questions, how-
ever, resemble “Information Extraction” (IE) tasks:
essentially, the user has to detect and describe in-
stances of a generic event (for instance, cases of
hunger strikes and campaigns against racism in Eu-
rope in this case); hence we will refer to them as IE
summaries.
3.2 Generation of manual summaries
Nine subjects between 25 and 35 years-old were re-
cruited for the manual generation of summaries. All
subjects were given an in-place detailed description
of the task, in order to minimize divergent interpre-
tations. They were told they had to generate sum-
maries with a maximum of information about ev-
ery topic within a 50 sentence space limit, using a
maximum of 30 minutes per topic. The 50 sentence
limit can be temporarily exceeded and, once the 30
minutes have expired, the user can still remove sen-
tences from the summary until the sentence limit is
reached back.
3.3 Manual identification of key concepts
After summarizing every topic, the following ques-
tionnaire was filled in by users:
 Who are the main people involved in the topic?
 What are the main organizations participating in the topic?
 What are the key factors in the topic?
Users provided free-text answers to these ques-
tions, with their freshly generated summary at hand.
We did not provide any suggestions or constraints
at this point, except that a maximum of eight slots
were available per question (i.e., a maximum of
8X3 = 24 key concepts per topic, per user).
This is, for instance, the answer of one user for
a topic about the invasion of Haiti by UN and USA
troops:
People Organizations
Jean Bertrand Aristide ONU (UN)
Clinton EEUU (USA)
Raoul Cedras OEA (OAS)
Philippe Biambi
Michel Josep Francois
Factors
militares golpistas (coup attempting soldiers)
golpe militar (coup attempt)
restaurar la democracia (reinstatement of democracy)
Finally, a single list of key concepts is generated
for each topic, joining all the answers given by the
nine subjects. These lists of key concepts constitute
the gold standard for all the experiments described
below.
3.4 Shallow parsing of documents
Documents are processed with a robust shallow
parser based in finite automata. The parser splits
sentences in chunks and assigns a label to every
chunk. The set of labels is:
 [N]: noun phrases, which correspond to
names or adjectives preceded by a determiner,
punctuation sign, or beginning of a sentence.
 [V]: verb forms.
 [Mod]: adverbial and prepositional phrases,
made up of noun phrases introduced by an ad-
verb or preposition. Note that this is the mech-
anism to express NP modifiers in Spanish (as
compared to English, where noun compound-
ing is equally frequent).
 [Sub]: words introducing new subordinate
clauses within a sentence (que, cuando, mien-
tras, etc.).
 [P]: Punctuation marks.
This is an example output of the chunker:
Previamente [Mod] ,[P]el presidente Bill Clinton [N] hab´ıa di-
cho [V] que [Sub] tenemos [V] la obligacion [N] de cambiar la
pol´ıtica estadounidense [Mod] que [Sub] no ha funcionado [V] en
Hait´ı [Mod].[P]
Although the precision of the parser is limited,
the results are good enough for the statistical mea-
sures used in our experiments.
4 Distribution of key concepts in syntactic
structures
We have extracted empirical data to answer these
questions:
 Is the probability of finding a key concept cor-
related with the distance to the verb in a sen-
tence or clause?
 Is the probability of finding a key concept in a
noun phrase correlated with the syntactic func-
tion of the phrase (subject, object, etc.)?
 Within a noun phrase, where is it more likely
to find key concepts: in the noun phrase head,
or in the modifiers?
We have used certain properties of Spanish syn-
tax (such as being an SVO language) to decide
which noun phrases play a subject function, which
are the head and modifiers of a noun phrase, etc. For
instance, NP modifiers usually appear after the NP
head in Spanish, and the specification of a concept
is usually made from left to right.
4.1 Distribution of key concepts with verb
distance
Figure 1 shows, for every topic, the probability of
finding a word from the manual list of key con-
cepts in fixed distances from the verb of a sen-
tence. Stop words are not considered for computing
word distance. The broader line represents the aver-
age across topics, and the horizontal dashed line is
the average probability across all positions, i.e., the
probability that a word chosen at random belongs to
the list of key concepts.
The plot shows some clear tendencies in the data:
the probability gets higher when we get close to the
verb, falls abruptly after the verb, and then grows
steadily again. For TT topics, the probability of
finding relevant concepts immediately before the
verb is 56% larger than the average (0:39 before the
verb, versus 0:25 in any position). This is true not
only as an average, but also for all individual TT
topics. This can be an extremely valuable result: it
shows a direct correlation between the position of a
term in a sentence and the importance of the term
in the topic. Of course, this direct distance to the
verb should be adapted for languages with different
syntactic properties, and should be validated for dif-
ferent domains.
The behavior of TT and IE topics is substantially
different. IE topics have smaller probabilities over-
all, because there are less key concepts common to
all documents. For instance, if the topic is “cases of
hunger strikes”, there is little in common between
Figure 1: Probability of finding key concepts at fixed distances from verb
all cases of hunger strikes found in the collection;
each case has its own relevant people and organiza-
tions, for instance. Users try to make abstraction of
individual cases to write key concepts, and then the
number of key concepts is smaller. The tendency
to have larger probabilities just before the verb and
smaller probabilities just after the verb, however,
can also be observed for IE topics.
Figure 2: Probability of finding key concepts in sub-
ject NPs versus other NPs
4.2 Key Concepts and Noun Phrase Syntactic
Function
We wanted also to confirm that it is more likely to
find a key concept in a subject noun phrase than
in general NPs. For this, we have split compound
sentences in chunks, separating subordinate clauses
([Sub] type chunks). Then we have extracted se-
quences with the pattern [N][Mod]*. We assume
that the sentence subject is a sequence[N][Mod]*
occurring immediately before the verb. For in-
stance:
El presidente [N] en funciones [Mod] de
Hait´ı [Mod] ha afirmado [V] que [Sub]...
The rest of [N] and [Mod] chunks are consid-
ered as part of the sentence verb phrase. In a ma-
jority of cases, these assumptions lead to a correct
identification of the sentence subject. We do not
capture, however, subjects of subordinate sentences
or subjects appearing after the verb.
Figure 2 shows how the probability of finding a
key concept is always larger in sentence subjects.
This result supports the assumption in (Boguraev
et al., 1998), where noun phrases receive a higher
weight, as representative terms, if they are syntactic
subjects.
4.3 Distribution of key concepts within noun
phrases
Figure 3: Probability of finding key concepts in NP
head versus NP modifiers
For this analysis, we assume that, in
[N][Mod]* sequences identified as subjects,
[N] is the head and [Mod]* are the modifiers.
Figure 3 shows that the probability of finding a
key concept in the NP modifiers is always higher
than in the head (except for topic TT3, where it is
equal). This is not intuitive a priori; an examination
of the data reveals that the most characteristic con-
cepts for a topic tend to be in the complements: for
instance, in “the president of Haiti”, “Haiti” carries
more domain information than “president”. This
seems to be the most common case in our news
collection. Of course, it cannot be guaranteed that
these results will hold in other domains.
5 Automatic Selection of Key Terms
We have shown that there is indeed a correlation be-
tween syntactic information and the possibility of
finding a key concept. Now, we want to explore
whether this syntactic information can effectively
be used for the automatic extraction of key concepts.
The problem of extracting key concepts for sum-
marization involves two related issues: a) What
kinds of terms should be considered as candidates?
and b) What is the optimal weighting criteria for
them?
There are several possible answers to the first
question. Previous work includes using noun
phrases (Boguraev et al., 1998; Jones et al., 2002),
words (Buyukkokten et al., 1999), n-grams (Leuski
et al., 2003; Lin and Hovy, 1997) or proper
nouns, multi-word terms and abbreviations (Neff
and Cooper, 1999).
Here we will focus, however, in finding appro-
priate weighting schemes on the set of candidate
terms. The most common approach in interactive
single-document summarization is using tf.idf mea-
sures (Jones et al., 2002; Buyukkokten et al., 1999;
Neff and Cooper, 1999), which favour terms which
are frequent in a document and infrequent across
the collection. In the iNeast system (Leuski et al.,
2003), the identification of relevant terms is ori-
ented towards multi-document summarization, and
they use a likelihood ratio (Dunning, 1993) which
favours terms which are representative of the set of
documents as opposed to the full collection.
Other sources of information that have been used
as complementary measures consider, for instance,
the number of references of a concept (Boguraev
et al., 1998), its localization (Jones et al., 2002)
or the distribution of the term along the document
(Buyukkokten et al., 1999; Boguraev et al., 1998).
5.1 Experimental setup
A technical difficulty is that the key concepts in-
troduced by the users are intellectual elaborations,
which result in complex expressions which might
even not be present (literally) in the documents.
Hence, we will concentrate on extracting lists of
terms, checking whether these terms are part of
some key concept. We will assume that, once key
terms are found, it is possible to generate full nomi-
nal expressions using, for instance, phrase browsing
strategies (Pe˜nas et al., 2002).
We will then compare different weighting criteria
to select key terms, using two evaluation measures:
a recall measure saying how well manually selected
key concepts are covered by the automatically gen-
erated term list; and a noise measure counting the
number of terms which do not belong to any key
concept. An optimal list will reach maximum recall
with a minimum of noise. Formally:
R = jCljjCj Noise =jLnj
where C is the set of key concepts manually se-
lected by users; L is a (ranked) list of terms gen-
erated by some weighting schema;Ln is the subset
of terms inLwhich do not belong to any key con-
cept; andCl is the subset of key concepts which are
represented by at least one term in the ranked listL.
Here is a (fictitious) example of how R and
Noise are computed:
C =fHaiti, reinstatement of democracy, UN and USA troopsg
L=fHaiti, soldiers, UN, USA, attemptg
! Cl =fHaiti, UN and USA troopsg R = 2=3L
n =fsoldiers,attemptg Noise = 2
We will compare the following weighting strate-
gies:
TF The frequency of a word in the set of documents
is taken as a baseline measure.
Likelihood ratio This is taken from (Leuski et al.,
2003) and used as a reference measure. We
have implemented the procedure described in
(Rayson and Garside, 2000) using unigrams
only.
OKAPImod We have also considered a measure
derived from Okapi and used in (Robertson et
al., 1992). We have adapted the measure to
consider the set of 100 documents as one single
document.
TFSYNTAX Using our first experimental result,
TFSYNTAX computes the weight of a term
as the number of times it appears preceding a
verb.
Figure 4: Comparison of weighting schemes to ex-
tract relevant terms
5.2 Results
Figure 4 draws Recall/Noise curves for all weight-
ing criteria. They all give similar results except our
TFSYNTAX measure, which performs better than
the others for TT topics. Note that the TFSYN-
TAX measure only considers 10% of the vocabu-
lary, which are the words immediately preceding
verbs in the texts.
In order to check whether this result is consistent
across topics (and not only the effect on an average)
we have compared recall for term lists of size 50 for
individual topics. We have selected 50 as a number
which is large enough to reach a good coverage and
permit additional filtering in an interactive summa-
rization process, such as the iNeast terminological
clustering described in (Leuski et al., 2003).
Figure 5 shows these results by topic. TFSYN-
TAX performs consistently better for all topics ex-
cept one of the IE topics, where the maximum like-
lihood measure is slightly better.
Apart from the fact that TFSYNTAX performs
better than all other methods, it is worth noticing
that sophisticated weighting mechanisms, such as
Okapi and the likelihood ratio, do not behave bet-
ter than a simple frequency count (TF).
6 Conclusions
The automatic extraction of relevant concepts for
a set of related documents is a part of many mod-
els of automatic or interactive summarization. In
this paper, we have analyzed the distribution of rel-
evant concepts across different syntactic functions,
and we have measured the usefulness of detecting
key terms to extract relevant concepts.
Our results suggest that the distribution of key
concepts in sentences is not uniform, having a max-
imum in positions immediately preceding the sen-
tence main verb, in noun phrases acting as subjects
and, more specifically, in the complements (rather
than the head) of noun phrases acting as subjects.
This evidence has been collected using a Spanish
news collection, and should be corroborated outside
the news domain and also adapted to be used for non
SVO languages.
We have also obtained empirical evidence that
statistical weights to select key terms can be im-
proved if we restrict candidate words to those which
precede the verb in some sentence. The combi-
nation of statistical measures and syntactic criteria
overcomes pure statistical weights, at least for TT
topics, where there is certain consistency in the key
concepts across documents.
Acknowledgments
This research has been partially supported by a re-
search grant of the Spanish Government (project
Hermes) and a research grant from UNED. We are
indebted to J. Cigarr´an who calculated the Okapi
weights used in this work.
Figure 5: Comparison of weighting schemes by topic

References
E. Amigo, J. Gonzalo, V. Peinado, A. Pe˜nas, and
F. Verdejo. 2004. Information synthesis: an em-
pirical study. In Proceedings of the 42th Annual
Meeting of the ACL, Barcelona, July.

B. Boguraev, C. Kennedy, R. Bellamy, S. Brawer,
Y. Wong, and J. Swartz. 1998. Dynamic Presen-
tation of Document Content for Rapid On-line
Skimming. In Proceedings of the AAAI Spring
1998 Symposium on Intelligent Text Summariza-
tion, Stanford, CA.

O. Buyukkokten, H. Garc´ıa-Molina, and
A. Paepcke. 1999. Seeing the Whole in
Parts: Text Summarization for Web Browsing
on Handheld Devices. In Proceedings of 10th
International WWW Conference.

T. Dunning. 1993. Accurate Methods for the Statis-
tics of Surprise and Coincidence. Computational
Linguistics, 19(1):61–74.

H. P. Edmundson. 1969. New Methods in Auto-
matic Extracting. Journal of the Association for
Computing Machinery, 16(2):264–285.

S. Jones, S. Lundy, and G. W. Paynter. 2002. In-
teractive Document Summarization Using Auto-
matically Extracted Keyphrases. In Proceedings
of the 35th Hawaii International Conference on
System Sciences, Big Island, Hawaii.

W. Kraaij, M. Spitters, and A. Hulth. 2002.
Headline Extraction based on a Combination of
Uni- and Multi-Document Summarization Tech-
niques. In Proceedings of the DUC 2002 Work-
shop on Multi-Document Summarization Evalua-
tion, Philadelphia, PA, July.

J. Kupiec, J. Pedersen, and F. Chen. 1995. A train-
able document summarizer. In Proceedings of SI-
GIR’95.

A. Leuski, C. Y. Lin, and S. Stubblebine. 2003.
iNEATS: Interactive Multidocument Summariza-
tion. In Proceedings of the 4lst Annual Meeting
of the ACL (ACL 2003), Sapporo, Japan.

C.-Y. Lin and E.H. Hovy. 1997. Identifying Top-
ics by Position. In Proceedings of the 5th Con-
ference on Applied Natural Language Processing
(ANLP), Washington, DC.

C. Lin and E. Hovy. 2002. NeATS in DUC
2002. In Proceedings of the DUC 2002 Work-
shop on Multi-Document Summarization Evalu-
ation, Philadelphia, PA, July.

M. S. Neff and J. W. Cooper. 1999. ASHRAM: Ac-
tive Summarization and Markup. In Proceedings
of HICSS-32: Understanding Digital Documents.

A. Pe˜nas, F. Verdejo, and J. Gonzalo. 2002. Ter-
minology Retrieval: Towards a Synergy be-
tween Thesaurus and Free Text Searching. In IB-
ERAMIA 2002, pages 684–693, Sevilla, Spain.

C. Peters, M. Braschler, J. Gonzalo, and M. Kluck,
editors. 2002. Evaluation of Cross-Language
Information Retrieval Systems, volume 2406 of
Lecture Notes in Computer Science. Springer-
Verlag, Berlin-Heidelberg-New York.

S. Preston, K.and Williams. 1994. Managing the
Information Overload. Physics in Business, June.

P. Rayson and R. Garside. 2000. Comparing Cor-
pora Using Frequency Profiling. In Proceedings
of the workshop on Comparing Corpora, pages
1–6, Honk Kong.

S. E. Robertson, S. Walker, M. Hancock-Beaulieu,
A. Gull, and M. Lau. 1992. Okapi at TREC. In
Text REtrieval Conference, pages 21–30.

J. D. Schlesinger, M. E. Okurowski, J. M. Conroy,
D. P. O’Leary, A. Taylor, J. Hobbs, and H. Wil-
son. 2002. Understanding Machine Performance
in the Context of Human Performance for Multi-
Document Summarization. In Proceedings of the
DUC 2002 Workshop on Multi-Document Sum-
marization Evaluation, Philadelphia, PA, July.
