Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language
Processing (HLT/EMNLP), pages 33–40, Vancouver, October 2005. c©2005 Association for Computational Linguistics
Improving Multilingual Summarization: Using Redundancy in the Input to
Correct MT errors
Advaith Siddharthan and Kathleen McKeown
Columbia University Computer Science Department
1214 Amsterdam Avenue, New York, NY 10027, USA.
a0 as372,kathy
a1 @cs.columbia.edu
Abstract
In this paper, we use the information re-
dundancy in multilingual input to correct
errors in machine translation and thus im-
prove the quality of multilingual sum-
maries. We consider the case of multi-
document summarization, where the input
documents are in Arabic, and the output
summary is in English. Typically, infor-
mation that makes it to a summary appears
in many different lexical-syntactic forms
in the input documents. Further, the use of
multiple machine translation systems pro-
vides yet more redundancy, yielding dif-
ferent ways to realize that information in
English. We demonstrate how errors in the
machine translations of the input Arabic
documents can be corrected by identify-
ing and generating from such redundancy,
focusing on noun phrases.
1 Introduction
Multilingual summarization is a relatively nascent
research area which has, to date, been addressed
through adaptation of existing extractive English
document summarizers. Some systems (e.g. SUM-
MARIST (Hovy and Lin, 1999)) extract sentences
from documents in a variety of languages, and trans-
late the resulting summary. Other systems (e.g.
Newsblaster (Blair-Goldensohn et al., 2004)) per-
form translation before sentence extraction. Read-
ability is a major issue for these extractive systems.
The output of machine translation software is usu-
ally errorful, especially so for language pairs such
as Chinese or Arabic and English. The ungrammati-
cality and inappropriate word choices resulting from
the use of MT systems leads to machine summaries
that are dif cult to read.
Multi-document summarization, however, has in-
formation available that was not available during the
translation process and which can be used to im-
prove summary quality. A multi-document summa-
rizer is given a set of documents on the same event
or topic. This set provides redundancy; for example,
each document may refer to the same entity, some-
times in different ways. It is possible that by ex-
amining many translations of references to the same
entity, a system can gather enough accurate informa-
tion to improve the translated reference in the sum-
mary. Further, as a summary is short and serves as
a surrogate for a large set of documents, it is worth
investing more resources in its translation; readable
summaries can help end users decide which docu-
ments they want to spend time deciphering.
Current extractive approaches to summarization
are limited in the extent to which they address qual-
ity issues when the input is noisy. Some new sys-
tems attempt substituting sentences or clauses in
the summary with similar text from extraneous but
topic related English documents (Blair-Goldensohn
et al., 2004). This improves readability, but can only
be used in limited circumstances, in order to avoid
substituting an English sentence that is not faith-
ful to the original. Evans and McKeown (2005)
consider the task of summarizing a mixed data set
that contains both English and Arabic news reports.
Their approach is to separately summarize informa-
tion that is contained in only English reports, only
Arabic reports, and in both. While the only-English
and in-both information can be summarized by se-
lecting text from English reports, the summaries of
only-Arabic suffer from the same readability issues.
In this paper, we use principles from information
33
theory (Shannon, 1948) to address the issue of read-
ability in multilingual summarization. We take as
input, multiple machine translations into English of
a cluster of news reports in Arabic. This input is
characterized by high levels of linguistic noise and
by high levels of information redundancy (multiple
documents on the same or related topics and mul-
tiple translations into English). Our aim is to use
automatically acquired knowledge about the English
language in conjunction with the information redun-
dancy to perform error correction on the MT. The
main bene t of our approach is to make machine
summaries of errorful input easier to read and com-
prehend for end-users.
We focus on noun phrases in this paper. The
amount of error correction possible depends on the
amount of redundancy in the input and the depth of
knowledge about English that we can utilize. We
begin by tackling the problem of generating refer-
ences to people in English summaries of Arabic texts
(a2 2). This special case involves large amounts of re-
dundancy and allows for relatively deep English lan-
guage modeling, resulting in good error correction.
We extend our approach to arbitrary NPs in a2 3.
The evaluation emphasis in multi-document sum-
marization has been on evaluating content (not read-
ability), using manual (Nenkova and Passonneau,
2004) as well as automatic (Lin and Hovy, 2003)
methods. We evaluate readability of the generated
noun phrases by computing precision, recall and f-
measure of the generated version compared to mul-
tiple human models of the same reference, comput-
ing these metrics on n-grams. Our results show that
our system performs signi cantly better on precision
over two baselines (most frequent initial reference
and randomly chosen initial reference). Precision is
the most important of these measures as it is impor-
tant to have a correct reference, even if we don’t re-
tain all of the words used in the human models.
2 References to people
2.1 Data
We used data from the DUC 2004 Multilingual
summarization task. The Document Understanding
Conference (http://duc.nist.gov) has been run annu-
ally since 2001 and is the biggest summarization
evaluation effort, with participants from all over the
world. In 2004, for the  rst time, there was a multi-
lingual multi-document summarization task. There
were 25 sets to be summarized. For each set con-
sisting of 10 Arabic news reports, the participants
were provided with 2 different machine translations
into English (using translation software from ISI
and IBM). The data provided under DUC includes
4 human summaries for each set for evaluation pur-
poses; the human summarizers were provided a hu-
man translation into English of each of the Arabic
New reports, and did not have to read the MT output
that the machine summarizers took as input.
2.2 Task de nition
An analysis of premodi cation in initial references
to people in DUC human summaries for the mono-
lingual task from 2001 2004 showed that 71% of
premodifying words were either title or role words
(eg. Prime Minister, Physicist or Dr.) or temporal
role modifying adjectives such as former or desig-
nate. Country, state, location or organization names
constituted 22% of premodifying words. All other
kinds of premodifying words, such as moderate or
loyal constitute only 7%. Thus, assuming the same
pattern in human summaries for the multilingual
task (cf. section 2.6 on evaluation), our task for each
person referred to in a document set is to:
1. Collect all references to the person in both translations of
each document in the set.
2. Identify the correct roles (including temporal modi ca-
tion) and af liations for that person,  ltering any noise.
3. Generate a reference using the above attributes and the
person’s name.
2.3 Automatic semantic tagging
As the task de nition above suggests, our approach
is to identify particular semantic attributes for a per-
son, and generate a reference formally from this se-
mantic input. Our analysis of human summaries tells
us that the semantic attributes we need to identify
are role, organization, country, state,
location and temporal modifier. In addi-
tion, we also need to identify the person name.
We used BBN’s IDENTIFINDER (Bikel et al., 1999)
to mark up person names, organizations and lo-
cations. We marked up countries and (American)
states using a list obtained from the CIA factsheet1.
1http://www.cia.gov/cia/publications/factbook provides a
list of countries and states, abbreviations and adjectival forms,
for example United Kingdom/U.K./British/Briton and Califor-
nia/Ca./Californian.
34
To mark up roles, we used a list derived from Word-
Net (Miller et al., 1993) hyponyms of the person
synset. Our list has 2371 entries including multi-
word expressions such as chancellor of the exche-
quer, brother in law, senior vice president etc. The
list is quite comprehensive and includes roles from
the  elds of sports, politics, religion, military, busi-
ness and many others. We also used WordNet to ob-
tain a list of 58 temporal adjectives. WordNet classi-
 es these as pre- (eg. occasional, former, incoming
etc.) or post-nominal (eg. elect, designate, emeritus
etc.). This information is used during generation.
Further, we identi ed elementary noun phrases us-
ing the LT TTT noun chunker (Grover et al., 2000),
and combined NP of NP sequences into one com-
plex noun phrase. An example of the output of our
semantic tagging module on a portion of machine
translated text follows:
...a3 NPa4 a3 ROLEa4 representative a3a6a5 ROLEa4 of
a3 COUNTRYa4 Iraq a3a6a5 COUNTRYa4 of the a3 ORGa4
United Nations a3a6a5 ORGa4a7a3 PERSONa4 Nizar Hamdoon
a3a6a5 PERSONa4a8a3a6a5 NPa4 that a3 NPa4 thousands of people
a3a6a5 NPa4 killed or wounded in a3 NPa4 the a3 TIMEa4 next
a3a6a5 TIMEa4 few days four of the aerial bombardment of
a3 COUNTRYa4 Iraq a3a6a5 COUNTRYa4a9a3a6a5 NPa4 ...
Our principle data structure for this experiment is
the attribute value matrix (AVM). For example, we
create the following AVM for the reference to Nizar
Hamdoon in the tagged example above:
a10a11
a12a14a13a16a15a18a17a20a19 a21a23a22a25a24a27a26a29a28a31a30a23a26a33a32a31a34a36a35a37a35a39a38
a40a37a41a43a42
a19 a28a45a44a18a46a47a28a45a44a18a48a49a44a50a38a16a51a52a26a53a51a45a22a25a54a43a44
a55a33a41a29a56
a13a37a57
a40a39a58 a59
a28a52a26a39a60
a61a63a62a16a64a53a65a67a66a53a68
a41a29a40a43a69
a15a29a13a47a70a53a71a16a15a33a57a36a70
a41
a13 a72a23a38a20a22a73a51a45a44a27a34a74a21a23a26a29a51a45a22a25a35a39a38a47a48
a61a63a62a36a64a53a65a47a75a39a68
a76a78a77
a79
Note that we store the relative positions (arg 1
and arg 2) of the country and organization attributes.
This information is used both for error reduction and
for generation as detailed below. We also replace
adjectival country attributes with the country name,
using the correspondence in the CIA factsheet.
2.4 Identifying redundancy and  ltering noise
We perform coreference by comparing AVMs. Be-
cause of the noise present in MT (For example,
words might be missing, or proper names might be
spelled differently by different MT systems), simple
name comparison is not suf cient. We form a coref-
erence link between two AVMs if:
1. The last name and (if present) the  rst name match.
2. OR, if the role, country, organization and time attributes
are the same.
The assumption is that in a document set to be
summarized (which consists of related news re-
ports), references to people with the same af liation
and role are likely to be references to the same per-
son, even if the names do not match due to spelling
errors. Thus we form one AVM for each person, by
combining AVMs. For Nizar Hamdoon, to whom
there is only one reference in the set (and thus two
MT versions), we obtain the AVM:
a10a11
a12a14a13a36a15a50a17a20a19 a21a6a22a25a24a27a26a29a28a80a30a23a26a33a32a31a34a36a35a37a35a39a38
a61a81a75a33a68
a40a16a41a39a42
a19 a28a45a44a50a46a47a28a45a44a18a48a49a44a18a38a37a51a52a26a53a51a45a22a78a54a39a44
a61a81a75a39a68
a55a39a41a53a56
a13a37a57
a40a43a58 a59
a28a52a26a39a60
a61a81a75a39a68a82a61a63a62a36a64a53a65a67a66a53a68
a41a33a40a39a69
a15a29a13a47a70a29a71a37a15a33a57a36a70
a41
a13 a72a6a38a20a22a73a51a45a44a27a34a74a21a83a26a53a51a45a22a25a35a39a38a20a48
a61a81a75a33a68a84a61a63a62a16a64a29a65a47a75a33a68
a76a78a77
a79
where the numbers in brackets represents the
counts of this value across all references. The arg
values now represent the most frequent ordering of
these organizations and countries in the input refer-
ences. As an example of a combined AVM for a
person with a lot of references, consider:
a10a11
a11
a11
a11
a12
a13a16a15a18a17a20a19 a85a86a44a50a28a45a35a33a87a88a26a29a89
a61a81a75a53a90a37a68a92a91a94a93
a22a78a26a33a32a80a22a25a38a47a44a95a85a67a44a50a28a45a35a39a87a20a26a33a89
a61a81a75a33a96a39a68
a40a37a41a43a42
a19 a46a47a28a45a44a18a48a49a22a78a34a47a44a50a38a16a51
a61a81a75a29a97a43a68a92a91
a89a25a44a27a26a39a34a36a44a50a28
a61a81a75a39a68
a55a33a41a29a56
a13a37a57
a40a39a58 a98
a89a25a99a39a44a50a28a45a22a78a26
a61a100a66a27a101a43a68a84a61a63a62a16a64a53a65a67a66a53a68
a41a29a40a43a69
a15a29a13a47a70a53a71a16a15a33a57a36a70
a41
a13 a102a6a44a18a38a47a35a53a54a39a26a53a51a45a22a78a35a33a38a7a103a104a26a29a28a49a51a106a105
a61a81a75a33a68a107a61a63a62a16a64a29a65a67a66a27a68a92a91
a98a23a108
a103
a61a100a66a53a68a107a61a63a62a16a64a53a65a67a66a53a68
a57a16a70a50a17a20a19 a109a110a35a33a28a45a32a80a44a92a28
a61a100a66a53a68
a76a78a77
a77
a77
a77
a79
This example displays common problems when
generating a reference. Zeroual has two af liations -
Leader of the Renovation Party, and Algerian Presi-
dent. There is additional noise - the values AFP and
former are most likely errors. As none of the organi-
zation or country values occur in the same reference,
all are marked arg1; no relative ordering statistics
are derivable from the input. For an example demon-
strating noise in spelling, consider:
a10a11
a11
a11
a11
a11
a11
a11
a12
a13a16a15a18a17a20a19 a111a84a87a88a26a33a32a80a32a31a26a53a28a31a112a83a26a33a34a20a34a20a26a53a113
a61a100a66a18a96a43a68a92a91
a111a84a87a88a26a33a32a80a32a31a26a53a28a31a114a83a26a33a34a20a34a20a26a53a113
a61a100a66a18a96a43a68a92a91
a112a83a26a39a34a20a34a47a26a29a113
a61a63a90a37a68a92a91
a114a83a26a33a34a20a34a47a26a29a113
a61a63a90a43a68
a40a37a41a43a42
a19 a89a78a44a18a26a39a34a47a44a92a28a80a115a18a35a39a89a25a35a33a38a20a44a18a89
a61a100a66a53a75a33a68a92a91
a115a18a35a33a89a78a35a33a38a20a44a18a89
a61a63a90a43a68
a89a78a44a18a26a39a34a47a44a92a28
a61a116a97a43a68a92a91
a32a80a22a78a38a47a22a25a48a100a51a45a44a50a28
a61a81a75a39a68a92a91a110a117
a87a20a48a100a51a45a22a25a115a18a44
a61a100a66a27a68
a55a33a41a29a56
a13a37a57
a40a39a58
a93
a22a25a118a37a105a16a26
a61a81a119a33a68a82a61a63a62a16a64a29a65a67a66a27a68
a41a29a40a43a69
a15a29a13a47a70a53a71a16a15a33a57a36a70
a41
a13 a103a120a44a27a26a33a115a18a44a95a121a122a35a39a87a20a38a37a51a49a28a49a105
a61a81a75a39a68a31a61a63a62a16a64a29a65a47a75a33a68a92a91
a121a122a35a39a87a20a38a37a51a49a28a49a105a7a103a120a44a27a26a29a115a18a44
a61a100a66a27a68a84a61a63a62a36a64a53a65a67a66a53a68
a76a78a77
a77
a77
a77
a77
a77
a77
a79
Our approach to removing noise is to:
1. Select the most frequent name with more than one word
(this is the most likely full name).
2. Select the most frequent role.
3. Prune the AVM of values that occur with a frequency be-
low an empirically determined threshold.
Thus we obtain the following AVMs for the three
examples above:
35
a10a11
a12a14a13a16a15a18a17a20a19 a21a23a22a25a24a27a26a29a28a31a30a23a26a33a32a31a34a36a35a37a35a39a38
a40a37a41a43a42
a19 a28a45a44a18a46a47a28a45a44a18a48a49a44a50a38a16a51a52a26a53a51a45a22a25a54a43a44
a55a33a41a29a56
a13a37a57
a40a39a58 a59
a28a52a26a39a60
a61a63a62a16a64a53a65a67a66a53a68
a41a29a40a43a69
a15a29a13a47a70a53a71a16a15a33a57a36a70
a41
a13 a72a23a38a20a22a73a51a45a44a27a34a74a21a23a26a29a51a45a22a25a35a39a38a47a48
a61a63a62a36a64a53a65a47a75a39a68
a76a78a77
a79
a123
a13a16a15a18a17a20a19
a93
a22a78a26a33a32a80a22a25a38a47a44a95a85a67a44a50a28a45a35a39a87a20a26a33a89
a40a37a41a43a42
a19 a46a47a28a45a44a18a48a49a22a78a34a36a44a18a38a37a51
a55a33a41a29a56
a13a37a57
a40a39a58a124a98
a89a25a99a39a44a92a28a45a22a125a26
a61a63a62a16a64a29a65a67a66a27a68 a126
a123
a13a16a15a18a17a20a19 a111a84a87a88a26a29a32a80a32a31a26a29a28a31a112a83a26a33a34a20a34a47a26a29a113
a40a37a41a43a42
a19 a89a25a44a27a26a39a34a36a44a50a28a80a115a18a35a33a89a78a35a33a38a20a44a18a89
a55a33a41a29a56
a13a37a57
a40a39a58
a93
a22a25a118a37a105a16a26
a61a63a62a16a64a29a65a67a66a27a68 a126
This is the input semantics for our generation mod-
ule described in the next section.
2.5 Generating references from AVMs
In order to generate a reference from the words in an
AVM, we need knowledge about syntax. The syn-
tactic frame of a reference to a person is determined
by the role. Our approach is to automatically acquire
these frames from a corpus of English text. We used
the Reuters News corpus for extracting frames. We
performed the semantic analysis of the corpus, as in
a2 2.3; syntactic frames were extracted by identifying
sequences involving locations, organizations, coun-
tries, roles and prepositions. An example of auto-
matically acquired frames with their maximum like-
lihood probabilities for the role ambassador is:
ROLE=ambassador
(p=.35) COUNTRY ambassador PERSON
(.18) ambassador PERSON
(.12) COUNTRY ORG ambassador PERSON
(.12) COUNTRY ambassador to COUNTRY PERSON
(.06) ORG ambassador PERSON
(.06) COUNTRY ambassador to LOCATION PERSON
(.06) COUNTRY ambassador to ORG PERSON
(.03) COUNTRY ambassador in LOCATION PERSON
(.03) ambassador to COUNTRY PERSON
These frames provide us with the required syn-
tactic information to generate from, including word
order and choice of preposition. We select the most
probable frame that matches the semantic attributes
in the AVM. We also use a default set of frames
shown below for instances where no automatically
acquired frames exist:
ROLE=a3 Defaulta4
COUNTRY ROLE PERSON
ORG ROLE PERSON
COUNTRY ORG ROLE PERSON
ROLE PERSON
If no frame matches, organizations, countries and
locations are dropped one by one in decreasing or-
der of argument number, until a matching frame is
found. After a frame is selected, any prenominal
temporal adjectives in the AVM are inserted to the
left of the frame, and any postnominal temporal ad-
jectives are inserted to the immediate right of the
role in the frame. Country names that are not ob-
jects of a preposition are replaced by their adjectival
forms (using the correspondences in the CIA fact-
sheet). For the AVMs above, our generation module
produces the following referring expressions:
a127 Iraqi United Nations representative Nizar Hamdoon
a127 Algerian President Liamine Zeroual
a127 Libyan Leader Colonel Muammar Qadda 
2.6 Evaluation
To evaluate the referring expressions generated by
our program, we used the manual translation of each
document provided by DUC. The drawback of us-
ing a summarization corpus is that only one human
translation is provided for each document, while
multiple model references are required for automatic
evaluation. We created multiple model references
by using the initial references to a person in the
manual translation of each input document in the
set in which that person was referenced. We cal-
culated unigram, bigram, trigram and fourgram pre-
cision, recall and f-measure for our generated ref-
erences evaluated against multiple models from the
manual translations. To illustrate the scoring, con-
sider evaluating a generated phrase  a b d against
three model references  a b c d ,  a b c and  b c
d . The bigram precision is a128a36a129a88a130a132a131a134a133a120a135a78a136 (one out of
two bigrams in generated phrase occurs in the model
set), bigram recall is a130a86a129a67a137a138a131a139a133a120a135a78a130a67a140a67a141 (two out of 7 bi-
grams in the models occurs in the generated phrase)
and f-measure (a142a143a131a144a130a33a145a147a146a149a148a150a129a120a151a125a145a9a152a153a148a150a154 ) is a133a120a135a78a155a67a141a20a156 . For
fourgrams, P, R and F are zero, as there is a fourgram
in the models, but none in the generated NP.
We used 6 document sets from DUC’04 for devel-
opment purposes and present the average P, R and F
for the remaining 18 sets in Table 1. There were 210
generated references in the 18 testing sets. The table
also shows the popular BLEU (Papineni et al., 2002)
and NIST2 MT metrics. We also provide two base-
lines - most frequent initial reference to the person
in the input (Base1) and a randomly selected initial
reference to the person (Base2). As Table 1 shows,
Base1 performs better than random selection. This
2http://www.nist.gov/speech/tests/mt/resources/scoring.htm
36
UNIGRAMS a157a159a158a106a160 a161a162a158a106a160 a163a164a158a106a160
Generated 0.847*@ 0.786 0.799*@
Base1 0.753* 0.805 0.746*
Base2 0.681 0.767 0.688
BIGRAMS a157a159a158a106a160 a161a162a158a106a160 a163a164a158a106a160
Generated 0.684*@ 0.591 0.615*
Base1 0.598* 0.612 0.562*
Base2 0.492 0.550 0.475
TRIGRAMS a157a159a158a106a160 a161a162a158a106a160 a163a164a158a106a160
Generated 0.514*@ 0.417 0.443*
Base1 0.424* 0.432 0.393*
Base2 0.338 0.359 0.315
FOURGRAMS a157a159a158a106a160 a161a162a158a106a160 a163a164a158a106a160
Generated 0.411*@ 0.336 0.351*
Base1 0.320 0.360* 0.302
Base2 0.252 0.280 0.235
@ Signi cantly better than Base1
* Signi cantly better than Base2
(Signi cance tested using unpaired t-test at 95% con dence)
MT Metrics Generated Base1 Base2
BLEU 0.898 0.499 0.400
NIST 8.802 6.423 5.658
Table 1: Evaluation of generated reference
is intuitive as it also uses redundancy to correct er-
rors, at the level of phrases rather than words. The
generation module outperforms both baselines, par-
ticularly on precision - which for unigrams gives an
indication of the correctness of lexical choice, and
for higher ngrams gives an indication of grammati-
cality. The unigram recall of a133a120a135a125a137a88a140a67a141 indicates that we
are not losing too much information at the noise  l-
tering stage. Note that we expect a low a165a167a166a53a168 for our
approach, as we only generate particular attributes
that are important for a summary. The important
measure is a169a170a166a27a168 , on which we do well. This is also
re ected in the high scores on BLEU and NIST.
It is instructive to see how these numbers vary as
the amount of redundancy increases. Information
theory tells us that information should be more re-
coverable with greater redundancy. Figure 1 plots
f-measure against the minimum amount of redun-
dancy. In other words, the value at X=3 gives the
f-measure averaged over all people who were men-
tioned at least thrice in the input. Thus X=1 includes
all examples and is the same as Table 1.
As the graphs show, the quality of the generated
reference improves appreciably when there are at
least 5 references to the person in the input. This is a
convenient result for summarization because people
who are mentioned more frequently in the input are
more likely to be mentioned in the summary.
1 2 3 4 5 6 7 8
0.50
0.55
0.60
0.65
0.70
0.75
0.80
0.85
0.90
Redundancy in input (At least X references)
F−Measure
Unigrams
1 2 3 4 5 6 7 8
0.30
0.35
0.40
0.45
0.50
0.55
0.60
0.65
0.70
Redundancy in input (At least X references)
F−Measure
Bigrams
1 2 3 4 5 6 7 8
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
0.55
Redundancy in input (At least X references)
F−Measure
Trigrams
1 2 3 4 5 6 7 8
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
Redundancy in input (At least X references)
F−Measure
Fourgrams
KEY  Generated    Base1         Base2
Figure 1: Improvement in F-measure for n-grams in
output with increased redundancy in input.
2.7 Advantages over using extraneous sources
Our approach performs noise reduction and gener-
ates a reference from information extracted from the
machine translations. Information about a person
can be obtained in other ways; for example, from a
database, or by collecting references to the person
from extraneous English-language reports. There
are two drawbacks to using extraneous sources:
1. People usually have multiple possible roles and af lia-
tions, so descriptions obtained from an external source
might not be appropriate in the current context.
2. Selecting descriptions from external sources can change
perspective  one country’s terrorist is another country’s
freedom  ghter.
In contrast, our approach generates references
that are appropriate and re ect the perspectives ex-
pressed in the source.
3 Arbitrary noun phrases
In the previous section, we showed how accurate ref-
erences to people can be generated using an infor-
mation theoretic approach. While this is an impor-
tant result in itself for multilingual summarization,
the same approach can be extended to correct errors
in noun phrases that do not refer to people. This ex-
tension is trickier to implement, however, because:
1. Collecting redundancy: Common noun coreference is a
hard problem, even within a single clean English text, and
harder still across multiple MT texts.
37
2. Generating: The semantics for an arbitrary noun phrase
cannot be de ned suf ciently for formal generation;
hence our approach is to select the most plausible of the
coreferring NPs according to an inferred language model.
When suf cient redundancy exists, it is likely that there
is at least one option that is superior to most.
Interestingly, the nature of multi-document sum-
marization allows us to perform these two hard
tasks. We follow the same theoretical framework
(identify redundancy, and then generate from this),
but the techniques we use are necessarily different.
3.1 Alignment of NPs across translations
We used the BLAST algorithm (Altschul et al., 1997)
for aligning noun phrases between two translations
of the same Arabic sentence. We obtained the best
results when each translation was analyzed for noun
chunks, and the alignment operation was performed
over sequences of words and a171 NPa172 and a171 /NPa172
tags. BLAST is an ef cient alignment algorithm that
assumes that words in the two sentences are roughly
in the same order from a global perspective. As nei-
ther of the MT systems used performs much clause
or phrase reorganization, this assumption is not a
problem for our task. An example of two aligned
sentences is shown in  gure 2. We then extract core-
ferring noun phrases by selecting the text between
aligned a171 NPa172 and a171 /NPa172 tags; for example:
1. the Special Commission in charge of disarmament of
Iraq’s weapons of mass destruction
2. the Special Commission responsible disarmament Iraqi
weapons of mass destruction
3.2 Alignment of NPs across documents
This task integrates well with the clustering ap-
proach to multi-document summarization (Barzilay,
2003), where sentences in the input documents are
 rst clustered according to their similarity, and then
one sentence is generated from each cluster. This
clustering approach basically does at the level of
sentences what we are attempting at the level of
noun phrases. After clustering, all sentences within
a cluster should represent similar information. Thus,
similar noun phrases in sentences within a cluster
are likely to refer to the same entities. We do noun
phrase coreference by identifying lexically similar
noun phrases within a cluster. We use SimFinder
(Hatzivassiloglou et al., 1999) for sentence cluster-
ing and the f-measure for word overlap to compare
noun phrases. We set a threshold for deciding coref-
erence by experimenting on the 6 development sets
(cf. a2 2.6) the most accurate coreference occurred
with a threshold of f=0.6 and a constraint that the
two noun phrases must have at least 2 words in com-
mon that were neither determiners nor prepositions.
For the reference to the UN Special Commission in
 gure 2, we obtained the following choices from
alignments and coreference across translations and
documents within a sentence cluster:
1. the United nations Special Commission in charge of dis-
armament of Iraq’s weapons of mass destruction
2. the the United Nations Special Commission responsible
disarmament Iraqi weapons of mass destruction
3. the Special Commission in charge of disarmament of
Iraq’s weapons of mass destruction
4. the Special Commission responsible disarmament Iraqi
weapons of mass destruction
5. the United nations Special Commission in charge of dis-
armament of Iraq’s weapons of mass destruction
6. the Special Commission of the United Nations responsi-
ble disarmament Iraqi weapons of mass destruction
Larger sentence clusters represent information
that is repeated more often across input documents;
hence the size of a cluster is indicative of the impor-
tance of that information, and the summary is com-
posed by considering each sentence cluster in de-
creasing order of size and generating one sentence
from it. From our perspective of  xing errors in
noun phrases, there is likely to be more redundancy
in a large cluster; hence this approach is likely to
work better within clusters that are important for
generating the summary.
3.3 Generation of noun phrases
As mentioned earlier, formal generation from a set
of coreferring noun phrases is impractical due to the
unrestricted nature of the underlying semantics. We
thus focus on selecting the best of the possible op-
tions  the option with the least garbled word order;
for example, selecting 1) from the following:
1. the malicious campaigns in some Western media
2. the campaigns tendentious in some of the media Western
European
The basic insight that we utilize is  when two
words in a NP occur together in the original docu-
ments more often than they should by chance, it is
likely they really should occur together in the gen-
erated NP. Our approach therefore consists of iden-
tifying collocations of length two. Let the number
of words in the input documents be a173 . For each
38
<S1> <NP> Ivanov </NP> stressed <NP> it </NP> should be to <NP> Baghdad </NP> to resume <NP> work </NP> with| | | | | | | | | | | | | | | |
<S2> <NP> Ivanov </NP> stressed however <NP> it </NP> should to <NP> Baghdad </NP> reconvening <NP> work </NP> with
<NP> the Special Commission in charge of disarmament of Iraq’s weapons of mass destruction </NP> . </S1>| | | | | | | | | |
<NP> the Special Commission </NP> <NP> responsible disarmament Iraqi weapons of mass destruction </NP> . </S2>
Figure 2: Two noun chunked MT sentences (S1 and S2) with the words aligned using BLAST.
pair of words a174 and a175 , we use maximum likelihood
to estimate the probabilities of observing the strings
a176
a174a162a175a39a177 , a176 a174a94a177 and a176 a175a39a177 . The observed frequency of these
strings in the corpus divided by the corpus size a173
gives the maximum likelihood probabilities of these
events a145a23a151a100a174a159a178a29a175a37a154 ,a145a83a151a100a174a120a154 anda145a83a151a49a175a43a154 . The natural way to de-
termine how dependent the distributions of a174 and a175
are is to calculate their mutual information (Church
and Hanks, 1991):a179
a151a100a174a159a178a29a175a37a154a162a131a181a180a116a182a67a183a150a184
a145a83a151a100a174a164a178a29a175a37a154
a145a23a151a100a174a104a154a170a146a185a145a23a151a49a175a37a154
If the occurrences of a174 and a175 were completely
independent of each other, we would expect the
maximum likelihood probability a145a23a151a100a174a159a178a29a175a43a154 of the string
a176
a174a186a175a39a177 to be a145a23a151a100a174a104a154a82a146a9a145a83a151a49a175a43a154 . Thus mutual information
is zero when a174 and a175 are independent, and positive
otherwise. The greater the value of
a179
a151a100a174a164a178a29a175a37a154 , the more
likely that a176 a174a187a175a39a177 is a collocation. Returning to our
problem of selecting the best NP from a set of core-
ferring NPs, we compute a score for each NP (con-
sisting of the string of words a188a190a189a16a135a63a135a63a135a73a188a80a191 ) by averaging
the mutual information for each bigram:
a192a83a193a43a194
a148a86a195a150a151a106a188a196a189a16a135a63a135a63a135a73a188a80a191a197a154a198a131a200a199a202a201a63a203
a191a205a204a206a189
a201a63a203
a189
a179
a151a106a188
a201
a178a27a188
a201a116a207
a189a39a154
a208a210a209
a128
We then select the NP with the highest score. This
model successfully selects the malicious campaigns
in some Western media in the example above and
the United nations Special Commission in charge of
disarmament of Iraq’s weapons of mass destruction
in the example in a2 3.2.
3.4 Automatic Evaluation
Our approach to evaluation is similar to that for
evaluating references to people. For each collection
of coreferring NPs, we identi ed the corresponding
model NPs from the manual translations of the input
documents by using the BLAST algorithm for word
alignment between the MT sentences and the cor-
responding manually translated sentence. Table 2
below gives the average unigram, bigram, trigram
and fourgram precision, recall and f-measure for the
UNIGRAMS a157a94a211a27a212 a161a6a211a27a212 a163a94a211a27a212
Mutual information 0.615*@ 0.658 0.607*
Base1 0.584 0.662 0.592
Base2 0.583 0.652 0.586
BIGRAMS a157a94a211a27a212 a161a6a211a27a212 a163a94a211a27a212
Mutual information 0.388*@ 0.425* 0.374*@
Base1 0.340 0.402 0.339
Base2 0.339 0.387 0.330
TRIGRAMS a157 a211a27a212 a161 a211a27a212 a163 a211a27a212
Mutual information 0.221*@ 0.204* 0.196*@
Base1 0.177 0.184 0.166
Base2 0.181 0.171 0.160
FOURGRAMS a157 a211a27a212 a161 a211a27a212 a163 a211a27a212
Mutual information 0.092* 0.090* 0.085*
Base1 0.078 0.080 0.072
Base2 0.065 0.066 0.061
@ Signi cantly better than Base1
* Signi cantly better than Base2
(Signi cance tested using unpaired t-test at 95% con dence)
MT Metrics Mutual information Base1 Base2
BLEU 0.276 0.206 0.184
NIST 5.886 4.979 4.680
Table 2: Evaluation of noun phrase selection
selected NPs, evaluated against the models. We ex-
cluded references to people as these were treated for-
mally in a2 2. This left us with 961 noun phrases from
the 18 test sets to evaluate. Table 2 also provides the
BLEU and NIST MT evaluation scores.
We again provide two baselines - most frequent
NP in the set (Base1) and a randomly selected NP
from the set (Base2). The numbers in Table 2 are
lower than those in Table 1. This is because generat-
ing references to people is a more restricted problem
 there is less error in MT output, and a formal gen-
eration module is employed for error reduction. In
the case of arbitrary NPs, we only select between the
available options. However, the information theo-
retic approach gives signi cant improvement for the
arbitrary NP case as well, particularly for precision,
which is an indicator of grammaticality.
3.5 Manual Evaluation
To evaluate how much impact the rewrites have on
summaries, we ran our summarizer on the 18 test
sets, and manually evaluated the selected sentences
39
and their rewritten versions for accuracy and  u-
ency. There were 118 sentences, out of which 94
had at least one modi cation after the rewrite pro-
cess. We selected 50 of these 94 sentences at ran-
dom and asked 2 human judges to rate each sen-
tence and its rewritten form on a scale of 1 5 for
accuracy and  uency3. We used 4 human judges,
each judging 25 sentence pairs. The original and
rewritten sentences were presented in random order,
so judges did not know which sentences were rewrit-
ten. Fluency judgments were made before seeing the
human translated sentence, and accuracy judgments
were made by comparing with the human transla-
tion. The average scores before and after rewrite
were a130a94a135a25a133a86a140 and a130a94a135a78a130a67a141 respectively for  uency and a155a94a135a25a133a67a133
and a155a94a135a116a128a16a213 respectively for accuracy. Thus the rewrite
operations increases both scores by around 0.2.
4 Conclusions and future work
We have demonstrated how the information redun-
dancy in the multilingual multi-document summa-
rization task can be used to reduce MT errors. We
do not use any related English news reports for sub-
stituting text; hence our approach is not likely to
change the perspectives expressed in the original
Arabic news to those expressed in English news re-
ports. Further, our approach does not perform any
corrections speci c to any particular MT system.
Thus the techniques described in this paper will re-
main relevant even with future improvements in MT
technology, and will be redundant only when MT is
perfect. We have used the Arabic-English data from
DUC’04 for this paper, but our approach is equally
applicable to other language pairs. Further, our tech-
niques integrate easily with the sentence clustering
approach to multi-document summarization  sen-
tence clustering allows us to reliably identify noun
phrases that corefer across documents.
In this paper we have considered the case of noun
phrases. In the future, we plan to consider other
types of constituents, such as correcting errors in
verb groups, and in the argument structure of verbs.
This will result in a more generative and less ex-
3We followed the DARPA/LDC guidelines from http://
ldc.upenn.edu/Projects/TIDES/Translation/TranAssessSpec.pdf.
For  uency, the scale was 5:Flawless, 4:Good, 3:Non-native,
2:Dis uent, 1:Incomprehensible. The accuracy scale for
information covered (comparing with human translation) was
5:All, 4:Most, 3:Much, 2:Little, 1:None.
tractive approach to summarization - indeed the case
for generative approaches to summarization is more
convincing when the input is noisy.
References
S.F. Altschul, T. L. Madden, A.A. Schaffer, J. Zhang,
Z. Zhang, W. Miller, and D. J. Lipman. 1997. Gapped
BLAST and PSI-BLAST: a new generation of protein
database search programs. Nucleic Acids Research,
17(25):3389 3402.
R. Barzilay. 2003. Information Fusion for Multidocu-
ment Summarization: Paraphrasing and Generation.
Ph.D. thesis, Columbia University, New York.
D. Bikel, R. Schwartz, and R. Weischedel. 1999. An al-
gorithm that learns what’s in a name. Machine Learn-
ing, 34:211 231.
S. Blair-Goldensohn, D. Evans, V. Hatzivassiologlou,
K. McKeown, A. Nenkova, R. Passonneau, B. Schiff-
man, A. Schlajiker, A. Siddharthan, and S. Siegelman.
2004. Columbia University at DUC 2004. In Proceed-
ings of DUC’04, pages 23 30, Boston, USA.
K. Church and P. Hanks. 1991. Word association norms,
mutual information and lexicography. Computational
Linguistics, 16(1):22 29.
D. Evans and K. McKeown. 2005. Identifying similar-
ities and differences across english and arabic news.
In Proceedings of International Conference on Intelli-
gence Analysis, pages 23 30, McLean, VA.
C. Grover, C. Matheson, A. Mikheev, and M. Moens.
2000. LT TTT - A  exible tokenisation tool. In Pro-
ceedings of LREC’00, pages 1147 1154.
V. Hatzivassiloglou, J. Klavans, and E. Eskin. 1999. De-
tecting text similarity over short passages: exploring
linguistic feature combinations via machine learning.
In Proceedings of EMNLP’99, MD, USA.
E.H. Hovy and Chin-Yew Lin. 1999. Automated text
summarization in summarist. In I. Mani and M. May-
bury, editors, Advances in Automated Text Summariza-
tion, chapter 8. MIT Press.
C. Lin and E. Hovy. 2003. Automatic evaluation of sum-
maries using n-gram co-occurrence statistics. In Pro-
ceedings of HLT-NAACL’03, Edmonton.
G.A. Miller, R. Beckwith, C.D. Fellbaum, D. Gross, and
K. Miller. 1993. Five Papers on WordNet. Technical
report, Princeton University, Princeton, N.J.
Ani Nenkova and Rebecca Passonneau. 2004. Evaluat-
ing content selection in summarization: The pyramid
method. In HLT-NAACL 2004: Main Proceedings,
pages 145 152, Boston, MA, USA.
K. Papineni, S. Roukos, T. Ward, and W. Zhu. 2002.
Bleu: A method for automatic evaluation of machine
translation. In Proceedings of ACL’02.
C. E. Shannon. 1948. A mathematical theory of commu-
nication. Bell System Tech. Journal, 27:379 423.
40
