A Phrase-Based HMM Approach to Document/Abstract Alignment
Hal Daum´e III and Daniel Marcu
Information Sciences Institute
University of Southern California
4676 Admiralty Way, Suite 1001
Marina del Rey, CA 90292
fhdaume,marcug@isi.edu
Abstract
We describe a model for creating word-to-word and
phrase-to-phrase alignments between documents
and their human written abstracts. Such alignments
are critical for the development of statistical sum-
marization systems that can be trained on large cor-
pora of document/abstract pairs. Our model, which
is based on a novel Phrase-Based HMM, outper-
forms both the Cut & Paste alignment model (Jing,
2002) and models developed in the context of ma-
chine translation (Brown et al., 1993).
1 Introduction
There are a wealth of document/abstract pairs that
statistical summarization systems could leverage to
learn how to create novel abstracts. Detailed stud-
ies of such pairs (Jing, 2002) show that human ab-
stractors perform a range of very sophisticated op-
erations when summarizing texts, which include re-
ordering, fusion, and paraphrasing. Unfortunately,
existing document/abstract alignment models are
not powerful enough to capture these operations.
To get around directly tackling this problem, re-
searchers in text summarization have employed one
of several techniques.
Some researchers (Banko et al., 2000) have de-
veloped simple statistical models for aligning doc-
uments and headlines. These models, which imple-
ment IBM Model 1 (Brown et al., 1993), treat docu-
ments and headlines as simple bags of words and
learn probabilistic word-based mappings between
the words in the documents and the words in the
headlines. As our results show, these models are
too weak for capturing the operations that are em-
ployed by humans in summarizing texts beyond the
headline level.
Other researchers have developed models that
make unreasonable assumptions about the data,
which lead to the utilization of a very small per-
cent of available data. For instance, the docu-
ment and sentence compression models of Daum´e
III, Knight, and Marcu (Knight and Marcu, 2002;
Daum´e III and Marcu, 2002a) assume that sen-
tences/documents can be summarized only through
deletion of contiguous text segments. Knight and
Marcu found that from a corpus of 39;060 abstract
sentences, only 1067 sentence extracts existed: a re-
call of only 2:7%.
An alternate techinque employed in a large vari-
ety of systems is to treat the summarization prob-
lem as a sentence extraction problem. Such sys-
tems can be trained either on human constructed ex-
tracts or extracts generated automatically from doc-
ument/abstract pairs (see (Marcu, 1999; Jing and
McKeown, 1999) for two such approaches).
None of these techniques is adequate. Even for a
relatively simple sentence from an abstract, we can
see that none of the assumptions listed above holds.
In Figure 1, we observe several phenomena:
 Alignments can occur at the granularity of
words and at the granularity of phrases.
 The ordering of phrases in an abstract can be
different from the ordering in the document.
 Some abstract words do not have direct cor-
respondents in the document, and some doc-
ument words are never used.
It is thus desirable to be able to automatically
construct alignments between documents and their
abstracts, so that the correspondences between the
pairs are obvious. One might be initially tempted
to use readily-available machine translation systems
like GIZA++ (Och and Ney, 2003) to perform such
Connecting Point has become the single largest Mac retailer after tripling it ’s Macintosh sales since January 1989 .
Connecting Point Systems tripled it ’s sales of Apple Macintosh systems since last January .  It is now the single largest seller of Macintosh .
Figure 1: Example abstract/text alignment.
alignments. However, as we will show, the align-
ments produced by such a system are inadequate for
this task.
The solution that we propose to this problem is
an alignment model based on a novel mathematical
structure we call the Phrase-Based HMM.
2 Designing a Model
As observed in Figure 1, our model needs to be able
to account for phrase-to-phrase alignments. It also
needs to be able to align abstract phrases with arbi-
trary parts of the document, and not require a mono-
tonic, left-to-right alignment.1
2.1 The Generative Story
The model we propose calculates the probability of
an alignment/abstract pair in a generative fashion,
generating the summary S = hs1 ::: smi from the
document D = hd1 ::: dni.
In a document/abstract corpus that we have
aligned by hand (see Section 3), we have observed
that 16% of abstract words are left unaligned. Our
model assumes that these “null-generated” words
and phrases are produced by a unique document
word ?, called the “null word.” The parame-
ters of our model are stored in two tables: a
rewrite/paraphrase table and a jump table. The
rewrite table stores probabilities of producing sum-
mary words/phrases from document words/phrases
and from the null word (namely, probabilities of the
form rewrite   s  d and rewrite ( s ?)); the jump ta-
ble stores the probabilities of moving within a doc-
ument from one position to another, and from and
to ?.
The generation of a summary from a document is
assumed to proceed as follows:
1In the remainder of the paper, we will use the words “sum-
mary” and “abstract” interchangeably. This is because we wish
to use the letter s to refer to summaries. We could use the letter
a as an abbreviation for “abstract”; however, in the definition
of the Phrase-Based HMM, we reuse common notation which
ascribes a different interpretation to a.
1. Choose a starting index i and jump to po-
sition di in the document with probability
jump (i). (If the first summary phrase is null-
generated, jump to the null-word with proba-
bility jump (?).)
2. Choose a document phrase of length k  0 and
a summary phrase of length l  1. Generate
summary words sl1 from document words di+ki
with probability rewrite
 
sl1 di+ki
 
.2
3. Choose a new document index i0 and
jump to position di0 with probability
jump (i0  (i + k)) (or, if the new document
position is the empty state, then jump (?)).
4. Choose k0 and l0 as in step 2, and gener-
ate the summary words s1+l+l01+l from the
document words di0+k0i0 with probability
rewrite
 
s1+l+l01+l di0+k0i0
 
.
5. Repeat from step 3 until the entire summary
has been generated.
6. Jump to position dn+1 in the document with
probability jump (n + 1  (i0 + k0)).
Note that such a formulation allows the same
document word/phrase to generate many summary
words: unlike machine translation, where such be-
havior is typically avoided, in summarization, we
observe that such phenomena do occur. However,
if one were to build a decoder based on this model,
one would need to account for this issue to avoid
degenerate summaries from being produced.
The formal mathematical model behind the align-
ments is as follows: An alignment @ defines both
a segmentation of the summary S and a mapping
from the segments of S to the segments of the doc-
ument D. We write si to refer to the ith segment of
S, and M to refer to the total number of segments
2We write xb
a for the subsequence hxa : : : xbi.
in S. We write d@(i) to refer to the words in the
document which correspond to segment si. Then,
the probability of a summary/alignment pair given a
document (Pr(S;@ D)), becomes:
M+1Y
i=1
 jump (@(i) @(i  1)) rewrite  s
i d@(i)
  
Here, we implicitly define sm+1 to be the end-of-
document token h!i and d@(m+1) to generate this
with probability 1. We also define the initial posi-
tion in the document, @(0) to be 0, and assume a
uniform prior on segmentations.
2.2 The Mathematical Model
Having decided to use this model, we must now
find a way to efficiently train it. The model is very
much like a Hidden Markov Model in which the
summary is the observed sequence. However, us-
ing a standard HMM would not allow us to account
for phrases in the summary. We therefore extend
a standard HMM to allow multiple observations to
be emitted on one transition. We call this model a
Phrase-Based HMM (PBHMM).
For this model, we have developed equiva-
lents of the forward and backward algorithms,
Viterbi search and forward-backward parameter re-
estimation. Our notation is shown in Table 1.
Here, S is the state space, and the observation se-
quences come from the alphabet K.  j is the prob-
ability of beginning in state j. The transition prob-
ability ai;j is the probability of transitioning from
state i to state j. bi;j; k is the probability of emitting
(the non-empty) observation sequence  k while tran-
sitioning from state i to state j. Finally, xt denotes
the state after emitting t symbols.
The full derivation of the model is too lengthy to
include; the interested reader is directed to (Daum´e
III and Marcu, 2002b) for the derivations and proofs
of the formulae. To assist the reader in understand-
ing the mathematics, we follow the same notation as
(Manning and Schutze, 2000). The formulae for the
calculations are summarized in Table 2.
2.2.1 Forward algorithm
The forward algorithm calculates the probability of
an observation sequence. We define  j(t) as the
probability of being in state j after emitting the first
t  1 symbols (in whatever grouping we want).
2.2.2 Backward algorithm
Just as we can compute the probability of an obser-
vation sequence by moving forward, so can we cal-
culate it by going backward. We define  i(t) as the
probability of emitting the sequence oTt given that
we are starting out in state i.
2.2.3 Best path
We define a path as a sequence P = hp1 ::: pLi such
that pi is a tuple ht;xi where t corresponds to the
last of the (possibly multiple) observations made,
and x refers to the state we were coming from when
we output this observation (phrase). Thus, we want
to find:
argmax
P
Pr  P oT1 ;  = argmax
P
Pr  P;oT1   
To do this, as in a traditional HMM, we estimate
the  table. When we calculate  j(t), we essentially
need to choose an appropriate i and t0, which we
store in another table, so we can calculate the actual
path at the end.
2.2.4 Parameter re-estimation
We want to find the model  which best explains
observations. There is no known analytic solution
for standard HMMs, so we are fairly safe in assum-
ing that we will not find an analytic solution for this
more complex problem. Thus, we also revert to an
iterative hill-climbing solution analogous to Baum-
Welch re-estimation (i.e., the Forward Backward al-
gorithm). The equations for the re-estimated values
^a and ^b are shown in Table 2.
2.2.5 Dirichlet Priors
Using simple maximum likelihood estimation is in-
adequate for this model: the maximum likelihood
solution is simply to make phrases as long as pos-
sible; unfortunately, doing so will first cut down on
the number of probabilities that need to be multi-
plied and second make nearly all observed summary
phrase/document phrase alignments unique, thus re-
sulting in rewrite probabilities of 1 after normaliza-
tion. In order to account for this, instead of finding
the maximum likelihood solution, we instead seek
the maximum a posteriori solution.
The distributions we deal with in HMMs, and,
in particular, PBHMMs, are all multinomial. The
Dirichlet distribution is in the conjugate family to
the multinomial distribution3. This makes Dirich-
let priors very appealing to work with, so long as
3This effectively means that the product of a Dirichlet and
multinomial yields a multinomial.
S set of states
K output alphabet
 = f j : j 2 Sg initial state probabilities
A = fai;j : i;j 2 Sg transition probabilities
B = fbi;j; k : i;j 2 S;  k 2 K+g emission probabilities
Table 1: Notation used for the PBHMM
 j(t) = Pr  ot 11 ;xt 1 = j   =
t 1X
t0=0
X
i2S
 
 i(t0 + 1)  ai;j  bi;j;ot
t0+1
 
 i(t) = Pr  oTt  ;xt 1 = i =
TX
t0=t
X
j2S
 
ai;j  bi;j;ot0
t
  j(t0 + 1)
 
 j(t) = max
l;pl 11
Pr
 
pl 11 ;ot 11 ;pl:t = t  1;pl:x = j  
 
=  i(t0)ai;jbi;j;ot 1
t0
 i;j(t0;t) = E  # of transitions i; j emitting ott0 =
 i(t0)ai;jbi;j;ot
t0
 j(t + 1)
Pr oT1   
^ai;j = E [# of transitions i ;j]E [# of transitions i;?] =
PT
t0=1
PT
t=t0  i;j(t
0;t)
PT
t0=1
PT
t=t0
P
j02S  i;j0(t0;t)
^bi;j; k = E
 # of transitions i;j with  k observed 
E [# of transitions i ;j] =
PT+1 j kj
t=1  ( k;o
t+j kj 1
t ) i;j(t;t + j kj 1)P
T
t0=1
PT
t=t0  i;j(t0;t)
Table 2: Summary of equations for a PBHMM
we can adequately express our prior beliefs in their
form. (See (Gauvain and Lee, 1994) for the appli-
cation to standard HMMs.)
Applying a Dirichlet prior effectively allows us to
add “fake counts” during parameter re-estimation,
according to the prior. The prior we choose has a
form such that fake counts are added as follows:
word-to-word rewrites get an additional count of 2;
identity rewrites get an additional count of 4; stem-
identity rewrites get an additional count of 3.
2.3 Constructing the PBHMM
Given our generative story, we construct a PBHMM
to calculate these probabilities efficiently. The
structure of the PBHMM for a given document is
conceptually simple. We provide values for each of
the following: the set of possible states S; the out-
put alphabet K; the initial state probabilities  ; the
transition probabilities A; and the emission proba-
bilities B.
2.3.1 State Space
The state set is large, but structured. There is a
unique initial state p, a unique final state q, and a
state for each possible document phrase. That is, for
all 1  i  i0  n, there is a state that corresponds
to the document phrase beginning at position i and
ending at position i0, di0i , which we will refer to as
ri;i0. There is also a null state for each document po-
sition ra0 ;i, so that when jumping out of a null state,
we can remember what our previous position in the
document was. Thus, S = fp;qg [fri;i0 : 1  i  
i0  ng [ fra0 ;i : 1  i  ng. Figure 2 shows the
schematic drawing of the PBHMM constructed for
the document “a b”. K, the output alphabet, con-
sists of each word found in S, plus the token !.
2.3.2 Initial State Probabilities
For initial state probabilities: since p is our initial
state, we say that  p = 1 and that  r = 0 for all
r 6= p.
2.3.3 Transition Probabilities
The transition probabilities A are governed by the
jump table. Each possible jump type and it’s as-
sociated probability is shown in Table 3. By these
calculations, regardless of document phrase lengths,
transitioning forward between two consecutive seg-
ments will result in jump (1). When transitioning
jump(2)
jump(1)
jump(1)
jump(0)
jump(2)
jump(1)
jump(2)
jump(0)
jump(1)
ba
ab
b
qp
jump(1)jump(   )
a
Figure 2: Schematic drawing of the PBHMM (with some transition probabilities) for the document “a b”
source target probability
p ri;i0 jump (i)
ri;i0 rj;j0 jump (j  i0)
ri;j0 q jump (m + 1  i0)
p ra0 ;i jump (?) jump (i)
ra0 ;i rj;j0 jump (j  i)
ra0 ;i ra0 ;j jump (?) jump (j  i)
ra0 ;i q jump (m + 1  i)
ri;i0 ra0 ;j jump (?) jump (j  i0)
Table 3: Jump probability decomposition
from p to ri;i0, the value ap;ri;i0 = jump (i). Thus,
if we begin at the first word in the document, we
incur a transition probability of jump (1). There are
no transitions into p.
2.3.4 Rewrite Probabilities
Just as the transition probabilities are governed by
the jump table, the emission probabilities B are
governed by the rewrite table. In general, we write
bx;y; k to mean the probability of generating  k while
transitioning from state x to state y. However, in
our case we do not need the x parameter, so we
will refer to these as bj; k, the probability of generat-
ing  k when jumping into state j. When j = ri;i0,
this is rewrite
  
k di0i
 
. When j = ra0 ;i, this is
rewrite   k ? . Finally, any state transitioning into
q generates the phrase h!i with probability 1 and
any other phrase with probability 0.
Consider again the document “a b” (the PBHMM
for which is shown in Figure 2) in the case when
the corresponding summary is “c d”. Suppose the
correct alignment is that “c d” is aligned to “a” and
“b” is left unaligned. Then, the path taken through
the PBHMM is p ! a ! q. During the transi-
tion p ! a, “c d” is emitted. During the transition
a ! q, ! is emitted. Thus, the probability for the
alignment is: jump (1) rewrite (“cd” “a”) jump (2).
The rewrite probabilities themselves are gov-
erned by a mixture model with unknown mixing pa-
rameters. There are three mixture component, each
of which is represented by a multinomial. The first
is the standard word-for-word and phrase-for-phrase
table seen commonly in machine translation, where
rewrite   s  d is simply a normalized count of how
many times we have seen  s aligned to  d. The sec-
ond is a stem-based table, in which suffixes (using
Porter’s stemmer) of the words in  s and  d are thrown
out before a comparison is made. The third is a
simple identity function, which has a constant zero
value when  s and  d are different (up to stem) and
a constant non-zero value when they have the same
stem. The mixing parameters are estimated simul-
taneously during EM.
2.3.5 Parameter Initialization
Instead of initializing the jump and rewrite tables
randomly or uniformly, as it typically done with
HMMs, we initialize the tables according to the dis-
tribution specified by the prior. This is not atypi-
cal practice in problems in which a MAP solution is
sought.
3 Evaluation and Results
In this section, we describe an intrinsic evaluation of
the PBHMM document/abstract alignment model.
All experiments in this paper are done on the Ziff-
Davis corpus (statistics are in Table 4). In order
to judge the quality of the alignments produced by
a system, we first need to create a set of “gold
standard” alignments. Two human annotators man-
ually constructed such alignments between docu-
ments and their abstracts. Software for assisting this
process was developed and is made freely available.
An annotation guide, which explains in detail the
document/abstract alignment process was also pre-
pared and is freely available.4
4Both the software and documentation are available on the
first author’s web page. The alignments are also available; con-
tact the authors for a copy.
Abstracts Extracts
Documents 2033
Sentences 13k 41k
Words 261k 1m
Types 14k 26k
29k
Sentences/Doc 6:28 21:51
Words/Doc 128:52 510:99
Words/Sent 20:47 23:77
Table 4: Ziff-Davis extract corpus statistics
3.1 Human Annotation
From the Ziff-Davis corpus, we randomly selected
45 document/abstract pairs and had both annotators
align them. The first five were annotated separately
and then discussed; the last 40 were done indepen-
dently.
Annotators were asked to perform phrase-to-
phrase alignments between abstracts and documents
and to classify each alignment as either possible P
or sure S, where P  S. In order to calculate
scores for phrase alignments, we convert all phrase
alignments to word alignments. That is, if we have
an alignment between phrases A and B, then this
induces word alignments between a and b for all
words a 2 A and b 2 B. Given an alignment A,
we could calculate precision and recall as (see (Och
and Ney, 2003)):
Precision = jA\PjjAj Recall = jA\SjjSj
One problem with these definitions is that phrase-
based models are fond of making phrases. That is,
when given an abstract containing “the man” and a
document also containing “the man,” a human may
prefer to align “the” to “the” and “man” to “man.”
However, a phrase-based model will almost always
prefer to align the entire phrase “the man” to “the
man.” This is because it results in fewer probabili-
ties being multiplied together.
To compensate for this, we define soft precision
(SoftP in the tables) by counting alignments where
“a b” is aligned to “a b” the same as ones in which
“a” is aligned to “a” and “b” is aligned to “b.” Note,
however, that this is not the same as “a” aligned to
“a b” and “b” aligned to “b”. This latter alignment
will, of course, incur a precision error. The soft pre-
cision metric induces a new, soft F-Score, labeled
SoftF.
Often, even humans find it difficult to align func-
tion words and punctuation. A list of 58 function
words and punctuation marks which appeared in the
corpus (henceforth called the ignore-list) was as-
sembled. Agreement and precision/recall have been
calculated both on all words and on all words that
do not appear in the ignore-list.
Annotator agreement was strong for Sure align-
ments and fairly weak for Possible alignments (con-
sidering only the 40 independently annotated pairs).
When considering only Sure alignments, the kappa
statistic (over 7:2 million items, 2 annotators and 2
categories) for agreement was 0:63. When words
from the ignore-list were thrown out, this rose to
0:68. Carletta (1995) suggests that kappa values
over 0:80 reflect very strong agreement and that
kappa values between 0:60 and 0:80 reflect good
agreement.
3.2 Machine Translation Experiments
In order to establish a baseline alignment model,
we used the IBM Model 4 (Brown et al., 1993)
and the HMM model (Stephan Vogel and Tillmann,
1996) as implemented in the GIZA++ package (Och
and Ney, 2003). We modified this slightly to allow
longer inputs and higher fertilities.
Such translation models require that input be in
sentence-aligned form. In the summarization task,
however, one abstract sentence often corresponds
to multiple document sentences. In order to over-
come this problem, each sentence in an abstract was
paired with three sentences from the corresponding
document, selected using the techniques described
by Marcu (1999). In an informal evaluation, 20 such
pairs were randomly extracted and evaluated by a
human. Each pair was ranked as 0 (document sen-
tences contain little-to-none of the information in
the abstract sentence), 1 (document sentences con-
tain some of the information in the abstract sen-
tence) or 2 (document sentences contain all of the
information). Of the twenty random examples, none
were labeled as 0; five were labeled as 1; and 15
were labeled as 2, giving a mean rating of 1:75.
We ran experiments using the document sen-
tences as both the source and the target language
in GIZA++. When document sentences were used
as the target language, each abstract word needed
to produce many document words, leading to very
high fertilities. However, since each target word is
generated independently, this led to very flat rewrite
tables and, hence, to poor results. Performance in-
creased dramatically by using the document as the
source language and the abstract as the target lan-
guage.
In all MT cases, the corpus was appended with
one-word sentence pairs for each word where that
word is translated as itself. In the two basic mod-
els, HMM and Model 4, the abstract sentence is the
source language and the document sentences are the
target language. To alleviate the fertility problem,
we also ran experiments with the translation going
in the opposite direction. These are called HMM-
flipped and Model 4-flipped, respectively. These
tend to out-perform the original translation direc-
tion. In all of these setups, 5 iterations of Model
1 were run, followed by 5 iterations of the HMM
model. In the Model 4 cases, 5 iterations of Model
4 were run, following the HMM.
3.3 Cut and Paste Experiments
We also tested alignments using the Cut and
Paste summary decomposition method (Jing, 2002),
based on a non-trainable HMM. Briefly, the Cut and
Paste HMM searches for long contiguous blocks of
words in the document and abstract that are iden-
tical (up to stem). The longest such sequences are
aligned. By fixing a length cutoff of n and ignoring
sequences of length less than n, one can arbitrarily
increase the precision of this method. We found that
n = 2 yields the best balance between precision and
recall (and the highest F-measure). The results of
these experiments are shown under the header “Cut
& Paste.” It clearly outperforms all of the MT-based
models.
3.4 PBHMM Experiments
While the PBHMM is based on a dynamic program-
ming algorithm, the effective search space in this
model is enormous, even for moderately sized doc-
ument/abstract pairs. We selected the 2000 shortest
document/abstract pairs from the Ziff-Davis corpus
for training; however, only 12 of the hand-annotated
documents were included in this set, so we addition-
ally added the other 33 hand-annotate documents to
this set, yielding 2033 document/abstract pairs. We
then performed sentence extraction on this corpus
exactly as in the MT case, using the technique of
(Marcu, 1999). The relevant data for this corpus is
in Table 4. We also restrict the state-space with a
beam, sized at 50% of the unrestricted state-space.
The PBHMM system was then trained on this ab-
stract/extract corpus. The precision/recall results
are shown in Table 5. Under the methodology for
combining the two human annotations by taking the
union, either of the human scores would achieve a
System SoftP Recall SoftF
Human1 0.727 0.746 0.736
Human2 0.680 0.695 0.687
HMM 0.120 0.260 0.164
Model 4 0.117 0.260 0.161
HMM-flipped 0.295 0.250 0.271
Model 4-flipped 0.280 0.247 0.262
Cut & Paste 0.349 0.379 0.363
PBHMM 0.456 0.686 0.548
PBHMM O 0.523 0.686 0.594
Table 5: Results on the Ziff-Davis corpus
precision and recall of 1:0. To give a sense of how
well humans actually perform on this task (in addi-
tion to the kappa scores reported earlier), we com-
pare each human against the other.
One common precision mistake made by the
PBHMM system is to accidentally align words on
the summary side to words on the document side,
when the summary word should be null-aligned.
The PBHMMO system is an oracle system in which
system-produced alignments are removed for sum-
mary words that should be null-aligned (according
to the hand-annotated data). Doing this results in a
rather significant gain in SoftP score.
As we can see from Table 5, none of the ma-
chine translation models is well suited to this task,
achieving, at best, an F-score of 0:298. The Cut &
Paste method performs significantly better, which is
to be expected, since it is designed specifically for
summarization. As one would expect, this method
achieves higher precision than recall, though not by
very much. Our method significantly outperforms
both the IBM models and the Cut & Paste method,
achieving a precision of 0:456 and a recall nearing
0:7, yielding an overall F-score of 0:548.
4 Conclusions and Future Work
Despite the success of our model, it’s performance
still falls short of human performance (we achieve
an F-score of 0:548 while humans achieve 0:736).
Moreover, this number for human performance is
a lower-bound, since it is calculated with only one
reference, rather than two.
We have begun to perform a rigorous error anal-
ysis of the model to attempt to identify its deficien-
cies: currently, these appear to primarily be due to
the model having a zeal for aligning identical words.
This happens for one of two reasons: either a sum-
mary word should be null-aligned (but it is not),
or a summary word should be aligned to a differ-
ent, non-identical document word. We can see the
PBHMMO model as giving us an upper bound on
performance if we were to fix this first problem. The
second problem has to do either with synonyms that
do not appear frequently enough for the system to
learn reliable rewrite probabilities, or with corefer-
ence issues, in which the system chooses to align,
for instance, “Microsoft” to “Microsoft,” rather than
“Microsoft” to “the company,” as might be correct
in context. Clearly more work needs to be done
to fix these problems; we are investigating solving
the first problem by automatically building a list of
synonyms from larger corpora and using this in the
mixture model, and the second problem by inves-
tigating the possibility of including some (perhaps
weak) coreference knowledge into the model.
Finally, we are looking to incorporate the results
of this model into a real system. This can be done ei-
ther by using the word-for-word alignments to auto-
matically build sentence-to-sentence alignments for
training a sentence extraction system (in which case
the precision/recall numbers over full sentences are
likely to be much higher), or by building a system
that exploits the word-for-word alignments explic-
itly.
5 Acknowledgments
This work was partially supported by DARPA-ITO
grant N66001-00-1-9814, NSF grant IIS-0097846,
and a USC Dean Fellowship to Hal Daum´e III.
Thanks to Franz Josef Och and Dave Blei for dis-
cussions related to the project.

References
Michele Banko, Vibhu Mittal, and Michael Witbrock. 2000. Headline generation based on statistical translation. In Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics (ACL–2000), pages 318–325, Hong Kong, October 1–8.
Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2):263–311.
Jean Carletta. 1995. Assessing agreement on classification tasks: the kappa statistic. Computational Linguistics, 22(2):249–254.
Hal Daum´e III and Daniel Marcu. 2002a. A noisy-channel model for document compression. In Proceedings of the Conference of the Association of Computational Linguistics (ACL 2002).
Hal Daum´e III and Daniel Marcu. 2002b. A phrase-based HMM. Unpublished; available at http://www.isi.edu/˜hdaume/docs/daume02pbhmm.ps, December.
J. Gauvain and C. Lee. 1994. Maximum a-posteriori estimation for multivariate gaussian mixture observations of markov chains. IEEE Transactions SAP, 2:291–298.
Hongyan Jing and Kathleen R. McKeown. 1999. The decomposition of human-written summary sentences. In Proceedings of the 22nd Conference on Research and Development in Information Retrieval (SIGIR–99), Berkeley, CA, August 15–19.
Hongyan Jing. 2002. Using hidden markov modeling to decompose human-written summaries. Computational Linguistics, 28(4):527 – 544, December.
Kevin Knight and Daniel Marcu. 2002. Summarization beyond sentence extraction: A probabilistic approach to sentence compression. Artificial Intelligence, 139(1).
Christopher Manning and Hinrich Schutze. 2000. Foundations of Statistical Natural Language Processing. The MIT Press.
Daniel Marcu. 1999. The automatic construction of large-scale corpora for summarization research. In Proceedings of the 22nd Conference on Research and Development in Information Retrieval (SIGIR–99), pages 137–144, Berkeley, CA, August 15–19.
Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1):19–51.
Hermann Ney Stephan Vogel and Christoph Tillmann. 1996. HMM-based word alignment in statistical translation. In COLING ’96: The 16th Int. Conf. on Computational Linguistics, pages 836–841.
