File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-3216_metho.xml
Size: 13,379 bytes
Last Modified: 2025-10-06 14:09:30
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3216"> <Title>A Phrase-Based HMM Approach to Document/Abstract Alignment</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Designing a Model </SectionTitle> <Paragraph position="0"> As observed in Figure 1, our model needs to be able to account for phrase-to-phrase alignments. It also needs to be able to align abstract phrases with arbitrary parts of the document, and not require a monotonic, left-to-right alignment.1</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 The Generative Story </SectionTitle> <Paragraph position="0"> The model we propose calculates the probability of an alignment/abstract pair in a generative fashion, generating the summary S = hs1 ::: smi from the document D = hd1 ::: dni.</Paragraph> <Paragraph position="1"> In a document/abstract corpus that we have aligned by hand (see Section 3), we have observed that 16% of abstract words are left unaligned. Our model assumes that these &quot;null-generated&quot; words and phrases are produced by a unique document word ?, called the &quot;null word.&quot; The parameters of our model are stored in two tables: a rewrite/paraphrase table and a jump table. The rewrite table stores probabilities of producing summary words/phrases from document words/phrases and from the null word (namely, probabilities of the form rewrite s d and rewrite ( s ?)); the jump table stores the probabilities of moving within a document from one position to another, and from and to ?.</Paragraph> <Paragraph position="2"> The generation of a summary from a document is assumed to proceed as follows: 1In the remainder of the paper, we will use the words &quot;summary&quot; and &quot;abstract&quot; interchangeably. This is because we wish to use the letter s to refer to summaries. We could use the letter a as an abbreviation for &quot;abstract&quot;; however, in the definition of the Phrase-Based HMM, we reuse common notation which ascribes a different interpretation to a.</Paragraph> <Paragraph position="3"> 1. Choose a starting index i and jump to position di in the document with probability jump (i). (If the first summary phrase is nullgenerated, jump to the null-word with probability jump (?).) 2. Choose a document phrase of length k 0 and a summary phrase of length l 1. Generate summary words sl1 from document words di+ki with probability rewrite sl1 di+ki 3. Choose a new document index i0 and jump to position di0 with probability jump (i0 (i + k)) (or, if the new document position is the empty state, then jump (?)).</Paragraph> <Paragraph position="4"> 4. Choose k0 and l0 as in step 2, and generate the summary words s1+l+l01+l from the document words di0+k0i0 with probability rewrite s1+l+l01+l di0+k0i0 .</Paragraph> <Paragraph position="5"> 5. Repeat from step 3 until the entire summary has been generated.</Paragraph> <Paragraph position="6"> 6. Jump to position dn+1 in the document with probability jump (n + 1 (i0 + k0)).</Paragraph> <Paragraph position="7"> Note that such a formulation allows the same document word/phrase to generate many summary words: unlike machine translation, where such behavior is typically avoided, in summarization, we observe that such phenomena do occur. However, if one were to build a decoder based on this model, one would need to account for this issue to avoid degenerate summaries from being produced.</Paragraph> <Paragraph position="8"> The formal mathematical model behind the alignments is as follows: An alignment @ defines both a segmentation of the summary S and a mapping from the segments of S to the segments of the document D. We write si to refer to the ith segment of S, and M to refer to the total number of segments 2We write xb a for the subsequence hxa : : : xbi.</Paragraph> <Paragraph position="9"> in S. We write d@(i) to refer to the words in the document which correspond to segment si. Then, the probability of a summary/alignment pair given a document (Pr(S;@ D)), becomes:</Paragraph> <Paragraph position="11"> Here, we implicitly define sm+1 to be the end-ofdocument token h!i and d@(m+1) to generate this with probability 1. We also define the initial position in the document, @(0) to be 0, and assume a uniform prior on segmentations.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 The Mathematical Model </SectionTitle> <Paragraph position="0"> Having decided to use this model, we must now find a way to efficiently train it. The model is very much like a Hidden Markov Model in which the summary is the observed sequence. However, using a standard HMM would not allow us to account for phrases in the summary. We therefore extend a standard HMM to allow multiple observations to be emitted on one transition. We call this model a Phrase-Based HMM (PBHMM).</Paragraph> <Paragraph position="1"> For this model, we have developed equivalents of the forward and backward algorithms, Viterbi search and forward-backward parameter reestimation. Our notation is shown in Table 1.</Paragraph> <Paragraph position="2"> Here, S is the state space, and the observation sequences come from the alphabet K. j is the probability of beginning in state j. The transition probability ai;j is the probability of transitioning from state i to state j. bi;j; k is the probability of emitting (the non-empty) observation sequence k while transitioning from state i to state j. Finally, xt denotes the state after emitting t symbols.</Paragraph> <Paragraph position="3"> The full derivation of the model is too lengthy to include; the interested reader is directed to (Daum'e III and Marcu, 2002b) for the derivations and proofs of the formulae. To assist the reader in understanding the mathematics, we follow the same notation as (Manning and Schutze, 2000). The formulae for the calculations are summarized in Table 2.</Paragraph> <Paragraph position="4"> The forward algorithm calculates the probability of an observation sequence. We define j(t) as the probability of being in state j after emitting the first t 1 symbols (in whatever grouping we want).</Paragraph> <Paragraph position="5"> Just as we can compute the probability of an observation sequence by moving forward, so can we calculate it by going backward. We define i(t) as the probability of emitting the sequence oTt given that we are starting out in state i.</Paragraph> <Paragraph position="6"> We define a path as a sequence P = hp1 ::: pLi such that pi is a tuple ht;xi where t corresponds to the last of the (possibly multiple) observations made, and x refers to the state we were coming from when we output this observation (phrase). Thus, we want to find:</Paragraph> <Paragraph position="8"> To do this, as in a traditional HMM, we estimate the table. When we calculate j(t), we essentially need to choose an appropriate i and t0, which we store in another table, so we can calculate the actual path at the end.</Paragraph> <Paragraph position="9"> We want to find the model which best explains observations. There is no known analytic solution for standard HMMs, so we are fairly safe in assuming that we will not find an analytic solution for this more complex problem. Thus, we also revert to an iterative hill-climbing solution analogous to Baum-Welch re-estimation (i.e., the Forward Backward algorithm). The equations for the re-estimated values ^a and ^b are shown in Table 2.</Paragraph> <Paragraph position="10"> Using simple maximum likelihood estimation is inadequate for this model: the maximum likelihood solution is simply to make phrases as long as possible; unfortunately, doing so will first cut down on the number of probabilities that need to be multiplied and second make nearly all observed summary phrase/document phrase alignments unique, thus resulting in rewrite probabilities of 1 after normalization. In order to account for this, instead of finding the maximum likelihood solution, we instead seek the maximum a posteriori solution.</Paragraph> <Paragraph position="11"> The distributions we deal with in HMMs, and, in particular, PBHMMs, are all multinomial. The Dirichlet distribution is in the conjugate family to the multinomial distribution3. This makes Dirichlet priors very appealing to work with, so long as</Paragraph> <Paragraph position="13"> we can adequately express our prior beliefs in their form. (See (Gauvain and Lee, 1994) for the application to standard HMMs.) Applying a Dirichlet prior effectively allows us to add &quot;fake counts&quot; during parameter re-estimation, according to the prior. The prior we choose has a form such that fake counts are added as follows: word-to-word rewrites get an additional count of 2; identity rewrites get an additional count of 4; stemidentity rewrites get an additional count of 3.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Constructing the PBHMM </SectionTitle> <Paragraph position="0"> Given our generative story, we construct a PBHMM to calculate these probabilities efficiently. The structure of the PBHMM for a given document is conceptually simple. We provide values for each of the following: the set of possible states S; the output alphabet K; the initial state probabilities ; the transition probabilities A; and the emission probabilities B.</Paragraph> <Paragraph position="1"> The state set is large, but structured. There is a unique initial state p, a unique final state q, and a state for each possible document phrase. That is, for all 1 i i0 n, there is a state that corresponds to the document phrase beginning at position i and ending at position i0, di0i , which we will refer to as ri;i0. There is also a null state for each document position ra0 ;i, so that when jumping out of a null state, we can remember what our previous position in the document was. Thus, S = fp;qg [fri;i0 : 1 i i0 ng [ fra0 ;i : 1 i ng. Figure 2 shows the schematic drawing of the PBHMM constructed for the document &quot;a b&quot;. K, the output alphabet, consists of each word found in S, plus the token !.</Paragraph> <Paragraph position="2"> For initial state probabilities: since p is our initial state, we say that p = 1 and that r = 0 for all r 6= p.</Paragraph> <Paragraph position="3"> The transition probabilities A are governed by the jump table. Each possible jump type and it's associated probability is shown in Table 3. By these calculations, regardless of document phrase lengths, transitioning forward between two consecutive segments will result in jump (1). When transitioning</Paragraph> <Paragraph position="5"/> <Paragraph position="7"> from p to ri;i0, the value ap;ri;i0 = jump (i). Thus, if we begin at the first word in the document, we incur a transition probability of jump (1). There are no transitions into p.</Paragraph> <Paragraph position="8"> Just as the transition probabilities are governed by the jump table, the emission probabilities B are governed by the rewrite table. In general, we write bx;y; k to mean the probability of generating k while transitioning from state x to state y. However, in our case we do not need the x parameter, so we will refer to these as bj; k, the probability of generating k when jumping into state j. When j = ri;i0, this is rewrite k di0i . When j = ra0 ;i, this is rewrite k ? . Finally, any state transitioning into q generates the phrase h!i with probability 1 and any other phrase with probability 0.</Paragraph> <Paragraph position="9"> Consider again the document &quot;a b&quot; (the PBHMM for which is shown in Figure 2) in the case when the corresponding summary is &quot;c d&quot;. Suppose the correct alignment is that &quot;c d&quot; is aligned to &quot;a&quot; and &quot;b&quot; is left unaligned. Then, the path taken through the PBHMM is p ! a ! q. During the transition p ! a, &quot;c d&quot; is emitted. During the transition a ! q, ! is emitted. Thus, the probability for the alignment is: jump (1) rewrite (&quot;cd&quot; &quot;a&quot;) jump (2). The rewrite probabilities themselves are governed by a mixture model with unknown mixing parameters. There are three mixture component, each of which is represented by a multinomial. The first is the standard word-for-word and phrase-for-phrase table seen commonly in machine translation, where rewrite s d is simply a normalized count of how many times we have seen s aligned to d. The second is a stem-based table, in which suffixes (using Porter's stemmer) of the words in s and d are thrown out before a comparison is made. The third is a simple identity function, which has a constant zero value when s and d are different (up to stem) and a constant non-zero value when they have the same stem. The mixing parameters are estimated simultaneously during EM.</Paragraph> <Paragraph position="10"> Instead of initializing the jump and rewrite tables randomly or uniformly, as it typically done with HMMs, we initialize the tables according to the distribution specified by the prior. This is not atypical practice in problems in which a MAP solution is sought.</Paragraph> </Section> </Section> class="xml-element"></Paper>