File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/j05-4004_metho.xml
Size: 39,886 bytes
Last Modified: 2025-10-06 14:09:39
<?xml version="1.0" standalone="yes"?> <Paper uid="J05-4004"> <Title>Induction of Word and Phrase Alignments for Automatic Document Summarization</Title> <Section position="3" start_page="508" end_page="508" type="metho"> <SectionTitle> 2. Human-produced Alignments </SectionTitle> <Paragraph position="0"> In order to decide how to design an alignment model and to judge the quality of the alignments produced by a system, we first need to create a set of &quot;gold standard&quot; alignments. To this end, we asked two human annotators to manually construct such alignments between documents and their abstracts. These <document, abstract> pairs were drawn from the Ziff-Davis collection (Marcu 1999). Of the roughly 7,000 documents in that corpus, we randomly selected 45 pairs for annotation. We added to this set of 45 pairs the 2,000 shorter documents from this collection, and all the work described in the remainder of this paper focuses on this subset of 2,033 <document, abstract> pairs.</Paragraph> <Paragraph position="1"> Statistics for this sub-corpus and for the pairs selected for annotation are shown in Table 1. As can be simply computed from this table, the compression rate in this corpus is about 12%. The first five human-produced alignments were completed separately and then discussed; the last 40 were done independently.</Paragraph> <Section position="1" start_page="508" end_page="508" type="sub_section"> <SectionTitle> 2.1 Annotation Guidelines </SectionTitle> <Paragraph position="0"> Annotators were asked to perform word-to-word and phrase-to-phrase alignments between abstracts and documents, and to classify each alignment as either possible (P) or sure (S), where S [?] P, following the methodology used in the machine translation community (Och and Ney 2003). The direction of containment (S [?] P) is because being a sure alignment is a stronger requirement than being a possible alignment. A full description of the annotation guidelines is available in a document available with the alignment software on the first author's web site (http://www.isi.edu/[?]hdaume/HandAlign).</Paragraph> <Paragraph position="1"> Here, we summarize the main points.</Paragraph> <Paragraph position="2"> The most important instruction that annotators were given was to align everything in the summary to something. This was not always possible, as we will discuss shortly, but by and large it was an appropriate heuristic. The second major instruction was to choose alignments with maximal consecutive length: If there are two possible alignments for a phrase, the annotators were instructed to choose the one that will result in the longest consecutive alignment. For example, in Figure 1, this rule governs the choice of</Paragraph> </Section> </Section> <Section position="4" start_page="508" end_page="510" type="metho"> <SectionTitle> 3 The reason there are 2,033 pairs, not 2,045, is that 12 of the original 45 pairs were among the 2,000 </SectionTitle> <Paragraph position="0"> shortest, so the 2,033 pairs are obtained by taking the 2,000 shortest and adding to them the 33 pairs that were annotated and not already among the 2,000 shortest.</Paragraph> <Paragraph position="1"> Computational Linguistics Volume 31, Number 4 the alignment of the word Macintosh on the summary side: lexically, it could be aligned to the final occurrence of the word Macintosh on the document side, but by aligning it to Apple Macintosh systems, we are able to achieve a longer consecutive sequence of aligned words.</Paragraph> <Paragraph position="2"> The remainder of the instructions have to do primarily with clarifying particular linguistic phenomena including punctuation, anaphora (for entities, annotators are told to feel free to align names to pronouns, for instance) and metonymy, null elements, genitives, appositives, and ellipsis.</Paragraph> <Section position="1" start_page="509" end_page="509" type="sub_section"> <SectionTitle> 2.2 Annotator Agreement </SectionTitle> <Paragraph position="0"> To compute annotator agreement, we employed the kappa statistic. To do so, we treat the problem as a sequence of binary decisions: given a single summary word and document word, should the two be aligned? To account for phrase-to-phrase alignments, we first converted these into word-to-word alignments using the &quot;all pairs&quot; heuristic. By looking at all such pairs, we wound up with 7.2 million items over which to compute the kappa statistic (with two annotators and two categories). Annotator agreement was strong for sure alignments and fairly weak for possible alignments.</Paragraph> <Paragraph position="1"> When considering only sure alignments, the kappa statistic for agreement was 0.63 (though it dropped drastically to 0.42 on possible alignments).</Paragraph> <Paragraph position="2"> In performing the annotation, we found that punctuation and non-content words are often very difficult to align (despite the discussion of these issues in the alignment guidelines). The primary difficulty with function words is that when the summarizers have chosen to reorder words to use slightly different syntactic structures, there are lexical changes that are hard to predict.</Paragraph> <Paragraph position="3"> Fortunately, for many summarization tasks, it is much more important to get content words right, rather than function words. When words on a stop list of 58 function words and punctuation were ignored, the kappa value rose to 0.68. Carletta (1995) has suggested that kappa values over 0.80 reflect very strong agreement and that kappa values between 0.60 and 0.80 reflect good agreement.</Paragraph> </Section> <Section position="2" start_page="509" end_page="510" type="sub_section"> <SectionTitle> 2.3 Results of Annotation </SectionTitle> <Paragraph position="0"> After the completion of these alignments, we can investigate some of their properties.</Paragraph> <Paragraph position="1"> Such an investigation is interesting both from the perspective of designing a model and from a linguistic perspective.</Paragraph> <Paragraph position="2"> In the alignments, we found that roughly 16% of the abstract words are left unaligned. This figure includes both standard lexical words and punctuation. Of this 16%, 4% are punctuation marks (though not all punctuation is unaligned) and 7% are function words. The remaining 5% are words that would typically be considered content words. This rather surprising result tells us that any model we build needs to be able to account for a reasonable portion of the abstract to not have a direct correspondence to any portion of the document.</Paragraph> <Paragraph position="3"> 4 For example, the change from I gave a gift to the boy. to The boy received a gift from me. is relatively straightforward; however, it is a matter of opinion whether to and from should be aligned - they serve the Daum'e and Marcu Alignments for Automatic Document Summarization To get a sense of the importance of producing alignments at the phrase level, we computed that roughly 75% of the alignments produced by humans involve only one word on both sides. In 80% of the alignments, the summary side is a single word (thus in 5% of the cases, a single summary word is aligned to more than one document word). In 6.1% of the alignments, the summary side involved a phrase of length two, and in 2.2% of the cases it involved a phrase of length three. In all these numbers, care should be taken to note that the humans were instructed to produce phrase alignments only when word alignments were impossible. Thus, it is entirely likely that summary word i is aligned to document word j and summary word i + 1 is aligned to document word j + 1, in which case we count this as two singleton alignments, rather than an alignment of length two. These numbers suggest that looking at phrases in addition to words is empirically important.</Paragraph> <Paragraph position="4"> Lexical choice is another important aspect of the alignment process. Of all the aligned summary words and phrases, the corresponding document word or phrase was exactly the same as that on the summary side in 51% of the cases. When this constraint was weakened to looking only at stems (for multi-word phrases, a match meant that each corresponding word matched up to stem), this number rose to 67%. When broken down into cases of singletons and non-singletons, we saw that 86% of singletons are identical up to stem, and 48% of phrases are identical up to stem. This suggests that looking at stems, rather than lexical items, is useful.</Paragraph> <Paragraph position="5"> Finally, we investigated the issue of adjacency in the alignments. Specifically, we consider the following question: Given that a summary phrase ending at position i is aligned to a document phrase ending at position j, what is a likely position in the document for the summary phrase beginning at position i + 1? It turns out that this is overwhelmingly j + 1. In Figure 2, we have plotted the frequencies of such relative jumps over the human-aligned data. This graph suggests that a model biased toward stepping forward monotonically in the document is likely to be appropriate. However, it should also be noted that backward jumps are also quite common, suggesting that a monotonic alignment model is inappropriate for this task.</Paragraph> </Section> </Section> <Section position="5" start_page="510" end_page="515" type="metho"> <SectionTitle> 3. Statistical Alignment Model </SectionTitle> <Paragraph position="0"> Based on linguistic observations from the previous section, we reach several conclusions regarding the development of a statistical model to produce such alignments.</Paragraph> <Paragraph position="1"> First, the model should be able to produce alignments between phrases of arbitrary length (but perhaps with a bias toward single words). Second, it should not be constrained by any assumptions of monotonicity or word (or stem) identity, but it might be able to realize that monotonicity and word and stem identity are good indicators of alignment. Third, our model must be able to account for words on the abstract side that have no correspondence on the document side (following the terminology from the machine translation community, we will refer to such words as null generated).</Paragraph> <Section position="1" start_page="510" end_page="512" type="sub_section"> <SectionTitle> 3.1 Generative Story </SectionTitle> <Paragraph position="0"> Based on these observations, and with an eye toward computational tractability, we posit the following generative story for how a summary is produced, given a document: 1. Repeat until the whole summary is generated: (a) Choose a document position j and jump there.</Paragraph> <Paragraph position="1"> summary side.</Paragraph> <Paragraph position="2"> (b) Choose a document phrase length l.</Paragraph> <Paragraph position="3"> (c) Generate a summary phrase based on the document phrase spanning positions j to j + l.</Paragraph> <Paragraph position="4"> 2. Jump to the end of the document.</Paragraph> <Paragraph position="5"> In order to account for null generated summary words, we augment the above generative story with the option to jump to a specifically designated null state from which a summary phrase may be generated without any correspondence in the document. From inspection of the human-aligned data, most such null generated words are function words or punctuation; however, in some cases, there are pieces of information in the summary that truly did not exist in the original document. The null generated words can account for these as well (additionally, the null generated words allow the model to &quot;give up&quot; when it cannot do anything better). We require that summary phrases produced from the null state have length 1, so that in order to generate multiple null generated words, they must be generated independently.</Paragraph> <Paragraph position="6"> In Figure 3, we have shown a portion of the generative process that would give rise to the alignment in Figure 1.</Paragraph> <Paragraph position="7"> This generative story implicitly induces an alignment between the document and the summary: the summary phrase is considered to be aligned to the document phrase Daum'e and Marcu Alignments for Automatic Document Summarization Figure 3 Beginning and end of the generative process that gave rise to the alignment in Figure 1, which is reproduced here for convenience.</Paragraph> <Paragraph position="8"> that &quot;generated&quot; it. In order to make this computationally tractable, we must introduce some conditional independence assumptions. Specifically, we assume the following: 1. Decision (a) in our generative story depends only on the position of the end of the current document phrase (i.e., j + l).</Paragraph> <Paragraph position="9"> 2. Decision (b) is conditionally independent of every other decision. 3. Decision (c) depends only on the phrase at the current document position.</Paragraph> </Section> <Section position="2" start_page="512" end_page="512" type="sub_section"> <SectionTitle> 3.2 Statistical Model </SectionTitle> <Paragraph position="0"> Based on the generative story and independence assumptions described above, we can model the entire summary generation process according to two distributions:</Paragraph> <Paragraph position="2"> in the document when the previous phrase ended at position j + l.</Paragraph> <Paragraph position="4"> ), the rewrite probability of generating summary phrase s given that we are considering the sub-phrase of d beginning at position j and ending at position j + l.</Paragraph> <Paragraph position="5"> Specific parameterizations of the distributions jump and rewrite will be discussed in Section 4 to enable the focus here to be on the more general problems of inference and decoding in such a model. The model described by these independence assumptions very much resembles that of a hidden Markov model (HMM), where states in the state space are document ranges and emissions are summary words. The difference is that instead of generating a single word in each transition between states, an entire phrase is generated. This difference is captured by the semi-Markov model or segmental HMM framework, described in great detail by Ostendorf, Digalakis, and Kimball (1996); see also Ferguson (1980); Gales and Young (1993); Mitchell, Jamieson, and Harper (1995); Smyth, Heckerman, and Jordan (1997); Ge and Smyth (2000); Aydin, Altunbasak, and Borodovsky (2004) for more detailed descriptions of these models as well as other applications in speech processing and computational biology. In the following subsections, we will briefly discuss the aspects of inference that are relevant to our problem, but the interested reader is directed to Ostendorf, Digalakis, and Kimball (1996) for more details.</Paragraph> </Section> <Section position="3" start_page="512" end_page="513" type="sub_section"> <SectionTitle> 3.3 Creating the State Space </SectionTitle> <Paragraph position="0"> Given our generative story, we can construct a semi-HMM to calculate precisely the alignment probabilities specified by our model in an efficient manner. A semi-HMM is Computational Linguistics Volume 31, Number 4 fully defined by a state space (with designated start and end states), an output alphabet, transition probabilities, and observation probabilities. The semi-HMM functions like an HMM: Beginning at the start state, stochastic transitions are made through the state space according to the transition probabilities. At each step, one or more observations are generated. The machine stops when it reaches the end state.</Paragraph> <Paragraph position="1"> In our case, the state set is large, but well structured. There is a unique initial state <start> , a unique final state <end> , and a state for each possible document phrase. That is, for a document of length n, for all 1 [?] i [?] i prime [?] n, there is a state that corresponds to the document phrase beginning at position i and ending at position i prime , which we will refer to as r i,i prime. There is also a null state for each document position r</Paragraph> <Paragraph position="3"> :1[?] i [?] n}. The output alphabet consists of each word found in S, plus the end-of-sentence word o. We only allow the word o to be emitted on a transition to the end state. The transition probabilities are managed by the jump model, and the emission probabilities are managed by the rewrite model. Consider the document ab(the semi-HMM for which is shown in Figure 4) in the case when the corresponding summary is cd.Suppose the correct alignment is that cdis aligned to a and b is left unaligned. Then, the path taken through the semi-HMM is <start> -a -<end> . During the transition <start> -a, cdis emitted.</Paragraph> <Paragraph position="4"> During the transition a -<end> , o is emitted.</Paragraph> </Section> <Section position="4" start_page="513" end_page="514" type="sub_section"> <SectionTitle> 3.4 Expectation Maximization </SectionTitle> <Paragraph position="0"> The alignment task, as described above, is a chicken-and-egg problem: if we knew the model components (namely, the rewrite and jump tables), we would be able to efficiently find the best alignment. Similarly, if we knew the correct alignments, we would be able to estimate the model components. Unfortunately, we have neither.</Paragraph> <Paragraph position="1"> Expectation maximization is a general technique for learning in such chicken-and-egg situations (Dempster, Laird, and Rubin 1977; Boyles 1983; Wu 1983). The basic idea is to make a guess at the alignments, and then use this guess to estimate the parameters for the relevant distributions. We can use these re-estimated distributions to make a better guess at the alignments, and then use these (ideally better) alignments to re-estimate the parameters.</Paragraph> <Paragraph position="2"> Figure 4 Schematic drawing of the semi-HMM (with some transition probabilities) for the document ab. Daum'e and Marcu Alignments for Automatic Document Summarization Formally, the EM family of algorithms tightly bound the log of an expectation of a function by the expectation of the log of that function, through the use of Jensen's inequality (Jensen 1906). The tightness of the bound means that when we attempt to estimate the model parameters, we may do so over expected alignments, rather than the true (but unknown) alignments. EM gives formal guarantees of convergence, but is only guaranteed to find local maxima.</Paragraph> </Section> <Section position="5" start_page="514" end_page="515" type="sub_section"> <SectionTitle> 3.5 Model Inference </SectionTitle> <Paragraph position="0"> All the inference techniques utilized in this paper are standard applications of semi-Markov model techniques. The relevant equations are summarized in Figure 5 and described here. In all these equations, the variables t and t prime range over phrases in the summary (specifically, the phrase s t:t prime), and the variables i and j range over phrases in the document. The interested reader is directed to Ostendorf, Digalakis, and Kimball (1996) for more details on the generic form of these models and their inference techniques. Unfortunately, the number of possible alignments for a given <document, summary> pair is exponential in the length of the summary. This would make a na&quot;ive implementation of the computation of p(s |d) intractable without a more clever solution. Instead, we are able to employ a variant of the forward algorithm to compute these probabilities recursively. The basic idea is to compute the probability of generating a prefix of the summary and ending up at a particular position in the document (this is known as the forward probability). Since our independence assumptions tell us that it does not matter how we got to this position, we can use this forward probability to compute the probability of taking one more step in the summary. At the end, the desired probability p(s |d) is simply the forward probability of reaching the end of the summary and document simultaneously. The forward probabilities are calculated in the a table in Figure 5. This equation essentially says that the probability of emitting the first t [?] 1 words of the summary and ending at position j in the document can be computed by summing over our previous position (t prime ) and previous state (i)and multiplying the probability of getting there (a</Paragraph> <Paragraph position="2"> + 1)) with the probability of moving from there to the current position.</Paragraph> <Paragraph position="3"> Figure 5 Summary of inference equations for a semi-Markov model. Computational Linguistics Volume 31, Number 4 The second standard inference problem is the calculation of the best alignment: the Viterbi alignment. This alignment can be computed in exactly the same fashion as the forward algorithm, with two small changes. First, the forward probabilities implicitly include a sum over all previous states, whereas the Viterbi probabilities replace this with a max operator. Second, in order to recover the actual Viterbi alignment, we keep track of which previous state this max operator chose. This is computed by filling out the z table from Figure 5. This is almost identical to the computation of the forward probabilities, except that instead of summing over all possible t prime and i, we take the maximum over those variables.</Paragraph> <Paragraph position="4"> The final inference problem is parameter re-estimation. In the case of standard HMMs, this is known as the Baum-Welch, Baum-Eagon or Forward-Backward algorithm (Baum and Petrie 1966; Baum and Eagon 1967). By introducing backward probabilities analogous to the forward probabilities, we can compute alignment probabilities of suffixes of the summary. The backward table is the b table in Figure 5, which is analogous to the a table, except that the computation proceeds from the end to the start. By combining the forward and backward probabilities, we can compute the expected number of times a particular alignment was made (the E-step in the EM framework). Based on these expectations, we can simply sum and normalize to get new parameters (the M-step). The expected transitions are computed according to the t table, which makes use of the forward and backward probabilities. Finally, the re-estimated jump probabilities are given by ^a and the re-estimated rewrite probabilities are given by ^ b, which are essentially relative frequencies of the fractional counts given by the ts. The computational complexity for the Viterbi algorithm and for the parameter re- null parenrightbig , where N is the length of the summary and T is the number of states (in our case, T is roughly the length of the document times the maximum phrase length allowed). However, we will typically bound the maximum length of a phrase; we are unlikely to otherwise encounter enough training data to get reasonable estimates of emission probabilities. If we enforce a maximum observation sequence length of l, then this drops to O</Paragraph> <Paragraph position="6"> . Moreover, if the transition network is sparse, as it is in our case, and the maximum out-degree of any node is b, then the complexity drops to O (NTbl).</Paragraph> </Section> </Section> <Section position="6" start_page="515" end_page="521" type="metho"> <SectionTitle> 4. Model Parameterization </SectionTitle> <Paragraph position="0"> Beyond the conditional independence assumptions made by the semi-HMM, there are nearly no additional constraints that are imposed on the parameterization (in terms of the jump and rewrite distributions) of the model. There is one additional technical requirement involving parameter re-estimation, which essentially says that the expectations calculated during the forward-backward algorithm must be sufficient statistics for the parameters of the jump and rewrite models. This constraint simply requires that whatever information we need to re-estimate their parameters is available to us from the forward-backward algorithm.</Paragraph> <Section position="1" start_page="515" end_page="517" type="sub_section"> <SectionTitle> 4.1 Parameterizing the Jump Model </SectionTitle> <Paragraph position="0"> Recall that the responsibility of the jump model is to compute probabilities of the form</Paragraph> <Paragraph position="2"> is a new position and j is an old position. We have explored several possible parameterizations of the jump table. The first simply computes a table of likely jump distances (i.e., jump forward 1, jump backward 3, etc.). The second models</Paragraph> <Paragraph position="4"> this distribution as a Gaussian (though, based on Figure 2 this is perhaps not the best model). Both of these models have been explored in the machine translation community. Our third parameterization employs a novel syntax-aware jump model that attempts to take advantage of local syntactic information in computing jumps.</Paragraph> <Paragraph position="5"> for each possible jump distance, and compute jump(j</Paragraph> <Paragraph position="7"> [?] j). Each possible jump type and its associated probability is shown in Table 2. By these calculations, regardless of document phrase lengths, transitioning forward between two consecutive segments will result in jump rel (1). When transitioning from the start state p to state r i,i prime, the value we use is a jump length of i. Thus, if we begin at the first word in the document, we incur a transition probability of j1. There are no transitions into p.We additionally remember a specific transition jump rel ([?]) for the probability of transitioning to a null state. It is straightforward to estimate these parameters based on the estimations from the forward-backward algorithm. In particular, jump rel (i)issimply the relative frequency of length i jumps, and jump rel ([?]) is simply the count of jumps that end in a null state to the total number of jumps. The null state remembers the position we ended in before we jumped there, and so to jump out of a null state, we make a jump based on this previous position.</Paragraph> <Paragraph position="8"> sity of data problem in the relative jump model by assuming a parametric form to the jumps. In particular, we assume there is a mean jump length u and a jump variance s</Paragraph> <Paragraph position="10"> and then the probability of a jump of length i is given by:</Paragraph> <Paragraph position="12"> Some care must be taken in employing this model, since the normal distribution is defined over a continuous space. Thus, when we discretize the calculation, the normalizing constant changes slightly from that of a continuous normal distribution. In 6 In order for the null state to remember where we were, we actually introduce one null state for each document position, and require that from a document phrase d</Paragraph> <Paragraph position="14"> Computational Linguistics Volume 31, Number 4 practice, we normalize by summing over a sufficiently large range of possible is. The parameters u and s are estimated by computing the mean jump length in the expectations and its empirical variance. We model null states identically to the relative jump model. extremely na&quot;ive in that they look only at the distance jumped and completely ignore what is being jumped over. In the syntax-aware jump model, we wish to enable the model to take advantage of syntactic knowledge in a very weak fashion. This is quite different from the various approaches to incorporating syntactic knowledge into machine translation systems, wherein strong assumptions about the possible syntactic operations are made (Yamada and Knight 2001; Eisner 2003; Gildea 2003). To motivate this model, consider the first document sentence shown with its syntactic parse tree in Figure 6. Though it is not always the case, forward jumps of distance more than one are often indicative of skipped words. From the standpoint of the relative jump models, jumping over the four words tripled it 's sales and jumping over the four words of Apple Macintosh systems are exactly the same.</Paragraph> <Paragraph position="15"> However, intuitively, we would be much more willing to jump over the latter than the former. The latter phrase is a full syntactic constituent, while the first phrase is just a collection of nearby words. Furthermore, the latter phrase is a prepositional phrase (and prepositional phrases might be more likely dropped than other phrases), while the former phrase includes a verb, a pronoun, a possessive marker, and a plain noun.</Paragraph> <Paragraph position="16"> To formally capture this notion, we parameterize the syntax-aware jump model according to the types of phrases being jumped over. That is, to jump over tripled it 's sales would have probability jump</Paragraph> <Paragraph position="18"> Macintosh systems would have probability jump syn (PP). In order to compute the probabilities for jumps over many components, we factorize so that the first probability becomes jump</Paragraph> <Paragraph position="20"> (NNS). This factorization explicitly encodes our preference for jumping over single units rather than several syntactically unrelated units.</Paragraph> <Paragraph position="21"> In order to work with this model, we must first parse the document side of the corpus; we used Charniak's parser (Charniak 1997). Given the document parse trees, the re-estimation of the components of this probability distribution is done by simply counting what sorts of phrases are being jumped over. Again, we keep a single parameter jump syn ([?]) for jumping to null states. To handle backward jumps, we simply consider a duplication of the tag set, where jump syn (NP-f) denotes a forward jump over an NP, and jump syn (NP-b) denotes a backward jump over an NP.</Paragraph> </Section> <Section position="2" start_page="517" end_page="519" type="sub_section"> <SectionTitle> 4.2 Parameterizing the Rewrite Model </SectionTitle> <Paragraph position="0"> As observed from the human-aligned summaries, a good rewrite model should be able to account for alignments between identical word and phrases, between words that are identical up to stem, and between different words. Intuition (as well as further 7 As can be seen from this example, we have preprocessed the data to split off possessive terms, such as the mapping from its to it 's.</Paragraph> <Paragraph position="1"> 8 In general, there are many ways to get from one position to another. For instance, to get from systems to January, we could either jump forward over an RB and a JJ, or we could jump forward over an ADVP and backward over an NN. In our version, we restrict all jumps to the same direction, and take the shortest jump sequence, in terms of number of nodes jumped over.</Paragraph> <Paragraph position="2"> Daum'e and Marcu Alignments for Automatic Document Summarization Figure 6 The syntactic tree for an example document sentence.</Paragraph> <Paragraph position="3"> investigations of the data) also suggest that synonymy is an important factor to take into consideration in a successful rewrite model. We account for each of these four factors in four separate components of the model and then take a linear interpolation of them to produce the final probability:</Paragraph> <Paragraph position="5"> where the ls are constrained to sum to unity. The four rewrite distributions used are: id is a word identity model, which favors alignment of identical words; stem is a model designed to capture the notion that matches at the stem level are often sufficient for alignment (i.e., walk and walked are likely to be aligned); wn is a rewrite model based on similarity according to WordNet; and wr is the basic rewrite model, similar to a translation table in machine translation. These four models are described in detail in this section, followed by a description of how to compute their lsduringEM.</Paragraph> <Paragraph position="7"> . That is, the probability is 1 exactly when s and d are identical, and 0 when they differ. This model has no parameters.</Paragraph> <Paragraph position="8"> very similar to that of the word identity model:</Paragraph> <Paragraph position="10"> That is, the probability of a phrase s given d is uniform over all phrases s prime that match d up to stem (and are of the same length, i.e., |s prime |=|d|), and zero otherwise. The Computational Linguistics Volume 31, Number 4 normalization constant is computed offline based on a pre-computed vocabulary. This model also has no parameters.</Paragraph> <Paragraph position="11"> phrases to be rewritten to semantically &quot;related&quot; summary phrases. To compute the value for rewrite wn (s |d), we first require that both s and d can be found in WordNet. If either cannot be found, then the probability is zero. If they both can be found, then the graph distance between their first senses is computed (we traverse the hypernymy tree up until they meet). If the two paths do not meet, then the probability is again taken to be zero. We place an exponential model on the hypernym tree-based distance:</Paragraph> <Paragraph position="13"> d exp [[?]edist(s, d)] (5) Here, dist is calculated distance, taken to be +[?] whenever either of the failure conditions is met. The single parameter of this model is e, which is computed according to the maximum likelihood criterion from the expectations during training. The normalization constant Z d is calculated by summing over the exponential distribution for all s prime that occur on the summary side of our corpus. handle the cases not handled by the above models. It is analogous to a translation-table (t-table) in statistical machine translation (we will continue to use this terminology for the remainder of the article), and simply computes a matrix of (fractional) counts corresponding to all possible phrase pairs. Upon normalization, this matrix gives the rewrite distribution.</Paragraph> <Paragraph position="14"> 4.2.5 Estimation of the Weight Parameters. In order to weight the four models, we need to estimate values for the l components. This computation can be performed inside of the EM iterations by considering for each rewritten pair its expectation of belonging to each of the models. We use these expectations to maximize the likelihood with respect to the ls and then normalize them so they sum to one.</Paragraph> </Section> <Section position="3" start_page="519" end_page="521" type="sub_section"> <SectionTitle> 4.3 Model Priors </SectionTitle> <Paragraph position="0"> In the standard HMM case, the learning task is simply one of parameter estimation, wherein the maximum likelihood criterion under which the parameters are typically trained performs well. However, in our model, we are, in a sense, simultaneously estimating parameters and selecting a model: The model selection is taking place at the level of deciding how to segment the observed summary. Unfortunately, in such model selection problems, likelihood increases monotonically with model complexity. Thus, EM will find for us the most complex model; in our case, this will correspond to a model in which the entire summary is produced at once, and no generalization will be possible.</Paragraph> <Paragraph position="1"> This suggests that a criterion other than maximum likelihood (ML) is more appropriate. We advocate the maximum a posteriori (MAP) criterion in this case. While Daum'e and Marcu Alignments for Automatic Document Summarization ML optimizes the probability of the data given the parameters (the likelihood), MAP optimizes the product of the probability of the parameters with the likelihood (the unnormalized posterior). The difficulty in our model that makes ML estimation perform poorly is centered in the lexical rewrite model. Under ML estimation, we will simply insert an entry in the t-table for the entire summary for some uncommon or unique document word and are done. However, aprioriwe do not believe that such a parameter is likely. The question then becomes how to express this in a way that inference remains tractable.</Paragraph> <Paragraph position="2"> From a statistical point of view, the t-table is nothing but a large multinomial model (technically, one multinomial for each possible document phrase). Under a multinomial distribution with parameter th with J-many components (with all th j positive and summing to one), the probability of an observation x is given by p (x |th) = producttext</Paragraph> <Paragraph position="4"> (here, we consider x to be a vector of length J in which all components are zero except for one, corresponding to the actual observation).</Paragraph> <Paragraph position="5"> This distribution belongs to the exponential family and therefore has a natural conjugate distribution. Informally, two distributions are conjugate if you can multiply them together and get the original distribution back. In the case of the multinomial, the conjugate distribution is the Dirichlet distribution. A Dirichlet distribution is parameterized byavectora of length J with a j [?] 0, but not necessarily summing to one. The Dirichlet distribution can be used as a prior distribution over multinomial parameters and has The fraction before the product is simply a normalization term that ensures that the integral over all possible th integrates to one.</Paragraph> <Paragraph position="6"> The Dirichlet is conjugate to the multinomial because when we compute the posterior of th given a and x, we arrive back at a Dirichlet distribution: p (th |x,a) [?] . This distribution has the same density as the original model, but a &quot;fake count&quot; of a j [?] 1 has been added to component j. This means that if we are able to express our prior beliefs about the multinomial parameters found in the t-table in the form of a Dirichlet distribution, the computation of the MAP solution can be performed exactly as described before, but with the appropriate fake counts added to the observed variables (in our case, the observed variables are the alignments between a document phrase and a summary phrase). The application of Dirichlet priors to standard HMMs has previously been considered in signal processing (Gauvain and Lee 1994). These fake counts act as a smoothing parameter, similar to Laplace smoothing (Laplace smoothing is the special case where a j = 2for all j).</Paragraph> <Paragraph position="7"> In our case, we believe that singleton rewrites are worth 2 fake counts, that lexical identity rewrites are worth 4 fake counts and that stem identity rewrites are worth 3 fake counts. Indeed, since a singleton alignment between identical words satisfies all of these criteria, it will receive a fake count of 9. The selection of these counts is intuitive, but clearly arbitrary. However, this selection was not &quot;tuned&quot; to the data to get better performance. As we will discuss later, inference in this model over the sizes of documents and summaries we consider is quite computationally expensive. As is appropriate, we specified this prior according to our prior beliefs, and left the rest to the inference mechanism.</Paragraph> </Section> <Section position="4" start_page="521" end_page="521" type="sub_section"> <SectionTitle> Computational Linguistics Volume 31, Number 4 4.4 Parameter Initialization </SectionTitle> <Paragraph position="0"> We initialize all the parameters uniformly, but in the case of the rewrite parameters, since there is a prior on them, they are effectively initialized to the maximum likelihood solution under their prior.</Paragraph> </Section> </Section> class="xml-element"></Paper>