File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1139_metho.xml
Size: 16,435 bytes
Last Modified: 2025-10-06 14:10:24
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1139"> <Title>Stochastic Language Generation Using WIDL-expressions and its Application in Machine Translation and Summarization</Title> <Section position="5" start_page="1106" end_page="1108" type="metho"> <SectionTitle> 3 Stochastic Language Generation from </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="1106" end_page="1106" type="sub_section"> <SectionTitle> WIDL-expressions 3.1 Interpolating Probability Distributions in a Log-linear Framework </SectionTitle> <Paragraph position="0"> Let us assume a finite set a61 of strings over a finite alphabet a1 , representing the set of possible sentence realizations. In a log-linear framework, we have a vector of feature functions a62 a6</Paragraph> <Paragraph position="2"> a61 , the interpolated probability a67a222a18a68a66a58a21 can be written under a log-linear model as in Equation 1:</Paragraph> <Paragraph position="4"> We can formulate the search problem of finding the most probable realization a66 under this model as shown in Equation 2, and therefore we do not need to be concerned about computing expensive normalization factors.</Paragraph> <Paragraph position="6"> For a given WIDL-expression a5 over a1 , the set a61 is defined by a21 a27a30a29a196a18a48a9a47a10a13a12a15a14a17a16a19a18a20a5a22a21a98a21 , and feature function a62a13a67 is taken to be a9a13a10a13a12a15a14a17a16a82a18a20a5a22a21 . Any language model we want to employ may be added in Equation 2 as a feature function a62a13a52 , a92a85a84a24a38 .</Paragraph> </Section> <Section position="2" start_page="1106" end_page="1108" type="sub_section"> <SectionTitle> 3.2 Algorithms for Intersecting WIDL-expressions with Language Models </SectionTitle> <Paragraph position="0"> Algorithm WIDL-NGLM-Aa86 (Figure 3) solves the search problem defined by Equation 2 for a WIDL-expression a5 (which provides feature function a62 a67 ) and a87 a0 -gram language models (which provide feature functions a62a167a55 a37a53a57a53a57a53a57a58a37 a62 a63 a21 . It does so by incrementally computing UNFOLD for a42 a31 (i.e., on-demand computation of the corresponding pFSA a57 a31 ), by keeping track of a set of active states, called a88a90a89a92a91a94a93a96a95a78a97 . The set of newly UNFOLDed states is called a98a100a99a53a101a71a102a78a103a105a104 . Using Equation 1 (unnormalized), we EVALUATE the current a67a183a18a68a66a58a21 scores for the a98a73a99a92a101a106a102a78a103a107a104 states. Additionally, EVALUATE uses an admissible heuristic function to compute future (admissible) scores for the a98a73a99a92a101a71a102a108a103a107a104 states. The algorithm PUSHes each state from the current a98a100a99a53a101a71a102a78a103a105a104 into a priority queue a109 , which sorts the states according to their total score (current a110 admissible). In the next iteration, a88a90a89a53a91a94a93a111a95a108a97 is a singleton set containing the state POPed out from the top of a109 . The admissible heuristic function we use is the one defined in (Soricut and Marcu, 2005), using Equation 1 (unnormalized) for computing the event costs. Given the existence of the admissible heuristic and the monotonicity property of the unfolding provided by the priority queue a109 , the proof for Aa86 optimality (Russell and Norvig, 1995) guarantees that WIDL-NGLM-Aa86 finds a path in a57a113a112 that provides an optimal solution.</Paragraph> <Paragraph position="1"> An important property of the WIDL-NGLM-Aa86 algorithm is that the UNFOLD relation (and, implicitly, the a57a112a31 acceptor) is computed only partially, for those states for which the total cost is less than the cost of the optimal path. This results in important savings, both in space and time, over simply running a single-source shortest-path algorithm for directed acyclic graphs (Cormen et al., 2001) over the full acceptor a57a133a31 (Soricut and Marcu, 2005).</Paragraph> </Section> </Section> <Section position="6" start_page="1108" end_page="1109" type="metho"> <SectionTitle> 4 Headline Generation using </SectionTitle> <Paragraph position="0"> WIDL-expressions We employ the WIDL formalism (Section 2) and the WIDL-NGLM-Aa86 algorithm (Section 3) in a summarization application that aims at producing both informative and fluent headlines. Our headlines are generated in an abstractive, bottom-up manner, starting from words and phrases. A more common, extractive approach operates top-down, by starting from an extracted sentence that is compressed (Dorr et al., 2003) and annotated with additional information (Zajic et al., 2004).</Paragraph> <Paragraph position="1"> Automatic Creation of WIDL-expressions for Headline Generation. We generate WIDL-expressions starting from an input document. First, we extract a weighted list of topic keywords from the input document using the algorithm of Zhou and Hovy (2003). This list is enriched with phrases created from the lexical dependencies the topic keywords have in the input document. We associate probability distributions with these phrases using their frequency (we assume Keywords a43 iraq 0.32, syria 0.25, rebels 0.22, kurdish 0.17, turkish 0.14, attack 0.10a46</Paragraph> <Section position="1" start_page="1108" end_page="1109" type="sub_section"> <SectionTitle> Phrases </SectionTitle> <Paragraph position="0"> iraq a43 in iraq 0.4, northern iraq 0.5,iraq and iran 0.1a46 , syria a43 into syria 0.6, and syria 0.4 a46 rebels a43 attacked rebels 0.7,rebels fighting 0.3a46 that higher frequency is indicative of increased importance) and their position in the document (we assume that proximity to the beginning of the document is also indicative of importance). In Figure 4, we present an example of input keywords and lexical-dependency phrases automatically extracted from a document describing incidents at the Turkey-Iraq border.</Paragraph> <Paragraph position="1"> The algorithm for producing WIDL-expressions combines the lexical-dependency phrases for each keyword using a a62 operator with the associated probability values for each phrase multiplied with the probability value of each topic keyword. It then combines all the a62 -headed expressions into a single WIDL-expression using a a154 operator with uniform probability. The WIDL-expression in Figure 1 is a (scaled-down) example of the expressions created by this algorithm.</Paragraph> <Paragraph position="2"> On average, a WIDL-expression created by this algorithm, using a10 a6 a34 keywords and an average of a81a195a6 a234 lexical-dependency phrases per keyword, compactly encodes a candidate set of about 3 million possible realizations. As the specification of the a154a77a63 operator takes space a11a195a18 a38 a21 for uniform a51 , Theorem 1 guarantees that the space complexity of these expressions is a11a183a18a12a10 a81a47a21 .</Paragraph> <Paragraph position="3"> Finally, we generate headlines from WIDL-expressions using the WIDL-NGLM-Aa86 algorithm, which interpolates the probability distributions represented by the WIDL-expressions with a0 -gram language model distributions. The output presented in Figure 4 is the most likely headline realization produced by our system.</Paragraph> <Paragraph position="4"> Headline Generation Evaluation. To evaluate the accuracy of our headline generation system, we use the documents from the DUC 2003 evaluation competition. Half of these documents are used as development set (283 documents), pare extractive algorithms against abstractive algorithms, including our WIDL-based algorithm.</Paragraph> <Paragraph position="5"> and the other half is used as test set (273 documents). We automatically measure performance by comparing the produced headlines against one reference headline produced by a human using ROUGEa129 (Lin, 2004).</Paragraph> <Paragraph position="6"> For each input document, we train two language models, using the SRI Language Model Toolkit (with modified Kneser-Ney smoothing). A general trigram language model, trained on 170M English words from the Wall Street Journal, is used to model fluency. A document-specific tri-gram language model, trained on-the-fly for each input document, accounts for both fluency and content validity. We also employ a word-count model (which counts the number of words in a proposed realization) and a phrase-count model (which counts the number of phrases in a proposed realization), which allow us to learn to produce headlines that have restrictions in the number of words allowed (10, in our case). The interpolation weights a65 (Equation 2) are trained using discriminative training (Och, 2003) using ROUGEa129 as the objective function, on the development set.</Paragraph> <Paragraph position="7"> The results are presented in Table 1. We compare the performance of several extractive algorithms (which operate on an extracted sentence to arrive at a headline) against several abstractive algorithms (which create headlines starting from scratch). For the extractive algorithms, Lead10 is a baseline which simply proposes as headline the lead sentence, cut after the first 10 words.</Paragraph> <Paragraph position="8"> HedgeTrimmera6 is our implementation of the Hedge Trimer system (Dorr et al., 2003), and Topiarya7 is our implementation of the Topiary system (Zajic et al., 2004). For the abstractive algorithms, Key-words is a baseline that proposes as headline the sequence of topic keywords, Webcl is the system</Paragraph> </Section> </Section> <Section position="7" start_page="1109" end_page="1109" type="metho"> <SectionTitle> THREE GORGES PROJECT IN CHINA HAS WON APPROVAL WATER IS LINK BETWEEN CLUSTER OF E. COLI CASES SRI LANKA 'S JOINT VENTURE TO EXPAND EXPORTS OPPOSITION TO EUROPEAN UNION SINGLE CURRENCY EURO OF INDIA AND BANGLADESH WATER BARRAGE </SectionTitle> <Paragraph position="0"> a WIDL-based sentence realization system.</Paragraph> <Paragraph position="1"> described in (Zhou and Hovy, 2003), and WIDL-Aa8 is the algorithm described in this paper. This evaluation shows that our WIDL-based approach to generation is capable of obtaining headlines that compare favorably, in both content and fluency, with extractive, state-of-the-art results (Zajic et al., 2004), while it outperforms a previously-proposed abstractive system by a wide margin (Zhou and Hovy, 2003). Also note that our evaluation makes these results directly comparable, as they use the same parsing and topic identification algorithms. In Figure 5, we present a sample of headlines produced by our system, which includes both good and not-so-good outputs.</Paragraph> </Section> <Section position="8" start_page="1109" end_page="1109" type="metho"> <SectionTitle> 5 Machine Translation using </SectionTitle> <Paragraph position="0"> WIDL-expressions We also employ our WIDL-based realization engine in a machine translation application that uses a two-phase generation approach: in a first phase, WIDL-expressions representing large sets of possible translations are created from input foreign-language sentences. In a second phase, we use our generic, WIDL-based sentence realization engine to intersect WIDL-expressions with an a0 -gram language model. In the experiments reported here, we translate between Chinese (source language) and English (target language).</Paragraph> <Paragraph position="1"> Automatic Creation of WIDL-expressions for MT. We generate WIDL-expressions from Chinese strings by exploiting a phrase-based translation table (Koehn et al., 2003). We use an algorithm resembling probabilistic bottom-up parsing to build a WIDL-expression for an input Chinese string: each contiguous span a18a92a75a37a10a9 a21 over a Chinese string a11a22a52a13a12a61 is considered a possible &quot;constituent&quot;, and the &quot;non-terminals&quot; associated with each constituent are the English phrase translations a61 a69 a52a13a12a61 that correspond in the translation table to the Chinese string a11a56a52a13a12a61 . Multiple-word English phrases, such as a14a16a15a17a14a19a18a20a14a22a21 , are represented as WIDL-expressions using the precedence (a131) and WIDL-expression, which provides a translation as the best scoring hypothesis under the interpolation with a trigram language model.</Paragraph> <Paragraph position="2"> lock (a221 ) operators, as a221 a18 a14 a15a134a131 a14 a18a56a131 a14 a21 a21 . To limit the number of possible translations a61 a69 a52a13a12a61 corresponding to a Chinese span a11a56a52a13a12a61 , we use a probabilistic beam a90 and a histogram beam a58 to beam out low probability translation alternatives. At this point, each a11 a52 a12a61 span is &quot;tiled&quot; with likely translations a61 a69 a52a13a12a61 taken from the translation table.</Paragraph> <Paragraph position="3"> Tiles that are adjacent are joined together in a larger tile by a a154a77a63 operator, where a51 a6 the component tiles are permitted by the a154a53a63 operators (assigned non-zero probability), but the longer the movement from the original order of the tiles, the lower the probability. (This distortion model is similar with the one used in (Koehn, 2004).) When multiple tiles are available for the same span a18a92a75a37a10a9 a21 , they are joined by a a62a42a63 operator, where a51 is specified by the probability distributions specified in the translation table. Usually, statistical phrase-based translation tables specify not only one, but multiple distributions that account for context preferences. In our experiments, we consider four probability distributions: and a66 are Chinese-English phrase translations as they appear in the translation table. In Figure 6, we show an example of WIDL-expression created by this algorithm1.</Paragraph> <Paragraph position="4"> On average, a WIDL-expression created by this algorithm, using an average of a10 a6 a226a48a36 tiles per sentence (for an average input sentence length of 30 words) and an average of a81a183a6a101a100 possible translations per tile, encodes a candidate set of about 10a233 a67 possible translations. As the specification of the a154a162a63 operators takes space a11a195a18 a38 a21 , Theorem 1 1English reference: the gunman was shot dead by the police. guarantees that these WIDL-expressions encode compactly these huge spaces in a11a183a18a12a10 a81a47a21 . In the second phase, we employ our WIDL-based realization engine to interpolate the distribution probabilities of WIDL-expressions with a trigram language model. In the notation of Equation 2, we use four feature functions a62a124a67 a37a53a57a53a57a53a57a58a37 a62 a50 for the WIDL-expression distributions (one for each probability distribution encoded); a feature function a62a79a102 for a trigram language model; a feature function a62 a233 for a word-count model, and a feature function a62 a47 for a phrase-count model.</Paragraph> <Paragraph position="5"> As acknowledged in the Machine Translation literature (Germann et al., 2003), full Aa86 search is not usually possible, due to the large size of the search spaces. We therefore use an approximation algorithm, called WIDL-NGLM-Aa86a69 , which considers for unfolding only the nodes extracted from the priority queue a109 which already unfolded a path of length greater than or equal to the maximum length already unfolded minus a81 (we used a81a195a6a84a225 in the experiments reported here). MT Performance Evaluation. When evaluated against the state-of-the-art, phrase-based decoder Pharaoh (Koehn, 2004), using the same experimental conditions - translation table trained on the FBIS corpus (7.2M Chinese words and 9.2M English words of parallel text), trigram language model trained on 155M words of English newswire, interpolation weights a65 (Equation 2) trained using discriminative training (Och, 2003) (on the 2002 NIST MT evaluation set), probabilistic beam a90 set to 0.01, histogram beam a58 set to 10 - and BLEU (Papineni et al., 2002) as our metric, the WIDL-NGLM-Aa86 a129 algorithm produces translations that have a BLEU score of 0.2570, while Pharaoh translations have a BLEU score of 0.2635. The difference is not statistically significant at 95% confidence level.</Paragraph> <Paragraph position="6"> These results show that the WIDL-based approach to machine translation is powerful enough to achieve translation accuracy comparable with state-of-the-art systems in machine translation.</Paragraph> </Section> class="xml-element"></Paper>