File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1661_metho.xml
Size: 14,038 bytes
Last Modified: 2025-10-06 14:10:48
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1661"> <Title>Statistical Ranking in Tactical Generation</Title> <Section position="5" start_page="517" end_page="518" type="metho"> <SectionTitle> CX </SectionTitle> <Paragraph position="0"> . Now, let our (positive) training data be given as CG . In our set-up, this is the case where multiple HPSG derivations for the same input semantics project identical surface strings.</Paragraph> <Paragraph position="1"> Given a set of CS features (as further described in Section 3.2), each pair of semantic input D7 and hypothesized realization D6 is mapped to a feature vector A8B4D7BND6B5 BE BOCS. The goal is then to find a vector of weights DB BE BOCS that optimize the likelihood of the training data. A conditional MaxEnt model of the probability of a realization D6 given the semantics D7, is defined as When we want to find the best realization for a given input semantics according to a model D4</Paragraph> <Paragraph position="3"> is sufficient to compute the score function as in Equation (4) and then use the decision function previously given in Equation (2) above. When it comes to estimating1 the parameters DB, the procedure seeks to maximize the (log of) a penalized likelihood function as in term of the likelihood function in Equation (6) is a penalty term that is commonly used for reducing the tendency of log-linear models to over-fit, especially when training on sparse data using many features (Chen & Rosenfeld, 1999; Johnson et al., 1999; Malouf & Noord, 2004). More specifically it defines a zero-mean Gaussian prior on the feature weights which effectively leads to less extreme values. After empirically tuning the prior on our 'Jotunheimen' treebank (training and testing by 10-fold cross-validation), we ended up using ARBE BP BCBMBCBCBF for the MaxEnt models applied in this paper.</Paragraph> <Section position="1" start_page="518" end_page="518" type="sub_section"> <SectionTitle> 2.3 SVM Rankers </SectionTitle> <Paragraph position="0"> In this section we briefly formulate the optimization problem in terms of support vector machines.</Paragraph> <Paragraph position="1"> Our starting point is the SVM approach introduced in Joachims (2002) for learning ranking functions for information retrieval. In our case the aim is to learn a ranking function from a set of preference relations on sentences generated for a given input semantics.</Paragraph> <Paragraph position="2"> In contrast to the MaxEnt approach, the SVM approach has a geometric rather than probabilistic view on the problem. Similarly to the the MaxEnt set-up, the SVM learner will try to learn a linear scoring function as defined in Equation (4) above.</Paragraph> <Paragraph position="3"> However, instead of maximizing the probability of the preferred or positive realizations, we try to maximize their value for BY</Paragraph> </Section> </Section> <Section position="6" start_page="518" end_page="519" type="metho"> <SectionTitle> DB </SectionTitle> <Paragraph position="0"> directly.</Paragraph> <Paragraph position="1"> Recall our definition of the set of positive training examples in Section 2.2. Let us here analo- null for training the models, using its limited-memory variable metric as the optimization method and experimentally determine the optimal convergence threshold and variance of the prior.</Paragraph> <Paragraph position="2"> lowing Joachims (2002), the goal is to minimize are commonly used in SVMs to make it possible to approximate a solution by allowing some error in cases where a separating hyperplane can not be found. The trade-off between maximizing the margin size and minimizing the training error is governed by the constant BV. Using the SVMD0CXCVCWD8 package by Joachims (1999), we empirically specified BV BP BCBMBCBCBH for the model described in this paper. Note that, for the experiments reported here, we will only be making binary distinctions of preferred/non-preferred realizations, although the approach presented in (Joachims, 2002) is formulated for the more general case of learning ordinal ranking functions. Finally, given a linear SVM, we score and select realizations in the same way as we did with the MaxEnt model, according to Equations (4) and (2). Note, however, that it is also possible to use non-linear kernel functions with this set-up, since the ranking function can be represented as a linear combination of the feature vectors as in</Paragraph> <Section position="1" start_page="518" end_page="519" type="sub_section"> <SectionTitle> 2.4 Evaluation Measures </SectionTitle> <Paragraph position="0"> The models presented in this paper are evaluated with respect to two simple metrics: exact match accuracy and word accuracy. The exact match measure simply counts the number of times that the model assigns the highest score to a string that exactly matches a corresponding 'gold' or reference sentence (i.e. a sentence that is marked as preferred in the treebank). This score is discounted appropriately in the case of ties between preferred and non-preferred candidates.</Paragraph> <Paragraph position="1"> if several realizations are given the top rank by the model. We also include the exact match accuracy for the five best candidates according to the models (see the D2-best columns of Table 6).</Paragraph> <Paragraph position="2"> The simple measure of exact match accuracy offers a very intuitive and transparent view on model performance. However, it is also in some respects too harsh as an evaluation measure in our setting since there will often be more than just one of the candidate realizations that provides a reasonable rendering of the input semantics. We therefore also include WA as similarity-based evaluation metric. This measure is based on the Levensthein distance between a candidate string and a reference, also known as edit distance. This is given by the minimum number of deletions, substitutions and insertions of words that are required to transform one string into another. If we let CS, D7 and CX represent the number of necessary deletions, substitutions and insertions respectively, and let D0 be the length of the reference, then WA is defined as</Paragraph> <Paragraph position="4"> The scores produced by similarity measures such as WA are often difficult to interpret, but at least they provide an alternative view on the relative performance of the different models that we want to compare. We could also have used several other similarity measures here, such as the BLEU score which is a well-established evaluation metric within MT, but in our experience the various string similarity measures usually agree on the relative ranking of the different models.</Paragraph> </Section> </Section> <Section position="7" start_page="519" end_page="521" type="metho"> <SectionTitle> 3 Data Sets and Features </SectionTitle> <Paragraph position="0"> The following sections summarize the data sets and the feature types used in the experiments.</Paragraph> <Section position="1" start_page="519" end_page="520" type="sub_section"> <SectionTitle> 3.1 Symmetric Treebanks </SectionTitle> <Paragraph position="0"> Conditional parse selection models are standardly trained on a treebank consisting of strings paired with their optimal analyses. For our discriminative realization ranking experiments we require training corpora that provide the inverse relation. By assuming that the preferences captured in a standard treebank can constitute a bidirectional relation, Velldal et al. (2004) propose a notion of symmetric treebanks as the combination of (a) a set of pairings of surface forms and associated semantics; combined with (b) the sets of alternative anal- null banks 'Jotunheimen' (top) and 'Rondane' (bottom). The former data set was used for development and cross-validation testing, the latter for cross-genre held-out testing. The data items are aggregated relative to their number of realizations.</Paragraph> <Paragraph position="1"> The columns are, from left to right, the subdivision of the data according to the number of realizations, total number of items scored (excluding items with only one realization and ones where all realizations are marked as preferred), average string length, average number of realizations, and average number of references. The rightmost column shows a random choice baseline, i.e. the probability of selecting the preferred realization by chance.</Paragraph> <Paragraph position="2"> yses for each surface form, and (c) sets of alternate realizations of each semantic form. Using the semantics of the preferred analyses in an existing treebank as input to the generator, we can produce all equivalent paraphrases of the original string. Furthermore, assuming that the original surface form is an optimal verbalization of the corresponding semantics, we can automatically label the preferred realization(s) by matching the yields of the generated trees against the original strings in the 'source' treebank. The result is what we call a generation treebank, which taken together with the original parse-oriented pairings constitute a full symmetrical treebank.</Paragraph> <Paragraph position="3"> We have successfully applied this technique to the tourism segments of the LinGO Redwoods treebank, which in turn is built atop the ERG.2 Table 2 summarizes the two resulting data sets, which are both comprised of instructional texts on tourist activities, the application domain of the background MT system.</Paragraph> </Section> <Section position="2" start_page="520" end_page="521" type="sub_section"> <SectionTitle> 3.2 Feature Templates </SectionTitle> <Paragraph position="0"> For the purpose of parse selection, Toutanova, Manning, Shieber, Flickinger, & Oepen (2002) and Toutanova & Manning (2002) train a discriminative log-linear model on the Redwoods parse treebank, using features defined over derivation trees with non-terminals representing the construction types and lexical types of the HPSG grammar (see Figure 1). The basic feature set of our MaxEnt realization ranker is defined in the same way (corresponding to the PCFG-S model of Toutanova & Manning, 2002), each feature capturing a sub-tree from the derivation limited to depth one. Table 3 shows example features in our Max-Ent and SVM models, where the feature template # 1 corresponds to local derivation sub-trees. To reduce the effects of data sparseness, feature type # 2 in Table 3 provides a back-off to derivation sub-trees, where the sequence of daughters is reduced, in turn, to just one of the daughters. Conversely, to facilitate sampling of larger contexts than just sub-trees of depth one, feature template # 1 allows optional grandparenting, including the upward chain of dominating nodes in some features. In our experiments, we found that grandparenting of up to three dominating nodes gave the best balance of enlarged context vs. data sparseness. null sentence the dog barks. Phrasal nodes are labeled with identifiers of grammar rules, and (preterminal) lexical nodes with class names for types of lexical entries.</Paragraph> <Paragraph position="1"> In addition to these dominance-oriented features taken from the derivation trees of each realization, our models also include more surfacether information and download pointers.</Paragraph> <Paragraph position="2"> from the derivation tree in Figure 1. The first column identifies the feature template corresponding to each example; in the examples, the first integer value is a parameter to feature templates, i.e. the depth of grandparenting (types 1 and 2) or D2-gram size (types 3 and 4). The special symbols BG and A1 denote the root of the tree and left periphery of the yield, respectively.</Paragraph> <Paragraph position="3"> oriented features, viz. D2-grams of lexical types with or without lexicalization. Feature type # 3 in Table 3 defines D2-grams of variable size, where (in a loose analogy to part of speech tagging) sequences of lexical types capture syntactic category assignments. Feature templates # 3 and # 4 only differ with regard to lexicalization, as the former includes the surface token associated with the rightmost element of each D2-gram. Unless otherwise noted, we used a maximum D2-gram size of three in the experiments reported here, again due to its empirically determined best overall performance. null The number of instantiated features produced by the feature templates easily grows quite large. For the 'Jotunheimen' data the total number of distinct feature instantiations is 312,650. For the experiments in this paper we implemented a simple frequency based cutoff by removing features that are observed as relevant less than CR times. We here follow the approach of Malouf & Noord (2004) where relevance of a feature is simply defined as taking on a different value for any two competing candidates for the same input. A feature is only included in training if it is relevant for more than CR items in the training data. Table 4 shows the effect on the accuracy of the MaxEnt model when varying the cutoff. We see that a model can be lection with respect to model size and accuracy.</Paragraph> <Paragraph position="4"> model configuration match WA basic model of (Velldal et al., 2004) 63.09 0.904 basic plus partial daughter sequence 64.64 0.910 basic plus grandparenting 67.54 0.923 basic plus lexical type trigrams 68.61 0.921 basic plus all of the above 70.28 0.927 basic plus language model 67.96 0.912 basic plus all of the above 72.28 0.928 Table 5: Performance summaries of best-performing realization rankers using various feature configurations, when compared to the set-up of Velldal et al. (2004). These scores where computed using a relevance cutoff of 3 and optimizing the variance of the prior for individual configurations. null compacted quite aggressively without sacrificing much in performance. For all models presented below we use a cutoff of CR BP BF.</Paragraph> </Section> </Section> class="xml-element"></Paper>