File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-1401_metho.xml

Size: 21,130 bytes

Last Modified: 2025-10-06 14:07:26

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-1401">
  <Title>Evaluation Metrics for Generation</Title>
  <Section position="3" start_page="1" end_page="2" type="metho">
    <SectionTitle>
2 System Overview
</SectionTitle>
    <Paragraph position="0"> FERGUS is composed of three mQdules: .the Tree Chooser, tile Unraveler, and the Linear Precedence (LP) Chooser (Figure 1). Tile input to the system is a dependency tree as shown in Figure 2. t Note that the nodes are unordered and are labeled only with lexemes, not with any sort of syntactic annotations. 2 The Tree Chooser uses a stochastic tree model to choose syntactic properties (expressed as trees in a Tree Adjoining Grammar) for the nodes in the input structure. This step can be seen as analogous to &amp;quot;supertagging&amp;quot; -(Bangalore-und doshh 1:999);. except that now supertags (i.e., names of trees which encode the syntactic properties of a lexical head) must be found for words in a tree rather than for words in a linear sequence. The Tree Chooser makes the siinplifying assumptions that the choice of a tree for a node depends only on its daughter nodes, thus allowing for a top-down algorithm. The Tree Chooser draws on a tree model, which is a analysis in terms of syntactic dependency for 1,000,000 words of the Wall Street Journal (WSJ). 3 The supertagged tree which is output from the Tree Chooser still does not fully determine the surface string, because there typically are different ways to attach a daughter node to her mother (for example, an adverb can be placed in different positions with respect to its verbal head). The Unraveler therefore uses the XTAG grammar of English (XTAG-Group, 1999) to produce a lattice of all possible linearizations that are compatible with the supertagged tree. Specifically, the daughter nodes are ordered with respect to the head at each level of the derivation tree. In cases where the XTAG grammar allows a daughter node to be attached at more than one place in the mother supertag (as is the case in our example for was and for; generaUy, such underspecification occurs with adjuncts and with arguments if their syntactic role is not specified), a disjunction of all these positions is assigned to the daughter node. A bottom-up algorithm then constructs a lattice that encodes the strings represented by each level of the derivation tree. The lattice at the root of the derivation tree is the result of the Unraveler.</Paragraph>
    <Paragraph position="1"> Finally. the LP Chooser chooses the most likely traversal of this lattice, given a linear language 1The sentence generated by this tree is a predicativenoun construction. The XTAG grammar analyzes these as being headed by the noun,rather-than by.the copula, and we follow the XTAG analysis. However, it would of course also be possible to use a graminar that allows for the copula-headed analysis.</Paragraph>
    <Paragraph position="2"> 21n the system that we used in the experiments described in Section 3. all words (including function words) need to be present in the input representation, fully inflected. Furthermore, there is no indication of syntactic role at all. This is of course unrealistic f~r applications see ,Section 5 for further renlarks.</Paragraph>
    <Paragraph position="3"> :3This wa~s constructed from the Penn Tree Bank using some heuristics, sirice the. l)enn Tree Bank does not contain full head-dependerit information; as a result of the tlse of heuristics, the Tree Model is tint fully correct.</Paragraph>
    <Paragraph position="4">  raveler encodes all possible word sequences permitted by the supertagged dependency structure. \Ve rank these word sequences in the order of their likeNhood by composing the lattice with a finite-state machine representing a trigram language model. This model has been constructed from the 1.000,0000 words WSJ training corpus. We pick the best path through the lattice resulting from the composition using the Viterbi algorithm, and this top ranking word sequence is the output of the LP Chooser and the generator.</Paragraph>
    <Paragraph position="5"> When we tally the results we obtain the score shown in the first column of Table 1.</Paragraph>
    <Paragraph position="6"> Note that if there are insertions and deletions, the number of operations may be larger than the number of tokens involved for either one of the two strings.</Paragraph>
    <Paragraph position="7"> As a result, the simple string accuracy metric may 3 Baseline-Qua_ntitmtive,Metrics ...,:-~.--..~,-:..,be.:..~eg~i~ee (t:hoagk:it, As, nevel:-greater._than 1, of We have used four different baseline quantitative metrics for evaluating our generator. The first two metrics are based entirely on the surface string. The next two metrics are based on a syntactic representation of the sentence.</Paragraph>
    <Section position="1" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.1 String-Based Metrics
</SectionTitle>
      <Paragraph position="0"> We employ two metrics that measure the accuracy of a generated string. The first metric, simple accuracy, is the same string distance metric used for measuring speech recognition accuracy. This metric has also been used to measure accuracy of MT systems (Alshawi et al., 1998). It is based on string edit distance between the output of the generation system and the reference corpus string. Simple accuracy is the number of insertion (I), deletion (D) and substitutions (S) errors between the reference strings in the test corpus and the strings produced by the generation model. An alignment algorithm using substitution, insertion and deletion of tokens as operations attempts to match the generated string with the reference string. Each of these operations is assigned a cost value such that a substitution operation is cheaper than the combined cost of a deletion and an insertion operation. The alignment algorithm attempts to find the set of operations that minimizes the cost of aligning the generated string to tile reference string. Tile metric is summarized in Equation (1). R is the number of tokens in the target string.</Paragraph>
      <Paragraph position="1"> course).</Paragraph>
      <Paragraph position="2"> The simple string accuracy metric penalizes a misplaced token twice, as a deletion from its expected position and insertion at a different position. This is particularly worrisome in our case, since in our evaluation scenario the generated sentence is a permutation of the tokens in the reference string. We therefore use a second metric, Generation String Accuracy, shown in Equation (3), which treats deletion of a token at one location in the string and the insertion of the same token at another location in the string as one single movement error (M). This  is in addition to the remaining insertions (I') and deletions (D').</Paragraph>
      <Paragraph position="3"> (3) Generation String Accuracy =</Paragraph>
      <Paragraph position="5"> In our example sentence (2), we see that the insertion and deletion of no can be collapsed into one move. However, the wrong positions of cost and of phase are not analyzed as two moves, since one takes the place of the other, and these two tokens still result in one deletion, one substitution, and one insertion. 5 Thus, the generation string accuracy depenalizes simple moves, but still treats complex moves (involving more than one token) harshly. Overall, the scores for the two metrics introduced so far are shown in the first two columns of Table 1.</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.2 Tree-Based Metrics
</SectionTitle>
      <Paragraph position="0"> (1) Simple String Accuracy = (1 I+*)+s IC/ ) \Vhile tile string-b~u~ed metrics are very easy to apply, they have the disadvantage that they do not reflect the intuition that all token moves are not Consider tile fifth)wing example. The target sentence is on top, tile generated sentence below. Tile third equally &amp;quot;bad&amp;quot;. Consider the subphrase estimate for line represents the operation needed to. transfor m .. phase the second of the sentence in (2). \Vhile this is one sentence into another: a period is used t.o indi- bad;it seems better:tiara art alternative such as escate that no operation is needed. 4 (2) There was no cost estimate for tile  There was estimate for l)hase tile d (1 i</Paragraph>
      <Paragraph position="2"> timate phase for tile second. Tile difference between the two strings is that the first scrambled string, but not tile second, can be read off fl'om tile dependency tree for the sentence (as shown ill Figure 2) without violation of projectivity, i.e., without (roughly STiffs shows the importance of the alignment algorithm in the definition of Ihese two metrics: had it. not, aligned phase and cost as a substitution (but each with an empty position in the other~string-:instead),, then ~khe simple string accuracy would have 6 errors instead of 5, but the generation string accuracy would have 3 errors instead of ,1, speaking) creating discontinuous constituents. It has long been observed (though informally) that the dependency trees of a vast majority of sentences in the languages of the world are projective (see e.g.</Paragraph>
      <Paragraph position="3"> (Mel'euk, 1988)), so that a violation of projectivity is presumably a more severe error than a word order variation that does not violate projectivity.</Paragraph>
      <Paragraph position="4"> We designed thetree-based'-acettrucymetrics in order to account for this effect. Instead of comparing two strings directly, we relate the two strings to a dependency tree of the reference string. For each treelet (i.e., non-leaf node with all of its daughters) of the reference dependency tree, we construct strings of the head and its dependents in the order they appear in the reference string, and in the order they appear in the result string. We then calculate the number of substitutions, deletions, and insertions as for the simple string accuracy, and the number of substitutions, moves, and remaining deletions and insertions as for the generation string metrics, for all treelets that form the dependency tree. We sum these scores, and then use the values obtained in the formulas given above for the two string-based metrics, yielding the Simple Tree Accuracy and Generation Tree Accuracy. The scores for our example sentence are shown in the last two columns of Table 1.</Paragraph>
    </Section>
    <Section position="3" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.3 Evaluation Results
</SectionTitle>
      <Paragraph position="0"> The simple accuracy, generation accuracy, simple tree accuracy and generation tree accuracy for the two experiments are tabulated in Table 2. The test corpus is a randomly chosen subset of 100 sentences from the Section 20 of WSJ. The dependency structures for the test sentences were obtained automatically from converting the Penn TreeBank phrase structure trees, in the same way as was done to Create the training corpus. The average length of the test sentences is 16.7 words with a longest sentence being 24 words in length. As can be seen, the supertag-based model improves over the baseline LR model on all four baseline quantitative metrics.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="2" end_page="2" type="metho">
    <SectionTitle>
4 Qualitative Evaluation of the
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
Quantitative Metrics
4.1 The Experiments
</SectionTitle>
      <Paragraph position="0"> We have presented four metrics which we can compute automatically. In order to determine whether the metrics correlate with independent notions understandability or quality, we have performed evaluation experiments with human subjects.</Paragraph>
      <Paragraph position="1"> In the web-based experiment, we ask human subjects to read a short paragraph from the WSJ. We present three or five variants of the last sentence of this paragraph on the same page, and ask the sub-ject to judge them along two dimensions: Here we summarize two experiments that we have performed that use different tree nmdels. (For a more detailed comparisons of different tree models, see (Bangalore and Rainbow, 2000).) o For the baseline experiment, we impose a random tree structure for each sentence of the corpus and build a Tree Model whose parameters consist of whether a lexeme ld precedes or follows her mother lexeme \[ .... We call this the Baseline Left-Right (LR) Model. This model generates There was estimate for phase the second no cost. for our example input.</Paragraph>
      <Paragraph position="2"> o In the second experiment we use the-system as described in Section 2. We employ the supertag-based tree model whose parameters consist of whether a lexeme ld with supertag sd is a dependent of lexeme 1,,, with supertag s,,,. Furthermore we use the information provided by the XTAG grammar to order the dependents. This model generates There was no cost estimate for&amp;quot; the second phase . for our example input, .which is indeed.the sentence found in the WS.I.</Paragraph>
      <Paragraph position="3"> o Understandability: How easy is this sentence to understand? Options range from &amp;quot;Extremely easy&amp;quot; (= 7) to &amp;quot;Just barely possible&amp;quot; (=4) to &amp;quot;Impossible&amp;quot; (=1). (Intermediate numeric values can also be chosen but have no description associated with them.) o Quality: How well-written is this sentence? Options range from &amp;quot;Extremely well-written'&amp;quot; (= 7) to &amp;quot;Pretty bad&amp;quot; (=4) to &amp;quot;Horrible (=1). (Again. intermediate numeric values can also t)e chosen, but have no description associated with them.) The 3-5 variants of each of 6 base sentences are construtted by us (most of the variants lraxre not actually been generated by FERGUS) to sample multiple values of each intrinsic metric as well as to contrast differences between the intrinsic measures. Thus for one sentence &amp;quot;tumble&amp;quot;, two of the five variants have approximately identical values for each of the metrics but with the absolute values being high (0.9) and medium (0.7) respectively. For two other sen\[,('II('(}S ~ve have contrasting intrinsic values for tree trod string based measures. For .the final sentence we have contrasts between the string measures with  tree measures being approximately equal. Ten subjects who were researchers from AT&amp;T carried out the experiment. Each subject made a total of 24 judgments.</Paragraph>
      <Paragraph position="4"> Given the variance between subjects we first normalized the data. We subtracted the mean score for each subject from each observed score and then divided this by standard deviation of the scores for that subject. As expected our data showed strong correlations between normalized understanding and quality judgments for each sentence variant (r(22) = 0.94, p &lt; 0.0001).</Paragraph>
      <Paragraph position="5"> Our main hypothesis is that the two tree-based metrics correlate better with both understandability and quality than the string-based metrics. This was confirmed. Correlations of the two string metrics with normalized understanding for each sentence variant were not significant (r(22) = 0.08 and rl.2.21 = 0.23, for simple accuracy and generation accuracy: for both p &gt; 0.05). In contrast both of the tree metrics were significant (r(2.2) = 0.51 and r(22) = 0.48: for tree accuracy and generation tree accuracy, for both p &lt; 0.05). Similar results were achieved--for thegorrealized quality metric: (r(.2.21 = 0.16 and r(221 = 0,33: for simple accuracy and generation accuracy, for both p &gt; 0.05), (r(ee) = 0.45 and r(.2.2) = 0.42, for tree accuracy and generation tree accuracy, for both p &lt; 0.05).</Paragraph>
      <Paragraph position="6"> A second aim of ()Lit&amp;quot; qualitative evaluation was to lest various models of the relationship between intrinsic variables and qualitative user judgments. \Ve proposed a mmlber-of'models:in which various conLfrom the two tree models binations of intrinsic metrics were used to predict user judgments of understanding and quality. .We conducted a series of linear regressions with normalized judgments of understanding and quality as the dependent measures and as independent measures different combinations of one of our four metrics with sentence length, and with the &amp;quot;problem&amp;quot; variables that we used to define the string metrics (S, I, D, M, I', D' - see Section 3 for definitions). One sentence variant was excluded from the data set, on the grounds that the severely &amp;quot;mangled&amp;quot; sentence happened to turn out well-formed and with nearly the same nleaning as the target sentence. The results are shown in Table 3.</Paragraph>
      <Paragraph position="7"> We first tested models using one of our metrics as a single intrinsic factor to explain the dependent variable. We then added the &amp;quot;problem&amp;quot; variables. 6 and could boost tile explanatory power while maintaining significance. In Table 3, we show only some con&gt; binations, which show that tile best results were obtained by combining the simple tree accuracy with the number of Substitutions (S) and the sentence length. As we can see, the number of substitutions ..... has an.important effecVon explanatory.power,, while that of sentence length is much more modest (but more important for quality than for understanding).</Paragraph>
      <Paragraph position="8"> Furthermore, the number of substitutions has more explanatory power than the number of moves (and in fact. than any of the other &amp;quot;problem&amp;quot; variables). The two regressions for understanding and writing show very sinlilar results. Normalized understand- null ing was best modeled as: Normalized understanding = 1.4728*simple tree accuracy - 0.1015*substitutions0.0228 * length - 0.2127.</Paragraph>
      <Paragraph position="9"> This model was significant: F(3,1.9 ) = 6.62, p &lt; 0.005. Tile model is plotted in Figure 3. with the data point representing the removed outlier at the top of the diagram.</Paragraph>
      <Paragraph position="10"> This model is also intuitively plausible. The simple tree metric was designed to measure the quality of a sentence and it has a positive coefficient. A substitution represents a case in the string metrics in which not only a word is in the wrong place, but the word that should have been in that place is somewhere else, Therefore, substitutions, more than moves or insertions or deletions, represent grave cases of word order anomalies. Thus, it is plausible to penalize them separately. (,Note that tile simple tree accuracy is bounded by 1, while the number of substitutions is l/ounded by the length of the sentence. In practice, in our sentences S ranges between 0 and 10 with a mean of 1,583.) Finally, it is also plausible that longer sentem:es are more difficult to understand, so that length has a (small) negative coefficient.</Paragraph>
      <Paragraph position="11"> We now turn to model for quality, Normalized quality = 1.2134*simple tree accuracy- 0.0839*substitutions - 0.0280 * length - 0.0689.</Paragraph>
      <Paragraph position="12"> This model was also significant: F(3A9) = 7.23, p &lt; 0.005. The model is plotted in Figure 4, with the data point representing the removed outlier at the top of the diagram. The quality model is plausible for the same reasons that the understanding model is.</Paragraph>
      <Paragraph position="14"/>
    </Section>
    <Section position="2" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
4.2 Two New Metrics
</SectionTitle>
      <Paragraph position="0"> A further goal of these experiments was to obtain one or two metrics which can be automatically computed, and which have been shown to significantly correlate with relevant human judgments* We use as a starting point the two linear models for normalized understanding and quality given above, but we make two changes. First, we observe that while it is plausible to model human judgments by penalizing long sentences, this seems unmotivated in an accuracy metric: we do not want to give a perfectly generated longer sentence a lower score than a perfectly generated shorter sentence. We therefore use models that just use the simple tree accuracy and the number of substitutions as independent variables* Second, we note that once we have done so, a perfect sentence gets a score of 0.8689 (for understandability) or 0.6639 (for quality). We therefore divide by this score to assure that a perfect sentence gets a score of 1. (As for the previously introduced metrics, the scores may be less than 0.) \Ve obtain the following new metrics:  curacy- 0.0869*substitutions - 0.3553) / 0.6639 \\e reevahtated our system and the baseline model using the new metrics, in order to veri(v whether the nloro motivated metrics we have developed still show that FER(;I:S improves l)erforniance over the baseline. This is indeed the case: the resuhs are</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML