XML Viewer - p06-1140

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1140_metho.xml
Size: 19,520 bytes
Last Modified: 2025-10-06 14:10:23
<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1140">
  <Title>Learning to Say It Well: Reranking Realizations by Predicted Synthesis Quality</Title>
  <Section position="5" start_page="1114" end_page="1116" type="metho">
    <SectionTitle>
3 Reranking Realizations by Predicted
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="1114" end_page="1115" type="sub_section">
      <SectionTitle>
Synthesis Quality
3.1 Generating Alternatives
</SectionTitle>
      <Paragraph position="0"> Our experiments with integrating language generation and synthesis have been carried out in the context of the COMIC3 multimodal dialogue system (den Os and Boves, 2003). The COMIC system adds a dialogue interface to a CAD-like application used in sales situations to help clients redesign their bathrooms. The input to the system includes speech, handwriting, and pen gestures; the output combines synthesized speech, an animated talking head, deictic gestures at on-screen objects, and direct control of the underlying application. null  Drawing on the materials used in (Foster and White, 2005) to evaluate adaptive generation in COMIC, we selected a sample of 104 sentences from 38 different output turns across three dialogues. For each sentence in the set, a variant was included that expressed the same content adapted to a different user model or adapted to a different dialogue history. For example, a description of a certain design's colour scheme for one user might be phrased as As you can see, the tiles have a blue and green colour scheme, whereas a variant expression of the same content for a different user could be Although the tiles have a blue colour scheme, the design does also feature green, if the user disprefers blue.</Paragraph>
      <Paragraph position="1"> In COMIC, the sentence planner uses XSLT to generate disjunctive logical forms (LFs), which specify a range of possible paraphrases in a nested free-choice form (Foster and White, 2004). Such disjunctive LFs can be efficiently realized using the OpenCCG realizer (White, 2004; White, 2006b; White, 2006a). Note that for the experiments reported here, we manually augmented the disjunctive LFs for the 104 sentences in our sample to make greater use of the periphrastic capabilities of the COMIC grammar; it remains for future work to augment the COMIC sentence planner produce these more richly disjunctive LFs automatically. null OpenCCG includes an extensible API for integrating language modeling and realization. To select preferred word orders, from among all those allowed by the grammar for the input LF, we used a backoff trigram model trained on approximately 750 example target sentences, where certain words were replaced with their semantic classes (e.g.</Paragraph>
      <Paragraph position="2"> MANUFACTURER, COLOUR) for better generalization. For each of the 104 sentences in our sample, we performed 25-best realization from the disjunctive LF, and then randomly selected up to 12 different realizations to include in our experiments based on a simulated coin flip for each realization, starting with the top-scoring one. We used this procedure to sample from a larger portion of the N-best realizations, while keeping the sample size manageable.</Paragraph>
      <Paragraph position="3"> Figure 1 shows an example of 12 paraphrases for a sentence chosen for inclusion in our sample.</Paragraph>
      <Paragraph position="4"> Note that the realizations include words with pitch accent annotations as well as boundary tones as separate, punctuation-like words. Generally the  quality of the sampled paraphrases is very high, only occasionally including dispreferred word orders such as We here have a design in the family style, where here is in medial position rather than fronted.4</Paragraph>
    </Section>
    <Section position="2" start_page="1115" end_page="1116" type="sub_section">
      <SectionTitle>
3.2 Synthesizing Utterances
</SectionTitle>
      <Paragraph position="0"> For synthesis, OpenCCG's output realizations are converted to APML,5 a markup language which allows pitch accents and boundary tones to be specified, and then passed to the Festival speech synthesis system (Taylor et al., 1998; Clark et al., 2004). Festival uses the prosodic markup in the text analysis phase of synthesis in place of the structures that it would otherwise have to predict from the text. The synthesiser then uses the context provided by the markup to enforce the selec- null festival/apml.html.</Paragraph>
      <Paragraph position="1"> tion of suitable units from the database.</Paragraph>
      <Paragraph position="2"> A custom synthetic voice for the COMIC system was developed, as follows. First, a domain-specific recording script was prepared by selecting about 150 sentences from the larger set of target sentences used to train the system's n-gram model. The sentences were greedily selected with the goals of ensuring that (i) all words (including proper names) in the target sentences appeared at least once in the record script, and (ii) all bigrams at the level of semantic classes (e.g. MANUFAC-TURER, COLOUR) were covered as well. For the cross-validation study reported in the next section, we also built a trigram model on the words in the domain-specific recording script, without replacing any words with semantic classes, so that we could examine whether the more frequent occurrence of the specific words and phrases in this part of the script is predictive of synthesis quality.</Paragraph>
      <Paragraph position="3"> The domain-specific script was augmented with a set of 600 newspaper sentences selected for diphone coverage. The newspaper sentences make it possible for the voice to synthesize words outside of the domain-specific script, though not necessarily with the same quality. Once these scripts were in place, an amateur voice talent was recorded reading the sentences in the scripts during two recording sessions. Finally, after the speech files were semi-automatically segmented into individual sentences, the speech database was constructed, using fully automatic labeling.</Paragraph>
      <Paragraph position="4"> We have found that the utterances synthesized with the COMIC voice vary considerably in their naturalness, due to two main factors. First, the system underwent further development after the voice was built, leading to the addition of a variety of new phrases to the system's repertoire, as well as many extra proper names (and their pronunciations); since these names and phrases usually require going outside of the domain-specific part of the speech database, they often (though not always) exhibit a considerable dropoff in synthesis quality.6 And second, the boundaries of the automatically assigned unit labels were not always accurate, leading to problems with unnatural joins and reduced intelligibility. To improve the reliability of the COMIC voice, we could have recorded more speech, or manually corrected label bound6Note that in the current version of the system, proper names are always required parts of the output, and thus the discriminative reranker cannot learn to simply choose paraphrases that leave out problematic names.</Paragraph>
      <Paragraph position="5">  aries; the goal of this paper is to examine whether the naturalness of a dialogue system's output can be improved in a less labor-intensive way.</Paragraph>
    </Section>
    <Section position="3" start_page="1116" end_page="1116" type="sub_section">
      <SectionTitle>
3.3 Rating Synthesis Quality
</SectionTitle>
      <Paragraph position="0"> To obtain data for training our realization reranker, we solicited judgements of the naturalness of the synthesized speech produced by Festival for the utterances in our sample COMIC corpus. Two judges (the first two authors) provided judgements on a 1-7 point scale, with higher scores representing more natural synthesis. Ratings were gathered using WebExp2,7 with the periphrastic alternatives for each sentence presented as a group in a randomized order. Note that for practical reasons, the utterances were presented out of the dialogue context, though both judges were familiar with the kinds of dialogues that the COMIC system is capable of.</Paragraph>
      <Paragraph position="1"> Though the numbers on the seven point scale were not assigned labels, they were roughly taken to be &amp;quot;horrible,&amp;quot; &amp;quot;poor,&amp;quot; &amp;quot;fair,&amp;quot; &amp;quot;ok,&amp;quot; &amp;quot;good,&amp;quot; &amp;quot;very good&amp;quot; and &amp;quot;perfect.&amp;quot; The average assigned rating across all utterances was 4.05 (&amp;quot;ok&amp;quot;), with a standard deviation of 1.56. The correlation between the two judges' ratings was 0.45, with one judge's ratings consistently higher than the other's.</Paragraph>
      <Paragraph position="2"> Some common problems noted by the judges included slurred words, especially the sometimes sounding like ther or even their; clipped words, such as has shortened at times to the point of sounding like is, or though clipped to unintelligibility; unnatural phrasing or emphasis, e.g. occasional pauses before a possessive 's, or words such as style sounding emphasized when they should be deaccented; unnatural rate changes; &amp;quot;choppy&amp;quot; speech from poor joins; and some unintelligible proper names.</Paragraph>
    </Section>
    <Section position="4" start_page="1116" end_page="1116" type="sub_section">
      <SectionTitle>
3.4 Ranking
</SectionTitle>
      <Paragraph position="0"> While Collins (2000) and Walker et al. (2002) develop their rankers using the RankBoost algorithm (Freund et al., 1998), we have instead chosen to use Joachims' (2002) method of formulating ranking tasks as Support Vector Machine (SVM) constraint optimization problems.8 This choice has been motivated primarily by convenience, as Joachims' SVMa108a105a103a104a116 package is easy to  of SVM ranking in generation, namely to the task of ranking alternative text orderings for local coherence.</Paragraph>
      <Paragraph position="1"> use; we leave it for future work to compare the performance of RankBoost and SVMa108a105a103a104a116 on our ranking task.</Paragraph>
      <Paragraph position="2"> The ranker takes as input a set of paraphrases that express the desired content of each sentence, optionally together with synthesized utterances for each paraphrase. The output is a ranking of the paraphrases according to the predicted naturalness of their corresponding synthesized utterances. Ranking is more appropriate than classification for our purposes, as naturalnesss is a graded assessment rather than a categorical one.</Paragraph>
      <Paragraph position="3"> To encode the ranking task as an SVM constraint optimization problem, each paraphrase a106 of a sentence a105 is represented by a feature vector a8a40a115a105a106a41 a61 a104a102a49a40a115a105a106a41a59a58a58a58a59a102a109a40a115a105a106a41a105, where a109 is the number of features. In the training data, the feature vectors are paired with the average value of their corresponding human judgements of naturalness. From this data, ordered pairs of paraphrases a40a115a105a106a59a115a105a107a41 are derived, where a115a105a106 has a higher naturalness rating than a115a105a107. The constraint optimization problem is then to derive a parameter vector a126a119 that yields a ranking score function a126a119 a1 a8a40a115a105a106a41 which minimizes the number of pairwise ranking violations. Ideally, for every ordered pair a40a115a105a106a59a115a105a107a41, we would have a126a119 a1 a8a40a115a105a106a41 a62 a126a119 a1 a8a40a115a105a107a41; in practice, it is often impossible or intractable to find such a parameter vector, and thus slack variables are introduced that allow for training errors.</Paragraph>
      <Paragraph position="4"> A parameter to the algorithm controls the trade-off between ranking margin and training error.</Paragraph>
      <Paragraph position="5"> In testing, the ranker's accuracy can be determined by comparing the ranking scores for every ordered pair a40a115a105a106a59a115a105a107a41 in the test data, and determining whether the actual preferences are borne out by the predicted preference, i.e. whether a126a119 a1 a8a40a115a105a106a41 a62 a126a119 a1 a8a40a115a105a107a41 as desired. Note that the ranking scores, unlike the original ratings, do not have any meaning in the absolute sense; their import is only to order alternative paraphrases by their predicted naturalness.</Paragraph>
      <Paragraph position="6"> In our ranking experiments, we have used SVMa108a105a103a104a116 with all parameters set to their default values.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="1116" end_page="1117" type="metho">
    <SectionTitle>
3.5 Features
</SectionTitle>
    <Paragraph position="0"> Table 1 shows the feature sets we have investigated for reranking, distinguished by the availability of the features and the need for discriminative training. The first row shows the feature sets that are  available to the realizer. There are two n-gram models that can be used to directly rank alternative realizations: NGRAM-1, the language model used in COMIC, and NGRAM-2, the language model derived from the domain-specific recording script; for feature values, the negative logarithms are used. There are also two WORDS feature sets (shown in the second column): WORDS-BI, which includes NGRAMS plus a feature for every possible unigram and bigram, where the value of the feature is the count of the unigram or bigram in a given realization; and WORDS-TRI, which includes all the features in WORDS-BI, plus a feature for every possible trigram. The second row shows the feature sets that require information from the synthesizer. The COSTS feature set includes NGRAMS plus the total join and target costs from the unit selection search. Note that a weighted sum of these costs could be used to directly rerank realizations, in much the same way as relative frequencies and concatenation costs are used in (Bulyko and Ostendorf, 2002); in our experiments, we let SVMa108a105a103a104a116 determine how to weight these costs. Finally, there are two ALL feature sets: ALL-BI includes NGRAMS, WORDS-BI and COSTS, plus features for every possible phone and diphone, and features for every specific unit in the database; ALL-TRI includes NGRAMS, WORDS-TRI, COSTS, and a feature for every phone, diphone and triphone, as well as specific units in the database. As with WORDS, the value of a feature is the count of that feature in a given synthesized utterance.</Paragraph>
  </Section>
  <Section position="7" start_page="1117" end_page="1118" type="metho">
    <SectionTitle>
4 Cross-Validation Study
</SectionTitle>
    <Paragraph position="0"> To train and test our ranker on our feature sets, we partitioned the corpus into 10 folds and performed 10-fold cross-validation. For each fold, 90% of the examples were used for training the ranker and the remaining unseen 10% were used for testing. The folds were created by randomly choosing from among the sentence groups, resulting in all of the paraphrases for a given sentence occurring in the same fold, and each occurring ex- null actly once in the testing set as a whole.</Paragraph>
    <Paragraph position="1"> We evaluated the performance of our ranker by determining the average score of the best ranked paraphrase for each sentence, under each of the following feature combinations: NGRAM1, NGRAM-2, COSTS, WORDS-BI, WORDS-TRI, ALL-BI, and ALL-TRI. Note that since we used the human ratings to calculate the score of the highest ranked utterance, the score of the highest ranked utterance cannot be higher than that of the highest human-rated utterance. Therefore, we effectively set the human ratings as the topline (BEST). For the baseline, we randomly chose an utterance from among the alternatives, and used its associated score. In 15 tests generating the random scores, our average scores ranged from 3.884.18. We report the median score of 4.11 as the average for the baseline, along with the mean of the topline and each of the feature subsets, in Table 2.</Paragraph>
    <Paragraph position="2"> We also report the ordering accuracy of each feature set used by the ranker in Table 2. As mentioned in Section 3.4, the ordering accuracy of the ranker using a given feature set is determined by a99a61a78, where a99 is the number of correctly ordered pairs (of each paraphrase, not just the top ranked one) produced by the ranker, and a78 is the total number of human-ranked ordered pairs.</Paragraph>
    <Paragraph position="3"> As Table 2 indicates, the mean of BEST is 5.38, whereas our ranker using WORDS-TRI features achieves a mean score of 4.95. This is a difference of 0.42 on a seven point scale, or only a 6% difference. The ordering accuracy of WORDS-TRI is 77.3%.</Paragraph>
    <Paragraph position="4"> We also measured the improvement of our ranker with each feature set over the random base-line as a percentage of the maximum possible gain (which would be to reproduce the human topline). The results appear in Figure 2. As the  maximum possible gain over the random baseline.</Paragraph>
    <Paragraph position="5"> figure indicates, the maximum possible gain our ranker achieves over the baseline is 66% (using the WORDS-TRI or ALL-BI feature set) . By comparison, NGRAM-1 and NGRAM-2 achieve less than 20% of the possible gain.</Paragraph>
    <Paragraph position="6"> To verify our main hypothesis that our ranker would significantly outperform the baselines, we computed paired one-tailed a116-tests between WORDS-TRI and RANDOM (a116 a61 a50a58a52, a112 a60 a56a58a57a120a49a48a0a49a51), and WORDS-TRI and NGRAM-1 (a116 a61 a49a58a52, a112 a60 a52a58a53a120a49a48a0a56). Both differences were highly significant. We also performed seven post-hoc comparisons using two-tailed a116-tests, as we did not have an a priori expectation as to which feature set would work better. Using the Bonferroni adjustment for multiple comparisons, the a112value required to achieve an overall level of significance of 0.05 is 0.007. In the first post-hoc test, we found a significant difference between BEST and WORDS-TRI (a116 a61 a56a58a48,a112 a60 a49a58a56a54a120a49a48a0a49a50), indicating that there is room for improvement of our ranker. However, in considering the top scoring feature sets, we did not find a significant difference between WORDS-TRI and WORDS-BI (a116 a61 a50a58a51, a112 a60 a48a58a48a50a50), from which we infer that the difference among all of WORDS-TRI, ALL-BI, ALL-TRI and WORDS-BI is not significant also.</Paragraph>
    <Paragraph position="7"> This suggests that the synthesizer features have no substantial impact on our ranker, as we would expect ALL-TRI to be significantly higher than WORDS-TRI if so. However, since COSTS does significantly improve upon NGRAM2 (a116 a61 a51a58a53, a112 a60 a48a58a48a48a49), there is some value to the use of synthesizer features in the absence of WORDS. We also looked at the comparison for the WORDS models and COSTS. While WORDS-BI did not perform significantly better than COSTS ( a116 a61 a50a58a51, a112 a60 a48a58a48a50a53), the added trigrams in WORDS-TRI did improve ranker performance significantly over COSTS (a116 a61 a51a58a55, a112 a60 a51a58a50a57a120a49a48a0a52). Since COSTS ranks realizations in the much the same way as (Bulyko and Ostendorf, 2002), the fact that WORDS-TRI outperforms COSTS indicates that our discriminative reranking method can significantly improve upon their non-discriminative approach. null</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML