File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-0306_metho.xml
Size: 10,952 bytes
Last Modified: 2025-10-06 14:07:22
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-0306"> <Title>Stochastic Language Generation for Spoken Dialogue Systems</Title> <Section position="2" start_page="0" end_page="29" type="metho"> <SectionTitle> 1 Content Planning </SectionTitle> <Paragraph position="0"> In content planning we decide which attributes (represented as word classes, see Figure 3) should be included in an utterance. In a task-oriented dialogue, the number of attributes generally increases during the course of the dialogue. Therefore, as the dialogue progresses, we need to decide which ones to include at each system turn. If we include all of them every time (indirect echoing, see Hayes and Reddy, 1983), the utterances become overly lengthy, but if we remove all unnecessary attributes, the user may get confused. With a fairly high recognition error rate, this becomes an even more important issue.</Paragraph> <Paragraph position="1"> The problem, then, is to find a compromise between the two. We compared two ways to systematically generate system utterances with only selected attributes, such that the user hears repetition of some of the constraints he/she has specified, at appropriate points in the dialogue, without sacrificing naturalness and efficiency.</Paragraph> <Paragraph position="2"> The specific problems, then, are deciding what should be repeated, and when. We first describe a simple heuristic of old versus new information. Then we present a statistical approach, based on bigram models.</Paragraph> <Paragraph position="3"> 1.1 First approach: old versus new As a simple solution, we can use the previous dialogue history, by tagging the attribute-value pairs as old (previously said by the system) information or new (not said by the system yet) information. The generation module would select only new information to be included in the system utterances. Consequently, information * given by the user is repeated only once in the dialogue, usually in the utterance immediately following the user utterance in which the new information was given 1.</Paragraph> <Paragraph position="4"> Although this approach seems to work fairly well, echoing user's constraints only once may not be the right thing to do. Looking at human-human dialogues, we observe that this is not very natural for a conversation; humans often repeat mutually known information, and they also often do not repeat some information at all. Also, this model does not capture the close relationship between two consecutive utterances within a dialogue. The second approach tries to address these issues.</Paragraph> <Section position="1" start_page="0" end_page="29" type="sub_section"> <SectionTitle> 1.2 Second approach: statistical model </SectionTitle> <Paragraph position="0"> For this approach, we adopt the first of the two sub-maxims in (Oberlander, 1998) '*'Do the human thing&quot;. Oberlander (1998) talks about generation of referring expressions, but it is universally valid, at least within natural language generation, to say the best we can do is When the system utterance uses a template that does not contain the slots for the new information given in the previous user utterance, then that new information will be confirmed in the next available system utterance in which the template contains those slots.</Paragraph> <Paragraph position="1"> to mimic human behavior. Hence, we built a two-stage statistical model of human-human dialogues using the CMU corpus. The model first predicts the number of attributes in the system utterance given the utterance class, then predicts the attributes given the attributes in the previous user utterance.</Paragraph> <Paragraph position="2"> 1.2.1 The number of attributes model The first model will predict the number of attributes in a system utterance given the utterance class. The model is the probability distribution P(nk) = P(nklck), where nk is the number of attributes and Ck is the utterance class for system utte~anee k.</Paragraph> <Paragraph position="3"> This model will predict which attributes to use in a system utterance. Using a statistical model, what we need to do is find the set of attributes A* = {al, az ..... an } such that A * = arg max FI P(al, a2 ..... an) We assume that the distributions of the ai's are dependent on the attributes in the previous utterances. As a simple model, we look only at the utterance immediately preceding the current utterance and build a bigram model of the attributes. In other words, A* = arg max P(AIB), where B = {bl, b2 ..... bin}, the set of m attributes in the preceding user utterance.</Paragraph> <Paragraph position="4"> If we took the above model and tried to apply it directly, we would run into a serious data sparseness problem, so we make two independence assumptions. The first assumption is that the attributes in the user utterance contribute independently to the probabilities of the attributes in the system utterance following it. Applying this assumption to the model above, we get the following: m A * = arg max ~ P(bk)P(A I bk) k=l The second independence assumption is that the attributes in the system utterance are independent of each other. This gives the final model that we used for selecting the attributes.</Paragraph> <Paragraph position="6"> Although this independence assumption is an oversimplification, this simple model is a good starting point for our initial implementation of this approach.</Paragraph> </Section> </Section> <Section position="3" start_page="29" end_page="29" type="metho"> <SectionTitle> 2 Stochastic Surface Realization </SectionTitle> <Paragraph position="0"> We follow Busemann and Horacek (1998) in designing our generation engine with &quot;different levels of granularity.&quot; The different levels contribute to the specific needs of the various utterance classes. For example, at the beginning of the dialogue, a system greeting can be simply generated by a &quot;canned&quot; expression. Other short, simple utterances can be generated efficiently by templates. In Busemann and Horacek (1998), the remaining output is generated by grammar rules.</Paragraph> <Paragraph position="1"> We replace the gefieration grammar with a simple statistical language model to generate more complex utterances.</Paragraph> <Paragraph position="2"> There are four aspects to our stochastic surface realizer: building language models, generating candidate utterances, scoring the utterances, and filling in the slots. We explain each of these below.</Paragraph> <Section position="1" start_page="29" end_page="29" type="sub_section"> <SectionTitle> 2.1 Building Language Models </SectionTitle> <Paragraph position="0"> Using the tagged utterances as described in the introduction, we built an unsmoothed n-gram language model for each utterance class. Tokens that belong in word classes (e.g., &quot;U.S.</Paragraph> <Paragraph position="1"> Airways&quot; in class &quot;airline&quot;) were replaced by the word classes before building the language models. We selected 5 as the n in n-gram to introduce some variability in the output utterances while preventing nonsense utterances.</Paragraph> <Paragraph position="2"> Note that language models are not used here in the same way as in speech recognition. In speech recognition, the language model probability acts as a 'prior' in determining the most probable sequence of words given the acoustics. In other words,</Paragraph> <Paragraph position="4"> where W is the string of words, wl, ..., wn, and A is the acoustic evidence (Jelinek 1998).</Paragraph> <Paragraph position="5"> Although we use the same statistical tool, we compute and use the language model probability directly to predict the next word. In other words, the most likely utterance is W* = arg max P(WIu), where u is the utterance class.</Paragraph> <Paragraph position="6"> We do not, however, look for the most likely hypothesis, but rather generate each word randomly according to the distribution, as illustrated in the next section.</Paragraph> </Section> <Section position="2" start_page="29" end_page="29" type="sub_section"> <SectionTitle> 2.2 Generating Utterances </SectionTitle> <Paragraph position="0"> The input to NLG from the dialogue manager is a frame of attribute-value pairs. The first two attribute-value pairs specify the utterance class. The rest of the frame contains word classes and their values. Figure 4 is an example of an input frame to NLG.</Paragraph> <Paragraph position="1"> The generation engine uses the appropriate language model for the utterance class and generates word sequences randomly according to the language model distributions. As in speech recognition, the probability of a word using the n-gram language model is</Paragraph> <Paragraph position="3"> where u is the utterance class. Since we have built separate models for each of the utterance classes, we can ignore u, and say that</Paragraph> <Paragraph position="5"> using the language model for u.</Paragraph> <Paragraph position="6"> Since we use unsmoothed 5,grams, we will not generate any unseen 5-grams (or smaller n-grams at the beginning and end of an utterance). This precludes generation of nonsense utterances, at least within the 5-word window. Using a smoothed n-gram would result in more randomness, but using the conventional back-off methods (Jelinek 1998), the probability mass assigned to unseen 5-grams would be very small, and those rare occurrences of unseen n-grams may not make sense anyway. There is the problem, as in speech recognition using n-gram language models, that long-distance dependency cannot be captured.</Paragraph> <Paragraph position="7"> = 2.3 Scoring Utterances For each randomly generated utterance, we compute a penalty score. The score is based on the heuristics we've empirically selected.</Paragraph> <Paragraph position="8"> Various penalty scores are assigned for an utterance that 1. is too short or too long (determined by utterance-class dependent thresholds), 2. contains repetitions of any of the slots, 3. contains slots for which there is no valid value in the frame, or 4. does not have some required slots (see section 2 for deciding which slots are required).</Paragraph> <Paragraph position="9"> The generation engine generates a candidate utterance, scores it, keeping only the best-scored utterance up to that point. It stops and returns the best utterance when it finds an utterance with a zero penalty scoreTor runs out of time.</Paragraph> </Section> <Section position="3" start_page="29" end_page="29" type="sub_section"> <SectionTitle> 2.4 Filling Slots </SectionTitle> <Paragraph position="0"> The last step is filling slots with the appropriate values. For example, the utterance &quot;What time would you like to leave {depart_city}?&quot; becomes &quot;What time would you like to leave New York?&quot;.</Paragraph> </Section> </Section> class="xml-element"></Paper>