File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-1073_metho.xml
Size: 25,584 bytes
Last Modified: 2025-10-06 14:07:07
<?xml version="1.0" standalone="yes"?> <Paper uid="C00-1073"> <Title>Automatic Optimization of Dialogue Management</Title> <Section position="3" start_page="502" end_page="502" type="metho"> <SectionTitle> 2 Reinforcement Learning for </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="502" end_page="502" type="sub_section"> <SectionTitle> Dialogue </SectionTitle> <Paragraph position="0"> Due to space limitations, we 1)resent only a 1)rief overview of how di~dogue strategy optimization can be viewed as an llL 1)roblem; for more details, see Singh ctal. (\]999), Walker el; a.1. (\]998), Levin et al. (2000). A dialogue strategy is a mapl)ing h'om a set ot! states (which summarize the entire dialogue so far) to a set of actions (such as the system's uttermines and database queries). There are nmltil)l(~ reasonable action choices in each state; tyl)ically these choices are made by the system designer. Our RL-I)ased at)l)roach is to build a system that explores these choices in a systematic way through experiments with rel)resentative human us(!rs. A scalar i)erf()rmanee llleasllre, called a rewal'd, is t h(m (;aleulated for each Cxl)erimental diMogue. (We discuss various choices for this reward measure later, but in our experiments only terminal dialogue states ha,re nonzero rewi-l,rds, slid the reward lneasul'(}s arc quantities directly obtMnable from the experimental set-up, such as user satisfaction or task coml)letion. ) This experimental data is used to construct an MDP which models the users' intera(:tion with the system.</Paragraph> <Paragraph position="1"> The l)roblem of learning the best dialogue strategy from data is thus reduced to computing the optimal policy tbr choosing actions in an MDP - that is, the system's goal is to take actions so as to maximize expected reward. The comput~ttion of the ol)timal policy given the MDP can be done etficiently using stan&trd RL algorithms.</Paragraph> <Paragraph position="2"> How do we build the desired MDP from sample dialogues? Following Singh et al. (1999), we can view a dialogue as a trajectory in the chosen state space determined by the system actions and user resl) onses: S1 -~al,rl '5'2 --}a~,rs 83 &quot;-~aa,ra &quot;'&quot; Here si -%,,,.~ si+l indicates that at the ith exchange, the system was in state si, executed action ai, received reward ri, and then the state changed to si+~. Dialogue sequences obtained froln training data can be used to empirically estimate the transition probabilities P(.s&quot;la', a) (denoting the probability of a mmsition to state s', given that the system was in state .s and took ;ration a), and the reward function R(.s, (t). The estilnated transition 1)tel)abilities and rewi~rd flmction constitute an MDP model of the nser population's interaction with the system.</Paragraph> <Paragraph position="3"> Given this MDP, the exl)ected cumnlative reward (or Q-value) Q(s, a) of taking action a from state s can be calculated in terms of the Q-wdues of successor states via the following recursive equation:</Paragraph> <Paragraph position="5"> These Q-values can be estimated to within a desired threshold using the standard RL value iteration algorithm (Sutton, 1.991.), which iteratively updates the estimate of Q(s, a) based on the current Q-vahms of neighboring states. Once value iteration is con> pleted, the optima\] diah)gue strategy (according to our estimated model) is obtained by selecting the action with the maximum Q-value at each dia.logue state.</Paragraph> <Paragraph position="6"> While this apl)roach is theoretically appealing, the cost of obtaining sample human dialogues makes it crucial to limit the size of the state space, to minimize data sparsity problems, while retaining enough information in the state to learn an accurate model.</Paragraph> <Paragraph position="7"> Our approad~ is to work directly in a minimal but carefully designed stat;e space (Singh et al., 1999).</Paragraph> <Paragraph position="8"> The contribution of this paper is to eml)irically vMi(tate a practical methodology for using IlL to build a dialogue system that ol)timizes its behavior from dialogue data. Our methodology involves 1) representing a dialogue strategy as a mapl)il~g fronl each state in the chosen state space S to a set of dialogue actions, 2) deploying an initial trah> ing system that generates exploratory training data with respect to S, 3) eonstrncting an MDP model from the obtained training data, 4) using value iteration to learn the optimal dialogue strategy in the learned MDP, and 4) redeploying the system using the learned state/~mtion real)ping. The next section details the use of this methodology to design the NJFun system.</Paragraph> </Section> </Section> <Section position="4" start_page="502" end_page="505" type="metho"> <SectionTitle> 3 The NJFun System </SectionTitle> <Paragraph position="0"> NJFnn is a real-time spoken dialogue system that provides users with intbrmation about things to do in New Jersey. NJFun is built using a general purpose 1)latt'ornl tbr spoken dialogue systems (Levin et al., 1.999), with support tbr modules tbr atttorustic speech recognition (ASI/.), spoken language understanding, text-to-speech (TTS), database access, and dialogue management. NJFnn uses a speech recognizer with stochastic language and understanding models trained from example user utterances, and a TTS system based on concatenative diphone synthesis. Its database is populated from the nj. online webpage to contain information about activities. NJFun indexes this database using three attributes: activity type, location, and time of day (which can assume values morning, afternoon, or evening).</Paragraph> <Paragraph position="1"> hffornmlly, the NJFun dialogue manager sequentially queries the user regarding the activity, loca-tion and time attributes, respectively. NJFun first asks the user for the current attribute (and 1)ossibly the other attributes, depending on the initiative).</Paragraph> <Paragraph position="2"> If the current attribute's value is not obtained, N.J-Fun asks for the attrilmte (and possibly the later attributes) again. If NJFun still does not obtain a value, NJFun moves on to the next attribute(s).</Paragraph> <Paragraph position="3"> Whenever NJFun successihlly obtains a value, it can confirm the vMue, or move on to the next attribute(s). When NJFun has finished acquiring attributes, it queries the database (using a wildcard for each unobtained attribute value). The length of NJFun dialogues ranges from 1 to 12 user utterances before the database query. Although the NJFun dialogues are fairly short (since NJFun only asks for an attribute twice), the information access part of the dialogue is similar to more complex tasks.</Paragraph> <Paragraph position="4"> As discussed above, our methodology for using RL to optimize dialogue strategy requires that all potentim actions tbr each state be specified. Note that at some states it is easy for a human to make the correct action choice. We made obvious dialogue strategy choices in advance, and used learIfing only to optimize the difficult choices (Walker et al., 1998).</Paragraph> <Paragraph position="5"> Ill NJFun, we restricted the action choices to 1) the type of initiative to use when asking or reasking for an attribute, and 2) whether to confirm an attribute value once obtained. The optimal actions may vary with dialogue state, and are subject to active debate in the literature.</Paragraph> <Paragraph position="6"> Tile examples in Figure 2 show that NJFun can ask the user about the first 2 attributes I using three types of initiative, based on the combination of tile wording of the system prompt (open versus directive), and the type of grammar NJFun uses during ASR (restrictivc versus non-restrictive). If NJFun uses an open question with m~ unrestricted grammar, it is using v.scr initiative (e.g., Greet\[l). If N J-Fun instead uses a directive prompt with a restricted grammar, the system is using systcm initiative (e.g., GreetS). If NJFun uses a directive question with a non-restrictive granmlar, it is using mizcd initiative, because it allows the user to take the initiative by supplying extra intbrnlation (e.g., ReAsklM).</Paragraph> <Paragraph position="7"> NJFun can also vary the strategy used to confirm each attribute. If NJFun asks the user to explicitly verify an attribute, it is using czplicit confirmation (e.g., ExpConf2 for the location, exemplified by $2 in Figure 1). If NJFun does not generate any COltfirnmtion prompt, it is using no confirmation (the NoConf action).</Paragraph> <Paragraph position="8"> Solely tbr the purposes of controlling its operation (as opposed to the le~trning, which we discuss in a moment), NJNm internally maintains an opcratio'ns vector of 14 variables. 2 variables track whether the system has greeted the user, and which attribute the system is currently attempting to obtain. For each of the 3 attributes, 4 variables track whether the system has obtained the attribute's value, the systent's confidence in the value (if obtained), the number of times the system has asked the user about the attribute, and the type of ASR grammar most recently used to ask for the attribute.</Paragraph> <Paragraph position="9"> The formal state space S maintained by NJFun for tile purposes of learning is nmch silnt)ler than the operations vector, due to the data sparsity concerns already discussed. The dialogue state space $ contains only 7 variables, as summarized in Figsire 3. S is computed from the operations vector using a hand-designed algorithm. The &quot;greet&quot; variable not (no=0, yes=l). &quot;Attr ~: specifies which attrihute NJFun is ('urrently ~tttelnpting to obtain or verify (activity=l, location=2, time=a, done with attributes=4). &quot;Conf&quot; tel)resents the confidence that NaFun has after obtaining a wdue for an attribute.</Paragraph> <Paragraph position="10"> The values 0, 1, and 2 represent the lowest, middle and highest ASR confidence vahms? The wdues 3 and 4 are set when ASR hears &quot;yes&quot; or &quot;no&quot; after a confirmation question. &quot;Val&quot; tracks whether NaFun has obtained a value, for tile attribute (no=0, yes=l). &quot;Times&quot; tracks the number of times that N,lFun has aske(1 the user ~d)out the attribute. &quot;(4ram&quot; tracks the type of grammar most ree(mtly used to obtain the attribute (0=non-restrictive, 1=restrictive). Finally, &quot;hist&quot; (history) represents whether Nalflm had troullle understanding the user ill the earlier p~trt of the conversation (bad=0, good=l). We omit the full detinition, but as a,n ex~unl>le , when N.lFun is working on the secon(1 attribute (location), the history variable is set to 0 if NJFun does not have an activity, has an activity but has no confidence in the value, or needed two queries to obtain the activity.</Paragraph> <Paragraph position="11"> As mentioned above, the goal is to design a small state space that makes enough critical distin('tions to suPi)ort learning. The use of 6&quot; redu(:es the mmfl)er of states to only 62, and SUl)l)orts the constru('tion of mt MI)P model that is not sparse with respect to ,g, even using limite(1 trMning (btta. :~ Tit(.' state sp~t(;e that we utilize here, although minimal, allows us to make initiative decisions based on the success of earlier ex(:ha.nges, and confirmation de(:isions based on ASR. confidence scores and gralnmars.</Paragraph> <Paragraph position="12"> '.Phe state/~t('tiol: real)ping r(-`l)resenting NaFun's initial dialogue strategy EIC (Explor:ttory for Initiative and Confirmation) is in Figure 4. Only the exploratory portion of the strategy is shown, namely those states for which NaFun has an action choice.</Paragraph> <Paragraph position="13"> ~klr each such state, we list tile two choices of actions available. (The action choices in boldfime are the ones eventually identified as el)ritual 1)y the learning process, an(1 are discussed in detail later.) The EIC strategy chooses random, ly between these two ac21&quot;or each uttermme, the ASH. outfmt includes 11o|, only the recognized string, but also aIl asso(:ia.ted acoustic (:onJld(mce score, iBased on data obtaintM dm'ing system deveJolmmnt , we defined a mapl)ing from raw confidence, values into 3 approximately equally potmlated p~rtitions.</Paragraph> <Paragraph position="14"> 362 refers to those states that can actually occur in a dialogue. \[<)r example, greet=0 is only possible in the initial dialogue state &quot;0 1 0 0 0 0 0&quot;. Thus, all other states beginning with 0 (e.g. &quot;0 I 0 0 I 0 0&quot;) will never occur. tions in the indicated state, to maximize exploration and minimize data sparseness when constructing our model. Since there are 42 states with 2 choices each, there is n search space of 242 potential global dialogue strategies; the goal of RL is to identify an apparently optimal strategy fl'om this large search space. Note that due to the randomization of the EIC strategy, the prompts are designed to ensure the coherence of all possible action sequences.</Paragraph> <Paragraph position="15"> Figure 5 illustrates how the dialogue strategy in Figure 4 generates the diMogue in Figure 1. Each row indicates the state that NJFun is in, the ac- null tion executed in this state, the corresponding turn in Figure 1, and the reward received. The initial state represents that NaFun will first attempt to obtain attribute 1. NJFun executes GreetU (although as shown in Figure 4, GreetS is also possible), generating the first utterance in Figure 1. Alter the user's response, the next state represents that N J-Fun has now greeted the user and obtained the activity value with high confidence, by using a non-restrictive grmnmar. NJFnn then chooses the NoConf strategy, so it does not attempt to confirm the activity, which causes the state to change but no prompt to be generated. The third state represents that NJFun is now working on the second attribute (location), that it already has this vahle with high confidence (location was obtained with activity after the user's first utterance), and that the dialogue history is good. 4 This time NaFun chooses the ExpConf2 strategy, and confirms the attribute with the second NJFun utterance, and the state changes again. The processing of time is similar to that of location, which leads NJFun to the final state, where it performs the action &quot;Tell&quot; (corresponding to querying the database, presenting the results to the user, and asking the user to provide a reward). Note that in NJFun, the reward is always 0 except at the terminal state, as shown in the last column of Figure 5.</Paragraph> </Section> <Section position="5" start_page="505" end_page="506" type="metho"> <SectionTitle> 4 Experimentally Optimizing a </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="505" end_page="506" type="sub_section"> <SectionTitle> Strategy </SectionTitle> <Paragraph position="0"> We collected experimental dialogues for both training and testing our system. To obtain training dialogues, we implemented NJFun using the EIC dialogue strategy described in Section 3. We used these dialogues to build an empirical MDP, and then computed the optimal dialogue strategy in this MDP (as described in Section 2). In this section we describe our experimental design and the learned dialogue strategy. In the next section we present results from testing our learned strategy and show that it improves task completion rates, the performance measure we chose to optimize.</Paragraph> <Paragraph position="1"> Experimental subjects were employees not associa, ted with the NJFun project. There were 54 sub4Recall that only the current attribute's features are ill the state, lIowever, the operations vector contains information regarding previous attributes.</Paragraph> <Paragraph position="2"> jects for training and 21 for testing. Subjects were distributed so tile training and testing pools were balanced for gender, English as a first language, and expertise with spoken dialogue systems.</Paragraph> <Paragraph position="3"> During both training and testing, subjects carried out free-form conversations with NJFun to complete six application tasks. For examl)le , the task executed by the user in Figure 1 was: &quot;You feel thirsty and want to do some winetasting in the morning.</Paragraph> <Paragraph position="4"> Are there any wineries (;lose by your house in Lambertville?&quot; Subjects read task descriptions on a web page, then called NJFun from their office phone.</Paragraph> <Paragraph position="5"> At the end of the task, NJFun asked for feedback on their experience (e.g., utterance $4 in Figure 1).</Paragraph> <Paragraph position="6"> Users then hung up the phone and filled out a user survey (Singh et al., 2000) on the web.</Paragraph> <Paragraph position="7"> The training phase of the experiment resulted in 311 complete dialogues (not all subjects completed all tasks), for which NJFun logged the sequence of states and the corresponding executed actions.</Paragraph> <Paragraph position="8"> The number of samples per st~tte for the initi~fl ask choices are:</Paragraph> <Paragraph position="10"> Such data illustrates that the random action choice strategy led to a fairly balanced action distribution per state. Similarly, the small state space, and the fact that we only allowed 2 action choices per state, prevented a data sparseness problem. The first state in Figure 4, the initial state for every dialogue, was the most frequently visited state (with 311 visits).</Paragraph> <Paragraph position="11"> Only 8 states that occur near the end of a dialogue were visited less tlmn 10 times.</Paragraph> <Paragraph position="12"> The logged data was then used to construct the empirical MDP. As we have mentioned, the measure we chose to optinfize is a binary reward flmction based on the strongest possible measure of task completion, called StrongComp, that takes on value 1 if NJFun queries the database using exactly the attributes specified in the task description, and 0 otherwise. Then we eoml)uted the optimal dialogue strategy in this MDP using RL (cf. Section 2). The action choices constituting the learned strategy are in boldface in Figure 4. Note that no choice was fixed for several states, inealfing that the Q-values were identical after value iteration. Thus, even when using the learned strategy, NJFun still sometimes chooses randomly between certain action pairs.</Paragraph> <Paragraph position="13"> Intuitively, the learned strategy says that the optimal use of initiative is to begin with user initiative, then back off to either mixed or system initiative when reasking for an attribute. Note, however, that the specific baekoff method differs with attribute (e.g., system initiative for attribute 1, but gcnerMly mixed initiative for attribute 2). With respect to confirmation, the optimal strategy is to mainly contirm at lower contidenee -values. Again, however, the point where contirlnation becomes unnecessary difl'ers across attributes (e.g., confidence level 2 for attribute 1, but sometimes lower levels for attributes 2 and 3), and also dt!txmds on other features of the state besides confidence (e.g., grammar and history). This use (if ASP, (:ontidence. by the dialogue strategy is more Sol)hisli('ated than previous al)proaches, e.g. (Niimi and Kot)ayashi, 1996; Lit\]nan and Pan, 2000). N.lI,'un ('an learn such linegrained distinctions l}ecause the el)ritual strategy is based on a eonll)arisoi) of 24~ l}ossible exl}h)ratory strategies. Both the initiative and confirmation resuits sugge.sl that the begimfing of the dialogue was the most problenmtie for N.lli'un. Figure I ix an example dialogue using the Ol)tilnal strategy.</Paragraph> </Section> </Section> <Section position="6" start_page="506" end_page="507" type="metho"> <SectionTitle> 5 Experimentally Evaluating the </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="506" end_page="507" type="sub_section"> <SectionTitle> Strategy </SectionTitle> <Paragraph position="0"> For the testing i)\]tase, NJFun was reilnplemented to use the learned strategy. 2:t test sul)je(;Is then performed the same 6 tasks used during training, resulling in 124 complete test dialogues. ()he of our main resull;s is that task completion its measured by StrongCom 11 increased front 52cX} in training 1o 64% in testing (p < .06)) There is also a signilicant in~twaction (!II'(~c.t between strategy nnd task (p<.01) for Strong-Colnl). \]'revious work has suggested l;hat novic(~ users l)erform (:Oml)arably to eXl)erts after only 2 tasks (Kamm et ill., \] 9!18). Sill('e Ollr \]oarllt}d sl.rategy was based on 6 tasks with each user, one (?xplanation of the interaction eft'cot is that the learnc.d strategy is slightly optimized for expert users. ~lb explore this hyi)othesis, we divided our corpus into diah)gues with &quot;novice&quot; (tasks \] and 2) and &quot;expert&quot; (tasks 3-6) users. We fOltltd that the learned strategy did in fact lc'a(l to a large an(1 significant improvement in StrongComp tbr (;Xl)erts (EIC=.d6, learned==.69, 11<.001), and a non-signilieant degradation for novices (1,31C=.66, learned=.55, 11<.3).</Paragraph> <Paragraph position="1"> An apparent limitation of these results is that EIC may not 1)e the best baseline strategy tbr coral)arisen to our learned strategy. A more standard alternative would be comparison to the very best hand-designed fixed strategy. However, there is no itgreement in the literature, nor amongst the authors, its to what the 1)est hand-designed strategy might have been. There is agreement, however, that the best strategy ix sensitive to lnally unknown and unmodeled factors: the aThe ('.xlmrimental design (lescribed above Colmists of 2 factors: the within-in groul) fa(:tor sl~ntefly aim the l)etweengroui)s facl;or task. \,Ve 11812, ~1, l,WO-~,g~l,y D.llO.ly,qiS of variance (ANOVA) to comtmte wlmtlmr main and int(!raction (!flk!cts of strategy are statistically signitica nt (t)<.05) or indicative of a statistical trend (p < .101. Main effe.cts of strategy are task-in(lel)endent , while interaction eIt'(!cts involving strat(%y are task-dependent.</Paragraph> <Paragraph position="2"> user 1)olmlation, the specitics of the, task, the 1)articular ASR used, etc. Furthernlore, \]P, IC was (:arefully designed so that the random choices it makes never results in tm unnatural dialogue. Finally, a companion paper (Singh et al., 2000) shows that the 1)erforntanee of the learned strategy is better thall several &quot;stmtdard&quot; fixed strategies (such as always use system-initiative and no-confirmation).</Paragraph> <Paragraph position="3"> Although many types of measures have been used to evaluate dialogue systems (e.g., task success, dialogue quality, ettit:ieney, usability (l)anieli and Gerbino, 1995; Kamm et al., 11998)), we optimized only tbr one task success measure, StrongConll).</Paragraph> <Paragraph position="4"> Ilowever, we also examined the 1)erl'ornmnee of the learned strategy using other ewduation measures (which t)ossibly could have llo011 used its our reward function). WeakComp is a relaxed version of task comt)letion that gives partial credit: if all attribute values are either correct or wihh:ards, the value is the sum of the correct munl)er of attrilmtes. ()tlmrwise, at least one attribute is wrong (e.g., the user says &quot;Lanfl)ertvilhf' but the system hears &quot;Morristown&quot;), and the wdue is -1. ASR is a dialogue quality llleasure that itl)l)roxinmtes Sl)eech recognition act:uracy for tl,e datM)ase query, a.nd is computed 1:) 3, adding 1 for each correct attribute value altd .5 for every wihtca.rd. Thus, if the task ix to go winetasting near Lambertville in the morning, and the systenl queries the database for an activity in New Jersey in the morning, StrongComp=0, \VeakComp=l, and ASR=2. In addition to the objective measures discussed a,bove, we also COmlmted two subjective usability measures. Feedback is obtained front the dialogue (e.g. $4 in Figure 5), by mapping good, so-so, bad to 1, 0, m~d -1, respectively. User satisfaction (UserSat, ranging front 0-20) is obtained by summing the answers of the web-based user survey.</Paragraph> <Paragraph position="5"> Table I summarizes the diflhrence in performance of NJFun tbr our original reward flmction and the above alternative evaluation measures, from trail> ing (EIC) to test (learned strategy for StrongComp).</Paragraph> <Paragraph position="6"> For WeakComp, the average reward increased from 1.75 to 2.19 (p < 0.02), while tbr ASll the average reward increased from 2.5 to 2.67 (p < 0.04). Again, these iml)rovements occur even though the learned strategy was not optilnized for these measures.</Paragraph> <Paragraph position="7"> The last two rows of the table show that for the subjective measures, i)erformmme does not significantly differ for the EIC and learned strategies. Interestingly, the distributions of the subjective measures move to the middle from training to testing, i.e., test users reply to the survey using less extreme answers than training users. Explaining the subjectire results is an area for future work.</Paragraph> </Section> </Section> class="xml-element"></Paper>