File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-0304_metho.xml

Size: 14,129 bytes

Last Modified: 2025-10-06 14:07:21

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-0304">
  <Title>NJFun: A Reinforcement Learning Spoken Dialogue System</Title>
  <Section position="3" start_page="0" end_page="18" type="metho">
    <SectionTitle>
2 The NJFun System
</SectionTitle>
    <Paragraph position="0"> NJFun is a reM-time spoken dialogue system that provides users with information about things to do in New Jersey. 1 An example dialogue with NJFun is shown in Figure 1. NJFun is built using an internal platform for spoken dialogue systems. NJFun uses a speech recognizer with stochastic language models trained from example user utterances, and a TTS system based on concatenative diphone synthesis.</Paragraph>
    <Paragraph position="1"> Its database is populated from the nj. online web-page to contain information about activities. NJFun indexes this database using three attributes: activity type, location, and time of day.</Paragraph>
    <Paragraph position="2"> Informally, the NJFun dialogue manager sequentially queries the user regarding the activity, location and time attributes, respectively. NJFun first asks the user for the current attribute (and possibly the other attributes, depending on the initiative). If the current attribute's value is not obtained, NJFun asks for the attribute (and possibly the later attributes) again. If NJFun still does not obtain a value, N J-Fun moves on to the next attribute(s). Whenever NJFun successfully obtains a value, it can confirm the value, or move on and attempt to obtain the next attribute(s)? When NJFun has finished asking about the attributes, it queries the database (using a wildcard for each unobtained attribute value).</Paragraph>
    <Paragraph position="3"> We use reinforcement learning (RL) to optimize dialogue strategy, lq.L requires that all potential actions for each state be specified. Note that at some states it is easy for a human to make the correct action choice. We made obvious dialogue strategy choices in advance, and used learning only to optimize the difficult choices. In NJFun, we restricted the action choices to 1) the type of initiative to use  tributes, in any order, in a single utterance. However, NJFun will always process multiple attributes using its predefined sequential ordering.</Paragraph>
    <Paragraph position="4">  SI: Welcome to NJFun. How may i help you? ._ UI: I'd like to find um winetasting in Lambertville in the morning. (ASR output: I'd like to find out wineries the in the Lambertville in the morning.) $2: Did you say you are interested in Lambertville? U2: Yes.</Paragraph>
    <Paragraph position="5"> $3: Did you say you want to go in the morning? U3: Yes.</Paragraph>
    <Paragraph position="6"> $4: I found a winery near Lambertville that is open in the morning. It is \[... \] Please give me feedback by saying 'good', 'so-so', or 'bad'. U4: Good.</Paragraph>
    <Paragraph position="7">  Welcome to NJFun. Please say an activity name or say 'list activities' for a list of activities I know about. Welcome to NJFun. How may I help you? I know about amusement parks, aquariums, cruises, historic sites, museums, parks, theaters, wineries, and zoos. Please say an activity name from this list.</Paragraph>
    <Paragraph position="8"> Please tell me the activity type.You can also tell me the location and time. Please say the name of the town or city that you are interested in. Please give me more information.</Paragraph>
    <Paragraph position="9"> Please tell me the name of the town or city that you are interested in. &amp;quot;Please tell me the location that you are interested in. You can also tell me the time.  when asking or reasking for an attribute, and 2) whether to confirm an attribute value once obtained.</Paragraph>
    <Paragraph position="10"> The optimal actions may vary with dialogue state, and are subject to active debate in the literature.</Paragraph>
    <Paragraph position="11"> The examples in Figure 2 shows that NJFun can ask the user about the first 2 attributes 3 using three types of initiative, based on the combination of the wording of the system prompt (open versus directive), and the type of grammar NJFun uses during ASR (restrictive versus non-restrictive). If NJFun uses an open question with an unrestricted grammar, it is using user initiative (e.g., GreetU). If N J-Fun instead uses a directive prompt with a restricted grammar, the system is using system initiative (e.g., GreetS). If NJFun uses a directive question with a non-restrictive grammar, it is using mixed initiative, because it is giving the user an opportunity to take the initiative by supplying extra information (e.g., ReAsklM).</Paragraph>
    <Paragraph position="12"> NJFun can also vary the strategy used to confirm each attribute. If NJFun asks the user to explicitly verify an attribute, it is using explicit confirmation (e.g., ExpConf2 for the location, exemplified by $2 in Figure 1). If NJFun does not generate any confirmation prompt, it is using no confirmation (an action we call NoConf).</Paragraph>
    <Paragraph position="13"> Solely for the purposes of controlling its operation (as opposed to the learning, which we discuss in a moment), NJFun internally maintains an operations vector of 14 variables. 2 variables track whether the system has greeted the user, and which attribute the system is currently attempting to obtain. For each of the 3 attributes, 4 variables track whether '~ &amp;quot;Greet&amp;quot; is equivalent to asking for the first attribute. N J-Fun always uses system initiative for the third attribute, because at that point the user can only provide the time of day. the system has obtained the attribute's value, the system's confidence in the value (if obtained), the number of times the system has asked the user about the attribute, and the type of ASR grammar most recently used to ask for the attribute.</Paragraph>
    <Paragraph position="14"> The formal state space S maintained by NJFun for the purposes of learning is much simpler than the operations vector, due to data sparsity concerns.</Paragraph>
    <Paragraph position="15"> The dialogue state space $ contains only 7 variables, which are summarized in Figure 3, and is easily computed from the operations vector. The &amp;quot;greet&amp;quot; variable tracks whether the system has greeted the user or not (no=0, yes=l). &amp;quot;Attr&amp;quot; specifies which attribute NJFun is currently attempting to obtain or verify (activity=l, location=2, time=3, done with attributes=4). &amp;quot;Conf&amp;quot; represents the confidence that NJFun has after obtaining a value for an attribute. The values 0, 1, and 2 represent low, medium and high ASR confidence. The values 3 and 4 are set when ASR hears &amp;quot;yes&amp;quot; or &amp;quot;no&amp;quot; after a confirmation question. &amp;quot;Val&amp;quot; tracks whether NJFun has obtained a value for the attribute (no=0, yes=l).</Paragraph>
    <Paragraph position="16"> &amp;quot;Times&amp;quot; tracks the number of times that NJFun has asked the user about the attribute. &amp;quot;Gram&amp;quot; tracks the type of grammar most recently used to obtain the attribute (0=non-restrictive, l=restrictive). Finally, &amp;quot;history&amp;quot; represents whether NJFun had trouble understanding the user in the earlier part of the conversation (bad=0, good=l). We omit the full definition, but as an example, when NJFun is working on the second attribute (location), the history variable is set to 0 if NJFun does not have an activity, has an activity but has no confidence in the value, or needed two queries to obtain the activity.</Paragraph>
    <Paragraph position="17"> In order to apply RL with a limited amount of training data, we need to design a small state space  that makes enough critical distinctions to support learning. The use of S yields a state space of size 62. The state space that we utilize here, although minimal, allows us to make initiative decisions based on the success of earlier exchanges, and confirmation decisions based on ASR confidence scores and grammars. null In order to learn a good dialogue strategy via RL we have to explore the state action space. The state/action mapping representing NJFun's initial exploratory dialog@ strategy EIC (Exploratory for Initiative and Confirmation) is given in Figure 4.</Paragraph>
    <Paragraph position="18"> Only the exploratory portion of the strategy is shown, namely all those states for which NJFun has an action choice. For each such state, we list the two choices of actions available. (The action choices in boldface are the ones eventually identified as optimal by the learning process.) The EIC strategy chooses randomly between these two actions when in the indicated state, in order to maximize exploration and minimize data sparseness when constructing our model. Since there are 42 states with 2 choices each, there is a search space of 242 potential dialogue strategies; the goal of the RL is to identify an apparently optimal strategy from this large search space. Note that due to the randomization of the EIC strategy, the prompts are designed to ensure the coherence of all possible action sequences.</Paragraph>
    <Paragraph position="19"> Figure 5 illustrates how the dialogue strategy in Figure 4 generates the dialogue in Figure 1. Each row indicates the state that NJFun is in, the action executed in this state, the corresponding turn in Figure 1, and the reward received. The initial state represents that NJFun will first attempt to obtain attribute 1. NJFun executes GreetU (although as shown in Figure 4, Greets is also possible), generating the first utterance in Figure 1. After the user's response, the next state represents that N J-Fun has now greeted the user and obtained the activity value with high confidence, by using a non-restrictive grammar. NJFun chooses not to confirm the activity, which causes the state to change but no prompt to be generated. The third state represents that NJFun is now working on the second attribute (location), that it already has this value with high confidence (location was obtained with activity after the user's first utterance), and that the dialogue history is good. This time NJFun chooses to confirm the attribute with the second NJFun utterance, and the state changes again. The processing of time is similar to that of location, which leads NJFun to the final state, where it performs the action &amp;quot;Tell&amp;quot; (cor-</Paragraph>
    <Paragraph position="21"> responding to querying the database, presenting the results to the user, and asking the user to provide a reward). Note that in NJFun, the reward is always 0 except at the terminal state, as shown in the last column of Figure 5.</Paragraph>
  </Section>
  <Section position="4" start_page="18" end_page="19" type="metho">
    <SectionTitle>
3 Sample Dialogues
</SectionTitle>
    <Paragraph position="0"> Figures 1 and 6 present two dialogues that were generated using different dialogue strategies in the EIC search space. For example, although NJFun begins both dialogues in the first state in Figure 4, NJFun executes GreetU in one dialogue and GreetS in the other. Similarly, although the next state of NJFun  SI: Welcome to NJFun. Please say an activity name or say 'list activities' for a list of activities I know about. UI: I would like to visit a historic site in Stanhope this morning.  (ASR output: I would like to the zoo historic sites in stanhope historic) $2: Did you say you are interested in going to a zoo? U2: No.</Paragraph>
    <Paragraph position="1"> $3: I know about amusement parks, aquariums, cruises, historic sites, museums, parks, theaters, wineries, and zoos. Please say an activity name from this list.</Paragraph>
    <Paragraph position="2"> U3: I would like to visit a historic site. (ASR output: I would like to visit historic sites) $4: Did you say you are interested in going to a historic site? U4: Yes.</Paragraph>
    <Paragraph position="3"> $5: Please give me more information.</Paragraph>
    <Paragraph position="4"> U5: (user says nothing) S6: Please tell me the location that you are interested in. You can also tell me the time. U6: Visit a historic. (ASR output: What historic) $7: What time of the day do you want to go? U7: Stanhope. (no ASR output) $8: Do you want to go in the morning, in the afternoon, or in the evening? US: Morning.</Paragraph>
  </Section>
  <Section position="5" start_page="19" end_page="19" type="metho">
    <SectionTitle>
4 Contributions
</SectionTitle>
    <Paragraph position="0"> The main contribution of this work is that we have developed and empirically validated a practical methodology for using RL to build a real dialogue system that optimizes its behavior from dialogue data. Unlike traditional approaches to learning dialogue strategy from data, which are limited to searching a handful of policies, our RL approach is able to search many tens of thousands of dialogue strategies. In particular, the traditional approach is to pick a handful of strategies that experts intuitively feel are good, implement each policy as a separate system, collect data from representative human users for each system, and then use standard statistical tests on that data to pick the best system, e.g. (Danieli and Gerbino, 1995). In contrast, our use of RL allowed us to explore 242 strategies that were left in our search space after we excluded strategies that were clearly suboptimal.</Paragraph>
    <Paragraph position="1"> An empirical validation of our approach is detailed in two forthcoming technical papers (Singh et al., 2000; Litman et al., 2000). We obtained 311 dialogues with the exploratory (i.e., training) version of NJFun, constructed an MDP from this training data, used RL to compute the optimal dialogue strategy in this MDP, reimplemented NJFun such that it used this learned dialogue strategy, and obtained 124 more dialogues. Our main result was that task completion improved from 52% to 64% from training to test data. Furthermore, analysis of our MDP showed that the learned strategy was not only better than EIC, but also better than other fixed choices proposed in the literature (Singh et al., 2000).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML