File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/n01-1028_metho.xml
Size: 18,233 bytes
Last Modified: 2025-10-06 14:07:32
<?xml version="1.0" standalone="yes"?> <Paper uid="N01-1028"> <Title>Learning optimal dialogue management rules by using reinforcement learning and inductive logic programming</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Example dialogue system </SectionTitle> <Paragraph position="0"> In this section we present a simple dialogue system that we use in the rest of the paper to describe and explain our results. This system will be used with automated users in order to simulate dialogues. The aim of the system is to be simple enough so that its operation is easy to understand while being complex enough to allow the study of the phenomena we are interested in. This will provide a simple way to explain our approach.</Paragraph> <Paragraph position="1"> We chose a system whose goal is to nd values for three pieces of information, called, unoriginally, a, b and c. In a practical system such as an automated travel agent for example, these values could be departure and arrival cities and the time of a ight.</Paragraph> <Paragraph position="2"> We now describe the system in terms of states, transitions, actions and rewards, which are the basic notions of reinforcement learning. The system has four actions at its disposal: prepare to ask (prepAsk) a question about one of the pieces of information, prepare to recognize (prepRec) a user's utterance about a piece of information, ask and recognize (ask&recognize) which outputs all the prepared questions and tries to recognize all the expected utterances, and end (end) which terminates the dialogue.</Paragraph> <Paragraph position="3"> We chose these actions as they are common, in one form or another, in most speech dialogue systems. To get a speci c piece of information, the system must prepare a question about it and expect a user utterance as an answer before carrying out an ask&recognize action. The system can try to get more than one piece of information in a single ask&recognize action by preparing more than one question and preparing to recognize more than one answer.</Paragraph> <Paragraph position="4"> Actions are associated with rewards or penalties. Every system action, except ending, has a penalty of -5 corresponding to some imagined processing cost. Ending provides a reward of 100 times the number of pieces of information known when the dialogue ends. We hope that these numbers simulate a realistic reward function. They could be tuned to re ect user satisfaction for a real dialogue manager.</Paragraph> <Paragraph position="5"> The state of the system represents which pieces of information are known or unknown and what questions and recognitions have been prepared. There is also a special end state. For this example, there are 513 di erent states.</Paragraph> <Paragraph position="6"> Pieces of information become known when users answer the system's questions. In our tutorial example, we used automated users. These users always give one piece of information if properly asked as explained above, and answer potential further questions with a decreasing probability (0.5 for a second piece of information, and 0.25 for a third in our example). We could tune these probabilities to re ect real user behavior. Using simulated users enables us to quickly train our system. It could also allow us to test the usefulness of ILP under di erent conditions.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Learning rules from optimal </SectionTitle> <Paragraph position="0"> strategy In this section we explain how we obtain and interpret rules expressing the optimal management strategy found by reinforcement learning for the system presented in section 2 as well as a more realistic one.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Example system </SectionTitle> <Paragraph position="0"> We rst search for the optimal strategy of our example system by using reinforcement learning. We do this by having repetitive dialogues with the automated users and evaluating the average reward of the actions taken by the system. When deciding what to do in each state, we choose the up-to-now best action with probability 0.8 and other actions with uniform probability totaling 0.2. This allows the system to explore the dialogue space while preferably following the best strategy found. The optimal</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> State Action </SectionTitle> <Paragraph position="0"> funknown(c), unknown(b), unknown(a)g prepRec(a) funknown(c), unknown(b), unknown(a), prepRec(a)g prepAsk(a) funknown(c), unknown(b), unknown(a), prepRec(a), prepAsk(a)g ask&recognize funknown(c), unknown(b), known(a)g prepRec(b) funknown(c), unknown(b), known(a), prepRec(b)g prepAsk(b) funknown(c), unknown(b), known(a), prepRec(b), prepAsk(b)g ask&recognize funknown(c), known(b), known(a)g prepRec(c) funknown(c), known(b), known(a), prepRec(c)g prepAsk(c) funknown(c), known(b), known(a), prepRec(c), prepAsk(c)g ask&recognize fknown(c), known(b), known(a)g end piece of information at a time until all the pieces have been collected, and then end the dialogue.</Paragraph> <Paragraph position="1"> A typical dialogue following this strategy would simply go like this, using the travel agent example: null S: Where do you want to leave from? U: Cambridge.</Paragraph> <Paragraph position="2"> S: Where do you want to go to? U: Seattle.</Paragraph> <Paragraph position="3"> S: When do you want to travel? U: Tomorrow.</Paragraph> <Paragraph position="4"> Then, in order to learn rules generalizing the optimal strategy, we use foidl. foidl is a program which learns rst-order rules from examples (Mooney and Cali , 1995; Mitchell, 1997, ch. 10). foidl starts with rules without conditions and then adds further terms so that they cover the examples given but not others. In our case, rule conditions are about properties of states and rule actions are the best actions to take. Some advantages of foidl are that it can learn from a relatively small set of positive examples without the need for explicit negative examples and that it uses intentional background knowledge (Cali and Mooney, 1998). foidl has two main learning modes. When the examples are functional, i.e., for each state there is only one best action, foidl learns a set of ordered rules, from the more generic to the more speci c. When applying the rules only the rst most generic rule whose precondition holds needs to be taken into account. When the examples are not functional, i.e., there is at least one state where two actions are equally good, foidl learns a bag of rules. All rules whose pre-conditions hold are applied. Ordered rules are usually easier to understand. In this paper, we use foidl in both modes: functional mode for the tutorial example and non-functional mode for the other example.</Paragraph> <Paragraph position="5"> The rules learned by foidl from the optimal strategy are presented in table 2. Preconditions on states express the required and su cient conditions for the action to be taken for a state. Uppercase letters represent variables ( a la Prolog) which can unify with a, b or c.</Paragraph> <Paragraph position="6"> Rules were learned in functional mode. The more generic rules are at the bottom of the table and the more speci c at the top. It can be quite clearly seen from these rules that the strategy is composed of two kinds of rules: ordering rules which indicate in what order the variables should be obtained, and generic rules, typeset in italic, which express the strategy of obtaining one piece of information at a time. The ordering, which is a, then b, then c, is arbitrary. It was imposed by the reinforcement learning algorithm. The general strategy consists in preparing to ask whatever piece of information the ordering rules have decided to recognize, and then asking and recognizing a piece of information as soon as we can. By expressing the strategy in the form of rules it is apparent how it operates. It would then be relatively easy for a dialogue engineer to implement a strategy that keeps the optimal one-at-a-time questioning strategy but does not necessarily impose the same order on</Paragraph> <Paragraph position="8"/> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Real world example </SectionTitle> <Paragraph position="0"> Although the tutorial example showed how rules could be obtained and interpreted, it does not say much about the practical use of our approach for real-world dialogues. In order to study this, we applied foidl to the optimal strategy presented in (Litman et al., 2000), which \presents a large-scale application of RL [reinforcement learning] to the problem of optimizing dialogue strategy selection [...]&quot;. This system is a more realistic application than our introductory example. It has been used by human users over the phone. The dialogue is about activities in New Jersey. A user states his/her preferences for a type of activity (museum visit, etc.) and availability, and the system retrieves potential activities. The system can vary its dialogue strategy by allowing or not allowing users to give extra information when answering a question. It can also decide to conrm or not to con rm a piece of information it has received.</Paragraph> <Paragraph position="1"> The state of the system represents what pieces of information have been obtained and some information on how the dialogue has evolved so far. This information is represented by variables indicating, in column order in table 3, whether the system has greeted the user (0=no, 1=yes), which piece of information the system wants (1, 2 or 3), what was the con dence in the user's last answer (0=low, 1=medium, 2=high, 3=accept, 4=deny), whether the system got a value for the piece of information (0=no, 1=yes), the number of times the system asked for that piece of information, whether the last question was open-ended (0=open-ended, 1=restrictive), and how well the dialogue was going (0=bad, 1=good).</Paragraph> <Paragraph position="2"> See Litman et al. (2000) for a more detailed explanation of the state representation. The actions the system can take are: greeting users (greetu), asking questions to users (asku), re-asking questions to users with an open or restrictive question (reaskm/reasks), asking for con rmation or not (expconf/noconf). The optimal strategy is composed of 42 state-action pairs. It can be reduced to 24 equivalent rules.</Paragraph> <Paragraph position="3"> We present the rules in table 3. Some of these rules are very speci c to the state they apply to. The more generic ones, which are valid whatever the exact piece of information being asked, are typeset in italic. The number of states they generalize is indicated in brackets.</Paragraph> <Paragraph position="4"> These rules can be divided into four categories: null Asking The rst rule simply states that asking (asku) is the best thing to do if we have never asked the value of a piece of information before.</Paragraph> <Paragraph position="5"> Re-asking The second rule states that the system should re-ask for a value with a restricted grammar (reasks), i.e., a grammar that does not allow mixed-initiative, if the previous attempt was made with an open-ended grammar and the user denied the value obtained. The third rule states that re-asking with an open-ended question (reaskm) is ne when the user denied the value obtained but the dialogue was going well until now.</Paragraph> <Paragraph position="6"> Con rming The fourth and fth rules state that the system should explicitly con rm (expconf) a value if the grammar to get it was open-ended, the con dence in the value obtained being medium or even high.</Paragraph> <Paragraph position="7"> No con rmation (noconf) is needed when the con dence is high and the answer was obtained with a restricted grammar even when the dialogue is going badly.</Paragraph> <Paragraph position="8"> Greeting The last rule indicates that the system should greet the user if it has not done so already.</Paragraph> <Paragraph position="9"> When preconditions hold for more than one rule, which can for example be the case for reasks and reaskm in some situations, all the actions allowed by the activated rules are possible. null The generic rules are more explicit than the state-based decision table given by reinforcement learning. For example, the rules about asking and greeting are obvious and it is reassuring that the approach suggests them. The e ects of open-ended or closed questions on the reasking and con rming policies also become much more apparent. Restricting the potential inputs is the best thing to do when re-asking except if the dialogue was going well until that point. In that case the system can risk having an open-ended grammar. The rules on con rmation show the preference to con rm if the value was obtained via an open-ended grammar and that no con rmation is required if the system has high con dence in a value asked via a closed grammar even if the dialogue is going badly. Because the rules enable us to better understand what the optimal policy does, we may be able to re-use the strategy learned in this speci c situation in other dialogue situations. It should be noted that the generic rules generalize only a part of the total strategy (18 states out of 42 in the example). Therefore a lot remains to be explained about the less generic rules. For example, the second piece of information does not require con rmation even if we got it with a low con dence value if the grammar was restrictive and the dialogue going well. Under the same conditions the rst piece of information would require a con rmation. The underlying reasons for these di erences are not clear. Some of the decisions made by the reinforcement learning algorithm are also hard to explain, whether in the form of rules or not. For example, the optimal strategy states that the third piece of information does not require conrmation if we got it with low con dence and the dialogue was going badly. It is di cult to explain why this is the best action to take.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Learning optimal strategy using </SectionTitle> <Paragraph position="0"> rules In this section, we discuss the use of rules during learning. Since rules can generalize the optimal strategy as we saw in the previous section, it is interesting to see whether they can generalize strategies obtained during training. If the rules can generalize the up-to-now best strategy, we may then be able to bene t from the rules to guide the search for the optimal strategy throughout the search space. In order to test this, we ran the same reinforcement learning algorithm to nd out the optimal policy in the same setting as in the example system of section 3. We also ran the same algorithm but this time we stopped it every 5000 iterations. An iteration corresponds to a transition between states in the dialogue. We then searched for rules summarizing the best policy found until then. We took the generic rules found, i.e., not the ones that are speci c to a particular state, and used these to direct the search. That is to say, when a rule applied we chose to take the action it suggested rather than the action suggested by the state's values (this is still subjected to the 0.8 probability selection). The idea behind this was that, if the rules generalize correctly the best strategy, following the rules would guide us more quickly to the best policy than a blind exploration. It should be noted that the underlying representation is still state-based, i.e., we do not generalize the state evaluation function. Our method is therefore guaranteed to nd the optimal policy even if the actions suggested by the rules are not the right ones.</Paragraph> <Paragraph position="1"> Table 4 summarizes the value of the best policy found after each step of 5000 iterations. A star (*) indicates that the optimal strategy has been consistently found. As can be seen from this table, using rules during learning improved the value of the best strategy found so far and reduced the number of iterations needed to nd the optimal strategy for this particular example.</Paragraph> <Paragraph position="2"> The main e ect of using rules seems to be the stabilization of the search on the optimal policy. The search without rules nds the optimal policy but then goes o track before coming back to it. This may not be always the case1 since the best strategy found at rst may not be optimal at all (for example, a rather good strategy at rst is to end the dialogue immediately since it avoids negative rewards), or the dialogue may not be regular enough for rules to be useful. In these cases using rules may well be detrimental. Nevertheless it is important to see that 1We do not claim any statistical evidence since we ran only a limited set of experiments on the e ects of rules and present just one here. Even if we ran enough experiments to get statistically signi cant results, they would be of little use as they would depend on a particular type of dialogues. Much more work needs to be done to evaluate the in uence of rules on reinforcement learning and, if possible, in which conditions they are useful.</Paragraph> <Paragraph position="3"> rules can help reduce, in this case by a factor of 2, the number of iterations needed to nd the optimal strategy. Computationally, using rules may not be much di erent than not using them since the bene ts of fewer reinforcement learning cycles are counter-balanced by the inductive learning costs. However, requiring fewer training dialogues is still an important advantage of this method. This is especially true for systems that train online with real users rather than simulated ones. In this case, example dialogues are an expensive commodity and reducing the need for training dialogues is bene cial.</Paragraph> </Section> class="xml-element"></Paper>