File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/e06-1037_intro.xml
Size: 2,962 bytes
Last Modified: 2025-10-06 14:03:20
<?xml version="1.0" standalone="yes"?> <Paper uid="E06-1037"> <Title>Using Reinforcement Learning to Build a Better Model of Dialogue State</Title> <Section position="3" start_page="0" end_page="289" type="intro"> <SectionTitle> 2 Background </SectionTitle> <Paragraph position="0"> We follow past lines of research (such as (Singh et al., 1999)) for describing a dialogue a0 as a trajectory within a Markov Decision Process (Sutton and Barto, 1998). A MDP has four main components: states, actions, a policy, which specifies what is the best action to take in a state, and a reward function which specifies the utility of each state and the process as a whole. Dialogue management is easily described using a MDP because one can consider the actions as actions made by the system, the state as the dialogue context, and a reward which for many dialogue systems tends to be task completion success or dialogue length.</Paragraph> <Paragraph position="1"> Typically the state is viewed as a vector of features such as dialogue history, speech recognition confidence, etc.</Paragraph> <Paragraph position="2"> Thegoal of using MDPsis to determine the best policy a1 for a certain state and action space. That is, we wish to find the best combination of states and actions to maximize the reward at the end of the dialogue. In most dialogues, the exact reward for each state is not known immediately, in fact, usually only the final reward is known at the end of the dialogue. As long as wehave a reward function, Reinforcement Learning allows one to automatically compute the best policy. The following recursive equation gives us a way of calculating the expected cumulative value (V-value) of a state</Paragraph> <Paragraph position="4"> a2a31a3 is the best action for state a2 at this time, a12 is the probability of getting from state a2 to for that traversal plus the value of the new state multiplied by a discount factor</Paragraph> <Paragraph position="6"> between 0 and 1 and discounts the value of past states. The policy iteration algorithm (Sutton and Barto, 1998) iteratively updates the value of each state V(s) based on the values of its neighboring states. Theiteration stops when each update yields an epsilon difference (implying that V(s) has converged) and we select the action that produces the highest V-value for that state.</Paragraph> <Paragraph position="7"> Normally one would want a dialogue system to interact with users thousands of times to explore the entire traversal space of the MDP, however in practice that is very time-consuming. Instead, the next best tactic is to train the MDP (that is, calculate transition probabilities for getting from one state to another, and the reward for each state) on already collected data. Of course, the whole space will not be considered, but if one reduces the size of the state vector effectively, data size becomes less of an issue (Singh et al., 2002).</Paragraph> </Section> class="xml-element"></Paper>