XML Viewer - c00-1068

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/c00-1068_intro.xml
Size: 9,674 bytes
Last Modified: 2025-10-06 14:00:45
<?xml version="1.0" standalone="yes"?>
<Paper uid="C00-1068">
  <Title>Flexible Mixed-Initiative Dialogue Management using Concept-Level Confidence Measures of Speech Recognizer Output</Title>
  <Section position="2" start_page="0" end_page="468" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> In a st)oken dialogu(: system, it fr(:(tuently o(:cm:s that the system incorrectly rccogniz(:s user utterances and the user makes exl)ressions the system has not (~xt)ccted. These prot)lcms arc essentially incvital)le in handling the natural language 1)y comlmters , even if vocal)ulary and grammar of the system are |~lmed. This lack of robustness is one of the reason why spoken dialogue systems have not been widely deployed.</Paragraph>
    <Paragraph position="1"> In order to realize a rol)ust st)oken dialogue system, it is inevital)le to handle speech recognition errors. To sut)t)ress recognition errors, system-initiative dialogue is eitbctive. But it ca.n 1)e adopted only in a simi)le task. For instance, the form-tilling task can 1)e realized 1)y a simi)le strategy where the system asks a user the slut wdues in a fixed order. In such a systelninitiated intera('tion, the recognizer easily narrows down the vocabulary of the next user's uttcrance, thus the recognition gets easier.</Paragraph>
    <Paragraph position="2"> ()n the other hand, in more eoniplicat('A task such ms inforination rctriewd, the vocml)ulmry of the llCXI; lltt(2rauco callllot 1)e limited on all occasions, because the user should be abh~ to input the values in various orders based on his i)rel'erence. Therefore, without imposing a rigid teml)late ut)on the user, the system must behav(~ at)t)rol)riately even when sl)ecch recognizer out1)ut contains some errors.</Paragraph>
    <Paragraph position="3"> Obviously, making confirmal;ion is efl'cctive to mvoid misun(lerstandings caused by slme(:h recognition errors, ttowcver, when contirmmtions are made \]'or every utterance, |;lie di~dogue will l)ccome too redundant mad consequcntly |;rout)lcsomc, for users. Previous works have, shown that confirmation strategy shouM 1)c decided according to the frequency of stretch recognition errors, using mathematicml formula (Niimi and Kolmymshi, 1.996) and using comt)uter-to-comlml;er silnulation (W~tanabe et al., 1!)98). These works assume tixe(t l)erfof mance (averaged speech recognition accuracy) in whole (lialogue with any speakers. For flexible dialogue management, howeve, r the confirmation strategy luust 1)e dynamically changc, d bmsed on the individual utterances. For instmncc, we human make contirmation only when we arc not coat|dent. Similarly, confidence, incasures (CMs) of every speech recognition output should be modeled as a criterion to control dialogue management.</Paragraph>
    <Paragraph position="4"> CMs have been calculated in previous works using transcripts and various knowledge sources (Litman et al., 1999) (Pao et, al., 1998). For more tlexible interaction, it, ix desirable that CMs are detined on each word rather than whole sentence, because the systeln can handle only unreliable portions of an utterance instead of accepting/rejecting whole sentence.</Paragraph>
    <Paragraph position="5">  In this paper, we propose two concept-level CMs that are on content-word level and on semantic-attribute level for every content word.</Paragraph>
    <Paragraph position="6"> Because the CMs are defined using only speech recognizer output, they can be computed in real time. The system can make efficient confirmation and effective guidance according to the CMs. Even when successful interpretation is not obtained o51 content-word level, the system generates system-initiative guidances based on the semantic-attribute level, which lead the next user's utterance to successful interpretation.</Paragraph>
    <Paragraph position="7"> 2 Definition of Confidence Measures (CMs) Confidence Measures (CMs) have been studied for utterance verification that verifies speech recognition result as a post-processing (Kawahara et al., 1998). Since an automatic speech recognition is a process finding a sentence hypothesis with the maximum likelihood for an input speech, some measures are needed in order to distinguish a correct recognition result from incorrect one. In this section, we describe definition of two level CMs which are on content-words and on semantic-attritmtes, using 10-best output of the speech recognizer and parsing with phrase-level grammars.</Paragraph>
    <Section position="1" start_page="467" end_page="467" type="sub_section">
      <SectionTitle>
2.1 Definition of CM for Content Word
</SectionTitle>
      <Paragraph position="0"> In the speech recognition process, both acoustic probability and linguistic t)robability of words are multiplied (summed up in log-scale) over a sentence, and the sequence having maximum likelihood is obtained by a search algorithm. A score of sentence derived from the speech recognizer is log-scaled likelihood of a hypothesis sequence. We use a grammar-based speech recognizer Julian (Lee et al., 1999), which was developed in our laboratory. It correctly obtains the N-best candidates and their scores by using A* search algorithm.</Paragraph>
      <Paragraph position="1"> Using the scores of these N-best candidates, we calculate content-word CMs as below. The content words are extracted by parsing with phrase-level grammars that are used in speech recognition process. In this paper, we set N = 10 after we examined various values of N as the nmnber of computed candidates J 1Even if we set N larger tt,an 10, the scores of i-th hypotheses (i &gt; 10) are too small to affect resulting CMs. First, each i-th score is multiplied by a factor a(a &lt; 1). This factor smoothes tile difference of N-best scores to get adequately distributed CMs. Because the distribution of the absolute values is different among kinds of statistical acoustic model (monophone, triphone, and so oi1), different values must be used. The value of c~ is examined in the preliminary experiment.</Paragraph>
      <Paragraph position="2"> In this paper, we set c~ = 0.05 when using triphone model as acoustic model. Next, they are transtbrnmd from log-scaled value (&lt;t. scaledi) to probability dimension by taking its exponential, and calculate a posteriori probability tbr each i-th candidate (Bouwman et al., 1999).</Paragraph>
      <Paragraph position="4"> This Pi represents a posteriori probability of the i-th sentence hypothesis.</Paragraph>
      <Paragraph position="5"> Then, we compute a posteriori probability tbr a word. If the i-th sentence contains a word w, let 5w,i = 1, and 0 otherwise. A posteriori probability that a word w is contained (Pw) is derived as summation of a posteriori prob~bilities of sentences that contain the word.</Paragraph>
      <Paragraph position="7"> We define this Pw as the content-word CM (CM,,). This CM.,, is calculated tbr every content word. Intuitively, words that appear many times in N-best hypotheses get high CMs, and frequently substituted ones in N-best hypotheses are judged as mn'eliable.</Paragraph>
      <Paragraph position="8"> In Figure 1, we show an example in CMw calculation with recognizer outputs (i-th recognized candidates and their a posteriori probabilities) tbr an utterance &amp;quot;Futaishisctsu ni rcsutoran no aru yado (Tell me hotels with restaurant facility.)&amp;quot;. It can be observed that a correct content word 'restaurant as facility' gets a high CM value (CMw = 1). The others, which are incorrectly recognized, get low CMs, and shall be rejected.</Paragraph>
    </Section>
    <Section position="2" start_page="467" end_page="468" type="sub_section">
      <SectionTitle>
2.2 CM for Semantic Attribute
</SectionTitle>
      <Paragraph position="0"> A concept category is semantic attribute assigned to content words, and it is identified by parsing with phrase-level gramnmrs that are used in speech recognition process and represented with Finite State Automata (FSA). Since</Paragraph>
    </Section>
    <Section position="3" start_page="468" end_page="468" type="sub_section">
      <SectionTitle>
Recognition candidates
</SectionTitle>
      <Paragraph position="0"> aa shisetsu ni resutmnu, no kayacho with restaurant facility / Kayacho(location) aa shisetsu ni rcsuto7nn no katsurn no with restaurant fimility / Katsura(location) aa shisctsu ni resutoran no kamigamo with restaurant facility / Kmnigamo(location)  with restaurant facility / I(amigamo(location) aa sh, isetsu ni resutoran no kafc with restaurant fimility / care(facility) &lt;g&gt; shisetsu ni resutoran no kafe with restaurant facility / cafc(facility) &lt;g&gt; setsubi wo rcsutoran no kayacho with restaurant facility / I(ayacho(locatlon) &lt;g&gt; sctsubi wo resutoran no katsura no with restaurant facility / Katsura(location)  these FSAs are, classified into (:on(:cl)t categories lmforehand, we can auton|atically derive the concept categories of words by parsing with these grammars. In our hotel query task, there are sevelt concept categories such as qocation', 'fi, cility' and so on.</Paragraph>
      <Paragraph position="1"> For this concept (:ategory, we also define semantic-attritmtc CMs (CM~:) as tbllows. First, we (-ah:ulnte a t)osteriori probabilities of N-best sentences in the same. way of comtmting content-word CM. If a concel)t c~tegory c is contained in the i-th sentence, let 5,,,i = 1, and 0 otherwise. The t)robability that a concept cat-</Paragraph>
      <Paragraph position="3"> We define this Pc as semantic-attribute CM (CM~). This CMc estimates which category the user refers to and is used to generate ett'ective guidances.</Paragraph>
      <Paragraph position="5"/>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML