XML Viewer - w04-3009

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-3009_metho.xml
Size: 20,548 bytes
Last Modified: 2025-10-06 14:09:28
<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-3009">
  <Title>Using Higher-level Linguistic Knowledge for Speech Recognition Error Correction in a Spoken Q/A Dialog</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Noisy Channel Error Correction Model
</SectionTitle>
    <Paragraph position="0"> The noisy channel error correction framework has been applied to a wide range of problems, such as spelling correction, statistical machine translation, and ASR error correction (Brill and Moore, 2000; Brown et al., 1990; Ringger and Allen, 1996). The key idea of noisy channel model is that we can model some channel properties through estimating the posterior probabilities.</Paragraph>
    <Paragraph position="1"> The problem of ASR error correction can be stated in this model as follows: For an input sentence, O = o1;o2;:::;on produced as the output sequence of ASR, find the best word sequence, ^W = w1;w2;:::;wn, that maximizes the posterior probability P(WjO). Then, applying Bayes' rule and dropping the constant denominator, we can rewrite as:</Paragraph>
    <Paragraph position="3"> Now, we have a noisy channel model for ASR error correction, with two components, the source model P(W) and the channel model P(OjW). The probability P(W) is given by the language model and can be decomposed as:</Paragraph>
    <Paragraph position="5"> The distribution P(W) can be defined using n-grams, structured language model (Chelba, 1997), or any other tool in the statistical language modeling.</Paragraph>
    <Paragraph position="6"> Next, the conditional probability, P(OjW) reflects the channel characteristics of the ASR environment. If we assume that the output word sequence produced under ASR are independent of one another, we have the following formula:</Paragraph>
    <Paragraph position="8"> However, this simple one-to-one model is not suitable to handling split or merged errors, which frequently appear in an ASR output, because we assume that the output word sequence are independent of one another. For example, 1figure 2 shows a split or a merged error problem. To solve this problem, Ringger and Allen used the fertility of pre-channel word (Ringger and Allen, 1996).</Paragraph>
    <Paragraph position="9"> Following (Brown et al., 1990), we refer to the number of post-channel words oi produced by a pre-channel word wi as a fertility. They simplified the fertility model of IBM statistical MT model-4, and permitted the fertility within 2 windows such as P(oi!1;oijwi) for twoto-one channel probability, and P(oijwi;wi+1) for oneto-two channel probability. So, the fertility model can deal with (TO LEAVE, TOLEDO) substitution. But this improved fertility model only slightly increased the accuracy in experiments (Ringger and Allen, 1996), and we think the major reason is due to the data-sparseness problem. Because substitution probability is based on the whole word-level, this fertility model requires enormous training data. We call the model a word-based channel model, because this model is based on the word-to-word transformation. The word-based model focused on inter-word substitutions, so it requires enough results of ASR and transcription pairs. Considering the cost of building the enough amount of correction pairs, we need a smaller unit than a word for overcoming the data-sparseness.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Syllable-based Channel Model
</SectionTitle>
    <Paragraph position="0"> We suggest an improved channel model for smaller training data. If we can use smaller unit such as letter, phoneme or syllable than word, relatively smaller training set is needed. For dealing with intra-word transformation, we suggest a syllable-based channel model, which can deal with syllable-to-syllable transformation. This model is especially reasonable for Korean. In some agglutinative languages such as Korean, syllable is a basic unit of written form like a Chinese character. In Korean, the average number of syllables in one word is about three or four.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 The Model
</SectionTitle>
      <Paragraph position="0"> Suppose S = s1;s2;:::;sn is a syllable sequence of ASR output and W = w1;w2;:::;wm is a source word sequence, then our purpose is to find the best word sequence ^W as follows:</Paragraph>
      <Paragraph position="2"> We can apply the same Bayes' rule and decompose the syllable-to-word channel model into syllable-to-syllable channel model.</Paragraph>
      <Paragraph position="4"> So, final formula can be written as:</Paragraph>
      <Paragraph position="6"> Here, P(SjX) is the probability of a syllable-to-syllable transformation, where X = x1;x2;:::;xn is a source syllable sequence. P(XjW) is a word model, which can convert syllable lattice into word lattice. The conversion can be done efficiently by dictionary look-up.</Paragraph>
      <Paragraph position="7"> This model is similar to a standard hidden markov model (HMM) of continuous speech recognition. In speech recognition system, P(SjX) can be an acoustic model in signal-to-phoneme level, and P(XjW) can be a pronunciation dictionary. Then, we applied the fertility into our syllable-to-syllable channel model. We set the maximum 2-fertility of syllable, which was determined experimentally.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Training the Model
</SectionTitle>
      <Paragraph position="0"> To train the model, we need a training data consisting of fX;Sg pairs which are manually transcribed strings and ASR outputs. And, we align the pair based on minimizing the edit distance between xi and si by dynamic  programming. 2Figure 3 shows an alignment for the syllable-model (For understanding, we use an English example and a letter-to-letter alignment. In Korean, each syllable is clearly distinguished much like a letter in English.). For example, (TO LEAVE, TOLEDO) pair in previous section can be divided into (TO, TO), (L, L), (EA, E), and (VE, DO) with fertility 2.</Paragraph>
      <Paragraph position="1"> We can then calculate the probability of each substitution P(sijxi) by Maximum-Likelihood Estimation (MLE). Let C(xi) be the frequency of source syllable, and C(xi;si) be the frequency of events where xi substitute si. Then,</Paragraph>
      <Paragraph position="3"> The total number of theoretical unique syllables is about ten thousands in Korean, but the number of syllables, which appeared at least one time, is about 2,300 in a corpus which has about 3 billion syllables. Thus, we used Witten-Bell method for smoothing unseen substitutions (Witten and Bell, 1991). Let T(xi) be the number of substitution types, and N be the number of syllables in a training data. For Witten-Bell discounting, we should define Z(xi), which is the number of syllable xi with count zero. Then, we can write as follows:</Paragraph>
      <Paragraph position="5"/>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Decoding the Model
</SectionTitle>
      <Paragraph position="0"> Given a syllable sequence S, we want to find arg maxW (P(W)P(XjW)P(SjX)). This will be to return an N-best list of candidates according to the models, and then rescore these candidates by taking into account the language model probabilities. To rescore the candidates, we used Viterbi search algorithm to find the best sequence. For implementation of candidate generation, we store the syllable channel probabilities P(sijxi) as a hash-table to pop them easily and fast. The system can generate a candidate word sequence network using syllable channel model and a lexicon. And then, we can find optimal sequence which has the best probability through Viterbi decoding by including a language model.</Paragraph>
      <Paragraph position="1"> 2We omitted detail character-level match lines to simplify.</Paragraph>
      <Paragraph position="2"> The whole word match is depicted in bold lines, while no-line</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4 Using Syntactic and Semantic
Knowledge
</SectionTitle>
      <Paragraph position="0"> In some similar areas such as spelling error correction or optical character recognition (OCR) error correction, NLP researchers traditionally identified five levels of errors in a text: (1) a lexical level, (2) a syntactic level,  (3) a semantic level, (4) a discourse structure level, and (5) a pragmatic level (Kukich, 1992). In spelling cor- null rection and OCR error correction problem, correction schemes mainly have focused on non-word errors at the lexical level, which is an isolated word correction problem. However, errors of speech recognition tend to be continuous word errors which should be better classified into syntactic and semantic level errors, because the recognizer only produces word sequences existing in a lexicon. So, this section presents a more syntax and semantic-oriented approach to correct erroneous outputs of a speech recognizer using a domain knowledge which provides syntactic and semantic information. We focus on continuous word error detection and correction, using syntactic and semantic knowledge, and pipeline this high-level error correction method with the syllable-based channel model.</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Lexico-Semantic Pattern
</SectionTitle>
      <Paragraph position="0"> A lexico-semantic pattern (LSP) is a structure where linguistic entries and semantic types are used in combination to abstract certain sequences of the words in a text.</Paragraph>
      <Paragraph position="1"> It has been used in the area of natural language interface for database (NLIDB) (Jung et al., 2003) and a TREC QA system for the purpose of matching the user query with the appropriate answer types at syntax/semantic level (Kim et al., 2001; Lee et al., 2001). In an LSP, linguistic entries consist of words, phrases and part-of-speech (POS) tags, such as 'YMCA,' 'Young Men's Christian Association,' and 'NNP.'3 Semantic types con3Part-of-speech tag denoting a proper noun which is used in Penn TreeBank (Marcus et al., 1994).</Paragraph>
      <Paragraph position="2">  sist of common semantic classes and domain-specific (or user-defined) semantic classes. The common semantic tags again include attribute-values in databases, such as '@corp' for a company name like 'IBM,' and pre-define 83 semantic category values, such as '@location' for location names like 'New York' (Jung et al., 2003). Figure 4 shows an example of predefined common semantic category values which will be used in an ontology dictionary. null In domain-specific application, well defined semantic concepts are required, and the domain-specific semantic classes represent these requirements. The domain-specific semantic classes include special attribute names in databases, such as '%action' for 'active' and 'inactive,' and semantic category names, such as '%hobby' for 'reading' and 'recreation,' for which the user wants a specific meaning in the application domain. Moreover, we used the classes to abstract out several synonyms into a single concept. For example, a domain-specific semantic class '%question' represents some words, such as 'question', 'query', 'asking', and 'answer.' The domain dictionary is a subset of the general semantic category dictionary, and focuses only on the narrow extent of the knowledge it concerns, since it is impossible to cover all the knowledge of the world in implementing an application. On the other hand, the ontology dictionary for common semantic classes reflects the pure general knowledge of the world; hence it performs a supplementary role to extract semantic information. The domain dictionary provides the specific vocabulary which is used in semantic representation tasks of a user query and the template database.</Paragraph>
    </Section>
    <Section position="6" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Construction of a Domain Knowledge
</SectionTitle>
      <Paragraph position="0"> For semantic-oriented error correction, we constructed a domain knowledge, which consists of a domain dictionary, an ontology dictionary, and template queries that are similar to question types in a QA system (Lee et al., 2001). Query sentences are semantically abstracted by LSP's and are automatically collected for the template database.</Paragraph>
      <Paragraph position="1"> Because Fujii et al. (Fujii et al., 2002B) have shown the importance of the language model which well describes the domain knowledge, we reflect the domain information with a template database: database of template queries of the source statements which are used  for the actual error detection and correction task after speech recognition. The template queries are automatically acquired by the Query-to-LSP translation from the source statements using two semantic category dictionaries: domain dictionary and an ontology dictionary. Assuming that some speech statements for a specific target domain are predefined, a record of the template database is composed of a fixed number of LSP elements, such as POS tags, semantic tags, and domain-specific semantic classes. Table 1 shows an example of template abstracted by LSP conversion in a predefined domain of &amp;quot;on-line education.&amp;quot; null Query-to-LSP translation transforms a given query into a corresponding LSP, and the LSP's enhance the coverage of extraction by information abstraction through many-to-one mapping between queries and an LSP. The words in a query sentence are converted into the LSP through several steps. First, a morphological analysis is performed, which segments a sentence of words into morphemes, and adds POS tags to the morphemes (Lee et al., 2002). NE recognition discovers all the possible semantic types for each word by consulting a domain dictionary and an ontology dictionary. NE tagging selects a semantic type for each word so that a sentence can be mapped into a suitable LSP sequence by searching several types in the semantic dictionaries (An et al., 2003).</Paragraph>
    </Section>
    <Section position="7" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Semantic-oriented Error Correction Process
</SectionTitle>
      <Paragraph position="0"> Now, we will show the working mechanism of post error correction of a speech recognition result using the domain knowledge of template database and domain-specific dictionary. Figure 5 is a schematic diagram of the post error correction process.</Paragraph>
      <Paragraph position="1"> The overall process is divided into two stages: a syntactic/semantic recovery and a lexical recovery stage. In the semantic error detection stage, a recognized query is converted into the corresponding LSP. The converted LSP may be ill-formed depending on the errors in the recognized query. Semantic error correction is performed by replacing these syntactic and/or semantic errors using a semantic confusion table. We used a pre-collected template database to recover the semantic level errors, and the technique for searching most similar templates are based on a minimum edit distance dynamic programming search, which has been used as a similarity search in many areas such as spelling correction, OCR post correction, and DNA sequence analysis (Wagner and Fischer, 1974). The semantic confusion table provides the matching cost, which can be semantic similarity, to the dynamic programming search process. The 'minimum edit distance' between two words is originally defined as the minimum number of deletions, insertions, and substitutions required to transform one word into the other. We compute the minimum edit distances between the erroneous LSP's and the template LSP's in the template database using the similarity cost functions at the semantic level, and select, as the final template query, the one which has the minimum distance among them. At this stage, replaced LSP elements can provide some clues of the recognition errors and the original query's meaning to the next lexical recovery stage. Moreover, candidate error boundary can also be detected by this procedure.</Paragraph>
      <Paragraph position="2"> After this procedure, lexical recovery is performed in the next stage. Recovered semantic tags and the erroneous queries produced by ASR are the clues of lexical recovery. Erroneous query and recovered template query are aligned by dynamic programming again, after which some lexical candidates are generated by our improved syllable-based channel model. Figure 6 4 shows an example of semantic error correction process using the same data in TRAIN-95 (Allen et al., 1996).</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Experiments
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Experimental Setup
</SectionTitle>
      <Paragraph position="0"> We performed several experiments on the domain of in-vehicle telematics IR related to navigation question answering services. The speech transcripts used in the experiments were composed of 462 queries, which were collected by 1 male speaker in a real application. We also used two Korean speech recognizers: a speech recognizer made by LG-Elite (LG Electronics Institute of Technology) and a Korean commercial speech recognizer, ByVoice (refer to http://www.voicetech.co.kr). For  our semantic-oriented error correction, we constructed a domain knowledge for our target domain. We constructed 3,195 entries of domain dictionary, 13,154 entries of ontology dictionary, and 436 semantic templates generated automatically using domain dictionary and ontology dictionary. null We implemented both word-based and syllable-based model for comparison, and combined the system of syllable-based lexical correction with the LSP-based semantic error correction. For experiments, we use trigrams language model generated by SRILM toolkit (Stolcke, 2002), and a training program for channel model made by ourselves. And, we divided the 462 queries into 6 different sets, and evaluated the results of 6-fold cross validation for each model.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Results
</SectionTitle>
      <Paragraph position="0"> To measure error correction performance, we use word error rate (WER) and term error rate (TER):</Paragraph>
      <Paragraph position="2"> jWtruthj is the number of original words, and jTtruthj is the number of query term (or keyword) in original words, that is, an error rate of content words directly related to the performance of IR and QA system (Fujii et al., 2002A).</Paragraph>
      <Paragraph position="3"> Table 2, 3 present the experiments results of WER of baseline ASR, word-based channel model, our syllable-based channel model and combined syllable-based channel model with the LSP semantic correction model. The performances of baseline systems were about 79% &gt;&gt; 81% on the utterances in in-vehicle telematics IR domain. This result shows that the semantic error correction of  speech recognition result is a viable approach to improve the performance.</Paragraph>
      <Paragraph position="4"> Using both baseline ASR systems, we achieved 39% and 27% of error reduction rate. In comparison with the previous word-based model, our new approaches have more accurate error correction performance in this domain. Table 4 shows the result of the experiments for TER. The result of TER shows that baseline ASR systems alone are not appropriate to process the user's queries in speech-driven IR, QA or dialog understanding system.</Paragraph>
      <Paragraph position="5"> However, with a post error correction, the error reduction rate of TER is much higher than that of WER. And we achieved better performance than word-based model.</Paragraph>
      <Paragraph position="6"> With this result, our methods are considered to be more appropriate in speech-driven IR and QA applications.</Paragraph>
      <Paragraph position="7"> Compared with the word-based noisy channel model that has been the best approach in the error correction so far, our semantic-oriented error correction suggests alternative more successful methods for speech recognition error correction.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML