File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1025_metho.xml

Size: 22,890 bytes

Last Modified: 2025-10-06 14:08:01

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-1025">
  <Title>A Method for Open-Vocabulary Speech-Driven Text Retrieval</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 System Overview
</SectionTitle>
    <Paragraph position="0"> Figure 1 depicts the overall design of our speech-driven text retrieval system, which consists of speech recognition, text retrieval and query completion modules. Although our system is currently implemented for Japanese, our methodology is language-independent. We explain the retrieval process based on this figure.</Paragraph>
    <Paragraph position="1"> Given a query spoken by a user, the speech recognition module uses a dictionary and acoustic/language models to generate a transcription of the user speech. During this process, OOV words, which are not listed in the dictionary, are also detected. For this purpose, our language model includes both words and syllables so that OOV words are transcribed as sequences of syllables.</Paragraph>
    <Paragraph position="2"> For example, in the case where &amp;quot;kankitsu (citrus)&amp;quot; is not listed in the dictionary, this word should be transcribed as /ka N ki tsu/. However, it is possible that this word is mistakenly transcribed, such as /ka N ke tsu/ and /ka N ke tsu ke ko/.</Paragraph>
    <Paragraph position="3"> To improve the quality of our system, these syllable sequences have to be transcribed as words, which is one of the central issues in this paper. In the case of speech-driven retrieval, where users usually have specific information needs, it is feasible that users utter contents related to a target collection. In other words, there is a great possibility that detected OOV words can be identified as index terms that are phonetically identical or similar.</Paragraph>
    <Paragraph position="4"> However, since a) a single sound can potentially correspond to more than one word (i.e., homonyms) and b) searching the entire collection for phonetically identical/similar terms is prohibitive, we need an efficient disambiguation method. Specifically, in the case of Japanese, the homonym problem is multiply crucial because words consist of different character types, i.e., &amp;quot;kanji,&amp;quot; &amp;quot;katakana,&amp;quot; &amp;quot;hiragana,&amp;quot; alphabets and other characters like numerals1.</Paragraph>
    <Paragraph position="5"> To resolve this problem, we use a two-stage retrieval method. In the first stage, we delete OOV words from the transcription, and perform text retrieval using remaining words, to obtain a specific number of top-ranked documents according to the degree of relevance. Even if speech recognition is  and katakana and hiragana are phonograms.</Paragraph>
    <Paragraph position="6"> lection. Thus, we search only these documents for index terms corresponding to detected OOV words.</Paragraph>
    <Paragraph position="7"> Then, in the second stage, we replace detected OOV words with identified index terms so as to complete the transcription, and re-perform text retrieval to obtain final outputs. However, we do not re-perform speech recognition in the second stage.</Paragraph>
    <Paragraph position="8"> In the above example, let us assume that the user also utters words related to &amp;quot;kankitsu (citrus),&amp;quot; such as &amp;quot;orenji (orange)&amp;quot; and &amp;quot;remon (lemon),&amp;quot; and that these words are correctly recognized as words. In this case, it is possible that retrieved documents contain the word &amp;quot;kankitsu (citrus).&amp;quot; Thus, we replace the syllable sequence /ka N ke tsu/ in the query with &amp;quot;kankitsu,&amp;quot; which is additionally used as a query term in the second stage.</Paragraph>
    <Paragraph position="9"> It may be argued that our method resembles the notion of pseudo-relevance feedback (or local feedback) for IR, where documents obtained in the first stage are used to expand query terms, and final outputs are refined in the second stage (Kwok and Chan, 1998). However, while relevance feedback is used to improve only the retrieval accuracy, our method improves the speech recognition and retrieval accuracy.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Speech Recognition
</SectionTitle>
    <Paragraph position="0"> The speech recognition module generates word sequence a2 , given phone sequence a3 . In a stochastic speech recognition framework (Bahl et al., 1983), the task is to select the a2 maximizing a4a6a5  Here, a4a6a5a29a3 a7a2 a9 models a probability that word sequence a2 is transformed into phone sequence a3 , and a4a6a5 a2 a9 models a probability that a2 is linguistically acceptable. These factors are usually called acoustic and language models, respectively.</Paragraph>
    <Paragraph position="1"> For the speech recognition module, we use the Japanese dictation toolkit (Kawahara et al., 2000)2, which includes the &amp;quot;Julius&amp;quot; recognition engine and acoustic/language models. The acoustic model was produced by way of the ASJ speech database (ASJ-JNAS) (Itou et al., 1998; Itou et al., 1999), which contains approximately 20,000 sentences uttered by 132 speakers including the both gender groups.</Paragraph>
    <Paragraph position="2"> This toolkit also includes development softwares so that acoustic and language models can be produced and replaced depending on the application.</Paragraph>
    <Paragraph position="3"> While we use the acoustic model provided in the toolkit, we use a new language model including both words and syllables. For this purpose, we used the &amp;quot;ChaSen&amp;quot; morphological analyzer3 to extract words from ten years worth of &amp;quot;Mainichi Shimbun&amp;quot; newspaper articles (1991-2000).</Paragraph>
    <Paragraph position="4"> Then, we selected 20,000 high-frequency words to produce a dictionary. At the same time, we segmented remaining lower-frequency words into syllables based on the Japanese phonogram system.</Paragraph>
    <Paragraph position="5"> The resultant number of syllable types was approximately 700. Finally, we produced a word/syllable-based trigram language model. In other words, OOV words were modeled as sequences of syllables.</Paragraph>
    <Paragraph position="6"> Thus, by using our language model, OOV words can easily be detected.</Paragraph>
    <Paragraph position="7"> In spoken document retrieval, an open-vocabulary method, which combines recognition methods for words and syllables in target speech documents, was also proposed (Wechsler et al., 1998). However, this method requires an additional computation for recognizing syllables, and thus is expensive. In contrast, since our language model is a regular statistical a30 -gram model, we can use the same speech recognition framework as in Equation (1).</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Text Retrieval
</SectionTitle>
    <Paragraph position="0"> The text retrieval module is based on the &amp;quot;Okapi&amp;quot; probabilistic retrieval method (Robertson and Walker, 1994), which is used to compute the relevance score between the transcribed query and each document in a target collection. To produce an inverted file (i.e., an index), we use ChaSen to extract content words from documents as terms, and perform a word-based indexing. We also extract terms from transcribed queries using the same method.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Query Completion
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.1 Overview
</SectionTitle>
      <Paragraph position="0"> As explained in Section 3, the basis of the query completion module is to correspond OOV words detected by speech recognition (Section 4) to index terms used for text retrieval (Section 5). However, to identify corresponding index terms efficiently, we limit the number of documents in the first stage retrieval. In principle, terms that are indexed in top-ranked documents (those retrieved in the first stage) and have the same sound with detected OOV words can be corresponding terms.</Paragraph>
      <Paragraph position="1"> However, a single sound often corresponds to multiple words. In addition, since speech recognition on a syllable-by-syllable basis is not perfect, it is possible that OOV words are incorrectly transcribed. For example, in some cases the Japanese word &amp;quot;kankitsu (citrus)&amp;quot; is transcribed as /ka N ke tsu/. Thus, we also need to consider index terms that are phonetically similar to OOV words. To sum up, we need a disambiguation method to select appropriate corresponding terms, out of a number of candidates.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.2 Formalization
</SectionTitle>
      <Paragraph position="0"> Intuitively, it is feasible that appropriate terms: a31 have identical/similar sound with OOV words detected in spoken queries, a31 frequently appear in a top-ranked document set, a31 and appear in higher-ranked documents.</Paragraph>
      <Paragraph position="1"> From the viewpoint of probability theory, possible representations for the above three properties include Equation (2), where each property corresponds to different parameters. Our task is to select the a32 maximizing the value computed by this equation as the corresponding term for OOV word a33 .</Paragraph>
      <Paragraph position="3"> Here, a45a6a46 is the top-ranked document set retrieved in the first stage, given query a44 . a4a6a5a25a33 a7a32a20a9 is a probability that index term a32 can be replaced with detected OOV word a33 , in terms of phonetics. a4a6a5a25a32 a7a42 a9 is the relative frequency of term a32 in document a42 . a4a6a5 a42a47a7a44 a9 is a probability that document a42 is relevant to query a44 , which is associated with the score formalized in the Okapi method.</Paragraph>
      <Paragraph position="4"> However, from the viewpoint of empiricism, Equation (2) is not necessarily effective. First, it is not easy to estimate a4a6a5a25a33 a7a32a20a9 based on the probability theory. Second, the probability score computed by the Okapi method is an approximation focused mainly on relative superiority among retrieved documents, and thus it is difficult to estimate a4a22a5 a42a47a7a44 a9 in a rigorous manner. Finally, it is also difficult to determine the degree to which each parameter influences in the final probability score.</Paragraph>
      <Paragraph position="5"> In view of these problems, through preliminary experiments we approximated Equation (2) and formalized a method to compute the degree (not the probability) to which given index terma32 corresponds to OOV word a33 .</Paragraph>
      <Paragraph position="6"> First, we estimate a4a6a5a29a33 a7a32a20a9 by the ratio between the number of syllables commonly included in both a33 and a32 and the total number of syllables in a33 . We use a DP matching method to identify the number of cases related to deletion, insertion, and substitution in a33 , on a syllable-by-syllable basis.</Paragraph>
      <Paragraph position="7"> Second, a4a22a5a29a33 a7a32a48a9 should be more influential than</Paragraph>
      <Paragraph position="9"> last two parameters are effective in the case where a large number of candidates phonetically similar to a33 are obtained. To decrease the effect of a4a6a5a25a32</Paragraph>
      <Paragraph position="11"> a9 , we tentatively use logarithms of these parameters. In addition, we use the score computed by the Okapi method as a4a6a5 a42a47a7a44 a9 .</Paragraph>
      <Paragraph position="12"> According to the above approximation, we compute the score of a32 as in Equation (3).</Paragraph>
      <Paragraph position="14"> It should be noted that Equation (3) is independent of the indexing method used, and therefore a32 can be any sequences of characters contained in a45 a46 . In other words, any types of indexing methods (e.g., word-based and phrase-based indexing methods) can be used in our framework.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.3 Implementation
</SectionTitle>
      <Paragraph position="0"> Since computation time is crucial for a real-time usage, we preprocess documents in a target collection so as to identify candidate terms efficiently. This process is similar to the indexing process performed in the text retrieval module.</Paragraph>
      <Paragraph position="1"> In the case of text retrieval, index terms are organized in an inverted file so that documents including terms that exactly match with query keywords can be retrieved efficiently.</Paragraph>
      <Paragraph position="2"> However, in the case of query completion, terms that are included in top-ranked documents need to be retrieved. In addition, to minimize a score computation (for example, DP matching is time-consuming), it is desirable to delete terms that are associated with a diminished phonetic similarity value, a4a6a5a25a33 a7a32a20a9 , prior to the computation of Equation (3). In other words, an index file for query completion has to be organized so that a partial matching method can be used. For example, /ka N ki tsu/ has to be retrieved efficiently in response to /ka N ke tsu/.</Paragraph>
      <Paragraph position="3"> Thus, we implemented a forward/backward partial-matching method, in which entries can be retrieved by any substrings from the first/last characters. In addition, we index words and word-based bigrams, because preliminary experiments showed that OOV words detected by our speech recognition module are usually single words or short phrases, such as &amp;quot;ozon-houru (ozone hole).&amp;quot;</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
7 Experimentation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
7.1 Methodology
</SectionTitle>
      <Paragraph position="0"> To evaluate the performance of our speech-driven retrieval system, we used the IREX collection4. This test collection, which resembles one used in the TREC ad hoc retrieval track, includes 30 Japanese topics (information need) and relevance assessment (correct judgement) for each topic, along with target</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4http://cs.nyu.edu/cs/projects/proteus/irex/index-e.html
</SectionTitle>
      <Paragraph position="0"> documents. The target documents are 211,853 articles collected from two years worth of &amp;quot;Mainichi Shimbun&amp;quot; newspaper (1994-1995).</Paragraph>
      <Paragraph position="1"> Each topic consists of the ID, description and narrative. While descriptions are short phrases related to the topic, narratives consist of one or more sentences describing the topic. Figure 2 shows an example topic in the SGML form (translated into English by one of the organizers of the IREX workshop).</Paragraph>
      <Paragraph position="2"> However, since the IREX collection does not contain spoken queries, we asked four speakers (two males/females) to dictate the narrative field. Thus, we produced four different sets of 30 spoken queries.</Paragraph>
      <Paragraph position="3"> By using those queries, we compared the following different methods: 1. text-to-text retrieval, which used written narratives as queries, and can be seen as a perfect speech-driven text retrieval, 2. speech-driven text retrieval, in which only words listed in the dictionary were modeled in the language model (in other words, the OOV word detection and query completion modules were not used), 3. speech-driven text retrieval, in which OOV words detected in spoken queries were simply deleted (in other words, the query completion module was not used), 4. speech-driven text retrieval, in which our method proposed in Section 3 was used.</Paragraph>
      <Paragraph position="4"> In cases of methods 2-4, queries dictated by four speakers were used independently. Thus, in practice we compared 13 different retrieval results. In addition, for methods 2-4, ten years worth of Mainichi Shimbun Japanese newspaper articles (1991-2000) were used to produce language models. However, while method 2 used only 20,000 high-frequency words for language modeling, methods 3 and 4 also used syllables extracted from lower-frequency words (see Section 4).</Paragraph>
      <Paragraph position="5"> Following the IREX workshop, each method retrieved 300 top documents in response to each query, and non-interpolated average precision values were used to evaluate each method.</Paragraph>
      <Paragraph position="6"> &lt;TOPIC&gt;&lt;TOPIC-ID&gt;1001&lt;/TOPIC-ID&gt; &lt;DESCRIPTION&gt;Corporate merging&lt;/DESCRIPTION&gt; &lt;NARRATIVE&gt;The article describes a corporate merging and in the article, the name of companies have to be identifiable. Information including the field and the purpose of the merging have to be identifiable. Corporate merging</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
7.2 Results
</SectionTitle>
      <Paragraph position="0"> First, we evaluated the performance of detecting OOV words. In the 30 queries used for our evaluation, 14 word tokens (13 word types) were OOV words unlisted in the dictionary for speech recognition. Table 1 shows the results on a speaker-byspeaker basis, where &amp;quot;#Detected&amp;quot; and &amp;quot;#Correct&amp;quot; denote the total number of OOV words detected by our method and the number of OOV words correctly detected, respectively. In addition, &amp;quot;#Completed&amp;quot; denotes the number of detected OOV words that were corresponded to correct index terms in 300 top documents.</Paragraph>
      <Paragraph position="1"> It should be noted that &amp;quot;#Completed&amp;quot; was greater than &amp;quot;#Correct&amp;quot; because our method often mistakenly detected words in the dictionary as OOV words, but completed them with index terms correctly. We estimated recall and precision for detecting OOV words, and accuracy for query completion, as in Equation (4).</Paragraph>
      <Paragraph position="2">  Looking at Table 1, one can see that recall was generally greater than precision. In other words, our method tended to detect as many OOV words as possible. In addition, accuracy of query completion was relatively low.</Paragraph>
      <Paragraph position="3"> Figure 3 shows example words in spoken queries, detected as OOV words and correctly completed with index terms. In this figure, OOV words are transcribed with syllables, where &amp;quot;:&amp;quot; denotes a long vowel. Hyphens are inserted between Japanese words, which inherently lack lexical segmentation.</Paragraph>
      <Paragraph position="4"> Second, to evaluate the effectiveness of our query completion method more carefully, we compared retrieval accuracy for methods 1-4 (see Section 7.1). Table 2 shows average precision values, averaged over the 30 queries, for each method5. The average precision values of our method (i.e., method 4) was approximately 87% of that for text-to-text retrieval. By comparing methods 2-4, one can see that our method improved average precision values of the other methods irrespective of the speaker. To put it more precisely, by comparing methods 3 and 4, one can see the effectiveness of the query completion method. In addition, by comparing methods 2 and 4, one can see that a combination of the OOV word detection and query completion methods was effective.</Paragraph>
      <Paragraph position="5"> It may be argued that the improvement was relatively small. However, since the number of OOV words inherent in 30 queries was only 14, the effect of our method was overshadowed by a large number of other words. In fact, the number of words used as query terms for our method, averaged over the four speakers, was 421. Since existing test collections for IR research were not produced to explore the OOV problem, it is difficult to derive conclusions that are statistically valid. Experiments using larger-scale test collections where the OOV problem is more crucial need to be further explored.</Paragraph>
      <Paragraph position="6"> Finally, we investigated the time efficiency of our method, and found that CPU time required for the query completion process per detected OOV word was 3.5 seconds (AMD Athlon MP 1900+). However, an additional CPU time for detecting OOV words, which can be performed in a conventional speech recognition process, was not crucial.</Paragraph>
      <Paragraph position="7"> 5Average precision is often used to evaluate IR systems, which should not be confused with evaluation measures in Equation (4).</Paragraph>
      <Paragraph position="8">  /gu re : pu ra chi na ga no/ gureepu-furuutsu /gu re : pu fu ru : tsu/ grapefruit /ya yo i chi ta/ Yayoi-jidai /ya yo i ji da i/ the Yayoi period /ni ku ku ra i su/ nikku-puraisu /ni q ku pu ra i su/ Nick Price /be N pi/ benpi /be N pi/ constipation</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
7.3 Analyzing Errors
</SectionTitle>
      <Paragraph position="0"> We manually analyzed seven cases where the average precision value of our method was significantly lower than that obtained with method 2 (the total number of cases was the product of numbers of queries and speakers).</Paragraph>
      <Paragraph position="1"> Among these seven cases, in five cases our query completion method selected incorrect index terms, although correct index terms were included in top-ranked documents obtained with the first stage. For example, in the case of the query 1021 dictated by a female speaker, the word &amp;quot;seido (institution)&amp;quot; was mistakenly transcribed as /se N do/. As a result, the word &amp;quot;sendo (freshness),&amp;quot; which is associated with the same syllable sequences, was selected as the index term. The word &amp;quot;seido (institution)&amp;quot; was the third candidate based on the score computed by Equation (3). To reduce these errors, we need to enhance the score computation.</Paragraph>
      <Paragraph position="2"> In another case, our speech recognition module did not correctly recognize words in the dictionary, and decreased the retrieval accuracy.</Paragraph>
      <Paragraph position="3"> In the final case, a fragment of a narrative sentence consisting of ten words was detected as a single OOV word. As a result, our method, which can complete up to two word sequences, mistakenly processed that word, and decreased the retrieval accuracy. However, this case was exceptional. In most cases, functional words, which were recognized with a high accuracy, segmented OOV words into shorter fragments.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML