XML Viewer - n04-1017

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/n04-1017_metho.xml
Size: 8,131 bytes
Last Modified: 2025-10-06 14:08:55
<?xml version="1.0" standalone="yes"?>
<Paper uid="N04-1017">
  <Title>Lattice-Based Search for Spoken Utterance Retrieval</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Methods
</SectionTitle>
    <Paragraph position="0"> In this section we describe the overall structure of our system and give details of the techniques used in our investigations. The system consists of three main components. First, the ASR component is used to convert speech into a lattice representation, together with timing information. Second, this representation is indexed for efficient retrieval. These two steps are performed off-line.</Paragraph>
    <Paragraph position="1"> Finally, when the user enters a query the index is searched and matching audio segments are returned.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Automatic Speech Recognition
</SectionTitle>
      <Paragraph position="0"> We use a state-of-the-art HMM based large vocabulary continuous speech recognition (LVCSR) system. The acoustic models consist of decision tree state clustered triphones and the output distributions are mixtures of Gaussians. The language models are pruned backoff tri-gram models. The pronunciation dictionaries contain few alternative pronunciations. Pronunciations that are not in our baseline pronunciation dictionary (including OOV query words) are generated using a text-to-speech (TTS) frontend. The TTS frontend can produce multiple pronunciations. The ASR systems used in this study are single pass systems. The recognition networks are represented as weighted finite state machines (FSMs).</Paragraph>
      <Paragraph position="1"> The output of the ASR system is also represented as an FSM and may be in the form of a best hypothesis string or a lattice of alternate hypotheses. The labels on the arcs of the FSM may be words or phones, and the conversion between the two can easily be done using FSM composition. The costs on the arcs are negative log likelihoods. Additionally, timing information can also be present in the output.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Lattice Indexing and Retrieval
</SectionTitle>
      <Paragraph position="0"> In the case of lattices, we store a set of indices, one for each arc label (word or phone) l, that records the lattice number L[a], input-state k[a] of each arc a labeled with l in each lattice, along with the probability mass f(k[a]) leading to that state, the probability of the arc itself p(a|k[a]) and an index for the next state. To retrieve a single label from a set of lattices representing a speech corpus one simply retrieves all arcs in each lattice from the label index. The lattices are first normalized by weight pushing (Mohri et al., 2002) so that the probability of the set of all paths leading from the arc to the final state is 1. After weight pushing, for a given arc a, the probability of the set of all paths containing that arc is given by</Paragraph>
      <Paragraph position="2"> namely the probability of all paths leading into that arc, multiplied by the probability of the arc itself. For a lattice L we construct a &amp;quot;count&amp;quot; C(l|L) for a given label l using the information stored in the index I(l) as follows,</Paragraph>
      <Paragraph position="4"> where C(l|pi) is the number of times l is seen on path pi and d(a,l) is 1 if arc a has the label l and 0 otherwise. Retrieval can be thresholded so that matches below a certain count are not returned.</Paragraph>
      <Paragraph position="5"> To search a multilabel expression (e.g. a multi-word phrase) w1w2 ...wn we seek on each label in the expression, and then for each (wi,wi+1) join the output states of wi with the matching input states of wi+1; in this way we retrieve just those path segments in each lattice that match the entire multi-label expression. The probability of each match is defined as f(k[a1])p(a1|k[a1])p(a2|k[a2])...p(an|k[an]), where p(ai|k[ai]) is the probability of the ith arc in the expression starting in arc a1. The total &amp;quot;count&amp;quot; for the lattice is computed as defined above.</Paragraph>
      <Paragraph position="6"> Note that in the limit case where each lattice is an unweighted single path -- i.e. a string of labels -- the above scheme reduces to a standard inverted index.</Paragraph>
      <Paragraph position="7"> The count C(l|L) can be interpreted as a lattice-based confidence measure. Although it may be possible to use more sophisticated confidence measures, use of (posterior) probabilities allows for a simple factorization which makes indexing efficient.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Indexing Using Sub-word Units
</SectionTitle>
      <Paragraph position="0"> In order to deal with queries that contain OOV words we investigate the use of sub-word units for indexing. In this study we use phones as the sub-word units. There are two methods for obtaining phonetic representation of an input utterance.</Paragraph>
      <Paragraph position="1">  1. Phone recognition using an ASR system where  recognition units are phones. This is achieved by using a phone level language model instead of the word level language model used in the baseline ASR system.</Paragraph>
      <Paragraph position="2"> 2. Converting the word level representation of the utterance into a phone level representation. This is achieved by using the baseline ASR system and replacing each word in the output by its pronunciation(s) in terms of phones.</Paragraph>
      <Paragraph position="3"> Both methods have their shortcomings. Phone recognition is known to be less accurate than word recognition. On the other hand, the second method can only generate phone strings that are substrings of the pronunciations of in-vocabulary word strings. An alternative is to use hybrid language models used for OOV word detection (Yazgan and Saraclar, 2004).</Paragraph>
      <Paragraph position="4"> For retrieval, each query word is converted into phone string(s) by using its pronunciation(s). The phone index can then be searched for each phone string. Note that this approach will generate many false alarms, particularly for short query words, which are likely to be substrings of longer words. In order to control for this a bound on minimum pronunciation length can be utilized. Since most short words are in vocabulary this bound has little effect on recall.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 Using Both Word and Sub-word Indices
</SectionTitle>
      <Paragraph position="0"> Given a word index and a sub-word index, it is possible to  improve the retrieval performance of the system by using both indices. There are many strategies for doing this. 1. combination: Search both the word index and the sub-word index, combine the results.</Paragraph>
      <Paragraph position="1"> 2. vocabulary cascade:  Search the word index for in-vocabulary queries, search the sub-word index for OOV queries. 3. search cascade: Search the word index, if no result is returned search the sub-word index. In the first case, if the indices are obtained from ASR best hypotheses, then the result combination is a simple union of the separate sets of results. However, if indices are obtained from lattices, then in addition to taking a union of results, retrieval can be done using a combined score. Given a query q, let Cw(q) and Cp(q) be the lattice counts obtained from the word index and the phone index respectively. We also define the normalized lattice count for the phone index as</Paragraph>
      <Paragraph position="3"> where |pron(q) |is the length of the pronunciation of query q. We then define the combined score to be</Paragraph>
      <Paragraph position="5"> where l is an empirically determined scaling factor.</Paragraph>
      <Paragraph position="6"> In the other cases, instead of using two different thresholds we use a single threshold on Cw(q) and Cnormp (q) during retrieval.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML