File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-2907_metho.xml

Size: 17,231 bytes

Last Modified: 2025-10-06 14:09:29

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-2907">
  <Title>tioning: Low latency real-time broadcast news tran-</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Indexation Algorithm
</SectionTitle>
    <Paragraph position="0"> This section presents an algorithm for the construction of an efficient index for a large set of speech utterances.</Paragraph>
    <Paragraph position="1"> We assume that for each speech utterance ui of the dataset considered, i = 1,...,n, a weighted automaton Ai over the alphabet S and the log semiring, e.g., phone or word lattice output by an automatic speech recognizer, is given. The problem consists of creating a full index, that is one that can be used to search directly any factor of any string accepted by these automata. Note that this problem crucially differs from classical indexation problems in that the input data is uncertain. Our algorithm must make use of the weights associated to each string by the input automata.</Paragraph>
    <Paragraph position="2"> The main idea behind the design of the algorithm described is that the full index can be represented by a weighted finite-state transducer T mapping each factor x to the set of indices of the automata in which x appears and the negative log of the expected count of x. More precisely, let Pi be the probability distribution defined by the weighted automaton Ai over the set of strings S[?] and let Cx(u) denote the number of occurrences of a factor x in u, then, for any factor x [?] S[?] and automaton index</Paragraph>
    <Paragraph position="4"> Our algorithm for the construction of the index is simple, it is based on general weighted automata and transducer algorithms. We describe the consecutive stages of the algorithm. null This algorithm can be seen as a generalization to weighted automata of the notion of suffix automaton and factor automaton for strings. The suffix (factor) automaton of a string u is the minimal deterministic finite automata recognizing exactly the set of suffixes (resp. factors) of u (Blumer et al., 1985; Crochemore, 1986). The size of both automata is linear in the length of u and both can be built in linear time. These are classical representations used in text indexation (Blumer et al., 1987; Crochemore, 1986).</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Preprocessing
</SectionTitle>
      <Paragraph position="0"> When the automata Ai are word or phone lattices output by a speech recognition or other natural language processing system, the path weights correspond to joint probabilities. We can apply to Ai a general weight-pushing algorithm in the log semiring (Mohri, 1997) which converts these weights into the desired (negative log of) posterior probabilities. More generally, the path weights in the resulting automata can be interpreted as log-likelihoods. We denote by Pi the corresponding probability distribution. When the input automaton Ai is acyclic, the complexity of the weight-pushing algorithm is linear in its size (O(|Ai|)). Figures 1(b)(d) illustrates the application of the algorithm to the automata of Figures 1(a)(c).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Construction of Transducer Index T
</SectionTitle>
      <Paragraph position="0"> Let Bi = (S,Qi,Ii,Fi,Ei,li,ri) denote the result of the application of the weight pushing algorithm to the automaton Ai. The weight associated by Bi to each string it accepts can be interpreted as the log-likelihood of that string for the utterance ui given the models used to generate the automata. More generally, Bi defines a probability distribution Pi over all strings x [?] S[?] which is just the sum of the probability of all paths of Bi in which x appears.</Paragraph>
      <Paragraph position="1"> For each state q [?] Qi, denote by d[q] the shortest distance from Ii to q (or -log of the forward probability) and by f[q] the shortest distance from q to F (or -log of the backward probability):</Paragraph>
      <Paragraph position="3"> The shortest distances d[q] and f[q] can be computed for all states q [?] Qi in linear time (O(|Bi|)) when Bi is acyclic (Mohri, 2002). Then,</Paragraph>
      <Paragraph position="5"> From the weighted automaton Bi, one can derive a weighted transducer Ti in two steps:  1. Factor Selection. In the general case we select all the factors to be indexed in the following way: * Replace each transition (p,a,w,q) [?] QixSx RxQi by (p,a,a,w,q) [?] QixSxSxRxQi; * Create a new state s negationslash[?] Qi and make s the unique initial state; * Create a new state e negationslash[?] Qi and make e the unique final state; * Create a new transition (s,epsilon1,epsilon1,d[q],q) for each state q [?] Qi; * Create a new transition (q,epsilon1,i,f[q],e) for each state q [?] Qi; 2. Optimization. The resulting transducer can be optimized by applying weighted epsilon1-removal, weighted determinization, and minimization over the log semiring by viewing it as an acceptor, i.e., input null output labels are encoded a single labels.</Paragraph>
      <Paragraph position="6"> It is clear from Equation 4 that for any factor x [?] S[?]:</Paragraph>
      <Paragraph position="8"> This construction is illustrated by Figures 2(a)(b). Our full index transducer T is the constructed by * taking the [?]log-sum (or union) of all the transducers Ti, i = 1,...,n; * defining T as the result of determinization (in the log semiring) applied to that transducer.</Paragraph>
      <Paragraph position="9"> Figure 3 is illustrating this construction and optimization.  tomata B1 given Figure 1(b): (a) intermediary result after factor selection and (b) resulting weighted transducer T1.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Search
</SectionTitle>
    <Paragraph position="0"> The full index represented by the weighted finite-state transducer T is optimal. Indeed, T contains no transition with input epsilon1 other than the final transitions labeled with an output index and it is deterministic. Thus, the set of indices Ix of the weighted automata containing a factor x can be obtained in O(|x|+|Ix|) by reading in T the unique path with input label x and then the transitions with input epsilon1 which have each a distinct output label.</Paragraph>
    <Paragraph position="1"> The user's query is typically an unweighted string, but it can be given as an arbitrary weighted automaton X.</Paragraph>
    <Paragraph position="2"> This covers the case of Boolean queries or regular expressions which can be compiled into automata. The response to a query X is computed using the general algorithm of composition of weighted transducers (Mohri et al., 1996) followed by projection on the output:</Paragraph>
    <Paragraph position="4"> which is then epsilon1-removed and determinized to give directly the list of all indices and their corresponding log- null ing the weighted automata B1 and B2 given in Figures 1(b)(d) likelihoods. The final result can be pruned to include only the most likely responses. The pruning threshold may be used to vary the number of responses.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 General Indexation Framework
</SectionTitle>
    <Paragraph position="0"> The indexation technique just outlined can be easily extended to include many of the techniques used for speech indexation. This can be done by introducing a transducer F that converts between different levels of information sources or structures, or that filters out or reweights index entries. The filter F can be applied (i) before, (ii) during or (iii) after the construction of the index. For case (i), the filter is used directly on the input and the indexation algorithm is applied to the weighted automata (F *Ai)1[?]i[?]n.</Paragraph>
    <Paragraph position="1"> For case (ii), filtering is done after the factor selection step of the algorithm and the filter applies to the factors, typically to restrict the factors that will be indexed. For case (iii), the filter is applied to the index. Obviously different filters can be used in combination at different stages.</Paragraph>
    <Paragraph position="2"> When such a filter is used, the response to a query X is obtained using another transducer Fprime 1 and the following composition and projection:</Paragraph>
    <Paragraph position="4"> Since composition is associative, it does not impose a specific order to its application. However, in practice, it is often advantageous to compute X *Fprime before application of T. The following are examples of some filter transducers that can be of interest in many applications.</Paragraph>
    <Paragraph position="5">  tionary can be used to map word sequences into their phonemic transcriptions, thus transform word lattices into equivalent phone lattices. This mapping can represented by a weighted transducer F.</Paragraph>
    <Paragraph position="6"> Using an index based on phone lattices allows a user to search for words that are not in the ASR vocabulary. In this case, the inverse transduction Fprime is a grapheme to phoneme converter, commonly present in TTS front-ends. Among others, Witbrock and Hauptmann (1997) present a system where a phonetic transcript is obtained from the word transcript and retrieval is performed using both word and phone indices.</Paragraph>
    <Paragraph position="7"> * Vocabulary Restriction: in some cases using a full index can be prohibitive and unnecessary. It might be desirable to do partial indexing by ignoring some words (or phones) in the input. For example, we might wish to index only &amp;quot;named entities&amp;quot;, or just the consonants. This is mostly motivated by the reduction of the size of the index while retaining the necessary information. A similar approach is to apply a many to one mapping to index groups of phones, or metaphones (Amir et al., 2001), to over- null come phonetic errors.</Paragraph>
    <Paragraph position="8"> * Reweighting: a weighted transducer can be used to emphasize some words in the input while deemphasizing other. The weights, for example might correspond to TF-IDF weights. Another reweighting method might involve edit distance or confusion statistics.</Paragraph>
    <Paragraph position="9"> * Classification: an extreme form of summarizing the  information contained in the indexed material is to assign a class label, such as a topic label, to each input. The query would also be classified and all answers with the same class label would be returned as relevant.</Paragraph>
    <Paragraph position="10"> * Length Restriction: a common way of indexing phone strings is to index fixed length overlapping phone strings (Logan et al., 2002). This results in a partial index with only fixed length strings. More generally a minimum and maximum string length may be imposed on the index. An example restriction automaton is given in Figure 4. In this case, the filter applies to the factors and has to be applied during or after indexation. The restricted index will be smaller in size but contains less information and may result in degradation in retrieval performance, especially for long queries.</Paragraph>
    <Paragraph position="11"> The length restriction filter requires a modification of the search procedure. Assume a fixed - say r - length restriction filter and a string query of length k. If k &lt; r,</Paragraph>
    <Paragraph position="13"> ducer given in Figure 3(b).</Paragraph>
    <Paragraph position="14"> then we need to pad the input to length r with Sr[?]k. If k [?] r, then we must search for all substrings of length r in the index. A string is present in a certain lattice if all its substrings are (and not vice versa). So, the results of each substring search must be intersected. The probability of each substring xi+r[?]1i for i [?] {1,...,k + 1 [?] r} is an upper bound on the probability of the string xk1, and the count of each substring is an upper bound on the count of the string, so for i [?] {1,...,k + 1[?]r} EP[C(xk1)] [?] EP[C(xi+r[?]1i )].</Paragraph>
    <Paragraph position="15"> Therefore, the intersection operation must use minimum for combining the expected counts of substrings. In other words, the expected count of the string is approximated by the minimum of the probabilities of each of its substrings, null</Paragraph>
    <Paragraph position="17"> In addition to a filter transducer, pruning can be applied at different stages of the algorithm to reduce the size of the index. Pruning eliminates least likely paths in a weighted automaton or transducer. Applying pruning to Ai can be seen as part of the process that generates the uncertain input data. When pruning is applied to Bi, only the more likely alternatives will be indexed. If pruning is applied to Ti, or to T, pruning takes the expected counts into consideration and not the probabilities. Note that the threshold used for this type of pruning is directly comparable to the threshold used for pruning the search results in Section 4 since both are thresholds on expected counts.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Experimental Results
</SectionTitle>
    <Paragraph position="0"> Our task is retrieving the utterances (or short audio segments) that a given query appears in. The experimental setup is identical to that of Saraclar and Sproat (2004).</Paragraph>
    <Paragraph position="1"> Since, we take the system described there as our baseline, we give a brief review of the basic indexation algorithm used there. The algorithm uses the same pre-processing step. For each label in S, an index file is constructed. For each arc a that appears in the preprocessed weighted automaton Bi, the following information is stored: (i,p[a],n[a],d[p[a]],w[a]). Since the pre-processing ensures that f[q] = 0 for all q in Bi, it is possible to compute [?]log(EPi[Cx]) as in Equation 4 using the information stored in the index.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.1 Evaluation Metrics
</SectionTitle>
      <Paragraph position="0"> For evaluating retrieval performance we use precision and recall with respect to manual transcriptions. Let Correct(q) be the number of times the query q is found correctly, Answer(q) be the number of answers to the query q, and Reference(q) be the number of times q is found in the reference.</Paragraph>
      <Paragraph position="2"> We compute precision and recall rates for each query and report the average over all queries. The set of queries Q includes all the words seen in the reference except for a stoplist of 100 most common words.</Paragraph>
      <Paragraph position="4"> For lattice based retrieval methods, different operating points can be obtained by changing the threshold. The precision and recall at these operating points can be plotted as a curve.</Paragraph>
      <Paragraph position="5"> In addition to individual precision-recall values we also compute the F-measure defined as</Paragraph>
      <Paragraph position="7"> and report the maximum F-measure (maxF) to summarize the information in a precision-recall curve.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.2 Corpora
</SectionTitle>
      <Paragraph position="0"> We use three different corpora to assess the effectiveness of different retrieval techniques.</Paragraph>
      <Paragraph position="1"> The first corpus is the DARPA Broadcast News corpus consisting of excerpts from TV or radio programs including various acoustic conditions. The test set is the 1998 Hub-4 Broadcast News (hub4e98) evaluation test set (available from LDC, Catalog no. LDC2000S86) which is 3 hours long and was manually segmented into 940 segments. It contains 32411 word tokens and 4885 word types. For ASR we use a real-time system (Saraclar et al., 2002). Since the system was designed for SDR, the recognition vocabulary of the system has over 200K words.</Paragraph>
      <Paragraph position="2"> The second corpus is the Switchboard corpus consisting of two party telephone conversations. The test set is the RT02 evaluation test set which is 5 hours long, has 120 conversation sides and was manually segmented into 6266 segments. It contains 65255 word tokens and 3788 word types. For ASR we use the first pass of the evaluation system (Ljolje et al., 2002). The recognition vocabulary of the system has over 45K words.</Paragraph>
      <Paragraph position="3"> The third corpus is named Teleconferences since it consists of multi-party teleconferences on various topics. A test set of six teleconferences (about 3.5 hours) was transcribed. It contains 31106 word tokens and 2779 word types. Calls are automatically segmented into a total of 1157 segments prior to ASR. We again use the first pass of the Switchboard evaluation system for ASR.</Paragraph>
      <Paragraph position="4"> We use the AT&amp;T DCD Library (Allauzen et al., 2003) as our ASR decoder and our implementation of the algorithm is based on the AT&amp;T FSM Library (Mohri et al., 2000), both of which are available for download.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML