File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/h05-1119_metho.xml

Size: 26,671 bytes

Last Modified: 2025-10-06 14:09:33

<?xml version="1.0" standalone="yes"?>
<Paper uid="H05-1119">
  <Title>Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 947-954, Vancouver, October 2005. c(c)2005 Association for Computational Linguistics SEARCHING THE AUDIO NOTEBOOK: KEYWORD SEARCH IN RECORDED CONVERSATIONS</Title>
  <Section position="4" start_page="0" end_page="947" type="metho">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Lisa Stifelman proposed in her thesis the idea of the &amp;quot;Audio Notebook,&amp;quot; where audio recordings of lectures and interviews are retained along with the notes (Stifelman, 1997). She has shown that the audio recordings are valuable to users if portions of interest can be accessed quickly and easily.</Paragraph>
    <Paragraph position="1"> Stifelman explored various techniques for this, including user-activity based techniques (most noteworthy timestamping notes so they can serve as an index into the recording) and content-based ones (signal processing for accelerated playback, &amp;quot;snap-to-grid&amp;quot; (=phrase boundary) based on prosodic cues). The latter are intended for situations where the former fail, e.g. when the user has no time for taking notes, does not wish to pay attention to it, or cannot keep up with complex subject matter, and as a consequence the audio is left without index. In this paper, we investigate technologies for searching the spoken content of the audio recording.</Paragraph>
    <Paragraph position="2"> Several approaches have been reported in the literature for the problem of indexing spoken words in audio recordings. The TREC (Text REtrieval Conference) Spoken-Document Retrieval (SDR) track has fostered research on audio-retrieval of broadcast-news clips. Most TREC benchmarking systems use broadcast-news recognizers to generate approximate transcripts, and apply text-based information retrieval to these. They achieve retrieval accuracy similar to using human reference transcripts, and ad-hoc retrieval for broadcast news is considered a &amp;quot;solved problem&amp;quot; (Garofolo, 2000). Noteworthy are the rather low word-error rates (20%) in the TREC evaluations, and that recognition errors did not lead to catastrophic failures due to redundancy of news segments and queries.</Paragraph>
    <Paragraph position="3"> However, in our scenario, requirements are rather different. First, word-error rates are much higher (4060%). Directly searching such inaccurate speech recognition transcripts suffers from a poor recall. Second, unlike broadcast-news material, user recordings of conversations will not be limited to a few specific domains. This not only poses difficulties for obtaining domain-specific training data, but also implies an unlimited vocabulary of query phrases users want to use. Third, audio recordings will accumulate. When the audio database grows to hundreds or even thousands of hours, a reasonable response time is still needed.</Paragraph>
    <Paragraph position="4"> A successful way to deal with high word error rates is the use of recognition alternates (lattices). For example, (Seide and Yu, 2004; Yu and Seide, 2004) reports a substantial 50% improvement of FOM (Figure Of Merit) for a word-spotting task in voicemails. Improvements from using lattices were also reported by (Saraclar and Sproat, 2004) and (Chelba and Acero, 2005).</Paragraph>
    <Paragraph position="5"> To address the problem of domain independence, a subword-based approach is needed. In (Logan, 2002) the authors address the problem by indexing phonetic or word-fragment based transcriptions. Similar approaches, e.g. using overlapping M-grams of phonemes, are discussed in (Sch&amp;quot;auble, 1995) and (Ng, 2000). (James and Young, 1994) introduces the approach of searching phoneme lattices. (Clements, 2001) proposes a similar idea called &amp;quot;phonetic search track.&amp;quot; In previous work (Seide and Yu, 2004), promising results were obtained with phonetic lattice search in voicemails. In (Yu and  w o r dr e c o g n i z e r p h o n e t i cr e c o g n i z e r p o s t e r i o rc o n v e r s i o n &amp; m e r g i n g w o r dl a t t i c e l a t t i c e i n d e x e r l a t t i c es t o r e i n d e xl o o k u pr e s u l t l i s t l i n e a r s e a r c h q u e r y p h o n e m el a t t i c ea u d i os t r e a m p r o m i s i n gs e g m e n t s r a n k e r i n d e x i n gs e a r c h h i t l i s t h y b r i d la t t i c e q u e r y i n v e r t e di n d e x l e t t e r t o s o u n d w o r d s t r i n gp h o n e m e s t r i n g h y b r i dl a t t i c e  Seide, 2004), it was found that even better result can be achieved by combining a phonetic search with a word-level search.</Paragraph>
    <Paragraph position="6"> For the third problem, quick response time is commonly achieved by indexing techniques. However, in the context of phonetic lattice search, the concept of &amp;quot;indexing&amp;quot; becomes a non-trivial problem, because due to the unknown-word nature, we need to deal with an open set of index keys. (Saraclar and Sproat, 2004) proposes to store the individual lattice arcs (inverting the lattice). (Allauzen et al., 2004) introduces a general indexation framework by indexing expected term frequencies (&amp;quot;expected counts&amp;quot;) instead of each individual keyword occurrence or lattice arcs. In (Yu et al., 2005), a similar idea of indexing expected term frequencies is proposed, suggesting to approximate expected term frequencies by M-gram phoneme language models estimated on segments of audio.</Paragraph>
    <Paragraph position="7"> In this paper, we combine previous work on phonetic lattice search, hybrid search and lattice indexing into a real system for searching recorded conversations that achieves high accuracy and can handle hundreds of hours of audio. The main contributions of this paper are: a real system for searching conversational speech, a novel method for combining phoneme and word lattices, and experimental results for searching recorded conversations. null The paper is organized as follows. Section 2 gives an overview of the system. Section 3 introduces the over-all criterion, based on which the system is developed, Section 4 introduces our implementation for a hybrid word/phoneme search system, and Section 5 discusses the lattice indexing mechanism. Section 6 presents the experimental results, and Section 7 concludes.</Paragraph>
  </Section>
  <Section position="5" start_page="947" end_page="948" type="metho">
    <SectionTitle>
2 A System For Searching Conversations
</SectionTitle>
    <Paragraph position="0"> A system for searching the spoken content of recorded conversations has several distinct properties. Users are searching their own meetings, so most searches will be known-item searches with at most a few correct hits in the archive. Users will often search for specific phrases that they remember, possibly with boolean operators. Relevance weighting of individual query terms is less of an issue in this scenario.</Paragraph>
    <Paragraph position="1"> We identified three user requirements: * high recall and accurate ranking of phrase matches; * domain independence - it should work for any topic, ideally without need to adapt vocabularies or language models; * reasonable response time - a few seconds at most, independent of the size of the conversation archive.</Paragraph>
    <Paragraph position="2"> We address them as follows. First, to increase recall we search recognition alternates based on lattices. Lattice oracle word-error rates1 are significantly lower than word-error rates of the best path. For example, (Chelba and Acero, 2005) reports a lattice oracle error rate of 22% for lecture recordings at a top-1 word-error rate of 45%2. To utilize recognizer scores in the lattices, we formulating the ranking problem as one of risk minimization and derive that keyword hits should be ranked by their word (phrase) posterior probabilities.</Paragraph>
    <Paragraph position="3"> Second, domain independence is achieved by combining large-vocabulary recognition with a phonetic search. This helps especially for proper names and specialized terminology, which are often either missing in the vocabulary or not well-predicted by the language model.</Paragraph>
    <Paragraph position="4"> Third, to achieve quick response time, we use an M-gram based indexing approach. It has two stages, where the first stage is a fast index lookup to create a short-list of candidate lattices. In the second stage, a detailed lattice match is applied to the lattices in the short-list. We call the second stage linear search because search time grows linearly with the duration of the lattices searched.</Paragraph>
    <Paragraph position="5">  The resulting system architecture is shown in Fig. 1. In the following three sections, we will discuss our solutions in these three aspects in details respectively.</Paragraph>
  </Section>
  <Section position="6" start_page="948" end_page="948" type="metho">
    <SectionTitle>
3 Ranking Criterion
</SectionTitle>
    <Paragraph position="0"> For ranking search results according to &amp;quot;relevance&amp;quot; to the user's query, several relevance measures have been proposed in the text-retrieval literature. The key element of these measures is weighting the contribution of individual keywords to the relevance-ranking score. Unfortunately, text ranking criteria are not directly applicable to retrieval of speech because recognition alternates and confidence scores are not considered.</Paragraph>
    <Paragraph position="1"> Luckily, this is less of an issue in our known-item style search, because the simplest of relevance measures can be used: A search hit is assumed relevant if the query phrase was indeed said there (and fulfills optional boolean constraints), and it is not relevant otherwise.</Paragraph>
    <Paragraph position="2"> This simple relevance measure, combined with a variant of the probability ranking principle (Robertson, 1977), leads to a system where phrase hits are ranked by their phrase posterior probability. This is derived through a Bayes-risk minimizing approach as follows:  1. Let the relevance be R(Q,hiti) of a returned audio hit - hiti to a user's query Q formally defined is 1 (match) if the hit is an occurrence of the query term with time boundaries (thitis ,thitie ), or 0 if not.</Paragraph>
    <Paragraph position="3"> 2. The user expects the system to return a list of audio hits, ranked such that the accumulative relevance of the top n hits (hit1...hitn), averaged over a range of</Paragraph>
    <Paragraph position="5"> Note that this is closely related to popular word-spotting metrics, such as the NIST (National Institute of Standards &amp; Technology) Figure Of Merit.</Paragraph>
    <Paragraph position="6"> To the retrieval system, the true transcription of each audio file is unknown, so it must maximize Eq. (1) in the sense of an expected value</Paragraph>
    <Paragraph position="8"> where O denotes the totality of all audio files (O for observation), W = (w1,w2,...,wN) a hypothesized transcription of the entire collection, and T = (t1,t2,...,tN+1) the associated time boundaries on a shared collection-wide time axis.</Paragraph>
    <Paragraph position="9"> RWT(*) shall be relevance w.r.t. the hypothesized transcription and alignment. The expected value is taken w.r.t. the posterior probability distribution P(WT|O) provided by our speech recognizer in the form of scored lattices. It is easy to see that this expression is maximal if the hits are ranked by their expected relevance EWT|O{RWT(Q,hiti)}. In our definition of relevance, RWT(Q,hiti) is written as</Paragraph>
    <Paragraph position="11"> and the expected relevance is computed as</Paragraph>
    <Paragraph position="13"> For single-word queries, this is the well-known word posterior probability (Wessel et al., 2000; Evermann et al., 2000). To cover multi-label phrase queries, we will call it phrase posterior probability.</Paragraph>
    <Paragraph position="14"> The formalism in this section is applicable to all sorts of units, such as fragments, syllables, or words. The transcription W and its units wk, as well as the query string Q, should be understood in this sense. For a regular word-level search, W and Q are just strings of words In the context of phonetic search, W and Q are strings of phonemes.</Paragraph>
    <Paragraph position="15"> For simplicity of notation, we have excluded the issue of multiple pronunciations of a word. Eq. (2) can be trivially extended by summing up over all alternative pronunciations of the query. And in a hybrid search, there would be multiple representations of the query, which are just as pronunciation variants.</Paragraph>
  </Section>
  <Section position="7" start_page="948" end_page="950" type="metho">
    <SectionTitle>
4 Word/Phoneme Hybrid Search
</SectionTitle>
    <Paragraph position="0"> For a combined word and phoneme based search, two problems need to be considered: * Recognizer configuration. While established solutions exist for word-lattice generation, what needs to be done for generating high-quality phoneme lattices? null * How should word and phoneme lattices be jointly represented for the purpose of search, and how should they be searched?</Paragraph>
    <Section position="1" start_page="948" end_page="949" type="sub_section">
      <SectionTitle>
4.1 Speech Recognition
</SectionTitle>
      <Paragraph position="0"> Word lattices are generated by a common speaker-independent large-vocabulary recognizer. Because the speaking style of conversations is very different from, say,  your average speech dictation system, specialized acoustic models are used. These are trained on conversational speech to match the speaking style. The vocabulary and the trigram language model are designed to cover a broad range of topics.</Paragraph>
      <Paragraph position="1"> The drawback of large-vocabulary recognition is, of course, that it is infeasible to have the vocabulary cover all possible keywords that a user may use, particularly proper names and specialized terminology.</Paragraph>
      <Paragraph position="2"> One way to address this out-of-vocabulary problem is to mine the user's documents or e-mails to adapt the recognizer's vocabulary. While this is workable for some scenarios, it is not a good solution e.g. when new words are frequently introduced in the conversations themselves rather than preceding written conversations, where the spelling of a new word is not obvious and thus inconsistent, or when documents with related documents are not easily available on the user's hard disk but would have to be specifically gathered by the user.</Paragraph>
      <Paragraph position="3"> A second problem is that the performance of state-of-the-art speech recognition relies heavily on a well-trained domain-matched language model. Mining user data can only yield a comparably small amount of training data.</Paragraph>
      <Paragraph position="4"> Adapting a language model with it would barely yield a robust language model for newly learned words, and their usage style may differ in conversational speech.</Paragraph>
      <Paragraph position="5"> For the above reasons, we decided not to attempt to adapt vocabulary and language model. Instead, we use a fixed broad-domain vocabulary and language model for large-vocabulary recognition, and augment this system with maintenance-free phonetic search to cover new words and mismatched domains.</Paragraph>
      <Paragraph position="6">  The simplest phonetic recognizer is a regular recognizer with the vocabulary replaced by the list of phonemes of the language, and the language model replaced by a phoneme M-gram. However, such phonetic language model is much weaker than a word language model. This results in poor accuracy and inefficient search.</Paragraph>
      <Paragraph position="7"> Instead, our recognizer uses &amp;quot;phonetic word fragments&amp;quot; (groups of phonemes similar to syllables or halfsyllables) as its vocabulary and in the language model. This provides phonotactic constraints for efficient decoding and accurate phoneme-boundary decisions, while remaining independent of any specific vocabulary. A set of about 600 fragments was automatically derived from the language-model training set by a bottom-up grouping procedure (Klakow, 1998; Ng, 2000; Seide and Yu, 2004). Example fragments are /-k-ih-ng/ (the syllable king), /ih-n-t-ax-r-/ (inter-), and /ih-z/ (the word is). With this, lattices are generated using the common Viterbi decoder with word-pair approximation (Schwartz et al., 1994; Ortmanns et al., 1996). The decoder has been modified to keep track of individual phoneme boundaries and scores. These are recorded in the lattices, while fragment-boundary information is discarded. This way, phoneme lattices are generated.</Paragraph>
      <Paragraph position="8"> In the results section we will see that, even with a well-trained domain-matching word-level language model, searching phoneme lattices can yield search accuracies comparable with word-level search, and that the best performance is achieved by combining both into a hybrid word/phoneme system.</Paragraph>
    </Section>
    <Section position="2" start_page="949" end_page="950" type="sub_section">
      <SectionTitle>
4.2 Unified Hybrid Lattice Representation
</SectionTitle>
      <Paragraph position="0"> Combining word and phonetic search is desirable because they are complementary: Word-based search yields better precision, but has a recall issue for unknown and rare words, while phonetic search has very good recall but suffers from poor precision especially for short words.</Paragraph>
      <Paragraph position="1"> Combining the two is not trivial. Several strategies are discussed in (Yu and Seide, 2004), including using a hybrid recognizer, combining lattices from two separate recognizers, and combining the results of two separate systems. Both hybrid recognizer configuration and lattice combination turned out difficult because of the different dynamic range of scores in word and phonetic paths.</Paragraph>
      <Paragraph position="2"> We found it beneficial to convert both lattices into posterior-based representations called posterior lattices first, which are then merged into a hybrid posterior lattice. Search is performed in a hybrid lattice in a unified manner using both phonetic and word representations as &amp;quot;alternative pronunciation&amp;quot; of the query, and summing up the resulting phrase posteriors.</Paragraph>
      <Paragraph position="3"> Posterior lattices are like regular lattices, except that they do not store acoustic likelihoods, language model probabilities, and precomputed forward/backward probabilities, but arc and node posteriors. An arc's posterior is the probability that the arc (with its associated word or phoneme hypothesis) lies on the correct path, while a node posterior is the probability that the correct path connects two word/phoneme hypotheses through this node.</Paragraph>
      <Paragraph position="4"> In our actual system, a node is only associated with a point in time, and the node posterior is the probability of having a word or phoneme boundary at its associated time point.</Paragraph>
      <Paragraph position="5"> The inclusion of node posteriors, which to our knowledge is a novel contribution of this paper, makes an exact computation of phrase posteriors from posterior lattices possible. In the following we will explain this in detail.</Paragraph>
      <Paragraph position="6">  A lattice L = (N,A,nstart,nend) is a directed acyclic graph (DAG) with N being the set of nodes, A is the set of arcs, and nstart, nend [?] N being the unique initial and unique final node, respectively. Nodes represent times and possibly context conditions, while arcs represent word or phoneme hypotheses.3 Each node n [?]N has an associated time t[n] and possibly an acoustic or language-model context condition.</Paragraph>
      <Paragraph position="7"> Arcs are 4-tuples a = (S[a],E[a],I[a],w[a]). S[a], E[a] 3Alternative definitions of lattices are possible, e.g. nodes representing words and arcs representing word transitions.</Paragraph>
      <Paragraph position="8">  [?] N denote the start and end node of the arc. I[a] is the arc label4, which is either a word (in word lattices) or a phoneme (in phonetic lattices). Last, w[a] shall be a weight assigned to the arc by the recognizer. Specifically, w[a]=pac(a)1/l*PLM(a) with acoustic likelihood pac(a), language model probability PLM, and language-model weight l.</Paragraph>
      <Paragraph position="9"> In addition, we define paths pi = (a1,*** ,aK) as sequences of connected arcs. We use the symbols S, E, I, and w for paths as well to represent the respective properties for entire paths, i.e. the path start node S[pi] = S[a1], path end node E[pi] = E[aK], path label sequence I[pi] = (I[a1],*** ,I[aK]), and total path weight w[pi] = producttextKk=1 w[ak].</Paragraph>
      <Paragraph position="10"> Finally, we define P(n1,n2) as the entirety of all paths that start at node n1 and end in node n2: P(n1,n2) =</Paragraph>
      <Paragraph position="12"> With this, the phrase posteriors defined in Eq. 2 can be written as follows.</Paragraph>
      <Paragraph position="13"> In the simplest case, Q is a single word token. Then, the phrase posterior is just the word posterior and, as shown in e.g. (Wessel et al., 2000) or (Evermann et al., 2000), can be computed as</Paragraph>
      <Paragraph position="15"> with the forward/backward probabilities an and bn defined as:</Paragraph>
      <Paragraph position="17"> an and bn can conveniently be computed from the word lattices by the well-known forward/backward recursion:</Paragraph>
      <Paragraph position="19"> 4Lattices are often interpreted as weighted finite-state acceptors, where the arc labels are the input symbols, hence the symbol I.</Paragraph>
      <Paragraph position="20"> Now, in the general case of multi-label queries, the phrase posterior can be computed as</Paragraph>
      <Paragraph position="22"> The posterior-lattice representation has several advantages over traditional lattices. First, lattice storage is reduced because only one value (node posterior) needs to be stored per node instead of two (a, b)6. Second, node and arc posteriors have a smaller and similar dynamic range than an, bn, and w[a], which is beneficial when the values should be stored with a small number of bits.</Paragraph>
      <Paragraph position="23"> Further, for the case of word-based search, the summation in Eq. 3 can also be precomputed by merging all lattice nodes that carry the same time label, and merging the corresponding arcs by summing up their arc posteriors.</Paragraph>
      <Paragraph position="24"> In such a &amp;quot;pinched&amp;quot; lattice, word posteriors for single-label queries can now be looked up directly. However, posteriors for multi-label strings cannot be computed precisely anymore. Our experiments have shown that the impact on ranking accuracy caused by this approximation is neglectable. Unfortunately, we have also found that the same is not true for phonetic search.</Paragraph>
      <Paragraph position="25"> The most important advantage of posterior lattices for our system is that they provide a way of combining the word and phoneme lattices into a single structure - by simply merging their start nodes and their end nodes. This allows to implement hybrid queries in a single unified search, treating the phonetic and the word-based representation of the query as alternative pronunciations.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="950" end_page="951" type="metho">
    <SectionTitle>
5 Lattice Indexing
</SectionTitle>
    <Paragraph position="0"> Searching lattices is time-consuming. It is not feasible to search large amounts of audio. To deal with hundreds or even thousands of hours of audio, we need some form of inverted indexing mechanism.</Paragraph>
    <Paragraph position="1"> This is comparably straight-forward when indexing text. It is also not difficult for indexing word lattices. In both case, the set of words to index is known. However, indexing phoneme lattices is very different, because theoretically any phoneme string could be an indexing item. 5Again, mind that in our lattice formulation word/phoneme hypotheses are represented by arcs, while nodes just represent connection points. The node posterior is the probability that the correct path passes through a connection point.</Paragraph>
    <Paragraph position="2"> 6Note, however, that storage for the traditional lattice can also be reduced to a single number per node by weight pushing (Saraclar and Sproat, 2004), using an algorithm that is very similar to the forward/backward procedure.</Paragraph>
    <Paragraph position="3">  We address this by our M-gram lattice-indexing scheme. It was originally designed for phoneme lattices, but can be - and is actually - used in our system for indexing word lattices.</Paragraph>
    <Paragraph position="4"> First, audio files are clipped into homogeneous segments. For an audio segment i, we define the expected term frequency (ETF) of a query string Q as summation of phrase posteriors of all hits in this segment:</Paragraph>
    <Paragraph position="6"> with Pi being the set of all paths of segment i.</Paragraph>
    <Paragraph position="7"> At indexing time, ETFs of a list of M-grams for each segment are calculated. They are stored in an inverted structure that allows retrieval by M-gram.</Paragraph>
    <Paragraph position="8"> In search time, the ETFs of the query string are estimated by the so-called &amp;quot;M-gram approximation&amp;quot;. In order to explain this concept, we need to first introduce P(Q|Oi) - the probability of observing query string Q at any word boundary in the recording Oi. P(Q|Oi) has a</Paragraph>
    <Paragraph position="10"> with ~Ni being the expected number of words in the segment i. It can also be computed as ~Ni = summationdisplay n[?]Ni p[n], where Ni is the node set for segment i. Like the M-gram approximation in language-model theory, we approximate P(Q|Oi) as</Paragraph>
    <Paragraph position="12"> while the right-hand items can be calculated from M-gram ETFs:</Paragraph>
    <Paragraph position="14"> The actual implementation uses only M-grams extracted from a large background dictionary, with a simple backoff strategy for unseen M-grams, see (Yu et al., 2005) for details.</Paragraph>
    <Paragraph position="15"> The resulting index is used in a two stage-search manner: The index itself is only used as the first stage to determine a short-list of promising segments that may contain the query. The second stage involves a linear lattice search to get final results.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML