File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-1055_metho.xml

Size: 12,635 bytes

Last Modified: 2025-10-06 14:09:48

<?xml version="1.0" standalone="yes"?>
<Paper uid="P05-1055">
  <Title>Position Specific Posterior Lattices for Indexing Speech</Title>
  <Section position="4" start_page="444" end_page="445" type="metho">
    <SectionTitle>
3 Text Document Retrieval
</SectionTitle>
    <Paragraph position="0"> Probably the most widespread text retrieval model is the TF-IDF vector model (Baeza-Yates and Ribeiro-Neto, 1999). For a given query Q = q1 ...qi ...qQ and document Dj one calculates a similarity measure by accumulating the TF-IDF score wi,j for each query term qi, possibly weighted by a document specific weight:</Paragraph>
    <Paragraph position="2"> where fi,j is the normalized frequency of word qi in document Dj and the inverse document frequency for query term qi is idfi = log Nni where N is the total number of documents in the collection and ni is the number of documents containing qi.</Paragraph>
    <Paragraph position="3"> The main criticism to the TF-IDF relevance score is the fact that the query terms are assumed to be independent. Proximity information is not taken into account at all, e.g. whether the words LANGUAGE and MODELING occur next to each other or not in a document is not used for relevance scoring.</Paragraph>
    <Paragraph position="4"> Another issue is that query terms may be encountered in different contexts in a given document: title, abstract, author name, font size, etc. For hypertext document collections even more context information is available: anchor text, as well as other mark-up tags designating various parts of a given document being just a few examples. The TF-IDF ranking scheme completely discards such information although it is clearly important in practice.</Paragraph>
    <Section position="1" start_page="444" end_page="444" type="sub_section">
      <SectionTitle>
3.1 Early Google Approach
</SectionTitle>
      <Paragraph position="0"> Aside from the use of PageRank for relevance ranking, (Brin and Page, 1998) also uses both proximity and context information heavily when assigning a relevance score to a given document -- see Section 4.5.1 of (Brin and Page, 1998) for details.</Paragraph>
      <Paragraph position="1"> For each given query term qi one retrieves the list of hits corresponding to qi in document D. Hits can be of various types depending on the context in which the hit occurred: title, anchor text, etc. Each type of hit has its own type-weight and the typeweights are indexed by type.</Paragraph>
      <Paragraph position="2"> For a single word query, their ranking algorithm takes the inner-product between the type-weight vector and a vector consisting of count-weights (tapered counts such that the effect of large counts is discounted) and combines the resulting score with PageRank in a final relevance score.</Paragraph>
      <Paragraph position="3"> For multiple word queries, terms co-occurring in a given document are considered as forming different proximity-types based on their proximity, from adjacent to &amp;quot;not even close&amp;quot;. Each proximity type comes with a proximity-weight and the relevance score includes the contribution of proximity information by taking the inner product over all types, including the proximity ones.</Paragraph>
    </Section>
    <Section position="2" start_page="444" end_page="445" type="sub_section">
      <SectionTitle>
3.2 Inverted Index
</SectionTitle>
      <Paragraph position="0"> Of essence to fast retrieval on static document collections of medium to large size is the use of an inverted index. The inverted index stores a list of hits for each word in a given vocabulary. The hits are grouped by document. For each document, the list of hits for a given query term must include position -- needed to evaluate counts of proximity types -- null as well as all the context information needed to calculate the relevance score of a given document using the scheme outlined previously. For details, the reader is referred to (Brin and Page, 1998), Section 4.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="445" end_page="446" type="metho">
    <SectionTitle>
4 Position Specific Posterior Lattices
</SectionTitle>
    <Paragraph position="0"> As highlighted in the previous section, position information is crucial for being able to evaluate proximity information when assigning a relevance score to a given document.</Paragraph>
    <Paragraph position="1"> In the spoken document case however, we are faced with a dilemma. On one hand, using 1-best ASR output as the transcription to be indexed is sub-optimal due to the high WER, which is likely to lead to low recall -- query terms that were in fact spoken are wrongly recognized and thus not retrieved.</Paragraph>
    <Paragraph position="2"> On the other hand, ASR lattices do have much better WER -- in our case the 1-best WER was 55% whereas the lattice WER was 30% -- but the position information is not readily available: it is easy to evaluate whether two words are adjacent but questions about the distance in number of links between the occurrences of two query words in the lattice are very hard to answer.</Paragraph>
    <Paragraph position="3"> The position information needed for recording a given word hit is not readily available in ASR lattices -- for details on the format of typical ASR lattices and the information stored in such lattices the reader is referred to (Young et al., 2002). To simplify the discussion let's consider that a traditional text-document hit for given word consists of just (document id, position).</Paragraph>
    <Paragraph position="4"> The occurrence of a given word in a lattice obtained from a given spoken document is uncertain and so is the position at which the word occurs in the document.</Paragraph>
    <Paragraph position="5"> The ASR lattices do contain the information needed to evaluate proximity information, since on a given path through the lattice we can easily assign a position index to each link/word in the normal way.</Paragraph>
    <Paragraph position="6"> Each path occurs with a given posterior probability, easily computable from the lattice, so in principle one could index soft-hits which specify</Paragraph>
    <Paragraph position="8"> more than one path contains the same word in the same position, one would need to sum over all possible paths in a lattice that contain a given word at a given position.</Paragraph>
    <Paragraph position="9"> A simple dynamic programming algorithm which is a variation on the standard forward-backward algorithm can be employed for performing this computation. The computation for the backward pass stays unchanged, whereas during the forward pass one needs to split the forward probability arriving at a given node n, an, according to the length l -measured in number of links along the partial path that contain a word; null (epsilon1) links are not counted when calculating path length -- of the partial paths that start at the start node of the lattice and end at</Paragraph>
    <Paragraph position="11"> The backward probability bn has the standard definition (Rabiner, 1989).</Paragraph>
    <Paragraph position="12"> To formalize the calculation of the positionspecific forward-backward pass, the initialization, and one elementary forward step in the forward pass are carried out using Eq. (1), respectively -- see Figure 1 for notation:</Paragraph>
    <Paragraph position="14"> The &amp;quot;probability&amp;quot; P(li) of a given link li is stored as a log-probability and commonly evaluated in</Paragraph>
    <Paragraph position="16"> where logPAM(li) is the acoustic model score, logPLM(word(li)) is the language model score, LMw &gt; 0 is the language model weight, logPIP &gt; 0 is the &amp;quot;insertion penalty&amp;quot; and FLATw is a flattening weight. In N-gram lattices where N [?] 2, all links ending at a given node n must contain the same word word(n), so the posterior probability of a given word w occurring at a given position l can be easily calculated using:</Paragraph>
    <Paragraph position="18"> The Position Specific Posterior Lattice (PSPL) is a representation of the P(w,l|LAT) distribution: for each position bin l store the words w along with their posterior probability P(w,l|LAT).</Paragraph>
  </Section>
  <Section position="6" start_page="446" end_page="447" type="metho">
    <SectionTitle>
5 Spoken Document Indexing and Search
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="446" end_page="446" type="sub_section">
      <SectionTitle>
Using PSPL
</SectionTitle>
      <Paragraph position="0"> Spoken documents rarely contain only speech. Often they have a title, author and creation date. There might also be a text abstract associated with the speech, video or even slides in some standard format. The idea of saving context information when indexing HTML documents and web pages can thus be readily used for indexing spoken documents, although the context information is of a different nature. null As for the actual speech content of a spoken document, the previous section showed how ASR technology and PSPL lattices can be used to automatically convert it to a format that allows the indexing of soft hits -- a soft index stores posterior probability along with the position information for term occurrences in a given document.</Paragraph>
    </Section>
    <Section position="2" start_page="446" end_page="446" type="sub_section">
      <SectionTitle>
5.1 Speech Content Indexing Using PSPL
</SectionTitle>
      <Paragraph position="0"> Speech content can be very long. In our case the speech content of a typical spoken document was approximately 1 hr long; it is customary to segment a given speech file in shorter segments.</Paragraph>
      <Paragraph position="1"> A spoken document thus consists of an ordered list of segments. For each segment we generate a corresponding PSPL lattice. Each document and each segment in a given collection are mapped to an integer value using a collection descriptor file which lists all documents and segments. Each soft hit in our index will store the PSPL position and posterior probability.</Paragraph>
    </Section>
    <Section position="3" start_page="446" end_page="447" type="sub_section">
      <SectionTitle>
5.2 Speech Content Relevance Ranking Using
PSPL Representation
</SectionTitle>
      <Paragraph position="0"> Consider a given query Q = q1 ...qi ...qQ and a spoken document D represented as a PSPL. Our ranking scheme follows the description in Section 3.1.</Paragraph>
      <Paragraph position="1"> The words in the document D clearly belong to the ASR vocabulary V whereas the words in the query may be out-of-vocabulary (OOV). As argued in Section 2, the query-OOV rate is an important factor in evaluating the impact of having a finite ASR vocabulary on the retrieval accuracy. We assume that the words in the query are all contained in V; OOV words are mapped to UNK and cannot be matched in any document D.</Paragraph>
      <Paragraph position="2"> For all query terms, a 1-gram score is calculated by summing the PSPL posterior probability across all segments s and positions k. This is equivalent to calculating the expected count of a given query term qi according to the PSPL probability distribution P(wk(s)|D) for each segment s of document D. The results are aggregated in a common value</Paragraph>
      <Paragraph position="4"> Similar to (Brin and Page, 1998), the logarithmic tapering off is used for discounting the effect of large counts in a given document.</Paragraph>
      <Paragraph position="5"> Our current ranking scheme takes into account proximity in the form of matching N-grams present in the query. Similar to the 1-gram case, we calculate an expected tapered-count for each N-gram qi ...qi+N[?]1 in the query and then aggregate the results in a common value SN[?]gram(D,Q) for each</Paragraph>
      <Paragraph position="7"> The different proximity types, one for each N-gram order allowed by the query length, are combined by taking the inner product with a vector of weights.</Paragraph>
      <Paragraph position="9"> Only documents containing all the terms in the query are returned. In the current implementation the weights increase linearly with the N-gram order.</Paragraph>
      <Paragraph position="10"> Clearly, better weight assignments must exist, and as the hit types are enriched beyond using just Ngrams, the weights will have to be determined using machine learning techniques.</Paragraph>
      <Paragraph position="11"> It is worth noting that the transcription for any given segment can also be represented as a PSPL with exactly one word per position bin. It is easy to see that in this case the relevance scores calculated according to Eq. (3-4) are the ones specified by 3.1.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML