File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/c02-2003_metho.xml

Size: 15,504 bytes

Last Modified: 2025-10-06 14:07:51

<?xml version="1.0" standalone="yes"?>
<Paper uid="C02-2003">
  <Title>Searching the Web by Voice</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Collocations
</SectionTitle>
    <Paragraph position="0"> A collocation is &amp;quot;an expression of two or more words that corresponds to some conventional way of saying things&amp;quot; (Manning and Sch&amp;quot;utze, 1999). Sometimes, the notion of collocation is defined in terms of syntax (by possible part-of-speech patterns) or in terms of semantics (requiring collocations to exhibit non-compositional meaning) (Smadja, 1993). We adopt an empirical approach and consider any sequence of words that co-occurs more often than chance a potential collocation. null</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 The Likelihood Ratio
</SectionTitle>
      <Paragraph position="0"> We adopted a method for collocation discovery based on the likelihood ratio (Dunning, 1993). Suppose we wish to test whether two words DB</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
BD
DB
BE
</SectionTitle>
    <Paragraph position="0"> form a collocation. Under the independence hypothesis we assume that the probability of observing the  B5. The likelihood ratio AL is calculated by dividing the likelihood of observing the data under the hypothesis of independence, C4B4C0</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
CX
</SectionTitle>
    <Paragraph position="0"> B5, by the likelihood of observing the data under the hypothesis that the words form a collocation, C4B4C0  B5, and compute the two likelihoods using the binomial distribution (see (Manning and Sch&amp;quot;utze, 1999) for details). If the likelihood ratio is small, then C0</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
CR
</SectionTitle>
    <Paragraph position="0"> explains the data much better than</Paragraph>
    <Paragraph position="2"> , and so the word sequence is likely to be a collocation. null</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Discovering Longer Collocations
</SectionTitle>
      <Paragraph position="0"> Two-word collocations can be discovered by carrying out the calculations described above for all frequent two-word sequences, ranking the sequences according to their likelihood ratios, and selecting all sequences with ratios below a threshold. Collocations are not limited to two words, however. We have extended Dunning's scheme to discover longer collocations by performing the likelihood ratio tests iteratively. The algorithm for this is shown below.</Paragraph>
      <Paragraph position="1">  1. Count occurrences of sequences of tokens (initially, words) for lengths of up to D2 tokens.</Paragraph>
      <Paragraph position="2"> 2. For each sequence CB BP DB</Paragraph>
      <Paragraph position="4"> of D2 tokens in the training data, let ALB4CBB5 be the greatest likelihood ratio found by considering all possible ways to split the D2-token sequence into two contiguous parts.</Paragraph>
      <Paragraph position="5">  3. Sort the D2-token sequences CB by ALB4CBB5, and designate the C3 D2 sequences with the lowest ALB4CBB5 values as collocations. 4. Re-tokenize the data by treating each collocation as a single token.</Paragraph>
      <Paragraph position="6"> 5. Set D2 BP D2A0BD.</Paragraph>
      <Paragraph position="7"> 6. Repeat through D2 BPBE.</Paragraph>
      <Paragraph position="9"> of desired collocations of length D2, are chosen manually. This algorithm solves two key problems in discovering longer collocations. The first problem concerns long word sequences that include shorter collocations. For example, consider the sequence New York flowers: this sequence does indeed occur together more often than chance, but if we identify New York as a collocation then including New York flowers as an additional collocation provides little additional benefit (as measured by the reduction in per-query perplexity).</Paragraph>
      <Paragraph position="10"> To solve this problem, step 2 in the collocation discovery algorithm considers all D2 A0 BD possible ways to divide a potential collocation of length D2 into two parts. For the case of New York flowers, this means considering the combinations New York + flowers and New + York flowers. The likelihood ratio used to decide whether the word sequence should be considered a collocation is the maximum of the ratios for all possible splits. Since flowers is close to independent from New York, the potential collocation is rejected.</Paragraph>
      <Paragraph position="11"> The second problem concerns subsequences of long collocations. For example, consider the collocation New York City. New York is a collocation in its own right, but York City is not. To distinguish between these two cases, we need to note that York City occurs more often than chance, but usually as part of the larger collocation New York City, while New York occurs more often than chance outside the larger collocation as well.</Paragraph>
      <Paragraph position="12"> The solution to this problem is to find larger collocations first, and to re-tokenize the data to treat collocations as a single token (step 4 above). In this way, after New York City is identified as a collocation, all instances of it are treated as a single token, and do not contribute to the counts for New York or York City. Since New York occurs outside the larger collocation, it is still correctly identified as a collocation, but York City drops out.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Implementing Voice Search
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Training and Test Data
</SectionTitle>
      <Paragraph position="0"> To create the various language models for the voice search system, we used training data consisting of 19.8 million query occurrences, with 12.6 million distinct queries. There were 54.9 million word occurrences, and 3.4 million distinct words. The evaluation data consisted of 2.5 million query occurrences, with 1.9 million distinct queries. It included 7.1 million word occurrences, corresponding to 750,000 distinct words.</Paragraph>
      <Paragraph position="1"> We used a vocabulary of 100,000 items (depending on the model, the vocabulary included words only, or words and collocations). The word with the lowest frequency occurred 31 times.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Constructing the Language Model
</SectionTitle>
      <Paragraph position="0"> The procedure for constructing the language model was as follows:  1. Obtain queries by extracting a sample from Google's query logs.</Paragraph>
      <Paragraph position="1"> 2. Filter out non-English queries by discarding queries that were made from abroad, requested result sets in foreign languages, etc. 3. Use Google's spelling correction mechanism to correct misspelled queries.</Paragraph>
      <Paragraph position="2"> 4. Create lists of collocations as described in Section 3 above.</Paragraph>
      <Paragraph position="3"> 5. Create the vocabulary consisting of the most frequent words and collocations.</Paragraph>
      <Paragraph position="4"> 6. Use a dictionary and an automatic text-tophonemes tool to obtain phonetic transcriptions for the vocabulary, applying a separate algorithm to special terms (such as acronyms, numerals, URLs, and filenames).</Paragraph>
      <Paragraph position="5"> 7. Estimate n-gram probabilities to create the language model.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 System Architecture
</SectionTitle>
      <Paragraph position="0"> Figure 1 presents an overview of the voice search system. The left-hand side of the diagram represents the off-line steps of creating the statistical language model. The language model is used with a commercially available speech recognition engine, which supplies the acoustic models and the decoder.</Paragraph>
      <Paragraph position="1"> The right-hand side of the diagram represents the run-time flow of a voice query. The speech recognition engine returns a list of the n-best recognition hypotheses. A disjunctive query is derived from this n-best list, and the query is issued to the Google search engine.</Paragraph>
    </Section>
  </Section>
  <Section position="9" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Coverage and Perplexity Results
</SectionTitle>
    <Paragraph position="0"> We evaluated the coverage and perplexity of different language models. In our experiments, we varied the language models along two dimensions:</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Spelling Correction
Filtering and
</SectionTitle>
      <Paragraph position="0"> Context. We evaluated unigram, bigram, and tri-gram language models to see the effect of taking more context into account.</Paragraph>
      <Paragraph position="1"> Collocations. We evaluated language models whose vocabulary included only the 100,000 most frequent words, as well as models whose vocabulary included the most frequent words and collocations. Specifically, we ran the algorithm in Section 3.2 to obtain 5000 three-word collocations, and then 20,000 two-token collocations (which could contain two, four, or six words). To obtain the final vocabulary of 100,000 words and collocations, we tokenized the training corpus using a vocabulary with all 25,000 collocations, and then selected the 100,000 most frequent tokens. Most of the collocations were included in the final vocabulary.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Query Coverage
</SectionTitle>
      <Paragraph position="0"> We say that a vocabulary covers a query when all words (and collocations, if applicable) in the query are in the vocabulary. Table 1 summarizes the coverage of different-sized vocabularies composed of words, words + collocations, or entire queries.</Paragraph>
      <Paragraph position="1">  At a vocabulary size of 100,000 items, there is only a difference of 2.7% between an all-word vocabulary, and a vocabulary that includes words and collocations. Thus, using collocations does not result in a large loss of coverage.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Perplexity Results
</SectionTitle>
      <Paragraph position="0"> We compared the perplexity of different models with a 100,000 item vocabulary in two ways: by measuring the per-token perplexity, and by measuring the per-query perplexity. Per-token perplexity measures how well the language model is able to predict the next word (or collocation), while per-query perplexity measures the contribution of the language model to recognizing the entire query.</Paragraph>
      <Paragraph position="1"> To avoid complications related to out-of-vocabulary words, we computed perplexity only on queries covered by the vocabulary (79.2% of the test queries for the all-word vocabulary, and 76.9% for words plus collocations). The results are shown in Table 2.</Paragraph>
      <Paragraph position="2">  These results show that there is a large decrease in perplexity from the unigram model to the bigram model, but there is a much smaller decrease in perplexity in moving to a trigram model. Furthermore, the per-token perplexity of the unigram model with collocations is about 25% higher than that of the word-based unigram model. This shows that the distribution of the word plus collocation vocabulary is more random than the distribution of words alone.</Paragraph>
      <Paragraph position="3"> The bigram and trigram models exhibit the same effect. null</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.3 Per-Query Perplexity
</SectionTitle>
      <Paragraph position="0"> Per-query perplexity shows the gains from including collocations in the vocabulary. Using collocations means that the average number of tokens (words or collocations) per query decreases, which leads to less uncertainty per query, making recognition of entire queries significantly easier. For the unigram model, collocations lead to a reduction of per-query perplexity by a factor of 14. We can see that the per-query perplexity of the unigram model with collocations is about halfway between the word-based unigram and bigram models. In other words, collocations seem to give us about half the effect of word bigrams.</Paragraph>
      <Paragraph position="1"> Similarly, the per-query perplexity of the bigram model with collocations is very close to the perplexity of the word-based trigram model. Furthermore, moving from a collocation bigram model to a collocation trigram model only yields a small additional per-query perplexity decrease.</Paragraph>
    </Section>
  </Section>
  <Section position="10" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Recall Evaluation
</SectionTitle>
    <Paragraph position="0"> We also evaluated the recall of the voice search system using audio recordings that we collected for this purpose. Since only unigram models yielded close to real-time performance for the speech recognizer, we limited our attention to comparing unigram models with a vocabulary size of 100,000 items consisting of either words, or words and collocations. With these unigram models, the recognizer took only 1-2 seconds to process each query.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.1 Data Collection
</SectionTitle>
      <Paragraph position="0"> We collected voice query data using a prototype of the voice search system connected to the phone network. In total, 18 speakers made 809 voice queries.</Paragraph>
      <Paragraph position="1"> The collected raw samples exhibited a variety of problems, such as low volume, loud breath sounds, clicks, distortions, dropouts, initial cut-off, static, hiccups, and other noises. We set aside all samples with insurmountable problems and speakers with very strong accents. This left 581 good samples.</Paragraph>
      <Paragraph position="2"> These good samples include a variety of speakers, various brands of cell phones as well as desktop phones, and different cell phone carriers. The average length of the utterances was 2.1 words.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.2 Recall Results
</SectionTitle>
      <Paragraph position="0"> We used the 581 good audio samples from the data collection to evaluate recognition recall, for which we adopted a strict definition: disregarding singular/plural variations of nouns, did the recognizer return the exact transcription of the audio sample as one of the top D2 (1, 5, 10) hypotheses? Note that this recall metric incorporates coverage as well as accuracy: if a query contains a word not in the vocabulary, the recognizer cannot possibly recognize it correctly. The results are shown in Table 3.</Paragraph>
      <Paragraph position="1">  These results show that adding collocations to the recognition vocabulary leads to a recall improvement of 14-16 percentage points.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML