File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/n03-2003_metho.xml

Size: 4,200 bytes

Last Modified: 2025-10-06 14:08:16

<?xml version="1.0" standalone="yes"?>
<Paper uid="N03-2003">
  <Title>Getting More Mileage from Web Text Sources for Conversational Speech Language Modeling using Class-Dependent Mixtures</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Collecting Text from the Web
</SectionTitle>
    <Paragraph position="0"> The amount of text available on the web is enormous (over 3 billion web pages are indexed via Google alone) and continues to grow. Most of the text on the web is non-conversational, but there is a fair amount of chat-like material that is similar to conversational speech though often omitting disfluencies. This was our primary target when extracting data from the web. Queries submitted to Google were composed of N-grams that occur most frequently in the switchboard training corpus, e.g. &amp;quot;I never thought I would&amp;quot;, &amp;quot;I would think so&amp;quot;, etc. We were searching for the exact match to one or more of these N-grams within the text of the web pages. Web pages returned by Google for the most part consisted of conversational style phrases like &amp;quot;we were friends but we don't actually have a relationship&amp;quot; and &amp;quot;well I actually I I really haven't seen her for years.&amp;quot; We used a slightly different search strategy when collecting topic-specific data. First we extended the base-line vocabulary with words from a small in-domain training corpus (Schwarm and Ostendorf, 2002), and then we used N-grams with these new words in our web queries, e.g. &amp;quot;wireless mikes like&amp;quot;, &amp;quot;I know that recognizer&amp;quot; for a meeting transcription task (Morgan et al, 2001). Web pages returned by Google mostly contained technical material related to topics similar to what was discussed in the meetings, e.g. &amp;quot;we were inspired by the weighted count scheme...&amp;quot;, &amp;quot;for our experiments we used the Bellman-Ford algorithm...&amp;quot;, etc.</Paragraph>
    <Paragraph position="1"> The retrieved web pages were filtered before their content could be used for language modeling. First we stripped the HTML tags and ignored any pages with a very high OOV rate. We then piped the text through a maximum entropy sentence boundary detector (Ratnaparkhi, 1996) and performed text normalization using NSW tools (Sproat et al, 2001).</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Class-dependent Mixture of LMs
</SectionTitle>
    <Paragraph position="0"> Linear interpolation is a standard approach to combining language models, where the probability of a word wi given history h is computed as a linear combination of the corresponding N-gram probabilities from S different models: p(wijh) = Ps2S sps(wijh). Depending on how much adaptation data is available it may be beneficial to estimate a larger number of mixture weights s (more than one per data source) in order to handle source mismatch, specifically letting the mixture weight depend on the contexth. One approach is to use a mixture weight corresponding to the source posterior probability s(h) = p(sjh) (Weintraub et al, 1996). Here, we instead choose to let the weight vary as a function of the previous word class, i.e. p(wijh) = Ps2S s(c(wi 1))ps(wijh), where classes c(wi 1) are part-of-speech tags except for the 100 most frequent words which form their own individual classes. Such a scheme can generalize across domains by tapping into the syntactic structure (POS tags), already shown to be useful for cross-domain language modeling (Iyer and Ostendorf, 1997), and at the same time target conversational speech since the top 100 words cover 70% of tokens in Switchboard training corpus.</Paragraph>
    <Paragraph position="1"> Combining several N-grams can produce a model with a very large number of parameters, which is costly in decoding. In such cases N-grams are typically pruned. Here we use entropy-based pruning (Stolcke, 1998) after mixing unpruned models, and reduce the model aggressively to about 15% of its original size. The same pruning parameters were applied to all models in our experiments.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML