File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/98/w98-1122_intro.xml

Size: 3,266 bytes

Last Modified: 2025-10-06 14:06:46

<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-1122">
  <Title>Automatic Acquisition of Phrase Grammars for Stochastic Language Modeling</Title>
  <Section position="2" start_page="0" end_page="188" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Traditionally, n-gram language models implicitly assume words as the basic lexical unit. However, certain word sequences (phrases) are recurrent in constrained domain languages and can be thought as a single lexical entry (e.g. by and large, I would like to, United States of America, etc..). A traditional word n-gram based language model can benefit greatly by using variable length units to capture long spanning dependencies, for any given order n of the model. Furthermore, language modeling based on longer length units is applicable to languages which do not have a predefined notion of a word. However, the problem of data sparseness is more acute in phrase-based language models than in word-based language models. Clustering words into classes has been used to overcome data sparseness in word-based language models (et.al., 1992; Kneser and Ney, 1993; Pereira et al., 1993; McCandless and Glass, 1993; Bellegarda et al., 1996; Saul and Pereira, 1997). Although the automatically acquired phrases can be later clustered into classes to overcome data sparseness, we present a novel approach  of combining the construction of classes during the acquisition of phrases. This integration of phrase acquisition and class construction results in the acquisition of phrase-grammar fragments. In (Gorin, 1996; Arai et al., 1997), grammar fragment acquisition is performed through Kullback-Liebler divergence techniques with application to topic classification from text.</Paragraph>
    <Paragraph position="1"> Although phrase-grammar fragments reduce the problem of data sparseness, they can result in overgeneralization. For example, one of the classes induced in our experiments was C1 = {and, but, because} which one might call the class of conjunctions. However, this class was part of a phrase-grammar fragment such as A T C1 T which results in phrases A T and T, A T but T, A T because T - a clear case of over-generalization given our corpus. Hence we need to further stochastically separate phrases generated by a phrase-grammar fragment. null In this paper, we present our approach to integrating phrase acquisition and clustering and our technique to specialize the acquired phrase fragments. We extensively evaluate the effectiveness of phrase-grammar based n-gram language model and demonstrate that it outperforms a phrase-based n-gram language model in an end-to-end evaluation of a spoken language application. The outline of the paper is as follows. In Section 2, we review the phrase acquisition algorithm presented in (Riccardi et al., 1997). In Section 3, we discuss our approach to phrase acquisition and clustering respectively. The algorithm integrating the phrase acquisition and clustering processes is presented in Section 4. The spoken language application for automatic call routing (How May I Help You? (HMIHY)) that is used for evaluating our approach and the results of our experiments are described in Section 5.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML