File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/01/p01-1003_intro.xml

Size: 6,565 bytes

Last Modified: 2025-10-06 14:01:10

<?xml version="1.0" standalone="yes"?>
<Paper uid="P01-1003">
  <Title>Improvement of a Whole Sentence Maximum Entropy Language Model Using Grammatical Featuresa0</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Language modeling is an important component in computational applications such as speech recognition, automatic translation, optical character recognition, information retrieval etc. (Jelinek, 1997; Borthwick, 1997). Statistical language models have gained considerable acceptance due to the efficiency demonstrated in the fields in which they have been applied (Bahal et al., 1983; Jelinek et al., 1991; Ratnapharkhi, 1998; Borthwick, 1999).</Paragraph>
    <Paragraph position="1"> Traditional statistical language models calculate the probability of a sentence a4 using the chain rule:</Paragraph>
    <Paragraph position="3"> a32This work has been partially supported by the Spanish CYCIT under contract (TIC98/0423-C06).</Paragraph>
    <Paragraph position="4"> a33Granted by Universidad del Cauca, Popay'an (Colombia) null where a30a34a26 a10 a12a15a14a35a19a21a19a21a19a22a12 a26a13a36 a14 , which is usually known as the history of a12 a26 . The effort in the language modeling techniques is usually directed to the estimation ofa5a7a6a13a12 a26a22a29a30a34a26 a8 . The language model defined by the expression a5a7a6a13a12 a26a22a29a30a34a26 a8 is named the conditional language model. In principle, the determination of the conditional probability in (1) is expensive, because the possible number of word sequences is very great. Traditional conditional language models assume that the probability of the word a12 a26 does not depend on the entire history, and the history is limited by an equivalence relation a37 , and (1) is rewritten as:</Paragraph>
    <Paragraph position="6"> The most commonly used conditional language model is the n-gram model. In the n-gram model, the history is reduced (by the equivalence relation) to the last a41a43a42a45a44 words. The power of the n-gram model resides in: its consistence with the training data, its simple formulation, and its easy implementation. However, the n-gram model only uses the information provided by the last a41a46a42a47a44 words to predict the next word and so only makes use of local information. In addition, the value of n must be low (a48a50a49 ) because for a41a52a51a45a49 there are problems with the parameter estimation.</Paragraph>
    <Paragraph position="7"> Hybrid models have been proposed, in an attempt to supplement the local information with long-distance information. They combine different types of models, like n-grams, with long-distance information, generally by means of linear interpolation, as has been shown in (Bellegarda, 1998; Chelba and Jelinek, 2000; Bened'i and S'anchez, 2000).</Paragraph>
    <Paragraph position="8"> A formal framework to include long-distance and local information in the same language model is based on the Maximum Entropy principle (ME). Using the ME principle, we can combine information from a variety of sources into the same language model (Berger et al., 1996; Rosenfeld, 1996). The goal of the ME principle is that, given a set of features (pieces of desired information contained in the sentence), a set of functions  a14a21a54a21a19a21a19a21a19 a53a9a55 (measuring the contribution of each feature to the model) and a set of constraints1, we have to find the probability distribution that satisfies the constraints and minimizes the relative entropy (Divergence of Kullback-Leibler) a56 a6a57a5 a29a40a29a5a59a58 a8 , with respect to the distribution a5a59a58 .</Paragraph>
    <Paragraph position="9"> The general Maximum Entropy probability distribution relative to a prior distribution a5 a58 is given</Paragraph>
    <Paragraph position="11"> where a61 is the normalization constant and a82 a26 are parameters to be found. The a82 a26 represent the contribution of each feature to the distribution.</Paragraph>
    <Paragraph position="12"> From (3) it is easy to derive the Maximum Entropy conditional language model (Rosenfeld, 1996): if a83 is the context space and a84 is the vocabulary, then a83 xa84 is the states space, and if</Paragraph>
    <Paragraph position="14"> where a97a98a6a13a85 a8 is the normalization constant depending on the context a85 . Although the conditional ME language model is more flexible than n-gram models, there is an important obstacle to its general use: conditional ME language models have a high computational cost (Rosenfeld, 1996), spe- null theoretical expectation and the empirical expectation over the training corpus.</Paragraph>
    <Paragraph position="15"> Although we can incorporate local information (like n-grams) and some kinds of long-distance information (like triggers) within the conditional ME model, the global information contained in the sentence is poorly encoded in the ME model, as happens with the other conditional models.</Paragraph>
    <Paragraph position="16"> There is a language model which is able to take advantage of the local information and at the same time allows for the use of the global properties of the sentence: the Whole Sentence Maximum Entropy model (WSME) (Rosenfeld, 1997). We can include classical information such us n-grams, distance n-grams or triggers and global properties of the sentence, as features into the WSME framework. Besides the fact that the WSME model training procedure is less expensive than the conditional ME model, the most important training step is based on well-developed statistical sampling techniques. In recent works (Chen and Rosenfeld, 1999a), WSME models have been successfully trained using features of n-grams and distance n-grams.</Paragraph>
    <Paragraph position="17"> In this work, we propose adding information to the WSME model which is provided by the grammatical structure of the sentence. The information is added in the form of features by means of a Stochastic Context-Free Grammar (SCFG).</Paragraph>
    <Paragraph position="18"> The grammatical information is combined with features of n-grams and triggers.</Paragraph>
    <Paragraph position="19"> In section 2, we describe the WSME model and the training procedure in order to estimate the parameters of the model. In section 3, we define the grammatical features and the way of obtaining them from the SCFG. Finally, section 4 presents the experiments carried out using a part of the Wall Street Journal in order evalute the behavior of this proposal.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML