File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/p01-1003_metho.xml
Size: 12,439 bytes
Last Modified: 2025-10-06 14:07:40
<?xml version="1.0" standalone="yes"?> <Paper uid="P01-1003"> <Title>Improvement of a Whole Sentence Maximum Entropy Language Model Using Grammatical Featuresa0</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Whole Sentence Maximum Entropy </SectionTitle> <Paragraph position="0"> Model The whole sentence Maximum Entropy model directly models the probability distribution of the complete sentence2. The WSME language model has the form of (3).</Paragraph> <Paragraph position="1"> In order to simplify the notation we write a102 a26a35a103 a63 a73 a69 , and define: 2By sentence, we understand any sequence of linguistic units that belongs to a certain vocabulary.</Paragraph> <Paragraph position="3"> where a4 is a sentence and the a102 a26 are now the parameters to be learned.</Paragraph> <Paragraph position="4"> The training procedure to estimate the parameters of the model is the Improved Iterative Scaling algorithmn (IIS) (Della Pietra et al., 1995). IIS is based on the change of the log-likelihood over the training corpus a107 , when each of the parameters changes from a82 a26 to a82 a26a109a108a46a110a96a26 , a110a96a26 a88a46a111 . Mathematical considerations on the change in the log-likelihood give the training equation:</Paragraph> <Paragraph position="6"> the IIS, we have to find the value of the improvement a110 a26 in the parameters, solving (8) with respect to a110a96a26 for each a124a20a10a125a44 a19a21a19a21a19a109a54a16a126 . The main obstacle in the WSME training process resides in the calculation of the first sum in (8). The sum extends over all the sentences a12 of a given length. The great number of such sentences makes it impossible, from computing perspective, to calculate the sum, even for a moderate length3. Nevertheless, such a sum is the statistical expected value of a function of a12 with respect to the distribution a5 : a127a129a128a131a130 a53 a26 a63 a112 a69a75a74 a113a86a132 . As is well known, it could be estimated using the sampling expectation as:</Paragraph> <Paragraph position="8"> Note that in (7) the constant a61 is unknown, so direct sampling from a5 is not possible. In sampling from such types of probability distributions, the Monte Carlo Markov Chain (MCMC) a141 of length a142 is a143a144a145a143a146 sampling methods have been successfully used when the distribition is not totally known (Neal, 1993). MCMC are based on the convergence of certain Markov Chains to a target distribution a5 . In MCMC, a path of the Markov chain is ran for a long time, after which the visited states are considered as a sampling element. The MCMC sampling methods have been used in the parameter estimation of the WSME language models, specially the Independence Metropolis-Hasting (IMH) and the Gibb's sampling algorithms (Chen and Rosenfeld, 1999a; Rosenfeld, 1997). The best results have been obtainded using the (IMH) algorithm.</Paragraph> <Paragraph position="9"> Although MCMC performs well, the distribution from which the sample is obtained is only an approximation of the target sampling distribution. Therefore samples obtained from such distributions may produce some bias in sample statistics, like sampling mean. Recently, another sampling technique which is also based on Markov Chains has been developed by Propp and Wilson (Propp and Wilson, 1996), the Perfect Sampling (PS) technique. PS is based on the concept of Coupling From the Past. In PS, several paths of the Markov chain are running from the past (one path in each state of the chain). In all the paths, the transition rule of the Markov chain uses the same set of random numbers to transit from one state to another. Thus if two paths coincide in the same state in time a147 , they will remain in the same states the rest of the time. In such a case, we say that the two paths are collapsed.</Paragraph> <Paragraph position="10"> Now, if all the paths collapse at any given time, from that point in time, we are sure that we are sampling from the true target distribution a5 . The Coupling From the Past algorithm, systematically goes to the past and then runs paths in all states and repeats this procedure until a time a148 has been found. Once a148 has been found, the paths that begin in time a42a89a148 all paths collapse at time a147a149a10a150a120 . Then we run a path of the chain from the state at time a147a131a10a151a42a24a148 to the actual time (a147a152a10a153a120 ), and the last state arrived is a sample from the target distribution. The reason for going from past to current time is technical, and is detailed in (Propp and Wilson, 1996). If the state space is huge (as is the case where the state space is the set of all sentences), we must define a stochastic order over the state space and then run only two paths: one beginning in the minimum state and the other in the maximum state, following the same mechanism described above for the two paths until they collapse. In this way, it is proved that we get a sample from the exact target distribution and not from an approximate distribution as in MCMC algorithms (Propp and Wilson, 1996). Thus, we hope that in samples generated with perfect sampling, statistical parameter estimators may be less biased than those generated with MCMC.</Paragraph> <Paragraph position="11"> Recently (Amaya and Bened'i, 2000), the PS was successfully used to estimate the parameters of a WSME language model . In that work, a comparison was made between the performance of WSME models trained using MCMC and WSME models trained using PS. Features of n-grams and features of triggers were used In both kinds of models, and the WSME model trained with PS had better performance. We then considered it appropriate to use PS in the training procedure of the WSME.</Paragraph> <Paragraph position="12"> The model parameters were completed with the estimation of the global normalization constant</Paragraph> <Paragraph position="14"> and thus estimate a61 using the sampling expectation. null</Paragraph> <Paragraph position="16"> is a random sample from a5 a58 .</Paragraph> <Paragraph position="17"> Because we have total control over the distribition a5a59a58 , is easy to sample from it in the traditional way.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 The grammatical features </SectionTitle> <Paragraph position="0"> The main goal of this paper is the incorporation of gramatical features to the WSME. Grammatical information may be helpful in many aplications of computational linguistics. The grammatical structure of the sentence provides long-distance information to the model, thereby complementing the information provided by other sources and improving the performance of the model. Grammatical features give a better weight to such parameters in grammatically correct sentences than in grammatically incorrect sentences, thereby helping the model to assign better probabilities to correct sentences from the language of the application. To capture the grammatical information, we use Stochastic Context-Free Grammars (SCFG).</Paragraph> <Paragraph position="1"> Over the last decade, there has been an increasing interest in Stochastic Context-Free Grammars (SCFGs) for use in different tasks (K., 1979; Jelinek, 1991; Ney, 1992; Sakakibara, 1990).</Paragraph> <Paragraph position="2"> The reason for this can be found in the capability of SCFGs to model the long-term dependencies established between the different lexical units of a sentence, and the possibility to incorporate the stochastic information that allows for an adequate modeling of the variability phenomena. Thus, SCFGs have been successfully used on limited-domain tasks of low perplexity. However, SCFGs work poorly for large vocabulary, general-purpose tasks, because the parameter learning and the computation of word transition probabilities present serious problems for complex real tasks.</Paragraph> <Paragraph position="3"> To capture the long-term relations and to solve the main problem derived from the use of SCFGs in large-vocabulary complex tasks,we consider the proposal in (Bened'i and S'anchez, 2000): define a category-based SCFG and a probabilistic model of word distribution in the categories. The use of categories as terminal of the grammar reduces the number of rules to take into account and thus, the time complexity of the SCFG learning procedure. The use of the probabilistic model of word distribution in the categories, allows us to obtain the best derivation of the sentences in the application.</Paragraph> <Paragraph position="4"> Actually, we have to solve two problems: the estimation of the parameters of the models and their integration to obtain the best derivation of a sentence.</Paragraph> <Paragraph position="5"> The parameters of the two models are estimated from a training sample. Each word in the training sample has a part-of-speech tag (POStag) associated to it. These POStags are considered as word categories and are the terminal symbols of our SCFG.</Paragraph> <Paragraph position="6"> Given a category, the probability distribution of a word is estimated by means of the relative frequency of the word in the category, i.e. the relative frequency which the word a12 has been labeled with a POStag (a word a12 may belong to different categories).</Paragraph> <Paragraph position="7"> To estimate the SCFG parameters, several algorithms have been presented (K. and S.J., 1991; Pereira and Shabes, 1992; Amaya et al., 1999; S'anchez and Bened'i, 1999). Taking into account the good results achieved on real tasks (S'anchez and Bened'i, 1999), we used them to learn our category-based SCFG.</Paragraph> <Paragraph position="8"> To solve the integration problem, we used an algorithm that computes the probability of the best derivation that generates a sentence, given the category-based grammar and the model of word distribution into categories (Bened'i and S'anchez, 2000). This algorithm is based on the well-known Viterbi-like scheme for SCFGs.</Paragraph> <Paragraph position="9"> Once the grammatical framework is defined, we are in position to make use of the information provided by the SCFG. In order to define the grammatical features, we first introduce some notation. null consider only context-free grammars in Chomsky normal form, that is grammars with rules of the form a168a176a169 a177a152a178 or a168a176a169 a179 where a168 a54 a177 a54 a178 a88 a159 and a179 a88 a161 .</Paragraph> <Paragraph position="10"> a probability distribution over the grammar rules. The grammatical features are defined as follows: let a4a181a10 a12a15a14a158a19a21a19a21a19a162a12a24a23 , a sentence of the training set. As mentioned above, we can compute the best derivation of the sentence a4 , using the defined SCFG and obtain the parse tree of the sentence.</Paragraph> <Paragraph position="11"> Once we have the parse tree of all the sentences in the training corpus, we can collect the set of all the production rules used in the derivation of the sentences in the corpus.</Paragraph> <Paragraph position="12"> Formally: we define the set a127 we make use of a special symbol $ which is not in the terminals nor in the non-terminals. If a rule of the form a168a187a169a189a188 occurs in the derivation tree of a4 , the corresponding element in a127 a6 a4a60a8 is written as a6 a168 a54 a188 a54a162a190 a8 . The set a127a191a10 a172 the corpus), is the set of grammatical features.</Paragraph> <Paragraph position="13"> a127 is the set representation of the grammatical information contained in the derivation trees of the sentences and may be incorporated to the WSME model by means of the characteristic functions defined as: Thus, whenever the WSME model processes a sentence a4 , if it is looking for a specific grammatial feature, say a6a78a194a59a54a139a195a66a54a22a196 a8 , we get the derivation tree for a4 and the set a127 a6 a4a60a8 is calculated from the derivation tree. Finally, the model asks if the the tuple a6a78a194a59a54a139a195a66a54a22a196 a8 is an element of a127 a6 a4a9a8 . If it is, the feature is active; if not, the feature a6a78a194a59a54a139a195a66a54a22a196 a8 does not contribute to the sentence probability. Therefore, a sentence may be a grammatically incorrect sentence (relative to the SCFG used), if derivations with low frequency appears.</Paragraph> </Section> class="xml-element"></Paper>