File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/p00-1061_intro.xml
Size: 4,863 bytes
Last Modified: 2025-10-06 14:00:52
<?xml version="1.0" standalone="yes"?> <Paper uid="P00-1061"> <Title>Lexicalized Stochastic Modeling of Constraint-Based Grammars using Log-Linear Measures and EM Training</Title> <Section position="4" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Incomplete-Data Estimation for Log-Linear Models 2.1 Log-Linear Models </SectionTitle> <Paragraph position="0"> A log-linear distribution p (x) on the set of analyses X of a constraint-based grammar can be dened as follows:</Paragraph> <Paragraph position="2"> is a xed reference distribution.</Paragraph> <Paragraph position="3"> The task of probabilistic modeling with log-linear distributions is to build salient properties of the data as property-functions</Paragraph> <Paragraph position="5"> the probabilitymodel.For a given vector of property-functions, the task of statistical inference is to tune the parameters to best reect the empirical distribution of the training data.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Incomplete-Data Estimation </SectionTitle> <Paragraph position="0"> Standard numerical methods for statistical inference of log-linear models from fully annotated dataso-called complete dataare the iterative scaling methods of Darroch and Ratcli (1972) and Della Pietra et al. (1997). For data consisting of unannotated sentencesso-called incomplete datathe iterative method of the EM algorithm (Dempster et al., 1977) has to be employed. However, since even complete-data estimation for log-linear models requires iterative methods, an application of EM to log-linear models results in an algorithm which is expensive since it is doubly-iterative. A singly-iterative algorithm interleaving EM and iterative scaling into a mathematically well-dened estimation method for log-linear models from incomplete data is the IM algorithm of Riezler (1999). Applying this algorithm to stochastic constraint-based grammars, we assume the following to be given: A training sample of unannotated sentences y fromasetY, observed with empirical Input Reference model p For i from 1 to n do</Paragraph> <Paragraph position="2"> yielding a set X(y) of parses for eachsentence y, and a log-linear model p () on the parses</Paragraph> <Paragraph position="4"> X(y) for the sentences in the training corpus, with known values of property-functions and unknown values of . The aim of incomplete-data maximum likelihood estimation (MLE) is to nd a value that maximizes the incomplete-data log-</Paragraph> <Paragraph position="6"> L(): Closed-form parameter-updates for this problem can be computed by the algorithm of Fig. 1, where</Paragraph> <Paragraph position="8"> (x) is the conditional probability of a parse x given the sentence y and the current parameter value .</Paragraph> <Paragraph position="10"> Note that because of the restriction of X to the parses obtainable by a grammar from the training corpus, wehave a log-linear probability measure only on those parses and not on all possible parses of the grammar. We shall therefore speak of mere log-linear measures in our application of disambiguation.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Searching for Order in Chaos </SectionTitle> <Paragraph position="0"> For incomplete-data estimation, a sequence of likelihood values is guaranteed to converge to a critical point of the likelihood function L. This is shown for the IM algorithm in Riezler (1999). The process of nding likelihood maxima is chaotic in that the nal likelihood value is extremely sensitive to the starting values of , i.e. limit points can be local maxima (or saddlepoints), which are not necessarily also global maxima. A way to search for order in this chaos is to search for starting values which are hopefully attracted by the global maximum of L. This problem can best be explained in terms of the minimum divergence paradigm (Kullback, 1959), which is equivalent to the maximum likelihood paradigm by the following theorem. Let</Paragraph> <Paragraph position="2"> p(x)f(x) be the expectation of a function f with respect to a distribution p: The probability distribution p</Paragraph> <Paragraph position="4"> minimizes the divergence to p , over the set of models p to which the constraints</Paragraph> <Paragraph position="6"> plied. Clearly, this argument applies to both complete-data and incomplete-data estimation. Note that for a uniformly distributed reference model p , the minimum divergence model is a maximum entropy model (Jaynes, 1957). In Sec. 4, we will demonstrate that a uniform initialization of the IM algorithm shows a signicant improvement in likelihood maximization as well as in linguistic performance when compared to standard random initialization.</Paragraph> </Section> </Section> class="xml-element"></Paper>