XML Viewer - a00-2021

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/a00-2021_metho.xml
Size: 20,677 bytes
Last Modified: 2025-10-06 14:07:02
<?xml version="1.0" standalone="yes"?>
<Paper uid="A00-2021">
  <Title>Exploiting auxiliary distributions in stochastic unification-based grammars</Title>
  <Section position="3" start_page="154" end_page="155" type="metho">
    <SectionTitle>
2 Stochastic Unification-based
Grammars
</SectionTitle>
    <Paragraph position="0"> Most of the classes of probabilistic language models used in computational linguistic are exponential families. That is, the probability P(w) of a well-formed syntactic structure w E ~ is defined by a function of the form</Paragraph>
    <Paragraph position="2"> eters, Q is a function of w (which Jelinek (1997) calls a reference distribution when it is not an indicator function), and ZA = fn Q(w) ex'f(~)dw is a normalization factor called the partition function. (Note that a feature here is just a real-valued function of a syntactic structure w; to avoid confusion we use the term &amp;quot;attribute&amp;quot; to refer to a feature in a feature structure). If Q(w) = 1 then the class of exponential distributions is precisely the class of distributions with maximum entropy satisfying the constraint that the expected values of the features is a certain specified value (e.g., a value estimated from training data), so exponential models are sometimes also called &amp;quot;Maximum Entropy&amp;quot; models. For example, the class of distributions obtained by varying the parameters of a PCFG is an exponential family. In a PCFG each rule or production is associated with a feature, so m is the number of rules and the jth feature value fj (o.,) is the number of times the j rule is used in the derivation of the tree w E ~. Simple manipulations show that P,x (w) is equivalent to the PCFG distribution ifAj = logpj, where pj is the rule emission probability, and Q(w) = Z~ = 1.</Paragraph>
    <Paragraph position="3"> If the features satisfy suitable Markovian independence constraints, estimation from fully observed training data is straight-forward. For example, because the rule features of a PCFG meet &amp;quot;context-free&amp;quot; Markovian independence conditions, the well-known &amp;quot;relative frequency&amp;quot; estimator for PCFGs both maximizes the likelihood of the training data (and hence is asymptotically consistent and efficient) and minimizes the Kullback-Leibler divergence between training and estimated distributions.</Paragraph>
    <Paragraph position="4"> However, the situation changes dramatically if we enforce non-local or context-sensitive constraints on linguistic structures of the kind that can be expressed by a UBG. As Abney (1997) showed, under these circumstances the relative frequency estimator is in general inconsistent, even if one restricts attention to rule features.</Paragraph>
    <Paragraph position="5"> Consequently, maximum likelihood estimation is much more complicated, as discussed in section 2.2. Moreover, while rule features are natural for PCFGs given their context-free independence properties, there is no particular reason to use only rule features in Stochastic UBGs (SUBGs). Thus an SUBG is a triple (G, f, A), where G is a UBG which generates a set of well-formed linguistic structures i2, and f and A are vectors of feature functions and feature parameters as above. The probability of a structure w E ~ is given by (1) with Q(w) = 1. Given a base UBG, there are usually infinitely many different ways of selecting the features f to make a SUBG, and each of these makes an empirical claim about the class of possible distributions of structures.</Paragraph>
    <Section position="1" start_page="154" end_page="155" type="sub_section">
      <SectionTitle>
2.1 Stochastic Lexical Functional
Grammar
Stochastic Lexical-Functional Grammar
</SectionTitle>
      <Paragraph position="0"> (SLFG) is a stochastic extension of Lexical-Functional Grammar (LFG), a UBG formalism developed by Kaplan and Bresnan (1982).</Paragraph>
      <Paragraph position="1"> Given a base LFG, an SLFG is constructed by defining features which identify salient constructions in a linguistic structure (in LFG this is a c-structure/f-structure pair and its associated mapping; see Kaplan (1995)). Apart from the auxiliary distributions, we based our  features on those used in Johnson et al. (1999), which should be consulted for further details. Most of these feature values range over the natural numbers, counting the number of times that a particular construction appears in a linguistic structure. For example, adjunct and argument features count the number of adjunct and argument attachments, permitting SLFG to capture a general argument attachment preference, while more specialized features count the number of attachments to each grammatical function (e.g., SUB J, OBJ, COMP, etc.).</Paragraph>
      <Paragraph position="2"> The flexibility of features in stochastic UBGs permits us to include features for relatively complex constructions, such as date expressions (it seems that date interpretations, if possible, are usually preferred), right-branching constituent structures (usually preferred) and non-parallel coordinate structures (usually dispreferred). Johnson et al. remark that they would have liked to have included features for lexical selectional preferences. While such features are perfectly acceptable in a SLFG, they felt that their corpora were so small that the large number of lexical dependency parameters could not be accurately estimated. The present paper proposes a method to address this by using an auxiliary distribution estimated from a corpus large enough to (hopefully) provide reliable estimates for these parameters.</Paragraph>
    </Section>
    <Section position="2" start_page="155" end_page="155" type="sub_section">
      <SectionTitle>
2.2 Estimating stochastic
</SectionTitle>
      <Paragraph position="0"> unification-based grammars Suppose ~ = Wl,...,Wn is a corpus of n syntactic structures. Letting fj(fJ) = ~--~=1 fj(oJi) and assuming each wi E 12, the likelihood of the corpus L~(&amp;) is:</Paragraph>
      <Paragraph position="2"> where E~(fj) is the expected value of f~ under the distribution P~. The maximum likelihood estimates are the )~ which maximize (2), or equivalently, which make (3) zero, but as Johnson et al. (1999) explain, there seems to be no practical way of computing these for realistic SUBGs since evaluating (2) and its derivatives (3) involves integrating over all syntactic structures ft.</Paragraph>
      <Paragraph position="3"> However, Johnson et al. observe that parsing applications require only the conditional probability distribution P~(wly), where y is the terminal string or yield being parsed, and that this can be estimated by maximizing the pseudo-likelihood of the corpus PL~(SJ):</Paragraph>
      <Paragraph position="5"> where f~(Yi) is the set of all syntactic structures in f~ with yield yi (i.e., all parses of Yi generated by the base UBG). It turns out that calculating the pseudo-likelihood of a corpus only involves integrations over the sets of parses of its yields f~(Yi), which is feasible for many interesting UBGs. Moreover, the maximum pseudo-likelihood estimator is asymptotically consistent for the conditional distribution P(w\]y). For the reasons explained in Johnson et al. (1999) we actually estimate )~ by maximizing a regularized version of the log pseudo-likelihood (5), where aj is 7 times the maximum value of fj found in the training corpus:</Paragraph>
      <Paragraph position="7"> See Johnson et al. (1999) for details of the calculation of this quantity and its derivatives, and the conjugate gradient routine used to calculate the )~ which maximize the regularized log pseudo-likelihood of the training corpus.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="155" end_page="156" type="metho">
    <SectionTitle>
3 Auxiliary distributions
</SectionTitle>
    <Paragraph position="0"> We modify the estimation problem presented in section 2.2 by assuming that in addition to the corpus ~ and the m feature functions f we are given k auxiliary distributions Q1,..., Qk whose support includes f~ that we suspect may be related to the joint distribution P(w) or conditional distribution P(wly ) that we wish to esti- null mate. We do not require that the Qj be probability distributions, i.e., it is not necessary that f~ Qj(w)dw = 1, but we do require that they are strictly positive (i.e., Qj(w) &gt; O, Vw E ~).</Paragraph>
    <Paragraph position="1"> We define k new features fro+l,..., fm+k where fm+j(w) = log Qj(w), which we call auxiliary features. The m + k parameters associated with the resulting m+k features can be estimated using any method for estimating the parameters of an exponential family with real-valued features (in our experiments we used the pseudo-likelihood estimation procedure reviewed in section 2.2). Such a procedure estimates parameters )~m+l,.-., Am+k associated with the auxiliary features, so the estimated distributions take the form (6) (for simplicity we only discuss joint distributions here, but the treatment of conditional distributions is parallel).</Paragraph>
    <Paragraph position="3"> Note that the auxiliary distributions Qj are treated as fixed distributions for the purposes of this estimation, even though each Qj may itself be a complex model obtained via a previous estimation process. Comparing (6) with (1) on page 2, we see that the two equations become identical if the reference distribution Q in (1) is replaced by a geometric mixture of the auxiliary distributions Qj, i.e., if:</Paragraph>
    <Paragraph position="5"> The parameter associated with an auxiliary feature represents the weight of that feature in the mixture. If a parameter ~m+j = 1 then the corresponding auxiliary feature Qj is equivalent to a reference distribution in Jelinek's sense, while if ~m+j = 0 then Qj is effectively ignored. Thus our approach can be regarded as a smoothed version Jelinek's reference distribution approach, generalized to permit multiple auxiliary distributions.</Paragraph>
  </Section>
  <Section position="5" start_page="156" end_page="156" type="metho">
    <SectionTitle>
4 Lexical selectional preferences
</SectionTitle>
    <Paragraph position="0"> The auxiliary distribution we used here is based on the probabilistic model of lexical selectional preferences described in Rooth et al. (1999). An existing broad-coverage parser was used to find shallow parses (compared to the LFG parses) for the 117 million word British National Corpus (Carroll and Rooth, 1998). We based our auxiliary distribution on 3.7 million (g, r, a) tuples (belonging to 600,000 types) we extracted these parses, where g is a lexical governor (for the shallow parses, g is either a verb or a preposition), a is the head of one of its NP arguments and r is the the grammatical relationship between the governor and argument (in the shallow parses r is always OBJ for prepositional governors, and r is either SUBJ or OBJ for verbal governors).</Paragraph>
    <Paragraph position="1"> In order to avoid sparse data problems we smoothed this distribution over tuples as described in (Rooth et al., 1999). We assume that governor-relation pairs (g, r) and arguments a are independently generated from 25 hidden</Paragraph>
    <Paragraph position="3"> where the distributions Pe are estimated from the training tuples using the Expectation-Maximization algorithm. While the hidden classes axe not given any prior interpretation they often cluster semantically coherent predicates and arguments, as shown in Figure 1.</Paragraph>
    <Paragraph position="4"> The smoothing power of a clustering model such as this can be calculated explicitly as the percentage of possible tuples which are assigned a non-zero probability. For the 25-class model we get a smoothing power of 99%, compared to only 1.7% using the empirical distribution of the training data.</Paragraph>
  </Section>
  <Section position="6" start_page="156" end_page="159" type="metho">
    <SectionTitle>
5 Empirical evaluation
</SectionTitle>
    <Paragraph position="0"> Hadar Shemtov and Ron Kaplan at Xerox PARC provided us with two LFG parsed corpora called the Verbmobil corpus and the Homecentre corpus. These contain parse forests for each sentence (packed according to scheme described in Maxwell and Kaplan (1995)), together with a manual annotation as to which parse is correct. The Verbmobil corpus contains 540 sentences relating to appointment planning, while the Homecentre corpus contains 980 sentences from Xerox documentation on their &amp;quot;homecentre&amp;quot; multifunction devices. Xerox did not provide us with the base LFGs for intellectual prop-erty reasons, but from inspection of the parses</Paragraph>
    <Paragraph position="2"> matrix shows at the top the 30 most probable nouns in the Pe (a116) distribution and their probabilities, and at the left the 30 most probable verbs and prepositions listed according to Pre((g, r)116) and their probabilities. Dots in the matrix indicate that the respective pair was seen in the training data. Predicates with suffix : s indicate the subject slot of an intransitive or transitive verb; the suffix : o specifies the nouns in the corresponding row as objects of verbs or prepositions.</Paragraph>
    <Paragraph position="3"> it seems that slightly different grammars were used with each corpus, so we did not merge the corpora. We chose the features of our SLFG based solely on the basis of the Verbmobil corpus, so the Homecentre corpus can be regarded as a held-out evaluation corpus.</Paragraph>
    <Paragraph position="4"> We discarded the unambiguous sentences in each corpus for both training and testing (as explained in Johnson et al. (1999), pseudo-likelihood estimation ignores unambiguous sentences), leaving us with a corpus of 324 ambiguous sentences in the Verbmobil corpus and 481 sentences in the Homecentre corpus; these sentences had a total of 3,245 and 3,169 parses respectively.</Paragraph>
    <Paragraph position="5"> The (non-auxiliary) features used in were based on those described by Johnson et al. (1999). Different numbers of features were used with the two corpora because some of the features were generated semi-automatically (e.g., we introduced a feature for every attribute-value pair found in any feature structure), and &amp;quot;pseudo-constant&amp;quot; features (i.e., features whose values never differ on the parses of the same sentence) are discarded. We used 172 features in the SLFG for the Verbmobil corpus and 186 features in the SLFG for the Homecentre corpus.</Paragraph>
    <Paragraph position="6"> We used three additional auxiliary features derived from the lexical selectional preference model described in section 4. These were defined in the following way. For each governing predicate g, grammatical relation r and argument a, let n(g,r,a)(w) be the number of times that the f-structure:</Paragraph>
    <Paragraph position="8"> appears as a subgraph of the f-structure of w, i.e., the number of times that a fills the  grammatical role r of g. We used the lexical model described in the last section to estimate P(alg , r), and defined our first auxiliary feature as:</Paragraph>
    <Paragraph position="10"> where g0 is the predicate of the root feature structure. The justification for this feature is that if f-structures were in fact a tree, ft(w) would be the (logarithm of) a probability distribution over them. The auxiliary feature ft is defective in many ways. Because LFG f-structures are DAGs with reentrancies rather than trees we double count certain arguments, so ft is certainly not the logarithm of a probability distribution (which is why we stressed that our approach does not require an auxiliary distribution to be a distribution).</Paragraph>
    <Paragraph position="11"> The number of governor-argument tuples found in different parses of the same sentence can vary markedly. Since the conditional probabilities P(alg, r) are usually very small, we found that ft(w) was strongly related to the number of tuples found in w, so the parse with the smaller number of tuples usually obtains the higher fl score. We tried to address this by adding two additional features. We set fc(w) to be the number of tuples in w, i.e.:</Paragraph>
    <Paragraph position="13"> Then we set .Q(w) = h(w)/L(w), i.e.,/,(w) is the average log probability of a lexical dependency tuple under the auxiliary lexical distribution. We performed our experiments with ft as the sole auxiliary distribution, and with ft, fe and fn as three auxiliary distributions.</Paragraph>
    <Paragraph position="14"> Because our corpora were so small, we trained and tested these models using a 10-fold cross-validation paradigm; the cumulative results are shown in Table 1. On each fold we evaluated each model in two ways. The correct parses measure simply counts the number of test sentences for which the estimated model assigns its maximum parse probability to the correct parse, with ties broken randomly. The pseudo-likelihood measure is the pseudo-likelihood of test set parses; i.e., the conditional probability of the test parses given their yields. We actually report the negative log of this measure, so a smaller score corresponds to better performance here. The correct parses measure is most closely related to parser performance, but the pseudo-likelihood measure is more closely related to the quantity we are optimizing and may be more relevant to applications where the parser has to return a certainty factor associated with each parse.</Paragraph>
    <Paragraph position="15"> Table 1 also provides the number of indistinguishable sentences under each model. A sentence y is indistinguishable with respect to features f iff f(wc) : f(w'), where wc is the correct parse of y and wc ~ w I E ~(y), i.e., the feature values of correct parse of y are identical to the feature values of some other parse of y. If a sentence is indistinguishable it is not possible to assign its correct parse a (conditional) probability higher than the (conditional) probability assigned to other parses, so all else being equal we would expect a SUBG with with fewer indistinguishable sentences to perform better than one with more.</Paragraph>
    <Paragraph position="16"> Adding auxiliary features reduced the already low number of indistinguishable sentences in the Verbmobil corpus by only 11%, while it reduced the number of indistinguishable sentences in the Homecentre corpus by 24%. This probably reflects the fact that the feature set was designed by inspecting only the Verbmobil corpus.</Paragraph>
    <Paragraph position="17"> We must admit disappointment with these results. Adding auxiliary lexical features improves the correct parses measure only slightly, and degrades rather than improves performance on the pseudo-likelihood measure. Perhaps this is due to the fact that adding auxiliary features increases the dimensionality of the feature vector f, so the pseudo-likelihood scores with different numbers of features are not strictly comparable. null The small improvement in the correct parses measure is typical of the improvement we might expect to achieve by adding a &amp;quot;good&amp;quot; non-auxiliary feature, but given the importance usually placed on lexical dependencies in statistical models one might have expected more improvement. Probably the poor performance is due in part to the fairly large differences between the parses from which the lexical dependencies were estimated and the parses produced by the LFG. LFG parses are very detailed, and many ambiguities depend on the precise grammatical  features are described in the text. The column labelled &amp;quot;indistinguishable&amp;quot; gives the number of indistinguishable sentences with respect to each feature set, while &amp;quot;correct&amp;quot; and &amp;quot;- log PL&amp;quot; give the correct parses and pseudo-likelihood measures respectively. relationship holding between a predicate and its argument. It could also be that better performance could be achieved if the lexical dependencies were estimated from a corpus more closely related to the actual test corpus. For example, the verb feed in the Homecentre corpus is used in the sense of &amp;quot;insert (paper into printer)&amp;quot;, which hardly seems to be a prototypical usage.</Paragraph>
    <Paragraph position="18"> Note that overall system performance is quite good; taking the unambiguous sentences into account the combined LFG parser and statistical model finds the correct parse for 73% of the Verbmobil test sentences and 80% of the Homecentre test sentences. On just the ambiguous sentences, our system selects the correct parse for 56% of the Verbmobil test sentences and 59% of the Homecentre test sentences.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML