XML Viewer - p04-1014

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-1014_metho.xml
Size: 18,018 bytes
Last Modified: 2025-10-06 14:08:59
<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-1014">
  <Title>Parsing the WSJ using CCG and Log-Linear Models</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Log-Linear Parsing Models
</SectionTitle>
    <Paragraph position="0"> Log-linear models (also known as Maximum Entropy models) are popular in NLP because of the ease with which discriminating features can be included in the model. Log-linear models have been applied to the parsing problem across a range of grammar formalisms, e.g. Riezler et al. (2002) and Toutanova et al. (2002). One motivation for using a log-linear model is that long-range dependencies which CCG was designed to handle can easily be encoded as features.</Paragraph>
    <Paragraph position="1"> A conditional log-linear model of a parse !2 , given a sentence S , is defined as follows:</Paragraph>
    <Paragraph position="3"> where :f(!) = Pi i fi(!). The function fi is a feature of the parse which can be any real-valued function over the space of parses . Each feature fi has an associated weight i which is a parameter of the model to be estimated. ZS is a normalising constant which ensures that P(!jS ) is a probability distribution:</Paragraph>
    <Paragraph position="5"> where (S ) is the set of possible parses for S .</Paragraph>
    <Paragraph position="6"> For the dependency model a parse, !, is ahd; i pair (as given in (1)). A feature is a count of the number of times some configuration occurs in d or the number of times some dependency occurs in .</Paragraph>
    <Paragraph position="7"> Section 6 gives examples of features.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 The Dependency Model
</SectionTitle>
      <Paragraph position="0"> We follow Riezler et al. (2002) in using a discriminative estimation method by maximising the conditional likelihood of the model given the data. For the dependency model, the data consists of sentences S 1;:::;S m, together with gold standard dependency structures, 1;:::; m. The gold standard structures are multisets of dependencies, as described earlier.</Paragraph>
      <Paragraph position="1"> Section 6 explains how the gold standard structures are obtained.</Paragraph>
      <Paragraph position="2"> The objective function of a model is the conditional log-likelihood, L( ), minus a Gaussian prior term, G( ), used to reduce overfitting (Chen and Rosenfeld, 1999). Hence, given the definition of the probability of a dependency structure (1), the objective function is as follows:</Paragraph>
      <Paragraph position="4"> where n is the number of features. Rather than have a different smoothing parameter i for each feature, we use a single parameter .</Paragraph>
      <Paragraph position="5"> We use a technique from the numerical optimisation literature, the L-BFGS algorithm (Nocedal and Wright, 1999), to optimise the objective function.</Paragraph>
      <Paragraph position="6"> L-BFGS is an iterative algorithm which requires the gradient of the objective function to be computed at each iteration. The components of the gradient vector are as follows:</Paragraph>
      <Paragraph position="8"> The first two terms in (5) are expectations of feature fi: the first expectation is over all derivations leading to each gold standard dependency structure; the second is over all derivations for each sentence in the training data. Setting the gradient to zero yields the usual maximum entropy constraints (Berger et al., 1996), except that in this case the empirical values are themselves expectations (over all derivations leading to each gold standard dependency structure). The estimation process attempts to make the expectations equal, by putting as much mass as possible on the derivations leading to the gold standard structures.1 The Gaussian prior term penalises any model whose weights get too large in absolute value.</Paragraph>
      <Paragraph position="9"> Calculation of the feature expectations requires summing over all derivations for a sentence, and summing over all derivations leading to a gold standard dependency structure. In both cases there can be exponentially many derivations, and so enumerating all derivations is not possible (at least for wide-coverage automatically extracted grammars).</Paragraph>
      <Paragraph position="10"> Clark and Curran (2003) show how the sum over the complete derivation space can be performed efficiently using a packed chart and a variant of the inside-outside algorithm. Section 5 shows how the same technique can also be applied to all derivations leading to a gold standard dependency structure.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 The Normal-Form Model
</SectionTitle>
      <Paragraph position="0"> The objective function and gradient vector for the normal-form model are as follows:</Paragraph>
      <Paragraph position="2"> context of LFG parsing.</Paragraph>
      <Paragraph position="3"> where d j is the the gold standard derivation for sentence S j and (S j) is the set of possible derivations for S j. Note that the empirical expectation in (7) is simply a count of the number of times the feature appears in the gold-standard derivations.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Packed Charts
</SectionTitle>
    <Paragraph position="0"> The packed charts perform a number of roles: they are a compact representation of a very large number of CCG derivations; they allow recovery of the highest scoring parse or dependency structure without enumerating all derivations; and they represent an instance of what Miyao and Tsujii (2002) call a feature forest, which is used to efficiently estimate a log-linear model. The idea behind a packed chart is simple: equivalent chart entries of the same type, in the same cell, are grouped together, and back pointers to the daughters indicate how an individual entry was created. Equivalent entries form the same structures in any subsequent parsing.</Paragraph>
    <Paragraph position="1"> Since the packed charts are used for model estimation and recovery of the highest scoring parse or dependency structure, the features in the model partly determine which entries can be grouped together. In this paper we use features from the dependency structure, and features defined on the local rule instantiations.2 Hence, any two entries with identical category type, identical head, and identical unfilled dependencies are equivalent. Note that not all features are local to a rule instantiation; for example, features encoding long-range dependencies may involve words which are a long way apart in the sentence.</Paragraph>
    <Paragraph position="2"> For the purposes of estimation and finding the highest scoring parse or dependency structure, only entries which are part of a derivation spanning the whole sentence are relevant. These entries can be easily found by traversing the chart top-down, starting with the entries which span the sentence. The entries within spanning derivations form a feature forest (Miyao and Tsujii, 2002). A feature forest</Paragraph>
    <Paragraph position="4"> The individual entries in a cell are conjunctive nodes, and the equivalence classes of entries are dis2By rule instantiation we mean the local tree arising from the application of a CCG combinatory rule.</Paragraph>
    <Paragraph position="5"> hC;D;R; ; iis a packed chart / feature forest G is a set of gold standard dependencies Let c be a conjunctive node Let d be a disjunctive node deps(c) is the set of dependencies on node c</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Efficient Estimation
</SectionTitle>
    <Paragraph position="0"> The L-BFGS algorithm requires the following values at each iteration: the expected value, and the empirical expected value, of each feature (to calculate the gradient); and the value of the likelihood function. For the normal-form model, the empirical expected values and the likelihood can easily be obtained, since these only involve the single gold-standard derivation for each sentence. The expected values can be calculated using the method in Clark and Curran (2003).</Paragraph>
    <Paragraph position="1"> For the dependency model, the computations of the empirical expected values (5) and the likelihood function (4) are more complex, since these require sums over just those derivations leading to the gold standard dependency structure. We will refer to such derivations as correct derivations.</Paragraph>
    <Paragraph position="2"> Figure 1 gives an algorithm for finding nodes in a packed chart which appear in correct derivations.</Paragraph>
    <Paragraph position="3"> cdeps(c) is the number of correct dependencies on conjunctive node c, and takes the value 1 if there are any incorrect dependencies on c. dmax(c) is 3A more complete description of CCG feature forests is given in Clark and Curran (2003).</Paragraph>
    <Paragraph position="4"> the maximum number of correct dependencies produced by any sub-derivation headed by c, and takes the value 1 if there are no sub-derivations producing only correct dependencies. dmax(d) is the same value but for disjunctive node d. Recursive definitions for calculating these values are given in Figure 1; the base case occurs when conjunctive nodes have no disjunctive daughters.</Paragraph>
    <Paragraph position="5"> The algorithm identifies all those root nodes heading derivations which produce just the correct dependencies, and traverses the chart top-down marking the nodes in those derivations. The insight behind the algorithm is that, for two conjunctive nodes in the same equivalence class, if one node heads a sub-derivation producing more correct dependencies than the other node (and each sub-derivation only produces correct dependencies), then the node with less correct dependencies cannot be part of a correct derivation.</Paragraph>
    <Paragraph position="6"> The conjunctive and disjunctive nodes appearing in correct derivations form a new correct feature forest. The correct forest, and the complete forest containing all derivations spanning the sentence, can be used to estimate the required likelihood value and feature expectations. Let E fi be the expected value of fi over the forest for model ; then the values in (5) can be obtained by calculating E j fi for the complete forest j for each sentence S j in the training data (the second sum in (5)), and also E j fi for each forest j of correct derivations (the first sum in (5)):</Paragraph>
    <Paragraph position="8"> The gold standard dependency structures are produced by running our CCG parser over the normal-form derivations in CCGbank (Hockenmaier, 2003a). Not all rule instantiations in CCGbank are instances of combinatory rules, and not all can be produced by the parser, and so gold standard structures were created for 85.5% of the sentences in sections 2-21 (33,777 sentences).</Paragraph>
    <Paragraph position="9"> The same parser is used to produce the packed charts. The parser uses a maximum entropy supertagger (Clark and Curran, 2004) to assign lexical categories to the words in a sentence, and applies the CKY chart parsing algorithm described in Steedman (2000). For parsing the training data, we ensure that the correct category is a member of the set assigned to each word. The average number of categories assigned to each word is determined by a parameter in the supertagger. For the first set of experiments, we used a setting which assigns 1.7 categories on average per word.</Paragraph>
    <Paragraph position="10"> The feature set for the dependency model consists of the following types of features: dependency features (with and without distance measures), rule instantiation features (with and without a lexical head), lexical category features, and root category features. Dependency features are the 5-tuples defined in Section 1. There are also three additional dependency feature types which have an extra distance field (and only include the head of the lexical category, and not the head of the argument); these count the number of words (0, 1, 2 or more), punctuation marks (0, 1, 2 or more), and verbs (0, 1 or more) between head and dependent. Lexical category features are word-category pairs at the leaf nodes, and root features are headword-category pairs at the root nodes. Rule instantiation features simply encode the combining categories together with the result category. There is an additional rule feature type which also encodes the lexical head of the resulting category. Additional generalised features for each feature type are formed by replacing words with their POS tags.</Paragraph>
    <Paragraph position="11"> The feature set for the normal-form model is the same except that, following Hockenmaier and Steedman (2002), the dependency features are defined in terms of the local rule instantiations, by adding the heads of the combining categories to the rule instantiation features. Again there are 3 additional distance feature types, as above, which only include the head of the resulting category. We had hoped that by modelling the predicate-argument dependencies produced by the parser, rather than local rule dependencies, we would improve performance.</Paragraph>
    <Paragraph position="12"> However, using the predicate-argument dependencies in the normal-form model instead of, or in addition to, the local rule dependencies, has not led to an improvement in parsing accuracy.</Paragraph>
    <Paragraph position="13"> Only features which occurred more than once in the training data were included, except that, for the dependency model, the cutoff for the rule features was 9 and the counting was performed across all derivations, not just the gold-standard derivation.</Paragraph>
    <Paragraph position="14"> The normal-form model has 482,007 features and the dependency model has 984,522 features.</Paragraph>
    <Paragraph position="15"> We used 45 machines of a 64-node Beowulf cluster to estimate the dependency model, with an average memory usage of approximately 550 MB for each machine. For the normal-form model we were able to reduce the size of the charts considerably by applying two types of restriction to the parser: first, categories can only combine if they appear together in a rule instantiation in sections 2-21 of CCGbank; and second, we apply the normal-form restrictions described in Eisner (1996). (See Clark and Curran (2004) for a description of the Eisner constraints.) The normal-form model requires only 5 machines for estimation, with an average memory usage of 730 MB for each machine.</Paragraph>
    <Paragraph position="16"> Initially we tried the parallel version of GIS described in Clark and Curran (2003) to perform the estimation, running over the Beowulf cluster.</Paragraph>
    <Paragraph position="17"> However, we found that GIS converged extremely slowly; this is in line with other recent results in the literature applying GIS to globally optimised models such as conditional random fields, e.g. Sha and Pereira (2003). As an alternative to GIS, we have implemented a parallel version of our L-BFGS code using the Message Passing Interface (MPI) standard.</Paragraph>
    <Paragraph position="18"> L-BFGS over forests can be parallelised, using the method described in Clark and Curran (2003) to calculate the feature expectations. The L-BFGS algorithm, run to convergence on the cluster, takes 479 iterations and 2 hours for the normal-form model, and 1,550 iterations and roughly 17 hours for the dependency model.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
7 Parsing Algorithm
</SectionTitle>
    <Paragraph position="0"> For the normal-form model, the Viterbi algorithm is used to find the most probable derivation. For the dependency model, the highest scoring dependency structure is required. Clark and Curran (2003) outlines an algorithm for finding the most probable dependency structure, which keeps track of the highest scoring set of dependencies for each node in the chart. For a set of equivalent entries in the chart (a disjunctive node), this involves summing over all conjunctive node daughters which head sub-derivations leading to the same set of high scoring dependencies. In practice large numbers of such conjunctive nodes lead to very long parse times.</Paragraph>
    <Paragraph position="1"> As an alternative to finding the most probable dependency structure, we have developed an algorithm which maximises the expected labelled recall over dependencies. Our algorithm is based on Goodman's (1996) labelled recall algorithm for the phrase-structure PARSEVAL measures.</Paragraph>
    <Paragraph position="2"> Let L be the number of correct dependencies in with respect to a gold standard dependency structure G; then the dependency structure, max, which maximises the expected recall rate is:</Paragraph>
    <Paragraph position="4"> where S is the sentence for gold standard dependency structure G and i ranges over the dependency structures for S . This expression can be expanded further:</Paragraph>
    <Paragraph position="6"> The final score for a dependency structure is a sum of the scores for each dependency in ; and the score for a dependency is the sum of the probabilities of those derivations producing . This latter sum can be calculated efficiently using inside and  where c is the inside score and c is the outside score for node c (see Clark and Curran (2003)); C is the set of conjunctive nodes in the packed chart for sentence S and deps(c) is the set of dependencies on conjunctive node c. The intuition behind the expected recall score is that a dependency structure scores highly if it has dependencies produced by high scoring derivations.4 The algorithm which finds max is a simple variant on the Viterbi algorithm, efficiently finding a derivation which produces the highest scoring set of dependencies.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML