File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1013_metho.xml
Size: 20,930 bytes
Last Modified: 2025-10-06 14:08:25
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1013"> <Title>Log-Linear Models for Wide-Coverage CCG Parsing</Title> <Section position="4" start_page="2" end_page="2" type="metho"> <SectionTitle> 3 Log-Linear Models for CCG </SectionTitle> <Paragraph position="0"> Previous parsing models for CCG include a generative model over normal-form derivations (Hockenmaier and Steedman, 2002) and a conditional model over dependency structures (Clark et al., 2002).</Paragraph> <Paragraph position="1"> We follow Clark et al. in modelling dependency structures, but, unlike Clark et al., do so in terms of derivations. An advantage of our approach is that the model can potentially include derivationspecific features in addition to dependency information. Also, modelling derivations provides a close link between the model and the parsing algorithm, which makes it easier to define dynamic programming techniques for efficient model estimation and decoding , and also apply beam search to reduce the search space.</Paragraph> <Paragraph position="2"> The probability of a dependency structure, 2 , given a sentence, S ,isdefinedasfollows:</Paragraph> <Paragraph position="4"> where ( ;S ) is the set of derivations for S which lead to and is the set of dependency structures.</Paragraph> <Paragraph position="5"> Note that ( ;S ) includes the non-standard derivations allowed by CCG. This model allows the possibility of including features from the non-standard derivations, such as features encoding the use of type-raising or function composition.</Paragraph> <Paragraph position="6"> A log-linear model of a parse, ! 2 Ohm,givena</Paragraph> <Paragraph position="8"> This model can be applied to any kind of parse, but for this paper a parse, !,isahd; i pair (as given in (5)). The function f i is a feature of the parse We use the term decoding to refer to the process of finding the most probable dependency structure from a packed chart. which can be any real-valued function over the space of parsesOhm. In this paper f</Paragraph> <Paragraph position="10"> (!) is a count of the number of times some dependency occurs in !. Each</Paragraph> <Paragraph position="12"> has an associated weight</Paragraph> <Paragraph position="14"> rameter of the model to be estimated. Z S is a normalising constant which ensures that P(!jS ) is a probability distribution:</Paragraph> <Paragraph position="16"> The advantage of a log-linear model is that the features can be arbitrary functions over parses. This means that any dependencies - including overlapping and long-range dependencies - can be included in the model, irrespective of whether those dependencies are independent.</Paragraph> <Paragraph position="17"> The theory underlying log-linear models is described in Della Pietra et al. (1997) and Berger et al. (1996). Briefly, the log-linear form in (6) is derived by choosing the model with maximum entropy from a set of models that satisfy a certain set of constraints (Rosenfeld, 1996). The constraints are that, for each feature f</Paragraph> <Paragraph position="19"> where the sums are over all possible parse-sentence pairs and ~ P(S ) is the relative frequency of sentence S in the data. The value on the left of (8) is the expected value of f i according to the model, E</Paragraph> <Paragraph position="21"> and the value on the right is the empirical expected</Paragraph> <Paragraph position="23"> Estimating the parameters of a log-linear model requires the values in (8) to be calculated for each feature. Calculating the empirical expected values requires a treebank of CCG derivations plus dependency structures. For this we use CCGbank (Hockenmaier, 2003), a corpus of normal-form CCG derivations derived semi-automatically from the Penn Treebank. Following Clark et al., gold standard dependency structures are obtained for each derivation by running a dependency-producing parser over the derivations. The empirical expected value of a feature f</Paragraph> <Paragraph position="25"> are the parses in the training data (consisting of a normal-form derivation plus dependency structure) and f</Paragraph> <Paragraph position="27"> Parameter estimation also requires calculation of expected values of the features according to the model, E</Paragraph> <Paragraph position="29"> . This requires summing over all parses (derivation plus dependency structure) for the sentences in the data, a difficult task since the total number of parses can grow exponentially with sentence length. For some sentences in CCGbank, the parser described in Section 6 produces trillions of parses.</Paragraph> <Paragraph position="30"> The next section shows how a packed chart can efficiently represent the parse space, and how GIS applied to the packed chart can be used to estimate the parameters.</Paragraph> </Section> <Section position="5" start_page="2" end_page="4" type="metho"> <SectionTitle> 4 Packed Charts </SectionTitle> <Paragraph position="0"> Geman and Johnson (2002) have proposed a dynamic programming estimation method for packed representations of unification-based parses. Miyao and Tsujii (2002) have proposed a similar method for feature forests whichtheyapplyto the derivations of an automatically extracted Tree-Adjoining Grammar. We apply Miyao and Tsujii's method to the derivations and dependency structures produced by our CCG parser.</Paragraph> <Paragraph position="1"> The dynamic programming method relies on a packed chart, in which chart entries of the same type in the same cell are grouped together, and back pointers to the daughters keep track of how an individual entry was created. The intuition behind the dynamic programming is that, for the purposes of building a dependency structure, chart entries of the same type are equivalent. Consider the following composition of will with buy using the forward composition rule: nNP)=NP), plus the dependencies yet to be filled. The dependencies are not shown, but there An alternative is to use feature counts from all derivations leading to the gold standard dependency structure, including the non-standard derivations, to calculate E</Paragraph> <Paragraph position="3"> are two subject dependencies on the first NP, one encoding the subject of will and one encoding the subject of buy , and there is an object dependency on the second NP encoding the object of buy.Entries of the same type are identical for the purposes of creating new dependencies for the remainder of the parsing.</Paragraph> <Paragraph position="4"> Any rule instantiation used by the parser creates both a set of dependencies and a set of features. For the previous example, one dependency is created:</Paragraph> <Paragraph position="6"> This dependency will be a feature created by the rule instantiation. We also use less specific features, such as the dependency with the words replaced by POS tags. Section 7 describes the features used.</Paragraph> <Paragraph position="7"> The feature forests of Miyao and Tsujii are defined in terms of conjunctive and disjunctive nodes.</Paragraph> <Paragraph position="8"> For our purposes, a conjunctive node is an individual entry in a cell, including the features created when the entry was derived, plus pointers to the entry's daughters. A disjunctive node represents an equivalence class of nodes in a cell, using the type equivalence relation described above. A conjunctive node results from either the combination of two disjunctive nodes using a binary rule, e.g. forward composition; or results from a single disjunctive node using a unary rule, e.g. type-raising; or is a leaf node (a word plus lexical category).</Paragraph> <Paragraph position="9"> Features in the model can only result from a single rule instantiation. It is possible to define features covering a larger part of the dependency structure; for example we might encode all three elements of the triple in a PP-attachment as a single feature. The disadvantage of using such features is that this reduces the efficiency of the dynamic programming.</Paragraph> <Paragraph position="10"> Note, however, that the equivalence relation defining disjunctive nodes takes into account unfilled dependencies, which may be long-range dependencies being &quot;passed up&quot; the derivation tree. This means that long-range dependencies can be features in our model, even though the lexical items involved may be far apart in the sentence.</Paragraph> <Paragraph position="11"> In this example, the co-indexing of heads in the markedup category for will ((S[dcl]</Paragraph> <Paragraph position="13"> )) ensures the subject dependency for buy is &quot;passed up&quot; to the subject NP of the resulting category.</Paragraph> <Paragraph position="14"> By rule instantiation we mean the local tree arising from the application of a CCG combinatory rule. The packed structure we have described is an example of a feature forest (Miyao and Tsujii, 2002), defined as follows:</Paragraph> <Paragraph position="16"> for a parse is then the sum of the values of f</Paragraph> <Paragraph position="18"> in the parse.</Paragraph> </Section> <Section position="6" start_page="4" end_page="8" type="metho"> <SectionTitle> 5 Estimation using GIS </SectionTitle> <Paragraph position="0"> GIS is a very simple algorithm for estimating the parameters of a log-linear model. The parameters are initialised to some arbitrary constant and the following update rule is applied until convergence:</Paragraph> <Paragraph position="2"> (!). In practice C is maximised over the sentences in the training data. Implementations of GIS typically use a &quot;correction feature&quot;, but following Curran and Clark (2003) we do not use such a feature, which simplifies the algorithm. null</Paragraph> <Paragraph position="4"> requires summing over all derivations which include f</Paragraph> <Paragraph position="6"> in the training data. The key to performing this sum efficiently is to write the sum in terms of inside and outside scores for each conjunctive node. The inside and outside scores can be defined recursively, as in the inside-outside algorithm for PCFGs. If the inside score for a conjunctive node c is denoted , then the expected value of</Paragraph> <Paragraph position="8"> can be written as follows:</Paragraph> <Paragraph position="10"> is the set of conjunctive nodes for S .</Paragraph> <Paragraph position="11"> Consider the example feature forest in Figure 1. The figure shows the nodes used to calculate the inside and outside scores for conjunctive node c</Paragraph> <Paragraph position="13"> The intuition for calculating outside scores is similar, but a little more involved. The outside score for a conjunctive node, c , is the outside score for its disjunctive node mother: c = d where c2g(d) (14) The outside score for a disjunctive node is a sum over the mother nodes, of the product of the outside score of the mother, the inside score of the sister, and the feature weights on the mother. For example, the The notation is taken from Miyao and Tsujii (2002). Miyao and Tsujii (2002) ignore the feature weights on the mother, but this ignores some of the probability mass for the outside (at least for the feature forests we have defined). outside score of d in Figure 1 is the sum of the following two values: the product of the outside score of c , the inside score of d and the feature weights and the feature weights at c</Paragraph> <Section position="1" start_page="8" end_page="8" type="sub_section"> <SectionTitle> .The </SectionTitle> <Paragraph position="0"> recursive definition is as follows. The outside score for a root disjunctive node is 1, otherwise: In order to calculate inside scores, the scores for daughter nodes need to be calculated before the scores for mother nodes (and vice versa for the outside scores). This can easily be achieved by ordering the nodes in the bottom-up CKY parsing order.</Paragraph> <Paragraph position="1"> Note that the inside-outside approach can be combined with any maximum entropy estimation procedure, such as those evaluated by Malouf (2002). Finally, in order to avoid overfitting, we use a Gaussian prior on the parameters of the model (Chen and Rosenfeld, 1999), which requires a slight modification to the update rule in (10). A Gaussian prior also handles the problem of &quot;pseudo-maximal&quot; features (Johnson et al., 1999).</Paragraph> </Section> </Section> <Section position="7" start_page="8" end_page="8" type="metho"> <SectionTitle> 6 The Parser </SectionTitle> <Paragraph position="0"> The parser is based on Clark et al. (2002) and takes as input a POS-tagged sentence with a set of possible lexical categories assigned to each word. The supertagger of Clark (2002) provides the lexical categories, with a parameter setting which assigns around 4 categories per word on average. The parsing algorithm is the CKY bottom-up chart-parsing algorithm described in Steedman (2000). The combinatory rules used by the parser are functional application (forward and backward), generalised forward composition, backward composition, generalised backward-crossed composition, and type raising. There is also a coordination rule which conjoins categories of the same type. Restrictions are placed on some of the rules, such as that given by Steedman (2000, p.62) for backward-crossed composition. null Type-raising is applied to the categories NP, PP and S[adj]nNP (adjectival phrase), and is implemented by adding the relevant set of type-raised categories to the chart whenever an NP, PP or S[adj]nNP is present. The sets of type-raised categories are based on the most commonly used type-raising rule instantiations in sections 2-21 of CCGbank, and contain 8 type-raised categories for NP and 1 each for PP and S[adj]nNP.</Paragraph> <Paragraph position="1"> The parser also uses a number of lexical rules and punctuation rules. These rules are based on those occurring roughly more than 200 times in sections 2-21 of CCGbank. An example of a lexical rule used by the parser is the following, which takes a passive form of a verb and creates a nominal modifier:</Paragraph> <Paragraph position="3"> This rule is used to create NPssuchasthe role played by Kim Cattrall. Note that there is a dependency relation on the resulting category; in the previous example role would fill a nominal modifier dependency headed by played.</Paragraph> <Paragraph position="4"> Currently, the only punctuation marks handled by the parser are commas, and all other punctuation is removed after the supertagging phase. An example of a comma rule is the following: This rule takes a sentential modifier followed by a comma (for example Currently , in the sentence above in the text) and returns a sentential modifier of the same type.</Paragraph> <Paragraph position="5"> The next section describes the efficient implementation of the parser and model estimator.</Paragraph> </Section> <Section position="8" start_page="8" end_page="8" type="metho"> <SectionTitle> 7 Implementation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="8" end_page="8" type="sub_section"> <SectionTitle> 7.1 Parser Implementation </SectionTitle> <Paragraph position="0"> The non-standard derivations allowed by CCG,together with the wide coverage grammar, result in extremely large charts. This means that efficient implementation of the parsing process is imperative for performing large-scale experiments.</Paragraph> <Paragraph position="1"> The packed chart prevents combinatorial explosion in the number of category combinations by grouping equivalent categories into a single entry.</Paragraph> <Paragraph position="2"> The speed of the parser is heavily dependent on the efficiency of equivalence testing, and category unification and construction. These are performed efficiently by always creating categories in a canonical form which can then be compared rapidly using hash functions over categories.</Paragraph> <Paragraph position="3"> The parser produces a packed chart from which the most probable dependency structure can be recovered. Since the same dependency structure can be generated by more than one derivation, a dependency structure's score is the sum of the log-linear scores for each derivation. Finding the structure with the highest score is not trivial, since filled dependencies are only stored at the conjunctive nodes where they are created. This means that a dependency appearing in a structure can be created in different parts of the chart for different derivations. We solve this in practice using a hash function over dependencies, which can be used to quickly determine whether two derivations lead to the same structure.</Paragraph> <Paragraph position="4"> For each node in the chart, we can keep track of the derivation leading to the set of dependencies with the highest score for that node.</Paragraph> </Section> <Section position="2" start_page="8" end_page="8" type="sub_section"> <SectionTitle> 7.2 Data Generation </SectionTitle> <Paragraph position="0"> Data for model estimation is created in two steps.</Paragraph> <Paragraph position="1"> First, the parser is run over the normal-form derivations in Sections 2-21 of CCGbank outputting the corresponding dependencies and other features. The features used in our preliminary implementation are as follows: AF dependency features; AF lexical category features; AF root category features.</Paragraph> <Paragraph position="2"> Dependency features are 5-tuples as defined in Section 2. Further dependency features are formed by substituting POS tags for the words, which leads to a total of 4 features for each dependency. Lexical category features are word category pairs on the leaf nodes and root features are head-word category pairs on root nodes. Extra features are formed by replacing words with their POS tags. The total number of features is 817,658, but we reduce this to 243,603 by only including features which appear at least twice in the data.</Paragraph> <Paragraph position="3"> The second step of data generation involves using the parser to create a feature forest for each sentence, using the feature set extracted from CCGbank. The parser is interrupted if a sentence takes longer than 60 seconds to process or if more than 500,000 conjunctive nodes are created in the chart. If this occurs, the process is repeated but with a smaller number of categories assigned to each word by the supertagger. Approximately 93% of the sentences in sections 2-21 can be processed in this way, giving 36,400 training sentences. Creating the forests takes approximately one hour using 40 nodes of our Beowulf cluster, and produces 19.9 GB of data.</Paragraph> </Section> <Section position="3" start_page="8" end_page="8" type="sub_section"> <SectionTitle> 7.3 Estimation </SectionTitle> <Paragraph position="0"> The parse forests regularly represent trillions of possible parses for a sentence. The estimation process involves summing feature weights over all these parses, a total which cannot be represented using double precision arithmetic (limited to less than ). Our implementation uses the sum, rather than product, form of (6), so that logarithms can be used to avoid numerical overflow. For converting the sum of products in Equation 15 to log space, we use a technique commonly used in speech recognition (p.c. Simon King).</Paragraph> <Paragraph position="1"> We have implemented a parallel version of our GIS code using the MPICH library (Gropp et al., 1996), an open-source implementation of the Message Passing Interface (MPI) standard. MPI parallel programming involves explicit synchronisation and information transfer between the parallel processes using messages. It is ideal for development of parallel programs for cluster architectures.</Paragraph> <Paragraph position="2"> GIS over parse forests is straightforward to parallelise. The parse forests are divided among the machines in the cluster (in our current implementation, each machine receives 979 forests). Each machine calculates the inside and outside scores for each node in the parse forest and updates the estimated feature expectations. The feature expectations are then summed across all of the machines using a global operation (called a reduce operation). Every machine receives this sum which is then used to calculate the normal GIS weight update. In our preliminary tests, each process used approximately 750 MB of RAM, giving a total usage of 30 GB across the cluster. One iteration of GIS takes approximately GIS - CCG IIS - TAG number of features 243,603 5,715 number of sentences 36,400 868 avg. num. of nodes 52,000 17,412 memory usage 30 GB 1.5 GB disk usage 19.9 GB - null 1 minute. Given the large number of features, we estimate at least 1,000 iterations will be needed for convergence.</Paragraph> </Section> </Section> class="xml-element"></Paper>