XML Viewer - p03-1046

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/p03-1046_metho.xml
Size: 23,314 bytes
Last Modified: 2025-10-06 14:08:13
<?xml version="1.0" standalone="yes"?>
<Paper uid="P03-1046">
  <Title>Parsing with generative models of predicate-argument structure</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 A model of surface dependencies
</SectionTitle>
    <Paragraph position="0"> Hockenmaier and Steedman (2002b) define a surface dependency model (henceforth: SD) HWDep which captures word-word dependencies that are defined in terms of the derivation tree itself. It assumes that binary trees (with parent category C8) have one head child (with category C0) and one non-head child (with category BW), and that each node has one lexical head CWBPCWCRBNDBCX. In the following tree, C8BPCBCJCSCRD0CLD2C6C8, C0BPB4CBCJCSCRD0CLD2C6C8B5BPC6C8, BWBPC6C8, CW  Like Clark et al. (2002), we define predicate-argument structure for CCG in terms of the dependencies that hold between words with lexical functor categories and their arguments. We assume that a lexical head is a pair CWCRBNDBCX, consisting of a word DB and its lexical category CR. Each constituent has at least one lexical head (more if it is a coordinate construction). The arguments of functor categories are numbered from 1 to D2, starting at the innermost argument, where D2 is the arity of the functor, eg.</Paragraph>
    <Paragraph position="2"> pendencies hold between lexical heads whose category is a functor category and the lexical heads of their arguments. Such dependencies can be expressed as 3-tuples CWCWCRBNDBCXBNCXBNCWCR  ical head of the CXth argument of CR.</Paragraph>
    <Paragraph position="3"> The predicate-argument structure that corresponds to a derivation contains not only local, but also long-range dependencies that are projected from the lexicon or through some rules such as the coordination of functor categories. For details, see Hockenmaier (2003).</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Word-word dependencies in Dutch
</SectionTitle>
    <Paragraph position="0"> Dutch has a much freer word order than English.</Paragraph>
    <Paragraph position="1"> The analyses given in Steedman (2000) assume that this can be accounted for by an extended use of composition. As indicated by the indices (which are only included to improve readability), in the following examples, hij is the subject (C6C8</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="1" type="metho">
    <SectionTitle>
BF
</SectionTitle>
    <Paragraph position="0"> )of geeft, de politieman the indirect object (C6C8  A SD model estimated from a corpus containing these three sentences would not be able to capture the correct dependencies. Unless we assume that the above indices are given as a feature on the C6C8 categories, the model could not distinguish between the dependency relations of Hij and geeft in the first sentence, bloem and geeft in the second sentence and politieman and geeft in the third sentence. Even with the indices, either the dependency between politieman and geeft or between bloem and geeft in the first sentence could not be captured by a model that assumes that each local tree has exactly one head. Furthermore, if one of these sentences occurred in the training data, all of the dependencies in the other variants of this sentence would be unseen to the model. However, in terms of the predicate-argument structure, all three examples express the same relations. The model we propose here would therefore be able to generalize from one example to the word-word dependencies in the other examples.</Paragraph>
    <Paragraph position="1">  The variables CC are uninstantiated for reasons of space. The cross-serial dependencies of Dutch are one of the syntactic constructions that led people to believe that more than context-free power is required for natural language analysis. Here is an example together with the CCG derivation from Steedman (2000): dat ik Cecilia de paarden zag voeren (that I Cecilia the horses saw feed)  Again, a local dependency model would systematically model the wrong dependencies in this case, since it would assume that all noun phrases are arguments of the same verb.</Paragraph>
    <Paragraph position="2"> However, since there is no Dutch corpus that is annotated with CCG derivations, we restrict our attention to English in the remainder of this paper.</Paragraph>
  </Section>
  <Section position="6" start_page="1" end_page="2" type="metho">
    <SectionTitle>
5 A model of predicate-argument
</SectionTitle>
    <Paragraph position="0"> structure We first explain how word-word dependencies in the predicate-argument structure can be captured in a generative model, and then describe how these probabilities are estimated in the current implementation.</Paragraph>
    <Section position="1" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
5.1 Modelling local dependencies
</SectionTitle>
      <Paragraph position="0"> We first define the probabilities for purely local dependencies without coordination. By excluding non-local dependencies and coordination, at most one dependency relation holds for each word. Consider the following sentence:  This derivation expresses the following dependencies: null</Paragraph>
      <Paragraph position="2"> We assume again that heads are generated before their modifiers or arguments, and that word-word dependencies are expressed by conditioning modifiers or arguments on heads. Therefore, the head words of arguments (such as Smith) are generated in the following manner:  model assumes that word-word dependencies can be defined at the maximal projection of a constituent. However, as the Dutch examples show, the argument slot CX can only be determined if the head constituent is fully expanded. For instance, if CBCJCSCRD0CL expands to a non-head CBBPB4CBBPC6C8B5 and to a head CBCJCSCRD0CLBPC6C8, it is necessary to know how the CBCJCSCRD0CLBPC6C8 expands to determine which argument is filled by the nonhead, even if we already know that the lexical category of the head word of CBCJCSCRD0CLBPC6C8 is a ditransitive B4B4CBCJCSCRD0CLBPC6C8B5BPC6C8B5BPC6C8. Therefore, we assume that the non-head child of a node is only expanded after the head child has been fully expanded.</Paragraph>
    </Section>
    <Section position="2" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
5.2 Modelling long-range dependencies
</SectionTitle>
      <Paragraph position="0"> The predicate-argument structure that corresponds to a derivation contains not only local, but also long-range dependencies that are projected from the lexicon or through some rules such as the coordination of functor categories. In the following derivation, Smith is the subject of resigned and of left:  In order to express both dependencies, Smith has to be conditioned on resigned and on left:</Paragraph>
      <Paragraph position="2"> In terms of the predicate-argument structure, resigned and left are both lexical heads of this sentence. Since neither fills an argument slot of the other, we assume that they are generated independently. This is different from the SD model, which conditions the head word of the second and subsequent conjuncts on the head word of the first conjunct. Similarly, in a sentence such as Miller and Smith resigned, the current model assumes that the two heads of the subject noun phrase are conditioned on the verb, but not on each other.</Paragraph>
      <Paragraph position="3"> Argument-cluster coordination constructions such as give a dog a bone and a policeman a flower are another example where the dependencies in the predicate-argument structure cannot be expressed at the level of the local trees that combine the individual arguments. Instead, these dependencies are projected down through the category of the  Lexical categories that project long-range dependencies include cases such as relative pronouns, control verbs, auxiliaries, modals and raising verbs. This can be expressed by co-indexing their arguments, eg. B4C6C8D2C6C8  Again, in order to capture this dependency, we assume that the entire verb phrase is generated before the subject.</Paragraph>
      <Paragraph position="4"> In relative clauses, there is a dependency between the verbs in the relative clause and the head of the noun phrase that is modified by the relative clause:  Since the entire relative clause is an adjunct, it is generated after the noun phrase Smith. Therefore, we cannot capture the dependency between Smith and resigned by conditioning Smith on resigned.Instead, resigned needs to be conditioned on the fact that its subject is Smith. This is similar to the way in which head words of adjuncts such as yesterday are generated. In addition to this dependency, we also assume that there is a dependency between who and resigned. It follows that if we want to capture unbounded long-range dependencies such as object extraction, words cannot be generated at the maximal projection of constituents anymore. Consider the following examples:  In both cases, there is a CBCJCSCRD0CLBPC6C8 with lexical head B4CBCJCSCRD0CLD2C6C8B5BPC6C8; however, in the second case, the NP argument is not the object of the transitive verb. This problem can be solved by generating words at the leaf nodes instead of at the maximal projection of constituents. After expanding the B4CBCJCSCRD0CLD2C6C8B5BPC6C8 node to B4CBCJCSCRD0CLD2C6C8B5BPC6C8 and C6C8BPC6C8, the NP that is co-indexed with woman cannot be unified with the object of saw anymore.</Paragraph>
      <Paragraph position="5"> These examples have shown that two changes to the generative process are necessary if word-word dependencies in the predicate-argument structure are to be captured. First, head constituents have to be fully expanded before non-head constituents are generated. Second, words have to be generated at the leaves of the tree, not at the maximal projection of constituents.</Paragraph>
    </Section>
    <Section position="3" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
5.3 The word probabilities
</SectionTitle>
      <Paragraph position="0"> Not all words have functor categories or fill argument slots of other functors. For instance, punctuation marks, conjunctions, and the heads of entire sentences are not conditioned on any other words.</Paragraph>
      <Paragraph position="1"> Therefore, they are only conditioned on their lexical categories. Therefore, this model contains the following three kinds of word probabilities:</Paragraph>
    </Section>
    <Section position="4" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
5.4 The structural probabilities
</SectionTitle>
      <Paragraph position="0"> Like the SD model, we assume an underlying process which generates CCG derivation trees starting from the root node. Each node in a derivation tree has a category, a list of lexical heads and a (possibly empty) list of dependency relations to be filled by its lexical heads. As discussed in the previous section, head words cannot in general be generated at the maximal projection if unbounded long-range dependencies are to be captured. This is not the case for lexical categories. We therefore assume that a node's lexical head category is generated at its maximal projection, whereas head words are generated at the leaf nodes. Since lexical categories are generated at the maximal projection, our model has the same structural probabilities as the LexCat model of Hockenmaier and Steedman (2002b).</Paragraph>
    </Section>
    <Section position="5" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
5.5 Estimating word probabilities
</SectionTitle>
      <Paragraph position="0"> This model generates words in three different ways--as arguments of functors that are already generated, as functors which have already one (or more) arguments instantiated, or independent of the surrounding context. The last case is simple, as this probability can be estimated directly, by counting the number of times CR is the lexical category of DB in the training corpus, and dividing this by the number of times CR occurs as a lexical category in the training corpus:</Paragraph>
      <Paragraph position="2"> In order to estimate the probability of an argument DB, we count the number of times it occurs with lexical category CR and is the CXth argument of the lexical  CXth argument of any lexical head with category CR. For instance, in order to compute the probability of yesterday modifying resigned as in the previous section, we count the number of times the transitive verb resigned was modified by the adverb yesterday and divide this by the number of times resigned was modified by any adverb of the same category.</Paragraph>
      <Paragraph position="3"> We have seen that functor probabilities are not only necessary for adjuncts, but also for certain types of long-range dependencies such as the relation between the noun modified by a relative clause and the verb in the relative clause. In the case of zero or reduced relative clauses, some of these dependencies are also captured by the SD model. However, in that model, only counts from the same type of construction could be used, whereas in our model, the functor probability for a verb in a zero or reduced relative clause can be estimated from all occurrences of the head noun. In particular, all instances of the noun and verb occurring together in the training data (with the same predicate-argument relation between them, but not necessarily with the same surface configuration) are taken into account by the new model. To obtain the model probabilities, the relative frequency estimates of the functor and argument probabilities are both interpolated with the word proba-</Paragraph>
    </Section>
    <Section position="6" start_page="1" end_page="2" type="sub_section">
      <SectionTitle>
5.6 Conditioning events on multiple heads
</SectionTitle>
      <Paragraph position="0"> In the presence of long-range dependencies and coordination, the new model requires the conditioning of certain events on multiple heads. Since it is unlikely that such probabilities can be estimated directly from data, they have to be approximated in some manner.</Paragraph>
      <Paragraph position="1"> If we assume that all dependencies CSCTD4  This approximation is has the advantage that it is easy to compute, but might not give a good estimate, since it averages over all individual distributions. 6 Dynamic programming and beam search This section describes how this model is integrated into a CKY chart parser. Dynamic programming and effective beam search strategies are essential to guarantee efficient parsing in the face of the high ambiguity of wide-coverage grammars. Both use the inside probability of constituents. In lexicalized models where each constituent has exactly one lexical head, and where this lexical head can only depend on the lexical head of one other constituent, the inside probability of a constituent is the probability that a node with the label and lexical head of this constituent expands to the tree below this node. The probability of generating a node with this label and lexical head is given by the outside probability of the constituent.</Paragraph>
      <Paragraph position="2"> In the model defined here, the lexical head of a constituent can depend on more than one other word. As explained in section 5.2, there are instances where the categorial functor is conditioned on its arguments - the example given above showed that verbs in relative clauses are conditioned on the lexical head of the noun which is modified by the relative clause. Therefore, the inside probability of a constituent cannot include the probability of any lexical head whose argument slots are not all filled. This means that the equivalence relation defined by the probability model needs to take into account not only the head of the constituent itself, but also all other lexical heads within this constituent which have at least one unfilled argument slot. As a consequence, dynamic programming becomes less effective. There is a related problem for the beam search: in our model, the inside probabilities of constituents within the same cell cannot be directly compared anymore. Instead, the number of unfilled lexical heads needs to be taken into account. If a lexical head CWCRBNDBCX is unfilled, the evaluation of the probability of DB is delayed. This creates a problem for the beam search strategy.</Paragraph>
      <Paragraph position="3"> The fact that constituents can have more than one lexical head causes similar problems for dynamic programming and the beam search.</Paragraph>
      <Paragraph position="4"> In order to be able to parse efficiently with our model, we use the following approximations for dynamic programming and the beam search: Two constituents with the same span and the same category are considered equivalent if they delay the evaluation of the probabilities of the same words and if they have the same number of lexical heads, and if the first two elements of their lists of lexical heads are identical (the same words and lexical categories). This is only an approximation to true equivalence, since we do not check the entire list of lexical heads. Furthermore, if a cell contains more than 100 constituents, we iteratively narrow the beam (by halvingitinsize) null  until the beam search has no further effect or the cell contains less than 100 constituents. This is a very aggressive strategy, and it is likely to adversely affect parsing accuracy. However, more lenient strategies were found to require too much space for the chart to be held in memory. A better way of dealing with the space requirements of our model would be to implement a packed shared parse forest, but we leave this to future work.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="2" end_page="2" type="metho">
    <SectionTitle>
7 An experiment
</SectionTitle>
    <Paragraph position="0"> We use sections 02-21 of CCGbank for training, section 00 for development, and section 23 for testing. The input is POS-tagged using the tagger of Ratnaparkhi (1996). However, since parsing with the new model is less efficient, only sentences AK 40 tokens only are used to test the model. A frequency cutoff of AK 20 was used to determine rare words in the training data, which are replaced with their POS-tags. Unknown words in the test data are also replaced by their POS-tags. The models are evaluated according to their Parseval scores and to the recovery of dependencies in the predicate-argument structure. Like Clark et al. (2002), we do not take the lexical category of the dependent into account, and evaluate CWCWCRBNDBCXBNCXBNCW BNDB  Beam search is as in Hockenmaier and Steedman (2002b). nalize the parser if it mistakes a complement for an adjunct or vice versa.</Paragraph>
    <Paragraph position="1"> In order to determine the impact of capturing different kinds of long-range dependencies, four different models were investigated: The baseline model is like the LexCat model of (2002b), since the structural probabilities of our model are like those of that model. Local only takes local dependencies into account. LeftArgs only takes long-range dependencies that are projected through left arguments (D2CG) into account. This includes for instance long-range dependencies projected by subjects, subject and object control verbs, subject extraction and leftnode raising. All takes all long-range dependencies into account, in particular it extends LeftArgs by capturing also the unbounded dependencies arising through right-node-raising and object extraction. Local, LeftArgs and All are all tested with the aggressive beam strategy described above.</Paragraph>
    <Paragraph position="2"> In all cases, the CCG derivation includes all long-range dependencies. However, with the models that exclude certain kinds of dependencies, it is possible that a word is conditioned on no dependencies. In these cases, the word is generated with C8B4DBCYCRB5.</Paragraph>
    <Paragraph position="3"> Table 1 gives the performance of all four models on section 23 in terms of the accuracy of lexical categories, Parseval scores, and in terms of the recovery of word-word dependencies in the predicate-argument structure. Here, results are further broken up into the recovery of local, all long-range, bounded long-range and unbounded long-range dependencies. null LexCat does not capture any word-word dependencies. Its performance on the recovery of predicate-argument structure can be improved by 3% by capturing only local word-word dependencies (Local). This excludes certain kinds of dependencies that were captured by the SD model. For instance, the dependency between the head of a noun phrase and the head of a reduced relative clause (the shares bought by John) is captured by the SD model, since shares and bought are both heads of the local trees that are combined to form the complex noun phrase. However, in the SD model the probability of this dependency can only be estimated from occurrences of the same construction, since dependency relations are defined in terms of local trees and not in terms of the underlying predicate-argument struc- null ture. By including long-range dependencies on left arguments (such as subjects) (LeftArgs), a further improvement of 0.7% on the recovery of predicate-argument structure is obtained. This model captures the dependency between shares and bought. In contrast to the SD model, it can use all instances of shares as the subject of a passive verb in the training data to estimate this probability. Therefore, even if shares and bought do not co-occur in this particular construction in the training data, the event that is modelled by our dependency model might not be unseen, since it could have occurred in another syntactic context.</Paragraph>
    <Paragraph position="4"> Our results indicate that in order to perform well on long-range dependencies, they have to be included in the model, since Local, the model that captures only local dependencies performs worse on long-range dependencies than LexCat, the model that captures no word-word dependencies. However, with more than 5% difference on labelled precision and recall on long-range dependencies, the model which captures long-range dependencies on left arguments performs significantly better on recovering long-range dependencies than Local.The greatest difference in performance between the models which do capture long-range dependencies and the models which do not is on long-range dependencies. This indicates that, at least in the kind of model considered here, it is very important to model not just local, but also long-range dependencies. It is not clear why All, the model that includes all dependencies, performs slightly worse than the model which includes only long-range dependencies on subjects.</Paragraph>
    <Paragraph position="5"> On the Wall Street Journal task, the overall performance of this model is lower than that of the SD model of Hockenmaier and Steedman (2002b).</Paragraph>
    <Paragraph position="6"> In that model, words are generated at the maximal projection of constituents; therefore, the structural probabilities can also be conditioned on words, which improves the scores by about 2%. It is also very likely that the performance of the new models is harmed by the very aggressive beam search.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML