File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/02/p02-1042_evalu.xml
Size: 6,675 bytes
Last Modified: 2025-10-06 13:58:51
<?xml version="1.0" standalone="yes"?> <Paper uid="P02-1042"> <Title>Building Deep Dependency Structures with a Wide-Coverage CCG Parser</Title> <Section position="7" start_page="379" end_page="379" type="evalu"> <SectionTitle> 6 Results </SectionTitle> <Paragraph position="0"> To measure the performance of the parser, we compared the dependencies output by the parser with those in the gold standard, and computed precision and recall figures over the dependencies. Recall that a dependency is defined as a 4-tuple: a head of a functor, a functor category, an argument slot, and a head of an argument. Figures were calculated for labelled dependencies (LP,LR) and unlabelled dependencies (UP,UR). To obtain a point for a labelled dependency, each element of the 4-tuple must match exactly. Note that the category set we are using distinguishes around 400 distinct types; for example, tensed transitive buy is treated as a distinct category from infinitival transitive buy. Thus this evaluation criterion is much more stringent than that for a standard pos-tag label-set (there are around 50 pos-tags used in the Penn Treebank).</Paragraph> <Paragraph position="1"> To obtain a point for an unlabelled dependency, the heads of the functor and argument must appear together in some relation (either as functor or argument) for the relevant sentence in the gold standard. The results are shown in Table 1, with an additional column giving the category accuracy.</Paragraph> <Paragraph position="3"> As an additional experiment, we conditioned the dependency probabilities in 10 on a &quot;distance measure&quot; ([?]). Distance has been shown to be a useful feature for context-free treebank style parsers (e.g. Collins (1996), Collins (1999)), although our hypothesis was that it would be less useful here, because the CCG grammar provides many of the constraints given by [?], and distance measures are biased against long-range dependencies.</Paragraph> <Paragraph position="4"> We tried a number of distance measures, and the one used here encodes the relative position of the heads of the argument and functor (left or right), counts the number of verbs between argument and functor (up to 1), and counts the number of punctuation marks (up to 2). The results are also given in Table 1, and show that, as expected, adding distance gives no improvement overall.</Paragraph> <Paragraph position="5"> An advantage of the dependency-based evaluation is that results can be given for individual dependency relations. Labelled precision and recall on Section 00 for the most frequent dependency types are shown in Table 2 (for the model without distance measures).9 The columns # deps give the total number of dependencies, first the number put forward by the parser, and second the number in the gold standard. F-score is calculated as (2*LP*LR)/(LP+LR).</Paragraph> <Paragraph position="6"> We also give the scores for the dependencies created by the subject and object relative pronoun categories, including the headless object relative pronoun category.</Paragraph> <Paragraph position="7"> We would like to compare these results with those of other parsers that have presented dependency-based evaluations. However, the few that exist (Lin, 1995; Carroll et al., 1998; Collins, 1999) have used either different data or different sets of dependencies (or both). In future work we plan to map our CCG dependencies onto the set used by Carroll and Briscoe and parse their evaluation corpus so a direct comparison can be made.</Paragraph> <Paragraph position="8"> As far as long-range dependencies are concerned, it is similarly hard to give a precise evaluation. Note that the scores in Table 2 currently conflate extracted and in-situ arguments, so that the scores for the direct objects, for example, include extracted objects. The scores for the relative pronoun categories give a good indication of the performance on extraction cases, although even here it is not possible at present to determine exactly how well the parser is performing at recovering extracted arguments.</Paragraph> <Paragraph position="9"> In an attempt to obtain a more thorough analysis, we analysed the performance of the parser on the 24 cases of extracted objects in the gold-standard Section 00 (development set) that were passed down the object relative pronoun category were recovered correctly by the parser; 10 were incorrect because the wrong category was assigned to the relative pronoun, 3 were incorrect because the relative pronoun was attached to the wrong noun, and 1 was incorrect because the wrong category was assigned to the predicate from which the object was N, as a default, since the structure of the compound is not present in the Penn Treebank. Thus the scores for N N are not particularly informative. Removing these relations reduces the overall scores by around 2%. Also, the scores in Table 2 are for around 95% of the sentences in Section 00, because of the problem obtaining gold standard dependency structures for all sentences, noted earlier.</Paragraph> <Paragraph position="10"> 10The number of extracted objects need not equal the occurrences of the category since coordination can introduce more extracted. The tendency for the parser to assign the wrong category to the relative pronoun in part reflects the fact that complementiser that is fifteen times as frequent as object relative pronoun that.</Paragraph> <Paragraph position="11"> However, the supertagger alone gets 74% of the object relative pronouns correct, if it is used to provide a single category per word, so it seems that our dependency model is further biased against object extractions, possibly because of the technical unsoundness noted earlier.</Paragraph> <Paragraph position="12"> It should be recalled in judging these figures that they are only a first attempt at recovering these long-range dependencies, which most other wide-coverage parsers make no attempt to recover at all. To get an idea of just how demanding this task is, it is worth looking at an example of object relativization that the parser gets correct. Figure 2 gives part of a dependency structure returned by the parser for a sentence from section 00 (with the relations omitted).11 Notice that both respect and confidence are objects of had. The relevant dependency quadruples found by the parser are the following: 11The full sentence is The events of April through June damaged the respect and confidence which most Americans previously had for the leaders of China.</Paragraph> <Paragraph position="13"> respect and confidence which most Americans previously had</Paragraph> </Section> class="xml-element"></Paper>