File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/p03-1054_metho.xml
Size: 17,578 bytes
Last Modified: 2025-10-06 14:08:20
<?xml version="1.0" standalone="yes"?> <Paper uid="P03-1054"> <Title>Accurate Unlexicalized Parsing</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 External vs. Internal Annotation </SectionTitle> <Paragraph position="0"> The two major previous annotation strategies, parent annotation and head lexicalization, can be seen as instances of external and internal annotation, respectively. Parent annotation lets us indicate an important feature of the external environment of a node which influences the internal expansion of that node. On the other hand, lexicalization is a (radical) method of marking a distinctive aspect of the otherwise hidden internal contents of a node which influence the external distribution. Both kinds of annotation can be useful. To identify split states, we add suffixes of the form -X to mark internal content features, and ^X to mark external features.</Paragraph> <Paragraph position="1"> To illustrate the difference, consider unary productions. In the raw grammar, there are many unaries, and once any major category is constructed over a span, most others become constructible as well using unary chains (see Klein and Manning (2001) for discussion). Such chains are rare in real treebank trees: unary rewrites only appear in very specific contexts, for example S complements of verbs where the S has an empty, controlled subject. Figure 4 shows an erroneous output of the parser, using the baseline markovized grammar. Intuitively, there are several reasons this parse should be ruled out, but one is that the lower S slot, which is intended primarily for S complements of communication verbs, is not a unary rewrite position (such complements usually have subjects). It would therefore be natural to annotate the trees so as to confine unary productions to the contexts in which they are actually appropriate. We tried two annotations. First, UNARY-INTERNAL marks (with a -U) any nonterminal node which has only one child. In isolation, this resulted in an absolute gain of 0.55% (see figure 3). The same sentence, parsed using only the baseline and UNARY-INTERNAL, is parsed correctly, because the VP rewrite in the incorrect parse ends with an S^VP-U with very low probability.8 Alternately, UNARY-EXTERNAL, marked nodes which had no siblings with ^U. It was similar to UNARY-INTERNAL in solo benefit (0.01% worse), but provided far less marginal benefit on top of other later features (none at all on top of UNARY-INTERNAL for our top models), and was discarded.9 One restricted place where external unary annotation was very useful, however, was at the preterminal level, where internal annotation was meaningless. One distributionally salient tag conflation in the Penn treebank is the identification of demonstratives (that, those) and regular determiners (the, a). Splitting DT tags based on whether they were only children (UNARY-DT) captured this distinction. The same external unary annotation was even more effective when applied to adverbs (UNARY-RB), distinguishing, for example, as well from also). Beyond these cases, unary tag marking was detrimental. The F1 after UNARY-INTERNAL, UNARY-DT, and UNARY-RB was 78.86%.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Tag Splitting </SectionTitle> <Paragraph position="0"> The idea that part-of-speech tags are not fine-grained enough to abstract away from specific-word behaviour is a cornerstone of lexicalization. The UNARY-DT annotation, for example, showed that the determiners which occur alone are usefully distinguished from those which occur with other nominal material. This marks the DT nodes with a single bit about their immediate external context: whether there are sisters. Given the success of parent annotation for nonterminals, it makes sense to parent annotate tags, as well (TAG-PA). In fact, as figure 3 shows, exhaustively marking all preterminals with their parent category was the most effective single annotation we tried. Why should this be useful? Most tags have a canonical category. For example, NNS tags occur under NP nodes (only 234 of 70855 do not, mostly mistakes). However, when a tag IN tag): (a) the incorrect baseline parse and (b) the correct TAG-PA parse. SPLIT-IN also resolves this error.</Paragraph> <Paragraph position="1"> somewhat regularly occurs in a non-canonical position, its distribution is usually distinct. For example, the most common adverbs directly under ADVP are also (1599) and now (544). Under VP, they are n't (3779) and not (922). Under NP, only (215) and just (132), and so on. TAG-PA brought F1 up substantially, to 80.62%.</Paragraph> <Paragraph position="2"> In addition to the adverb case, the Penn tag set conflates various grammatical distinctions that are commonly made in traditional and generative grammar, and from which a parser could hope to get useful information. For example, subordinating conjunctions (while, as, if ), complementizers (that, for), and prepositions (of, in, from) all get the tag IN.</Paragraph> <Paragraph position="3"> Many of these distinctions are captured by TAG-PA (subordinating conjunctions occur under S and prepositions under PP), but are not (both subordinating conjunctions and complementizers appear under SBAR). Also, there are exclusively nounmodifying prepositions (of ), predominantly verbmodifying ones (as), and so on. The annotation SPLIT-IN does a linguistically motivated 6-way split of the IN tag, and brought the total to 81.19%.</Paragraph> <Paragraph position="4"> Figure 5 shows an example error in the baseline which is equally well fixed by either TAG-PA or SPLIT-IN. In this case, the more common nominal use of works is preferred unless the IN tag is annotated to allow if to prefer S complements.</Paragraph> <Paragraph position="5"> We also got value from three other annotations which subcategorized tags for specific lexemes.</Paragraph> <Paragraph position="6"> First we split off auxiliary verbs with the SPLIT-AUX annotation, which appends ^BE to all forms of be and ^HAVE to all forms of have.10 More minorly, SPLIT-CC marked conjunction tags to indicate 10This is an extended uniform version of the partial auxiliary annotation of Charniak (1997), wherein all auxiliaries are marked as AUX and a -G is added to gerund auxiliaries and gerund VPs.</Paragraph> <Paragraph position="7"> whether or not they were the strings [Bb]ut or &, each of which have distinctly different distributions from other conjunctions. Finally, we gave the percent sign (%) its own tag, in line with the dollar sign ($) already having its own. Together these three annotations brought the F1 to 81.81%.</Paragraph> <Paragraph position="8"> 5 What is an Unlexicalized Grammar? Around this point, we must address exactly what we mean by an unlexicalized PCFG. To the extent that we go about subcategorizing POS categories, many of them might come to represent a single word. One might thus feel that the approach of this paper is to walk down a slippery slope, and that we are merely arguing degrees. However, we believe that there is a fundamental qualitative distinction, grounded in linguistic practice, between what we see as permitted in an unlexicalized PCFG as against what one finds and hopes to exploit in lexicalized PCFGs. The division rests on the traditional distinction between function words (or closed-class words) and content words (or open class or lexical words). It is standard practice in linguistics, dating back decades, to annotate phrasal nodes with important function-word distinctions, for example to have a CP[for] or a PP[to], whereas content words are not part of grammatical structure, and one would not have special rules or constraints for an NP[stocks], for example. We follow this approach in our model: various closed classes are subcategorized to better represent important distinctions, and important features commonly expressed by function words are annotated onto phrasal nodes (such as whether a VP is finite, or a participle, or an infinitive clause). However, no use is made of lexical class words, to provide either monolexical or bilexical probabilities.11 At any rate, we have kept ourselves honest by estimating our models exclusively by maximum likelihood estimation over our subcategorized grammar, without any form of interpolation or shrinkage to unsubcategorized categories (although we do markovize rules, as explained above). This effec11It should be noted that we started with four tags in the Penn treebank tagset that rewrite as a single word: EX (there), WP$ (whose), # (the pound sign), and TO), and some others such as WP, POS, and some of the punctuation tags, which rewrite as barely more. To the extent that we subcategorize tags, there will be more such cases, but many of them already exist in other tag sets. For instance, many tag sets, such as the Brown and CLAWS (c5) tagsets give a separate sets of tags to each form of the verbal auxiliaries be, do, and have, most of which rewrite as only a single word (and any corresponding contractions). the incorrect baseline parse and (b) the correct TMP-NP parse. tively means that the subcategories that we break off must themselves be very frequent in the language.</Paragraph> <Paragraph position="9"> In such a framework, if we try to annotate categories with any detailed lexical information, many sentences either entirely fail to parse, or have only extremely weird parses. The resulting battle against sparsity means that we can only afford to make a few distinctions which have major distributional impact.</Paragraph> <Paragraph position="10"> Even with the individual-lexeme annotations in this section, the grammar still has only 9255 states compared to the 7619 of the baseline model.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Annotations Already in the Treebank </SectionTitle> <Paragraph position="0"> At this point, one might wonder as to the wisdom of stripping off all treebank functional tags, only to heuristically add other such markings back in to the grammar. By and large, the treebank out-of-the package tags, such as PP-LOC or ADVP-TMP, have negative utility. Recall that the raw treebank grammar, with no annotation or markovization, had an F1 of 72.62% on our development set. With the functional annotation left in, this drops to 71.49%. The h [?] 2,v [?] 1 markovization baseline of 77.77% dropped even further, all the way to 72.87%, when these annotations were included.</Paragraph> <Paragraph position="1"> Nonetheless, some distinctions present in the raw treebank trees were valuable. For example, an NP with an S parent could be either a temporal NP or a subject. For the annotation TMP-NP, we retained the original -TMP tags on NPs, and, furthermore, propagated the tag down to the tag of the head of the NP. This is illustrated in figure 6, which also shows an example of its utility, clarifying that CNN last night is not a plausible compound and facilitating the otherwise unusual high attachment of the smaller NP.</Paragraph> <Paragraph position="2"> TMP-NP brought the cumulative F1 to 82.25%. Note that this technique of pushing the functional tags down to preterminals might be useful more generally; for example, locative PPs expand roughly the the incorrect baseline parse and (b) the correct SPLIT-VP parse. same way as all other PPs (usually as IN NP), but they do tend to have different prepositions below IN.</Paragraph> <Paragraph position="3"> A second kind of information in the original trees is the presence of empty elements. Following Collins (1999), the annotation GAPPED-S marks S nodes which have an empty subject (i.e., raising and control constructions). This brought F1 to 82.28%.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 7 Head Annotation </SectionTitle> <Paragraph position="0"> The notion that the head word of a constituent can affect its behavior is a useful one. However, often the head tag is as good (or better) an indicator of how a constituent will behave.12 We found several head annotations to be particularly effective. First, possessive NPs have a very different distribution than other NPs - in particular, NP - NP a rules are only used in the treebank when the leftmost child is possessive (as opposed to other imaginable uses like for New York lawyers, which is left flat). To address this, POSS-NP marked all possessive NPs. This brought the total F1 to 83.06%. Second, the VP symbol is very overloaded in the Penn treebank, most severely in that there is no distinction between finite and infinitival VPs. An example of the damage this conflation can do is given in figure 7, where one needs to capture the fact that present-tense verbs do not generally take bare infinitive VP complements. To allow the finite/non-finite distinction, and other verb type distinctions, SPLIT-VP annotated all VP nodes with their head tag, merging all finite forms to a single tag VBF. In particular, this also accomplished Charniak's gerund-VP marking. This was extremely useful, bringing the cumulative F1 to 85.72%, 2.66% absolute improvement (more than its solo improvement over the baseline).</Paragraph> <Paragraph position="1"> 12This is part of the explanation of why (Charniak, 2000) finds that early generation of head tags as in (Collins, 1999) is so beneficial. The rest of the benefit is presumably in the availability of the tags for smoothing purposes.</Paragraph> </Section> <Section position="8" start_page="0" end_page="0" type="metho"> <SectionTitle> 8 Distance </SectionTitle> <Paragraph position="0"> Error analysis at this point suggested that many remaining errors were attachment level and conjunction scope. While these kinds of errors are undoubtedly profitable targets for lexical preference, most attachment mistakes were overly high attachments, indicating that the overall right-branching tendency of English was not being captured. Indeed, this tendency is a difficult trend to capture in a PCFG because often the high and low attachments involve the very same rules. Even if not, attachment height is not modeled by a PCFG unless it is somehow explicitly encoded into category labels. More complex parsing models have indirectly overcome this by modeling distance (rather than height).</Paragraph> <Paragraph position="1"> Linear distance is difficult to encode in a PCFG - marking nodes with the size of their yields massively multiplies the state space.13 Therefore, we wish to find indirect indicators that distinguish high attachments from low ones. In the case of two PPs following a NP, with the question of whether the second PP is a second modifier of the leftmost NP or should attach lower, inside the first PP, the important distinction is usually that the lower site is a non-recursive base NP. Collins (1999) captures this notion by introducing the notion of a base NP, in which any NP which dominates only preterminals is marked with a -B. Further, if an NP-B does not have a non-base NP parent, it is given one with a unary production. This was helpful, but substantially less effective than marking base NPs without introducing the unary, whose presence actually erased a useful internal indicator - base NPs are more frequent in subject position than object position, for example. In isolation, the Collins method actually hurt the base-line (absolute cost to F1 of 0.37%), while skipping the unary insertion added an absolute 0.73% to the baseline, and brought the cumulative F1 to 86.04%.</Paragraph> <Paragraph position="2"> In the case of attachment of a PP to an NP either above or inside a relative clause, the high NP is distinct from the low one in that the already modified one contains a verb (and the low one may be a base NP as well). This is a partial explanation of the utility of verbal distance in Collins (1999). To 13The inability to encode distance naturally in a naive PCFG is somewhat ironic. In the heart of any PCFG parser, the fundamental table entry or chart item is a label over a span, for example an NP from position 0 to position 5. The concrete use of a grammar rule is to take two adjacent span-marked labels and combine them (for example NP[0,5] and VP[5,12] into S[0,12]). Yet, only the labels are used to score the combination.</Paragraph> <Paragraph position="3"> capture this, DOMINATES-V marks all nodes which dominate any verbal node (V*, MD) with a -V. This brought the cumulative F1 to 86.91%. We also tried marking nodes which dominated prepositions and/or conjunctions, but these features did not help the cumulative hill-climb.</Paragraph> <Paragraph position="4"> The final distance/depth feature we used was an explicit attempt to model depth, rather than use distance and linear intervention as a proxy. With RIGHT-REC-NP, we marked all NPs which contained another NP on their right periphery (i.e., as a right-most descendant). This captured some further attachment trends, and brought us to a final development F1 of 87.04%.</Paragraph> </Section> <Section position="9" start_page="0" end_page="0" type="metho"> <SectionTitle> 9 Final Results </SectionTitle> <Paragraph position="0"> We took the final model and used it to parse section 23 of the treebank. Figure 8 shows the results. The test set F1 is 86.32% for [?] 40 words, already higher than early lexicalized models, though of course lower than the state-of-the-art parsers.</Paragraph> </Section> class="xml-element"></Paper>