File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/p03-1054_intro.xml
Size: 7,038 bytes
Last Modified: 2025-10-06 14:01:47
<?xml version="1.0" standalone="yes"?> <Paper uid="P03-1054"> <Title>Accurate Unlexicalized Parsing</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Vertical and Horizontal Markovization </SectionTitle> <Paragraph position="0"> The traditional starting point for unlexicalized parsing is the raw n-ary treebank grammar read from training trees (after removing functional tags and null elements). This basic grammar is imperfect in two well-known ways. First, the category symbols are too coarse to adequately render the expansions independent of the contexts. For example, subject NP expansions are very different from object NP expansions: a subject NP is 8.7 times more likely than an object NP to expand as just a pronoun. Having separate symbols for subject and object NPs allows this variation to be captured and used to improve parse scoring. One way of capturing this kind of external context is to use parent annotation, as presented in Johnson (1998). For example, NPs with S parents (like subjects) will be marked NP^S, while NPs with VP parents (like objects) will be NP^VP.</Paragraph> <Paragraph position="1"> The second basic deficiency is that many rule types have been seen only once (and therefore have their probabilities overestimated), and many rules which occur in test sentences will never have been seen in training (and therefore have their probabilities underestimated - see Collins (1999) for analysis). Note that in parsing with the unsplit grammar, not having seen a rule doesn't mean one gets a parse failure, but rather a possibly very weird parse (Charniak, 1996). One successful method of combating sparsity is to markovize the rules (Collins, 1999). In particular, we follow that work in markovizing out from the head child, despite the grammar being unlexicalized, because this seems the best way to capture the traditional linguistic insight that phrases are organized around a head (Radford, 1988).</Paragraph> <Paragraph position="2"> Both parent annotation (adding context) and RHS markovization (removing it) can be seen as two instances of the same idea. In parsing, every node has a vertical history, including the node itself, parent, grandparent, and so on. A reasonable assumption is that only the past v vertical ancestors matter to the current expansion. Similarly, only the previous h horizontal ancestors matter (we assume that the head</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Horizontal Markov Order </SectionTitle> <Paragraph position="0"/> <Paragraph position="2"> child always matters). It is a historical accident that the default notion of a treebank PCFG grammar takes v = 1 (only the current node matters vertically) and h = [?] (rule right hand sides do not decompose at all). On this view, it is unsurprising that increasing v and decreasing h have historically helped.</Paragraph> <Paragraph position="3"> As an example, consider the case of v = 1,</Paragraph> <Paragraph position="5"> binary or unary rule, which conceptually represent a head-outward generation of the right hand size, as shown in figure 1. The bottom layer will be a unary over the head declaring the goal: <VP: [VBZ]> -VBZ. The square brackets indicate that the VBZ is the head, while the angle brackets <X> indicates that the symbol <X> is an intermediate symbol (equivalently, an active or incomplete state). The next layer up will generate the first rightward sibling of the head child: <VP: [VBZ]. . . NP> - <VP: [VBZ]> NP. Next, the PP is generated: <VP: [VBZ]. . . PP> <VP: [VBZ]. . . NP> PP. We would then branch off left siblings if there were any.7 Finally, we have another unary to finish the VP. Note that while it is convenient to think of this as a head-outward process, these are just PCFG rewrites, and so the actual scores attached to each rule will correspond to a downward generation order.</Paragraph> <Paragraph position="6"> Figure 2 presents a grid of horizontal and vertical markovizations of the grammar. The raw tree-bank grammar corresponds to v = 1,h = [?] (the upper right corner), while the parent annotation in (Johnson, 1998) corresponds to v = 2,h = [?], and the second-order model in Collins (1999), is broadly a smoothed version of v = 2,h = 2. In addition to exact nth-order models, we tried variable7In our system, the last few right children carry over as preceding context for the left children, distinct from common practice. We found this wrapped horizon to be beneficial, and it also unifies the infinite order model with the unmarkovized raw rules.</Paragraph> <Paragraph position="7"> notated models, starting with the markovized baseline. The right two columns show the change in F1 from the baseline for each annotation introduced, both cumulatively and for each single annotation applied to the baseline in isolation.</Paragraph> <Paragraph position="8"> history models similar in intent to those described in Ron et al. (1994). For variable horizontal histories, we did not split intermediate states below 10 occurrences of a symbol. For example, if the symbol <VP: [VBZ]. . . PP PP> were too rare, we would collapse it to <VP: [VBZ]. . . PP> . For vertical histories, we used a cutoff which included both frequency and mutual information between the history and the expansions (this was not appropriate for the horizontal case because MI is unreliable at such low counts).</Paragraph> <Paragraph position="9"> Figure 2 shows parsing accuracies as well as the number of symbols in each markovization. These symbol counts include all the intermediate states which represent partially completed constituents.</Paragraph> <Paragraph position="10"> The general trend is that, in the absence of further annotation, more vertical annotation is better - even exhaustive grandparent annotation. This is not true for horizontal markovization, where the variableorder second-order model was superior. The best entry, v = 3, h [?] 2, has an F1 of 79.74, already a substantial improvement over the baseline.</Paragraph> <Paragraph position="11"> In the remaining sections, we discuss other annotations which increasingly split the symbol space.</Paragraph> <Paragraph position="12"> Since we expressly do not smooth the grammar, not all splits are guaranteed to be beneficial, and not all sets of useful splits are guaranteed to co-exist well.</Paragraph> <Paragraph position="13"> In particular, while v = 3, h [?] 2 markovization is good on its own, it has a large number of states and does not tolerate further splitting well. Therefore, we base all further exploration on the v [?] 2,h [?] 2 INTERNAL annotation (incorrect baseline parse shown).</Paragraph> <Paragraph position="14"> grammar. Although it does not necessarily jump out of the grid at first glance, this point represents the best compromise between a compact grammar and useful markov histories.</Paragraph> </Section> </Section> class="xml-element"></Paper>