File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/w01-0714_metho.xml

Size: 20,489 bytes

Last Modified: 2025-10-06 14:07:38

<?xml version="1.0" standalone="yes"?>
<Paper uid="W01-0714">
  <Title>References</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Search vs. Clustering
</SectionTitle>
    <Paragraph position="0"> Whether grammar induction is viewed as a search problem or a clustering problem is a matter of perspective, and the two views are certainly not mutually exclusive. The search view focuses on the recursive relationships between the non-terminals in the grammar. The clustering view, which is perhaps more applicable to the present work, focuses on membership of (terminal) sequences to classes represented by the non-terminals. For example, the non-terminal symbol NP can be thought of as a cluster of (terminal) sequences which can be generated starting from NP. This clustering is then inherently soft clustering, since sequences can be ambiguous.</Paragraph>
    <Paragraph position="1"> Unlike standard clustering tasks, though, a sequence token in a given sentence need not be a constituent at all. For example, DT NN is an extremely common NP, and when it occurs, it is a constituent around 82% of the time in the data. However, when it occurs as a subsequence of DT NN NN it is usually not a constituent. In fact, the difficult decisions for a supervised parser, such as attachment level or coordination scope, are decisions as to which sequences are constituents, not what their tags would be if they were. For example, DT NN IN DT NN is virtually always an NP when it is a constituent, but it is only a constituent 66% of the time, mostly because the PP, IN DT NN, is attached elsewhere.</Paragraph>
    <Paragraph position="2"> One way to deal with this issue is to have an explicit class for &amp;quot;not a constituent&amp;quot; (see section 4.2). There are difficulties in modeling such a class, mainly stemming from the differences between this class and the constituent classes. In particular, this class will not be distributionally cohesive. Also, for example, DT NN and DT JJ NN being generally of category NP seems to be a highly distributional fact, while DT NN not being a constituent in the context DT NN NN seems more properly modeled by the competing productions of the grammar.</Paragraph>
    <Paragraph position="3"> Another approach is to model the non-constituents either implicitly or independently of the clustering model (see section 4.1). The drawback to insufficiently modeling non-constituency is that for acquisition systems which essentially work bottom-up, non-constituent chunks such as NN IN or IN DT are hard to rule out locally.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Systems
</SectionTitle>
    <Paragraph position="0"> We present two systems. The first, GREEDY-MERGE, learns symbolic CFGs for partial parsing. The rules it learns are of high quality (see figures 3 and 4), but parsing coverage is relatively shallow.</Paragraph>
    <Paragraph position="1"> The second, CONSTITUENCY-PARSER, learns distributions over sequences representing the probabil- null ity that a constituent is realized as that sequence (see figure 1). It produces full binary parses.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 GREEDY-MERGE
</SectionTitle>
      <Paragraph position="0"> GREEDY-MERGE is a precision-oriented system which, to a first approximation, can be seen as an agglomerative clustering process over sequences.</Paragraph>
      <Paragraph position="1"> For each pair of sequences, a normalized divergence is calculated as follows:</Paragraph>
      <Paragraph position="3"> The pair with the least divergence is merged.2 Merging two sequences involves the creation of a single new non-terminal category which rewrites as either sequence. Once there are non-terminal categories, the definitions of sequences and contexts become slightly more complex. The input sentences are parsed with the previous grammar state, using a shallow parser which ties all parentless nodes together under a TOP root node. Sequences are then the ordered sets of adjacent sisters in this parse, and the context of a sequence can either be the preceding and following tags or a higher node in the tree. To illustrate, in figure 2, the sequence VBZ RB could either be considered to be in context [Z1. . . #] or [NN. . . #]. Taking the highest potential context ([Z1. . . #] in this case) performed slightly better.3 Merging a sequence and a single non-terminal results in a rule which rewrites the non-terminal as the sequence (i.e., that sequence is added to that nonterminal's class), and merging two non-terminals involves collapsing the two symbols in the grammar (i.e., those classes are merged). After the merge, re-analysis of the grammar rule RHSs is necessary.</Paragraph>
      <Paragraph position="4"> An important point about GREEDY-MERGE is that stopping the system at the correct point is critical. Since our greedy criterion is not a measure over entire grammar states, we have no way to detect the optimal point beyond heuristics (the same 2We required that the candidates be among the 250 most frequent sequences. The exact threshold was not important, but without some threshold, long singleton sequences with zero divergence are always chosen. This suggests that we need a greater bias towards quantity of evidence in our basic method. 3An option which was not tried would be to consider a non-terminal as a distribution over the tags of the right or left corners of the sequences belonging to that non-terminal.</Paragraph>
      <Paragraph position="5"> category appears in several merges in a row, for example) or by using a small supervision set to detect a parse performance drop. The figures shown are from stopping the system manually just before the first significant drop in parsing accuracy.</Paragraph>
      <Paragraph position="6"> The grammar rules produced by the system are a strict subset of general CFG rules in several ways.</Paragraph>
      <Paragraph position="7"> First, no unary rewriting is learned. Second, no non-terminals which have only a single rewrite are ever proposed, though this situation can occur as a result of later merges. The effect of these restrictions is discussed below.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 CONSTITUENCY-PARSER
</SectionTitle>
      <Paragraph position="0"> The second system, CONSTITUENCY-PARSER, is recall-oriented. Unlike GREEDY-MERGE, this system always produces a full, binary parse of each input sentence. However, its parsing behavior is secondary. It is primarily a clustering system which views the data as the entire set of (sequence, context) pairs a8a10a2a9a42a17a3a36a11 that occurred in the sentences. Each pair token comes from some specific sentence and is classified with a binary judgement a83 of that token's constituency in that sentence. We assume that these pairs are generated by the following model:</Paragraph>
      <Paragraph position="2"> We use EM to maximize the likelihood of these pairs given the hidden judgements a83 , subject to the constraints that the judgements for the pairs from a given sentence must form a valid binary parse.</Paragraph>
      <Paragraph position="3"> Initialization was either done by giving initial seeds for the probabilities above or by forcing a certain set of parses on the first round. To do the reestimation, we must have some method of deciding which binary bracketing to prefer. The chance of a pair a8a10a2a9a42a17a3a36a11 being a constituent is</Paragraph>
      <Paragraph position="5"> and we score a tree a98 by the likelihood product of its judgements a83a81a8a10a2a28a42a89a98a99a11 . The best tree is then</Paragraph>
      <Paragraph position="7"> As we are considering each pair independently from the rest of the parse, this model does not correspond to a generative model of the kind standardly associated with PCFGs, but can be seen as a random field over the possible parses, with the features being the sequences and contexts (see (Abney, 1997)). However, note that we were primarily interested in the clustering behavior, not the parsing behavior, and that the random field parameters have not been fit to any distribution over trees. The parsing model is very crude, primarily serving to eliminate systematically mutually incompatible analyses.</Paragraph>
      <Paragraph position="8">  Since this system does not postulate any non-terminal symbols, but works directly with terminal sequences, sparsity will be extremely severe for any reasonably long sequences. Substantial smoothing was done to all terms; for the a57 a8a15a83a81a45a2a16a11 estimates we interpolated the previous counts equally with a uniform a57 a8a15a83a51a11 , otherwise most sequences would remain locked in their initial behaviors. This heavy smoothing made rare sequences behave primarily according to their contexts, removed the initial invariance problem, and, after a few rounds of re-estimation, had little effect on parser performance.</Paragraph>
      <Paragraph position="9">  CONSTITUENCY-PARSER's behavior is determined by the initialization it is given, either by initial parameter estimates, or fixed first-round parses. We used four methods: RANDOM, ENTROPY, RIGHT-BRANCH, and GREEDY.</Paragraph>
      <Paragraph position="10"> For RANDOM, we initially parsed randomly. For ENTROPY, we weighted a57 a8a15a83a53a45a2a12a11 proportionally to a13a25a24a51a8a15a7a9a8a10a2a16a11a17a11 . For RIGHTBRANCH, we forced right-branching structures (thereby introducing a bias towards English structure). Finally, GREEDY used the output from GREEDY-MERGE (using the grammar state in figure 3) to parse initially.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Results
</SectionTitle>
    <Paragraph position="0"> Two kinds of results are presented. First, we discuss the grammars learned by GREEDY-MERGE and the constituent distributions learned by CONSTITUENCY-PARSER. Then we apply both systems to parsing free text from the WSJ section of the Penn treebank.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Grammars learned by GREEDY-MERGE
</SectionTitle>
      <Paragraph position="0"> Figure 3 shows a grammar learned at one stage of a run of GREEDY-MERGE on the sentences in the WSJ section of up to 10 words after the removal of punctuation (a113 7500 sentences). The non-terminal categories proposed by the systems are internally given arbitrary designations, but we have relabeled them to indicate the best recall match for each.</Paragraph>
      <Paragraph position="1"> Categories corresponding to NP, VP, PP, and S are learned, although some are split into sub-categories (transitive and intransitive VPs, proper NPs and two N-bar or zero determiner NP</Paragraph>
      <Paragraph position="3"> kinds of common NPs, and so on).4 Provided one is willing to accept a verb-group analysis, this grammar seems sensible, though quite a few constructions, such as relative clauses, are missing entirely. Figure 4 shows a grammar learned at one stage of a run when verbs were split by transitivity. This grammar is similar, but includes analyses of sentencial coordination and adverbials, and subordinate clauses. The only rule in this grammar which seems overly suspect is ZVP a116 IN ZS which analyzes complementized subordinate clauses as VPs.</Paragraph>
      <Paragraph position="4"> In general, the major mistakes the GREEDY-MERGE system makes are of three sorts: a55 Mistakes of omission. Even though the grammar shown has correct, recursive analyses of many categories, no rule can non-trivially incorporate a number (CD). There is also no analysis for many common constructions.</Paragraph>
      <Paragraph position="5"> a55 Alternate analyses. The system almost invariably forms verb groups, merging MD VB sequences with single main verbs to form verb group constituents (argued for at times by some linguists (Halliday, 1994)). Also, PPs are sometimes attached to NPs below determiners (which is in fact a standard linguistic analysis (Abney, 1987)). It is not always clear whether these analyses should be considered mistakes.</Paragraph>
      <Paragraph position="6"> a55 Over-merging. These errors are the most serious. Since at every step two sequences are merged, the process will eventually learn the 4Splits often occur because unary rewrites are not learned in the current system.</Paragraph>
      <Paragraph position="7">  grammar where X a116 X X and X a116 (any terminal). However, very incorrect merges are sometimes made relatively early on (such as merging VPs with PPs, or merging the sequences IN NNP IN and IN.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 CONSTITUENCY-PARSER's Distributions
</SectionTitle>
      <Paragraph position="0"> The CONSTITUENCY-PARSER's state is not a symbolic grammar, but estimates of constituency for terminal sequences. These distributions, while less compelling a representation for syntactic knowledge than CFGs, clearly have significant facts about language embedded in them, and accurately learning them can be seen as a kind of acquisiton.</Paragraph>
      <Paragraph position="1"> Figure 5 shows the sequences whose constituency counts are most incorrect for the GREEDY-RE setting. An interesting analysis given by the system is the constituency of NNP POS NN sequences as NNP (POS NN) which is standard in linguistic analyses (Radford, 1988), as opposed to the treebank's systematic (NNP POS) NN. Other common errors, like the overcount of JJ NN or JJ NNS are partially due to parsing inside NPs which are flat in the treebank (see section 5.3).</Paragraph>
      <Paragraph position="2"> It is informative to see how re-estimation with CONSTITUENCY-PARSER improves and worsens the GREEDY-MERGE initial parses. Coverage is improved; for example NPs and PPs involving the CD tag are consistently parsed as constituents while GREEDY-MERGE did not include them in parses at all. On the other hand, the GREEDY-MERGE sys- null identified as constituents by CONSTITUENCY-PARSER using GREEDY-RE (ENTROPY-RE is similar). &amp;quot;Total&amp;quot; is the frequency of the sequence in the flat data. &amp;quot;True&amp;quot; is the frequency as a constituent in the treebank's parses. &amp;quot;Estimated&amp;quot; is the frequency as a constituent in the system's parses.</Paragraph>
      <Paragraph position="3"> tem had learned the standard subject-verb-object attachment order, though this has disappeared, as can be seen in the undercounts of VP sequences. Since many VPs did not fit the conservative VP grammar in figure 3, subjects and verbs were often grouped together frequently even on the initial parses, and the CONSTITUENCY-PARSER has a further bias towards over-identifying frequent constituents.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.3 Parsing results
</SectionTitle>
      <Paragraph position="0"> Some issues impact the way the results of parsing treebank sentences should be interpreted. Both systems, but especially the CONSTITUENCY-PARSER, tend to form verb groups and often attach the sub-ject below the object for transitive verbs. Because of this, certain VPs are systematically incorrect and VP accuracy suffers dramatically, substantially pulling down the overall figures.5 Secondly, the treebank's grammar is an imperfect standard for an unsupervised learner. For example, transitive sentences are bracketed [subject [verb object]] (&amp;quot;The president [executed the law]&amp;quot;) while nominalizations are bracketed [[possessive noun] complement] (&amp;quot;[The president's execution] of the law&amp;quot;), an arbitrary inconsistency which is unlikely to be learned automatically. The treebank is also, somewhat purposefully, very flat. For example, there is no analysis of the inside of many short noun phrases. The GREEDY-MERGE grammars above, however, give a (correct) analysis of the insides of NPs like DT JJ NN NN for which it will be penalized in terms of unlabeled precision (though not crossing brackets) when compared to the treebank.</Paragraph>
      <Paragraph position="1"> An issue with GREEDY-MERGE is that the grammar learned is symbolic, not probabilistic. Any disambiguation is done arbitrarily. Therefore, even adding a linguistically valid rule can degrade numerical performance (sometimes dramatically) by introducing ambiguity to a greater degree than it improves coverage.</Paragraph>
      <Paragraph position="2"> In figure 6, we report summary results for each system on the a119 10-word sentences of the WSJ section. GREEDY is the above snapshot of the GREEDY-MERGE system. RANDOM, EN-TROPY, and RIGHTBRANCH are the behaviors of the random-parse baseline, the right-branching baseline, and the entropy-scored initialization for CONSTITUENCY-PARSER. The -RE settings are the result of context-based re-estimation from the respective baselines using CONSTITUENCY-PARSER.6 NCB precision is the percentage of pro5The RIGHTBRANCH baseline is in the opposite situation. Its high overall figures are in a large part due to extremely high VP accuracy, while NP and PP accuracy (which is more important for tasks such as information extraction) is very low. 6RIGHTBRANCH was invariant under re-estimation, and RIGHTBRANCH-RE is therefore omitted.</Paragraph>
      <Paragraph position="3"> posed brackets which do not cross a correct bracket.</Paragraph>
      <Paragraph position="4"> Recall is also shown separately for VPs and NPs to illustrate the VP effect noted above.</Paragraph>
      <Paragraph position="5"> The general results are encouraging. GREEDY is, as expected, higher precision than the other settings. Re-estimation from that initial point improves recall at the expense of precision. In general, re-estimation improves parse accuracy, despite the indirect relationship between the criterion being maximized (constituency cluster fit) and parse quality.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Limitations of this study
</SectionTitle>
    <Paragraph position="0"> This study presents preliminary investigations and has several significant limitations.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.1 Tagged Data
</SectionTitle>
      <Paragraph position="0"> A possible criticism of this work is that it relies on part-of-speech tagged data as input. In particular, while there has been work on acquiring parts-of-speech distributionally (Finch et al., 1995; Sch&amp;quot;utze, 1995), it is clear that manually constructed tag sets and taggings embody linguistic facts which are not generally detected by a distributional learner. For example, transitive and intransitive verbs are identically tagged yet distributionally dissimilar.</Paragraph>
      <Paragraph position="1"> In principle, an acquisition system could be designed to exploit non-distributionality in the tags.</Paragraph>
      <Paragraph position="2"> For example, verb subcategorization or selection could be induced from the ways in which a given lexical verb's distribution differs from the average, as in (Resnik, 1993). However, rather than being exploited by the systems here, the distributional nonunity of these tags appears to actually degrade performance. As an example, the systems more reliably group verbs and their objects together (rather than verbs and their subjects) when transitive and intransitive verbs are given separate tags.</Paragraph>
      <Paragraph position="3"> Future experiments will investigate the impact of distributional tagging, but, despite the degradation in tag quality that one would expect, it is also possible that some current mistakes will be corrected.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.2 Individual system limitations
</SectionTitle>
      <Paragraph position="0"> For GREEDY-MERGE, the primary limitations are that there is no clear halting condition, there is no ability to un-merge or to stop merging existing classes while still increasing coverage, and the system is potentially very sensitive to the tagset used. For CONSTITUENCY-PARSER, the primary limitations are that no labels or recursive grammars are learned, and that the behavior is highly dependent on initialization.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML