XML Viewer - w02-1009

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1009_metho.xml
Size: 35,729 bytes
Last Modified: 2025-10-06 14:08:01
<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-1009">
  <Title>Transformational Priors Over Grammars</Title>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1 A Sketch of the Concrete Problem
</SectionTitle>
    <Paragraph position="0"> This paper uses a new kind of statistical model to smooth the probabilities of PCFG rules. It focuses on &amp;quot;flat&amp;quot; or &amp;quot;dependency-style&amp;quot; rules. These resemble subcategorization frames, but include adjuncts as well as arguments.</Paragraph>
    <Paragraph position="1"> The verb put typically generates 3 dependents--a subject NP at left, and an object NP and goal PP at right: S!NP put NP PP: Jim put [the pizza] [in the oven] But put may also take other dependents, in other rules: S!NP Adv put NP PP: Jim often put [a pizza] [in the oven] S!NP put NP PP PP: Jim put soup [in an oven] [at home] S!NP put NP: Jim put [some shares of IBM stock] S!NP put Prt NP: Jim put away [the sauce] S!TO put NP PP: to put [the pizza] [in the oven] S!NP put NP PP SBAR: Jim put it [to me] [that ::: ] These other rules arise if put can add, drop, reorder, or retype its dependents. These edit operations on rules are semantically motivated and quite common (Table 1).</Paragraph>
    <Paragraph position="2"> We wish to learn contextual probabilities for the edit operations, based on an observed sample of flat rules. In English we should discover, for example, that it is quite common to add or delete PP at the right edge of a rule.</Paragraph>
    <Paragraph position="3"> These contextual edit probabilities will help us guess the true probabilities of novel or little-observed rules.</Paragraph>
    <Paragraph position="4"> However, rules are often idiosyncratic. Our smoothing method should not keep us from noticing (given enough evidence) that put takes a PP more often than most verbs. Hence this paper's proposal is a Bayesian smoothing method that allows idiosyncrasy in the grammar while presuming regularity to be more likely a priori. The model will assign a positive probability to each of the infinitely many formally possible rules. The following bizarre rule is not observed in training, and seems very unlikely. But there is no formal reason to rule it out, and it might help us parse an unlikely test sentence. So the model will allow it some tiny probability:</Paragraph>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
S!NP Adv PP put PP PP PP NP AdjP S
2 Background and Other Approaches
</SectionTitle>
    <Paragraph position="0"> A PCFG is a conditional probability function p(RHS j LHS).1 For example, p(V NP PPjVP) gives the probability of the rule VP!V NP PP. With lexicalized nonterminals, it has form p(Vput NPpizza PPinjVPput).</Paragraph>
    <Paragraph position="1"> Usually one makes an independence assumption and defines this asp(Vput NP PPjVPput) times factors that choose dependent headwords pizza and in according to the selectional preferences of put. This paper is about estimating the first factor, p(Vput NP PPjVPput).</Paragraph>
    <Paragraph position="2"> In supervised learning, it is simplest to use a maximum likelihood estimate (perhaps with backoff from put). Charniak (1997) calls this a &amp;quot;Treebank grammar&amp;quot; and gambles that assigning 0 probability to rules unseen in training data will not hurt parsing accuracy too much.</Paragraph>
    <Paragraph position="3"> However, there are four reasons not to use a Treebank grammar. First, ignoring unseen rules necessarily sacrifices some accuracy. Second, we will show that it improves accuracy to flatten the parse trees and use flat, dependency-style rules like p(NP put NP PPjSput); this avoids overly strong independence assumptions, but it increases the number of unseen rules and so makes Treebank grammars less tenable. Third, backing off from the word is a crude technique that does not distinguish among words.2 Fourth, one would eventually like to reduce or eliminate supervision, and then generalization is important to constrain the search to reasonable grammars.</Paragraph>
    <Paragraph position="4"> To smooth the distribution p(RHSjLHS), one can define it in terms of a set of parameters and then estimate those parameters. Most researchers have used an n-gram model (Eisner, 1996; Charniak, 2000) or more general Markov model (Alshawi, 1996) to model the sequence of nonterminals in the RHS. The sequence Vput NP PP in our example is then assumed to be emitted by some Markov model of VPput rules (again with backoff from put). Collins (1997, model 2) uses a more sophisticated model in which all arguments in this sequence are generated jointly, as in a Treebank grammar, and then a Markov process is used to insert adjuncts among the arguments.</Paragraph>
    <Paragraph position="5"> While Treebank models overfit the training data, Markov models underfit. A simple compromise (novel to this paper) is a hybrid Treebank/Markov model, which backs off from a Treebank model to a Markov. Like this paper's main proposal, it can learn well-observed idiosyncratic rules but generalizes when data are sparse.3  These models are beaten by our rather different model, transformational smoothing, which learns common rules and common edits to them. The comparison is a direct one, based on the perplexity or cross-entropy of the trained models on a test set of S! rules.4 A subtlety is that two annotation styles are possible. In the Penn Treebank, put is the head of three constituents (V, VP, and S, where underlining denotes a head child) and joins with different dependents at different levels: [S [NP Jim] [VP [V put] [NP pizza] [PP in the oven]]] In the flattened or dependency version that we prefer, each word joins with all of its dependents at once: [S [NP Jim] put [NP pizza] [PP in the oven]] A PCFG generating the flat structure must estimate p(NP put NP PP j Sput). A non-flat PCFG adds the dependents of put in 3 independent steps, so in effect it factors the flat rule's probability into 3 supposedly independent &amp;quot;subrule probabilities,&amp;quot; p(NP VPput j Sput) p(Vput NP PPjVPput) p(putjVput).</Paragraph>
    <Paragraph position="6"> Our evaluation judges the estimates of flat-rule probabilities. Is it better to estimate these directly, or as a product of estimated subrule probabilities?5 Transformational smoothing is best applied to the former, so that the edit operations can freely rearrange all of a word's dependents. We will see that the Markov and Treebank/Markov models also work much better this way--a useful finding.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 The Abstract Problem: Designing Priors
</SectionTitle>
    <Paragraph position="0"> This section outlines the Bayesian approach to learning probabilistic grammars (for us, estimating a distribution over flat CFG rules). By choosing among the many grammars that could have generated the training data, the learner is choosing how to generalize to novel sentences.</Paragraph>
    <Paragraph position="1"> To guide the learner's choice, one can explicitly specify a prior probability distribution p( ) over possible grammars , which themselves specify probability distributions over strings, rules, or trees. A learner should seek that maximizes p( ) p(D j ), where D is the set of strings, rules, or trees observed by the learner. The first factor favors regularity (&amp;quot;pick an a priori plausible grammar&amp;quot;), while the second favors fitting the idiosyncrasies of the data, especially the commonest data.6 to evaluate rule distributions that they acquired from an automatically-parsed treebank.</Paragraph>
    <Paragraph position="2">  Priors can help both unsupervised and supervised learning. (In the semi-supervised experiments here, training data is not raw text but a sparse sample of flat rules.) Indeed a good deal of syntax induction work has been carried out in just this framework (Stolcke and Omohundro, 1994; Chen, 1996; De Marcken, 1996; Gr&amp;quot;unwald, 1996; Osborne and Briscoe, 1997). However, all such work to date has adopted rather simple prior distributions. Typically, it has definedp( ) to favor PCFGs whose rules are few, short, nearly equiprobable, and defined over a small set of nonterminals. Such definitions are convenient, especially when specifying an encoding for MDL, but since they treat all rules alike, they may not be good descriptions of linguistic plausibility. For example, they will never penalize the absence of a predictable rule.</Paragraph>
    <Paragraph position="3"> A prior distribution can, however, be used to encode various kinds of linguistic notions. After all, a prior is really a soft form of Universal Grammar: it gives the learner enough prior knowledge of grammar to overcome Chomsky's &amp;quot;poverty of the stimulus&amp;quot; (i.e., sparse data). A preference for small or simple grammars, as above.</Paragraph>
    <Paragraph position="4"> Substantive preferences, such as a preference for verbs to take 2 nominal arguments, or to allow PP adjuncts.</Paragraph>
    <Paragraph position="5"> Preferences for systematicity, such as a preference for the rules to be consistently head-initial or head-final.</Paragraph>
    <Paragraph position="6"> This paper shows how to design a prior that favors a certain kind of systematicity. Lexicalized grammars for natural languages are very large--each word specifies a distribution over all possible dependency rules it could head--but they tend to have internal structure. The new prior prefers grammars in which a rule's probability can be well-predicted from the probabilities of other rules, using linguistic transformations such as edit operations.</Paragraph>
    <Paragraph position="7"> For example, p(NP Adv w put NP PPjSw) correlates with p(NP w NP PPjSw). Both numbers are high for w = put, medium for w = fund, and low for w = sleep. The slope of the regression line has to do with the rate of preverbal Adv-insertion in English.</Paragraph>
    <Paragraph position="8"> The correlation is not perfect (some verbs are especially prone to adverbial modification), which is why we will only model it with a prior. To just the extent that evidence aboutw is sparse, the prior will cause the learner to smooth the two probabilities toward the regression line.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Patterns Worth Modeling
</SectionTitle>
    <Paragraph position="0"> Before spelling out our approach, let us do a sanity check.</Paragraph>
    <Paragraph position="1"> A frame is a flat rule whose headword is replaced with teriori learning, since it is equivalent to maximizing p( jD).</Paragraph>
    <Paragraph position="2"> It is also equivalent to Minimum Description Length (MDL) learning, which minimizes the total number of bits'( )+'(Dj ) needed to encode grammar and data, because one can choose an encoding scheme where '(x) = log2 p(x), or conversely, define probability distributions by p(x) = 2 '(x).</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
MI MI MI
</SectionTitle>
    <Paragraph position="0"> 9.01 [NP ADJP-PRD] [NP RB ADJP-PRD] 4.76 [TO S] [ S] 5.54 [TO NP PP] [NP TO NP] 8.65 [NP ADJP-PRD] [NP PP-LOC-PRD] 4.17 [TO S] [TO NP PP] 5.25 [TO NP PP] [NP MD NP .] 8.01 [NP ADJP-PRD] [NP NP-PRD] 2.77 [TO S] [TO NP] 4.67 [TO NP PP] [NP MD NP] 7.69 [NP ADJP-PRD] [NP ADJP-PRD .] 6.13 [TO NP] [TO NP SBAR-TMP] 4.62 [TO NP PP] [TO ] 8.49 [NP NP-PRD] [NP NP-PRD .] 5.72 [TO NP] [TO NP PP PP] 3.19 [TO NP PP] [TO NP] 7.91 [NP NP-PRD] [NP ADJP-PRD .] 5.36 [TO NP] [NP MD RB NP] 2.05 [TO NP PP] [ NP] 7.01 [NP NP-PRD] [NP ADJP-PRD] 5.16 [TO NP] [TO NP PP PP-TMP] 5.08 [ NP] [ADVP-TMP NP] 8.45 [NP ADJP-PRD .] [NP PP-LOC-PRD] 5.11 [TO NP] [TO NP ADVP] 4.86 [ NP] [ADVP NP] 8.30 [NP ADJP-PRD .] [NP NP-PRD .] 4.85 [TO NP] [TO NP PP-LOC] 4.53 [ NP] [ NP PP-LOC] 8.04 [NP ADJP-PRD .] [NP NP-PRD] 4.84 [TO NP] [MD NP] 3.50 [ NP] [ NP PP] 7.01 [NP ADJP-PRD .] [NP ADJP-PRD] 4.49 [TO NP] [NP TO NP] 3.17 [ NP] [ S] 7.01 [NP SBAR] [NP SBAR . &amp;quot;] 4.36 [TO NP] [NP MD S] 2.28 [ NP] [NP NP] 4.75 [NP SBAR] [NP SBAR .] 4.36 [TO NP] [NP TO NP PP] 1.89 [ NP] [NP NP .] 6.94 [NP SBAR .] [&amp;quot; NP SBAR .] 4.26 [TO NP] [NP MD NP PP] 2.56 [NP NP] [NP NP .] 5.94 [NP SBAR .] [NP SBAR . &amp;quot;] 4.26 [TO NP] [TO NP PP-TMP] 2.20 [NP NP] [ NP] 5.90 [NP SBAR .] [S , NP .] 4.21 [TO NP] [TO PRT NP] 4.89 [NP NP .] [NP ADVP-TMP NP .] 5.82 [NP SBAR .] [NP ADVP SBAR .] 4.20 [TO NP] [NP MD NP] 4.57 [NP NP .] [NP ADVP NP .] 4.68 [NP SBAR .] [ SBAR] 3.99 [TO NP] [TO NP PP] 4.51 [NP NP .] [NP NP PP-TMP] 4.50 [NP SBAR .] [NP SBAR] 3.69 [TO NP] [NP MD NP .] 3.35 [NP NP .] [NP S .] 3.23 [NP SBAR .] [NP S .] 3.60 [TO NP] [TO ] 2.99 [NP NP .] [NP NP] 2.07 [NP SBAR .] [NP ] 3.56 [TO NP] [TO PP] 2.96 [NP NP .] [NP NP PP .] 1.91 [NP SBAR .] [NP NP .] 2.56 [TO NP] [NP NP PP] 2.25 [NP NP .] [ NP PP] 1.63 [NP SBAR .] [NP NP] 2.04 [TO NP] [NP S] 2.20 [NP NP .] [ NP] 4.52 [NP S] [NP S .] 1.99 [TO NP] [NP NP] 4.82 [NP S .] [ S] 4.27 [NP S] [ S] 1.69 [TO NP] [NP NP .] 4.58 [NP S .] [NP S] 3.36 [NP S] [NP ] 1.68 [TO NP] [NP NP PP .] 3.30 [NP S .] [NP ] 2.66 [NP S] [NP NP .] 1.03 [TO NP] [ NP] 2.93 [NP S .] [NP NP .] 2.37 [NP S] [NP NP] 4.75 [S , NP .] [NP SBAR .] 2.28 [NP S .] [NP NP]  the position, then S! also tends to appear at least once with that headword. MI measures the mutual information of these two events, computed over all words. When MI is large, as here, the edit distance between and tends to be strikingly small (1 or 2), and certain linguistically plausible edits are extremely common.</Paragraph>
    <Paragraph position="1"> the variable &amp;quot; &amp;quot; (corresponding towabove). Table 1 illustrates that in the Penn Treebank, if frequent rules with frame imply matching rules with frame , there are usually edit operations (section 1) to easily turn into .</Paragraph>
    <Paragraph position="2"> How about rare rules, whose probabilities are most in need of smoothing? Are the same edit transformations that we can learn from frequent cases (Table 1) appropriate for predicting the rare cases? The very rarity of these rules makes it impossible to create a table like Table 1.</Paragraph>
    <Paragraph position="3"> However, rare rules can be measured in the aggregate, and the result suggests that the same kinds of transformations are indeed useful--perhaps even more useful--in predicting them. Let us consider the set R of 2,809,545 possible flat rules that stand at edit distance 1 from the set of S! rules observed in our English training data.</Paragraph>
    <Paragraph position="4"> That is, a rule such as Sput !NP put NP is in R if it did not appear in training data itself, but could be derived by a single edit from some rule that did appear.</Paragraph>
    <Paragraph position="5"> A bigram Markov model (section 2) was used to identify 2,714,763 rare rules in R--those that were predicted to occur with probability &lt; 0:0001 given their headwords. 79 of these rare rules actually appeared in a development-data set of 1423 rules. The bigram model would have expected only 26.2 appearances, given the lexical headwords in the test data set. The difference is statistically significant (p&lt; 0:001, bootstrap test).</Paragraph>
    <Paragraph position="6"> In other words, the bigram model underpredicts the edit-distance &amp;quot;neighbors&amp;quot; of observed rules by a factor of 3.7 One can therefore hope to use the edit transformations to improve on the bigram model. For example, the 7Similar results are obtained when we examine just one particular kind of edit operation, or rules of one particular length.</Paragraph>
    <Paragraph position="7"> DeleteYtransformation recognizes that if X Y Z has been observed, then X Z is plausible even if the bigram X Z has not previously been observed.</Paragraph>
    <Paragraph position="8"> Presumably, edit operations are common because they modify a rule in semantically useful ways, allowing the filler of a semantic role to be expressed (Insert), suppressed (Delete), retyped (Substitute), or heavy-shifted (Swap). Such &amp;quot;valency-affecting operations&amp;quot; have repeatedly been invoked by linguists; they are not confined to English.8 So a learner of an unknown language can reasonably expect a priori that flat rules related by edit operations may have related probabilities.</Paragraph>
    <Paragraph position="9"> However, which edit operations varies by language.</Paragraph>
    <Paragraph position="10"> Each language defines its own weighted, contextual, asymmetric edit distance. So the learner will have to discover how likely particular edits are in particular contexts. For example, it must learn the rates of preverbal Adv-insertion and right-edge PP-insertion. Evidence about these rates comes mainly from the frequent rules.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 A Transformation Model
</SectionTitle>
    <Paragraph position="0"> The form of our new model is shown in Figure 1. The vertices are flat context-free rules, and the arcs between them represent edit transformations. The set of arcs leav8Carpenter (1991) writes that whenever linguists run into the problem of systematic redundancy in the syntactic lexicon, they design a scheme in which lexical entries can be derived from one another by just these operations. We are doing the same thing. The only twist that the lexical entries (in our case, flat PCFG rules) have probabilities that must also be derived, so we will assume that the speaker applies these operations (randomly from the hearer's viewpoint) at various rates to be learned.</Paragraph>
    <Paragraph position="1"> exp 1  Smerge!, are omitted to avoid visual clutter). Arc probabilities are determined log-linearly, as shown, from a real-valued vector of feature weights. The Z values are chosen so that the arcs leaving each vertex have total probability 1. Dashed arrows represent arcs not shown here (there are hundreds from each vertex, mainly insertions). Also, not all features are shown (see Table 2). ing any given vertex has total probability 1. The learner's job is to discover the probabilities.</Paragraph>
    <Paragraph position="2"> Fortunately, the learner does not have to learn a separate probability for each of the (infinitely) many arcs, since many of the arcs represent identical or similar edits.</Paragraph>
    <Paragraph position="3"> As shown in Figure 1, an arc's probability is determined from meaningful features of the arc, using a conditional log-linear model of p(arc j source vertex). The learner only has to learn the finite vector of feature weights.</Paragraph>
    <Paragraph position="4"> Arcs that represent similar transformations have similar features, so they tend to have similar probabilities.</Paragraph>
    <Paragraph position="5"> This transformation model is really a PCFG with unusual parameterization. That is, for any value of , it defines a language-specific probability distribution over all possible context-free rules (graph vertices). To sample from this distribution, take a random walk from the special vertex START to the special vertex HALT. The rule at the last vertex reached before HALT is the sample.</Paragraph>
    <Paragraph position="6"> This sampling procedure models a process where the speaker chooses an initial rule and edits it repeatedly.</Paragraph>
    <Paragraph position="7"> The random walk might reach Sfund!To fund NP in two steps and simply halt there. This happens with probability 0:0011 exp 1Z1 exp 0Z2 . Or, having arrived at Sfund!To fund NP, it might transform it into Sfund!To fund PP NP and then further to Sfund!To fund NP PP before halting.</Paragraph>
    <Paragraph position="8"> Thus, p (Sfund!To fund NP PP) denotes the probability that the random walk somehow reaches Sfund!To fund NP PP and halts there. Conditionalizing this probability gives p (To NP PP j Sfund), as needed for the PCFG.9 9The experiments of this paper do not allow transformations Given , it is nontrivial to solve for the probability distribution over grammar rules e. Let I (e) denote the flow to vertex e. This is defined to be the total probability of all paths from START toe. Equivalently, it is the expected number of times e would be visited by a random walk from START. The following recurrence defines p (e):10</Paragraph>
    <Paragraph position="10"> Since solving the large linear system (1) would be prohibitively expensive, in practice we use an approximate relaxation algorithm (Eisner, 2001) that propagates flow through the graph until near-convergence. In general this may underestimate the true probabilities somewhat.</Paragraph>
    <Paragraph position="11"> Now consider how the parameter vector affects the distribution over rules, p (e), in Figure 1: By raising the initial weight 1, one can increase the flow to Sfund!To fund NP, Smerge!To merge NP, and the like. By equation (2), this also increases the probability of these rules. But the effect also feeds through the graph to increase the flow and probability at those rules' descendants in the graph, such as Smerge!To merge NP PP.</Paragraph>
    <Paragraph position="12"> So a single parameter 1 controls a whole complex of rule probabilities (roughly speaking, the infinitival transitives). The model thereby captures the fact that, although that change the LHS or headword of a rule, so it is trivial to find the divisor p (Sfund): in Figure 1 it is 0.0011. But in general, LHS-changing transformations can be useful (Eisner, 2001).</Paragraph>
    <Paragraph position="13"> 10Where x;y = 1 if x = y, else x;y = 0.</Paragraph>
    <Paragraph position="14"> rules are mutually exclusive events whose probabilities sum to 1, transformationally related rules have positively correlated probabilities that rise and fall together.</Paragraph>
    <Paragraph position="15"> The exception weight 9 appears on all and only the arcs to Smerge!To merge NP PP. That rule has even higher probability than predicted by PP-insertion as above (since merge, unlike fund, actually tends to subcategorize for PPwith). To model its idiosyncratic probability, one can raise 9. This &amp;quot;lists&amp;quot; the rule specially in the grammar. Rules derived from it also increase in probability (e.g., Smerge!To Adv merge NP PP), since again the effect feeds through the graph.</Paragraph>
    <Paragraph position="16"> The generalization weight 3 models the strength of the PP-insertion relationship. Equations (1) and (2) imply that p (Sfund!To fund NP PP) is modeled as a linear combination of the probabilities of that rule's parents in the graph. 3 controls the coefficient of p (Sfund!To fund NP) in this linear combination, with the coefficient approaching zero as 3 ! 1.</Paragraph>
    <Paragraph position="17"> Narrower generalization weights such as 4 and 5 control where PP is likely to be inserted. To learn the feature weights is to learn which features of a transformation make it probable or improbable in the language. Note that the vertex labels, graph topology, and arc parameters are language independent. That is, Figure 1 is supposed to represent Universal Grammar: it tells a learner what kinds of generalizations to look for. The language-specific part is , which specifies which generalizations and exceptions help to model the data.</Paragraph>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 The Prior
</SectionTitle>
    <Paragraph position="0"> The model has more parameters than data. Why? Beyond the initial weights and generalization weights, in practice we allow one exception weight (e.g., 8; 9) for each rule that appeared in training data. (This makes it possible to learn arbitrary exceptions, as in a Treebank grammar.) Parameter estimation is nonetheless possible, using a prior to help choose among the many values of that do a reasonable job of explaining the training data. The prior constrains the degrees of freedom: while many parameters are available in principle, the prior will ensure that the data are described using as few of them as possible.</Paragraph>
    <Paragraph position="1"> The point of reparameterizing a PCFG in terms of , as in Figure 1, is precisely that only one parameter is needed per linguistically salient property of the PCFG.</Paragraph>
    <Paragraph position="2"> Making 3 &gt; 0 creates a broadly targeted transformation. Making 9 6= 0 or 1 6= 0 lists an idiosyncratic rule, or class of rules, together with other rules derived from them. But it takes more parameters to encode less systematic properties, such as narrowly targeted edit transformations ( 4; 5) or families of unrelated exceptions.</Paragraph>
    <Paragraph position="3"> A natural prior for the parameter vector 2 Rk is therefore specified in terms of a variance 2. We simply say that the weights 1; 2;::: k are independent samples from the normal distribution with mean 0 and vari-</Paragraph>
    <Paragraph position="5"> or equivalently, that is drawn from a multivariate Gaussian with mean ~0 and diagonal covariance matrix 2I, i.e., N(~0; 2I).</Paragraph>
    <Paragraph position="6"> This says that a priori, the learner expects most features in Figure 1 to have weights close to zero, i.e., to be irrelevant. Maximizing p( ) p(D j ) means finding a relatively small set of features that adequately describe the rules and exceptions of the grammar. Reducing the variance 2 strengthens this bias toward simplicity.</Paragraph>
    <Paragraph position="7"> For example, if Sfund!To fund NP PP and Smerge!To fund NP PP are both observed more often than the current p distribution predicts, then the learner can follow either (or both) of two strategies: raise 8 and 9, or raise 3. The former strategy fits the training data only; the latter affects many disparate arcs and leads to generalization. The latter strategy may harm p(Dj ) but is preferred by the prior p( ) because it uses one parameter instead of two. If more than two words act like merge and fund, the pressure to generalize is stronger.</Paragraph>
  </Section>
  <Section position="9" start_page="0" end_page="0" type="metho">
    <SectionTitle>
7 Perturbation Parameters
</SectionTitle>
    <Paragraph position="0"> In experiments, we have found that a slight variation on this model gets slightly better results. Let e denote the exception weight (if any) that allows one to tune the probability of rulee. We eliminate e and introduce a different parameter e, called a perturbation, which is used in the following replacements for equations (1) and (2):</Paragraph>
    <Paragraph position="2"> Increasing either e or e will raise p (e); the learner may do this to account for observations of e in training data. The probabilities of other rules consequently decrease so that Pep (e) = 1. When e is raised, all rules' probabilities are scaled down slightly and equally (because Z increases). When e is raised, e steals probability from its siblings,11 but these are similar toeso tend to appear in test data if e is in training data. Raising e without disproportionately harming e's siblings requires manipulation of many other parameters, which is discouraged by the prior and may also suffer from search error.</Paragraph>
    <Paragraph position="3"> We speculate that this is why e works better.</Paragraph>
    <Paragraph position="4"> 11Raising the probability of an arc from e0 to e decreases the probabilities of arcs from e0 to siblings of e, as they sum to 1. (Insert) (Insert, target) (Insert, left) (Insert, target, left) (Insert, right) (Insert, target, right) (Insert, left, right) (Insert, side) (Insert, side, target) (Insert, side, left) (Insert, side, target, left) (Insert, side, right) (Insert, side, target, right) (Insert, side, left, right)  given arc are found by instantiating the tuples above, as shown. Each instantiated tuple has a weight specified in .</Paragraph>
    <Paragraph position="5"> S! rules only train dev test  To evaluate the quality of generalization, we used preparsed training data D and testing data E (Table 3). Each dataset consisted of a collection of flat rules such as Sput!NP put NP PP extracted from the Penn Tree-bank (Marcus et al., 1993). Thus, p(D j ; ) and p(Ej ; ) were each defined as a product of rule probabilities of the form p ; (NP put NP PPjSput).</Paragraph>
    <Paragraph position="6"> The learner attempted to maximize p( ; ) p(D j ; ) by gradient ascent. This amounts to learning the generalizations and exceptions that related the training rules D. The evaluation measure was then the perplexity on test data, log2p(E j ; )=jEj. To get a good (low) perplexity score, the model had to assign reasonable probabilities to the many novel rules in E (Table 3). For many of these rules, even the frame was novel.</Paragraph>
    <Paragraph position="7"> Note that although the training data was preparsed into rules, it was not annotated with the paths in Figure 1 that generated those rules, so estimating and was still an unsupervised learning problem.</Paragraph>
    <Paragraph position="8"> The transformation graph had about 14 features per arc (Table 2). In the finite part of the transformation graph that was actually explored (including bad arcs that compete with good ones), about 70000 distinct features were encountered, though after training, only a few hundred 12See (Eisner, 2001) for full details of data preparation, model structure, parameter initialization, backoff levels for the comparison models, efficient techniques for computing the objective and its gradient, and more analysis of the results.  aBack off from Treebank grammar with Katz vs. one-count backoff (Chen and Goodman, 1996) (Note: One-count was always used for backoff within the n-gram and Collins models.) bSee section 2 for discussion cCollins (1997, model 2) dAverage of transformation model with best other model  training set. (b) Half training set (sections 0-7 only). feature weights were substantial, and only a few thousand were even far enough from zero to affect performance.</Paragraph>
    <Paragraph position="9"> There was also a parameter e for each observed rule e. Results are given in Table 4a, which compares the transformation model to various competing models discussed in section 2. The best (smallest) perplexities appear in boldface. The key results: The transformation model was the winner, reducing perplexity by 20% over the best model replicated from previous literature (a bigram model).</Paragraph>
    <Paragraph position="10"> Much of this improvement could be explained by the transformation model's ability to model exceptions. Adding this ability more directly to the bigram model, using the new Treebank/Markov approach of section 2, also reduced perplexity from the bigram model, by 6% or 14% depending on whether Katz or one-count backoff was used, versus the transformation model's 20%.</Paragraph>
    <Paragraph position="11"> Averaging the transformation model with the best competing model (Treebank/bigram) improved it by an additional 6%. So using transformations yields a total perplexity reduction of 12% over Treebank/bigram, and 24% over the best previous model from the literature (bigram). What would be the cost of achieving such a perplexity improvement by additional annotation? Training the averaged model on only the first half of the training set, with no further tuning of any options (Table 4b), yielded a test set perplexity of 118.0. So by using transformations, we can achieve about the same perplexity as the best model without transformations (Treebank/bigram, 116.2), using only half as much training data.</Paragraph>
    <Paragraph position="12"> Furthermore, comparing Tables 4a and 4b shows that the transformation model had the most graceful performance degradation when the dataset was reduced in size.  best transformation-free model. Improvements fall above the main diagonal; dashed diagonals indicate a factor of two. The three log-log plots (at different scales!) partition the rules by the number of training observations: 0 (left graph), 1 (middle), 2 (right). This is an encouraging result for the use of the method in less supervised contexts (although results on a noisy dataset would be more convincing in this regard).</Paragraph>
    <Paragraph position="13"> The competing models from the literature are best used to predict flat rules directly, rather than by summing over their possible non-flat internal structures, as has been done in the past. This result is significant in itself. Extending Johnson (1998), it shows the inappropriateness of the traditional independence assumptions that build up a frame by several rule expansions (section 2).</Paragraph>
    <Paragraph position="14"> Figure 2 shows that averaging the transformation model with the Treebank/bigram model improves the latter not merely on balance, but across the board. In other words, there is no evident class of phenomena for which incorporating transformations would be a bad idea.</Paragraph>
    <Paragraph position="15"> Transformations particularly helped raise the estimates of the low-probability novel rules in test data, as hoped. Transformations also helped on test rules that had been observed once in training with relatively infrequent words. (In other words, the transformation model does not discount singletons too much.) Transformations hurt slightly on balance for rules observed more than once in training, but the effect was tiny. All these differences are slightly exaggerated if one compares the transformation model directly with the Treebank/bigram model, without averaging.</Paragraph>
    <Paragraph position="16"> The transformation model was designed to use edit operations in order to generalize appropriately from a word's observed frames to new frames that are likely to appear with that word in test data. To directly test the model's success at such generalization, we compared it to the bigram model on a pseudo-disambiguation task.</Paragraph>
    <Paragraph position="17"> Each instance of the task consisted of a pair of rules from test data, expressed as (word, frame) pairs (w1;f1) and (w2;f2), such that f1 and f2 are &amp;quot;novel&amp;quot; frames that did not appear in training data (with any headword).</Paragraph>
    <Paragraph position="18"> Each model was then asked: Does f1 go with w1 and f2 with w2, or vice-versa? In other words, which is bigger, p(f1 jw1) p(f2 jw2) or p(f2 jw1) p(f1 jw2)? Since the frames were novel, the model had to make the choice according to whether f1 or f2 looked more like the frames that had actually been observed with w1 in the past, and likewise w2. What this means depends on the model. The bigram model takes two frames to look alike if they contain many bigrams in common. The transformation model takes two frames to look alike if they are connected by a path of probable transformations. The test data contained 62 distinct rules (w;f) in which f was a novel frame. This yielded 62 612 = 1891 pairs of rules, leading to 1811 task instances after obvious ties were discarded.13 Baseline performance on this difficult task is 50% (random guess). The bigram model chose correctly in 1595 of the 1811 instances (88.1%). Parameters for &amp;quot;memorizing&amp;quot; specific frames do not help on this task, which involves only novel frames, so the Treebank/bigram model had the same performance. By contrast, the transformation model got 1669 of 1811 correct (92.2%), for a morethan-34% reduction in error rate. (The development set showed similar results.) However, since the 1811 task instances were derived non-independently from just 62 novel rules, this result is based on a rather small sample.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML