File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1638_metho.xml
Size: 28,477 bytes
Last Modified: 2025-10-06 14:10:45
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1638"> <Title>Better Informed Training of Latent Syntactic Features</Title> <Section position="4" start_page="317" end_page="317" type="metho"> <SectionTitle> 2 Partially supervised EM learning </SectionTitle> <Paragraph position="0"> The parameters of a PCFG can be learned with or without supervision. In the supervised case, the complete tree is observed, and the rewrite rule probabilities can be estimated directly from the observed rule counts. In the unsupervised case, only the words are observed, and the learning method must induce the whole structure above them. (See Table 1.) In the partially supervised case we will consider, some part of the tree is observed, and the remaining information has to be induced.</Paragraph> <Paragraph position="1"> Pereira and Schabes (1992) estimate PCFG parameters from partially bracketed sentences, using the inside-outside algorithm to induce the missing brackets and the missing node labels. Some authors define a complete tree as one that specifies not only a label but also a &quot;head child&quot; for each node. Chiang and Bikel (2002) induces the missing head-child information; Prescher (2005) induces both the head-child information and the latent annotations we will now discuss.</Paragraph> </Section> <Section position="5" start_page="317" end_page="320" type="metho"> <SectionTitle> 3 Feature Grammars </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="317" end_page="318" type="sub_section"> <SectionTitle> 3.1 The PCFG-LA Model </SectionTitle> <Paragraph position="0"> Staying in the partially supervised paradigm, the PCFG-LA model described in Matsuzaki et al.</Paragraph> <Paragraph position="1"> (2005) observe whole treebank trees, but learn an &quot;annotation&quot; on each nonterminal token--an unspecified and uninterpreted integer that distinguishes otherwise identical nonterminals. Just as Collins manually split the S nonterminal label into S and SG for sentences with and without subjects, Matsuzaki et al. (2005) split S into S[1], S[2], . . . , S[L] where L is a predefined number--but they do it automatically and systematically, and not only for S but for every nonterminal. Their partially supervised learning procedure observes trees that are fully bracketed and fully labeled, except for the integer subscript used to annotate each node.</Paragraph> <Paragraph position="2"> After automatically inducing the annotations with EM, their resulting parser performs just as well as one learned from a treebank whose nonterminals were manually refined through linguistic and error analysis (Klein and Manning, 2003).</Paragraph> <Paragraph position="3"> In Matsuzaki's PCFG-LA model, rewrite rules take the form</Paragraph> <Paragraph position="5"> in the lexical case. The probability of a tree consisting of rules r1,r2,... is given by the probability of its root symbol times the conditional probabilities of the rules. The annotated tree T1 in Fig. 1, for example, has the following probability:</Paragraph> <Paragraph position="7"> where, to simplify the notation, we use</Paragraph> <Paragraph position="9"> will have children Y Z.</Paragraph> <Paragraph position="10"> Degrees of freedom. We will want to compare models that have about the same size. Models with more free parameters have an inherent advantage on modeling copious data because of their greater expressiveness. Models with fewer free parameters are easier to train accurately on sparse data, as well as being more efficient in space and often in time. Our question is therefore what can be accomplished with a given number of parameters. How many free parameters in a PCFG-LA model? Such a model is created by annotating the nonterminals of a standard PCFG (extracted from the given treebank) with the various integers from 1 to L. If the original, &quot;backbone&quot; grammar has R3 binary rules of the form X - Y Z, then the resulting PCFG-LA model has L3 x R3 such rules: X[1] - Y[1] Z[1], X[1] - Y[1] Z[2],</Paragraph> <Paragraph position="12"> ilarly, if the backbone grammar has R2 rules of the form X - Y the PCFG-LA model has L2 x</Paragraph> <Paragraph position="14"> The PCFG-LA has as many parameters to learn as rules: one probability per rule. However, not all these parameters are free, as there are L x N sum-to-one constraints, where N is the number of backbone nonterminals. Thus we have L3R3 +L2R2 +LR1 [?]LN (3) degrees of freedom.</Paragraph> <Paragraph position="15"> We note that Goodman (1997) mentioned possible ways to factor the probability 1, making independence assumptions in order to reduce the number of parameters.</Paragraph> <Paragraph position="16"> Runtime. Assuming there are no unary rule cycles in the backbone grammar, bottom-up chart parsing of a length-n sentence at test time takes time proportional to n3L3R3 +n2L2R2 +nLR1, by attempting to apply each rule everywhere in the sentence. (The dominating term comes from equation (4) of Table 2: we must loop over all n3 triples i,j,k and all R3 backbone rules X - YZ and all 3We use unary rules of this form (e.g. the Treebank'sS-NP) in our reimplementation of Matsuzaki's algorithm. L3 triples a,b,g.) As a function of n and L only, this is O(n3L3).</Paragraph> <Paragraph position="17"> At training time, to induce the annotations on a given backbone tree with n nodes, one can run a constrained version of this algorithm that loops over only the n triples i,j,k that are consistent with the given tree (and considers only the single consistent backbone rule for each one). This takes time O(nL3), as does the inside-outside version we actually use to collect expected PCFG-LA rule counts for EM training.</Paragraph> <Paragraph position="18"> We now introduce a model that is smaller, and has a lower runtime complexity, because it adheres to specified ways of propagating features through the tree.</Paragraph> </Section> <Section position="2" start_page="318" end_page="320" type="sub_section"> <SectionTitle> 3.2 Feature Passing: The INHERIT Model </SectionTitle> <Paragraph position="0"> Many linguistic theories assume that features get passed from the mother node to their children or some of their children. In many cases it is the head child that gets passed its feature value from its mother (e.g., Kaplan and Bresnan (1982), Pollard and Sag (1994)). In some cases the feature is passed to both the head and the non-head child, or perhaps even to the non-head alone.</Paragraph> <Paragraph position="1"> at different positions in the tree.</Paragraph> <Paragraph position="2"> In the example in Fig. 2, the tense feature (pres) is always passed to the head child (underlined). How the number feature (sg/pl) is passed depends on the rewrite rule: S - NPVP passes it to both children, to enforce subject-verb agreement, while VP - VNP only passes it to the head child, since the object NP is free not to agree with the verb. A feature grammar can incorporate such patterns of feature passing. We introduce additional parameters that define the probability of passing a feature to certain children. The head child of each node is given deterministically by the head rules of (Collins, 1996).</Paragraph> <Paragraph position="3"> Under the INHERIT model that we propose, the Model Runtime and d.f. Simplified equation for inside probabilities (ignores unary rules)</Paragraph> <Paragraph position="5"> stands for &quot;degrees of freedom&quot; (i.e., free parameters). The B terms are inside probabilities; to compute Viterbi parse probabilities instead, replace summation by maximization. Note the use of the intermediate quantity BX(i,j) to improve runtime complexity by moving some summations out of the inner loop; this is an instance of a &quot;folding transformation&quot; (Blatz and Eisner, 2006).</Paragraph> <Paragraph position="6"> is passed to the head child (underlined). Right: T3.</Paragraph> <Paragraph position="7"> The feature is passed to both children.</Paragraph> <Paragraph position="8"> probabilities of tree T2 in Fig. 3 are calculated as follows, with Pann(1 |NP) being the probability of annotating an NP with feature 1 if it does not inherit its parent's feature. The VP is boldfaced to indicate that it is the head child of this rule.</Paragraph> <Paragraph position="10"> In T2, the subject NP chose feature 1 or 2 independent of its parent S, according to the distribution Pann(* |NP). In T3, it was constrained to inherit its parent's feature 2.</Paragraph> <Paragraph position="11"> Degrees of freedom. The INHERIT model may be regarded as containing all the same rules (see (1)) as the PCFG-LA model. However, these rules' probabilities are now collectively determined by a smaller set of shared parameters.4 That is because the distribution of the child features b and g no longer depends arbitrarily on the rest of the rule. b is either equal to a, or chosen independently of everything but Y .</Paragraph> <Paragraph position="12"> The model needs probabilities for L x R3 binary-rule parameters like P(S[2] - NPVP) above, as well as L x R2 unary-rule and L x R1 lexical-rule parameters. None of these consider the annotations on the children. They are subject to LxN sum-to-one constraints.</Paragraph> <Paragraph position="13"> The model also needs 4xR3 passpattern probabilities like P(pass to head |X - YZ) above, with R3 sum-to-one constraints, and L x N noninherited annotation parameters Pann(a|X), with N sum-to-one constraints.</Paragraph> <Paragraph position="14"> Adding these up and canceling the two L x N</Paragraph> <Paragraph position="16"> below. Like equation (5), it is P(X[a] - Y Z) times a sum of up to 4 products, corresponding to the 4 passpattern cases.</Paragraph> <Paragraph position="17"> terms, the INHERIT model has</Paragraph> <Paragraph position="19"> degrees of freedom. Thus for a typical grammar where R3 dominates, we have reduced the number of free parameters from about L3R3 to only about LR3.</Paragraph> <Paragraph position="20"> Runtime. We may likewise reduce an L3 factor to L in the runtime. Table 2 shows dynamic programming equations for the INHERIT model. By exercising care, they are able to avoid summing over all possible values of b and g within the inner loop. This is possible because when they are not inherited, they do not depend on X,Y,Z, or a.</Paragraph> </Section> <Section position="3" start_page="320" end_page="320" type="sub_section"> <SectionTitle> 3.3 Multiple Features </SectionTitle> <Paragraph position="0"> The INHERIT model described above is linguistically naive in several ways. One problem (see section 6 for others) is that each nonterminal has only a single feature to pass. Linguists, however, usually annotate each phrase with multiple features.</Paragraph> <Paragraph position="1"> Our example tree in Fig. 2 was annotated with both tense and number features, with different inheritance patterns.</Paragraph> <Paragraph position="2"> As a step up from INHERIT, we propose an INHERIT2 model where each nonterminal carries two features. Thus, we will have L6R3 binary rules instead of L3R3. However, we assume that the two features choose their passpatterns independently, and that when a feature is not inherited, it is chosen independently of the other feature. This keeps the number of parameters down.</Paragraph> <Paragraph position="3"> In effect, we are defining</Paragraph> <Paragraph position="5"> where P1 and P2 choose child features as if they were separate single-feature INHERIT models.</Paragraph> <Paragraph position="6"> We omit discussion of dynamic programming speedups for INHERIT2. Empirically, the hope is that the two features when learned with the EM algorithm will pick out different linguistic properties of the constituents in the treebank tree.</Paragraph> </Section> </Section> <Section position="6" start_page="320" end_page="321" type="metho"> <SectionTitle> 4 Annealing-Like Training Approaches </SectionTitle> <Paragraph position="0"> Training latent PCFG models, like training most other unsupervised models, requires non-convex optimization. To find good parameter values, it is often helpful to train a simpler model first and use its parameters to derive a starting guess for the harder optimization problem. A well-known example is the training of the IBM models for statistical machine translation (Berger et al., 1994).</Paragraph> <Paragraph position="1"> In this vein, we did an experiment in which we gradually increased L during EM training of the PCFG-LA and INHERIT models. Whenever the training likelihood began to converge, we manually and globally increased L, simply doubling or tripling it (see &quot;clone all&quot; in Table 3 and Fig. 5). The probability of X[a] - Y[b]Z[g] under the new model was initialized to be proportional to the probability of X[a mod L] Y[b mod L]Z[g mod L] (where L refers to the old L),5 times a random &quot;jitter&quot; to break symmetry. null In a second annealing experiment (&quot;clone some&quot;) we addressed a weakness of the PCFG-LA and INHERIT models: They give every non-terminal the same number of latent annotations.</Paragraph> <Paragraph position="2"> It would seem that different coarse-grained non-terminals in the original Penn Treebank have different degrees of impurity (Klein and Manning, 2003). There are linguistically many kinds of NP, which are differentially selected for by various contexts and hence are worth distinguishing.</Paragraph> <Paragraph position="3"> By contrast, -LRB- is almost always realized as a left parenthesis and may not need further refinement. Our &quot;clone some&quot; annealing starts by training a model with L=2 to convergence. Then, instead of cloning all nonterminals as in the previous annealing experiments, we clone only those that have seemed to benefit most from their previous refinement. This benefit is measured by the Jensen-Shannon divergence of the two distributions P(X[0] - ***) and P(X[1] - ***). The 5Notice that as well as cloning X[a], this procedure multiplies by 4, 2, and 1 the number of binary, unary, and lexical rules that rewrite X[a]. To leave the backbone grammar unchanged, we should have scaled down the probabilities of such rules by 1/4, 1/2, and 1 respectively. Instead, we simply scaled them all down by the same proportion. While this temporarily changes the balance of probability among the three kinds of rules, EM immediately corrects this balance on the next training iteration to match the observed balance on the treebank trees--hence the one-iteration downtick in Figure 5).</Paragraph> <Paragraph position="4"> Jensen-Shannon divergence is defined as</Paragraph> <Paragraph position="6"> These experiments are a kind of &quot;poor man's version&quot; of the deterministic annealing clustering algorithm (Pereira et al., 1993; Rose, 1998), which gradually increases the number of clusters during the clustering process. In deterministic annealing, one starts in principle with a very large number of clusters, but maximizes likelihood only under a constraint that the joint distribution p(point,cluster) must have very high entropy. This drives all of the cluster centroids to coincide exactly, redundantly representing just one effective cluster. As the entropy is permitted to decrease, some of the cluster centroids find it worthwhile to drift apart.6 In future work, we would like to apply this technique to split nonterminals gradually, by initially requiring high-entropy parse forests on the training data and slowly relaxing this constraint.</Paragraph> </Section> <Section position="7" start_page="321" end_page="323" type="metho"> <SectionTitle> 5 Experiments </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="321" end_page="321" type="sub_section"> <SectionTitle> 5.1 Setup </SectionTitle> <Paragraph position="0"> We ran several experiments to compare the INHERIT with the PCFG-LA model and look into the effect of different Treebank preprocessing and the annealing-like procedures.</Paragraph> <Paragraph position="1"> We used sections 2-20 of the Penn Treebank 2 Wall Street Journal corpus (Marcus et al., 1993) for training, section 22 as development set and section 23 for testing. Following Matsuzaki et al. (2005), words occurring fewer than 4 times in the training corpus were replaced by unknown-word symbols that encoded certain suffix and capitalization information.</Paragraph> <Paragraph position="2"> All experiments used simple add-lambda smoothing (l=0.1) during the reestimation step (M step) of training.</Paragraph> <Paragraph position="3"> Binarization and Markovization. Before extracting the backbone PCFG and running the constrained inside-outside (EM) training algorithm, we preprocessed the Treebank using center-parent binarization Matsuzaki et al. (2005). Besides making the rules at most binary, this preprocessing also helpfully enriched the backbone nonterminals. For 6In practice, each very large group of centroids (effective cluster) is represented by just two, until such time as those two drift apart to represent separate effective clusters--then each is cloned.</Paragraph> <Paragraph position="4"> all but the first (&quot;Basic&quot;) experiments, we also enriched the nonterminals with order-1 horizontal and order-2 vertical markovization (Klein and Manning, 2003).7 Figure 4 shows what a multiplechild structure X - A B H C D looks like after binarization and markovization. The binarization process starts at the head of the sentence and moves to the right, inserting an auxiliary node for each picked up child, then moving to the left.</Paragraph> <Paragraph position="5"> Each auxiliary node consists of the parent label, the direction (L or R) and the label of the child and center-parent binarization of the rule X A BH C D where H is the head child.</Paragraph> <Paragraph position="6"> Initialization. The backbone PCFG grammar was read off the altered Treebank, and the initial annotated grammar was created by creating several versions of every rewrite rule. The probabilities of these newly created rules are uniform and proportional to the original rule, multiplied by a random epsilon factor uniformly sampled from [.9999,1.0001] to break symmetry.</Paragraph> </Section> <Section position="2" start_page="321" end_page="322" type="sub_section"> <SectionTitle> 5.2 Decoding </SectionTitle> <Paragraph position="0"> To test the PCFG learned by a given method, we attempted to recover the unannotated parse of each sentence in the development set. We then scored these parses by debinarizing or demarkovizing them, then measuring their precision and recall of the labeled constituents from the gold-standard Treebank parses.</Paragraph> <Paragraph position="1"> 7The vertical markovization was applied before binarization. - Matsuzaki et al. (2005) used a markovized grammar to get a better unannotated parse forest during decoding, but they did not markovize the training data.</Paragraph> <Paragraph position="2"> We increased L after iteration 50 and, for the INHERIT model, iteration 110. The downward spikes in the two annealed cases are due to perturbation of the model parameters (footnote 5).</Paragraph> <Paragraph position="3"> An unannotated parse's probability is the total probability, under our learned PCFG, of all of its annotated refinements. This total can be efficiently computed by the constrained version of the inside algorithm in Table 2.</Paragraph> <Paragraph position="4"> How do we obtain the unannotated parse whose total probability is greatest? It does not suffice to find the single best annotated parse and then strip off the annotations. Matsuzaki et al. (2005) note that the best annotated parse is in fact NP-hard to find. We use their reranking approximation. A 1000-best list for each sentence in the decoding set was created by parsing with our markovized unannotated grammar and extracting the 1000 best parses using the k-best algorithm 3 described in Huang and Chiang (2005). Then we chose the most probable of these 1000 unannotated parses under our PCFG, first finding the total probability of each by using the the constrained inside algorithm as explained above.8</Paragraph> </Section> <Section position="3" start_page="322" end_page="322" type="sub_section"> <SectionTitle> 5.3 Results and Discussion </SectionTitle> <Paragraph position="0"> Table 3 summarizes the results on development and test data. 9 Figure 5 shows the training loglikelihoods. null First, markovization of the Treebank leads to 8For the first set of experiments, in which the models were trained on a simple non-markovized grammar, the 1000-best trees had to be &quot;demarkovized&quot; before our PCFG was able to rescore them.</Paragraph> <Paragraph position="1"> 9All results are reported on sentences of 40 words or less. striking improvements. The &quot;Basic&quot; block of experiments in Table 3 used non-markovized grammars, as in Matsuzaki et al. (2005). The next block of experiments, introducing markovized grammars, shows a considerable improvement. This is not simply because markovization increases the number of parameters: markovization with L = 2 already beats basic models that have much higher L and far more parameters.</Paragraph> <Paragraph position="2"> Evidently, markovization pre-splits the labels in the trees in a reasonable way, so EM has less work to do. This is not to say that markovization eliminates the need for hidden annotations: with markovization, going from L=1 to L=2 increases the parsing accuracy even more than without it.</Paragraph> <Paragraph position="3"> Second, our &quot;clone all&quot; training technique (shown in the next block of Table 3) did not help performance and may even have hurt slightly.</Paragraph> <Paragraph position="4"> Here we initialized the L=2x2 model with the trained L=2 model for PCFG-LA, and the L=3x3 model with the L=3 and the L=3x3x3 model with the L=3x3 model.</Paragraph> <Paragraph position="5"> Third, our &quot;clone some&quot; training technique appeared to work. On PCFG-LA, the L<2x2 condition (i.e., train with L=2 and then clone some) matched the performance of L=4 with 30% fewer parameters. On INHERIT, L<2x2 beat L=4 with 8% fewer parameters. In these experiments, we used the average divergence as a threshold: X[0] and X[1] are split again if the divergence of their rewrite distributions is higher than average.</Paragraph> <Paragraph position="6"> Fourth, our INHERIT model was a disappointment. It generally performed slightly worse than PCFG-LA when given about as many degrees of freedom. This was also the case on some cursory experiments on smaller training corpora.</Paragraph> <Paragraph position="7"> It is tempting to conclude that INHERIT simply adopted overly strong linguistic constraints, but relaxing those constraints by moving to the INHERIT2 model did not seem to help. In our one experiment with INHERIT2 (not shown in Table 3), using 2 features that can each take L=2 values (d.f.: 212,707) obtains an F1 score of only 83.67--worse than 1 feature taking L=4 values.</Paragraph> </Section> <Section position="4" start_page="322" end_page="323" type="sub_section"> <SectionTitle> 5.4 Analysis: What was learned by INHERIT? </SectionTitle> <Paragraph position="0"> INHERIT did seem to discover &quot;linguistic&quot; features, as intended, even though this did not improve parse accuracy. We trained INHERIT and PCFG-LA models (both L=2, non-markovized) and noticed the following.</Paragraph> <Paragraph position="1"> mean (F1). &quot;Basic&quot; models are trained on a non-markovized treebank (as in Matsuzaki et al. (2005)); all others are trained on a markovized treebank. The best model (PCFG-LA with &quot;clone some&quot; annealing, F1=86.43) has also been decoded on the final test set, reaching P/R=86.94/85.40 (F1=86.17). We used both models to assign the most-probable annotations to the gold parses of the development set. Under the INHERIT model, NP[0] vs. NP[1] constituents were 21% plural vs. 41% plural. Under PCFG-LA this effect was weaker (30% vs. 39%), although it was significant in both (Fisher's exact test, p < 0.001). Strikingly, under the INHERIT model, the NP's were 10 times more likely to pass this feature to both children (Fisher's, p < 0.001)--just as we would expect for a number feature, since the determiner and head noun of an NP must agree.</Paragraph> <Paragraph position="2"> The INHERIT model also learned to use feature value 1 for &quot;tensed auxiliary.&quot; The VP[1] nonterminal was far more likely than VP[0] to expand as V VP, where V represents any of the tensed verb preterminals VBZ, VBG, VBN, VBD, VBP. Furthermore, these expansion rules had a very strong preference for &quot;pass to head,&quot; so that the left child would also be annotated as a tensed auxiliary, typically causing it to expand as a form of be, have, ordo. In short, the feature ensured that it was genuine auxiliary verbs that subcategorized for VPprimes.</Paragraph> <Paragraph position="3"> (The PCFG-LA model actually arranged the same behavior, e.g. similarly preferring VBZ[1] in the auxiliary expansion rule VP - VBZVP. The difference is that the PCFG-LA model was able to express this preference directly without propagating the [1] up to the VP parent. Hence neither VP[0] nor VP[1] became strongly associated with the auxiliary rule.) Many things are equally learned by both models: They learn the difference between subordinating conjunctions (while, if ) and prepositions (under, after), putting them in distinct groups of the originalINtag, which typically combine with sentences and noun phrases, respectively. Both models also split the conjunction CC into two distinct groups: a group of conjunctions starting with an upper-case letter at the beginning of the sentence and a group containing all other conjunctions.</Paragraph> </Section> </Section> <Section position="8" start_page="323" end_page="324" type="metho"> <SectionTitle> 6 Future Work: Log-Linear Modeling </SectionTitle> <Paragraph position="0"> Our approach in the INHERIT model made certain strict independence assumptions, with no backoff.</Paragraph> <Paragraph position="1"> The choice of a particular passpattern, for example, depends on all and only the three nonterminals X,Y,Z. However, given sparse training data, sometimes it is advantageous to back off to smaller amounts of contextual information; the nonterminal X or Y might alone be sufficient to predict the passpattern.</Paragraph> <Paragraph position="2"> A very reasonable framework for handling this issue is to model P(X[a] - Y[b] Z[g]) with a log-linear model.10 Feature functions would consider the values of variously sized, overlapping subsets of X,Y,Z,a,b,g. For example, a certain feature might fire when X[a] = NP[1] and Z[g] = N[2]. This approach can be extended to the multi-feature case, as in INHERIT2. Inheritance as in the INHERIT model can then be expressed by features like a = b, or a = b and X = VP. During early iterations, we could use a prior to encourage a strong positive weight on these inheritance features, and gradually relax this bias--akin to the &quot;structural annealing&quot; of (Smith and Eisner, 2006).</Paragraph> <Paragraph position="3"> When modeling the lexical rule P(X[a] - w), we could use features that consider the spelling of the word w in conjunction with the value of a. Thus, we might learn that V[1] is particularly likely to rewrite as a word ending in -s. Spelling features that are predictable from string context are important clues to the existence and behavior of the hidden annotations we wish to induce.</Paragraph> <Paragraph position="4"> A final remark is that &quot;inheritance&quot; does not necessarily have to mean that a = b. It is enough that a and b should have high mutual information, so that one can be predicted from the other; they do not actually have to be represented by the same integer. More broadly, we might like a to have high mutual information with the pair (b,g).</Paragraph> <Paragraph position="5"> One might try using this sort of intuition directly in an unsupervised learning procedure (Elidan and Friedman, 2003).</Paragraph> </Section> class="xml-element"></Paper>