File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-2903_metho.xml

Size: 10,754 bytes

Last Modified: 2025-10-06 14:10:55

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-2903">
  <Title>Non-Local Modeling with a Mixture of PCFGs</Title>
  <Section position="4" start_page="0" end_page="14" type="metho">
    <SectionTitle>
2 Empirical Motivation
</SectionTitle>
    <Paragraph position="0"> It is commonly accepted that the context freedom assumptions underlying the PCFG model are too  for the American financial ($). On the right hand side we show the ten rules whose likelihoods are most increased in a sentence containing this rule.</Paragraph>
    <Paragraph position="1"> strong and that weakening them results in better models of language (Johnson, 1998; Gildea, 2001; Klein and Manning, 2003). In particular, certain grammar productions often cooccur with other productions, which may be either near or distant in the parse tree. In general, there exist three types of correlations: (i) local (e.g. parent-child), (ii) non-local, and (iii) self correlations (which may be local or non-local).</Paragraph>
    <Paragraph position="2"> In order to quantify the strength of a correlation, we use a likelihood ratio (LR). For two rules X-a and Y-b, we compute LR(X-a, Y-b) = P(a,b|X,Y )P(a|X,Y )P(b|X,Y ) This measures how much more often the rules occur together than they would in the case of independence. For rules that are correlated, this score will be high ([?] 1); if the rules are independent, it will be around 1, and if they are anti-correlated, it will be near 0.</Paragraph>
    <Paragraph position="3"> Among the correlations present in the Penn Treebank, the local correlations are the strongest ones; they contribute 65% of the rule pairs with LR scores above 90 and 85% of those with scores over 200.</Paragraph>
    <Paragraph position="4"> Non-local and self correlations are in general common but weaker, with non-local correlations contributing approximately 85% of all correlations1. By adding a latent variable conditioning all productions, 1Quantifying the amount of non-local correlation is problematic; most pairs of cooccuring rules are non-local and will, due to small sample effects, have LR ratios greater than 1 even if they were truly independent in the limit.</Paragraph>
    <Paragraph position="5"> we aim to capture some of this interdependence between rules.</Paragraph>
    <Paragraph position="6"> Correlations at short distances have been captured effectively in previous work (Johnson, 1998; Klein and Manning, 2003); vertical markovization (annotating nonterminals with their ancestor symbols) does this by simply producing a different distribution for each set of ancestors. This added context leads to substantial improvement in parsing accuracy. With local correlations already well captured, our main motivation for introducing a mixture of grammars is to capture long-range rule cooccurrences, something that to our knowledge has not been done successfully in the past.</Paragraph>
    <Paragraph position="7"> As an example, the rule QP-# CD CD, representing a quantity of British currency, cooccurs with itself 132 times as often as if occurrences were independent. These cooccurrences appear in cases such as seen in Figure 1. Similarly, the rules VP-VBD NP PP , S and VP-VBG NP PP PP cooccur in the Penn Tree-bank 100 times as often as we would expect if they were independent. They appear in sentences of a very particular form, telling of an action and then giving detail about it; an example can be seen in Figure 2.</Paragraph>
  </Section>
  <Section position="5" start_page="14" end_page="16" type="metho">
    <SectionTitle>
3 Mixtures of PCFGs
</SectionTitle>
    <Paragraph position="0"> In a probabilistic context-free grammar (PCFG), each rule X-a is associated with a conditional probability P(a|X) (Manning and Sch&amp;quot;utze, 1999).</Paragraph>
    <Paragraph position="1"> Together, these rules induce a distribution over trees P(T). A mixture of PCFGs enriches the basic model  grammar: rules VP-VBD NP PP , S and VP-VBG NP PP PP and rules VP-VBP RB ADJP and VP-VBP ADVP PP. (b) Sibling effects, though not parallel structure, rules: NX-NNS and NX-NN NNS. (d) A special structure for footnotes has rules ROOT-X and X-SYM coocurring with high probability.</Paragraph>
    <Paragraph position="2"> by allowing for multiple grammars, Gi, which we call individual grammars, as opposed to a single grammar. Without loss of generality, we can assume that the individual grammars share the same set of rules. Therefore, each original rule X-a is now associated with a vector of probabilities, P(a|X,i). If, in addition, the individual grammars are assigned prior probabilities P(i), then the entire mixture induces a joint distribution over derivations</Paragraph>
    <Paragraph position="4"> tribution over trees by summing over the grammar index i.</Paragraph>
    <Paragraph position="5"> As a generative derivation process, we can think of this in two ways. First, we can imagine G to be a latent variable on which all productions are conditioned. This view emphasizes that any otherwise unmodeled variable or variables can be captured by the latent variable G. Second, we can imagine selecting an individual grammar Gi and then generating a sentence using that grammar. This view is associated with the expectation that there are multiple grammars for a language, perhaps representing different genres or styles. Formally, of course, the two views are the same.</Paragraph>
    <Section position="1" start_page="14" end_page="16" type="sub_section">
      <SectionTitle>
3.1 Hierarchical Estimation
</SectionTitle>
      <Paragraph position="0"> So far, there is nothing in the formal mixture model to say that rule probabilities in one component have any relation to those in other components. However, we have a strong intuition that many rules, such as NP-DT NN, will be common in all mixture components. Moreover, we would like to pool our data across components when appropriate to obtain more reliable estimators.</Paragraph>
      <Paragraph position="1"> This can be accomplished with a hierarchical estimator for the rule probabilities. We introduce a shared grammar Gs. Associated to each rewrite is now a latent variable L = {S, I} which indicates whether the used rule was derived from the shared grammar Gs or one of the individual grammars Gi:</Paragraph>
      <Paragraph position="3"> where l [?] P(l = I) is the probability of choosing the individual grammar and can also be viewed as a mixing coefficient. Note that</Paragraph>
      <Paragraph position="5"> grammar is the same for all individual grammars.</Paragraph>
      <Paragraph position="6"> This kind of hierarchical estimation is analogous to that used in hierarchical mixtures of naive-Bayes for  text categorization (McCallum et al., 1998). The hierarchical estimator is most easily described as a generative model. First, we choose a individual grammar Gi. Then, for each nonterminal, we select a level from the back-off hierarchy grammar: the individual grammar Gi with probability l, and the shared grammar Gs with probability 1[?]l.</Paragraph>
      <Paragraph position="7"> Finally, we select a rewrite from the chosen level. To emphasize: the derivation of a phrase-structure tree in a hierarchically-estimated mixture of PCFGs involves two kinds of hidden variables: the grammar G used for each sentence, and the level L used at each tree node. These hidden variables will impact both learning and inference in this model.</Paragraph>
    </Section>
    <Section position="2" start_page="16" end_page="16" type="sub_section">
      <SectionTitle>
3.2 Inference: Parsing
</SectionTitle>
      <Paragraph position="0"> Parsing involves inference for a given sentence S.</Paragraph>
      <Paragraph position="1"> One would generally like to calculate the most probable parse - that is, the tree T which has the highest probability P(T|S)[?]summationtexti P(i)P(T|i). However, this is difficult for mixture models. For a single grammar we have:</Paragraph>
      <Paragraph position="3"> This score decomposes into a product and it is simple to construct a dynamic programming algorithm to find the optimal T (Baker, 1979). However, for a mixture of grammars we need to sum over the indi-</Paragraph>
      <Paragraph position="5"> Because of the outer sum, this expression unfortunately does not decompose into a product over scores of subparts. In particular, a tree which maximizes the sum need not be a top tree for any single component.</Paragraph>
      <Paragraph position="6"> As is true for many other grammar formalisms in which there is a derivation / parse distinction, an alternative to finding the most probable parse is to find the most probable derivation (Vijay-Shankar and Joshi, 1985; Bod, 1992; Steedman, 2000). Instead of finding the tree T which maximizes summationtexti P(T,i), we find both the tree T and component i which maximize P(T,i). The most probable derivation can be found by simply doing standard PCFG parsing once for each component, then comparing the resulting trees' likelihoods.</Paragraph>
    </Section>
    <Section position="3" start_page="16" end_page="16" type="sub_section">
      <SectionTitle>
3.3 Learning: Training
</SectionTitle>
      <Paragraph position="0"> Training a mixture of PCFGs from a treebank is an incomplete data problem. We need to decide which individual grammar gave rise to a given observed tree. Moreover, we need to select a generation path (individual grammar or shared grammar) for each rule in the tree. To learn estimate parameters, we can use a standard Expectation-Maximization (EM) approach.</Paragraph>
      <Paragraph position="1"> In the E-step, we compute the posterior distributions of the latent variables, which are in this case both the component G of each sentence and the hierarchy level L of each rewrite. Note that, unlike during parsing, there is no uncertainty over the actual rules used, so the E-step does not require summing over possible trees. Specifically, for the variable G we have</Paragraph>
      <Paragraph position="3"> For the hierarchy level L we can write</Paragraph>
      <Paragraph position="5"> where we slightly abuse notation since the rule X -a can occur multiple times in a tree T.</Paragraph>
      <Paragraph position="6"> In the M-step, we find the maximum-likelihood model parameters given these posterior assignments; i.e., we find the best grammars given the way the training data's rules are distributed between individual and shared grammars. This is done exactly as in the standard single-grammar model using relative expected frequencies. The updates are shown in Figure 3.3, where T = {T1,T2,...}is the training set.</Paragraph>
      <Paragraph position="7"> We initialize the algorithm by setting the assignments from sentences to grammars to be uniform between all the individual grammars, with a small random perturbation to break symmetry.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML