XML Viewer - n03-1027

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/n03-1027_metho.xml
Size: 20,824 bytes
Last Modified: 2025-10-06 14:08:13
<?xml version="1.0" standalone="yes"?>
<Paper uid="N03-1027">
  <Title>Supervised and unsupervised PCFG adaptation to novel domains</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 MAP estimation
</SectionTitle>
    <Paragraph position="0"> In the maximum a posteriori estimation framework described in detail in Gauvain and Lee (1994), the model parameters are assumed to be a random vector in the space . Given an observation sample x, the MAP estimate is obtained as the mode of the posterior distribution of denoted</Paragraph>
    <Paragraph position="2"> In the case of n-gram model adaptation, as discussed in Bacchiani and Roark (2003), the objective is to estimate probabilities for a discrete distribution across words, entirely analogous to the distribution across mixture components within a mixture density, which is a common use for MAP estimation in ASR. A practical candidate for the prior distribution of the weights !1;!2; ;!K, is its conjugate prior, the Dirichlet density,</Paragraph>
    <Paragraph position="4"> where i &gt; 0 are the parameters of the Dirichlet distribution. With such a prior, if the expected counts for the i-th component is denoted as ci, the mode of the posterior distribution is obtained as</Paragraph>
    <Paragraph position="6"> We can use this formulation to estimate the posterior, but we must still choose the parameters of the Dirichlet. First, let us introduce some notation. A context-free grammar (CFG) G = (V;T;P;Sy), consists of a set of non-terminal symbols V, a set of terminal symbols T, a start symbol Sy2V, and a set of rule productions P of the form: A ! , where A 2 V and 2 (V [T) . A probabilistic context-free grammar (PCFG) is a CFG with a probability assigned to each rule, such that the probabilities of all rules expanding a given non-terminal sum to one; specifically, each right-hand side has a probability given the left-hand side of the rule1.</Paragraph>
    <Paragraph position="7"> LetAdenote the left-hand side of a production, and i the i-th possible expansion of A. Let the probability estimate for the production A ! i according to the out-of-domain model be denoted as eP( ijA) and let the expected adaptation counts be denoted as c(A ! i). Then the parameters of the prior distribution for left-hand side A are chosen as</Paragraph>
    <Paragraph position="9"> where A is the left-hand side dependent prior weighting parameter. This choice of prior parameters defines the MAP estimate of the probability of expansion i from the left-hand side A as</Paragraph>
    <Paragraph position="11"> A +PKk=1c(A! k) 1 i K: (5) Note that the MAP estimates with this parameterization reduce to the out-of-domain model parameters in the absence of adaptation data.</Paragraph>
    <Paragraph position="12"> Each left-hand side A has its own prior distribution, parameterized with A. This presents an over-parameterization problem. We follow Gauvain and Lee (1994) in adopting a parameter tying approach. As pointed out in Bacchiani and Roark (2003), two methods of parameter tying, in fact, correspond to two well known model mixing approaches, namely count merging and model interpolation. Let eP andec denote the probabilities and counts from the out-of-domain model, and let P and c denote the probabilities and counts from the adaptation model (i.e. in-domain).</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Count Merging
</SectionTitle>
      <Paragraph position="0"> If the left-hand side dependent prior weighting parameter is chosen as</Paragraph>
      <Paragraph position="2"> the MAP adaptation reduces to count merging, scaling the out-of-domain counts with a factor and the in-domain counts with a factor :</Paragraph>
      <Paragraph position="4"> 1An additional condition for well-formedness is that the PCFG is consistent or tight, i.e. there is no probability mass lost to infinitely large trees. Chi and Geman (1998) proved that this condition is met if the rule probabilities are estimated using relative frequency estimation from a corpus.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Model Interpolation
</SectionTitle>
      <Paragraph position="0"> If the left-hand side dependent prior weighting parameter is</Paragraph>
      <Paragraph position="2"/>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Other Tying Candidates
</SectionTitle>
      <Paragraph position="0"> While we will not be presenting empirical results for other parameter tying approaches in this paper, we should point out that the MAP framework is general enough to allow for other schema, which could potentially improve performance over simple count merging and model interpolation approaches. For example, one may choose a more complicated left-hand side dependent prior weighting parameter such as</Paragraph>
      <Paragraph position="2"> for some threshold . Such a schema may do a better job of managing how quickly the model moves away from the prior, particularly if there is a large difference in the respective sizes of the in-domain and out-of domain corpora. We leave the investigation of such approaches to future research.</Paragraph>
      <Paragraph position="3"> Before providing empirical results on the count merging and model interpolation approaches, we will introduce the parser and parsing models that were used.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Grammar and parser
</SectionTitle>
    <Paragraph position="0"> For the empirical trials, we used a top-down, left-to-right (incremental) statistical beam-search parser (Roark, 2001a; Roark, 2003). We refer readers to the cited papers for details on this parsing algorithm. Briefly, the parser maintains a set of candidate analyses, each of which is extended to attempt to incorporate the next word into a fully connected partial parse. As soon as &amp;quot;enough&amp;quot; candidate parses have been extended to the next word, all parses that have not yet attached the word are discarded, and the parser moves on to the next word. This beam search is parameterized with a base beam parameter , which controls how many or how few parses constitute &amp;quot;enough&amp;quot;. Candidate parses are ranked by a figure-of-merit, which promotes better candidates, so that they are worked on earlier. The figure-of-merit consists of the probability of the parse to that point times a look-ahead statistic, which is an estimate of how much probability mass it will take to connect the parse with the next word. It is a generative parser that does not require any pre-processing, such as POS tagging or chunking. It has been demonstrated in the above papers to perform competitively on standard statistical parsing tasks with full coverage. Baseline results below will provide a comparison with other well known statistical parsers.</Paragraph>
    <Paragraph position="1"> The PCFG is a Markov grammar (Collins, 1997; Charniak, 2000), i.e. the production probabilities are estimated by decomposing the joint probability of the categories on the right-hand side into a product of conditionals via the chain rule, and making a Markov assumption. Thus, for example, a first order Markov grammar conditions the probability of the category of thei-th child of the left-hand side on the category of the left-hand side and the category of the (i-1)-th child of the left-hand side. The benefits of Markov grammars for a top-down parser of the sort we are using is detailed in Roark (2003). Further, as in Roark (2001a; 2003), the production probabilities are conditioned on the label of the left-hand side of the production, as well as on features from the left-context. The model is smoothed using standard deleted interpolation, wherein a mixing parameter is estimated using EM on a held out corpus, such that probability of a production A ! , conditioned on j features from the left context, Xj1 = X1:::Xj, is defined recursively as</Paragraph>
    <Paragraph position="3"> where bP is the maximum likelihood estimate of the conditional probability. These conditional probabilities decompose via the chain rule as mentioned above, and a Markov assumption limits the number of previous children already emitted from the left-hand side that are conditioned upon.</Paragraph>
    <Paragraph position="4"> These previous children are treated exactly as other conditioning features from the left context. Table 1 gives the conditioning features that were used for all empirical trials in this paper. There are different conditioning features for parts-of-speech (POS) and non-POS non-terminals. Deleted interpolation leaves out one feature at a time, in the reverse order as they are presented in the table 1.</Paragraph>
    <Paragraph position="5"> The grammar that is used for these trials is a PCFG that is induced using relative frequency estimation from a transformed treebank. The trees are transformed with a selective left-corner transformation (Johnson and Roark, 2000) that has been flattened as presented in Roark (2001b). This transform is only applied to left-recursive productions, i.e.</Paragraph>
    <Paragraph position="6"> productions of the form A ! A . The transformed trees look as in figure 1. The transform has the benefit for a top-down incremental parser of this sort of delaying many of the parsing decisions until later in the string, without unduly disrupting the immediate dominance relationships that provide conditioning features for the probabilistic model.</Paragraph>
    <Paragraph position="7">  used in the reported empirical trials The parse trees that are returned by the parser are then detransformed to the original form of the grammar for evaluation2. null For the trials reported in the next section, the base beam parameter is set at = 10. In order to avoid being pruned, a parse must be within a probability range of the best scoring parse that has incorporated the next word. Letk be the number of parses that have incorporated the next word, and let ~p be the best probability from among that set. Then the probability of a parse must be above ~pk310 to avoid being pruned.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Empirical trials
</SectionTitle>
    <Paragraph position="0"> The parsing models were trained and tested on treebanks from the Penn Treebank II. For the Wall St. Journal portion, we used the standard breakdown: sections 2-21 were kept training data; section 24 was held-out development data; and section 23 was for evaluation. For the Brown corpus portion, we obtained the training and evaluation sections used in Gildea (2001). In that paper, no held-out section was used for parameter tuning3, so we further partitioned the training data into kept and held-out data. The sizes of the corpora are given in table 2, as well as labels that are used to refer to the corpora in subsequent tables.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Baseline performance
</SectionTitle>
      <Paragraph position="0"> The first results are for parsing the Brown corpus. Table 3 presents our baseline performance, compared with the Gildea (2001) results. Our system is labeled as 'MAP'. All parsing results are presented as labeled precision and recall.</Paragraph>
      <Paragraph position="1"> Whereas Gildea (2001) reported parsing results just for sentences of length less than or equal to 40, our results are for all sentences. The goal is not to improve upon Gildea's parsing performance, but rather to try to get more benefit from the out-of-domain data. While our performance is 0.51.5 percent better than Gildea's, the same trends hold - low eighties in accuracy when using the Wall St. Journal (out-ofdomain) training; mid eighties when using the Brown corpus training. Notice that using the Brown held out data with the Wall St. Journal training improved precision substantially.</Paragraph>
      <Paragraph position="2"> Tuning the parameters on in-domain data can make a big difference in parser performance. Choosing the smoothing parameters as Gildea did, based on the distribution within the corpus itself, may be effective when parsing within the same distribution, but appears less so when using the tree-bank for parsing outside of the domain.</Paragraph>
      <Paragraph position="3">  the WSJ Treebank. Note, again, that the Gildea results are for sentences 40 words in length, while all others are for all sentences in the test set. Also, Gildea did not report performance of a Brown corpus trained parser on the WSJ. Our performance under that condition is not particularly good, but again using an in-domain held out set for parameter tuning provided a substantial increase in accuracy, somewhat more in terms of precision than recall. Our baseline results for a WSJ section 2-21 trained parser are slightly better than the Gildea parser, at more-or-less the same level of performance as Charniak (1997) and Ratnaparkhi (1999), but several points below the best reported results on this task.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Supervised adaptation
</SectionTitle>
      <Paragraph position="0"> Table 5 presents parsing results on the Brown;E test set for models using both in-domain and out-of-domain training data. The table gives the adaptation (in-domain) treebank that was used, and the A that was used to combine the adaptation counts with the model built from the out-of-domain treebank. Recall that ec(A) times the out-of-domain model yields count merging, with the ratio of out-of-domain to in-domain counts; and c(A) times the out-of-domain model yields model interpolation, with the ratio of out-of-domain to in-domain probabilities. Gildea (2001) merged the two corpora, which just adds the counts from the out-of-domain treebank to the in-domain treebank, i.e. = 1.</Paragraph>
      <Paragraph position="1"> This resulted in a 0.25 improvement in the F-measure. In our case, combining the counts in this way yielded a half a point, perhaps because of the in-domain tuning of the smoothing parameters. However, when we optimize empirically on the held-out corpus, we can get nearly a full point improvement. Model interpolation in this case per- null that the Gildea results are for sentences 40 words in length. All others include all sentences.</Paragraph>
      <Paragraph position="2"> forms nearly identically to count merging.</Paragraph>
      <Paragraph position="3"> Adaptation to the Brown corpus, however, does not adequately represent what is likely to be the most common adaptation scenario, i.e. adaptation to a consistent domain with limited in-domain training data. The Brown corpus is not really a domain; it was built as a balanced corpus, and hence is the aggregation of multiple domains. The reverse scenario - Brown corpus as out-of-domain parsing model and Wall St. Journal as novel domain - is perhaps a more natural one. In this direction, Gildea (2001) also reported very small improvements when adding in the out-of-domain treebank. This may be because of the same issue as with the Brown corpus, namely that the optimal ratio of in-domain to out-of-domain is not 1 and the smoothing parameters need to be tuned to the new domain; or it may be because the new domain has a million words of training data, and hence has less use for out-of-domain data. To tease these apart, we partitioned the WSJ training data (sections 2-21) into smaller treebanks, and looked at the gain provided by adaptation as the in-domain observations grow. These smaller treebanks provide a more realistic scenario: rapid adaptation to a novel domain will likely occur with far less manual annotation of trees within the new domain than can be had in the full Penn Treebank.</Paragraph>
      <Paragraph position="4"> Table 6 gives the baseline performance on WSJ;23, with models trained on fractions of the entire 2-21 test set. Sections 2-21 contain approximately 40,000 sentences, and we partitioned them by percentage of total sentences. From table 6 we can see that parser performance degrades quite dramatically when there is less than 20,000 sentences in the training set, but that even with just 2000 sentences, the system outperforms one trained on the Brown corpus.</Paragraph>
      <Paragraph position="5"> Table 7 presents parsing accuracy when a model trained on the Brown corpus is adapted with part or all of the WSJ training corpus. From this point forward, we only present results for count merging, since model interpolation consistently performed 0.2-0.5 points below the count merging  approach4. The A mixing parameter was empirically optimized on the held out set when the in-domain training was just 10% of the total; this optimization makes over a point difference in accuracy. Like Gildea, with large amounts of in-domain data, adaptation improved our performance by half a point or less. When the amount of in-domain data is small, however, the impact of adaptation is much greater.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Unsupervised adaptation
</SectionTitle>
      <Paragraph position="0"> Bacchiani and Roark (2003) presented unsupervised MAP adaptation results for n-gram models, which use the same methods outlined above, but rather than using a manually annotated corpus as input to adaptation, instead use an automatically annotated corpus. Their automatically annotated corpus was the output of a speech recognizer which used the out-of-domain n-gram model. In our case, we use the parsing model trained on out-of-domain data, and output a set of candidate parse trees for the strings in the in-domain corpus, with their normalized scores. These normalized scores (posterior probabilities) are then used to give weights to the features extracted from each candidate parse, in just the way that they provide expected counts for an expectation maximization algorithm.</Paragraph>
      <Paragraph position="1"> For the unsupervised trials that we report, we collected up to 20 candidate parses per string5. We were interested in investigating the effects of adaptation, not in optimizing performance, hence we did not empirically optimize the mixing parameter A for the new trials, so as to avoid obscuring the effects due to adaptation alone. Rather, we used the best 4This is consistent with the results presented in Bacchiani and Roark (2003), which found a small but consistent improvement in performance with count merging versus model interpolation for n-gram modeling.</Paragraph>
      <Paragraph position="2"> 5Because of the left-to-right, heuristic beam-search, the parser does not produce a chart, rather a set of completed parses.</Paragraph>
      <Paragraph position="3"> performing parameter from the supervised trials, namely 0.20ec(A). Since we are no longer limited to manually annotated data, the amount of in-domain WSJ data that we can include is essentially unlimited. Hence the trials reported go beyond the 40,000 sentences in the Penn WSJ Treebank, to include up to 5 times that number of sentences from other years of the WSJ.</Paragraph>
      <Paragraph position="4"> Table 8 shows the results of unsupervised adaptation as we have described it. Note that these improvements are had without seeing any manually annotated Wall St. Journal treebank data. Using the approximately 40,000 sentences in f2-21, we derived a 3.8 percent F-measure improvement over using just the out of domain data. Going beyond the size of the Penn Treebank, we continued to gain in accuracy, reaching a total F-measure improvement of 4.2 percent with 200 thousand sentences, approximately 5 million words. A second iteration with this best model, i.e. re-parsing the 200 thousand sentences with the adapted model and re-training, yielded an additional 0.65 percent F-measure improvement, for a total F-measure improvement of 4.85 percent over the baseline model.</Paragraph>
      <Paragraph position="5"> A final unsupervised adaptation scenario that we investigated is self-adaptation, i.e. adaptation on the test set itself. Because this adaptation is completely unsupervised, thus does not involve looking at the manual annotations at all, it can be equally well applied using the test set as the unsupervised adaptation set. Using the same adaptation procedure presented above on the test set itself, i.e. producing the top 20 candidates from WSJ;23 with normalized posterior probabilities and re-estimating, we produced a self-adapted parsing model. This yielded an F-measure accuracy of 76.8, which is a 1.1 percent improvement over the baseline.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML