File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/w06-2903_evalu.xml
Size: 7,499 bytes
Last Modified: 2025-10-06 13:59:55
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2903"> <Title>Non-Local Modeling with a Mixture of PCFGs</Title> <Section position="6" start_page="16" end_page="19" type="evalu"> <SectionTitle> 4 Results </SectionTitle> <Paragraph position="0"> We ran our experiments on the Wall Street Journal (WSJ) portion of the Penn Treebank using the standard setup: We trained on sections 2 to 21, and we used section 22 as a validation set for tuning model hyperparameters. Results are reported on all sentences of 40 words or less from section 23. We use a markovized grammar which was annotated with parent and sibling information as a baseline (see Section 4.2). Unsmoothed maximum-likelihood estimates were used for rule probabilities as in Charniak (1996). For the tagging probabilities, we used maximum-likelihood estimates for P(tag|word). Add-one smoothing was applied to unknown and rare (seen ten times or less during training) words before inverting those estimates to give P(word|tag). Parsing was done with a simple Java implementation of an agenda-based chart parser.</Paragraph> <Section position="1" start_page="17" end_page="17" type="sub_section"> <SectionTitle> 4.1 Parsing Accuracy </SectionTitle> <Paragraph position="0"> The EM algorithm is guaranteed to continuously increase the likelihood on the training set until convergence to a local maximum. However, the likelihood on unseen data will start decreasing after a number of iterations, due to overfitting. This is demonstrated in Figure 4. We use the likelihood on the validation set to stop training before overfitting occurs.</Paragraph> <Paragraph position="1"> In order to evaluate the performance of our model, we trained mixture grammars with various numbers of components. For each configuration, we used EM to obtain twelve estimates, each time with a different random initialization. We show the F1-score for the model with highest log-likelihood on the validation set in Figure 4. The results show that a mixture of grammars outperforms a standard, single grammar PCFG parser.2</Paragraph> </Section> <Section position="2" start_page="17" end_page="17" type="sub_section"> <SectionTitle> 4.2 Capturing Rule Correlations </SectionTitle> <Paragraph position="0"> As described in Section 2, we hope that the mixture model will capture long-range correlations in by adding parent annotation, we combine our mixture model with a grammar in which node probabilities depend on the parent (the last vertical ancestor) and the closest sibling (the last horizontal ancestor). Klein and Manning (2003) refer to this grammar as a markovized grammar of vertical order = 2 and horizontal order = 1. Because many local correlations are captured by the markovized grammar, there is a greater hope that observed improvements stem from non-local correlations.</Paragraph> <Paragraph position="1"> In fact, we find that the mixture does capture non-local correlations. We measure the degree to which a grammar captures correlations by calculating the total squared error between LR scores of the grammar and corpus, weighted by the probability of seeing nonterminals. This is 39422 for a single PCFG, but drops to 37125 for a mixture with five individual grammars, indicating that the mixture model better captures the correlations present in the corpus. As a concrete example, in the Penn Treebank, we often see the rules FRAG-ADJP and PRN-, SBAR , cooccurring; their LR is 134.</Paragraph> <Paragraph position="2"> When we learn a single markovized PCFG from the treebank, that grammar gives a likelihood ratio of only 61. However, when we train with a hierarchical model composed of a shared grammar and four individual grammars, we find that the grammar likelihood ratio for these rules goes up to 126, which is very similar to that of the empirical ratio.</Paragraph> </Section> <Section position="3" start_page="17" end_page="18" type="sub_section"> <SectionTitle> 4.3 Genre </SectionTitle> <Paragraph position="0"> The mixture of grammars model can equivalently be viewed as capturing either non-local correlations or variations in grammar. The latter view suggests that the model might benefit when the syntactic structure (after 13 iterations). (b) The accuracy of the mixture of grammars model with l = 0.4 versus the number of grammars. Note the improvement over a 1-grammar PCFG model. varies significantly, as between different genres. We tested this with the Brown corpus, of which we used 8 different genres (f, g, k, l, m, n, p, and r). We follow Gildea (2001) in using the ninth and tenth sentences of every block of ten as validation and test data, respectively, because a contiguous test section might not be representative due to the genre variation. null To test the effects of genre variation, we evaluated various training schemes on the Brown corpus.</Paragraph> <Paragraph position="1"> The single grammar baseline for this corpus gives F1 = 79.75, with log likelihood (LL) on the testing data=-242561. The first test, then, was to estimate each individual grammar from only one genre. We did this by assigning sentences to individual grammars by genre, without using any EM training. This increases the data likelihood, though it reduces the F1 score (F1 = 79.48, LL=-242332). The increase in likelihood indicates that there are genre-specific features that our model can represent. (The lack of F1 improvement may be attributed to the increased difficulty of estimating rule probabilities after dividing the already scant data available in the Brown corpus. This small quantity of data makes overfitting almost certain.) However, local minima and lack of data cause difficulty in learning genre-specific features. If we start with sentences assigned by genre as before, but then train with EM, both F1 and test data log likelihood drop (F1 = 79.37, LL=-242100). When we use EM with a random initialization, so that sentences are not assigned directly to grammars, the scores go down even further (F1 = 79.16, LL=-242459). This indicates that the model can capture variation between genres, but that maximum training data likelihood does not necessarily give maximum accuracy.</Paragraph> <Paragraph position="2"> Presumably, with more genre-specific data available, learning would generalize better. So, genre-specific grammar variation is real, but it is difficult to capture via EM.</Paragraph> </Section> <Section position="4" start_page="18" end_page="19" type="sub_section"> <SectionTitle> 4.4 Smoothing Effects </SectionTitle> <Paragraph position="0"> While the mixture of grammars captures rule correlations, it may also enhance performance via smoothing effects. Splitting the data randomly could produce a smoothed shared grammar, Gs, that is a kind of held-out estimate which could be superior to the unsmoothed ML estimates for the singlecomponent grammar.</Paragraph> <Paragraph position="1"> We tested the degree of generalization by evaluating the shared grammar alone and also a mixture of the shared grammar with the known single grammar. Those shared grammars were extracted after training the mixture model with four individual grammars. We found that both the shared grammar alone (F1=79.13, LL=-333278) and the shared grammar mixed with the single grammar (F1=79.36, LL=-331546) perform worse than a sin- null gle PCFG (F1=79.37, LL=-327658). This indicates that smoothing is not the primary learning effect contributing to increased F1.</Paragraph> </Section> </Section> class="xml-element"></Paper>