File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/w01-0521_metho.xml
Size: 12,882 bytes
Last Modified: 2025-10-06 14:07:40
<?xml version="1.0" standalone="yes"?> <Paper uid="W01-0521"> <Title>Corpus Variation and Parser Performance</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 The Parsing Model </SectionTitle> <Paragraph position="0"> We take as our baseline parser the statistical model of Model 1 of Collins#281997#29. The model is a historybased, generativemodel, in which the probabilityfor a parse tree is found by expanding each node in the tree in turn into its child nodes, and multiplying the probabilitiesfor each action in the derivation. It can be thoughtofasavariety of lexicalized probabilistic context-free grammar, with the rule probabilities factored into three distributions. The #0Crst distribution gives probability of the syntactic category H of the head child of a parent node with category P, head word Hhw with the head tag #28the part of speech tag of the head word#29 Hht:</Paragraph> <Paragraph position="2"> The head word and head tag of the new node H are de#0Cned to be the same as those of its parent. The remaining two distributions generate the non-head children one after the other. A special #23STOP#23 symbol is generated to terminate the sequence of children for a given parent. Each child is generated in two steps: #0Crst its syntactic category C and head tag Chtare chosen given the parent's and head child's features and a function #01 representing the distance from the head child:</Paragraph> <Paragraph position="4"> Then the new child's head word Chw is chosen:</Paragraph> <Paragraph position="6"> For each of the three distributions, the empiricaldistribution of the training data is interpolated with less speci#0Cc backo#0B distributions, as we will see in Section 5. Further details of the model, including the distance features used and special handling of punctuation, conjunctions, and base noun phrases, are described in Collins #281999#29.</Paragraph> <Paragraph position="7"> The fundamental features of used in the probability distributions are the lexical heads and head tags of each constituent, the co-occurrences of parent nodes and their head children, and the co-occurrences of child nodes with their head siblings and parents. The probability models of Charniak #281997#29, Magerman #281995#29 and Ratnaparkhi #281997#29 di#0Ber in their details but are based on similar features. Models 2 and 3 of Collins #281997#29 add some slightly more elaborate features to the probability model, as do the additions of Charniak #282000#29 to the model of Charniak #281997#29.</Paragraph> <Paragraph position="8"> Our implementation of Collins' Model 1 performs at 86#25 precision and recall of labeled parse constituents on the standard Wall Street Journal training and test sets. While this does not re#0Dect the state-of-the-art performance on the WSJ task achieved by the more the complex models of Charniak #282000#29 and Collins #282000#29, we regard it as a reasonable baseline for the investigation of corpus e#0Bects on statistical parsing.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Parsing Results on the Brown Corpus </SectionTitle> <Paragraph position="0"> We conducted separate experiments using WSJ data, Brown data, and a combination of the two as training material. For the WSJ data, we observed the standard division into training #28sections 2 through 21 of the treebank#29 and test #28section 23#29 sets. For the Brown data, we reserved every tenth sentence in the corpus as test data, using the other nine for training. This may underestimate the dif#0Cculty of the Brown corpus by including sentences from the same documents in training and test sets.</Paragraph> <Paragraph position="1"> However, because of the variation within the Brown corpus, we felt that a single contiguous test section might not be representative. Only the subset of the Brown corpus available in the Treebank II bracketing format was used. This subset consists primarily of various #0Cction genres. Corpus sizes are shown in to sentences of 40 words or less. The Brown test set's average sentence was shorter despite the length restriction.</Paragraph> <Paragraph position="2"> Results for the Brown corpus, along with WSJ results for comparison, are shown in Table 2. The basic mismatch between the two corpora is shown in the signi#0Ccantly lower performance of the WSJ-trained model on Brown data than on WSJ data #28rows 1 and 2#29. A model trained on Brown data only does signi#0Ccantly better, despite the smaller size of the training set. Combining the WSJ and Brown training data in one model improves performance further, but by less than 0.5#25 absolute. Similarly, adding the Brown data to the WSJ model increased performance on WSJ by less than 0.5#25. Thus, even a large amount of additional data seems to have relatively little impact if it is not matched to the test material.</Paragraph> <Paragraph position="3"> The more varied nature of the Brown corpus also seems to impact results, as all the results on Brown are lower than the WSJ result.</Paragraph> </Section> <Section position="6" start_page="0" end_page="2" type="metho"> <SectionTitle> 5 The E#0Bect of Lexical </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="2" type="sub_section"> <SectionTitle> Dependencies </SectionTitle> <Paragraph position="0"> The parserscitedaboveallusesomevarietyof lexical dependency feature to capture statistics on the co-occurrence of pairs of words being found in parent-child relations within the parse tree. These word pair relations, also called lexical bigrams #28Collins, 1996#29, are reminiscentof dependency grammarssuch as Me l#15cuk #281988#29 and the link grammar of Sleator and Temperley#281993#29. In Collins'Model 1, the word pair statistics occur in the distribution</Paragraph> <Paragraph position="2"> whereHhwrepresentthe head wordof aparentnode in the tree and Chw the head word of its #28non-head#29 child. #28The head word of a parent is the same as the head word of its head child.#29 Because this is the only part of the model that involves pairs of words, it is alsowherethe bulkof theparametersarefound. The large number of possible pairs of words in the vocabulary make the training data necessarily sparse. In order to avoid assigning zero probability to unseen events, it is necessary to smooth the training data.</Paragraph> <Paragraph position="3"> The Collins model uses linear interpolation to estimate probabilities from empirical distributions of varying speci#0Ccities: are chosen as a function of the number of examples seen for the conditioning events and the number of unique values seen for the predicted variable. Only the #0Crst distribution in this interpolation scheme involves pairs of words, and the third component is simply the probabilityofaword given its part of speech.</Paragraph> <Paragraph position="4"> Because the word pair feature is the most speci#0Cc in the model, it is likely to be the most corpusspeci#0Cc. The vocabularies used in corpora vary, as do the word frequencies. It is reasonable to expect word co-occurrences to vary as well. In order to test this hypothesis, we removed the distribution null ~ P#28ChwjP;H;Hht;Hhw;C;Cht#29 from the parsingmodelentirely, relyingon the interpolationof the two less speci#0Cc distributions in the parser:</Paragraph> <Paragraph position="6"> We performed cross-corpus experiments as before to determine whether the simpler parsing model might be more robust to corpus e#0Bects. Results are shown in Table 3.</Paragraph> <Paragraph position="7"> Perhaps the most striking result is just how little the eliminationof lexicalbigramsa#0Bects the baseline system: performance on the WSJ corpus decreases by less than 0.5#25 absolute. Moreover, the performance of a WSJ-trained system without lexical bi-grams on Brown test data is identical to the WSJ-trained system with lexical bigrams. Lexical co-occurrence statistics seem to be of no bene#0Ct when attempting to generalize to a new corpus.</Paragraph> </Section> </Section> <Section position="7" start_page="2" end_page="2" type="metho"> <SectionTitle> 6 Pruning Parser Parameters </SectionTitle> <Paragraph position="0"> The relatively high performance of a parsing model with no lexical bigram statistics on the WSJ task led us to explore whether it might be possible to signi#0Ccantly reduce the size of the parsing model by selectively removing parameters without sacri#0Ccing performance. Such a technique reduces the parser's memory requirements as well as the overhead of loading and storing the model, which could be desirable for an application where limited computing resources are available.</Paragraph> <Paragraph position="1"> Signi#0Ccant e#0Bort has gone into developing techniques for pruning statistical language models for speech recognition, and we borrow from this work, using the weighted di#0Berence technique of Seymore and Rosenfeld #281996#29. This technique applies to any statistical model which estimates probabilities by backing o#0B, that is, using probabilities from a less speci#0Cc distribution when no data are available are available for the full distribution, as the following equations show for the general case:</Paragraph> <Paragraph position="3"> Here e is the event to be predicted, h is the set of conditioning events or history, #0B is a backo#0B weight, and h is the subset of conditioning events used for the less speci#0Cc backo#0Bdistribution. BOis the backo#0Bsetofevents for which no data are present in the speci#0Cc distribution P . In the case of n-gram language modeling, e is the next word to be predicted, and the conditioning events are the n,1 preceding words. In our case the speci#0Cc distribution P of the backo#0B model is P cw of equation 1, itself a linear interpolation of three empirical distributions from the trainingdata. The less speci#0Cc distributionP of the backo#0B model is P of equation 2, an interpolation of two empirical distributions. The backo#0B weight #0B is simply 1 , #15 in our linear interpolation model. The Seymore#2FRosenfeld pruning technique can be used to prune backo#0B probability models regardless of whether the backo#0B weights are derived from linear interpolation weights or discounting techniques such as Good-Turing. In order to ensure that the model's probabilities still sum to one, the backo#0B weight #0B must be adjusted whenever a parameter is removed from the model. In the Seymore#2FRosenfeld approach, parameters are pruned according to the following criterion: #29 represents the new backed o#0B probability estimate after removingp#28ejh#29 from the model and adjusting the backo#0B weight, and N#28e;h#29 is the count in the training data. This criterion aims to prune probabilities that are similar to their backo#0B estimates, and that are not frequently used. As shown byStolcke #281998#29, this criterion is an approximation of the relativeentropybetween the original and pruned distributions, but does not takeinto account the e#0Bect of changing the backo#0B weight on other events' probabilities.</Paragraph> <Paragraph position="4"> Adjusting the threshold#12 below which parameters are pruned allows us to successively remove more and more parameters. Results for di#0Berentvalues of #12 are shown in Table 4.</Paragraph> <Paragraph position="5"> The complete parsing model derived from the WSJ training set has 735,850 parameters in a total of nine distributions: three levels of backo#0B for each of the three distributions P</Paragraph> <Paragraph position="7"> . The lexical bigrams are contained in the most speci#0Cc distribution for P cw . Removing all these parameters reduces the total model size by 43#25. The results show a gradual degradation as more parameters are pruned.</Paragraph> <Paragraph position="8"> The ten lexicalbigramswith the highest scores for the pruning metric are shown in Table 5 for WSJ and Table 6. The pruning metric of equation 3 has been normalized by corpus size to allow comparison between WSJ and Brown. The only overlap between the two sets is for pairs of unknown word tokens. The WSJ bigrams are almost all speci#0Cc to #0Cnance, are all word pairs that are likely to appear immediately adjacent to one another, and are all children of the base NP syntactic category. The Brown bigrams, which have lower correlation values by our metric, include verb#2Fsubject and preposition#2Fobject relations and seem more broadly applicable as a model of English. However, the pairs are not strongly related semantically, no doubt because the #0Crst term of the pruning criterion favors the most frequentwords, such as forms of the verbs WSJ, with parent category #28other syntactic context variables not shown#29 and pruning metric</Paragraph> </Section> class="xml-element"></Paper>