File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1088_metho.xml
Size: 14,843 bytes
Last Modified: 2025-10-06 14:10:18
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1088"> <Title>Wolfson Building Parks Road</Title> <Section position="5" start_page="698" end_page="699" type="metho"> <SectionTitle> 3 CCG Supertagging and Parsing </SectionTitle> <Paragraph position="0"> Parsing using CCG can be viewed as a two-stage process: firstassignlexicalcategoriestothewords in the sentence, and then combine the categories The WSJ is a paper that I enjoy reading</Paragraph> </Section> <Section position="6" start_page="699" end_page="699" type="metho"> <SectionTitle> NP/N N (S[dcl]\NP)/NP NP/N N (NP\NP)/(S[dcl]/NP) NP (S[dcl]\NP)/(S[ng]\NP) (S[ng]\NP)/NP </SectionTitle> <Paragraph position="0"> together using CCG's combinatory rules.1 We perform stage one using a supertagger.</Paragraph> <Paragraph position="1"> The set of lexical categories used by the supertagger is obtained from CCGbank (Hockenmaier, 2003), a corpus of CCG normal-form derivations derived semi-automatically from the Penn Treebank. Following our earlier work, we apply a frequency cutoff to the training set, only using those categories which appear at least 10 times in sections 02-21, which results in a set of 425 categories. We have shown that the resulting set has very high coverage on unseen data (Clark and Curran, 2004a). Figure 1 gives an example sentence with the CCG lexical categories.</Paragraph> <Paragraph position="2"> The parser is described in Clark and Curran (2004b). It takes POS tagged sentences as input witheachwordassignedasetoflexicalcategories.</Paragraph> <Paragraph position="3"> A packed chart is used to efficiently represent all the possible analyses for a sentence, and the CKY chart parsing algorithm described in Steedman (2000) is used to build the chart. A log-linear model is used to score the alternative analyses.</Paragraph> <Paragraph position="4"> In Clark and Curran (2004a) we described a novel approach to integrating the supertagger and parser: startwithaveryrestrictivesupertaggersetting, so that only a small number of lexical categories is assigned to each word, and only assign more categories if the parser cannot find a spanning analysis. This strategy results in an efficient and accurate parser, with speeds up to 35 sentences per second. Accurate supertagging at low levels of lexical category ambiguity is therefore particularly important when using this strategy.</Paragraph> <Paragraph position="5"> We found in Clark and Curran (2004b) that a large drop in parsing accuracy occurs if automatically assigned POS tags are used throughout the parsing process, rather than gold standard POS tags (almost 2% F-score over labelled dependencies). This is due to the drop in accuracy of the supertagger (see Table 3) and also the fact that the log-linear parsing model uses POS tags as features. The large drop in parsing accuracy demonstrates that improving the performance of POS tag- null different levels of ambiguity.</Paragraph> <Paragraph position="6"> gers is still an important research problem. In this paper we aim to reduce the performance drop of the supertagger by maintaing some POS ambiguity through to the supertagging phase. Future work will investigate maintaining some POS ambiguity through to the parsing phase also.</Paragraph> </Section> <Section position="7" start_page="699" end_page="701" type="metho"> <SectionTitle> 4 Multi-tagging Experiments </SectionTitle> <Paragraph position="0"> We performed several sets of experiments for POS tagging and CCG supertagging to explore the trade-offbetweenambiguityandtaggingaccuracy.</Paragraph> <Paragraph position="1"> For both POS tagging and supertagging we varied the average number of tags assigned to each word, to see whether it is possible to significantly increase tagging accuracy with only a small increase in ambiguity. For CCG supertagging, we alsocompared multi-tagging approaches, with a fixed category ambiguity of 1.4 categories per word.</Paragraph> <Paragraph position="2"> All of the experiments used Section 02-21 of CCGbank as training data, Section 00 as development data and Section 23 as final test data. We evaluate both per-word tag accuracy and sentence accuracy, which is the percentage of sentences for which every word is tagged correctly. For the multi-tagging results we consider the word to be tagged correctly if the correct tag appears in the set of tags assigned to the word.</Paragraph> <Section position="1" start_page="699" end_page="701" type="sub_section"> <SectionTitle> 4.1 Results </SectionTitle> <Paragraph position="0"> Table 1 shows the results for multi-POS tagging for different levels of ambiguity. The row corresponding to 1.01 tags per word shows that adding different levels of ambiguity.</Paragraph> <Paragraph position="1"> even a tiny amount of ambiguity (1 extra tag in every 100 words) gives a reasonable improvement, whilst adding 1 tag in 20 words, or approximately one extra tag per sentence on the WSJ, gives a significant boost of 1.6% word accuracy and almost 20% sentence accuracy.</Paragraph> <Paragraph position="2"> ThebottomrowofTable1givesanupperbound onaccuracyifthemaximumambiguityisallowed.</Paragraph> <Paragraph position="3"> This involves setting the b value to 0, so all feasible tags are assigned. Note that the performance gain is only 1.6% in sentence accuracy, compared with the previous row, at the cost of a large increase in ambiguity.</Paragraph> <Paragraph position="4"> Our first set of CCG supertagging experiments compared the performance of several approaches. In Table 2 we give the accuracies when using gold standard POS tags,andalso POS tagsautomatically assignedbyour POS taggerdescribedabove. Since POS tagsareimportantfeaturesforthesupertagger maximum entropy model, erroneous tags have a significant impact on supertagging accuracy.</Paragraph> <Paragraph position="5"> Thesinglemethodisthesingle-taggersupertagger, whichat91.5%per-wordaccuracyistooinaccurate for use with the CCG parser. The remaining rowsinthetablegivemulti-taggerresultsforacategory ambiguity of 1.4 categories per word. The noseq method, which performs significantly better than single, does not take into account the previously assigned categories. The best hist method gains roughly another 1% in accuracy over noseq by taking the greedy approach of using only the two most probable previously assigned categories.</Paragraph> <Paragraph position="6"> Finally, the full forward-backward approach described in Section 2.1 gains roughly another 0.6% by considering all possible category histories. We see the largest jump in accuracy just by returning multiple categories. The other more modest gains come from producing progressively better models of the category sequence.</Paragraph> <Paragraph position="7"> The final set of supertagging experiments in Table 3 demonstrates the trade-off between ambiguity and accuracy. Note that the ambiguity levels need to be much higher to produce similar performance to the POS tagger and that the upper bound case (b = 0) has a very high average ambiguity.</Paragraph> <Paragraph position="8"> This is to be expected given the much larger CCG tag set.</Paragraph> <Paragraph position="9"> 5 Tag uncertainty thoughout the pipeline Tables 2 and 3 show that supertagger accuracy when using gold-standard POS tags is typically 1%higherthanwhenusingautomaticallyassigned POS tags. Clearly, correct POS tags are important features for the supertagger.</Paragraph> <Paragraph position="10"> Errors made by the supertagger can multiply out when incorrect lexical categories are passed to the parser, so a 1% increase in lexical category error can become much more significant in the parser evaluation. For example, when using the dependency-based evaluation in Clark and Curran (2004b), getting the lexical category wrong for a ditransitive verb automatically leads to three dependencies in the output being incorrect.</Paragraph> <Paragraph position="11"> We have shown that multi-tagging can significantly increase the accuracy of the POS tagger with only a small increase in ambiguity. What we would like to do is maintain some degree of POS tag ambiguity and pass multiple POS tags through to the supertagging stage (and eventually the parser). There are several ways to encode multiple POS tags as features. The simplest approach is to treat all of the POS tags as binary features, but this does not take into account the uncertainty in each of the alternative tags. What we need is a way of incorporating probability information into the Maximum Entropy supertagger.</Paragraph> </Section> </Section> <Section position="8" start_page="701" end_page="701" type="metho"> <SectionTitle> 6 Real-values in ME models </SectionTitle> <Paragraph position="0"> Maximum Entropy (ME) models, in the NLP literature, are typically defined with binary features, although they do allow real-valued features. The only constraint comes from the optimisation algorithm; for example, GIS only allows non-negative values. Real-valued features are commonly used with other machine learning algorithms.</Paragraph> <Paragraph position="1"> Binary features suffer from certain limitations of the representation, which make them unsuitable for modelling some properties. For example, POS taggers have difficulty determining if capitalised, sentence initial words are proper nouns. A useful way to model this property is to determine the ratio of capitalised and non-capitalised instances of a particular word in a large corpus and use a real-valued feature which encodesthis ratio (Vadas and Curran, 2005). The only way to include this feature in a binary representation is to discretize (or bin) the feature values. For this type of feature, choosing appropriate bins is difficult and it may be hard to find a discretization scheme that performs optimally.</Paragraph> <Paragraph position="2"> Another problem with discretizing feature values is that it imposes artificial boundaries to define the bins. For the example above, we may choose the bins 0 [?] x < 1 and 1 [?] x < 2, which separate the values 0.99 and 1.01 even though they are close in value. At the same time, the model does notdistinguishbetween0.01and0.99eventhough they are much further apart.</Paragraph> <Paragraph position="3"> Further, if we have not seen cases for the bin 2 [?] x < 3, then the discretized model has no evidence to determine the contribution of this feature. But for the real-valued model, evidence supporting 1 [?] x < 2 and 3 [?] x < 4 provides evidence for the missing bin. Thus the real-valued model generalises more effectively.</Paragraph> <Paragraph position="4"> One issue that is not addressed here is the interaction between the Gaussian smoothing parameter and real-valued features. Using the same smoothing parameter for real-valued features with vastly different distributions is unlikely to be optimal.</Paragraph> <Paragraph position="5"> However, for these experiments we have used the same value for the smoothing parameter on all real-valued features. This is the same value we have used for the binary features.</Paragraph> </Section> <Section position="9" start_page="701" end_page="702" type="metho"> <SectionTitle> 7 Multi-POS Supertagging Experiments </SectionTitle> <Paragraph position="0"> We have experimented with four different approaches to passing multiple POS tags as features through to the supertagger. For the later experiments, this required the existing binary-valued framework to be extended to support real values.</Paragraph> <Paragraph position="1"> The level of POS tag ambiguity was varied between 1.05 and 1.3 POS tags per word on average.</Paragraph> <Paragraph position="2"> These results are shown in Table 4.</Paragraph> <Paragraph position="3"> The first approach is to treat the multiple POS tags as binary features (bin). This simply involves adding the multiple POS tags for each word in both the training and test data. Every assigned POS tag is treated as a separate feature and considered equally important regardless of its uncertainty. Here we see a minor increase in performance over the original supertagger at the lower levels of POS ambiguity. However, as the POS ambiguity is increased, the performance of the binary-valued features decreases and is eventually worse than the original supertagger. This is because at the lowest levels of ambiguity the extra POS tags can be treated as being of similar reliability. However, at higher levels of ambiguity many POS tags are added which are unreliable and should not be trusted equally.</Paragraph> <Paragraph position="4"> The second approach (split) uses real-valued features to model some degree of uncertainty in the POS tags,dividingthe POS tagprobabilitymass evenly among the alternatives. This has the effect of giving smaller feature values to tags where many alternative tags have been assigned. This produces similar results to the binary-valued features, again performing best at low levels of ambiguity. null The third approach (invrank) is to use the inverserankofeach POS tagasareal-valuedfeature.</Paragraph> <Paragraph position="5"> The inverse rank is the reciprocal of the tag's rank ordered by decreasing probability. This method assumes the POS tagger correctly orders the alternative tags, but does not rely on the probability assigned to each tag. Overall, invrank performs worse than split.</Paragraph> <Paragraph position="6"> The final and best approach is to use the probabilities assigned to each alternative tag as real-valued features:</Paragraph> <Paragraph position="8"> This model gives the best performance at 1.1 POS tags per-word average ambiguity. Note that, even when using the probabilities as features, only a small amount of additional POS ambiguity is required to significantly improve performance.</Paragraph> <Paragraph position="9"> with different levels of POS ambiguity and using different approaches to POS feature encoding.</Paragraph> <Paragraph position="10"> Table 5 shows our best performance figures for the multi-POS supertagger, against the previously describedmethodusingbothgoldstandardandautomatically assigned POS tags. Table 6 uses the Section 23 test data to demonstrate the improvement in supertagging when moving from single-tagging (single) to simple multi-tagging (noseq); from simple multi-tagging to the full forward-backward algorithm (fwdbwd); and finally when using the probabilities of multiply-assigned POS tags as features (MULTI-POS column). All of these multi-tagging experiments use an ambiguity level of 1.4 categories per word and the last result uses POS tag ambiguity of</Paragraph> <Section position="1" start_page="702" end_page="702" type="sub_section"> <SectionTitle> 1.1 tags per word. </SectionTitle> <Paragraph position="0"/> </Section> </Section> class="xml-element"></Paper>