File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1157_metho.xml
Size: 11,475 bytes
Last Modified: 2025-10-06 14:08:49
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1157"> <Title>Verb Phrase Ellipsis detection using Automatically Parsed Text</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Corpus description </SectionTitle> <Paragraph position="0"> The British National Corpus (BNC) (Leech, 1992) is annotated with POS tags, using the CLAWS-4 tagset. A range of sections of the BNC, containing around 370k words4 with 645 samples of VPE was used as training data. The separate development data consists of around 74k words5 with 200 samples of VPE.</Paragraph> <Paragraph position="1"> The Penn Treebank (Marcus et al., 1994) has more than a hundred phrase labels, and a number of empty categories, but uses a coarser tagset. A mixture of sections from the Wall Street Journal and Brown corpus were used.</Paragraph> <Paragraph position="2"> The training section6 consists of around 540k words and contains 522 samples of VPE. The development section7 consists of around 140k words and contains 150 samples of VPE.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Experiments using the Penn Treebank </SectionTitle> <Paragraph position="0"> To experiment with what gains are possible through the use of more complex data such as parse trees, the Penn Treebank is used for the second round of experiments. The results are presented as new features are added in a cumulative fashion, so each experiment also contains the data contained in those before it; the close to punctuation experiment contains the words and POS tags from the experiment before it, the next experiment contains all of these plus the heuristic baseline and so on.</Paragraph> <Paragraph position="1"> Words and POS tags The Treebank, besides POS tags and category headers associated with the nodes of the parse tree, includes empty category information. For the initial experiments, the empty category information is ignored, and the words and POS tags are extracted from the trees. The results in Table 2 are seen to be considerably poorer than those for BNC, despite the comparable data sizes. This can be accounted for by the coarser tagset employed.</Paragraph> <Paragraph position="2"> Close to punctuation A very simple feature, that checks for auxiliaries close to punctuation marks was tested. Table 3 shows the performance of the feature itself, characterised by very low precision, and results obtained by using it. It gives a 3% increase in F1 for GIS-MaxEnt, but a 1.5% decrease for L-BFGS-MaxEnt and 0.5% decrease for MBL.</Paragraph> <Paragraph position="3"> This brings up the point that the individual success rate of the features will not be in direct correlation with gains in overall results. Their contribution will be high if they have high precision for the cases they are meant to address, andiftheyproduceadifierentsetofresultsfrom those already handled well, complementing the existing features. Overlap between features can be useful to have greater confldence when they agree, but low precision in the feature can increase false positives as well, decreasing performance. Also, the small size of the development set can contribute to uctuations in results.</Paragraph> <Paragraph position="4"> punctuation feature Heuristic Baseline A simple heuristic approach was developed to form a baseline using only POS data. The method takes all auxiliaries as possible candidates and then eliminates them using local syntactic information in a very simple way. It searches forwards within a short range of words, and if it encounters any other verbs, adjectives, nouns, prepositions, pronouns or numbers, classifles the auxiliary as not elliptical. It also does a short backwards search for verbs. The forward search looks 7 words ahead and the backwards search 3. Both skip 'asides', which are taken to be snippets between commas without verbs in them, such as : \... papers do, however, show ...&quot;. This feature gives a 3.5 - 4.5% improvement</Paragraph> <Paragraph position="6"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Surrounding categories </SectionTitle> <Paragraph position="0"> The next feature added is the categories of the previous branch of the tree, and the next branch. So in the example in Figure 1, the previous category of the elliptical verb is ADVP-PRD-TPC-2, and the next category NP-SBJ.</Paragraph> <Paragraph position="1"> The results of using this feature are seen in Table 5, giving a 1.6 - 3.5% boost.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> gories Auxiliary-flnal VP </SectionTitle> <Paragraph position="0"> For auxiliary verbs parsed as verb phrases (VP), this feature checks if the flnal element in the VP is an auxiliary or negation. If so, no main verb can be present, as a main verb cannot be followed by an auxiliary or negation. This feature This achieves 60% F1 on our development set.</Paragraph> <Paragraph position="1"> Our flndings are in line with Hardt's, who reports 48% F1, with the difierence being due to the difierent sections of the Treebank used.</Paragraph> <Paragraph position="2"> It was observed that this search may be too restrictive to catch pseudo-gapping (Figure 2) and some examples of VPE in the corpus (Figure 3). We modify the search pattern to be '(VP (-NONE- *?*) ... )', which is a VP that contains an empty element, but can contain other categories after it as well. This improves the feature itself by 10% in F1 and gives 10 - 14% improvement to the algorithms (Table 7).</Paragraph> <Paragraph position="3"> Finally, empty category information is included completely, such that empty categories are treated as words, or leaves of the parse tree, and included in the context. Table 8 shows that adding this information results in 2.5 - 4.9% in-</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Cross-validation </SectionTitle> <Paragraph position="0"> We perform cross-validation with and without the features developed to measure the improvement obtained through their use. The cross-validation results show a difierent ranking of the algorithms by performance than on the development set (Table 9), but consistent with the results for the BNC corpus. MBL shows consistent performance, L-BFGS-MaxEnt gets somewhat lower results and GIS-MaxEnt much lower. These results indicate that the confldence threshold settings of the maximum entropy models were over-optimized for the development data, and perhaps the smoothing for L-BFGS-MaxEnt was as well. MBL which was used as-is does not sufier these performance drops. The increase in F1 achieved by adding the features is similar for all algorithms, ranging from 17.9 to 19.8%.</Paragraph> </Section> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Experiments with Automatically </SectionTitle> <Paragraph position="0"> Parsed data The next set of experiments use the BNC and Treebank, but strip POS and parse information, and parse them automatically using two difierent parsers. This enables us to test what kind of performance is possible for real-world applications. null</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Parsers used </SectionTitle> <Paragraph position="0"> Charniak's parser (2000) is a combination probabilistic context free grammar and maximum entropy parser. It is trained on the Penn Treebank, and achieves a 90.1% recall and precision average for sentences of 40 words or less. While Charniak's parser does not generate empty category information, Johnson (2002) has developed an algorithm that extracts patterns from the Treebank which can be used to insert empty categories into the parser's output. This program will be used in conjunction with Charniak's parser.</Paragraph> <Paragraph position="1"> Robust Accurate Statistical Parsing (RASP) (Briscoe and Carroll, 2002) uses a combination of statistical techniques and a hand-crafted grammar. RASP is trained on a range of corpora, and uses a more complex tagging system (CLAWS-2), like that of the BNC. This parser, on our data, generated full parses for 70% of the sentences, partial parses for 28%, while 2% were not parsed, returning POS tags only.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 Reparsing the Treebank </SectionTitle> <Paragraph position="0"> The results of experiments using the two parsers (Table 10) show generally similar performance. Preiss (2003) shows that for the task of anaphora resolution, these two parsers produce very similar results, which is consistent with our flndings. Compared to results on the original treebank with similar data (Table 6), the results are low, which is not surprising, given the errors introduced by the parsing process. It is noticeable that the addition of features has less efiect; 0-6%.</Paragraph> <Paragraph position="1"> The auxiliary-flnal VP feature (Table 11), which is determined by parse structure, is only half as successful for RASP. Conversely, the heuristic baseline, which relies on POS tags, is more successful for RASP as it has a more detailed tagset. The empty VP feature retains a high precision of over 80%, but its recall drops by 50% to 20%, suggesting that the emptycategory insertion algorithm is sensitive to parsing errors.</Paragraph> <Paragraph position="2"> tent with experiments on the development set.</Paragraph> <Paragraph position="3"> This is due to the fact that settings were not optimized for the development set, but left as they were from previous experiments. Results here show better performance for RASP overall.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.3 Parsing the BNC </SectionTitle> <Paragraph position="0"> Experiments using parsed versions of the BNC corpora (Tables 13, 15) show similar results to the original results (Table 1) but the features generate only a 3% improvement, suggesting that many of the cases in the test set can be identifled using similar contexts in the training data and the features do not add extra information. The performance of the features (Table 14) remain similar to those for the re-parsed treebank experiments, except for empty VP, where there is a 7% drop in F1, due to Charniak's parser being trained on the Tree-bank only.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.4 Combining BNC and Treebank </SectionTitle> <Paragraph position="0"> data Combining the re-parsed BNC and Treebank data gives a more robust training set of 1167 VPE's and a development set of 350 VPE's. Theresults(Tables16, 17)showonlya2-3%improvement when the features are added. Again, simple contextual information is successful in correctly identifying most of the VPE's.</Paragraph> <Paragraph position="1"> It is also seen that the increase in data size is not matched by a large increase in performance. This may be because simple cases are already handled, and for more complex cases the context size limits the usefulness of added data. The difierences between the two corpora may also limit the relevance of examples from one to the other.</Paragraph> </Section> </Section> class="xml-element"></Paper>