File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-2009_metho.xml

Size: 11,274 bytes

Last Modified: 2025-10-06 14:09:00

<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-2009">
  <Title>Robust VPE detection using Automatically Parsed Text</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Previous work
</SectionTitle>
    <Paragraph position="0"> Hardt's (1997) algorithm for detecting VPE in the Penn Treebank (see Section 3) achieves precision levels of 44% and recall of 53%, giving an F11 of 48%, using a simple search technique, which relies on the parse annotation having identified empty expressions correctly.</Paragraph>
    <Paragraph position="1"> In previous work (Nielsen, 2003a; Nielsen, 2003b) we performed experiments on the British National Corpus using a variety of machine learning techniques. These earlier results are not directly comparable to Hardt's, due to the different corpora used. The expanded set of results are summarised in Table 1, for Transformation Based Learning (TBL) (Brill, 1995), GIS based Maximum Entropy Modelling (GIS-MaxEnt) (Ratnaparkhi, 1998), L-BFGS based Maximum Entropy</Paragraph>
    <Paragraph position="3"> For all of these experiments, the training features consisted of lexical forms and Part of Speech (POS) tags of the words in a three word forward/backward window of the auxiliary being tested. This context size was determined empirically to give optimum results, and will be used throughout this paper. The L-BFGS-MaxEnt uses Gaussian Prior smoothing which was optimized for the BNC data, while the GIS-MaxEnt has a simple smoothing option available, but this deteriorates results and is not used. MBL was used with its default settings.</Paragraph>
    <Paragraph position="4"> While TBL gave the best results, the software we used (Lager, 1999) ran into memory problems and proved problematic with larger datasets. Decision trees, on the other hand, tend to oversimplify due to the very sparse nature of ellipsis, and produce a single rule that classifies everything as non-VPE. This leaves Maximum Entropy and MBL for further experiments.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Corpus description
</SectionTitle>
    <Paragraph position="0"> The British National Corpus (BNC) (Leech, 1992) is annotated with POS tags, using the CLAWS-4 tagset. A range of V sections of the BNC, containing around 370k words3 with 645 samples of VPE was used as training data. The separate test data consists of around 74k words4 with 200 samples of VPE.</Paragraph>
    <Paragraph position="1"> The Penn Treebank (Marcus et al., 1994) has more than a hundred phrase labels, and a number of empty categories, but uses a coarser tagset. A mixture of sections from the Wall Street Journal and Brown corpus were used. The training section5 consists of around 540k words and contains 522 samples of VPE. The test section6 consists of around 140k words and contains 150 samples of VPE.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Experiments using the Penn Treebank
</SectionTitle>
    <Paragraph position="0"> To experiment with what gains are possible through the use of more complex data such as  parse trees, the Penn Treebank is used for the second round of experiments. The results are presented as new features are added in a cumulative fashion, so each experiment also contains the data contained in those before it.</Paragraph>
    <Paragraph position="1"> Words and POS tags The Treebank, besides POS tags and category headers associated with the nodes of the parse tree, includes empty category information. For the initial experiments, the empty category information is ignored, and the words and POS tags are extracted from the trees. The results in Table 2 are seen to be considerably poorer than those for BNC, despite the comparable data sizes. This can be accounted for by the coarser tagset employed.  Close to punctuation A very simple feature, that checks for auxiliaries close to punctuation marks was tested. Table 3 shows the performance of the feature itself, characterised by very low precision, and results obtained by using it. It gives a 2% increase in F1 for MBL, 3% for GIS-MaxEnt, but a 1.5% decrease for L-BFGS-MaxEnt.</Paragraph>
    <Paragraph position="2"> This brings up the point that the individual success rate of the features will not be in direct correlation with gains in overall results. Their contribution will be high if they have high precision for the cases they are meant to address, and if they produce a different set of results from those already handled well, complementing the existing features. Overlap between features can be useful to have greater confidence when they agree, but low precision in the feature can increase false positives as well, decreasing performance. Also, the small size of the test set can contribute to fluctuations in results.</Paragraph>
    <Paragraph position="3"> Heuristic Baseline A simple heuristic approach was developed to form a baseline. The method takes all auxiliaries</Paragraph>
    <Paragraph position="5"> as possible candidates and then eliminates them using local syntactic information in a very simple way. It searches forwards within a short range of words, and if it encounters any other verbs, adjectives, nouns, prepositions, pronouns or numbers, classifies the auxiliary as not elliptical. It also does a short backwards search for verbs. The forward search looks 7 words ahead and the backwards search 3. Both skip 'asides', which are taken to be snippets between commas without verbs in them, such as : &amp;quot;... papers do, however, show ...&amp;quot;. This feature gives a 4.5% improvement for MBL (Table</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Surrounding categories
</SectionTitle>
      <Paragraph position="0"> The next feature added is the categories of the previous branch of the tree, and the next branch. So in the example in Figure 1, the previous category of the elliptical verb is ADVP-PRD-TPC-2, and the next category NP-SBJ. The results of using this feature are seen in Table 5, giving a 3.5% boost to</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
gories
Auxiliary-final VP
</SectionTitle>
    <Paragraph position="0"> For auxiliary verbs parsed as verb phrases (VP), this feature checks if the final element in the VP is an auxiliary or negation. If so, no main verb can be present, as a main verb cannot be followed by an auxiliary or negation. This feature was used by Hardt (1993) and gives a 3.5% boost to performance for MBL, 6% for GIS-MaxEnt, and 3.4%</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Empty VP
</SectionTitle>
      <Paragraph position="0"> Hardt (1997) uses a simple pattern check to search for empty VP's identified by the Treebank, (VP (-NONE- *?*)), which achieves 60% F1 on our test set. Our findings are in line with Hardt's, who reports 48% F1, with the difference being due to the different sections of the Treebank used.</Paragraph>
      <Paragraph position="1"> It was observed that this search may be too restrictive to catch some examples of VPE in the corpus, and pseudo-gapping. Modifying the search pattern to be '(VP (-NONE- *?*)' instead improves the feature itself by 10% in F1 and gives the results seen in Table 7, increasing MBL's F1 by 10%, GIS-MaxEnt by 14% and L-BFGS-MaxEnt by 11.7%.</Paragraph>
      <Paragraph position="2">  feature Empty categories Finally, including empty category information completely, such that empty categories are treated as words and included in the context. Table 8 shows that adding this information results in a 4% increase in F1 for MBL, 4.9% for GIS-MaxEnt, and 2.5% for L-BFGS-MaxEnt.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Experiments with Automatically
</SectionTitle>
    <Paragraph position="0"> Parsed data The next set of experiments use the BNC and Treebank, but strip POS and parse information, and parse them automatically using two different parsers. This enables us to test what kind of performance is possible for real-world applications.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Parsers used
</SectionTitle>
      <Paragraph position="0"> Charniak's parser (2000) is a combination probabilistic context free grammar and maximum entropy parser. It is trained on the Penn Treebank, and achieves a 90.1% recall and precision average for sentences of 40 words or less.</Paragraph>
      <Paragraph position="1"> Robust Accurate Statistical Parsing (RASP) (Briscoe and Carroll, 2002) uses a combination of statistical techniques and a hand-crafted grammar.</Paragraph>
      <Paragraph position="2"> RASP is trained on a range of corpora, and uses a more complex tagging system (CLAWS-2), like that of the BNC. This parser, on our data, generated full parses for 70% of the sentences, partial parses for 28%, while 2% were not parsed, returning POS tags only.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Reparsing the Treebank
</SectionTitle>
      <Paragraph position="0"> The results of experiments using the two parsers (Table 9) show generally similar performance.</Paragraph>
      <Paragraph position="1"> Compared to results on the original treebank with similar data (Table 6), the results are 4-6% lower, or in the case of GIS-MaxEnt, 4% lower or 2% higher, depending on parser. This drop in performance is not surprising, given the errors introduced by the parsing process. As the parsers do not generate empty-category information, their overall results are 14-20% lower, compared to those in Table 8.</Paragraph>
      <Paragraph position="2"> The success rate for the features used (Table 10) stay the same, except for auxiliary-final VP, which is determined by parse structure, is only half as successful for RASP. Conversely, the heuristic baseline is more successful for RASP, as it relies on POS tags, which is to be expected as RASP has a more detailed tagset.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.3 Parsing the BNC
</SectionTitle>
      <Paragraph position="0"> Experiments using parsed versions of the BNC corpora (Table 11) show similar results to the original results (Table 1) - except L-BFGS-MaxEnt which scores 4-8% lower - meaning that the added information from the features mitigates the errors introduced in parsing. The performance of the features (Table 12) remain similar to those for the re-parsed treebank experiments.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.4 Combining BNC and Treebank data
</SectionTitle>
      <Paragraph position="0"> Combining the re-parsed BNC and Treebank data diversifies and increases the size of the test data, making conclusions drawn empirically more reliable, and the wider range of training data makes it more robust. This gives a training set of 1167 VPE's and a test set of 350 VPE's. The results in Table 13 show little change from the previous experiments.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML