XML Viewer - p06-1042

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/p06-1042_evalu.xml
Size: 8,008 bytes
Last Modified: 2025-10-06 13:59:37
<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1042">
  <Title>Error mining in parsing results</Title>
  <Section position="6" start_page="332" end_page="335" type="evalu">
    <SectionTitle>
4 Results
</SectionTitle>
    <Paragraph position="0"> In this section, we mostly focus on the results of our error mining algorithm on the parsing results provided by SXLFG on the MD corpus. We first present results when only forms are taken into account, and then give an insight on results when both forms and form bigrams are considered.</Paragraph>
    <Paragraph position="1"> 5Such an information, which is extremely valuable for the developers of the resources, can not be obtained by global (form-level and not occurrence-level) approaches such as the err(f)-based approach of (van Noord, 2004). Indeed, enumerating all sentences which include a given form f, and which did not receive a full parse, is not precise enough: it would show at the same time sentences wich fail because of f (e.g., because its lexical entry lacks a given sub-categorization frame) and sentences which fail for an other independent reason.</Paragraph>
    <Section position="1" start_page="332" end_page="333" type="sub_section">
      <SectionTitle>
4.1 Finding suspicious forms
</SectionTitle>
      <Paragraph position="0"> The execution of our error mining script on MD/SXLFG, with imax = 50 iterations and when only (isolated) forms are taken into account, takes less than one hour on a 3.2 GHz PC running Linux with a 1.5 Go RAM. It outputs 18,334 relevant suspicious forms (out of the 327,785 possible ones), where a relevant suspicious form is defined as a form f that satisfies the following arbitrary constraints:6 S(imax)f &gt; 1,5 * S and |Of |&gt; 5.</Paragraph>
      <Paragraph position="1"> We still can not prove theoretically the convergence of the algorithm.7 But among the 1000 best-ranked forms, the last iteration induces a mean variation of the suspicion rate that is less than 0.01%.</Paragraph>
      <Paragraph position="2"> On a smaller corpus like the EASy corpus, 200 iterations take 260s. The algorithm outputs less than 3,000 relevant suspicious forms (out of the 61,125 possible ones). Convergence information 6These constraints filter results, but all forms are taken into account during all iterations of the algorithm. 7However, the algorithms shares many common points with iterative algorithm that are known to converge and that have been proposed to find maximum entropy probability distributions under a set of constraints (Berger et al., 1996). Such an algorithm is compared to ours later on in this paper.  is the same as what has been said above for the MD corpus.</Paragraph>
      <Paragraph position="3"> Table 2 gives an idea of the repartition of suspicious forms w.r.t. their frequency (for FRMG on MD), showing that rare forms have a greater probability to be suspicious. The most frequent suspicious form is the double-quote, with (only) Sf = 9%, partly because of segmentation problems.</Paragraph>
    </Section>
    <Section position="2" start_page="333" end_page="333" type="sub_section">
      <SectionTitle>
4.2 Analyzing results
</SectionTitle>
      <Paragraph position="0"> Table 3 gives an insight on the output of our algorithm on parsing results obtained by SXLFG on the MD corpus. For each form f (in fact, for each couple of the form (token,form)), this table displays its suspicion rate and its number of occurrences, as well as the rate err(f) of non-parsable sentences among those where f appears and a short manual analysis of the underlying error.</Paragraph>
      <Paragraph position="1"> In fact, a more in-depth manual analysis of the results shows that they are very good: errors are correctly identified, that can be associated with four error sources: (1) the Lefff lexicon, (2) the SXPipe pre-syntactic processing chain, (3) imperfections of the grammar, but also (4) problems related to the corpus itself (and to the fact that it is a raw corpus, with meta-data and typographic noise).</Paragraph>
      <Paragraph position="2"> On the EASy corpus, results are also relevant, but sometimes more difficult to interpret, because of the relative small size of the corpus and because of its heterogeneity. In particular, it contains e-mail and oral transcriptions sub-corpora that introduce a lot of noise. Segmentation problems (caused both by SXPipe and by the corpus itself, which is already segmented) play an especially important role.</Paragraph>
    </Section>
    <Section position="3" start_page="333" end_page="333" type="sub_section">
      <SectionTitle>
4.3 Comparing results with results of other
algorithms
</SectionTitle>
      <Paragraph position="0"> In order to validate our approach, we compared our results with results given by two other relevant algorithms: * van Noord's (van Noord, 2004) (form-level and non-iterative) evaluation of err(f) (the rate of non-parsable sentences among sentences containing the form f), * a standard (occurrence-level and iterative) maximum entropy evaluation of each form's contribution to the success or the failure of a sentence (we used the MEGAM package (Daume III, 2004)).</Paragraph>
      <Paragraph position="1"> As done for our algorithm, we do not rank forms directly according to the suspicion rate Sf computed by these algorithms. Instead, we use the Mf measure presented above (Mf = Sf *ln|Of|). Using directly van Noord's measure selects as most suspicious words very rare words, which shows the importance of a good balance between suspicion rate and frequency (as noted by (van Noord, 2004) in the discussion of his results). This remark applies to the maximum entropy measure as well. Table 4 shows for all algorithms the 10 best-ranked suspicious forms, complemented by a manual evaluation of their relevance. One clearly sees that our approach leads to the best results. Van Noord's technique has been initially designed to find errors in resources that already ensured a very high coverage. On our systems, whose development is less advanced, this technique ranks as most suspicious forms those which are simply the most frequent ones. It seems to be the case for the standard maximum entropy algorithm, thus showing the importance to take into account the fact that there is at least one cause of error in any sentence whose parsing failed, not only to identify a main suspicious form in each sentence, but also to get relevant global results.</Paragraph>
    </Section>
    <Section position="4" start_page="333" end_page="333" type="sub_section">
      <SectionTitle>
4.4 Comparing results for both parsers
</SectionTitle>
      <Paragraph position="0"> We complemented the separated study of error mining results on the output of both parsers by an analysis of merged results. We computed for each form the harmonic mean of both measures</Paragraph>
      <Paragraph position="2"> tem. Results (not shown here) are very interesting, because they identify errors that come mostly from resources that are shared by both systems (the Lefff lexicon and the pre-syntactic processing chain SXPipe). Although some errors come from common lacks of coverage in both grammars, it is nevertheless a very efficient mean to get a first repartition between error sources.</Paragraph>
    </Section>
    <Section position="5" start_page="333" end_page="335" type="sub_section">
      <SectionTitle>
4.5 Introducing form bigrams
</SectionTitle>
      <Paragraph position="0"> As said before, we also performed experiments where not only forms but also form bigrams are treated as potential causes of errors. This approach allows to identify situations where a form is not in itself a relevant cause of error, but leads often to a parse failure when immediately followed or preceded by an other form.</Paragraph>
      <Paragraph position="1"> Table 5 shows best-ranked form bigrams (forms that are ranked in-between are not shown, to em- null Sf * ln|Of|). These results have been computed on a subset of the MD corpus (60,000 sentences).  phasize bigram results), with the same data as in table 3.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML