File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-1042_intro.xml

Size: 8,303 bytes

Last Modified: 2025-10-06 14:03:34

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1042">
  <Title>Error mining in parsing results</Title>
  <Section position="4" start_page="0" end_page="330" type="intro">
    <SectionTitle>
2 Principles
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="329" type="sub_section">
      <SectionTitle>
2.1 General idea
</SectionTitle>
      <Paragraph position="0"> The idea we implemented is inspired from (van Noord, 2004). In order to identify missing and erroneous information in a parsing system, one can analyze a large corpus and study with statistical tools what differentiates sentences for which parsing succeeded from sentences for which it failed.</Paragraph>
      <Paragraph position="1"> The simplest application of this idea is to look for forms, called suspicious forms, that are found more frequently in sentences that could not be parsed. This is what van Noord (2004) does, without trying to identify a suspicious form in any sentence whose parsing failed, and thus without taking into account the fact that there is (at least) one cause of error in each unparsable sentence.1 On the contrary, we will look, in each sentence on which parsing failed, for the form that has the highest probability of being the cause of this failure: it is the main suspect of the sentence.</Paragraph>
      <Paragraph position="2"> This form may be incorrectly or only partially described in the lexicon, it may take part in constructions that are not described in the grammar, or it may exemplify imperfections of the pre-syntactic processing chain. This idea can be easily extended to sequences of forms, which is what we do by tak- null ing form bigrams into account, but also to lemmas (or sequences of lemmas).</Paragraph>
    </Section>
    <Section position="2" start_page="329" end_page="330" type="sub_section">
      <SectionTitle>
2.2 Form-level probabilistic model
</SectionTitle>
      <Paragraph position="0"> We suppose that the corpus is split in sentences, sentences being segmented in forms. We denote by si the i-th sentence. We denote by oi,j, (1 [?] j [?] |si|) the occurrences of forms that constitute si, and by F(oi,j) the corresponding forms. Finally, we call error the function that associates to each sentence si either 1, if si's parsing failed, and</Paragraph>
      <Paragraph position="2"> Let Of be the set of the occurrences of a form f in the corpus: Of = {oi,j|F(oi,j) = f}. The number of occurrences of f in the corpus is therefore |Of|.</Paragraph>
      <Paragraph position="3"> Let us define at first the mean global suspicion rate S, that is the mean probability that a given occurrence of a form be the cause of a parsing failure. We make the assumption that the failure of the parsing of a sentence has a unique cause (here, a unique form. . . ). This assumption, which is not necessarily exactly verified, simplifies the model and leads to good results. If we call occtotal the total amount of forms in the corpus, we have then:</Paragraph>
      <Paragraph position="5"> Let f be a form, that occurs as the j-th form of sentence si, which means that F(oi,j) = f. Let us assume that si's parsing failed: error(si) = 1. We call suspicion rate of the j-th form oi,j of sentence si the probability, denoted by Si,j, that the occurrence oi,j of form form f be the cause of the si's parsing failure. If, on the contrary, si's parsing succeeded, its occurrences have a suspicion rate that is equal to zero.</Paragraph>
      <Paragraph position="6"> We then define the mean suspicion rate Sf of a form f as the mean of all suspicion rates of its occurrences:</Paragraph>
      <Paragraph position="8"> To compute these rates, we use a fix-point algorithm by iterating a certain amount of times the following computations. Let us assume that we just completed the n-th iteration: we know, for each sentence si, and for each occurrence oi,j of this sentence, the estimation of its suspicion rate Si,j as computed by the n-th iteration, estimation that is denoted by S(n)i,j . From this estimation, we compute the n + 1-th estimation of the mean suspicion rate of each form f, denoted by S(n+1)f :</Paragraph>
      <Paragraph position="10"> This rate2 allows us to compute a new estimation of the suspicion rate of all occurrences, by giving to each occurrence if a sentence si a suspicion rate S(n+1)i,j that is exactly the estimation S(n+1)f of the mean suspicion rate of Sf of the corresponding form, and then to perform a sentence-level normalization. Thus:</Paragraph>
      <Paragraph position="12"> At this point, the n+1-th iteration is completed, and we can resume again these computations, until convergence on a fix-point. To begin the whole process, we just say, for an occurrence oi,j of sentence si, that S(0)i,j = error(si)/|si|. This means that for a non-parsable sentence, we start from a baseline where all of its occurrences have an equal probability of being the cause of the failure.</Paragraph>
      <Paragraph position="13"> After a few dozens of iterations, we get stabilized estimations of the mean suspicion rate each form, which allows: * to identify the forms that most probably cause errors, * for each form f, to identify non-parsable sentences si where an occurrence oi,j [?] Of of f is a main suspect and where oi,j has a very 2We also performed experiment in which Sf was estimated by an other estimator, namely the smoothed mean suspicion rate, denoted by ~S(n)f , that takes into account the number of occurrences of f. Indeed, the confidence we can have in the estimation S(n)f is lower if the number of occurrences of f is lower. Hence the idea to smooth S(n)f by replacing it with a weighted mean ~S(n)f between S(n)f and S, where the weights l and 1 [?] l depend on |Of|: if |Of |is high, ~S(n)f will be close from S(n)f ; if it is low, it will be closer from S:</Paragraph>
      <Paragraph position="15"> In these experiments, we used the smoothing function l(|Of|) = 1 [?] e[?]b|Of |with b = 0.1. But this model, used with the ranking according to Mf = Sf * ln|Of |(see below), leads results that are very similar to those obtained without smoothing. Therefore, we describe the smoothingless model, which has the advantage not to use an empirically chosen smoothing function.</Paragraph>
      <Paragraph position="16">  high suspicion rate among all occurrences of form f.</Paragraph>
      <Paragraph position="17"> We implemented this algorithm as a perl script, with strong optimizations of data structures so as to reduce memory and time usage. In particular, form-level structures are shared between sentences. null</Paragraph>
    </Section>
    <Section position="3" start_page="330" end_page="330" type="sub_section">
      <SectionTitle>
2.3 Extensions of the model
</SectionTitle>
      <Paragraph position="0"> This model gives already very good results, as we shall see in section 4. However, it can be extended in different ways, some of which we already implemented. null First of all, it is possible not to stick to forms. Indeed, we do not only work on forms, but on couples made out of a form (a lexical entry) and one or several token(s) that correspond to this form in the raw text (a token is a portion of text delimited by spaces or punctuation tokens).</Paragraph>
      <Paragraph position="1"> Moreover, one can look for the cause of the failure of the parsing of a sentence not only in the presence of a form in this sentence, but also in the presence of a bigram3 of forms. To perform this, one just needs to extend the notions of form and occurrence, by saying that a (generalized) form is a unigram or a bigram of forms, and that a (generalized) occurrence is an occurrence of a generalized form, i.e., an occurrence of a unigram or a bigram of forms. The results we present in section 4 includes this extension, as well as the previous one.</Paragraph>
      <Paragraph position="2"> Another possible generalization would be to take into account facts about the sentence that are not simultaneous (such as form unigrams and form bigrams) but mutually exclusive, and that must therefore be probabilized as well. We have not yet implemented such a mechanism, but it would be very interesting, because it would allow to go beyond forms or n-grams of forms, and to manipulate also lemmas (since a given form has usually several possible lemmas).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML