XML Viewer - p06-1042

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1042_metho.xml
Size: 7,916 bytes
Last Modified: 2025-10-06 14:10:19
<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1042">
  <Title>Error mining in parsing results</Title>
  <Section position="5" start_page="330" end_page="332" type="metho">
    <SectionTitle>
3 Experiments
</SectionTitle>
    <Paragraph position="0"> In order to validate our approach, we applied these principles to look for error causes in parsing results given by two deep parsing systems for French, FRMG and SXLFG, on large corpora.</Paragraph>
    <Paragraph position="1"> 3One could generalize this to n-grams, but as n gets higher the number of occurrences of n-grams gets lower, hence leading to non-significant statistics.</Paragraph>
    <Section position="1" start_page="330" end_page="331" type="sub_section">
      <SectionTitle>
3.1 Parsers
</SectionTitle>
      <Paragraph position="0"> Both parsing systems we used are based on deep non-probabilistic parsers. They share: * the Lefff 2 syntactic lexicon for French (Sagot et al., 2005), that contains 500,000 entries (representing 400,000 different forms) ; each lexical entry contains morphological information, sub-categorization frames (when relevant), and complementary syntactic information, in particular for verbal forms (controls, attributives, impersonals,. . . ), * the SXPipe pre-syntactic processing chain (Sagot and Boullier, 2005), that converts a raw text in a sequence of DAGs of forms that are present in the Lefff ; SXPipe contains, among other modules, a sentence-level segmenter, a tokenization and spelling-error correction module, named-entities recognizers, and a non-deterministic multi-word identifier.</Paragraph>
      <Paragraph position="1"> But FRMG and SXLFG use completely different parsers, that rely on different formalisms, on different grammars and on different parser builder. Therefore, the comparison of error mining results on the output of these two systems makes it possible to distinguish errors coming from the Lefff or from SXPipe from those coming to one grammar or the other. Let us describe in more details the characteristics of these two parsers.</Paragraph>
      <Paragraph position="2"> The FRMG parser (Thomasset and Villemonte de la Clergerie, 2005) is based on a compact TAG for French that is automatically generated from a meta-grammar. The compilation and execution of the parser is performed in the framework of the DYALOG system (Villemonte de la Clergerie, 2005).</Paragraph>
      <Paragraph position="3"> The SXLFG parser (Boullier and Sagot, 2005b; Boullier and Sagot, 2005a) is an efficient and robust LFG parser. Parsing is performed in two steps. First, an Earley-like parser builds a shared forest that represents all constituent structures that satisfy the context-free skeleton of the grammar.</Paragraph>
      <Paragraph position="4"> Then functional structures are built, in one or more bottom-up passes. Parsing efficiency is achieved thanks to several techniques such as compact data representation, systematic use of structure and computation sharing, lazy evaluation and heuristic and almost non-destructive pruning during parsing. null Both parsers implement also advanced error recovery and tolerance techniques, but they were  useless for the experiments described here, since we want only to distinguish sentences that receive a full parse (without any recovery technique) from those that do not.</Paragraph>
    </Section>
    <Section position="2" start_page="331" end_page="331" type="sub_section">
      <SectionTitle>
3.2 Corpora
</SectionTitle>
      <Paragraph position="0"> We parsed with these two systems the following corpora: MD corpus : This corpus is made out of 14.5 million words (570,000 sentences) of general journalistic corpus that are articles from the Monde diplomatique.</Paragraph>
      <Paragraph position="1"> EASy corpus : This is the 40,000-sentence corpus that has been built for the EASy parsing evaluation campaign for French (Paroubek et al., 2005). We only used the raw corpus (without taking into account the fact that a manual parse is available for 10% of all sentences). The EASy corpus contains several sub-corpora of varied style: journalistic, literacy, legal, medical, transcription of oral, email, questions, etc.</Paragraph>
      <Paragraph position="2"> Both corpora are raw in the sense that no cleaning whatsoever has been performed so as to eliminate some sequences of characters that can not really be considered as sentences.</Paragraph>
      <Paragraph position="3"> Table 1 gives some general information on these corpora as well as the results we got with both parsing systems. It shall be noticed that both parsers did not parse exactly the same set and the same number of sentences for the MD corpus, and that they do not define in the exactly same way the notion of sentence.</Paragraph>
    </Section>
    <Section position="3" start_page="331" end_page="332" type="sub_section">
      <SectionTitle>
3.3 Results visualization environment
</SectionTitle>
      <Paragraph position="0"> We developed a visualization tool for the results of the error mining, that allows to examine and annotate them. It has the form of an HTML page that uses dynamic generation methods, in particular javascript. An example is shown on Figure 1.</Paragraph>
      <Paragraph position="1"> To achieve this, suspicious forms are ranked according to a measure Mf that models, for a given form f, the benefit there is to try and correct the (potential) corresponding error in the resources. A user who wants to concentrate on almost certain errors rather than on most frequent ones can visualize suspicious forms ranked according to Mf = Sf. On the contrary, a user who wants to concentrate on most frequent potential errors, rather than on the confidence that the algorithm has given to errors, can visualize suspicious forms ranked according to4 Mf = Sf|Of|. The default choice, which is adopted to produce all tables shown in this paper, is a balance between these two possibilities, and ranks suspicious forms according to</Paragraph>
      <Paragraph position="3"> The visualization environment allows to browse through (ranked) suspicious forms in a scrolling list on the left part of the page (A). When the suspicious form is associated to a token that is the same as the form, only the form is shown. Otherwise, the token is separated from the form by the symbol &amp;quot; / &amp;quot;. The right part of the page shows various pieces of information about the currently selected form. After having given its rank according to the ranking measure Mf that has been chosen (B), a field is available to add or edit an annotation associated with the suspicious form (D). These annotations, aimed to ease the analysis of the error mining results by linguists and by the developers of parsers and resources (lexica, grammars), are saved in a database (SQLITE). Statistical information is also given about f (E), including its number of occurrences occf, the number of occurrences of f in non-parsable sentences, the final estimation of its mean suspicion rate Sf and the rate err(f) of non-parsable sentences among those where f appears. This indications are complemented by a brief summary of the iterative process that shows the convergence of the successive estimations of Sf. The lower part of the page gives a mean to identify the cause of f-related errors by showing 4Let f be a form. The suspicion rate Sf can be considered as the probability for a particular occurrence of f to cause a parsing error. Therefore, Sf|Of |models the number of occurrences of f that do cause a parsing error.</Paragraph>
      <Paragraph position="4">  f's entries in the Lefff lexicon (G) as well as non-parsable sentences where f is the main suspect and where one of its occurrences has a particularly high suspicion rate5 (H).</Paragraph>
      <Paragraph position="5"> The whole page (with annotations) can be sent by e-mail, for example to the developer of the lexicon or to the developer of one parser or the other (C).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML