File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/p04-1057_intro.xml
Size: 3,400 bytes
Last Modified: 2025-10-06 14:02:22
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-1057"> <Title>Error Mining for Wide-Coverage Grammar Engineering</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> As we all know, hand-crafted linguistic descriptions such as wide-coverage grammars and large scale dictionaries contain mistakes, and are incomplete.</Paragraph> <Paragraph position="1"> In the context of parsing, people often construct sets of example sentences that the system should be able to parse correctly. If a sentence cannot be parsed, it is a clear sign that something is wrong. This technique only works in as far as the problems that might occur have been anticipated. More recently, tree-banks have become available, and we can apply the parser to the sentences of the tree-bank and compare the resulting parse trees with the gold standard. Such techniques are limited, however, because tree-banks are relatively small. This is a serious problem, because the distribution of words is Zipfian (there are very many words that occur very infrequently), and the same appears to hold for syntactic constructions.</Paragraph> <Paragraph position="2"> In this paper, an error mining technique is described which is very effective at automatically discovering systematic mistakes in a parser by using very large (but unannotated) corpora. The idea is very simple. We run the parser on a large set of sentences, and then analyze those sentences the parser cannot parse successfully. Depending on the nature of the parser, we define the notion 'successful parse' in different ways. In the experiments described here, we use the Alpino wide-coverage parser for Dutch (Bouma et al., 2001; van der Beek et al., 2002b). This parser is based on a large constructionalist HPSG for Dutch as well as a very large electronic dictionary (partly derived from CELEX, Parole, and CGN). The parser is robust in the sense that it essentially always produces a parse. If a full parse is not possible for a given sentence, then the parser returns a (minimal) number of parsed non-overlapping sentence parts. In the context of the present paper, a parse is called successful only if the parser finds an analysis spanning the full sentence.</Paragraph> <Paragraph position="3"> The basic idea is to compare the frequency of words and word sequences in sentences that cannot be parsed successfully with the frequency of the same words and word sequences in unproblematic sentences. As we illustrate in section 3, this technique obtains very good results if it is applied to large sets of sentences.</Paragraph> <Paragraph position="4"> To compute the frequency of word sequences of arbitrary length for very large corpora, we use a new combination of suffix arrays and perfect hash finite automata. This implementation is described in section 4.</Paragraph> <Paragraph position="5"> The error mining technique is able to discover systematic problems which lead to parsing failure.</Paragraph> <Paragraph position="6"> This includes missing, incomplete and incorrect lexical entries and grammar rules. Problems which cause the parser to assign complete but incorrect parses cannot be discovered. Therefore, tree-banks and hand-crafted sets of example sentences remain important to discover problems of the latter type.</Paragraph> </Section> class="xml-element"></Paper>