File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/93/e93-1027_intro.xml
Size: 6,021 bytes
Last Modified: 2025-10-06 14:05:22
<?xml version="1.0" standalone="yes"?> <Paper uid="E93-1027"> <Title>Linguistic Knowledge Acquisition from Parsing Failures</Title> <Section position="3" start_page="0" end_page="222" type="intro"> <SectionTitle> 2 Robust Parsing and Linguistic </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="222" type="sub_section"> <SectionTitle> Knowledge Acquisition 2.1 Search Space of Possible Hypotheses </SectionTitle> <Paragraph position="0"> When a parser fails to analyse an input sentence, a robust parser hypothesizes possible errors in the input in order to complete the analysis and correct errors \[Douglas and Dale, 1992\]: for example, deletion of necessary words (Ex. I have book), insertion of unnecessary words (Ex. I have a the book), disorder of words (Ex. I a book have), spelling errors (Ex. I have a bok), etc.</Paragraph> <Paragraph position="1"> As there is usually a set of possible hypotheses to complete the analysis, this error detection process becomes non-deterministic. Furthermore, allowing operations such as deletion and insertion of arbitrary sequences of words or unrestricted permutation of word sequences, radically expands its search space. The process generates many nonsensical hypotheses unless we restrict the search space either by heuristies-based cost functions \[Mellish, 1989\], or</Paragraph> </Section> <Section position="2" start_page="222" end_page="222" type="sub_section"> <SectionTitle> Remaining Constituents </SectionTitle> <Paragraph position="0"> to be Collected</Paragraph> </Section> <Section position="3" start_page="222" end_page="222" type="sub_section"> <SectionTitle> Failure of Application of an Existing Rule Unrecognized Sequence of Characters Robust Parsing </SectionTitle> <Paragraph position="0"> hypotheses of - deletion of necessary words - insertion of unnecessary words - disorder of words relaxation of - feature agreements hypotheses of - spelling errors Knowledge Acquisition hypotheses of - lack of necessary rules identification of - disagreeing features hypotheses of - new words by introducing prior knowledge about regularities of errors in the form of annotated rules \[Goeser, 1992\]. On the other hand, our framework of knowledge acquisition from parsing failures does not assume that the input contains errors, but instead, assumes that linguistic knowledge of the system is incomplete. This means that we do not need to, or should not, allow the costly operations of changing input, and therefore the search space explosion encountered by a robust parser does not occur.</Paragraph> <Paragraph position="1"> For example, when a string of characters which is not registered in the dictionary as a word appears, a robust parser may assume that there are spelling errors and try to identify the errors by changing the character string (deleting characters, adding new characters, etc.) to find the &quot;closest&quot; legitimate word in the dictionary. This is because the dictionary is assumed to be complete, e.g. that it contains all lexical items that will appear. On the other hand, we simply hypothesize that the string of characters is a word which should be registered in the dictionary, together with the lexical properties that are compatible with those hypothesized from the surrounding syntactic/semantic context in the input.</Paragraph> <Paragraph position="2"> Table 1 shows different types of hypotheses to be produced by a robust parser and a program for knowledge acquisition from parsing failures. Although the assumption of legitimacy of input reduces significantly the size of the search space, the assumption of incomplete linguistic knowledge introduces another type of non-determinism and potentially a very large search space. For example, even if a word is registered in the dictionary as a noun, it can have in theory arbitrary parts of speech such as verb, adjective, adverb, etc., as there is no guarantee that the current dictionary exhausts all possible usages of the word. A simple method will end up with an explosion of hypotheses.</Paragraph> </Section> <Section position="4" start_page="222" end_page="222" type="sub_section"> <SectionTitle> 2.2 Corpus-based Knowledge Acquisition </SectionTitle> <Paragraph position="0"> Apart from the differences in types of hypotheses, an essential difference exists in the very nature of errors in the two paradigms. While errors in ill-formed input, by definition, are supposed not to show any significant regularity incompleteness or &quot;linguistic knowledge errors&quot; are supposed to be observed recurrently in a corpus.</Paragraph> <Paragraph position="1"> hFrom the practical viewpoint of adaptation of knowledge to a new application domain, disparities between existing knowledge and actual language usages which are manifested only rarely in a reasonable size sample corpus, are less significant than those recurrently observed. Furthermore, unlike robust parsing, we do not need to identify causes of parsing failures at the time of parsing. That is, though there is in general a set of hypotheses which equally explain parsing failures of single sentences, we can choose the most plausible ones by observing statistical properties (for example, frequencies) of the same hypotheses generated in the analysis of a whole corpus. This would be a reasonable approach, as significant disparities between knowledge and actual usages are supposed to be observed recurrently.</Paragraph> <Paragraph position="2"> One of the crucial differences between the two paradigms, therefore, is that unlike robust parsing, we need not narrow down the number of hypotheses to one by using heuristics based on cues inside single sentences. Multiple hypotheses are not seriously damaging, though it is desirable for them to be reasonably restricted. The final decision will be made through the observation of hypotheses generated from the analysis of a whole corpus.</Paragraph> </Section> </Section> class="xml-element"></Paper>