File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-2305_intro.xml
Size: 6,159 bytes
Last Modified: 2025-10-06 14:04:04
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-2305"> <Title>Robust Parsing: More with Less</Title> <Section position="2" start_page="0" end_page="26" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Traditionally, broad coverage has always been considered to be a desirable property of a grammar: the more linguistic phenomena are treated properlyby the grammar, the better results can be expected when applying it to unrestricted text (c.f. (Grover et al., 1993; Doran et al., 1994)). With the advent of empirical methods andthecorrespondingevaluationmetrics,however,this view changed considerably. (Abney, 1996) was among the first who noted that the relationship between coverage and statistical parsing quality is a more complex one. Adding new rules to the grammar, i.e. increasing its coverage, does not only allow the parser to deal with more phenomena, hence more sentences; at the same time it opens up new possibilities for abusing thenewlyintroducedrulestomis-analyseconstructions which were already treated properly before. As a consequence, a net reduction in parsing quality might be observed for simple statistical reasons, since the gain usuallyisobtainedforrelativelyrarephenomena,while the adverse effects might well affect frequent ones.</Paragraph> <Paragraph position="1"> (Abney, 1996) uses this observation to argue in favour of stochastic models which attempt to choose the optimal structural interpretation instead of only providing a list of equally probable alternatives. However, using such an optimizationprocedureis notnecessarilya sufficient precondition to completely rule out the effect.</Paragraph> <Paragraph position="2"> Compared to traditional handwritten grammars, successfulstochasticmodelslike(Collins,1999;Charniak, null 2000) open up an even greater space of alternatives for the parser and accordingly offer a great deal of opportunities to construct odd structural descriptions from them. Whether the guidance of the stochastic model can really prevent the parser from making use of these unwanted opportunities so far remains unclear.</Paragraph> <Paragraph position="3"> In the following we make a first attempt to quantify the consequences that different degrees of coverage have for the output quality of a wide-coverage parser. For thispurposeweusea WeightedConstraintDependency Grammar (WCDG), which covers even relatively rare syntactic phenomena of German and performs reliably across a wide variety of different text genres (Foth et al., 2005). By combining hand-written rules with an optimization procedure for hypothesis selection, such a parser makes it possible to successively exclude certain rare phenomenafrom the coverageof the grammar and to study the impact of these modifications on its output quality 2 Some rare phenomena of German What are good candidates of 'rare' phenomena that might be intentionally removed from the coverage of ourgrammar?Onepossibilityistoremovecoveragefor constructions that are already slightly dispreferred. For instance, apposition and coordination of noun phrases often violate the principle of projectivity: &quot;I got a sled for Christmas, a parrot and a motor-bike.&quot; This is quite a common construction, but still 'rare' in the sense that the great majority of appositionsdoes respect projectivity, so that the example seems at least slightly unusual. But there are also syntactic relations thatarequiterarebutneverthelessappearperfectlynormal when they do occur, such as direct appellations: &quot;James, please open the door.&quot; This might be because their frequency varies considerably between text types; everyone is familiar with personal appellation from everyday conversation, but it would be surprising to hear it from the mouth of a television news reader.</Paragraph> <Paragraph position="4"> Finally, some constructions form variants e.g. by omitting certain words: &quot;I bought a new broom [in order] to clean the drive- null bleiben im Einklang mit den im Rahmen der Nordatlantikvertrags-Organisation eingegangenen Verpflichtungen, die f&quot;ur die ihr angeh&quot;orenden Staaten weiterhin das Fundament ihrer kollektiven Verteidigung und das Instrument f&quot;ur deren Verwirklichungist.&quot; way.&quot; Here the longer variant is unambiguously a subclause expressing purpose, while the shorter might be mistaken for a prepositionalphrase, so it could be regarded as misleading for the parser.</Paragraph> <Paragraph position="5"> The selection is necessarily subjective, not only because the delimitation of a phenomenon is subjective (are all kinds of ellipsis fundamentally the same phenomenonor not?) but also becausewe can removeonly those phenomena that are already covered in the first place. Therefore we have selected phenomena * that were explicitly added to the grammar at some point in order to deal with actually occurring unforeseen constructions, * that can easily be removed from the grammar without affecting other phenomena, * and that are relatively rare in all the texts we have investigated.</Paragraph> <Paragraph position="6"> Table 1 shows the 21 phenomena that we consider in this paper. (Note that the three earlier example sentences correspond to lines 1, 4, and 10 in this table, but that not all lines have exact counterparts in English.) The last column gives the overall frequency per 1,000 sentences of each phenomenon when measured across all trees in our collection.</Paragraph> <Paragraph position="7"> The collection contains sections of Bible text (Genesis 1-50), law text (the constitutions of Federal Germany andoftheEuropeanUnion),onlinetechnicalnewscasts (www.heise.de),novel text, and sentences from the NEGRA corpus of newspaper articles. Table 2 shows the sentence counts of the different sections and the frequency per 1000 of all 21 phenomena in each text type. It can be seen that most of the constructions remainquiterareoverall,butoftenthefrequencydepends null heavily on the text type, so that a high influence of the corpus can be expected for our experiments.</Paragraph> </Section> class="xml-element"></Paper>