File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/c02-1079_intro.xml

Size: 2,877 bytes

Last Modified: 2025-10-06 14:01:25

<?xml version="1.0" standalone="yes"?>
<Paper uid="C02-1079">
  <Title>Best Analysis Selection in Inflectional Languages</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Ambiguity on all levels of representation is an inherent property of natural languages and it also forms a central problem of natural language parsing. A consequence of the natural language ambiguity is a high number of possible outputs of a parser that are usually represented by labeled trees. The average number of parsing trees per input sentence strongly depends on the background grammar and thence on the language. There are natural language grammars producing at most hundreds or thousands of parsing trees but also highly ambiguous grammar systems producing enormous number of results. For example, a grammar extracted from the Penn Treebank and tested on a set of sentences randomly generated from a probabilistic version of the grammar has on average 7.2x1027 parses per sentence according to Moore's work (Moore, 2000). Such a mammoth extent of result is also no exception in parsing of Czech (SmrVz and Hor'ak, 2000) (see Fig. 1) due to free word order and  sulting analysis on the number of words in the input sentence rich morphology of word forms whose grammatical case cannot often be unambiguously determined.</Paragraph>
    <Paragraph position="1"> A traditional solution for these problems is presented by probabilistic parsing techniques (Bunt and Nijholt, 2000) aiming at finding the most probable parse of a given input sentence. This methodology is usually based on the relative frequencies of occurrences of the possible relations in a representative corpus. &amp;quot;Best&amp;quot; trees are judged by a probabilistic figure of merit (FOM).</Paragraph>
    <Paragraph position="2"> The term &amp;quot;figure of merit&amp;quot; is usually used to refer to a function that prunes implausible partial analyses during parsing. In this paper, we rather take figure of merit as a measure bounding the true probabilities of the complete parses.</Paragraph>
    <Paragraph position="3">  The standard methods of the best analysis selection (Caraballo and Charniak, 1998) usually use simple stochastic functions independent on the peculiarities of the underlying language. This approach seems to work satisfactorily in case of analytical languages. On the other hand, the obstacles brought by the synthetical languages in relationship with those simple statistical techniques are indispensable.</Paragraph>
    <Paragraph position="4"> Therefore, we try to improve the standard FOMs taking into consideration specific features of free word order languages. The following text discusses the assets of three figures of merit that reflect selected phenomena of the Czech language.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML