File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/99/w99-0622_intro.xml

Size: 3,507 bytes

Last Modified: 2025-10-06 14:07:04

<?xml version="1.0" standalone="yes"?>
<Paper uid="W99-0622">
  <Title>Guiding a Well-Founded Parser with Corpus Statistics</Title>
  <Section position="3" start_page="0" end_page="179" type="intro">
    <SectionTitle>
2 Setup
</SectionTitle>
    <Paragraph position="0"> Our lexicon is composed from two resources.</Paragraph>
    <Paragraph position="1"> COMLEX (Grishman et al., 1994) provides the syntactic and morphological information for 39,000 lemmas. WordNet (Fellbaum, 1998) provides the semantic information. In addition, we add to our lexicon approximately 47,000 &amp;quot;multiword&amp;quot; nouns found in WordNet.</Paragraph>
    <Paragraph position="2"> SAPIR, the parser we are using, employs a feature-based general grammar of English that has been in development at The Boeing Company over the past fifteen years (Harrison and Maxwell, 1986). The grammar consists of approximately 500 rules. By mapping COM-LEX's lexical entries into a format understand, able by SAPIR, we have a general purpose, wellfounded, English language parser.</Paragraph>
    <Paragraph position="3"> With such a parser, we can use Penn's Tree-bank the way it was probably intended: as a set  of bracketing constraints for the syntactic analysis of a sentence. The Linguistic Data Consortium provides a preliminary version (1.075) of the Treebank's bracketing of the Brown Corpus (Ku~era and Francis, 1967). Fortunately, SAPIR provides an interface whereby a sentence and a partially specified parse tree can be fed to the parser, so that only the syntactic analyses that conform with the provided bracketing will be pursued.</Paragraph>
    <Paragraph position="4"> So it is with this mechanism that we create our corpus. We start with the bracketed (but not part-of-speech-tagged) version of the Treebank, and process the bracketings in several ways, most notably removing quotation marks and &amp;quot;assuaging&amp;quot; gaps. The latter consists primarily of identifying the Treebank's sentential constructs which are considered verb phrases by SAPIR, and making that transformation. For example, a Tree-bank tree like (PP in (S (NP *) (VP going  (NP home) ) ) ) would be mapped to (PP in (VP going (NP home) ) ). Since this bracketing is supplied to SAPIR as a constraint, the parser is free to construct the gerundive NP containing  solely the VP. In fact, the bracketing corresponding to the parse found by SAPIR is:</Paragraph>
    <Paragraph position="6"> (Note that postfix &amp;quot; indicates a one-bar level phrase, as per X-theory (Jackendoff, 1977).) Approximately 30% of the Treebank bracketings are parseable, after this assuagement, by our parser. These 30% comprise our corpus.</Paragraph>
    <Paragraph position="7"> Of course, each (parseable) bracketing does not always yield just one parse. SAPIR has some hand-coded costs on syntactic rules which have served as its preference mechanism to-date.</Paragraph>
    <Paragraph position="8"> When SAPIR finds more than one parse for a given bracketing, we simply choose its most preferred one to use in our corpus. While we certainly do not feel that this is the best way to create our corpus, we would like to note that over 25% of the parseable bracketings yield a unique parse, and over 50% have just one or two possible parses. We should also note that the parseable bracketings are, of course, shorter (on average) than the unparseable ones. The average length of all of the sentences is 17.3 words, while the average for our corpus is 11.2.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML