File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/90/p90-1031_intro.xml
Size: 1,853 bytes
Last Modified: 2025-10-06 14:05:02
<?xml version="1.0" standalone="yes"?> <Paper uid="P90-1031"> <Title>PARSING THE LOB CORPUS</Title> <Section position="4" start_page="0" end_page="0" type="intro"> <SectionTitle> INTRODUCTION </SectionTitle> <Paragraph position="0"> We have implemented and tested a parsing system which is rapid and robust enough to apply to large bodies of unedited text. We have used our system to gather data from the Lancaster/Oslo-Bergen (LOB) corpus, generating parses which conform to a version of current Government-Binding theory, and aim to use the system to parse 25 million words of text The system consists of an interface to the LOB corpus, a part of speech disambiguator, and a novel parser. The disambiguator uses multivaluedness to perform, in conjunction with the parser, substantially more accurately than current algorithms. The parser employs bottom-up recognition to create rules which fire topdown, enabling it to rapidly parse the constituent phrases of a larger structure that might itself be difficult to analyze. The complexity of some of the free text in the LOB demands this, and we have not sought to parse sentences completely, but rather to ensure that our parses are accurate. The parser output can be modified to conform to any of a number of linguistic theories.</Paragraph> <Paragraph position="1"> This paper is divided into sections discussing the LOB corpus, statistical disambiguation, the parser, and our results.</Paragraph> <Paragraph position="2"> 1 This paper reports work done at the MIT Artificial Intelligence Laboratory. Support for this research was provided in part by grants from the National Science Foundation (under a Presidential Young Investigator award to Prof.</Paragraph> <Paragraph position="3"> Robert C. Berwick); the Kapor Family Foundation; and the Siemens Corporation.</Paragraph> </Section> class="xml-element"></Paper>