File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/c02-1159_intro.xml

Size: 5,118 bytes

Last Modified: 2025-10-06 14:01:23

<?xml version="1.0" standalone="yes"?>
<Paper uid="C02-1159">
  <Title>Extending A Broad-Coverage Parser for a General NLP Toolkit</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> With the improvement of natural language processing (NLP) techniques, domains for NLP systems, especially those handling speech input, are rapidly growing. However, most computer programmers do not have enough linguistic knowledge to develop NLP systems.</Paragraph>
    <Paragraph position="1"> There is a genuine demand for a general toolkit from which programmers with no linguistic knowledge can rapidly build NLP systems that handle domain specific problems more accurately (Alam, 2000). The toolkit will allow programmers to generate natural language front ends for new and existing applications using, for example, a program-through-example method. In this methodology, the programmer will specify a set of sample input sentences or a domain corpus for each task.</Paragraph>
    <Paragraph position="2"> The toolkit will then organize the sentences by similarity and generate a large set of syntactic variations of a given sentence. It will also generate the code that takes a user's natural language request and executes a command on an application. Currently this is an active research area, and the Advanced Technology Program (ATP) of the National Institute of Standards and Technology (NIST) is funding part of the work.</Paragraph>
    <Paragraph position="3"> In order to handle natural language input, an NLP toolkit must have a parser that maps a sentence string to a syntactic structure. The parser must be both general and accurate. It has to be general because programmers from different domains will use the toolkit to generate their specific parsers. It has to be accurate because the toolkit targets commercial domains, which usually require high accuracy.</Paragraph>
    <Paragraph position="4"> The accuracy of the parser directly affects the accuracy of the generated NL interface. In the program-through-example approach, the toolkit should convert the example sentences into semantic representations so as to capture their meanings. In a real world application, this process will involve a large quantity of data. If the programmers have to check each syntactic or semantic form by hand in order to decide if the corresponding sentence is parsed correctly, they are likely to be overwhelmed by the workload imposed by the large number of sentences, not to mention that they do not have the necessary linguistic knowledge to do this.</Paragraph>
    <Paragraph position="5"> Therefore the toolkit should have a broad-coverage parser that has the accuracy of a parser designed specifically for a domain.</Paragraph>
    <Paragraph position="6"> One solution is to use an existing parser with relatively high accuracy. Using existing parsers such as (Charniak, 2000; Collins, 1999) would eliminate the need to build a parser from scratch. However, there are two problems with such an approach. First, many parsers claim high precision in terms of the number of correctly parsed syntactic relations rather than sentences, whereas in commercial applications, the users are often concerned with the number of complete sentences that are parsed correctly. The precision might drop considerably using this standard. In addition, although many parsers are domain independent, they actually perform much better in the domains they are trained on or implemented in. Therefore, relying solely on a general parser would not satisfy the accuracy needs for a particular domain.</Paragraph>
    <Paragraph position="7"> Second, since each domain has its own problems, which cannot be foreseen in the design of the toolkit, customization of the parser might be needed. Unfortunately, using an existing parser does not normally allow this option. One solution is to build another parser on top of the general parser that can be customized to address domain specific parsing problems such as ungrammatical sentences.</Paragraph>
    <Paragraph position="8"> This domain specific parser can be built relatively fast because it only needs to handle a small set of natural language phenomena. In this way, the toolkit will have a parser that covers wider applications and in the mean time can be customized to handle domain specific phenomena with high accuracy. In this paper we adopt this methodology.</Paragraph>
    <Paragraph position="9"> The paper is organized into 6 sections. In Section 2, we briefly describe the NLP toolkit for which the parser is proposed and implemented. Section 3 introduces Minipar, the broad-coverage parser we choose for our toolkit, and the problems this parser has when parsing a corpus we collected in an IT domain.</Paragraph>
    <Paragraph position="10"> In Section 4, we present the design of the shallow parser and its disadvantages. We describe how we combine the strength of the two parsers and the testing result in Section 5. Finally, in Section 6, we draw conclusions and propose some future work.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML