File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/02/w02-1210_abstr.xml

Size: 4,642 bytes

Last Modified: 2025-10-06 13:42:35

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-1210">
  <Title>Efficient Deep Processing of Japanese</Title>
  <Section position="1" start_page="0" end_page="1" type="abstr">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> We present a broad coverage Japanese grammar written in the HPSG formalism with MRS semantics. The grammar is created for use in real world applications, such that robustness and performance issues play an important role. It is connected to a POS tagging and word segmentation tool.</Paragraph>
    <Paragraph position="1"> This grammar is being developed in a multilingual context, requiring MRS structures that are easily comparable across languages.</Paragraph>
    <Paragraph position="2"> Introduction Natural language processing technology has recently reached a point where applications that rely on deep linguistic processing are becoming feasible. Such applications (e.g. message extraction systems, machine translation and dialogue understanding systems) require natural language understanding, or at least an approximation thereof. This, in turn, requires rich and highly precise information as the output of a parse. However, if the technology is to meet the demands of real-world applications, this must not come at the cost of robustness. Robustness requires not only wide coverage by the grammar (in both syntax and semantics), but also large and extensible lexica as well as interfaces to preprocessing systems for named entity recognition, non-linguistic structures such as addresses, etc. Furthermore, applications built on deep NLP technology should be extensible to multiple languages. This requires flexible yet well-defined output structures that can be adapted to grammars of many different languages. Finally, for use in real-world applications, NLP systems meeting the above desiderata must also be efficient.</Paragraph>
    <Paragraph position="3"> In this paper, we describe the development of a broad coverage grammar for Japanese that is used in an automatic email response application. The grammar is based on work done in the Verbmobil project (Siegel 2000) on machine translation of spoken dialogues in the domain of travel planning. It has since been greatly extended to accommodate written Japanese and new domains.</Paragraph>
    <Paragraph position="4"> The grammar is couched in the theoretical framework of Head-Driven Phrase Structure Grammar (HPSG) (Pollard &amp; Sag 1994), with semantic representations in Minimal Recursion Semantics (MRS) (Copestake et al. 2001).</Paragraph>
    <Paragraph position="5"> HPSG is well suited to the task of multilingual development of broad coverage grammars: It is flexible enough (analyses can be shared across languages but also tailored as necessary), and has a rich theoretical literature from which to draw analyzes and inspiration. The characteristic type hierarchy of HPSG also facilitates the development of grammars that are easy to extend. MRS is a flat semantic formalism that works well with typed feature structures and is flexible in that it provides structures that are under-specified for scopal information. These structures give compact representations of ambiguities that are often irrelevant to the task at hand.</Paragraph>
    <Paragraph position="6"> HPSG and MRS have the further advantage that there are practical and useful open-source tools for writing, testing, and efficiently processing grammars written in these formalisms. The tools we are using in this project include the LKB system (Copestake 2002) for grammar development, [incr tsdb()] (Oepen &amp; Carroll 2000) for testing the grammar and tracking changes, and PET (Callmeier 2000), a very efficient HPSG parser, for processing. We also use the ChaSen tokenizer and POS tagger (Asahara &amp; Matsumoto 2000).</Paragraph>
    <Paragraph position="7"> While couched within the same general framework (HPSG), our approach differs from that of Kanayama et al (2000). The work described there achieves impressive coverage (83.7% on the EDR corpus of newspaper text) with an underspecified grammar consisting of a small number of lexical entries, lexical types associated with parts of speech, and six underspecified grammar rules. In contrast, our grammar is much larger in terms of the number of lexical entries, the number of grammar rules, and the constraints on both,  and takes correspondingly more effort to bring up to that level of coverage. The higher level of detail allows us to output precise semantic representations as well as to use syntactic, semantic and lexical information to reduce ambiguity and rank parses.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML