File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/00/c00-1057_abstr.xml
Size: 3,974 bytes
Last Modified: 2025-10-06 13:41:35
<?xml version="1.0" standalone="yes"?> <Paper uid="C00-1057"> <Title>Robust Segmentation of Japanese Text into a Lattice for Parsing</Title> <Section position="1" start_page="0" end_page="0" type="abstr"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> We describe a segmentation component that utilizes minimal syntactic knowledge to produce a lattice of word candidates for a broad coverage Japanese NL parser. The segmenter is a finite state morphological analyzer and text normalizer designed to handle the orthographic variations characteristic of written Japanese, including alternate spellings, script variation, vowel extensions and word-internal parenthetical material. This architecture differs from conventional Japanese wordbreakers in that it does not attempt to simultaneously attack the problems of identifying segmentation candidates and choosing the most probable analysis. To minimize duplication of effort between components and to give the segmenter greater fi'eedom to address orthography issues, the task of choosing the best analysis is handled by the parser, which has access to a much richer set of linguistic information. By maximizing recall in the segmenter and allowing a precision of 34.7%, our parser currently achieves a breaking accuracy of ~97% over a wide variety of corpora.</Paragraph> <Paragraph position="1"> Introduction The task of segmenting Japanese text into word units (or other units such as bunsetsu (phrases)) has been discussed at great length in Japanese NL literature (\[Kurohashi98\], \[Fuchi98\], \[Nagata94\], et al.). Japanese does not typically have spaces between words, which means that a parser must first have the input string broken into usable units before it can analyze a sentence. Moreover, a variety of issues complicate this operation, most notably that potential word candidate records may overlap (causing ambiguities for the parser) or there may be gaps where no suitable record is found (causing a broken span).</Paragraph> <Paragraph position="2"> These difficulties are commonly addressed using either heuristics or statistical methods to create a model for identifying the best (or n-best) sequence of records for a given input string. This is typically done using a connective-cost model (\[Hisamitsu90\]), which is either maintained laboriously by hand, or trained on large corpora.</Paragraph> <Paragraph position="3"> Both of these approaches suffer fiom problems.</Paragraph> <Paragraph position="4"> Handcrafted heuristics may become a maintenance quagmire, and as \[Kurohashi98\] suggests in his discussion of the JUMAN scgmenter, statistical models may become increasingly fi'agile as the system grows and eventually reach a point where side effects rule out fiwther improvements. The sparse data problem commonly encountered in statistical methods is exacerbated in Japanese by widespread orthographic variation (see SS3).</Paragraph> <Paragraph position="5"> Our system addresses these pitfalls by assigning completely separate roles to the segmeuter and the parser to allow each to delve deeper into the complexities inherent in its tasks.</Paragraph> <Paragraph position="6"> Other NL systems (\[Kitani93\], \[Ktu'ohashi98\]) have separated the segmentation and parsing components. However, these dual-level systems are prone to duplication of effort since mauy segmentation ambiguities cannot be resolved without invoking higher-level syntactic or semantic knowledge. Our system avoids this duplication by relaxing the requirement that the segmenter identify the best path (or even n-best paths) through the lattice of possible records. The segmenter is responsible only for ensuring that a correct set of records is present in its output. It is the filnction of the parsing component to select the best analysis from this lattice. With tiffs model, our system achieves roughly 97% recall/precision (see \[Suzuki00\] for more details).</Paragraph> </Section> class="xml-element"></Paper>