File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/w00-1321_intro.xml
Size: 4,735 bytes
Last Modified: 2025-10-06 14:01:04
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-1321"> <Title>Reducing Parsing Complexity by Intra-Sentence Segmentation based on Maximum Entropy Model</Title> <Section position="3" start_page="0" end_page="164" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Long sentence analysis has been a critical problem in machine translation because of high complexity. In EBMT (example-based machine translation), the longer a sentence is, the less possible it is that the sentence has an exact match in the translation archive, and the less flexible an EBMT system will be (Cranias et al., 1994). In idiom-based machine translation (Lee, 1993), long sentence parsing is difficult because more resources are spent during idiom recognition phase as sentence length increases. A parser is often unable to analyze long sentences owing to their complexity, though they have no grammatical errors (Nasukawa, 1995).</Paragraph> <Paragraph position="1"> In English-Korean machine translation, idiom-based approach is adopted to overcome the structural differences between two languages and to get more accurate translation. The parser is a chart parser with a capability of idiom recognition and translation, which is adapted to English-Korean machine tranalation. Idioms are recognized prior to syntactic analysis and the part of a sentence for an idiom takes an edge in a chart (Winograd, 1983). When parsing long sentences, an ambiguity of an idiom's range may cause more edges than the number of words included in the idiom (Yoon, 1994), which increases parsing complexity much. A parser of practical machine translation system should be able to analyze long sentences in a reasonable time.</Paragraph> <Paragraph position="2"> Most context-free parsing algorithms have O(n 3) parsing complexities in terms of time and space, where n is the length of a sentence (Tomita, 1986). Our work is motivated by the fact that parsing becomes more efficient, if n becomes shorter. This paper deals with the problem of parsing complexity by way of reducing the length of sentence to be analyzed. This reduction is achieved by intra-sentence segmentation, which is distinguished from inter--sentence segmentation that is used for text categorization (Beeferman et al., 1997) or sentence boundary identification (Palmer and Hearst, 1997) (Reynar and Ratnaparkhi, 1997). Intra-sentence segmentation plays a role as a preliminary step to a chart-based, context-free parser in English-Korean machine translation.</Paragraph> <Paragraph position="3"> There have been several methods for reducing parsing complexities by intra-sentence segmentation. In (Lyon and Frank, 1995)(Lyon and Dickerson, 1997), they took advantage of the fact that the declarative sentences almost always consist of three segments: \[pre-subject : subject:predicate\].</Paragraph> <Paragraph position="4"> The complexity could be reduced by decomposing a sentence into three sections. Pattern rules (Li et al., 1990) and sentence patterns (Kim and Khn, 1995) were used to segment long English sentences. They showed low segmentation coverage, which means that many of long sentences are not segmented by the pattern rules or sentence patterns. And they require much human efforts to construct pattern rules or collect sentence patterns. These factors may prevent them being applicable to practical machine translation sYstems.</Paragraph> <Paragraph position="5"> This paper presents a trainable model for identifying potential segmentation positions in a sentence and determining appropriate segmentation positions. Given a corpus annotated with segmentation positions, our model automatically learns the contextual evidences about segmentation positions, which relieves human of burden to construct pattern rules or sentence patterns. These evidences are combined under the maximum entropy framework (Jaynes, 1957) to estimate the probability for each position. By intra-sentence segmentation based on the proposed model, we achieve more improved parsing efficiency by 77% in time and 71% in space.</Paragraph> <Paragraph position="6"> In Section 2 we introduce the maximum entropy model. Section 3 describes features incorporated into the model and the process of identifying potential segmentation positions.</Paragraph> <Paragraph position="7"> The determination schemes of segmentation positions are described in Section 4. Segmentation performance of the model is presented with the degree of contribution to efficient parsing by the segmentation in Section 5. We also compare our approach with other intra-sentence segmentation approaches. Section 6 draws conclusions and presents some further works.</Paragraph> </Section> class="xml-element"></Paper>