File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/00/w00-1321_evalu.xml

Size: 6,267 bytes

Last Modified: 2025-10-06 13:58:39

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-1321">
  <Title>Reducing Parsing Complexity by Intra-Sentence Segmentation based on Maximum Entropy Model</Title>
  <Section position="7" start_page="168" end_page="170" type="evalu">
    <SectionTitle>
5 Experiments
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="168" end_page="168" type="sub_section">
      <SectionTitle>
5.1 Corpus and Construction of the
Maximum Entropy Model
</SectionTitle>
      <Paragraph position="0"> We construct the corpus from two different domains, where the sentences longer than 15 -words are extracted 3. The training portion is used to generate lexical contextual constraints and to collect statistics for maximum entropy model construction. From high school English texts, 1500 sentences are tagged with segmentation positions by human. Two people who have some knowledge about English syntactic structures read sentences, and marked words as segmentation positions where they paused.</Paragraph>
      <Paragraph position="1"> After generating lexical contextual constraints, we constructed the maximum entropy model p(ylx), where x is a lexical contextual constraint and y E {0,1}. The model incorporates features that occur more than 5 times in the training data. 3626 candidate features were generated without word sets and 3878 features with word sets. In Table 2, training time and the number of active features of the model are shown.</Paragraph>
      <Paragraph position="2"> Segmentation performance is evaluated using test portion that consists of 1800 sentences ffrom two domains: high school English texts and the Byte Magazine.</Paragraph>
    </Section>
    <Section position="2" start_page="168" end_page="169" type="sub_section">
      <SectionTitle>
5.2 Segmentation Performance
</SectionTitle>
      <Paragraph position="0"> In addition to coverage and accuracy, SC value is also defined to express the degree of contribution to efficient parsing by segmentation. It is the ratio of the sentences that can benefit from intra-sentence segmentation. If a long sentence is not segmented or is segmented at unsafe segmentation positions, the sentence is called a segmentation error sentence.</Paragraph>
      <Paragraph position="1"> SC value is calculated as # of segmentation error sentences SG= I# of segmentation target sentences&amp;quot; A sentence longer than vt words is considered as the segmentation target sentence, where c~ is set to 12. Table 3 compares segmentation performance for each determination scheme.</Paragraph>
      <Paragraph position="2">  By the comparison of the baseline scheme with others, the accuracy is observed to depend on the context information. Word sets are helpful for increasing coverage with less degradation of accuracy. Each scheme has superiority in terms of the different measures. But in terms of applicability to practical systems, the third scheme is best for our purpose. Table 4 shows the segmentation performance of the scheme using LCC with word sets.</Paragraph>
      <Paragraph position="3"> SU value for the sentences from the same domain as training data is about 0.88, and  about 0.85 for the sentenes from the Byte Magazine. Though they slightly differ between test domains, about 87% of long sentences can be parsed with less complexity and without causing parsing failures. It suggests that the intra-sentence segmentation method can be utilized for efficient parsing of the long sentences.</Paragraph>
    </Section>
    <Section position="3" start_page="169" end_page="169" type="sub_section">
      <SectionTitle>
5.3 Parsing Efficiency
</SectionTitle>
      <Paragraph position="0"> Parsing efficiency is generally measured by the required time and memory for parsing.</Paragraph>
      <Paragraph position="1"> In most cases, parsing sentences longer than 30 words could not complete without intra-sentence segmentation. Therefore, the parsing is performed for the sentences longer than 15 and less than 30 words. Ultra-Sparc 30 machine is used for experiments. The efficiency improvement was measured by</Paragraph>
      <Paragraph position="3"> where $unseg and rrbanseg are time and memory during parsing without segmentation and tseg, rnseg are for the parsing with segmentation.</Paragraph>
      <Paragraph position="4"> Table 5 summarizes the results.</Paragraph>
      <Paragraph position="5"> By segmenting long sentences into several manageable-sized segments, we can parse long sentences with much less time and space.</Paragraph>
    </Section>
    <Section position="4" start_page="169" end_page="170" type="sub_section">
      <SectionTitle>
5.4 Comparison with Related Works
</SectionTitle>
      <Paragraph position="0"> The intra-sentence segmentation method based on the maximum entropy model is corn- pared with other approaches in terms of the  segmentation coverage and the improvement of parsing efficiency.</Paragraph>
      <Paragraph position="1"> In (Lyon and Frank, 1995)(Lyon and Dickerson, 1997), a sentence is segmented into three segments. Though parsing efficiency can be improved by segmenting a sentence, this method may be applied to only simple sentences 4. Long sentences are generally coordinate sentences 5 or complex sentences 6. They have more than two subjects, so applying this method to such sentences seems to be inappropriate. null In (Kim and Kim, 1995), sentence patterns are used to segment long sentences. This method improve parsing efficiency by 30% in time and 58% in space. However collecting sentence patterns requires much hnman efforts and segmentation coverage is only about 36%. Li's method (Li et al., 1990) for sentence segmentation also depends upon manualintensive pattern rules. Segmentation coverage seems to be unsatisfactory for practical machine translation system.</Paragraph>
      <Paragraph position="2"> The proposed method can be applied to co-ordinate and complex sentences as well as simple sentences. It shows segmentation coverage of about 96%. In addition, it needs no other human efforts except for constructing training data. Human ~.nnotators have only to read sentences and mark segmentation positions, which is more simple than collecting pattern rules or sentence patterns. We can also get much improved parsing efficiency: about 77% in time and about 71% in space.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML