File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1812_metho.xml

Size: 4,535 bytes

Last Modified: 2025-10-06 14:08:10

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-1812">
  <Title>Segmentation Rate(%) Number of Words Economics Engineering Correct Segmentation Rate Unsegmentation Rate</Title>
  <Section position="4" start_page="0" end_page="1" type="metho">
    <SectionTitle>
3 Evaluation Experiments
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="1" type="sub_section">
      <SectionTitle>
3.1 ExperimentalDataAndProcedure
</SectionTitle>
      <Paragraph position="0"> To evaluate the adaptability of the proposed method fordifferentfieldsandtheeffectivityfor Chinese word segmentation. We use the Chinese text of two specialized fields from Sinica Corpus  : the economics contains 92,085 words and the engineering contains 70,017 words. Total words is 162,102. The economics consists of the text of economic system, economic policy and economic theory. The engineering consists of the text of electronics, communication engineering,machineengineeringandnuclearindus- null try.</Paragraph>
      <Paragraph position="1">  In order to confirm the adaptability of proposed method to user, we let the initial dictionary empty. We input a paragraph about hundred words one times and two fields text in turns.</Paragraph>
    </Section>
    <Section position="2" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
3.2 ExperimentalResults
</SectionTitle>
      <Paragraph position="0"> The results of experiment are shown in Table 1. Fig. 4 shows the change of CSR, ESR and USR. In our method, the correct segmentation number is the number of correct segmentation that is judged by a user. The unsegmentation number is the number when all unsegmented strings are segmented correctly. The erroneous segmentation number is the number that subtracts the number of correct segmentation and unsegmentation from the number of all words in the input text. To evaluate the experiment result, we use these formulas of CSR (Correct Segmentation Rate), ESR (Erroneous Segmentation Rate) and USR (Unsegmented Rate) as follows:</Paragraph>
    </Section>
    <Section position="3" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
4.1 AdaptabilityToDifferentFields
</SectionTitle>
      <Paragraph position="0"> Fig. 4 shows the experimental results of two fields. When the text is changed to different domain,because appearance of somenew words of different fields, the correct segmentation rate  isfalldowntemporary. Howeverwithincreasing of processed sentence, the correct segmentation rate goes on increasing quickly.</Paragraph>
      <Paragraph position="1"> We may consider that the proposed method has adaptability for different fields. Sometimes the correct segmentation rate is a little lower because the domain of text is a little difference, for example: the economics consists of the text of economic system, economic policy and economic theory and so on.</Paragraph>
    </Section>
    <Section position="4" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
4.2 EvaluationofAbilityforPredicting
UnknownWords
</SectionTitle>
      <Paragraph position="0"> We use 50,000 words to discuss the predicting abilityofproposedmethodforunknownwords.</Paragraph>
      <Paragraph position="2"> Where, CWNis thenumberofwords thatare predicted correctly. TWN is the total number of words that are predicted. TUN is the total number of unknown words.</Paragraph>
      <Paragraph position="3"> The precision and recall are shown in Fig. 5. The average precision is 26.0%. The average recall is 31.0%. With increasing of registered words in the dictionary, prediction effect for unknown words is becoming well, after 40,000 words are processed the precision and the recall are 85.0%, 40.0% respectively.</Paragraph>
    </Section>
    <Section position="5" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
4.3 AnalysisofErroneous
Segmentation
</SectionTitle>
      <Paragraph position="0"> We select 1,000words from thebeginning of the experimental date and the end of the experimental date respectively, to analysis the reason of an erroneous segmentation. At the beginning, ESR that is because of unregistered words is 18.0%, but after 16,000 words are processed, ESR that is because of unregistered words is 0.9%. However ESR that is caused by ambiguity goes on increasing from 1.6% to 7.0%. ESR caused by ambiguity is increasing with increasing of registered word in the dictionary. Ambiguoussegmentation isstill adifficult problem, so that it is necessary to improve the ability to deal with ambiguity.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML