File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/00/c00-2116_evalu.xml

Size: 2,083 bytes

Last Modified: 2025-10-06 13:58:34

<?xml version="1.0" standalone="yes"?>
<Paper uid="C00-2116">
  <Title>Automatic Corpus-Based Thai Word Extraction with the C4.5 Learning Algorithm</Title>
  <Section position="5" start_page="804" end_page="805" type="evalu">
    <SectionTitle>
4 Experimental Results
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="804" end_page="805" type="sub_section">
      <SectionTitle>
4.1 The Results
</SectionTitle>
      <Paragraph position="0"> To measure the accuracy of the algorithln, we consider two statistical values: precision and recall. The precision of our algorithm is 87.3% for the training set and 84.1% for the test set. The recall of extraction is 56% in both training and test sets. We compare the recall of our word extraction with the recall from using the Thai Royal Institute dictionary (RID). The recall froln our approach and from using RID are comparable and our approach should outperform the existing dictionary for larger corpora. Both precision and recall fiom training and test sets are quite close.</Paragraph>
      <Paragraph position="1"> This indicates that the created decision tree is robust for unseen data. Table 3 also shows that more than 30% of the extracted words are not found in RID. These would be the new entries for the dictionary.</Paragraph>
    </Section>
    <Section position="2" start_page="805" end_page="805" type="sub_section">
      <SectionTitle>
4.2 The Relationship of Accuracy, Occurrence
and Length
</SectionTitle>
      <Paragraph position="0"> In this section, we consider the relationship of the extraction accuracy to the string lengths and occurrences. Figure 2 and 3 depict that both precision and recall have tendency to increase as string occurrences are getting higher. This implies that the accuracy should be higher for larger corpora. Similarly, in Figure 4 and 5, the accuracy tends to be higher in longer strings. The new created words or loan words have tendency to be long. Our extraction, then, give a high accuracy and very useful for extracting these new created words.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML