File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/01/p01-1041_evalu.xml

Size: 3,587 bytes

Last Modified: 2025-10-06 13:58:47

<?xml version="1.0" standalone="yes"?>
<Paper uid="P01-1041">
  <Title>Japanese Named Entity Recognition based on a Simple Rule Generator and Decision Tree Learning</Title>
  <Section position="8" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
3 Results
</SectionTitle>
    <Paragraph position="0"> Now, we compare our method with the ME system. We used the standard IREX training data (CRL NE 1.4 MB and NERT 30 KB) and the formal run test data (GENERAL and AR-REST). When human annotators were not sure, they used &lt;OPTIONAL POSSIBILITY=...&gt; where POSSIBILITY is a list of possible NE classes. We also used 7.4 MB of in-house NE data that did not contain optional tags. All of the training data (all = CRL NE+NERT+in-house) were based on the Mainichi Newspaper's 1994 and 1995 CD-ROMs. Table 1 shows the details.</Paragraph>
    <Paragraph position="1"> We removed an optional tag when its possibility list contains NONE, which means this part is accepted without a tag. Otherwise, we selected the majority class in the list. As a result, 56 NEs were added to CRL NE.</Paragraph>
    <Paragraph position="2"> For tokenization, we used chasen 2.2.1 (http://chasen.aist-nara.ac.jp/).</Paragraph>
    <Paragraph position="3"> It has about 90 POS tags and large proper noun dictionaries (persons = 32,167, organizations = 16,610, locations = 67,296, miscellaneous proper nouns = 26,106). (Large dictionaries sometimes make the extraction of NEs difficult. If OO-SAKA-GIN-KOU is registered as a single word, GIN-KOU is not extracted as an organization suffix from this example.) We tuned chasen's parameters for NE recognition. In order to avoid the excessive division of unknown words (see Introduction), we reduced the cost for unknown words (30000 a52 7000). We also changed its setting so that an unknown word are classified as a misc-proper-noun.</Paragraph>
    <Paragraph position="4"> Then, we compared the above methods in terms of the averaged F-measures by 5-fold cross-validation of CRL NE data. The ME system attained 82.77% for a30  attained 81.18% for a30a8a57a22a6a13a60a22a35 by removing bad templates with fewer positive examples than negative ones.) Thus, the two methods returned similar results. However, we cannot expect good performance for other documents because CRL NE is limited to January, 1995.</Paragraph>
    <Paragraph position="5"> Figure 2 compares these systems by using the formal run data. We cannot show the ME results for the large training data because Ristad's toolkit crashes even on a 2 GB memory machine.</Paragraph>
    <Paragraph position="6"> According to this graph, the RG+DT system's scores are comparable to those of the ME system. When all the training data was used, RG+DT's F-measure for GENERAL was 87.43%. We also examined RG+DT's variants. When we replaced character types of one-word NEs by '*', the score dropped to 86.79%. When we did not replace any character type by '*' at all, the score was 86.63%. RG+DT/n in the figure is a variant that also applies suffix dictionary to numerical NE classes. When we used tokenized CRL NE for training, the RG+DT system's training time was about 3 minutes on a Pentium III 866 MHz 256 MB memory Linux machine. This performance is much faster than that of the ME system, which takes a few hours; this difference cannot be explained by the fact that the ME system is implemented on a slower machine. When we used all of the training data, the training time was less than one hour and the processing time of tokenized GENERAL (79 KB before tokenization) was about 14 seconds.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML