File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/w03-1506_evalu.xml
Size: 2,606 bytes
Last Modified: 2025-10-06 13:59:05
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1506"> <Title>Multi-Language Named-Entity Recognition System based on HMM</Title> <Section position="5" start_page="0" end_page="2" type="evalu"> <SectionTitle> 4. Experiments </SectionTitle> <Paragraph position="0"> To evaluate our system, we prepared original corpora for Japanese, Chinese, Korean and English. The material was mainly taken from newspapers and Web texts. We used the morpheme analysis definition of Pen Tree Bank for English [11], Jtag for Japanese [12], Beijing Univ. for Chinese [13] and MATEC99 for Korean [14].</Paragraph> <Paragraph position="1"> The named-entity tag definitions were based on MUC [15] for English and IREX [16] for Japanese. We defined Chinese and Korean named-entity tags following the Japanese IREX specifications. Table 5 shows dictionary and corpus size. Dictionary words means the size of the dictionary for morphological analysis. Total words and sentences represent the size of the corpus for named-entity recognition.</Paragraph> <Paragraph position="2"> Named-entity accuracy is expressed in terms of recall and precision. We also use the F-measure to indicate the overall performance. It is calculated as follows; Table 6 shows the F-measure for all languages. Since we used our original corpora in this evaluation, we cannot compare our results to those of previous works. Accordingly, we also evaluated SVM using our original corpora (see Table 6) [17]. The accuracy of HMM and SVM were approximately equivalent. But the analysis speed of HMM was ten times faster than that of SVM [9]. This means that our system is very fast and has state-of-the-art accuracy in four languages.</Paragraph> <Paragraph position="3"> We noted that the accuracy of SVM is unusually lower than that of HMM for Japanese. We have not yet confirmed the cause of this, but a plausible argument is as follows. First, the word segmentation ambiguity has a worse affect on accuracy than expected. Since current SVM implementations can not handle N-best morpheme candidates and lower-order candidates are not considered in named-entity recognition. Second, SVM may not suit the analysis of irregular, ill-structured, and informal sentences such as Web texts. Our original corpus data was dictionary words taken from newspapers and Web texts, the former contains complete and grammatical sentences unlike the latter. It is often said that HMM is robust enough to analyze these dirty sentences. It is, anyhow, our next step to analyze the results of named-entity recognition in more detail.</Paragraph> </Section> class="xml-element"></Paper>