File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/00/c00-2167_evalu.xml

Size: 9,158 bytes

Last Modified: 2025-10-06 13:58:38

<?xml version="1.0" standalone="yes"?>
<Paper uid="C00-2167">
  <Title>Japanese Named Entity Extraction Evaluation - Analysis of Results -</Title>
  <Section position="7" start_page="1107" end_page="1109" type="evalu">
    <SectionTitle>
2.5 Results
</SectionTitle>
    <Paragraph position="0"> 8 groups and 11 systems participated in the dry run, and 14 groups and 15 systems participated in the %rmal run 2. Tim evaluation results were made public anonymously using systeln ID's. Table 3 shows the ew~luation results (Fmeasure) of the formal run. F-measure is calculated from recall and precision (IREX Coinmittee, 1999). It ranges from 0 to 100, and the larger the better</Paragraph>
    <Section position="1" start_page="1107" end_page="1107" type="sub_section">
      <SectionTitle>
3.1 Difficulty across NE type
</SectionTitle>
      <Paragraph position="0"> In Table 4, tile F-measure of the best performing system is shown in the &amp;quot;Best&amp;quot; column; the average F-measures are shown in the &amp;quot;Average&amp;quot; column tbr each NE type on the formal runs. It can be observed that identifying time and nulneric expressions is relatively easy, as the average F-measures are more than 80%. In contrast, the accuracy of the other types of NE is not so good. Ill particular, artifacts are quite difficult to identify. It is interesting to see that tagging artifacts in the general domain is mud1 harder thins in the restricted domain. This is because of the limited types of artifacts in the restricted domain. Most of the artifacts in the 2The participation to the dry run was not obligatory.</Paragraph>
      <Paragraph position="1"> This is why the number of participants is smaller in the dry run than that in the formal rmL restricted domain are the names of laws, as the doinain is the arrest domain. Systems lnight be able to find such types of names easily because they could be recognized by a small number of simple patterns or by a short list. The types of tim artifacts in the general donmin are quite diverse, including names of prizes, novels, ships, or paintings. It nfight be difficult to build patterns for these itelns, or systems may need very complicated rules or large dictionaries.</Paragraph>
    </Section>
    <Section position="2" start_page="1107" end_page="1108" type="sub_section">
      <SectionTitle>
3.2 Three types of systems
</SectionTitle>
      <Paragraph position="0"> Based on the questionnaire for the particit)ants we gatlmred alter the formal runs, we found that there are three types of systems.</Paragraph>
      <Paragraph position="1"> * Hand created pattern based These are pattern based systems where the patterns are created by hand. A typical system used prefix, sutlqx and proper noun dictionaries. Patterns in these systems look like &amp;quot;If proper nouns are followed by a suffix of person name (for example, a common suflqx like &amp;quot;San&amp;quot;, which is ahnost equivalent to Mr. and Ms.) then the proper nouns are a t)erson nmne&amp;quot;. This type of system was very common; there were 8 systems in this category.</Paragraph>
      <Paragraph position="2"> * Automatically created pattern based These are pattern based systems where some or all of the patterns are created antolnatically using a training corl)us. There were three systems in this category, and these systems used quite different methods. One of them used the &amp;quot;error driven method&amp;quot;, in which hand created patterns were applied to tagged training data and the system learned from tlm mistakes. Another system learned patterns for a wide range of information, including, syntax, verb frame and discourse information fl'om training data. The last system used the local context of training data and several filters were applied to get more accurate patterns.</Paragraph>
      <Paragraph position="3"> Fully automatic Systems in this category created their knowledge automatically from a training corpus. There were four systems in this category. These systems basicMly tried to assign one of the four tags, beginning, middle or ending of an NE, or out-ofNE, to  each word or each character. The source information for the training was typically character type, POS, dictionary in%rmation or lexical information. As tile learning meclmnism, Maximmn Entrot)y models, decision trees, and HMMs were used.</Paragraph>
      <Paragraph position="4"> It is interesting to see that tile top three systems came, fi'om each category; the best system was a hand create, d pattern based system, tile second system was an automatically created pattern based system and the third system was a fully automatic system. So we believe we can not conclude which type is SUl)erior to the ethel'S. null Analyzing the results of the top three systo, ms, we observed the, importance of tile dictionaries. The best hand created pattern based system seems to have a wide coverage dictionary for person, organization and location names and achieved very good accuracy tbr those categories. Howe.ver, the hand created pattern based system failed to capture the evahmtion specific pattenls like &amp;quot;the middle of April&amp;quot;. Systems wore required to extract the entire ex1)ression as a date expression, but; the system only extracted &amp;quot;April&amp;quot;. The best hand created rule based system, as well as the best ~mtolnatically created pattern lmsed system also missed other specific patterns which inchlde abbreviations (&amp;quot;Rai-Nichi&amp;quot; = Visit-Japan), conjmmtions of locations (&amp;quot;Nichi-Be+-&amp;quot; = ,\]alton-US), and street; addresse, s (&amp;quot;Meguro-ku, 0okayama 2-12-1&amp;quot;). The best hilly autolnatic system was successflfl in extracting lllOSt of these specific patterns. However, the flflly automatic system has a probleln in its coverage. In lmrticular, the training data was newspaper articles published in 1994 and the test; data was fro151 1999, so there are several new names, e.g. the prilne minister's name which is not so co1111noll (0buchi) and a location nmne like &amp;quot;Kosovo&amp;quot;, wlfich were rarely mentioned in 1994 but apt)eared a lot in 1999. '.Fhe system missed many of them.</Paragraph>
    </Section>
    <Section position="3" start_page="1108" end_page="1108" type="sub_section">
      <SectionTitle>
3.3 Domain dependency
</SectionTitle>
      <Paragraph position="0"> In Table 3, the differences in performance between the general domain and the restricted domain ~r(' shown ill the cohmm &amp;quot;diff.&amp;quot;. Many systems 1)erfl)nne(l better in the restri(:ted domain, although ~ small ntllnl)er of systems perforlned better ill the genera\] domain. There were two systems which intentiomflly tuned their systems towards the restricted domain, which are shown in bold in the table. Both of these were alnong the systelns which perfbrlned much better (more than 7%) ill the restricte.d domain. The system which achieved the largest improvement was a fully automatic system, and it only replaced the training data for the domain restricted task (so this is an intentionally tuned system). It shows the domain dependency of the task, although further investigation is needed to see why some other systems can perform nmch better even without domain tinting.</Paragraph>
    </Section>
    <Section position="4" start_page="1108" end_page="1109" type="sub_section">
      <SectionTitle>
3.4 Comparison to human performance
</SectionTitle>
      <Paragraph position="0"> In Table 4, hulllan performance is shown in the &amp;quot;Novice&amp;quot; and &amp;quot;Expert&amp;quot; cohtmns. &amp;quot;Novice&amp;quot; means tile average F-measure of three graduate students all(l &amp;quot;l;xpert&amp;quot; means the average F-measure of the two people who were most re- null sponsible for creating tile definition and created the answer. They frst created two answers independently and checked them by themselves.</Paragraph>
      <Paragraph position="1"> The results after the checking are shown in the table, so many careless mistakes were deleted at this time. We Call say that 98-99 F-measure is the performance of experts who create them very carefully, and 94 is a usual person's perfof lnance.</Paragraph>
      <Paragraph position="2"> We can find a similar pattern of performance among different NEs. Hmnans also performed more poorly for artifacts and very well for time and numeric expressions.</Paragraph>
      <Paragraph position="3"> Tile difference t)etween the best system performance and hulnan perfbrlnance is 7 or more F-measure, as opposed to the case in English where the top systems perform at, a level comparable or superior to human perfornlancc. There could be several reasons for this. One obvious reason is that we introduced a ditficult NE type, artifact~ which degrades the overall performance more for the system side than the lmman side.</Paragraph>
      <Paragraph position="4"> Also, the difficulty of identifying the expression boundaries may contribute to the difference. Finally, we believe that the systems can possibly improve, as IREX was the first evaluation based project in Japanese, whereas in English there trove been 7 MUC's and tile technology may have matured by now.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML