File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/02/p02-1051_evalu.xml

Size: 10,717 bytes

Last Modified: 2025-10-06 13:58:50

<?xml version="1.0" standalone="yes"?>
<Paper uid="P02-1051">
  <Title>Translating Named Entities Using Monolingual and Bilingual Resources</Title>
  <Section position="7" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
6 Evaluation and Discussion
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.1 Test Set
</SectionTitle>
      <Paragraph position="0"> This section presents our evaluation results on the named entity translation task. We compare the translation results obtained from human translations, a commercial MT system, and our named entity translation system. The evaluation corpus consists of two different test sets, a development test set and a blind test set. The first set consists of 21 Arabic newspaper articles taken from the political affairs section of the daily newspaper Al-Riyadh. Named entity phrases in these articles were hand-tagged according to the MUC (Chinchor, 1997) guidelines.</Paragraph>
      <Paragraph position="1"> They were then translated to English by a bilingual speaker (a native speaker of Arabic) given the text they appear in. The Arabic phrases were then paired with their English translations.</Paragraph>
      <Paragraph position="2"> The blind test set consists of 20 Arabic newspaper articles that were selected from the political section of the Arabic daily Al-Hayat. The articles have already been translated into English by professional translators.3 Named entity phrases in these articles were hand-tagged, extracted, and paired with their English translations to create the blind test set.</Paragraph>
      <Paragraph position="3"> Table 1 shows the distribution of the named entity phrases into the three categories PERSON, ORGA-NIZATION , and LOCATION in the two data sets.</Paragraph>
      <Paragraph position="4"> The English translations in the two data sets were reviewed thoroughly to correct any wrong translations made by the original translators. For example, to find the correct translation of a politician's name, official government web pages were used to find the  test sets into the categories PERSON, ORGANI-ZATION , and LOCATION. The numbers shown are the ratio of each category to the total.</Paragraph>
      <Paragraph position="5"> correct spelling. In cases where the translation could not be verified, the original translation provided by the human translator was considered the &amp;quot;correct&amp;quot; translation. The Arabic phrases and their correct translations constitute the gold-standard translation for the two test sets.</Paragraph>
      <Paragraph position="6"> According to our evaluation criteria, only translations that match the gold-standard are considered as correct. In some cases, this criterion is too rigid, as it will consider perfectly acceptable translations as incorrect. However, since we use it mainly to compare our results with those obtained from the human translations and the commercial system, this criterion is sufficient. The actual accuracy figures might be slightly higher than what we report here.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.2 Evaluation Results
</SectionTitle>
      <Paragraph position="0"> In order to evaluate human performance at this task, we compared the translations by the original human translators with the correct translations on the goldstandard. The errors made by the original human translators turned out to be numerous, ranging from simple spelling errors (e.g., Custa Rica vs. Costa Rica) to more serious errors such as transliteration errors (e.g., John Keele vs. Jon Kyl) and other translation errors (e.g., Union Reserve Council vs. Federal Reserve Board).</Paragraph>
      <Paragraph position="1"> The Arabic documents were also translated using a commercial Arabic-to-English translation system.4 The translation of the named entity phrases are then manually extracted from the translated text.</Paragraph>
      <Paragraph position="2"> When compared with the gold-standard, nearly half of the phrases in the development test set and more than a third of the blind test were translated incorrectly by the commercial system. The errors can be classified into several categories including: poor  transliterations (e.g., Koln Baol vs. Colin Powell), translating a name instead of sounding it out (e.g., O'Neill's urine vs. Paul O'Neill), wrong translation (e.g., Joint Corners Organization vs.</Paragraph>
      <Paragraph position="3"> Joint Chiefs of Staff) or wrong word order (e.g.,the Church of the Orthodox Roman).</Paragraph>
      <Paragraph position="4"> Table 2 shows a detailed comparison of the translation accuracy between our system, the commercial system, and the human translators. The translations obtained by our system show significant improvement over the commercial system. In fact, in some cases it outperforms the human translator. When we consider the top-20 translations, our system's overall accuracy (84%) is higher than the human's (75.3%) on the blind test set. This means that there is a lot of room for improvement once we consider more effective re-scoring methods. Also, the top-20 list in itself is often useful in providing phrasal translation candidates for general purpose statistical machine translation systems or other NLP systems.</Paragraph>
      <Paragraph position="5"> The strength of our translation system is in translating person names, which indicates the strength of our transliteration module. This might also be attributed to the low named entity coverage of our bilingual dictionary. In some cases, some words that need to be translated (as opposed to transliterated) are not found in our bilingual dictionary which may lead to incorrect location or organization translations but does not affect person names. The reason word translations are sometimes not found in the dictionary is not necessarily because of the spotty coverage of the dictionary but because of the way we access definitions in the dictionary. Only shallow morphological analysis (e.g., removing prefixes and suffixes) is done before accessing the dictionary, whereas a full morphological analysis is necessary, especially for morphologically rich languages such as Arabic. Another reason for doing poorly on organizations is that acronyms and abbreviations in the Arabic text (e.g., &amp;quot; a0 a8a43 w-as,&amp;quot; the Saudi Press Agency) are currently not handled by our system.</Paragraph>
      <Paragraph position="6"> The blind test set was selected from the FBIS 2001 Multilingual Corpus. The FBIS data is collected by the Foreign Broadcast Information Service for the benefit of the US government. We suspect that the human translators who translated the documents into English are somewhat familiar with the genre of the articles and hence the named entities  on the development and blind test sets. Only a match with the translation in the gold-standard is considered a correct translation. The human translator results are obtained by comparing the translations provided by the original human translator with the translations in the gold-standard. The Sakhr results are for the Web version of Sakhr's commercial system. The Top-1 results of our system considers whether the correct answer is the top candidate or not, while the Top-20 results considers whether the correct answer is among the top-20 candidates. Overall is a weighted average of the three named entity categories.  tally. Straight Web Counts re-score candidates based on their Web counts. Contextual Web Counts uses Web counts within a given context (we used here title of the document as the contextual information). In Co-reference, if the phrase to be translated is part of a longer phrase then we use the the ranking of the candidates for the longer phrase to re-rank the candidates of the short one, otherwise we leave the list as is. that appear in the text. On the other hand, the development test set was randomly selected by us from our pool of Arabic articles and then submitted to the human translator. Therefore, the human translations in the blind set are generally more accurate than the human translations in the development test. Another reason might be the fact that the human translator who translated the development test is not a professional translator.</Paragraph>
      <Paragraph position="7"> The only exception to this trend is organizations.</Paragraph>
      <Paragraph position="8"> After reviewing the translations, we discovered that many of the organization translations provided by the human translator in the blind test set that were judged incorrect were acronyms or abbreviations for the full name of the organization (e.g., the INC instead of the Iraqi National Congress).</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.3 Effects of Re-Scoring
</SectionTitle>
      <Paragraph position="0"> As we described earlier in this paper, our translation system first generates a list of translation candidates, then re-scores them using several re-scoring methods. The list of translation candidates we used for these experiments are of size 20. The re-scoring methods are applied incrementally where the re-ranked list of one module is the input to the next module. Table 3 shows the translation accuracy after each of the methods we evaluated.</Paragraph>
      <Paragraph position="1"> The most effective re-scoring method was the simplest, the straight Web counts. This is because re-scoring methods are applied incrementally and straight Web counts was the first to be applied, and so it helps to resolve the &amp;quot;easy&amp;quot; cases, whereas the other methods are left with the more &amp;quot;difficult&amp;quot; cases. It would be interesting to see how rearranging the order in which the modules are applied might affect the overall accuracy of the system.</Paragraph>
      <Paragraph position="2"> The re-scoring methods we used so far are in general most effective when applied to person name translation because corpus phrase counts are already being used by the candidate generator for producing candidates for locations and organizations, but not for persons. Also, the re-scoring methods we used were initially developed and applied to per-son names. More effective re-scoring methods are clearly needed especially for organization names.</Paragraph>
      <Paragraph position="3"> One method is to count phrases only if they are tagged by a named entity identifier with the same tag we are interested in. This way we can eliminate counting wrong translations such as enthusiasm when translating &amp;quot; a0 a24a32 a87 h. m-as&amp;quot; (Hamas).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML