File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/p06-1010_evalu.xml

Size: 10,739 bytes

Last Modified: 2025-10-06 13:59:37

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1010">
  <Title>Named Entity Transliteration with Comparable Corpora</Title>
  <Section position="6" start_page="75" end_page="78" type="evalu">
    <SectionTitle>
4 Evaluation
</SectionTitle>
    <Paragraph position="0"> We use a comparable English-Chinese corpus to evaluate our methods for Chinese transliteration.</Paragraph>
    <Paragraph position="1"> We take one day's worth of comparable news articles (234 Chinese stories and 322 English stories), generate about 600 English names with the entity recognizer (Li et al., 2004) as described above, and  find potential Chinese transliterations also as previously described. We generated 627 Chinese candidates. In principle, all these 600 x 627 pairs are potential transliterations. We then apply the phonetic and time correlation methods to score and rank all the candidate Chinese-English correspondences. null To evaluate the proposed transliteration methods quantitatively, we measure the accuracy of the ranked list by Mean Reciprocal Rank (MRR), a measure commonly used in information retrieval when there is precisely one correct answer (Kantor and Voorhees, 2000). The reciprocal rank is the reciprocal of the rank of the correct answer.</Paragraph>
    <Paragraph position="2"> For example, if the correct answer is ranked as the first, the reciprocal rank would be 1.0, whereas if it is ranked the second, it would be 0.5, and so forth. To evaluate the results for a set of English names, we take the mean of the reciprocal rank of each English name.</Paragraph>
    <Paragraph position="3"> We attempted to create a complete set of answers for all the English names in our test set, but a small number of English names do not seem to have any standard transliteration according to the resources that we consulted. We ended up with a list of about 490 out of the 600 English names judged. We further notice that some answers (about 20%) are not in our Chinese candidate set. This could be due to two reasons: (1) The answer does not occur in the Chinese news articles we look at. (2) The answer is there, but our candidate generation method has missed it. In order to see more clearly how accurate each method is for ranking the candidates, we also compute the MRR for the subset of English names whose transliteration answers are in our candidate list. We distinguish the MRRs computed on these two sets of English names as &amp;quot;AllMRR&amp;quot; and &amp;quot;CoreMRR&amp;quot;. Below we first discuss the results of each of the two methods. We then compare the two methods and discuss results from combining the two methods. null</Paragraph>
    <Section position="1" start_page="76" end_page="76" type="sub_section">
      <SectionTitle>
4.1 Phonetic Correspondence
</SectionTitle>
      <Paragraph position="0"> We show sample results for the phonetic scoring method in Table 1. This table shows the 10 highest scoring transliterations for each Chinese character sequence based on all texts in the Chinese and English Xinhua newswire for the 13th of August, 2001. 8 out of these 10 are correct. For all the English names the MRR is 0.3, and for the  hua corpus for 8/13/01. The final column is the [?]log P estimate for the transliteration. Starred entries are incorrect.</Paragraph>
      <Paragraph position="1"> core names it is 0.89. Thus on average, the correct answer, if it is included in our candidate list, is ranked mostly as the first one.</Paragraph>
    </Section>
    <Section position="2" start_page="76" end_page="77" type="sub_section">
      <SectionTitle>
4.2 Frequency correlation
</SectionTitle>
      <Paragraph position="0"> ods.</Paragraph>
      <Paragraph position="1"> We proposed three similarity measures for the frequency correlation method, i.e., the Cosine, Pearson coefficient, and Jensen-Shannon divergence. In Table 2, we show their MRRs. Given that the only resource the method needs is comparable text documents over a sufficiently long period, these results are quite encouraging. For example, with Pearson correlation, when the Chinese transliteration of an English name is included in our candidate list, the correct answer is, on average, ranked at the 3rd place or better. The results thus show that the idea of exploiting frequency correlation does work. We also see that among the three similarity measures, Pearson correlation performs the best; it performs better than Cosine, which is better than JS-divergence.</Paragraph>
      <Paragraph position="2"> Compared with the phonetic correspondence method, the performance of the frequency correlation method is in general much worse, which is not surprising, given the fact that terms may be correlated merely because they are topically related.</Paragraph>
    </Section>
    <Section position="3" start_page="77" end_page="77" type="sub_section">
      <SectionTitle>
4.3 Combination of phonetic correspondence
</SectionTitle>
      <Paragraph position="0"> Since the two methods exploit complementary resources, it is natural to see if we can improve performance by combining the two methods. Indeed, intuitively the best candidate is the one that has a good pronunciation alignment as well as a correlated frequency distribution with the English name. We evaluated two strategies for combining the two methods. The first strategy is to use the phonetic model to filter out (clearly impossible) candidates and then use the frequency correlation method to rank the candidates. The second is to combine the scores of these two methods. Since the correlation coefficient has a maximum value of 1, we normalize the phonetic correspondence score by dividing all scores by the maximum score so that the maximum normalized value is also 1.</Paragraph>
      <Paragraph position="1"> We then take the average of the two scores and rank the candidates based on their average scores.</Paragraph>
      <Paragraph position="2"> Note that the second strategy implies the application of the first strategy.</Paragraph>
      <Paragraph position="3"> The results of these two combination strategies are shown in Table 3 along with the results of the two individual methods. We see that both combination strategies are effective and the MRRs of the combined results are all better than those of the two individual methods. It is interesting to see that the benefit of applying the phonetic correspondence model as a filter is quite significant. Indeed, although the performance of the frequency correlation method alone is much worse than that of the phonetic correspondence method, when working on the subset of candidates passing the phonetic filter (i.e., those candidates that have a reasonable phonetic alignment with the English name), it can outperform the phonetic correspondence method.</Paragraph>
      <Paragraph position="4"> This once again indicates that exploiting the frequency correlation can be effective. When combining the scores of these two methods, we not only (implicitly) apply the phonetic filter, but also exploit the discriminative power provided by the phonetic correspondence scores and this is shown to bring in additional benefit, giving the best performance among all the methods.</Paragraph>
    </Section>
    <Section position="4" start_page="77" end_page="78" type="sub_section">
      <SectionTitle>
4.4 Error Analysis
</SectionTitle>
      <Paragraph position="0"> From the results above, we see that the MRRs for the core English names are substantially higher than those for all the English names. This means that our methods perform very well whenever we have the answer in our candidate list, but we have also missed the answers for many English names.</Paragraph>
      <Paragraph position="1"> The missing of an answer in the candidate list is thus a major source of errors. To further understand the upper bound of our method, we manually add the missing correct answers to our candidate set and apply all the methods to rank this augmented set of candidates. The performance is reported in Table 4 with the corresponding performance on the original candidate set. We see that,  as expected, the performance on the augmented candidate list, which can be interpreted as an upper bound of our method, is indeed much better, suggesting that if we can somehow improve the candidate generation method to include the answers in the list, we can expect to significantly improve the performance for all the methods. This is clearly an interesting topic for further research. The relative performance of different methods on this augmented candidate list is roughly the same as on the original candidate list, except that the &amp;quot;Freq+PhoneticFilter&amp;quot; is slightly worse than that of the phonetic method alone, though it is still much better than the performance of the frequency correlation alone. One possible explanation may be that since these names do not necessarily occur in our comparable corpora, we may not have sufficient frequency observations for some of the names.</Paragraph>
    </Section>
    <Section position="5" start_page="78" end_page="78" type="sub_section">
      <SectionTitle>
4.5 Experiments on score propagation
</SectionTitle>
      <Paragraph position="0"> To demonstrate that score propagation can further help transliteration, we use the combination scores in Table 3 as the initial scores, and apply our propagation algorithm to iteratively update them. We remove the entries when they do not co-occur with others. There are 25 such English name candidates. Thus, the initial scores are actually slightly different from the values in Table 3. We show the new scores and the best propagation scores in Table 5. In the table, &amp;quot;init.&amp;quot; refers to the initial scores. and &amp;quot;CO&amp;quot; and &amp;quot;MI&amp;quot; stand for best scores obtained using either the co-occurrence or mutual information method. While both methods result in gains, CO very slightly outperforms the MI approach. In the score propagation process, we introduce two additional parameters: the interpolation parameter a and the number of iterations k. Figure 3 and Figure 4 show the effects of these parameters. Intuitively, we want to preserve the initial score of a pair, but add a slight boost from its neighbors. Thus, we set a very close to 1 (0.9 and 0.95), and allow the system to perform 20 iterations. In both figures, the first few iterations certainly leverage the transliteration, demonstrating that the propagation method works. However, we observe that the performance drops when more iterations are used, presumably due to noise introduced from more distantly connected nodes. Thus, a relatively conservative approach is to choose a high a value, and run only a few iterations. Note, finally, that the CO method seems to be more stable than the MI method.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML