XML Viewer - w06-1630

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/w06-1630_evalu.xml
Size: 8,238 bytes
Last Modified: 2025-10-06 13:59:49
<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1630">
  <Title>Unsupervised Named Entity Transliteration Using Temporal and Phonetic Correlation</Title>
  <Section position="6" start_page="252" end_page="255" type="evalu">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> We evaluate our algorithms on three comparable corpora: English/Arabic, English/Chinese, and English/Hindi. Data statistics are shown in Table 5.</Paragraph>
    <Paragraph position="1"> From each data set in Table 5, we picked out all news articles from seven randomly selected days. We identified about 6800 English names using the entity recognizer from (Carlson et al., 1999), and chose the most frequent 200 names as our English named entity candidates. Note that we chose the most frequent names because the reliability of the statistical correlation depends on the size of sample data. When a name is rare in a collection,  English/Chinese. The second column is the rank. one can either only use the phonetic model, which does not depend on the sample size; or else one must expand the data set and hope for more occurrence. To generate the Hindi and Arabic candidates, all words from the same seven days were extracted. The words were stemmed all possible ways using simple hand-developed affix lists: for example, given a Hindi word c1c2c3, if both c3 and c2c3 are in our suffix and ending list, then this single word generates three possible candidates: c1, c1c2, and c1c2c3. In contrast, Chinese candidates were extracted using a list of 495 characters that are frequently used for foreign names (Sproat et al., 1996). A sequence of three or more such characters from the list is taken as a possible name. The number of candidates for each target language is presented in the last column of Table 5.</Paragraph>
    <Paragraph position="2"> We measured the accuracy of transliteration by Mean Reciprocal Rank (MRR), a measure commonly used in information retrieval when there is precisely one correct answer (Kantor and Voorhees, 2000).</Paragraph>
    <Paragraph position="3"> We attempted to create a complete set of answers for 200 English names in our test set, but a small number of English names do not seem to have any standard transliteration in the target language according to the resources that we looked at, and these names we removed from the evaluation set. Thus, we ended up having a list of less than 200 English names, shown in the second column of Table 6 (All). Furthermore some correct transliterations are not found in our candidate list for the second language, for two reasons: (1) The answer does not occur at all in the target news articles; (Table 6 # Missing 1) (2) The answer is there, but our candidate generation method has missed it.</Paragraph>
    <Paragraph position="4"> (Table 6 # Missing 2) Thus this results in an even smaller number of candidates to evaluate (Core); this smaller number is given in the fifth column of Table 6. We compute MRRs on the two sets  of candidates -- those represented by the count in column 2, and the smaller set represented by the count in column 5; we term the former MRR &amp;quot;AllMRR&amp;quot; and the latter &amp;quot;CoreMRR&amp;quot;.2 It is worth noting that the major reason for not finding a candidate transliteration of an English name in the target language is almost always because it is really not there, rather than because our candidate generation method has missed it. Presumably this reflects the fact that the corpora are merely comparable, rather than parallel. But the important point is that the true performance of the system would be closer to what we report below for CoreMRR, if we were working with truly parallel data where virtually all source language names would have target-language equivalents.</Paragraph>
    <Section position="1" start_page="254" end_page="254" type="sub_section">
      <SectionTitle>
4.1 Performance of phonetic method and
</SectionTitle>
      <Paragraph position="0"> time correlation method The performance of the phonetic method and the time correlation method are reported in Table 7, top and middle panels, respectively. In addition to the MRR scores, we also report another metric --CorrRate, namely the proportion of times the first candidate is the correct one.</Paragraph>
      <Paragraph position="1"> Each of the two methods has advantages and disadvantages. The time correlation method relies more on the quality of the comparable corpora.</Paragraph>
      <Paragraph position="2"> It is perhaps not surprising that the time correlation method performs the best on English/Chinese, since these data come from the same source (Xinhua). Because the English and Hindi corpora are from different new agencies (Xinhua and Naidunia), the method performs relatively poorly.</Paragraph>
      <Paragraph position="3"> On the other hand, the phonetic method is less affected by corpus quality, but is sensitive to differ2We are aware that the resulting test set is very small, but we believe that it is large enough to demonstrate that the method is effective.</Paragraph>
      <Paragraph position="4"> ences between languages. As discussed in the introduction, Hindi is relatively easy, and so we see the best MRR scores there. The performance is worse on Chinese and Arabic. It makes sense then to consider combining the two methods.</Paragraph>
    </Section>
    <Section position="2" start_page="254" end_page="255" type="sub_section">
      <SectionTitle>
4.2 Method combination
</SectionTitle>
      <Paragraph position="0"> In this section, we evaluate the performance of such a combination. We first use the phonetic method to filter out unlikely candidates, and then apply both the phonetic method and the time correlation method to rank the candidates.</Paragraph>
      <Paragraph position="1"> We explore two combination methods: score combination and rank combination. In score combination, since the scores of two methods are not on the same scale, we first normalize them into the range [0,1] where the 1 is the best transliteration score and 0 the worst. Given a phonetic score p and a time correlation score t on the same transliteration pairs, the final combination score f would be: f = a p + (1 a) t, where a 2 [0, 1] is a linear combination parameter. For the rank combination, we take the unnormalized rankings of each candidate pair by the two methods and combine as follows: rcombined = a rp +(1 a) rt, where rp and rt are the phonetic and temporal rankings, respectively.</Paragraph>
      <Paragraph position="2"> The bottom panel of Table 7 shows the CoreMRR scores for these combination methods.</Paragraph>
      <Paragraph position="3"> In the second and third column, we repeat the phonetic and time correlation scores for ease of comparison. The fourth column and the sixth column represent the combination results with a = 0.5 for both combination methods. The fifth column and the last column are the best MRR scores that we can achieve through tuning a's. Score combination, in particular, significantly outperforms the individual phonetic and time correlation methods alone.</Paragraph>
      <Paragraph position="4"> Figure 1 plots the performance for all three languages with a variety of a's for the score combination method. Note that a higher a puts more weight on the phonetic model. As we have noted above, favoring the phonetic model is an advantage in our English/Hindi evaluation where the  using score combination. A higher a puts more weight on the phonetic model.</Paragraph>
      <Paragraph position="5"> phonetic correspondence between the two languages is fairly close, but the data sources are quite different; whereas for Arabic and Chinese we observe the opposite tendency. This suggests that one can balance the a scores according to whether one trusts one's data source versus whether one trusts in the similarity of the two languages' phonotactics.3 3A reviewer notes that we have not compared our method to state-of-the-art supervised transliteration models. This is true, but in the absence of a common evaluation set for transliteration, such a comparison would be meaningless. Certainly there are no standard databases, so far as we know, for the three language pairs we have been considering.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML