File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1630_metho.xml

Size: 10,320 bytes

Last Modified: 2025-10-06 14:10:49

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1630">
  <Title>Unsupervised Named Entity Transliteration Using Temporal and Phonetic Correlation</Title>
  <Section position="5" start_page="250" end_page="252" type="metho">
    <SectionTitle>
3 Transliteration with Comparable
Corpora
</SectionTitle>
    <Paragraph position="0"> We start from comparable corpora, consisting of newspaper articles in English and the target languages for the same time period. In this paper, the target languages are Arabic, Chinese and Hindi.</Paragraph>
    <Paragraph position="1"> We then extract named-entities in the English text using the named-entity recognizer described in (Li et al., 2004), which is based on the SNoW machine learning toolkit (Carlson et al., 1999). To perform transliteration, we use the following general approach: 1 Extract named entities from the English corpus for each day; 2 Extract candidates from the same day's newspapers in the target language; 3 For each English named entity, score and rank the target-language candidates as potential transliterations. We apply two unsupervised methods -time correlation and pronunciation-based methods -- independently, and in combination.</Paragraph>
    <Section position="1" start_page="250" end_page="251" type="sub_section">
      <SectionTitle>
3.1 Candidate scoring based on
</SectionTitle>
      <Paragraph position="0"> pronunciation Our phonetic transliteration score uses a standard string-alignment and alignment-scoring technique based on (Kruskal, 1999) in that the distance is determined by a combination of substitution, insertion and deletion costs. These costs are computed from a language-universal cost matrix based on phonological features and the degree of phonetic similarity. (Our technique is thus similar to other work on phonetic similarity such as (Frisch, 1996) though details differ.) We construct a single cost matrix, and apply it to English and all target languages. This technique requires the knowledge of the phonetics and the sound change patterns of the language, but it does not require a transliterationpair training dictionary. In this paper we assume the WorldBet transliteration system (Hieronymus, 1995), an ASCII-only version of the IPA.</Paragraph>
      <Paragraph position="1"> The cost matrix is constructed in the following way. All phonemes are decomposed into standard phonological features. However, phonological features alone are not enough to model the possible substution/insertion/deletion patterns of languages. For example, /h/ is more frequently deleted than other consonants, whereas no single phonological feature allows us to distinguish /h/ from other consonants. Similarly, stop and fricative consonants such as /p, t, k, b, d, g, s, z/ are frequently deleted when they appear in the coda position. This tendency is very salient when the target languages do not allow coda consonants or consonant clusters. So, Chinese only allows [n, N] in coda position, and stop consonants in coda position are frequently lost; Stanford is transliterated as sitanfu, with the final /d/ lost. Since phonological features do not consider the position in the syllable, this pattern cannot be captured by conventional phonological features alone.</Paragraph>
      <Paragraph position="2"> To capture this, an additional feature &amp;quot;deletion of stop/fricative consonant in the coda position&amp;quot; is added. We base these observations, and the concomitant pseudofeatures on pronunciation error data of learners of English as a second language, as reported in (Swan and Smith, 2002). Er- null rors in second language pronunciation are determined by the difference in the phonological system of learner's first and second language. The same substitution/deletion/insertion patterns in the second language learner's errors appear also in the transliteration of foreign names. For example, if the learner's first language does not have a particular phoneme found in English, it is substituted by the most similar phoneme in their first language. Since Chinese does not have /v/, it is frequently substituted by /w/ or /f/. This substitution occurs frequently in the transliteration of foreign names in Chinese. Swan &amp; Smith's study covers 25 languages, and includes Asian languages such as Thai, Korean, Chinese and Japanese, European languages such as German, Italian, French, and Polish and Middle Eastern languages such as Arabic and Farsi. Frequent substitution/insertion/deletion patterns of phonemes are collected from these data. Some examples are presented in Table 1.</Paragraph>
      <Paragraph position="3"> Twenty phonological features and 14 pseudofeatures are used for the construction of the cost matrix. All features are classified into 5 classes.</Paragraph>
      <Paragraph position="4"> There are 4 classes of consonantal features -place, manner, laryngeality and major (consonant, sonorant, syllabicity), and a separate class of vocalic features. The purpose of these classes is to define groups of features which share the same substitution/insertion/deletion costs. Formally, given a class C, and a cost CC, for each feature f 2 C, CC defines the cost of substituting a different value for f than the one present in the source phoneme. Among manner features, the feature continuous is classified separately, since the substitution between stop and fricative consonants is very frequent; but between, say, nasals and fricatives such substitution is much less common. The cost for frequent sound change patterns should be low. Based on our intuitions, our pseudofeatures are classified into one or another of the above-mentioned five classes. The substitution/deletion/insertion cost for a pair of phonemes is the sum of the individual costs of the features which are different between the two phonemes.</Paragraph>
      <Paragraph position="5"> For example, /n/ and /p/ are different in sonorant, labial and coronal features. Therefore, the substitution cost of /n/ for /p/ is the sum of the sonorant, labial and coronal cost (20+10+10 = 40). Features and associated costs are shown in Table 2. Sample substitution, insertion, and deletion costs for /g/ are presented in Table 3.</Paragraph>
      <Paragraph position="6"> The resulting cost matrix based on these principles is then used to calculate the edit distance between two phonetic strings. Pronunciations for English words are obtained using the Festival text-to-speech system (Taylor et al., 1998), and the target language words are automatically converted into their phonemic level transcriptions by various language-dependent means. In the case of Mandarin Chinese this is based on the standard pinyin transliteration system. For Arabic this is based on the orthography, which works reasonably well given that (apart from the fact that short vowels are no represented) the script is fairly phonemic.</Paragraph>
      <Paragraph position="7"> Similarly, the pronunciation of Hindi can be reasonably well-approximated based on the standard Devanagari orthographic representation. The edit cost for the pair of strings is normalized by the number of phonemes. The resulting score ranges from zero upwards; the score is used to rank candidate transliterations, with the candidate having the lowest cost being considered the most likely transliteration. Some examples of English words and the top three ranking candidates among all of the potential target-language candidates are given in Table 4.1 Starred entries are correct.</Paragraph>
    </Section>
    <Section position="2" start_page="251" end_page="252" type="sub_section">
      <SectionTitle>
3.2 Candidate scoring based on time
</SectionTitle>
      <Paragraph position="0"> correlation Names of the same entity that occur in different languages often have correlated frequency patterns due to common triggers such as a major event. For example, the 2004 tsunami disaster was covered in news articles in many different languages. We would thus expect to see a peak of frequency of names such as Sri Lanka, India, and Indonesia in news articles published in multiple languages in the same time period. In general, we may expect topically related names in different languages to tend to co-occur together over time. Thus if we have comparable news articles over a sufficiently long time period, it is possible to exploit such correlations to learn the associations of names in different languages.</Paragraph>
      <Paragraph position="1"> The idea of exploiting time correlation has been well studied. We adopt the method proposed in (Tao and Zhai, 2005) to represent the source name and each name candidate with a frequency vector and score each candidate by the similarity of the  learner's data reported in (Swan and Smith, 2002). Each row shows an input phoneme class, possible output phonemes (including null), and the positions where the substitution (or deletion) is likely to occur.</Paragraph>
      <Paragraph position="2">  denotes a situation such as the semivowel [j] substituting for the affricate [dZ]. Substitutions between these two sounds actually occur frequently in second-language error data. two frequency vectors. This is very similar to the case in information retrieval where a query and a document are often represented by a term vector and documents are ranked by the similarity between their vectors and the query vector (Salton and McGill, 1983). But the vectors are very different and should be constructed in quite different ways. Following (Tao and Zhai, 2005), we also normalize the raw frequency vector so that it becomes a frequency distribution over all the time points. In order to compute the similarity between two distribution vectors vectorx = (x1,...,xT ) and vectory = (y1,...,yT ), the Pearson correlation co-efficient was used in (Tao and Zhai, 2005). We also consider two other commonly used measures - cosine (Salton and McGill, 1983), and Jensen-Shannon divergence (Lin, 1991), though our results show that Pearson correlation coefficient performs better than these two other methods. Since the time correlation method and the phonetic correspondence method exploit distinct resources, it makes sense to combine them. We explore two approaches to combining these two methods, namely score combination and rank combination. These will be defined below in Section 4.2.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML