File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/p02-1051_metho.xml
Size: 21,228 bytes
Last Modified: 2025-10-06 14:07:56
<?xml version="1.0" standalone="yes"?> <Paper uid="P02-1051"> <Title>Translating Named Entities Using Monolingual and Bilingual Resources</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Our Approach </SectionTitle> <Paragraph position="0"> The frequency of named-entity phrases in news text reflects the significance of the events they are associated with. When translating named entities in news stories of international importance, the same event Computational Linguistics (ACL), Philadelphia, July 2002, pp. 400-408. Proceedings of the 40th Annual Meeting of the Association for will most likely be reported in many languages including the target language. Instead of having to come up with translations for the named entities often with many unknown words in one document, sometimes it is easier for a human to find a document in the target language that is similar to, but not necessarily a translation of, the original document and then extract the translations. Let's illustrate this idea with the following example:</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Example </SectionTitle> <Paragraph position="0"> We would like to translate the named entities that appear in the following Arabic excerpt:</Paragraph> <Paragraph position="2"> The Arabic newspaper article from which we extracted this excerpt is about negotiations between the US and North Korean authorities regarding the search for the remains of US soldiers who died during the Korean war.</Paragraph> <Paragraph position="3"> We presented the Arabic document to a bilingual speaker and asked them to translate the locations</Paragraph> <Paragraph position="5"> provided were Chozin Reserve, Onsan, and Kojanj.</Paragraph> <Paragraph position="6"> It is obvious that the human attempted to sound out names and despite coming close, they failed to get them correctly as we will see later.</Paragraph> <Paragraph position="7"> When translating unknown or unfamiliar names, one effective approach is to search for an English document that discusses the same subject and then extract the translations. For this example, we start by creating the following Web query that we use with the search engine: Search Query 1: soldiers remains, search, North Korea, and US.</Paragraph> <Paragraph position="8"> This query returned many hits. The top document returned by the search engine1 we used contained the following paragraph: The targeted area is near Unsan, which saw several battles between the U.S.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Army's 8th Cavalry regiment and Chinese </SectionTitle> <Paragraph position="0"> troops who launched a surprise offensive in late 1950.</Paragraph> <Paragraph position="1"> This allowed us to create a more precise query by adding Unsan to the search terms: Search Query 2: soldiers remains, search, North Korea, US, and Unsan.</Paragraph> <Paragraph position="2"> This search query returned only 3 documents. The first one is the above document. The third is the top level page for the second document. The second document contained the following excerpt: Operations in 2001 will include areas of investigation near Kaechon, approximately 18 miles south of Unsan and Kujang. Kaechon includes an area nicknamed the &quot;Gauntlet,&quot; where the U.S. Army's 2nd Infantry Division conducted its famous fighting withdrawal along a narrow road through six miles of Chinese ambush positions during November and December 1950. More than 950 missing in action soldiers are believed to be located in these three areas.</Paragraph> <Paragraph position="3"> The Chosin Reservoir campaign left approximately 750 Marines and soldiers missing in action from both the east and west sides of the reservoir in northeastern North Korea.</Paragraph> <Paragraph position="4"> This human translation method gives us the correct translation for the names we are interested in.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Two-Step Approach </SectionTitle> <Paragraph position="0"> Inspired by this, our goal is to tackle the named entity translation problem using the same approach described above, but fully automatically and using the least amount of hard-to-obtain bilingual resources.</Paragraph> <Paragraph position="1"> As shown in Figure 1, the translation process in our system is carried out in two main steps. Given a named entity in the source language, our translation algorithm first generates a ranked list of translation candidates using bilingual and monolingual resources, which we describe in the Section 3. Then, the list of candidates is re-scored using different monolingual clues (Section 4).</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Producing Translation Candidates </SectionTitle> <Paragraph position="0"> Named entity phrases can be identified fairly accurately (e.g., Bikel et al. (1999) report an F-MEASURE of 94.9%). In addition to identifying phrase boundaries, named-entity identifiers also provide the category and sub-category of a phrase (e.g., ENTITY NAME, and PERSON). Different types of named entities are translated differently and hence our candidate generator has a specialized module for each type. Numerical and temporal expressions typically use a limited set of vocabulary words (e.g., names of months, days of the week, etc.) and can be translated fairly easily using simple translation patterns. Therefore, we will not address them in this paper. Instead we will focus on person names, locations, and organizations. But before we present further details, we will discuss how words can be transliterated (i.e., &quot;sounded-out&quot;), which is a crucial component of our named entity translation algorithm.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Transliteration </SectionTitle> <Paragraph position="0"> Transliteration is the process of replacing words in the source language with their approximate phonetic or spelling equivalents in the target language.</Paragraph> <Paragraph position="1"> Transliteration between languages that use similar alphabets and sound systems is very simple. However, transliterating names from Arabic into English is a non-trivial task, mainly due to the differences in their sound and writing systems. Vowels in Arabic come in two varieties: long vowels and short vowels. Short vowels are rarely written in Arabic in newspaper text, which makes pronunciation and meaning highly ambiguous. Also, there is no one-to-one correspondence between Arabic sounds and English sounds. For example, English P and B are both mapped into Arabic &quot;a28 a7 b&quot;; Arabic &quot;a0 h. &quot; and &quot;a1 h-&quot; into English H; and so on.</Paragraph> <Paragraph position="2"> Stalls and Knight (1998) present an Arabic-to-English back-transliteration system based on the source-channel framework. The transliteration process is based on a generative model of how an English name is transliterated into Arabic. It consists of several steps, each is defined as a probabilistic model represented as a finite state machine. First, an English word is generated according to its uni-gram probabilities a2a4a3a6a5a8a7 . Then, the English word is pronounced with probability a2a4a3a6a9a11a10a12a5a13a7 , which is collected directly from an English pronunciation dictionary. Finally, the English phoneme sequence is converted into Arabic writing with probability a2a4a3a6a14a15a10a12a9a16a7 . According to this model, the transliteration probability is given by the following equation:</Paragraph> <Paragraph position="4"> The transliterations proposed by this model are generally accurate. However, one serious limitation of this method is that only English words with known pronunciations can be produced. Also, human translators often transliterate words based on how they are spelled in the source language. For example, Graham is transliterated into Arabic as dress these limitations, we extend this approach by using a new spelling-based model in addition to the phonetic-based model.</Paragraph> <Paragraph position="5"> The spelling-based model we propose (described in detail in (Al-Onaizan and Knight, 2002)) directly maps English letter sequences into Arabic letter sequences with probability a2a4a3a6a14a15a10a12a5a13a7 , which are trained on a small English/Arabic name list without the need for English pronunciations. Since no pronunciations are needed, this list is easily obtainable for many language pairs. We also extend the model a2a4a3a6a5a13a7 to include a letter trigram model in addition to the word unigram model. This makes it possible to generate words that are not already defined in the word uni-gram model. The transliteration score according to this model is given by:</Paragraph> <Paragraph position="7"> The phonetic-based and spelling-based models are combined into a single transliteration model.</Paragraph> <Paragraph position="8"> The transliteration score for an English word a5 given an Arabic word a14 is a linear combination of the phonetic-based and the spelling-based transliteration scores as follows:</Paragraph> <Paragraph position="10"/> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Producing Candidates for Person Names </SectionTitle> <Paragraph position="0"> Person names are almost always transliterated. The translation candidates for typical person names are generated using the transliteration module described above. Finite-state devices produce a lattice containing all possible transliterations for a given name. The candidate list is created by extracting the n-best transliterations for a given name. The score of each candidate in the list is the transliteration probability as given by Equation 3. For example, the name</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Producing Candidates for Location and Organization Names </SectionTitle> <Paragraph position="0"> Words in organization and location names, on the other hand, are either translated (e.g., &quot;</Paragraph> <Paragraph position="2"> tVswzyn&quot; as Chosin), and it is not clear when a word must be translated and when it must be transliterated. So to generate translation candidates for a given phrase a12 , words in the phrase are first translated using a bilingual dictionary and they are also transliterated. Our candidate generator combines the dictionary entries and n-best transliterations for each word in the given phrase into a regular expression that accepts all possible permutations of word translation/transliteration combinations. In addition to the word transliterations and translations, English zero-fertility words (i.e., words that might not have Arabic equivalents in the named entity phrase such as of and the) are considered. This regular expression is then matched against a large English news corpus. All matches are then scored according to their individual word translation/transliteration scores. The score for a given candidate a9 is given by a modified IBM Model 1 probability (Brown et al., 1993) as follows:</Paragraph> <Paragraph position="4"> where a40 is the length of a9 , a41 is the length of a12 , a15 is a scaling factor based on the number of matches of a9 found, and a14 a33 is the index of the English word aligned with a12 a33 according to alignment</Paragraph> <Paragraph position="6"> of the transliteration and translation score, where the translation score is a uniform probability over all dictionary entries for a12 a33 .</Paragraph> <Paragraph position="7"> The scored matches form the list of translation candidates. For example, the candidate list for</Paragraph> <Paragraph position="9"> V lyVg&quot; includes Bay of Pigsand Gulf of Pigs.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Re-Scoring Candidates </SectionTitle> <Paragraph position="0"> Once a ranked list of translation candidates is generated for a given phrase, several monolingual English resources are used to help re-rank the list. The candidates are re-ranked according to the following equation: a43a45a44</Paragraph> <Paragraph position="2"> where a54a57a56a4a3a38a47 a7 is the re-scoring factor used.</Paragraph> <Paragraph position="3"> Straight Web Counts: (Grefenstette, 1999) used phrase Web frequency to disambiguate possible English translations for German and Spanish compound nouns. We use normalized Web counts of named entity phrases as the first re-scoring factor used to rescore translation candidates. For the</Paragraph> <Paragraph position="5"> respectively, which leads to the correct translation being ranked highest.</Paragraph> <Paragraph position="6"> It is important to consider counts for the full name rather than the individual words in the name to get accurate counts. To illustrate this point consider the person name &quot;a0 a29a3a42a77 kyl</Paragraph> <Paragraph position="8"> eration module proposes Jon and John as possible transliterations for the first name, and Keele and Kyl among others for the last name. The normalized counts for the individual words are: (John, 0.9269), (Jon, 0.0688), (Keele, 0.0032), and (Kyl, 0.0011).</Paragraph> <Paragraph position="9"> To use these normalized counts to score and rank the first name/last name combinations in a way similar to a unigram language model, we would get the following name/score pairs: (John Keele, 0.003), (John Kyl, 0.001), (Jon Keele, 0.0002), and (Jon Kyl, a67a75a58a1a4 a52 a7a28a61 a62a6a5 ). However, the normalized phrase counts for the possible full names are: (Jon Kyl, 0.8976), (John Kyl, 0.0936), (John Keele, 0.0087), and (Jon Keele, 0.0001), which is more desirable as Jon Kyl is an often-mentioned US Senator.</Paragraph> <Paragraph position="10"> Co-reference: When a named entity is first mentioned in a news article, typically the full form of the phrase (e.g., the full name of a person) is used. Later references to the name often use a shortened version of the name (e.g, the last name of the person). Shortened versions are more ambiguous by nature than the full version of a phrase and hence more difficult to translate. Also, longer phrases tend to have more accurate Web counts than shorter ones as we have shown above. For example, the phrase &quot;a28 a7 a8a48</Paragraph> <Paragraph position="12"> nw-ab a7 a30 a71a7a8 mVgls&quot; is translated as the House of Representatives. The word &quot;a7 a30a33a39a40a7 a91a9 a8 al-mVgls&quot;2 might be used for later references to this phrase. In that case, we are confronted with the task of translating &quot;a7 a30a33a39a40a7 a91a9 a8 al-mVgls&quot; which is ambiguous and could refer to a number of things including: the Council when referring to &quot; a10a12 a13a11a10a12 a8 al-a13mn a7 a30 a71a7a8 mVgls&quot; (the Security Council); the House when referring to 'a28 a7 a8a48</Paragraph> <Paragraph position="14"> al-nw-ab a7 a30 a71a7a8 mVgls&quot; (the House of Representatives); and as the Assembly when referring to &quot; a27a14 a13a6a10a12 a8 al-a13mt</Paragraph> <Paragraph position="16"> mVgls&quot; but with the definite article a29a3a24 a- attached. If we are able to determine that in fact it was referring to the House of Representatives, then, we can translate it accurately as the House. This can be done by comparing the shortened phrase with the rest of the named entity phrases of the same type. If the shortened phrase is found to be a sub-phrase of only one other phrase, then, we conclude that the shortened phrase is another reference to the same named entity. In that case we use the counts of the longer phrase to re-rank the candidates of the shorter one.</Paragraph> <Paragraph position="17"> Contextual Web Counts: In some cases straight Web counting does not help the re-scoring. For example, the top two translation candidates for &quot; a43a31a30 dwn-ald&quot; are Donald Martin and Donald Marron. Their straight Web counts are 2992 and 2509, respectively. These counts do not change the ranking of the candidates list. We next seek a more accurate counting method by counting phrases only if they appear within a certain context. Using search engines, this can be done using the boolean operator AND. For the previous example, we use Wall Street as the contextual information In this case we get the counts 15 and 113 for Donald Martin and Donald Marron, respectively. This is enough to get the correct translation as the top candidate.</Paragraph> <Paragraph position="18"> The challenge is to find the contextual information that provide the most accurate counts. We have experimented with several techniques to identify the contextual information automatically. Some of these techniques use document-wide contextual information such as the title of the document or select key terms mentioned in the document. One way to identify those key terms is to use the tf.idf measure. Others use contextual information that is local to the named entity in question such as the a32 words that precede and/or succeed the named entity or other named entities mentioned closely to the one in question. null</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Extending the Candidates List </SectionTitle> <Paragraph position="0"> The re-scoring methods described above assume that the correct translation is in the candidates list. When it is not in the list, the re-scoring will fail. To address this situation, we need to extrapolate from the candidate list. We do this by searching for the correct translation rather than generating it. We do that by using sub-phrases from the candidates list or by searching for documents in the target language similar to the one being translated. For example, for a person name, instead of searching for the full name, we search for the first name and the last name separately. Then, we use the IdentiFinder named entity identifier (Bikel et al., 1999) to identify all named entities in the top a32 retrieved documents for each sub-phrase. All named entities of the type of the named entity in question (e.g., PER-SON) found in the retrieved documents and that contain the sub-phrase used in the search are scored using our transliteration module and added to the list of translation candidates, and the re-scoring is repeated. null To illustrate this method, consider the name &quot; a77 kwfy.&quot; Our translation module proposes: Coffee Annan, Coffee Engen, Coffee Anton, Coffee Anyone, and Covey Annan but not the correct translation Kofi Annan. We would like to find the most common person names that have either one of Coffee or Covey as a first name; or Annan, Engen, Anton, or Anyone as a last name. One way to do this is to search using wild cards. Since we are not aware of any search engine that allows wild-card Web search, we can perform a wild-card search instead over our news corpus. The problem is that our news corpus is dated material, and it might not contain the information we are interested in. In this case, our news corpus, for example, might predate the appointment of Kofi Annan as the Secretary General of the UN.</Paragraph> <Paragraph position="1"> Alternatively, using a search engine, we retrieve the top a32 matching documents for each of the names Coffee, Covey, Annan, Engen, Anton, and Anyone.</Paragraph> <Paragraph position="2"> All person names found in the retrieved documents that contain any of the first or last names we used in the search are added to the list of translation candidates. We hope that the correct translation is among the names found in the retrieved documents. The re-scoring procedure is applied once more on the expanded candidates list. In this example, we add Kofi Annan to the candidate list, and it is subsequently ranked at the top.</Paragraph> <Paragraph position="3"> To address cases where neither the correct translation nor any of its sub-phrases can be found in the list of translation candidates, we attempt to search for, instead of generating, translation candidates.</Paragraph> <Paragraph position="4"> This can be done by searching for a document in the target language that is similar to the one being translated from the source language. This is especially useful when translating named entities in news stories of international importance where the same event will most likely be reported in many languages including the target language. We currently do this by repeating the extrapolation procedure described above but this time using contextual information such as the title of the original document to find similar documents in the target language. Ideally, one would use a Cross-Lingual IR system to find relevant documents more successfully.</Paragraph> </Section> class="xml-element"></Paper>