File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1809_metho.xml
Size: 18,166 bytes
Last Modified: 2025-10-06 14:08:08
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-1809"> <Title>Corpus-Based Pinyin Name Resolution</Title> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> 1 Pinyin Name Extraction </SectionTitle> <Paragraph position="0"> Chinese person names in Pinyin have fairly predictable formats such as: first alphabet of the family name (surname) is capitalized, as is the first word (or second word) of a given name.</Paragraph> <Paragraph position="1"> Two-syllable given names may appear as one word or two. The latter may be hyphenated, a practice popular in places such as Taiwan or Hong Kong. Thus, one may find Chairman Mao's name in any of the following formats:</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Mao Ze Dong Mao ZeDong Mao Zedong Mao Ze-Dong Mao Ze-dong </SectionTitle> <Paragraph position="0"> Some publications also place the given name in front of the surname to agree with Western name convention. This style is supported but not used in this paper.</Paragraph> <Paragraph position="1"> While the surname character is pretty much closed, the given name is not. It is well known that the most popular Chinese surnames number to about 100. Including less frequent ones bring the number to about 400 which we use: see Hundred Surname website (2002). Sun, et.al.</Paragraph> <Paragraph position="2"> (1994) reported over 700 surnames in their studies when additional infrequent ones are included. Other than for a few exceptions, this set all have unique Pinyin codes. These surname codes constitute an important clue for spotting a name sequence. The capitalized word(s), and the monosyllabic nature of words immediately after (or before) the surname give further support of its existence. We also loosen name definition to detect entries that have a hyphen but without a surname. Some rare surnames can be two syllables long, and often pair with one syllable given names. A woman may include her own family name in addition to her husband's. For our current study, we limit testing to a sequence of two to three Pinyin syllables only. This seems sufficient for the large majority of names encountered. Fig.1 shows our procedure to identify possible Pinyin names without the need of a training corpus or a name dictionary.</Paragraph> </Section> </Section> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Mapping Pinyin Name to Chinese </SectionTitle> <Paragraph position="0"> To suggest Chinese characters for the detected Pinyin, we downloaded about 200K Chinese names. This is augmented with another 1/2 million Chinese name usage isolated from the</Paragraph> <Paragraph position="2"> first character; Gg, G-g = concatenated or hyphenated syllables, second one with lower-case) IdentiFinder (see Section 4). Last name and given name/characters are stored separately to form a database of name usage with frequencies.</Paragraph> <Paragraph position="3"> Two-character given names are stored both ways: as a single entry (observed) and as two separate characters. Observed usage items have their frequencies muliplied by a large factor to separate it from the unobserved type. A potential Pinyin surname is mapped to a set of possible characters. Existence of such characters in this surname database is the first step to decide that one may have a possible name sequence.</Paragraph> <Paragraph position="4"> Otherwise, we assume the Pinyin is not a name.</Paragraph> <Paragraph position="5"> Knight and Graehl (1997) have proposed to compose a set of weighted finite state transducers to solve the much more complicated problem of back-transliteration from Japanese Katakana to English. Their concern includes all types of source Katakana terms (not just names), corruptions due to OCR, approximations due to modeling of English, Japanese pronunciations, and a language model for English. Proposing Chinese characters for Pinyin is like back-transliteration and can also be viewed probabilistically. Some unique considerations however lead to a much simpler problem.</Paragraph> <Paragraph position="6"> Given an English Pinyin name E=EsEg (surname Es, given name Eg), our concern is to find the best Chinese name character sequence C=CsCg that maximizes P(C|E), or equivalently P(E|C)*P(C). Since surnames (Es,Eg) and givennames (Cs,Cg) can be considered independent, this probability can be re-written as: P(Es|C)*P(Cs)*P(Eg|C)*P(Cg).The conditioning on C can be replaced by Cs and Cg respectively since Chinese given names Cg should not influence English surname Es, and Cs should not influence Eg. As discussed before, other than a few exceptions Chinese characters have unique Pinyin, and hence P(Es|Cs) and P(Eg|Cg) is deterministic. Maximizing P(C|E) is equivalent to maximizing P(Cs')*P(Cg'), where Cs' and Cg' are sets of characters mapping from Es and Eg respectively. These probabilities are obtainable from frequencies in our database.</Paragraph> <Paragraph position="7"> Given names are limited to one or two syllables.</Paragraph> <Paragraph position="8"> In the latter case, the two characters are also assumed independent, and estimates of P(Cg') are smoothed between character pairs and their corresponding singles.</Paragraph> <Paragraph position="9"> To illustrate, we use the Pinyin: Jiang ZeMin (correct Chinese name is G34GD2GE6) as an example. This spelling is confirmed as a name because &quot;Jiang&quot; maps to five possible surnames, and &quot;ZeMin&quot; obeys given-name format, and have corresponding characters. Each surname character and all possible combinations of the given name characters are considered and probabilities evaluated based on the database of name usage frequencies. The top 16 candidates and estimated probabilities produced from our procedure are shown below: The probabilities are skewed because the first (correct) name has large usage frequency in the training data. However, every candidate is a possible name irrespective of probabilities because of the idiosyncracies of name forming.</Paragraph> <Paragraph position="10"> Quite often, some places or organizations also sound like names. These will also be translated (see example in Section 4). A couple of notable failures are strings like 'So China', which our procedure decodes as a name 'So Chi-na', 'So' being a legitimate surname in Wade-Giles convention. 'Hong Kong' also passes our test with candidates: GFFG29G0F G77GDBG0F G14G4FG0F etc. A 'stoplist' of such string patterns is employed to partially alleviate these errors.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Pinyin Name Resolution </SectionTitle> <Paragraph position="0"> Once candidate names for a Pinyin are available, one may output the top n ranked items as answers. However, selecting names based on probability may not be the best strategy. Quite often, people deliberately choose rare characters for naming purpose because they want to be differentiated from the usual run-of-the-mill names. Our strategy is to use IR techniques with a text collection to help in name selection. For cross language retrieval, it is especially helpful to use the target retrieval collection for resolution. This ensures that a translated name exists in the collection for retrieval. For general application, one could employ domain-relevant collections. Moreover, one can also use the occurrence frequency of the names in the collection to help narrow down the candidates: i.e. the higher the frequency, the more probable that the name is the intended one. This has the advantage that selection is tailored more to the application, and less dependent on the name character database of Section 2. When the collection is well chosen, this process can whittle down the candidates to just a few with good accuracy.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Experimental Studies </SectionTitle> <Paragraph position="0"> We performed two studies to demonstrate our Pinyin resolution strategy. The first is to repeat retrieval on some queries in NTCIR-2 cross language experiments to see how Pinyin name resolution can affect effectiveness. A second experiment is to use BBN's IdentiFinder as a reference, and to compare how our procedures succeed in extracting Pinyin names and translating them with respect to a reference set.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 CLIR with Pinyin Names </SectionTitle> <Paragraph position="0"> One of the NTCIR-2 cross language retrieval experiments (Eguchi, et.al. 2001) consists of 50 English topics and a Chinese target collection of about 200 MB. The purpose is to retrieve relevant Chinese documents using English text (topics) as queries. The Chinese counterparts to the English topics were also given so that CLIR results can be compared to monolingual. The original topics are lengthy; we limit our queries to a few words from the 'title' section of the topics. Three queries have Pinyin names and two contain non-person Pinyin entities that satisfy our Pinyin name detection format.</Paragraph> <Paragraph position="1"> On running these 'title' queries through our procedure, the Pinyin codes were identified, candidates suggested, and resolved using the target collection. Listed in Table 1 are the queries. The Pinyin name in each 'Original English' and 'Original Chinese' query is bolded.</Paragraph> <Paragraph position="2"> Under the column 'Selected Names with Occurrence Frequency' are the resolved Pinyin names in Chinese, together with their occurrence frequencies in the retrieval collection. As discussed in Section 3, these selections are narrowed down from a large number of candidates in the intermediate step.</Paragraph> <Paragraph position="3"> The Pinyin in Query 33 is for a kind of bean, while Query 44 has the name for a well known mountain, but they satisfy our definition of a name pattern. It can be seen that except for Query 46, the name with the largest occurrence agrees with the one intended in the monolingual query. In Query 46, the given name 'Yo-yo' is non-standard Pinyin, with suggested candidates like ' ' or ' ', and there are no such entries in the collection. If it were spelt 'Youyou', the correct characters 'GA0GA0' will be among the candidates and selected by the collection. When these Pinyin names with frequency>=5 were added to our MT software concatenated with dictionary translation procedure, Kwok (2001), the initial retrieval results in Table 2 are obtained. Here we follow the TREC convention to evaluate retrieval using the measures RR (relevant documents in top</Paragraph> <Paragraph position="5"> retrieved).</Paragraph> <Paragraph position="6"> Substantial improvements were obtained for four of the queries when the names are correctly picked, and come closer to or even surpass the monolingual result. This demonstrates that our approach to Pinyin name resolution can work, but we need more queries of this type to confirm the effect. Query #15 has very high Av.P of .3287 because dictionary translation brought in useful content words not present in the monolingual query like: (kidnapping), , (murder criminal case). These expand the query and combine synergistically with the Pinyin name to provide precision surpassing the monolingual result. As a candidate name, GA6G8D in Query #47 has very low probability compared to others because the character G8D (meaning 'mediocre') is rarely used in names. It was pulled out by high occurrence frequency in the target collection. Thompson & Dozier (1997) have also shown that correctly indexing names in monolingual English retrieval leads to better retrieval.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Resolving Pinyin Names in Text </SectionTitle> <Paragraph position="0"> In another experiment we intended to test our Pinyin procedure with parallel collections that contain many paired names, but failed to locate one. We intend to evaluate how well our extraction procedure works, and whether candidate suggestion can recover correct Chinese names. A pair of collections was downloaded from the Peoples' Daily website (2001) Year 2001 English version (~17MB) and the Chinese version (~70MB) as our test collections. A sampling shows that they have very different content. Our aim is to isolate Pinyin names from the English collection, and create a list of their Chinese counterparts. We can then compare our Pinyin extraction against the English list. We also like to see how our database suggest Chinese candidates for this fairly recent name set. The evaluation is more approximate compared to doing an evaluation using parallel corpora with lots of names paired. BBN's Identifinder, described in Weischedel et. al. (1996) was employed to process both collections independently. When given English or Chinese texts, IdentiFinder can bracket enttities of different types such as: PERSON, LOCATION, ORGANIZATION, etc. for later extraction. PERSON entities were isolated and two unique person name lists were produced: 4840 in English and 47621 in Chinese. They include Pinyin, non-Chinese and Chinese person names. The Chinese list contains many entries with one character (such as a surname G1D), translitered foreign names, and some with symbols. These we want to avoid. By capturing entries of length >=2 characters, without symbols, and having legitimate surnames, a filtered list of 23,863 Chinese entries were obtained. They were mapped into Pinyin and intersected with the English list. A total of 897 COMMON entries resulted, forming our reference set (Fig. 2). These are Chinese names obtainable by translating from the 4840-entry English list and which occur on the filtered list. The original English collection was next processed through our Pinyin identification procedure, and 1769 unique entries were detected to satisfy our criteria. Comparison with IdentiFinder's English list shows that 1467 (83%) are the same, and 302 (17%) different.</Paragraph> <Paragraph position="1"> The non-overlap can be due to: i) non-person entities that sound like names on our list; ii) non-Chinese names on the IdentiFinder list; iii) legitimate Chinese names detected by one and not the other; or iv) errors on either procedures. Candidate Chinese names were suggested for our 1769-entry Pinyin list, and afterwards resolved with the Chinese COMMON list. This tests how well our database suggests names for Pinyin. The result is shown in Table 3. We show suggestions of 1, 5, up to 50 candidates, and recall of the reference set improves steadily from 35.3% to 93.9% (missing 55 of those 897 in COMMON) at 50 suggested. This shows the difficulty of suggesting a correct name: only ~35% recall at top 1, ~68% at top 5. In general, small 'top n' is not sufficient to recover a correct name translation, while using too many lead to noise. Hence there is a need to resolve candidates on a relevant collection.</Paragraph> <Paragraph position="2"> We further compare the suggested Chinese names for the 1769 Pinyin against the filtered Chinese list (23863 entries) to see whether our Pinyin extraction can recover additional Chinese names not obtained by IdentiFinder (from the same English text). We found that at each suggestion level (Tables 4 & 3), more names were found by our Pinyin procedure that were missing in IdentiFinder: 30 at suggestion level 1, up to 174 (~19%) more names at the level of 50. These 174 are names in the filtered portion of the Chinese list but not included in COMMON because the English list from IdentiFinder does not have their corresponding Pinyin. The rest (1769-1016=) 753 on our list could be non-person entities that sounded like names, wrongly identified entries, or person names that do not exist in IdentiFinder's Chinese list. IdentiFinder may fail to extract some Chinese names as well. For example, some Pinyin names with 'An' as surname were missed. This study demonstrates the ability of our approach to locate Pinyin names in English text and translate them.</Paragraph> <Paragraph position="3"> A procedure to translate any Pinyin name into possible Chinese characters with probabilities based on usage frequencies is proposed.</Paragraph> <Paragraph position="4"> Candidates can further be resolved against a text collection to narrow down the possibilities. This leads to better CLIR results. For a recent English news collection, 83% of Pinyin names identified agrees with names found by BBN's IdentiFinder. Chinese name candidates for these Pinyin cover between 35.3 to 93.9% of a COMMON name set for the IdentiFinder names when suggestions varies between 1 to 50. But additional Chinese names not extracted by IdentiFinder can be located using our procedure. Pinyin is an official coding used in China and getting popular elsewhere. Names from other places such as Taiwan use different Pinyin conventions like Wade-Giles. We had some provision for them, but plan to expand our coverage for these names more completely in the future.</Paragraph> <Paragraph position="5"> Some web search engines offer advanced techniques that allow users to input English key terms and display results from Chinese documents, selecting items that have the English term and Chinese counterpart. These engines serve like giant bilingual dictionaries providing for entity translation. However, web pages usually contain current data and popular names only (like Ma Yo-yo). Lesser known names (like Bai Xiao-yan) are not available. Our approach can suggest Chinese names for Pinyin even if web search fails, or the relevant collection employed does not further resolve the suggested translations. For CLIR, our procedure ties translated names to the retrieval collection. We envisage each of these approaches has its own advantages, and that employing both together may help provide more accuracy for the issue of how to translate Pinyin names.</Paragraph> </Section> </Section> class="xml-element"></Paper>