File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/95/p95-1032_evalu.xml
Size: 5,302 bytes
Last Modified: 2025-10-06 14:00:14
<?xml version="1.0" standalone="yes"?> <Paper uid="P95-1032"> <Title>A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora</Title> <Section position="5" start_page="240" end_page="241" type="evalu"> <SectionTitle> 6 Results </SectionTitle> <Paragraph position="0"> The English half of the corpus has 5760 unique words containing 2779 nouns and proper nouns. Most of these words occurred only once. We carried out two sets of evaluations, first counting only the best matched pairs, then counting top three Chinese translations for an English word. The top N candidate evaluation is useful because in a machine-aided translation system, we could propose a list of up to, say, ten candidate translations to help the translator. We obtained the evaluations of three human judges (El-E3). Evaluator E1 is a native Cantonese speaker, E2 a Mandarin speaker, and E3 a speaker of both languages. The results are shown in Figure 6.</Paragraph> <Paragraph position="1"> The average accuracy for all evaluators for both sets is 73.1%. This is a considerable improvement from our previous algorithm (Fung & McKeown 1994) which found only 32 pairs of single word translation. Our program also runs much faster than other lexicon-based alignment methods.</Paragraph> <Paragraph position="2"> We found that many of the mistaken translations resulted from insufficient data suggesting that we should use a larger size corpus in our future work. Tagging errors also caused some translation mistakes. English words with multiple senses also tend to be wrongly translated at least in part (e.g., means). There is no difference between capital letters and small letters in Chinese, and no difference between singular and plural forms of the same term.</Paragraph> <Paragraph position="3"> This also led to some error in the vector representation. The evaluators' knowledge of the language and familiarity with the domain also influenced the results.</Paragraph> <Paragraph position="4"> Apart from single Word to single word translation such as Governor/~ and prosperity/~i~flC/~, we also found many single word translations which show potential towards being translated as compound domain-specific terms such as follows: * finding Chinese words: Chinese texts do not have word boundaries such as space in English, therefore our text was tokenized into words by a statistical Chinese tokenizer (Fung & Wu 1994).</Paragraph> <Paragraph position="5"> Tokenizer error caused some Chinese characters to be not grouped together as one word. Our program located some of these words. For example, Green was aligned to ,~j~,/~ and -~ which suggests that ,~j~ could be a single Chinese word. It indeed is the name for Green Paper a government document.</Paragraph> <Paragraph position="6"> * compound noun translations: carbon could be translated as \]i~, and monoxide as ~. If carbon monoxide were translated separately, we would get ~ --~K4h . However, our algorithm found both carbon and monoxide to be most likely translated to the single Chinese word --~ 4h~ which is the correct translation for carbon monoxide.</Paragraph> <Paragraph position="7"> The words Legislative and Council were both matched to ~-C/r~ and similarly we can deduce that Legislative Council is a compound noun/collocation. The interesting fact here is, Council is also matched to ~J. So we can deduce that ~-'r_~j should be a single Chinese word corresponding to Legislative Council.</Paragraph> <Paragraph position="8"> * slang: Some word pairs seem unlikely to be translations of each other, such as collusion and its first three candidates ~(it pull), ~t~(cat), F~ (tail). Actually pulling the cat's tail is Cantonese slang for collusion.</Paragraph> <Paragraph position="9"> The word gweilo is not a conventional English word and cannot be found in any dictionary but it appeared eleven times in the text. It was matched to the Cantonese characters ~, ~, ~, and ~ which separately mean vulgar/folk, name/litle, ghost and male. ~ means the colloquial term gweilo. Gweilo in Cantonese is actually an idiom referring to a male westerner that originally had pejorative implications. This word reflects a certain cultural context and cannot be simply replaced by a word to word translation.</Paragraph> <Paragraph position="10"> * collocations: Some word pairs such as projects and ~(houses) are not direct translations.</Paragraph> <Paragraph position="11"> However, they are found to be constituent words of collocations - the Housing Projects (by the Hong Kong Government).Both Cross and Harbour are translated to 'd~Yff.(sea bottom), and then to Pi~:i(tunnel), not a very literal translation. Yet, the correct translation for ~J-~ll~ is indeed the Cross Harbor Tunnel and not the Sea Bottom Tunnel.</Paragraph> <Paragraph position="12"> The words Hong and Kong are both translated into ~i4~, indicating Hong Kong is a compound name.</Paragraph> <Paragraph position="13"> Basic and Law are both matched to ~:~2~, so we know the correct translation for ~2g~ is Basic Law which is a compound noun.</Paragraph> <Paragraph position="14"> * proper names In Hong Kong, there is a specific system for the transliteration of Chinese family names into English. Our algo-</Paragraph> </Section> class="xml-element"></Paper>