File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-1314_metho.xml
Size: 9,103 bytes
Last Modified: 2025-10-06 14:07:28
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-1314"> <Title>Word Alignment of English-Chinese Bilingual Corpus Based on Chunks</Title> <Section position="4" start_page="110" end_page="113" type="metho"> <SectionTitle> 3 Alignment Algorithm </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="110" end_page="111" type="sub_section"> <SectionTitle> 3.1 Outline of Algorithm </SectionTitle> <Paragraph position="0"> For our procedure in this paper, the bilingual corpus has been aligned at the sentence level, and the English language texts have been tagged with POS tag, and the Chinese language texts have been segmented and tagged with POS tag.</Paragraph> <Paragraph position="1"> We have available a bilingual lexicon which lists typical translation for many of the words in the corpus. We have available a synonymy Chinese dictionary, also. We identify the chunks of English sentences and then predict the chunk boundaries of Chinese sentences from the translation of every English chunks and heuristic information by use of the bilingual lexicon. The ambiguities of Chinese chunk boundaries are resolved by the coterminous words in English chunks. After produce the word candidate sets by statistical method, we calculate the translation relation probability between every word pair and select the best alignment forms. The detail algorithm for word alignment is given in table 1.</Paragraph> <Paragraph position="2"> Step 1: According to the definition of Chunk in English, separate the English sentence into a few chunks and labeled with order number from left to fight.</Paragraph> <Paragraph position="3"> Step 2: Try to find the Chinese translation of every English chunk created in step 1 by bilingual dictionary and synonymy Chinese dictionary. If the Chinese translation is fred, then label the Chinese words with the same number used for the English chunk in step whole corpus as a base for word alignment.</Paragraph> </Section> <Section position="2" start_page="111" end_page="111" type="sub_section"> <SectionTitle> 3.2 Chunk Identifying of English Sentence </SectionTitle> <Paragraph position="0"> Following Steven Abney (1991), there are two separate stages in chunking parser, which is the chunker and the attacher. The chunker converts a stream of words into a stream of chunks, and the attacher converts the stream of chunks into a stream of sentences. So only the chunker is needed in this paper. It's a non-deterministic version of a LR parser. For detail about chunker and the used grammars, please see Abney (1991). Then the chunks in one sentence are labeled with order number from left to right.</Paragraph> </Section> <Section position="3" start_page="111" end_page="112" type="sub_section"> <SectionTitle> 3.3 Chunk Boundary Prediction of Chinese Sentence </SectionTitle> <Paragraph position="0"> We observe the phenomenon that when we translate the English sentence to Chinese sentence, all the words in one English chunk tend to be translated as one block of Chinese words that are coterminous. The word orders within these blocks tend to keep with the English chunk, also. There are three examples in figure 1. The first sentence pair is chosen from an example sentence of Abney (1991). The second sentence pair is from a computer handbook. In these sentence pair all English chunks can find the exactly Chinese Chunk. In the third sentence pair only one English chunk can't find the exactly Chinese chunk for this sentence is chosen from a story and the translation is not literally.</Paragraph> <Paragraph position="1"> In order to find the Chinese translation of every English chunk, we use the bilingual dictionary and synonymy Chinese dictionary to implement the matching. If the Chinese translation of any words within the English chunk is found, then label the Chinese word with the same number used for labeling the English chunk.</Paragraph> <Paragraph position="2"> If there are Chinese words, which are labeled simultaneously by two or more number of English chunks, we use the number of nearby Chinese words to disambiguate. For example, in figure 2, the first Chinese word /~j may be correspondent to the English chunk 5 or 7. We have known that the words in one English chunk tend to be translated as one block of Chinese words that are coterminous, So it's easy to decide the first Chinese word )x~ ffJ is correspondent to the English chunk 7, the second Chinese word )x~ ~ is correspondent to the English chunk 5. By the same way, we can find the correct translations of Chinese word ~ and ~ is English chunk 6 and chunk 8 respectively. In Step 4 of figure 2, the Chinese words with the same label number are bracketed with in one chunk. Finally, we separate the Chinese sentence into a few chunks by heuristic information based on POS tag (especially the preposition, conjunction, and auxiliary words) and the grammatical knowledge-base of contemporary Chinese (Yu shi wen, 1998).</Paragraph> <Paragraph position="3"> \[The b~ald man\] \[was sitting\] \[on his suitcase\]. \[To acce~_~_~_~click\] Ion &quot;Su2.p..9.~'l.</Paragraph> <Paragraph position="4"> \[I gathered\] \[from what they said\] ,\[that an elder sister\] \[of his\] \[ was coming \] \[to stay with them\],\[ and that she~\] \[ that ev~ \[~'fl'\]~qb\]\[~\]\[~\]\[ - ~\]\[~\]\[~l'\]~--~\], \[~.R~\]\[~_h\]\[~lJ\]. Step 1 English chunks with order number \[This product 1\] \[is designed 2\] for \[low-cost 3\], \[turnkey solutions 4\] and \[mission-critical applications 5\] that \[require 6\] \[a central application host 7\] and \[ do not require 8\] \[networking 9\]. Step 2 Label the translation of English chunk with it's order number</Paragraph> <Paragraph position="6"/> </Section> <Section position="4" start_page="112" end_page="113" type="sub_section"> <SectionTitle> 3.4 Calculation of Translation Relation </SectionTitle> <Paragraph position="0"> With the alignments at chunk level of whole corpus, we propose a Translation Relation Probability (TRP) to implement the word alignment. The translation Relation probability of words are given by following equation:</Paragraph> <Paragraph position="2"> Where fC/ is the frequency of English word in whole corpus; fc is the frequency of Chinese Word in whole corpus; f~ is calculated by follow equation:</Paragraph> <Paragraph position="4"> (2) Where Lmv is the average words number of all English chunks and all Chinese chunks which are related to the English word in whole Corpus; L~i is the word number of the English chunk in which the English candidate words co-occur with the Chinese words; ~ is the word number of the Chinese chunk in which the English candidate words co-occur with the Chinese words; N is the total number of chunks in which the English word co-occur with the Chinese word; 13C/e is the penalty value to indicate the POS change between the Engfish word and the Chinese word.</Paragraph> <Paragraph position="5"> By this equation we connect the chunk length and POS change with the co-occurrence frequency. The less the chunk length, the higher the translation relation probability. For example, the chunk pak, which is composed by one English word and two Chinese words, is more reliable than the chunk pair, which is composed by four English words and four Chinese words. An example is given in figure 3. There are 5 possible alignment forms in our consideration for this chunk, which includes three Engfish words and three Chinese words. Then calculate the total TRP value for every possible alignment word pairs in each alignment form by equation (1). After we get the total TRP value for each alignment form, we choose the biggest one.</Paragraph> <Paragraph position="6"> floppy disk drive III A floppy disk drive floppy disk drive</Paragraph> </Section> </Section> <Section position="5" start_page="113" end_page="114" type="metho"> <SectionTitle> 4 Experimental Results </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="113" end_page="114" type="sub_section"> <SectionTitle> 4.1 System Architecture </SectionTitle> <Paragraph position="0"/> </Section> <Section position="2" start_page="114" end_page="114" type="sub_section"> <SectionTitle> 4.2 Experiment Results </SectionTitle> <Paragraph position="0"> We tested our system with an English-Chinese bilingual corpus, which is part of a computer handbook (Sco Unix handbook). There are about 2246 English sentence and 2169 Chinese sentence in this computer handbook after filter noisy figures and tables. Finally we extracted 14,214 chunk pairs from the corpus. The accuracy for automatic chunk alignment is 85.7%. The accuracy for word alignment based on correctly aligned chunk pairs is 93.6%. The errors mainly due to the following reasons: Chinese segmentation error, stop words noise, POS tag error. The parameter 13ec we used in equation (2) should be chosen from the training corpus. In table 2, the total TRP values of example in figure 3 are showed. The alignment form D is the best.</Paragraph> </Section> </Section> class="xml-element"></Paper>