File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/p00-1067_metho.xml

Size: 17,715 bytes

Last Modified: 2025-10-06 14:07:20

<?xml version="1.0" standalone="yes"?>
<Paper uid="P00-1067">
  <Title>PENS: A Machine-aided English Writing System for Chinese Users</Title>
  <Section position="2" start_page="1" end_page="1" type="metho">
    <SectionTitle>
1 System Overview
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
1.1 System Architecture
</SectionTitle>
      <Paragraph position="0"> There are two modules in PENS. The first is called the spelling help. Given an English word, the spelling help performs two functions, 1) retrieving its synonym, antonym, and thesaurus; or 2) automatically giving the corresponding translation of Chinese words in the form of Chinese characters or pinyin. Statistical machine translation techniques are used for this translation, and therefore a Chinese-English bilingual dictionary (MRD), an English language model, and an English-Chinese word- translation model (TM) are needed. The English language model is a word trigram model, which consists of 247,238,396 trigrams, and the vocabulary used contains 58541 words. The MRD dictionary contains 115,200 Chinese entries as well as their corresponding English translations, and other information, such as part-of-speech, semantic classification, etc. The TM is trained from a word-aligned bilingual corpus, which occupies approximately 96,362 bilingual sentence pairs.</Paragraph>
      <Paragraph position="1"> The second module is an intelligent recommendation system. It employs an effective sentence retrieval algorithm on a large bilingual corpus. The input is a sequence of keywords or a short phrase given by users, and the output is limited pairs bilingual sentences expressing relevant meaning with users' query, or just a few pairs of bilingual sentences with syntactical relevance.</Paragraph>
    </Section>
    <Section position="2" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
1.2 Bilingual Corpus Construction
</SectionTitle>
      <Paragraph position="0"> We have collected bilingual texts extracted from World Wide Web bilingual sites, dictionaries, books, bilingual news and magazines, and product manuals. The size of the corpus is 96,362 sentence pairs. The corpus is used in the following three cases:  1) Act as translation memory to support the Intelligent Recommendation Function; 2) To be used to acquire English-Chinese translation model to support translation at word and phrase level; 3) To be used to extract bilingual terms to enrich the Chinese-English MRD;  To construct a sentence aligned bilingual corpus, we first use an alignment algorithm doing the automatic alignment and then the alignment result are corrected.</Paragraph>
      <Paragraph position="1"> There have been quite a number of recent papers on parallel text alignment. Lexically based techniques use extensive online bilingual lexicons to match sentences [Chen 93]. In contrast, statistical techniques require almost no prior knowledge and are based solely on the lengths of sentences, i.e. length-based alignment method. We use a novel method to incorporate both approaches [Liu, 95]. First, the rough result is obtained by using the length-based method. Then anchors are identified in the text to reduce the complexity. An anchor is defined as a block that consists of n successive sentences. Our experiments show best performance when n=3.</Paragraph>
      <Paragraph position="2"> Finally, a small, restricted set of lexical cues is applied to obtain for further improvement.</Paragraph>
    </Section>
    <Section position="3" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
1.3 Translation Model Training
</SectionTitle>
      <Paragraph position="0"> Chinese sentences must be segmented before word translation training, because written Chinese consists of a character stream without space between words. Therefore, we use a wordlist, which consists of 65502 words, in conjunction with an optimization procedure described in [Gao, 2000]. The bilingual training process employs a variant of the model in [Brown, 1993] and as such is based on an iterative EM (expectation-maximization) procedure for maximizing the likelihood of generating the English given the Chinese portion. The output of the training process is a set of potential English translations for each Chinese word, together with the probability estimate for each translation.</Paragraph>
    </Section>
    <Section position="4" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
1.4 Extraction of Bilingual
Domain-specific Terms
</SectionTitle>
      <Paragraph position="0"> A domain-specific term is defined as a string that consists of more than one successive word and has certain occurrences in a text collection within a specific domain. Such a string has a complete meaning and lexical boundaries in semantics; it might be a compound word, phrase or linguistic template. We use two steps to extract bilingual terms from sentence aligned corpus.</Paragraph>
      <Paragraph position="1"> First we extract Chinese monolingual terms from Chinese part of the corpus by a similar method described in [Chien, 1998], then we extract the English corresponding part by using the word alignment information. A candidate list of the Chinese-English bilingual terms can be obtained as the result. Then we will check the list and add the terms into the dictionary.</Paragraph>
    </Section>
  </Section>
  <Section position="3" start_page="1" end_page="21" type="metho">
    <SectionTitle>
2 Spelling Help
</SectionTitle>
    <Paragraph position="0"> The spelling help works on the word or phrase level. Given an English word or phrase, it performs two functions, 1) retrieving corresponding synonyms, antonyms, and thesaurus; and 2) automatically giving the corresponding translation of Chinese words in the form of Chinese characters or pinyin. We will focus our discussion on the latter function in the section.</Paragraph>
    <Paragraph position="1"> To use the latter function, the user may input Chinese characters or just input pinyin. It is not very convenient for Chinese users to input Chinese characters by an English keyboard.</Paragraph>
    <Paragraph position="2"> Furthermore the user must switch between English input model and Chinese input model time and again. These operations will interrupt his train of thought. To avoid this shortcoming, our system allows the user to input pinyin instead of Chinese characters. The pinyin can be translated into English word directly.</Paragraph>
    <Paragraph position="3"> Let us take a user scenario for an example to show how the spelling help works. Suppose that a user input a Chinese word &amp;quot;G1160G17e4&amp;quot; in the form of pinyin, say &amp;quot;wancheng&amp;quot;, as shown in figure1-1. PENS is able to detect whether a string is a pinyin string or an English string automatically. For a pinyin string, PENS tries to translate it into the corresponding English word or phrase directly. The mapping from pinyin to Chinese word is one-to-many, so does the mapping from Chinese word to English words. Therefore, for each pinyin string, there are alternative translations. PENS employs a statistical approach to determine the correct translation. PENS also displays the corresponding Chinese word or phrase for confirmation, as shown in figure 1-2.</Paragraph>
    <Paragraph position="4">  If the user is not satisfied with the English word determined by PENS, he can browse other candidates as well as their bilingual example sentences, and select a better one, as shown in</Paragraph>
    <Section position="1" start_page="1" end_page="21" type="sub_section">
      <SectionTitle>
2.1 Word Translation Algorithm
</SectionTitle>
      <Paragraph position="0"> basedonStatisticalLMandTM Suppose that a user input two English words,  , and then a pinyin string, say PY.ForPY, all candidate Chinese words are determined by looking up a Pinyin-Chinese dictionary. Then, a list of candidate English translations is obtained according to a MRD. These English translations are English words of their original form, while they should be of different forms in different contexts. We exploit morphology for this purpose, and expand each word to all possible forms. For instance, inflections of &amp;quot;go&amp;quot; may be &amp;quot;went&amp;quot;, and &amp;quot;gone&amp;quot;. In what follows, we will describe how to determine the proper translation among the  most proper translation of PY is the English word with the highest conditional probability among all leaf nodes, that is According to Bayes' law, the conditional probability is estimated by  For simplicity, we assume that a Chinese word doesn't depends on the translation context, so we can get the following approximate equation:</Paragraph>
      <Paragraph position="2"> ) is the translation model, and can be got from bilingual corpus, and P(PY  |CW</Paragraph>
      <Paragraph position="4"> is the polyphone model, here we suppose</Paragraph>
      <Paragraph position="6"> ) is the English trigram language model. To sum up, as indicated in (2-6), the spelling help find the most proper translation of PY by retrieving the English word with the highest conditional probability.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="21" end_page="12111" type="metho">
    <SectionTitle>
3 Intelligent Recommendation
</SectionTitle>
    <Paragraph position="0"> The intelligent recommendation works on the sentence level. When a user input a sequence of Chinese characters, the character string will be firstly segmented into one or more words. The segmented word string acts as the user query in IR. After query expansion, the intelligent recommendation employs an effective sentence retrieval algorithm on a large bilingual corpus, and retrieves a pair (or a set of pairs) of bilingual sentences related to the query. All the retrieved sentence pairs are ranked based on a scoring strategy.</Paragraph>
    <Section position="1" start_page="21" end_page="12111" type="sub_section">
      <SectionTitle>
3.1 Query Expansion
</SectionTitle>
      <Paragraph position="0"> Suppose that a user query is of the form CW</Paragraph>
      <Paragraph position="2"> each word of the queries based on a Chinese thesaurus, as shown below.</Paragraph>
      <Paragraph position="3">  We can obtain an expanded query by substituting a word in the query with its synonym. To avoid over-generation, we restrict that only one word is substituted at each time. Let us take the query &amp;quot;Gec4G4dc7G1b1cG1d70&amp;quot; for an example. The synonyms list is as follows:</Paragraph>
      <Paragraph position="5"> The query consists of two words. By substituting the first word, we get expanded queries, such as &amp;quot;Gec4G1b1cG1d70&amp;quot;G2c8&amp;quot;G4dc7G1b1cG1d70&amp;quot;G2c8&amp;quot;G4dc7Gaa1G1b1cG1d70&amp;quot;, etc, and by substituting the second word, we get other expanded queries, such as &amp;quot;Gec4G4dc7 G530G2afc&amp;quot;G2c8&amp;quot;Gec4G4dc7 G873G1b1c&amp;quot;G2c8&amp;quot;Gec4G4dc7G1172G1b1c&amp;quot;, etc.</Paragraph>
      <Paragraph position="6"> Then we select the expanded query, which is used for retrieving example sentence pairs, by estimating the mutual information of words with the query. It is indicated as follows</Paragraph>
      <Paragraph position="8"> is the jth synonym of the i-th Chinese word. In the above example, &amp;quot;G4dc7Gaa1 G1b1cG1d70&amp;quot;is selected. The selection well meets the common sense. Therefore, bilingual example sentences containing &amp;quot;G4dc7Gaa1G1b1cG1d70&amp;quot; will be retrieved as well.</Paragraph>
    </Section>
    <Section position="2" start_page="12111" end_page="12111" type="sub_section">
      <SectionTitle>
3.2 Ranking Algorithm
</SectionTitle>
      <Paragraph position="0"> The input of the ranking algorithm is a query Q, as described above, Q is a Chinese word string, as shown below  For each sentence, the relevance score is computed in two parts, 1) the bonus which represents the similarity of input query and the target sentence, and 2) the penalty,which represents the dissimilarity of input query and the target sentence.</Paragraph>
      <Paragraph position="1"> The bonus is computed by the following formula:</Paragraph>
      <Paragraph position="3"> is the weight of the jth word in query Q, which will be described later, tf ij is the number of the jth word occurring in sentence i, n is the number of the sentences in corpus, df</Paragraph>
      <Paragraph position="5"> of word in the ith sentence.</Paragraph>
      <Paragraph position="6"> The above formula contains only the algebraic similarities. To take the geometry similarity into consideration, we designed a penalty formula. The idea is that we use the editing distance to compute that geometry similarity.</Paragraph>
      <Paragraph position="8"> Suppose the matched word list between query Q and a sentence are represented as A and B  The editing distance is defined as the number of editing operation to revise B to A. The penalty will increase for each editing operation, but the score is different for different word category. For example, the penalty will be serious when operating a verb than operating a noun  We define the score and penalty for each kind of part-or-speech</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="12111" end_page="12111" type="metho">
    <SectionTitle>
4 Experimental Results &amp;
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="12111" end_page="12111" type="sub_section">
      <SectionTitle>
Evaluation
</SectionTitle>
      <Paragraph position="0"> In this section, we will report the primary experimental results on 1) word-level pinyin-English translation, and 2) example sentences retrieval.</Paragraph>
    </Section>
    <Section position="2" start_page="12111" end_page="12111" type="sub_section">
      <SectionTitle>
4.1 Word-level Pinyin-English
Translation
</SectionTitle>
      <Paragraph position="0"> Firstly, we built a testing set based on the word aligned bilingual corpus automatically.</Paragraph>
      <Paragraph position="1"> Suppose that there is a word-aligned bilingual  If we substitute an English word with the pinyin of the Chinese word which the English word is aligned to, we can get a testing example for word-level Pinyin-English translation. Since the user only cares about how to write content words, rather than function words, we should skip function words in the English sentence. In this example, suppose EW  ) The Chinese words and English words in brackets are standard answers to the pinyin. We can get the precision of translation by comparing the standard answers with the answers obtained  The standard testing set includes 1198 testing sentences, and all the pinyins are polysyllabic. The experimental result is shown in Figure 4-2.</Paragraph>
    </Section>
    <Section position="3" start_page="12111" end_page="12111" type="sub_section">
      <SectionTitle>
Translation
4.2 Example Sentence Retrieval
</SectionTitle>
      <Paragraph position="0"> We built a standard example sentences set which consists of 964 bilingual example sentence pairs. We also created 50 Chinese-phrase queries manually based on the set. Then we labelled every sentence with the 50 queries. For instance, let's say that the example sentence is G4aaG1e0dG1942G37beG13c5G2c58G41d7G1db9G2de8G304aG530G7ceG45aG34a7G418eG1c4(He drew the conclusion by building on his own investigation.) After labelling, the corresponding queries are &amp;quot;G41d7 G1db9G2de8G304a&amp;quot;, and &amp;quot;G530G7ce G34a7G418e&amp;quot;, that is, when a user input these queries, the above example sentence should be picked out.</Paragraph>
      <Paragraph position="1"> After we labelled all 964 sentences, we performed the sentence retrieval module on the sentence set, that is, PENS retrieved example sentences for each of the 50 queries. Therefore, for each query, we compared the sentence set retrieved by PENS with the sentence labelled manually, and evaluate the performance by estimating the precision and the recall.</Paragraph>
      <Paragraph position="2"> Let A denotes the number of sentences which is selected by both human and the machine, B denotes the number of sentences which is selected only by the machine, and C denotes the number of sentences which is selected only by human.</Paragraph>
      <Paragraph position="3"> The precision of the retrieval to query i, say Pi, is estimated by Pi = A / B and the recall Ri,is estimated by Ri = A/C. The average precision  The experimental results are P = 83.3%, and R = 55.7%. The user only cares if he could obtain a useful example sentence, and it is unnecessary for the system to find out all the relevant sentences in the bilingual sentence corpus.</Paragraph>
      <Paragraph position="4"> Therefore, example sentence retrieval in PENS is different from conventional text retrieval at this point.</Paragraph>
      <Paragraph position="5"> Conclusion In this paper, based on the comprehensive study of Chinese users requirements, we propose a unified approach to machine aided English writing system, which consists of two components: 1) a statistical approach to word spelling help, and 2) an information retrieval based approach to intelligent recommendation by providing suggestive example sentences. While the former works at the word or phrase level, the latter works at the sentence level. Both components work together in a unified way, and highly improve the productivity of English writing.</Paragraph>
      <Paragraph position="6"> We also develop a pilot system, namely PENS,wherewetrytofindanefficientwayin which human collaborate with computers.</Paragraph>
      <Paragraph position="7"> Although many components of PENS are under development, primary experiments on two standard testing sets have already shown very promising results.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML