File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/c02-1020_metho.xml

Size: 16,241 bytes

Last Modified: 2025-10-06 14:07:44

<?xml version="1.0" standalone="yes"?>
<Paper uid="C02-1020">
  <Title>Extracting Word Sequence Correspondences with Support Vector Machines</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Extracting Word Sequence
Correspondences with SVMs
3.1 Outline
</SectionTitle>
    <Paragraph position="0"> The method proposed in this paper can obtain word sequence correspondences (translation pairs) in the parallel corpora which include Japanese and English sentences. It consists of the following three steps:  1. Make training samples which include positive samples as translation pairs and negative samples as non-translation pairs from the training corpora manually, and learn a translation model from these with SVMs.</Paragraph>
    <Paragraph position="1"> 2. Make a set of candidates of translation pairs which are pairs of phrases obtained by parsing both Japanese sentences and English sentences. null 3. Extract translation pairs from the candidates by inputting them to the translation model made in step 1.</Paragraph>
    <Paragraph position="2"> 3.2 Features for the Translation Model  To apply SVMs for extracting translation pairs, the candidates of the translation pairs must be converted into feature vectors. In our method, they are composed of the following features: 1. Features which use an existing translation dictionary. null (a) Bilingual word pairs in the translation dictionary which are included in the candidates of the translation pairs.</Paragraph>
    <Paragraph position="3"> (b) Bilingual word pairs in the translation dictionary which are co-occurred in the context in which the candidates appear.</Paragraph>
    <Paragraph position="4">  2. Features which use the number of words. (a) The number of words in Japanese phrases. (b) The number of words in English phrases. 3. Features which use the part-of-speech.</Paragraph>
    <Paragraph position="5"> (a) The ratios of appearance of noun, verb, adjective and adverb in Japanese phrases.</Paragraph>
    <Paragraph position="6"> (b) The ratios of appearance of noun, verb, adjective and adverb in English phrases.</Paragraph>
    <Paragraph position="7"> 4. Features which use constituent words.</Paragraph>
    <Paragraph position="8"> (a) Constituent words in Japanese phrases.</Paragraph>
    <Paragraph position="9"> (b) Constituent words in English phrases.</Paragraph>
    <Paragraph position="10"> 5. Features which use neighbor words.</Paragraph>
    <Paragraph position="11"> (a) Neighbor words which appear in Japanese phrases just before or after.</Paragraph>
    <Paragraph position="12"> (b) Neighbor words which appear in English  phrases just before or after.</Paragraph>
    <Paragraph position="13"> Two types of the features which use an existing translation dictionary are used because the improvement of accuracy can be expected by e ectively using existing knowledge in the features. For features (1a), words included in a candidate of the translation pair are looked up with the translation dictionary and the bilingual word pairs in the candidate become features. They are based on the idea that a translation pair would include many bilingual word pairs. Each bilingual word pair included in the dictionary is allocated to the dimension of the feature vectors. If a bilingual word pair appears in the candidate of translation pair, the value of the corresponding dimension of the vector is set to 1, and otherwise it is set to 0. For features (1b), all pairs of words which co-occurred with a candidate of the translation pair are looked up with the translation dictionary and the bilingual word pairs in the dictionary become features. They are based on the idea that the context of the words which appear in neighborhood looks like each other for the translation pairs although expressed in the two di erent languages (Kaji and Aizono, 1996). The candidates are converted into the feature vectors just like (1a). Features (2a) (2b) are based on the idea that there is a correlation in the number of constituent words of the phrases of both languages in the translation pair. The number of constituent words of each language is used for the feature vector.</Paragraph>
    <Paragraph position="14"> Features (3a) (3b) are based on the idea that there is a correlation in the ratio of content words (noun, verb, adjective and adverb) which appear in the phrases of both languages in a translation pair. The ratios of the numbers of noun, verb, adjective and adverb to the number of words of the phrases of each language are used for the feature vector.</Paragraph>
    <Paragraph position="15"> For features (4a) (4b), each content word (noun, verb, adjective and adverb) is allocated to the dimension of the feature vectors for each language. If a word appears in the candidate of translation pair, the value of the corresponding dimension of the vector is set to 1, and otherwise it is set to 0. For features (5a) (5b), each content words (noun, verb, adjective and adverb) is allocated to the dimension of the feature vectors for each language. If a word appears in the candidate of translation pair just before or after, the value of the corresponding dimension of the vector is set to 1, and otherwise it is set to 0.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Learning the Translation Model
</SectionTitle>
      <Paragraph position="0"> Training samples which include positive samples as the translation pairs and negative samples as the non-translation pairs are made from the training corpora manually, and are converted into the feature vectors by the method described in section 3.2.</Paragraph>
      <Paragraph position="1"> For supervise signals yi, each positive sample is assigned to +1 and each negative sample is assigned to 1. The translation model is learned from them by SVMs described in section 2. As a result, the optimal parameters for SVMs are obtained.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 Making the Candidate of the Translation
Pairs
</SectionTitle>
      <Paragraph position="0"> A set of candidates of translation pairs is made from the combinations of phrases which are obtained by parsing both Japanese and English sentences. How to make the combinations does not require sentence alignments between both languages. Because the set grows too big for all the combinations, the phrases used for the combinations are limited in upper bound of the number of constituent words and only noun phrases and verb phrases.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.5 Extracting the Translation Pairs
</SectionTitle>
      <Paragraph position="0"> The candidates of the translation pairs are converted into the feature vectors with the method described in section 3.2. By inputting them to equation (8) with the optimal parameters obtained in section 3.3, +1 or 1 could be obtained as the output for each vector. If the output is+1, the candidate corresponding to the input vector is the translation pair, otherwise it is not the translation pair.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> To confirm the e ectiveness of the method described in section 3, we did the experiments where the English Business Letter Example Collection published from Nihon Keizai Shimbun Inc. are used as parallel corpora, which include Japanese and English sentences which are examples of business letters, and are marked up at translation pairs.</Paragraph>
    <Paragraph position="1"> As both training and test corpora, 1,000 sentences were used. The translation pairs which are already marked up in the corpora were corrected to the form described in section 3.4 to be used as the positive samples. Japanese sentences were parsed by KNP 1 and English sentences were parsed by Apple Pie Parser 2. The negative samples of the same number as the positive samples were randomly chosen from combinations of phrases which were made by parsing and of which the numbers of constituent words were below 8 words. As a result, 2,000 samples (1,000 positives and 1,000 negatives) for both training and test were prepared.</Paragraph>
    <Paragraph position="2"> The obtained samples must be converted into the feature vectors by the method described in section 3.2. For features (1a) (1b), 94,511 bilingual word pairs included in EDICT 3 were prepared. For features (4a) (4b) (5a) (5b), 1,009 Japanese words and 890 English words which appeared in the training corpora above 3 times were used. Therefore, the number of dimensions for the feature vectors was 94;511 2+1 2+4 2+1;009+890+1;009+890= 192;830.</Paragraph>
    <Paragraph position="3"> S V Mlight 4 was used for the learner and the classifier of SVMs. For the kernel function, the squared polynomial kernel (p = 2 in equation (13)) was used, and the error weight C was set to 0:01.</Paragraph>
    <Paragraph position="4"> The translation model was learned by the training samples and the translation pairs were extracted from the test samples by the method described in  recall rate when the number of the training samples are increased Table 1 shows the precision rate and the recall rate of the extracted translation pairs, and table 2 shows examples of the extracted translation pairs.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Discussion
</SectionTitle>
    <Paragraph position="0"> Figure 2 shows the transition in the precision rate and the recall rate when the number of the training samples are increased from 100 to 2,000 by every 100 samples. The recall rate rose according to the number of the training samples, and reaching the level-o in the precision rate since 1,300. Therefore, it suggests that the recall rate can be improved without lowering the precision rate too much by increasing the number of the training samples.</Paragraph>
    <Paragraph position="1"> Figure 3 shows that the transition in the precision rate and the recall rate when the number of the bilingual word pairs in the translation dictionary are increased from 0 to 90,000 by every 5,000 pairs. The precision rate rose almost linearly according to the number of the pairs, and reaching the level-o in the recall rate since 30,000. Therefore, it suggests that the precision rate can be improved without lowering the recall rate too much by increasing the number of the bilingual word pairs in the translation dictionary.</Paragraph>
    <Paragraph position="2"> Table 3 shows the precision rate and the recall rate when each kind of features described in section 3.2 was removed. The values in parentheses in the columns of the precision rate and the recall rate are  recall rate when the number of the bilingual word pairs in the translation dictionary are increased di erences with the values when all the features are used. The fall of the precision rate when the features which use the translation dictionary (1a) (1b) were removed and the fall of the recall rate when the features which use the number of words (2a) (2b) were removed were especially large.</Paragraph>
    <Paragraph position="3"> It is clear that feature (1a) (1b) could restrict the translation model most strongly in all features.</Paragraph>
    <Paragraph position="4"> Therefore, if feature (1a) (1b) were removed, it causes a good translation model not to be able to be learned only by the features of the remainder because of the weak constraints, wrong outputs increased, and the precision rate has fallen.</Paragraph>
    <Paragraph position="5"> Only features (2a) (2b) surely appear in all samples although some other features appeared in the training samples may not appear in the test samples.</Paragraph>
    <Paragraph position="6"> So, in the test samples, the importance of features (2a) (2b) are increased on the coverage of the samples relatively. Therefore, if features (2a) (2b) were removed, it causes the recall rate to fall because of the low coverage of the samples.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Related Works
</SectionTitle>
    <Paragraph position="0"> With di erence from our method, there have been researches which are based on the assumption of the sentence alignments for parallel corpora (Gale and Church, 1991; Kitamura and Matsumoto, 1996; Melamed, 1997). (Gale and Church, 1991) has used the 2 statistics as the correspondence level of the word pairs and has showed that it was more e ective than the mutual information. (Kitamura and Matsumoto, 1996) has used the Dice coe cient (Kay and R&amp;quot;oschesen, 1993) which was weighted by the logarithm of the frequency of the word pair as the  correspondence level of the word pairs. (Melamed, 1997) has proposed the Competitive Linking Algorithm for linking the word pairs and a method which calculates the optimized correspondence level of the word pairs by hill climbing.</Paragraph>
    <Paragraph position="1"> These methods could archive high accuracy because of the assumption of the sentence alignments for parallel corpora, but they have the problem with narrow applicable domains because there are not too many parallel corpora with sentence alignments at present. However, because our method does not require sentence alignments, it can be applied for wider applicable domains.</Paragraph>
    <Paragraph position="2"> Like our method, researches which are not based on the assumption of the sentence alignments for parallel corpora have been done (Kaji and Aizono, 1996; Tanaka and Iwasaki, 1996; Fung, 1997).</Paragraph>
    <Paragraph position="3"> They are based on the idea that the context of the words which appear in neighborhood looks like each other for the translation pairs although expressed in two di erent languages. (Kaji and Aizono, 1996) has proposed the correspondence level calculated by the size of intersection between co-occurrence sets with the word included in an existing translation dictionary. (Tanaka and Iwasaki, 1996) has proposed a method for obtaining the bilingual word pairs by optimizing the matrix of the translation probabilities so that the distance of the matrices of the probabilities of co-occurrences of words which appeared in each language might become small. (Fung, 1997) has calculated the vectors in which the weighted mutual information between the word in the corpora and the word included in an existing translation dictionary was an element, and has used these inner products as the correspondence level of word pairs.</Paragraph>
    <Paragraph position="4"> There is a common point between these method and ours on the idea that the context of the words which appear in neighborhood looks like each other for the translation pairs because features (1b) are based on the same idea. However, since our method caught extracting the translation pairs as the approach of the statistical machine learning, it could be expected to improve the performance by adding new features to the translation model. In addition, if learning the translation model for the training samples is done once with our method, the model need not be learned again for new samples although it needs the positive and negative samples for the training data. However, the methods introduced above must learn a new model again for new corpora. null (Sato and Nakanishi, 1998) has proposed a method for learning a probabilistic translation model with Maximum Entropy (ME) modeling which was the same approach of the statistical machine learning as SVMs, in which co-occurrence information and morphological information were used as features and has archived 58.25 % accuracy with 4,119 features. ME modeling might be similar to SVMs on using features for learning a model, but feature selection for ME modeling is more di cult because ME modeling is easier to cause over-fit for training samples than SVMs. In addition, ME modeling cannot learn dependencies between features, but SVMs can learn them automatically using a kernel function. Therefore, SVMs could learn more complex and e ective model than ME modeling.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML