File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1017_metho.xml

Size: 17,401 bytes

Last Modified: 2025-10-06 14:08:41

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1017">
  <Title>Splitting Input Sentence for Machine Translation Using Language Model with Sentence Similarity</Title>
  <Section position="3" start_page="1" end_page="1" type="metho">
    <SectionTitle>
2 Splitting Method
</SectionTitle>
    <Paragraph position="0"> We define the term sentence-splitting as the result of splitting a sentence. A sentence-splitting is expressed as a list of sub-sentences that are  portions of the original sentence. A sentence-splitting includes a portion or several portions. We use an N-gram Language Model (NLM) to generate sentence-splitting candidates, and we use the NLM and sentence similarity to select one of the candidates. The configuration of the method is shown in Figure 1.</Paragraph>
    <Section position="1" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
2.1 Probability Based on N-gram Language
Model
</SectionTitle>
      <Paragraph position="0"> The probability of a sentence can be calculated by an NLM of a corpus. The probability of a sentence-splitting, Prob, is defined as the product of the probabilities of the sub-sentences in equation (1), where P is the probability of a sentence based on an NLM, S is a sentence-splitting, that is, a list of sub-sentences that are portions of a sentence, and P is applied to the sub-sentences.</Paragraph>
      <Paragraph position="2"> To judge whether a sentence is split at a position, we compare the probabilities of the sentence-splittings before and after splitting.</Paragraph>
      <Paragraph position="3"> When calculating the probability of a sentence including a sub-sentence, we put pseudo words at the head and tail of the sentence to evaluate the probabilities of the head word and the tail word.</Paragraph>
      <Paragraph position="4"> For example, the probability of the sentence, &amp;quot;This is a medium size jacket&amp;quot; based on a trigram language model is calculated as follows. Here, p(z  |x y) indicates the probability that z occurs after the sequence x y, and SOS and EOS indicate the pseudo words.</Paragraph>
      <Paragraph position="5"> P(this is a medium size jacket) =</Paragraph>
      <Paragraph position="7"> This causes a tendency for the probability of the sentence-splitting after adding a splitting position to be lower than that of the sentence-splitting before adding the splitting position. Therefore, when we find a position that makes the probability higher, it is plausible that the position divides the sentence into sub-sentences.</Paragraph>
    </Section>
    <Section position="2" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
2.2 Sentence Similarity
</SectionTitle>
      <Paragraph position="0"> An NLM suggests where we should split a sentence, by using the local clue of several words among the splitting position. To supplement it with a wider view, we introduce another clue based on similarity to sentences, for which translation knowledge is automatically acquired from a parallel corpus. It is reasonably expected that MT systems can correctly translate a sentence that is similar to a sentence in the training corpus.</Paragraph>
      <Paragraph position="1"> Here, the similarity between two sentences is defined using the edit-distance between word sequences. The edit-distance used here is extended to consider a semantic factor. The edit-distance is normalized between 0 and 1, and the similarity is 1 minus the edit-distance. The definition of the similarity is given in equation (2). In this equation, L is the word count of the corresponding sentence.</Paragraph>
      <Paragraph position="2"> I and D are the counts of insertions and deletions respectively. Substitutions are permitted only between content words of the same part of speech.</Paragraph>
      <Paragraph position="3"> Substitution is considered as the semantic distance between two substituted words, described as Sem, which is defined using a thesaurus and ranges from 0to1.Sem is the division of K (the level of the least common abstraction in the thesaurus of two words) by N (the height of the thesaurus) according to equation (3) (Sumita and Iida, 1991).</Paragraph>
      <Paragraph position="4">  , the similarity of a sentence-splitting to a corpus is defined as Sim in equation (4). In this equation, S is a sentence-splitting and C is a given corpus that is a set of sentences.</Paragraph>
      <Paragraph position="5"> Sim is a mean similarity of sub-sentences against the corpus weighted with the length of each subsentence. The similarity of a sentence including a sub-sentence to a corpus is the greatest similarity between the sentence and a sentence in the corpus.</Paragraph>
    </Section>
    <Section position="3" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
2.3 Generating Sentence-Splitting
Candidates
</SectionTitle>
      <Paragraph position="0"> To calculate Sim is similar to retrieving the most similar sentence from a corpus. The retrieval procedure can be efficiently implemented by the techniques of clustering (Cranias et al., 1997) or using A* search algorithm on word graphs (Doi et al., 2004). However, it still takes more cost to calculate Sim than Prob when the corpus is large.</Paragraph>
      <Paragraph position="1"> Therefore, in the splitting method, we first generate sentence-splitting candidates by Probalone.</Paragraph>
      <Paragraph position="2"> In the generating process, for a given sentence, the sentence itself is a candidate. For each sentence-splitting of two portions whose Probdoes not decrease, the generating process is recursively executed with one of the two portions and then with the other. The results of recursive execution are combined into candidates for the given sentence.</Paragraph>
      <Paragraph position="3"> Through this process, sentence-splittings whose Probs are lower than that of the original sentence, are filtered out.</Paragraph>
    </Section>
    <Section position="4" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
2.4 Selecting the Best Sentence-Splitting
</SectionTitle>
      <Paragraph position="0"> Next, among the candidates, we select the one with the highest score using not only Prob but also Sim. We use the product of Proband Simas the measure to select a sentence-splitting by. The measure is defined as Scorein equation (5), where l, ranging from 0 to 1, gives the weight of Sim.</Paragraph>
      <Paragraph position="1"> In particular, the method uses only Prob when l is 0, and the method generates candidates by Prob and selects a candidate by only Sim when l is 1.</Paragraph>
    </Section>
    <Section position="5" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
2.5 Example
</SectionTitle>
      <Paragraph position="0"> Here, we show an example of generating sentence-splitting candidates with Prob and selecting one by Score. For the input sentence, &amp;quot;This is a medium size jacket I think it's a good size for you try it on please&amp;quot;, there may be many candidates. Below, five candidates, whose Probare not less than that of the original sentence, are generated. A '|' indicates a splitting position. The left numbers indicate the ranking based on Prob. The</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="1" end_page="3" type="metho">
    <SectionTitle>
3 Experimental Conditions
</SectionTitle>
    <Paragraph position="0"> We evaluated the splitting method through experiments, whose conditions are as follows.</Paragraph>
    <Section position="1" start_page="1" end_page="3" type="sub_section">
      <SectionTitle>
3.1 MT Systems
</SectionTitle>
      <Paragraph position="0"> We investigated the splitting method using MT systems in English-to-Japanese translation, to determine what effect the method had on translation. We used two different EBMT systems as test beds. One of the systems was Hierarchical Phrase Alignment-based Translator (HPAT) (Imamura, 2002), whose unit of translation expression is a phrase. HPAT translates an input sentence by combining phrases. The HPAT system is equipped with another sentence splitting method based on parsing trees (Furuse et al., 1998). The other system was DP-match Driven transDucer</Paragraph>
      <Paragraph position="2"> ) (Sumita, 2001), whose unit of expression is a sentence. For both systems, translation knowledge is automatically acquired from a parallel corpus.</Paragraph>
    </Section>
    <Section position="2" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
3.2 Linguistic Resources
</SectionTitle>
      <Paragraph position="0"> We used Japanese-and-English parallel corpora, i.e., a Basic Travel Expression Corpus (BTEC) and a bilingual travel conversation corpus of Spoken Language (SLDB) for training, and English sentences in Machine-Translation-Aided bilingual Dialogues (MAD) for a test set (Takezawa and Kikui, 2003). BTEC is a collection of Japanese sentences and their English translations usually found in phrase-books for foreign tourists. The contents of SLDB are transcriptions of spoken dialogues between Japanese and English speakers through human interpreters. The Japanese and English parts of the corpora correspond to each other sentence-to-sentence. The dialogues of MAD took place between Japanese and English speakers through human typists and an experimental MT system.</Paragraph>
      <Paragraph position="1"> (Kikui et al., 2003) shows that BTEC and SLDB are both required for handling MAD-type tasks.</Paragraph>
      <Paragraph position="2"> Therefore, in order to translate test sentences in MAD, we merged the parallel corpora, 152,170 sentence pairs of BTEC and 72,365 of SLDB, into a training corpus for HPAT and D  . The English part of the training corpus was also used to make an NLM and to calculate similarities for the sentence-splitting method. The statistics of the training corpus are shown in Table 1. The perplexity in the table is word trigram perplexity. The test set of this experiment was 505 English sentences uttered by human speakers in MAD, including no sentences generated by the MT system. The average length of the sentences in the test set was 9.52 words per sentence. The word trigram perplexity of the test set against the training corpus was 63.66.</Paragraph>
      <Paragraph position="3"> We also used a thesaurus whose hierarchies are based on the Kadokawa Ruigo-shin-jiten (Ohno and Hamanishi, 1984) with 80,250 entries.</Paragraph>
    </Section>
    <Section position="3" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
3.3 Instantiation of the Method
</SectionTitle>
      <Paragraph position="0"> For the splitting method, the NLM was the word trigram model using Good-Turing discounting.</Paragraph>
      <Paragraph position="1"> The number of split portions was limited to 4 per sentence. The weight of Sim, l in equation (5) was assigned one of 5 values: 0, 1/2, 2/3, 3/4 or 1.</Paragraph>
    </Section>
    <Section position="4" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
3.4 Evaluation
</SectionTitle>
      <Paragraph position="0"> We compared translation quality under the conditions of with or without splitting. To evaluate translation quality, we used objective measures and a subjective measure as follows.</Paragraph>
      <Paragraph position="1"> The objective measures used were the BLEU score (Papineni et al., 2001), the NIST score (Doddington, 2002) and Multi-reference Word Error Rate (mWER) (Ueffing et al., 2002). They were calculated with the test set. Both BLEU and NIST compare the system output translation with a set of reference translations of the same source text by finding sequences of words in the reference translations that match those in the system output translation. Therefore, achieving higher scores by these measures means that the translation results can be regarded as being more adequate translations. mWER indicates the error rate based on the edit-distance between the system output and the reference translations. Therefore, achieving a lower score by mWER means that the translation results can be regarded as more adequate translations. The number of references was 15 for the three measures.</Paragraph>
      <Paragraph position="2"> In the subjective measure (SM), the translation results of the test set under different two conditions were evaluated by paired comparison.</Paragraph>
      <Paragraph position="3"> Sentence-by-sentence, a Japanese native speaker who had acquired a sufficient level of English, judged which result was better or that they were of the same quality. SM was calculated compared to a baseline. As in equation (6), the measure was the gain per sentence, where the gain was the number of won translations subtracted by the number of defeated translations as judged by the human</Paragraph>
    </Section>
    <Section position="5" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
4.1 Translation Quality
</SectionTitle>
      <Paragraph position="0"> Table 2 shows evaluations of the translation results of two MT systems, HPAT and D  , under six conditions. In 'original', the input sentences of the systems were the test set itself without any splitting. In the other conditions, the test set sentences were split using Probinto sentence-splitting candidates, and a sentence-splitting per input sentence was selected with Score. The weights of Prob and Sim in the definition of Score in equation (5) were varied from only Prob to only Sim. The baseline of SM was the original.</Paragraph>
      <Paragraph position="1"> The number of input sentences, which have multi-candidates generated with Prob, was 237, where the average and the maximum number of candidates were respectively 5.07 and 64. The average length of the 237 sentences was 12.79 words  per sentence. The word trigram perplexity of the set of the 237 sentences against the training corpus was 73.87.</Paragraph>
      <Paragraph position="2"> The table shows certain tendencies. The differences in the evaluation scores between the original and the cases with splitting are significant for both systems and especially for D  . Although the differences among the cases with splitting are not so significant, SM steadily increases when using Sim compared to using only Prob, by 3.2% for HPAT and by 2.4% for D  . Among objective measures, the NIST score corresponds well to SM.</Paragraph>
    </Section>
    <Section position="6" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
4.2 Effect of Selection Using Similarity
</SectionTitle>
      <Paragraph position="0"> Table 3 allows us to focus on the effect of Sim in the sentence-splitting selection. The table shows the evaluations on 237 sentences of the test set, where selection was required. In this table, the number of changes is the number of cases where a candidate other than the best candidate using Prob was selected. The table also shows the average and maximum Probranking of candidates which were not the best using Prob but were selected as the best using Score. The condition of 'IDEAL' is to select such a candidate that makes the mWER of its translation the best value in any candidate. In IDEAL, the selections are different between MT systems. The two values of the number of changes are for HPAT and for D  . The baseline of SM was the condition of using only Prob.</Paragraph>
      <Paragraph position="1"> From the table, we can extract certain tendencies. The number of changes is very small when using both Prob and Sim in the experiment. In these cases, the procedure selects the best candidates or the second candidates in the measure of Prob. Although the change is small when the weights of Prob and Sim are equal, SM shows that most of the changed translations become better, some remain even and none become worse.</Paragraph>
      <Paragraph position="2"> The heavier the weight of Sim is, the higher the SM score is. The NIST score also increases especially for D  when the weight of Sim increases.</Paragraph>
      <Paragraph position="3"> The IDEAL condition overcomes most of the conditions as was expected, except that the SM score and the NIST score of D  are worse than those in the condition using only Sim. For D</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="3" end_page="3" type="metho">
    <SectionTitle>
, the
</SectionTitle>
    <Paragraph position="0"> sentence-splitting selection with Sim is a match for the ideal selection.</Paragraph>
    <Paragraph position="1"> So far, we have observed that SM and NIST tend to correspond to each other, although SM and BLEU or SM and mWER do not. The NIST score uses information weights when comparing the result of an MT system and reference translations. We can infer that the translation of a sentencesplitting, which was judged as being superior to another by the human evaluator, is more informative than the other.</Paragraph>
    <Section position="1" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
4.3 Effect of Using Thesaurus
</SectionTitle>
      <Paragraph position="0"> Furthermore, we conducted an experiment without using a thesaurus in calculating Sim. In the definition of Sim, all semantic distances of Sem  a thesaurus (P indicates Proband S indicates Sim) were assumed to be equal to 0.5. Table 4 shows evaluations on the 237 sentences.</Paragraph>
      <Paragraph position="1"> Compared to Table 3, the SM score is worse when the weight of Sim in Score is small, and better when the weight of Sim is great. However, the difference between the conditions of using or not using a thesaurus is not so significant.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML