File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0311_metho.xml

Size: 17,822 bytes

Last Modified: 2025-10-06 14:08:19

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0311">
  <Title>Retrieving Meaning-equivalent Sentences for Example-based Rough Translation</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Difficulty in Example-based S2ST
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Translation Degradation by Input Length
</SectionTitle>
      <Paragraph position="0"> A major problem with machine translation, regardless of the translation method, is that performance drops rapidly as input sentences become longer. For EBMT, the longer input sentences become, the fewer similar example sentences exist in the example corpus. Figure 1 shows translation difficulty in long sentences in EBMT (Sumita, 2001). The EBMT system is given 591 test sentences and returns translation result as translated/untranslated.</Paragraph>
      <Paragraph position="1"> Untranslated means that there exists no similar example sentences for the input. Although the EBMT is equipped with a large example corpus (about 170K sentences), it often failed to translate long inputs.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Style Differences between Concise and
Conversational
</SectionTitle>
      <Paragraph position="0"> The performance of example-based S2ST greatly depends on the example corpus. It is advantageous for an example corpus to have a large volume and the same style as the input sentences. A corpus of texts dictated from conversational speech is favorable for S2ST. Unfortunately, it is very difficult to prepare such an example corpus since this task requires laborious work such as speech recording and speech transcription.</Paragraph>
      <Paragraph position="1"> Therefore, we cannot avoid using a written-style corpus, such as phrasebooks, to prepare a sufficiently large volume of examples. Contained texts are almost grammatical and rarely contain unnecessary words. We call the style used in such a corpus &amp;quot;concise&amp;quot; and the style seen in conversational speech &amp;quot;conversational.&amp;quot; Table 1 shows the average numbers of words in concise (Takezawa et al., 2002) and conversational corpora (Takezawa, 1999). Sentences in conversational style are about 2.5 words longer than those in concise style in both  English and Japanese. This is because conversational style sentences contain unnecessary words or subordinate clauses, which have the effects of assisting the listener's comprehension and avoiding the possibility of giving the listener a curt impression.</Paragraph>
      <Paragraph position="2"> Table 2 shows cross perplexity between concise and conversational corpora (Takezawa et al., 2002). Perplexity is used as a metric for how well a language model derived from a training set matches a test set (Jurafsky and Martin, 2000). Cross perplexities between concise and conversational corpora are much higher than the selfperplexity of either of the two styles. This result also illustrates the great difference between the two styles.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Meaning-equivalent Sentence
</SectionTitle>
    <Paragraph position="0"> Example-based S2ST has the difficulties described in Section 2 when it attempts to translate inputs exactly.</Paragraph>
    <Paragraph position="1"> Here, we set our translation goal to translating input sentences not exactly but roughly. We assume that a rough translation is useful enough for S2ST, since unimportant information rarely disturbs the progress of dialogs and can be recovered in the following dialog if needed. We call this translation strategy &amp;quot;rough translation.&amp;quot; We propose &amp;quot;meaning-equivalent sentence&amp;quot; to carry out rough translation. Meaning-equivalent sentences are defined as follows: meaning-equivalent sentence (to an input sentence) A sentence that shares the main meaning with the input sentence despite lacking some unimportant information. It does not contain information additional to that in the input sentence.</Paragraph>
    <Paragraph position="2"> Important information is subjectively recognized mainly due to one of two reasons: (1) It can be surmised from the general situation, or (2) It does not place a strong restriction on the main information.</Paragraph>
    <Paragraph position="3">  formation. Information to be examined is written in bold. The information &amp;quot;of me&amp;quot; in (1) and &amp;quot;around here&amp;quot; in (3) can be surmised from the general situation, while the information &amp;quot;of this painting&amp;quot; in (2) and &amp;quot;Chinese&amp;quot; would not be surmised since it denotes a special object. The subordinate sentences in (4) and (5) are regarded as unimportant since they have small significance and are omittable.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Basic Idea of Retrieval
</SectionTitle>
      <Paragraph position="0"> The retrieval of meaning-equivalent sentence depends on content words and basically does not depend on functional words. Independence from functional words brings robustness to the difference in styles.</Paragraph>
      <Paragraph position="1"> However, functional words include important information for sentence meaning: the case relation of content words, modality, and tense. Lack of case relation information is compensated by the nature of the restricted domain. A restricted domain, as a domain of S2ST, has a relatively small lexicon and meaning variety. Therefore, if content words included in an input are given, their relation is almost determined in the domain. Information of modality and tense is extracted from functional words and utilized in classifying the meaning of a sentence (described in Section 3.2.2).</Paragraph>
      <Paragraph position="2"> This retrieval method is similar to information retrieval in that content words are used as clues for retrieval (Frakes and Baeza-Yates, 1992). However, our task has two difficulties: (1) Retrieval is carried out not by documents but by single sentences. This reduces the effectiveness of word frequencies. (2) The differences in modality and tense in sentences have to be considered since they play an important role in determining a sentence's communicative meaning.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Features for Retrieval
3.2.1 Content Words
</SectionTitle>
      <Paragraph position="0"> Words categorized as either noun1, adjective, adverb, or verb are recognized as content words. Interrogatives  are also included. Words such as particles, auxiliary verbs, conjunctions, and interjections are recognized as functional words.</Paragraph>
      <Paragraph position="1"> We utilize a thesaurus to expand the coverage of the example corpus. We call the relation of two words that are the same &amp;quot;identical&amp;quot; and words that are synonymous in the given thesaurus &amp;quot;synonymous.&amp;quot;  The meaning of a sentence is discriminated by its modality and tense, since these factors obviously determine meaning. We defined two modality groups and one tense group by examining our corpus. The modality groups are (&amp;quot;request&amp;quot;, &amp;quot;desire&amp;quot;, &amp;quot;question&amp;quot;, &amp;quot;confirmation&amp;quot;, &amp;quot;others&amp;quot;,) and (&amp;quot;negation&amp;quot;, &amp;quot;others&amp;quot;.) The tense group is (&amp;quot;past&amp;quot;, &amp;quot;others&amp;quot;.) These modalities and tense are distinguished by surface clues, mainly by particles and auxiliary verbs. Table 3 shows a part of the clues used for discriminating modalities in Japanese. Sentences having no clues are classified as others. Figure 3 2 shows 2Japanese content words are written in sans serif style and Japanese functional words in italic style.</Paragraph>
      <Paragraph position="3"> hoteru wo yoyaku shi tekudasai request (Will you reserve this hotel?) hoteru wo yoyaku shi tai desire (I want to reserve this hotel.) hoteru wo yoyaku shi mashi ta ka? question (Did you reserve this hotel?) past hoteru wo yoyaku shi tei masen negation (I do not reserve this hotel.)  sample sentences and their modality and tense. Clues are underlined.</Paragraph>
      <Paragraph position="4"> A speech act is a concept similar to modality in which speakers' intentions are represented. The two studies introduced information of the speech act in their S2ST systems (Wahlster, 2000; Tanaka and Yokoo, 1999). The two studies and our method differ in the effect of speech act information. Their effect of speech act information is so small that it is limited to generating the translation text. Translation texts are refined by selecting proper expressions according to the detected speakers' intention.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Retrieval and Ranking
</SectionTitle>
      <Paragraph position="0"> Sentences that satisfy the conditions below are recognized as meaning-equivalent sentences.</Paragraph>
      <Paragraph position="1">  1. It is required to have the same modality and tense as the input sentence.</Paragraph>
      <Paragraph position="2"> 2. All content words are included (identical or synonymous) in the input sentence. This means that the set of content words of a meaning-equivalent sentence is a subset of the input.</Paragraph>
      <Paragraph position="3"> 3. At least one content word is included (identical) in the input sentence.</Paragraph>
      <Paragraph position="4">  If more than one sentence is retrieved, we must rank them to select the most similar one. We introduce &amp;quot;focus area&amp;quot; in the ranking process to select sentences that are meaning-equivalent to the main sentence in complex sentences. We set the focus area as the last N words from the word list of an input sentence. N denotes the number of content words in meaning-equivalent sentences. This is because main sentences in complex sentences tend to be placed at the end in Japanese.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Input
</SectionTitle>
      <Paragraph position="0"> gaishutsu shi teiru aida ni, (While I was out), kaban wo nusuma re mashi ta (my baggage was stolen.)</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Meaning-equivalent Sentence
</SectionTitle>
      <Paragraph position="0"> baggu wo nusuma re ta</Paragraph>
      <Paragraph position="2"> Retrieved sentences are ranked by the conditions described below. Conditions are described in order of priority. If there is more than one sentence having the highest  Content word in a focus area of input are underlined and functional words are written in italic.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Experiment
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Test Data
</SectionTitle>
      <Paragraph position="0"> We used a bilingual corpus of travel conversation, which has Japanese sentences and their English translations (Takezawa et al., 2002). This corpus was sentencealigned, and a morphological analysis was done on both languages by our morphological analysis tools. The bilingual corpus was divided into example data (Example) and test data (Concise) by extracting test data randomly from the whole set of data.</Paragraph>
      <Paragraph position="1"> In addition to this, we used a conversational speech corpus for another set of test data (Takezawa, 1999). This corpus contains dialogs between a traveler and a hotel  receptionist. It tests the robustness in styles. We call this test corpus &amp;quot;Conversational.&amp;quot; We use sentences including more than one content word among the three corpora. The statistics of the three corpora are shown in Table 4.</Paragraph>
      <Paragraph position="2"> The thesaurus used in the experiment was &amp;quot;Kadokawa-Ruigo-Jisho&amp;quot; (Ohno and Hamanishi, 1984). Each word has semantic code consisting of three digits, that is, this thesaurus has three hierarchies. We defined &amp;quot;synonymous&amp;quot; words as sharing exact semantic codes.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Compared Retrieval Methods
</SectionTitle>
      <Paragraph position="0"> We use two example-based retrieval methods to show the characteristic of the proposed method. The first method (Method-1) uses &amp;quot;strict&amp;quot; retrieval, which does not allow missing words in input. The method takes functional words into account on retrieval. This method corresponds to the conventional EBMT method. The second method (Method-2) uses &amp;quot;rough&amp;quot; retrieval, which does allow missing words in input, but still takes functional words into account.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Evaluation Methodology
</SectionTitle>
      <Paragraph position="0"> Evaluation was carried out by judging whether retrieved sentences are meaning-equivalent to inputs. It must be noted that inputs and retrieved sentences are both in Japanese. We did not compare inputs and translations of retrieved sentences, since translation accuracy is a matter of the example corpus and does not concern our method.</Paragraph>
      <Paragraph position="1"> The sentence with the highest score among retrieved sentences was taken and evaluated. The sentences are marked manually as meaning-equivalent or not by a Japanese native. A meaning-equivalent sentence includes all important information in the input but may lack some unimportant information.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.4 Results
</SectionTitle>
      <Paragraph position="0"> Figure 5 shows the accuracy of the three methods with the concise and conversational style data. Accuracy is defined as the ratio of the number of correctly equivalent sentences to that of total inputs. Inputs are classified into four types by their word length.</Paragraph>
      <Paragraph position="1"> The performance of Method-1 reflects the narrow coverage and style-dependency of conventional EBMT. The longer input sentences become, the more steeply its performance degrades in both styles. The method can retrieve no similar sentence for inputs longer than eleven words in conversational style.</Paragraph>
      <Paragraph position="2"> Method-2 adopts a &amp;quot;rough&amp;quot; strategy in retrieval. It attains higher accuracy than Method-1, especially with longer inputs. This indicates the robustness of the rough retrieval strategy to longer inputs. However, the method still has an accuracy difference of about 15% between the two styles.</Paragraph>
      <Paragraph position="3"> The accuracy of the proposed method is better than that of Method-2, especially in conversational style. The accuracy difference in longer inputs becomes smaller (about 4%) than that of Method-2. This indicates the robustness of the proposed method to the differences between the two styles.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Related Work
5.1 EBMT
</SectionTitle>
    <Paragraph position="0"> The rough translation proposed in this paper is a type of EBMT (Sumita, 2001; Veale and Way, 1997; Carl, 1999; Brown, 2000). The basic idea of EBMT is that sentences similar to the inputs are retrieved from an example corpus and their translations become the basis of outputs.</Paragraph>
    <Paragraph position="1"> Here, let us consider the difference between our method and other EBMT methods by dividing similarity into a content-word part and a functional-word part. In the content-word part, our method and other EBMT methods are almost the same. Content words are important information in a similarity measure process, and thesauri are utilized to extend lexical coverage. In the functional-word part, our method is characterized by disregarding functional words, while other EBMT methods still rely on them for the similarity measure. In our method, the lack of functional word information is compensated by the semantically narrow variety in S2ST domains and the use of information on modality and tense.</Paragraph>
    <Paragraph position="2"> Consequently, our method gains robustness to length and the style differences between inputs and the example corpus. null</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Translation Memory
</SectionTitle>
      <Paragraph position="0"> Translation memory (TM) is aimed at retrieving informative translation example from example corpus. TM and our method share the retrieval strategy of rough and wide coverage. However, recall is more highly weighted than precision in TM, while recall and precision should be equally considered in our method. To carry out wide coverage retrieval, TM relaxed various conditions on inputs: Preserving only mono-gram and bi-gram on words/characters (Baldwin, 2001; Sato, 1992), removing functional words (Kumano et al., 2002; Wakita et al., 2000), and removing content words (Sumita and Tsutsumi, 1988). In our method, information on functional words is removed and that on modality and tense is introduced instead. Information on word order is also removed while instead we preserve information on whether each word is located in the focus area.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML