XML Viewer - c90-3055

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/90/c90-3055_abstr.xml
Size: 10,187 bytes
Last Modified: 2025-10-06 13:46:59
<?xml version="1.0" standalone="yes"?>
<Paper uid="C90-3055">
  <Title>A Machine Translation System for Foreign News in Satellite Broadcasting</Title>
  <Section position="2" start_page="0" end_page="0" type="abstr">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Since December 1986, NHK, the Japan Broadcasting Corporation, has been conducting two-channel, direct satellite broadcasting using Japan's BS-2b broadcasting satellite. The two satellite channels now have 24-hour nationwide TV broadcasting services. The core of the services on Channel I is &amp;quot;World News,&amp;quot; in which news from across the globe is broadcast.</Paragraph>
    <Paragraph position="1"> The languages spoken in NHK's World News are English, French, German, Italian, Russian, Korean, and Chinese. Urgent and important news has simultaneous interpretation services. In usual cases, however, services only superimpose Japanese subtitles on the TV screen. Actually, more than 50 bilingual translators prepare a manuscript by transcribing and translating the original news. All the work must be done in a limited time, even at midnight due to the time difference between Japan and other countries.</Paragraph>
    <Paragraph position="2"> A machine translation system was introduced to make easier this daily work. As a first step, the English  World News has been experimentarily broadcast, about 5 minutes a day, using the Japanese translation provided by the MT system. We think this is cultivating a new possibility of machine translation in Japan \[1\].</Paragraph>
    <Paragraph position="3"> 2 Satellite Broadcasting and Machine Translation Usually the generation of the subtitles proceeds as follows: 1) a bilingual translator prepares a manuscript by transcribing and translating the original news; 2) a supervisor examines the manuscript; and 3) an operator inputs the final manuscript into the  processing equipment.</Paragraph>
    <Paragraph position="4"> Our MT system was introduced in step 1, First of all, the original news is greatly summarized in English since the length of a subtitle script is at most 30 Japanese characters per a display screen. Preediting is also carried out in this step to provide a better input for the MT system. After postediting, the final result is given to step 2.</Paragraph>
    <Paragraph position="5"> The system is based on the STAR machine translation system \[2\], and works basically by a transfer method. The translation process can be divided into 4 main steps: morphological analysis, syntactic analysis, transfer, and generation. The morphological analysis identifies words as well as locally fixed sequences of words. In the syntactic analysis, all the possible surface structures for an input sentence are derived, and then the best candidates are chosen by using the &amp;quot;weight mechanism&amp;quot; described below.</Paragraph>
    <Paragraph position="6"> At present, the size of the dictionary is about 100,000 entries, and the grammar has about 3,000 CFG-type rules. The system can translate a sentence having 11 words on average within 2 seconds using a 3 MIPS UNIX computer. Further characteristics of the system are discussed below.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Characteristics of News Sentences
</SectionTitle>
      <Paragraph position="0"> Examining a large body of English news consisting of more than 3.5 million words from the World News and the basic news service of AP (Associated Press),we can summarize the linguistic properties of Ihe news sentences as follows: I) About 75,000 different words are used, and they are difficult to classify by news fields.</Paragraph>
      <Paragraph position="1"> 3o8 1 2) Various types of proper nouns such as human, nation and location names appear frequently. Human names are often with related words like titles.</Paragraph>
      <Paragraph position="2">  3) Many verbs having human subjects are used. Among others, &amp;quot;say,&amp;quot; &amp;quot;call,&amp;quot; &amp;quot;report,&amp;quot; &amp;quot;talk,&amp;quot; &amp;quot;ask,&amp;quot; &amp;quot;think,&amp;quot; &amp;quot;want,&amp;quot; and &amp;quot;feel&amp;quot; ate often found. 4) Many kinds of numeral expressions come out. Some of them are too complex to translate.</Paragraph>
      <Paragraph position="3"> 51) Colloquial expressions appear frequently.</Paragraph>
      <Paragraph position="4"> 3~2 Local Preprocessor for Proper Nouns In order to treat the above-mentioned characteristics 1) and 2) of the news sentences, our MT system has a preprocess called &amp;quot;Local Context Translation&amp;quot;(LOC~I), which constitutes the second part of the morphological analysis. Its main role is to identify and translate variou.~ types of locally fixed ~quences of words such as &amp;quot;U.S. President George Bush,&amp;quot;  The LOCT can perform translation of human names with related words like titles, identification of undefined proper nouns, and selection of words. To analyze local patterns, the LOCT has a set of CFG-rules, different from the global analysis rules, as shown in Figure 1. By these rules, &amp;quot;President&amp;quot; can be translated differently into Japanese depending on the previous word, R NAME picks up an undefined proper noun from the input text as a sequence of one capital letter and some small letters.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Robust Processing of Undefined Words
</SectionTitle>
      <Paragraph position="0"> In addition to a large dictionary, our system has a powerful processor tor undefined words to cover a wide scope of news sentences. The processor estimates the lexical items of undefined words, and gives them to the syntactic analyzer.</Paragraph>
      <Paragraph position="1"> The main processing functions are: 1) ~t~~: As explained in 3.2, the LOCT can estimate the grammatical values for undefined words by identifying local patterns.</Paragraph>
      <Paragraph position="2"> 2) E~ i n h n in, form of~: Many English words have their own ending forms corresponding to grammatical values. For exmnple, a word ending with &amp;quot;4ion(s),&amp;quot; &amp;quot;-ly,&amp;quot; or &amp;quot;+able&amp;quot; can be. estimated as a noun, an adverb, or an adjective, respectively.</Paragraph>
      <Paragraph position="3"> 3) ~: The processor has some heuristic rules for a word starting with a capital letter, a short word, a sequence of numeral digits.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 Structural Disambignation by
Weighting Grammatical Rules
</SectionTitle>
      <Paragraph position="0"> The syntactic ,analysis consists of two steps.</Paragraph>
      <Paragraph position="1"> 1) All the possible surface structures for an input sentence are derived as an AND/OR graph \[3\].</Paragraph>
      <Paragraph position="2"> 2) The best candidates are extracted from this graph using our &amp;quot;weight mechanism,&amp;quot; which can be formulated as a ~arch problem for an AND/OR graph having nodes with costs.</Paragraph>
      <Paragraph position="4"> italicized numerals show the weights of words given by the lexicon. The weight of the other node is calculated from those of the daughter nodes and the corresponding rule. For example, the weight of VP3-node is calculated by using the second rule in Figure 2 as follows: 2(aux)+3(vt2)+ 1.5*7(NP2)+4(NPS)+5+ 2 = 26.5 Among three VP candidates in this example, VP2 is chosen as the best one, since it has the smallest weight. The weight represents some kind of &amp;quot;incomprchensibility,&amp;quot; &amp;quot;complexity,&amp;quot; or&amp;quot; rareness&amp;quot; of a word, a phrase, or a sentence. All the words and rules in our system have been assigned their own weights. Our experiments on machine translation in satellite broadcasting show that the best candidates arc chosen for about 78% of the successfully analyzed World News sentences.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.5 Word Selection by Semantic Markers
</SectionTitle>
      <Paragraph position="0"> Semantic markers are employed for Japanese word selection. Their effectiveness has been shown particularly in the areas mentioned below.</Paragraph>
      <Paragraph position="1"> 1) Selection of a Japanese translation of &amp;quot;they&amp;quot; The word &amp;quot;they&amp;quot; quite often appears in news sentences. It has two major Japanese translafions:&amp;quot;karera&amp;quot; and &amp;quot;sorera&amp;quot; which refer to objects with will and objects without will, respectively. The confusion between the two intolerably degrades Japanese translation.</Paragraph>
      <Paragraph position="2"> A simple strategy that uses semantic markers and verb characteristics can make a proper selection in many cases without pronoun analysis. As mentioned in 3.1, verbs having human subjects are frequently utilized in news sentences. Meanwhile, verbs like &amp;quot;melt&amp;quot; take sub-ject nouns that have no will. If &amp;quot;karera&amp;quot; has a marker \[HIWILL\] (objects with high will) and &amp;quot;sorera&amp;quot; has nothing, translation control of &amp;quot;they&amp;quot; is realized by specifying the subject of a verb as \[HIWILL\] or nothing. null 2) Basic verb's translation word selection Verbs frequently used in news sentences are basic and thus have various meanings. To select a proper Japanese translation of a basic verb, we have set some special markers for news sentences. One of them is \[CRIMINAL\] which is utilized to obtain a special trmaslation of &amp;quot;catch.&amp;quot; Consider the sentence: &amp;quot;The police caught the assailant, who has a history of mental illness.&amp;quot; The word &amp;quot;catch&amp;quot; was successfully translated as &amp;quot;taiho-suru (arres0,&amp;quot; since &amp;quot;assailant&amp;quot; has the marker \[CRIMINAL\] and the translation description of &amp;quot;catch&amp;quot; defined its lapanese as &amp;quot;taiho-suru&amp;quot; when it took an object noun that belongs to \[CRIMINAL\].</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML