File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/a94-1005_metho.xml
Size: 11,342 bytes
Last Modified: 2025-10-06 14:13:37
<?xml version="1.0" standalone="yes"?> <Paper uid="A94-1005"> <Title>Machine Translation of Sentences with Fixed Expressions</Title> <Section position="4" start_page="28" end_page="28" type="metho"> <SectionTitle> (3) Type III </SectionTitle> <Paragraph position="0"> The Type IlI sentences have no features that make MT appropriate. Most of the general sentences are dealers' comments.</Paragraph> </Section> <Section position="5" start_page="28" end_page="29" type="metho"> <SectionTitle> 3 Outline of ENTS </SectionTitle> <Paragraph position="0"> ENTS consists of three translation methods corresponding to the types of economic sentences. ENTS processing follows the flow in Fig. 2.</Paragraph> <Paragraph position="1"> Figure 2 ENTS flow chart (1) Process 1 Process 1 translates fixed sentences (Type I) using bilingual templates that directly handle fixed expressions. (2) Process 2 Process 2 translates sentences of Type II using a conventional rule-based approach with grammatical rules tuned to economic sentences obtained from two data worth of AP stories \[Aizawa93\]. The grammatical rules are built reflecting features of fixed expressions. These economics-specific grammatical rules total about 500, which is 1/5 of the number of rules for general sentences. Therefore, there are few ambiguities in syntactic structure.</Paragraph> <Paragraph position="2"> (3) Process 3 Process 3 translates those sentences not processed by Process 1 or Process 2. It is a rule-based MT with general-purpose grammatical rules.</Paragraph> </Section> <Section position="6" start_page="29" end_page="30" type="metho"> <SectionTitle> 4 A translation method of fixed sentence </SectionTitle> <Paragraph position="0"> In our translation method, STRA (a fixed Sentence TRAnslation method), the bilingual templates in which translation equivalents of the fixed expressions are represented as variables are created using STRA data. That data is built automatically by DTRA (a Data production method for STRA) from fixed English sentences and their corresponding Japanese translations. The fixed English sentences are extracted automatically from a corpus by EXTRA (a fixed sentence EXTRAction method). CTRA (a Compound word TRAnslation method) plays a main role in STRA and DTRA.</Paragraph> <Paragraph position="1"> Fig. 3 visually summarizes the translation system.</Paragraph> <Paragraph position="2"> .......................... ,I ideg .......................</Paragraph> <Section position="1" start_page="29" end_page="29" type="sub_section"> <SectionTitle> 4.1 Compound word translation (CTRA) </SectionTitle> <Paragraph position="0"> The compound word translation module (CTRA) translates compound words in fixed expressions \[Katoh91\]. In STRA and DTRA, CTRA is the main processing unit, while it is used in one step of analysis in Processes 2 and 3. In our MT system used in Processes 2 and 3, the CTRA step occurs between morphological and syntactic analyses as shown in Fig. 4.</Paragraph> <Paragraph position="2"> Figure 4 Our rule-based MT system with CTRA CTRA extracts fixed expressions and defines their appropriate translation equivalents, the parts of speech and the semantic markers. For example, fixed expressions in example 1-1 are processed as: idiomatic translation part of semantic expression equivalents speech marker &quot;17.76 dollars per kilo .... le~ ~ 17.76 ~')v&quot; noun unit expression &quot;5 cents&quot; &quot;5 ~ > I,&quot; noun unit expression In CTRA, English analysis is done by CHART parser based on CFG rules which represent fixed expressions.</Paragraph> <Paragraph position="3"> On the other hand, Japanese generation is not based on a rule-based method, but conducted by substituting the translation equivalents of the English words for variables in Japanese templates. Fig. 5 shows examples of CFG rules and their corresponding Japanese templates. Both these CFG rules and Japanese templates are named as CTRA data.</Paragraph> <Paragraph position="4"> English fixed expressions Japanese templates 1: S--> UNTEXP r#1#\] 2: UNTEXP--> UNTEXP PER UNIT \[- 1 #3##1#J 3: --> NUMEXP UNIT \[#1##2#J 4: UNIT --> &quot;dollar&quot;, &quot;cents&quot;, r b')l/.\] , I-~ 2/b / , &quot;kilo&quot;, &quot;yen&quot;, etc \[~ ~ J , \[\['qJ ... 5: PER--> &quot;per&quot;, &quot;a&quot; rJ 6: NUMEXP--> &quot;1&quot;, &quot;12&quot;, etc \[-lJ , \[12J &quot;&quot; 7: CMA--> &quot;,&quot; \['J 8: UPDW --> &quot;up&quot;, &quot;down&quot; r T .:~ -TOj , F C/C/ ey 2/,, j 9: CITY --> &quot;Kuala Lumpur&quot;, I-~' 7 ~ )1/2/&quot;7&quot;-- )l/J , r.~.j -.. &quot;Tokyo&quot;, etc.</Paragraph> <Paragraph position="5"> where part of the #i# denotes the translation equivalent of the &quot;~ ith symbol in the fight-hand of CFG rule, and \[-(null)_\] means the rule has no corresponding translation equivalent. ) Figure 5 Sample CTRA data</Paragraph> </Section> <Section position="2" start_page="29" end_page="30" type="sub_section"> <SectionTitle> 4.2 Fixed sentence translation (STRA) </SectionTitle> <Paragraph position="0"> The fixed sentence translation module (STRA) is an expanded CTRA with added CTRA data (named as STRA data) for translating not only fixed expressions but also fixed sentences. STRA data is produced automatically, as described in next section.</Paragraph> <Paragraph position="1"> At the top of the CFG rules in Fig. 6 is an English template of example 1-1, and its corresponding translation with variables is a Japanese template. The CFG rules are based not on English grammar but on an English sentence pattern, although they represent the word order of a fixed sentence. For example, &quot;Malaysian tin closed at&quot;, which is arranged in one phrase, cannot usually be represented as one grammatical category according to English grammar.</Paragraph> <Paragraph position="2"> The STRA data is flexible in its ability to translate fixed sentences. For example, the STRA data shown in Fig. 6 and the CTRA data in Fig. 5 can translate : expressions, such as &quot;Kuala Lumpur&quot;, &quot;Tokyo&quot;, &quot;cents&quot; and &quot;yen&quot;, should be registered in CTRA data. These words are selected by hand, referring to frequently appearing fixed expressions collected from corpora.</Paragraph> </Section> <Section position="3" start_page="30" end_page="30" type="sub_section"> <SectionTitle> 4.3 Data production for STRA (DTRA) </SectionTitle> <Paragraph position="0"> A data production module for STRA (DTRA) builds STRA data automatically from English fixed sentences and their Japanese equivalent sentences. In DTRA, CFG rules are constructed by transforming English fixed sentences, and Japanese templates are made by replacing fixed expressions in their Japanese equivalent sentences with variables. DTRA's algorithm is as follows: In STEP 0, a fixed sentence wl...wn is translated into Japanese by hand. STEP 1 collects candidates for variables in the Japanese sentence. Actually, the fixed sentence is analyzed by CTRA, and various fixed expressions are extracted as symbols (pre-terminal or terminal symbols) used in non-active edges. STEP 2 is to calculate the weights of the symbols by the algorithm shown in Fig. 7 to select an optimal set of fixed expressions. If the translation equivalent of a symbol exist in the Japanese equivalent sentence, its weight is defined according to the number of words in edges, otherwise it is zero. STEP 3 selects an optimal set of edges by calculating the maximum in sums of the weights between positions 0 and n by Dynamic Programming (DP). STEP 4 produces pre-terminal symbols for the word sequences not selected in STEP 3, and lines up the symbols in order of their appearance to make CFG rules. In STEP 5, each translation equivalent of the edges in the optimal set is replaced with a variable in the Japanese equivalent sentence to make Japanese templates.</Paragraph> <Paragraph position="1"> DTRA is illustrated by processing sentence 1-1.</Paragraph> <Paragraph position="2"> STEP 0 translates the sentence into Japanese by hand: 7 &quot;y 7*O') 1 ~ t~ 17.76 F')I,&quot;C O'~?t:J The non-active edges obtained by CTRA in STEP 1 are shown in Fig. 8.</Paragraph> <Paragraph position="3"> STEP 2 calculates the weights of the non-active edges as shown in Fig. 9. For example, the weight of &quot;Kuala Lumpur&quot; is 9.</Paragraph> <Paragraph position="4"> in sentence 1-1 STEP 3 has DP select the maximum in sum of the weights between edge 0 and edge 16. In Fig. 9, the maximum is 108 and the optimal set of edges is selected as {&quot;Kuala Lumpur&quot;, &quot;,&quot;, &quot;17.76 dollars per kilo&quot;, &quot;,&quot;, &quot;up&quot;, &quot;5 cents&quot;}.</Paragraph> <Paragraph position="5"> In STEP 4, the word sequences not selected in STEP 3 are given pre-terminal symbols automatically: &quot;In&quot; PAT1 &quot;Malaysian tin closed at&quot; PAT2 and setting these symbols in a line, the CFG rule is: S --> PATI CITY CMA PAT2 UNTEXP CMA UPDW UNTEXP ( 1 2 3 4 5 6 7 8 ) The variables are defined as:</Paragraph> <Paragraph position="7"> STEP 5 replaces their translation equivalents of the selected edges in the Japanese sentence with variables: \[#2#~7 1t -- -5&quot; Z a)~-'~, #8##7#C/)#5#'eU'~ tcJ</Paragraph> </Section> <Section position="4" start_page="30" end_page="30" type="sub_section"> <SectionTitle> 4.4 Fixed sentence extraction (EXTRA) </SectionTitle> <Paragraph position="0"> A method of extracting fixed sentences (EXTRA) collects fixed sentences for DTRA from a corpus using the fixed pattern ratio (FPR) defined below.</Paragraph> <Paragraph position="1"> The first step in EXTRA is to extract the fixed-word sequences which appear in a corpus most frequently, ignoting differences of days of the week (e.g., Monday and Tuesday) and digits (e.g., 123 and 1000). The fixed-word sequences are not only compound words such as &quot;\[DIGIT\] dollars per kilo&quot;(where \[DIGIT\] denotes digits) and &quot;on condition of anonymity&quot;, but also some parts of fixed expressions, such as &quot;said in a&quot; and &quot;condition of anonymity&quot;. The fixed-word sequences are called &quot;fixed patterns&quot; and the compiled fixed patterns are called &quot;fixed pattern data&quot;.</Paragraph> <Paragraph position="2"> Using fixed patterns, FPR is defined as follows: sum of words in fixed sequences of a sentence FPR ........................................................... the total number of words in a sentence FPD1, FPD2 and FPD3 are assumed to be in fixed pattern data. Thus, The FPR of example 4-1 = 8/8 =1.0, because 4-1 itself is FPD1.</Paragraph> <Paragraph position="3"> The FPR of example 4-2 = (4+3)/9 = 0.78, because 4-2 includes FPD2 and FPD3.</Paragraph> <Paragraph position="4"> Fixed sentences are defined as those with FWR values above a certain threshold. EXTRA analyze each sentence in a corpus and extracts the sentences with sufficiently high FPR as fixed sentences.</Paragraph> </Section> </Section> class="xml-element"></Paper>