File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/p97-1016_metho.xml

Size: 19,819 bytes

Last Modified: 2025-10-06 14:14:31

<?xml version="1.0" standalone="yes"?>
<Paper uid="P97-1016">
  <Title>Ambiguity Resolution for Machine Translation of Telegraphic Messages I</Title>
  <Section position="5" start_page="120" end_page="121" type="metho">
    <SectionTitle>
LANGUAGE
GENERATION
GENESIS
</SectionTitle>
    <Paragraph position="0"> is directly derived from the parse tree andbecomes .the input to the generation system. The hierarchical structure of the parse tree is preserved in the semantic frame, and therefore a misparse of the input sentence leads to a mistranslation. Suppose that the sentence (1) is misparsed as an active rather than a passive sentence due to the omission of the verb was, and that the prepositional phrase 220 nm is misparsed as the direct object of the verb destroy. These instances of misunderstanding are reflected in the semantic frame. Since the semantic frame becomes the input to the generation system, the generation system produces the non-sensical Korean translation output, as in (2), as opposed to the sensible one, as in (3). 3 (2) TU-95-ka 220 hayli-lul pakoy-hayssta TU-95-NOM 220 nautical mile-OBJ destroyed (3) TU-95-ka 220 hayli-eyse pakoy-toyessta TU-95-NOM 220 nautical mile-LOC was destroyed Given that the generation of the semantic frame from the parse tree, and the generation of the translation output from the semantic frame, are quite straightforward in such a system, and that the flexibility of the semantic frame representation is well suited for multilingual machine translation, it would be more desirable to find a way of reducing the ambiguity of the input text to produce high quality translation output, rather than adjusting the translation process. In the sections below we discuss one such method in terms of grammar design and some of its side effects.x</Paragraph>
    <Section position="1" start_page="120" end_page="120" type="sub_section">
      <SectionTitle>
2.1 Lexicalization of Grammar Rules with
Semantic Categories
</SectionTitle>
      <Paragraph position="0"> In the domain of naval operational report messages (MUC-II messages hereafter), 4 (Sundheim, 1989), we find two types of ellipsis. First, top level categories such as subjects and the copula verb be are often omitted, as in (4).</Paragraph>
      <Paragraph position="1">  (4) Considered hostile act (= This was considered to be a hostile act).</Paragraph>
      <Paragraph position="2"> Second, many function words like prepositions and articles are omitted. Instances of preposition omission are given in (5), where z stands for Greenwich Mean Time (GMT).</Paragraph>
      <Paragraph position="3"> (5) a. Haylor hit by a torpedo and put out of action 8 hours (---- for 8 hours) b. All hostile recon aircraft outbound 1300 z (= at 1300 z)  If we try to parse sentences containing such omissions with the grammar where the rules are defined in terms of syntactic categories (i.e. part-of-speech), the syntactic ambiguity multiplies. 3In the examples, NOM stands for the nominative case marker, OBJ the object case marker, and LOC the locative postposition.</Paragraph>
      <Paragraph position="4"> 4MUC-II stands for the Second Message Understanding Conference. MUC-II messages were originally collected and prepared by NRaD(1989) to support DARPA-sponsored research in message understanding.</Paragraph>
      <Paragraph position="5"> To accommodate sentences like (5)a-b, the grammar needs to allow all instances of noun phrases (NP hereafter) to be ambiguous between an NP and a prepositional phrase (PP hereafter) where the preposition is omitted. Allowing an input where the copula verb be is omitted in the grammar causes the past tense form of a verb to be interpreted either as the main verb with the appropriate form of be omitted, as in (6)a, or as a reduced relative clause modifying the preceding noun, as in (6)b.</Paragraph>
      <Paragraph position="6"> (6) Aircraft launched at 1300 z ...</Paragraph>
      <Paragraph position="7"> a. Aircraft were launched at 1300 z ...</Paragraph>
      <Paragraph position="8"> b. Aircraft which were launched at 1300 z ...</Paragraph>
      <Paragraph position="9"> Such instances of ambiguity are usually resolved on the basis of the semantic information. However, relying on a semantic module for ambiguity resolution implies that the parser needs to produce all possible parses of the input text andcarry them along, thereby requiring a more complex understanding process. One way of reducing the ambiguity at an early stage of processing without relying on a semantic module is to incorporate domain/semantic knowledge into the grammar as follows: * Lexicalize grammar rules to delimit the lexical items which typically occur in phrases with omission; * Introduce semantic categories to capture the co-occurrence restrictions of lexical items.</Paragraph>
      <Paragraph position="10"> Some example grammar rules instantiating these ideas are given in (7).</Paragraph>
      <Paragraph position="12"> {at in near off on ...} NP  (7)a states that a locative prepositional phrase consists of a subset of prepositions and a noun phrase. In addition, there is a subcategory headless_PP which consists of a subset of noun phrases which typically occur in a locative prepositional phrase with the preposition omitted. The head nouns which typically occur in prepositional phrases with the preposition omission are nautical miles and yard. The rest of the rules can be read in a similar manner. And it is clear how such lexicalized rules with the semantic categories reduce the syntactic ambiguity of the input text.</Paragraph>
    </Section>
    <Section position="2" start_page="120" end_page="121" type="sub_section">
      <SectionTitle>
2.2 Drawbacks
</SectionTitle>
      <Paragraph position="0"> Whereas the language processing is very efficient when a system relies on a lexicalized semantic grammar, there are some drawbacks as well.</Paragraph>
      <Paragraph position="1"> * Since the grammar is domain and word specific, it is not easily ported to new constructions and new domains.</Paragraph>
      <Paragraph position="2"> * Since the vocabulary items are entered in the grammar as part of lexicalized grammar rules, if an input sentence contains words unknown to the grammar, parsing fails.</Paragraph>
      <Paragraph position="3"> These drawbacks are reflected in the performance evaluation of our machine translation system. After the system was developed on all the training data of the MUC-II corpus (640 sentences, 12 words/sentence average), the system was evaluated on the held-out test set of 111 sentences (hereafter TEST set). The results are shown in Table 1. The system was also evaluated on the data which were collected from an in-house experiment. For this experiment, the subjects were asked to study a number of MUC-II sentences, and create about 20 MUC-II-like sentences. These  Total No. of sentences 111 No. of sentences with no 66/111 (59.5%) unknown words No. of parsed sentences 23/66 (34.8%) No, of misparsed sentences 2/23 (8:7%)</Paragraph>
    </Section>
    <Section position="3" start_page="121" end_page="121" type="sub_section">
      <SectionTitle>
Semantic Grammar
</SectionTitle>
      <Paragraph position="0"> MUC-II-like sentences form data set TEST'. The results of the svstem evaluation on the data set TEST' are given in Table 2.</Paragraph>
      <Paragraph position="1"> &amp;quot; Table 1 shows that the grammar coverage for unseen data is about 35%, excluding the failures due to unknown words. Table 2 indicates that even for sentences constructed to be similar to the training data, the grammar coverage is about 43%, again excluding the parsing failures due to unknown words. The misparse 5 rate with respect to the total parsed sentences ranges between 8.7% and 14.6%, which is considered to be highly accurate.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="121" end_page="121" type="metho">
    <SectionTitle>
3 Incorporation of Syntactic Knowledge
</SectionTitle>
    <Paragraph position="0"> Considering the low parsing coverage of a semantic grammar which relies on domain specific knowledse, and the fact that the successful parsing of the input sentence ks a prerequisite for producing translation output, it is critical to improve the parsing coverage. Such a goal may be achieved by incorporating syntactic rules into the ~ammar while retaining lexical/semantic information to minim'ize the ambiguity of the input text. The question is: how much semantic and syntactic information is  necessary? We propose a solution, as in (8): (8) (a) Rules involving verbs and prepositions need to be lexicalized to resolve the prepositional phrase attachment ambiguity, cf. (Brill and Resnik, 1993).</Paragraph>
    <Paragraph position="1"> (b) Rules involving verbs need to be lexicalized to prevent misarSing due to an incorrect subcategorization.</Paragraph>
    <Paragraph position="2"> ) Domain specific expressions (e.g.z. nm in the MUC-II corpus) which frequently occur in phrases with omitted elements. need to be lexicalized. (d) Otherwise. relv on svntactic rules defined in terms of part- of-speech. &amp;quot; &amp;quot;  In this section, we discuss typical misparses for the syntactic grammar on experiments in the MUC-II corpus. We then illustrate how these misparses are corrected by lexicalizing the grammar rules for verbs, prepositions, and some domain-specific phrases.</Paragraph>
    <Section position="1" start_page="121" end_page="121" type="sub_section">
      <SectionTitle>
3.1 Typical Misparses Caused by Syntactic
Grammar
</SectionTitle>
      <Paragraph position="0"> The misparses we find in the MUC-II corpus, when tested on a syntactic grammar, are largely due to the three factors specified in (9).</Paragraph>
      <Paragraph position="1">  care. A number of the sentences we consider to be misparses are t svntacuc mksparses, but &amp;quot;semanucallv anomalous. Since we are interested in getting the accurate interpretation in the given context at the parsingstage, we consider parses which are semantically anomalous to be misparses.</Paragraph>
      <Paragraph position="2"> (9) i. Misparsing due to prepositional phrase attachment (hereafter PP-attachment) ambiguity ii. Misparsing due to incorrect verb subcategorizations iii. Misparsing due to the omission of a preposition, e.g. i,~10 z instead of at I~10 z Examples of misparses due to an incorrect verb subcategorization and a PP-attachment ambiguity are given in Figure 2 and Figure 3. respectively. An example of a misparse due to preposition omission is given in Figure 4.</Paragraph>
      <Paragraph position="3"> In Figure 2, the verb intercepted incorrectly subcategorizes for a finite complement clause.</Paragraph>
      <Paragraph position="4"> In Figure 3, the prepositional phrase with 12 rounds is u~ronglv attached to the noun phrase the contact, as opposed to the verb phrase vp_active, to which it properly belongs.</Paragraph>
      <Paragraph position="5"> Figure 4 shows that the prepositional phrase i,~i0 z with at omitted is misparsed as a part of the noun phrase expression hostile raid composition.</Paragraph>
    </Section>
    <Section position="2" start_page="121" end_page="121" type="sub_section">
      <SectionTitle>
3.2 Correcting Misparses by Lexicalizing Verbs,
</SectionTitle>
      <Paragraph position="0"> Prepositions, and Domain Specific Phrases Providing the accurate subcategorization frame for the verb intercept by lexicalizing the higher level category &amp;quot;vp&amp;quot; ensures that it never takes a finite clause as its complement, leading to the correct parse, as in Figure 5. As for PP-attachment ambiguity, lexicalization of verbs and prepositions helps in identifying the proper attachment site of the prepositional phrase, cf. (t3rill and Resnik, 1993), as illustrated in Figure 6.</Paragraph>
      <Paragraph position="1"> Misparses due to omission are easily corrected by deploying lexicalized rules for the vocabulary items which occur in phrases with omitted elements. For the misparse illustrated in Figure 3, utilizing the lexicalized rules in (10) prevents IJI0 z from being analyzed as part of the subsequent noun phrase, as in Figure 7.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="121" end_page="125" type="metho">
    <SectionTitle>
4 Experimental Results
</SectionTitle>
    <Paragraph position="0"> In this section we report two types of experimental results. One is the parsing results on two sets of unseen data TEST and TEST' (discussed in Section 2) using the syntactic grammar defined purely in terms of part-of-speech. Tl~e other is the parsing results on the same sets of data using the grammar which combines lexicalized semantic grammar rules and syntactic grammar rules. The results are compared with respect to the parsing coverage and the misparse rate. These experimental results are also compared with the parsing results with respect to the lexicalized semantic grammar discussed in Section 2.</Paragraph>
    <Section position="1" start_page="121" end_page="125" type="sub_section">
      <SectionTitle>
4.1 Experimental Results on Data Set TEST
</SectionTitle>
      <Paragraph position="0"> &amp;quot;-Total .No. of sentences i iii I No. of parsed sentences i 84/ili (75.7%) ', \[.No. of misparsed sentences 24/84 (29%) i  rate of misparse (i.e. 29%) than the grammar which utilizes both syntactic and semantic categories (i.e. 10%). Comparing the evaluation results on the mixed grammar with those on the lexicalized semantic grammar discussed in Section 2, the parsing coverage of the mixed grammar is much higher (77%) than that of the semantic grammar (59.5%). In terms of misparse rate, both grammars perform equally well, i.e. around 9%. 6</Paragraph>
    </Section>
    <Section position="2" start_page="125" end_page="125" type="sub_section">
      <SectionTitle>
4.2 Experimental Results on Data Set TEST'
</SectionTitle>
      <Paragraph position="0"> Total No. of sentences I 281 I No. of sentences which parse 215/281 (76.5%) No. of misparsed sentences 60/215 (28%)  mar Evaluation results of the two types of grammar on the TEST' data, given in Table 5 and Table 6, are similar to those of the two types of ~ammar on the TEST data discussed above. To summarize, the grammar which combines syntactic rules and lexicalized semantic rules fares better than the syntactic lgrcal.mm, mar or the semantic grammar. Compared with a lex- lzed semantic grammar, this grammar achieves a higher parsing coverage without increasing the amount of ambiguity/misparsing. When compared with a syntactic grammar, this grammar achieves a lower degree of ambiguity/misparsing without decreasing the parsing rate.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="125" end_page="126" type="metho">
    <SectionTitle>
5 System Engineering
</SectionTitle>
    <Paragraph position="0"> An input to the parser driven by a grammar which utilizes both syntactic and lexicalized semantic rules consists of words (to be covered by lexicalized semantic rules) and parts-of-speech (to be covered by syntactic rules). To accommodate the part-of-speech input to the parser, the input sentence has to be part-of-speech tagged before parsing. To produce an adequate translation output from the input containing parts-of-speech, there has to be a mechanism by which parts-of-speech are used for parsing purposes, and the corresponding lexical items are used for the semantic frame representation.</Paragraph>
    <Section position="1" start_page="125" end_page="125" type="sub_section">
      <SectionTitle>
5.1 Integration of Rule-Based Part-of-Speech
Tagger
</SectionTitle>
      <Paragraph position="0"> To accommodate the part-of-speech input to the parser, we have integrated the rule-based part-of-speech tagger, (Brill, 1992), (Brill, 1995), as a preprocessor to the language understanding system TINA, as in Figure 8. An advantage of integrating a part-of-speech tagger over a lexicon containing part-of-speech information is that only the former can tag words which are new to the system, and provides a way of handling unknown words.</Paragraph>
      <Paragraph position="1"> While most stochastic taggers require a large amount of training data to achieve high rates of tagging accuracy, the rule-based eThe parsing coverage of the semantic grammar, i.e. 34.8%, is after discounting the parsing failure due to words unknown to the ~rammar. The reason why we do not give the statistics of the parsing failure due to unknown words for the syntactic and the mixed grammar is because the part-of-speech tagging process, which will be discussed in detail in Section 5, has the effect of handling unknown words, and therefore the problem does not arise.</Paragraph>
      <Paragraph position="2">  ger as a Preprocessor to the Language Understanding System null tagger achieves performance comparable to or higher than that of stochastic taggers, even with a training corpus of a modest size. Given that the size of our training corpus is fairly small (total 7716 words), a transformation-based tagger is wellsuited to our needs.</Paragraph>
      <Paragraph position="3"> The transformation-based part-of-speech tagger operates in two stages. Each word in the tagged training corpus has an entry in the lexicon consisting of a partially ordered list of tags, indicating the most likely tag for that word, and all other tags seen with that word (in no particular order). Every word is first assigned its most likely tag in isolation. Unknown words are first assumed to be nouns, and then cues based upon prefixes, suffixes, infixes, and adjacent word co-occurrences are used to upgrade the most likely tag. Secondly, after the most likely tag for each word is assigned, contextual transformations are used to improve the accuracy.</Paragraph>
      <Paragraph position="4"> We have evaluated the tagger performance on the TEST Data both before and after training on the MUC-II corpus. The results are given in Table 7. Tagging statistics 'before training' are based on the lexicon and rules acquired from the BROWN CORPUS and the WALL STREET JOURNAL CORPUS. Tag- ~ ing statistics 'after training' are divided into two categories, oth of which are based on the rules acquired from training data sets of the MUC-II corpus. The only difference between the two is that in one case (After Training I) we use a lexicon acquired from the MUC-II corpus, and in the other case (After Training II) we use a lexicon acquired from a combination of the BROWN CORPUS, the WALL STREET JOURNAL CORPUS, and the  up to 98% after training and using the combined lexicon, with an accuracy for unknown words ranging from 82 to 87%. These high rates of tagging accuracy are largely due to two factors:  (1) Combination of domain specific contextual rules obtained by  training the MUC-II corpus with general contextual rules obtained by training the WSJ corpus; And (2) Combination of the MUC-II lexicon with the lexicon for the WSJ corpus.</Paragraph>
    </Section>
    <Section position="2" start_page="125" end_page="126" type="sub_section">
      <SectionTitle>
5.2 Adaptation of the Understanding System
</SectionTitle>
      <Paragraph position="0"> The understanding system depicted in Figure 1 derives the semantic frame representation directly from the parse tree. The terminal symbols (i.e. words in general) in the parse tree are represented as vocabulary items in the semantic frame. Once we allow the parser to take part-of-speech as the input, the parts- of-speech (rather than actual words) will appear as the terminal symbols in the parse tree, and hence as the vocabulary items in the semantic frame representation. We adapted the system so that the part-of-speech tags are used for parsing, but are replaced with the original words in the final semantic frame. Generation can then proceed as usual. Figures 9 and (11) illustrate the parse tree and semantic frame produced by the adapted system for the input sentence 0819 z unknown contacts replied incorrectly.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML