File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/p96-1020_metho.xml

Size: 26,328 bytes

Last Modified: 2025-10-06 14:14:18

<?xml version="1.0" standalone="yes"?>
<Paper uid="P96-1020">
  <Title>Pattern-Based Context-Free Grammars for Machine Translation</Title>
  <Section position="4" start_page="144" end_page="146" type="metho">
    <SectionTitle>
2 Pattern-Based Context-Free
Grammars
</SectionTitle>
    <Paragraph position="0"> Pattern-based context-free grammars (PCFG) consists of a set of translation patterns. A pattern is a pair of CFG rules, and zero or more syntactic head and link constraints for nonterminal symbols. For example, the English-French translation pattern 4</Paragraph>
    <Paragraph position="2"> essentially describes a synchronized 5 pair consisting of a left-hand-side English CFG rule (called a source</Paragraph>
    <Paragraph position="4"> accompanied by the following constraints.</Paragraph>
    <Paragraph position="5"> 1. Head constraints: The nonterminal symbol V in the source rule must have the verb miss as a syntactic head. The symbol V in the target rule must have the verb manquer as a syntactic head.</Paragraph>
    <Paragraph position="6"> The head of symbol S in the source (target) rule is identical to the head of symbol V in the source (target) rule as they are co-indexed.</Paragraph>
    <Paragraph position="7">  discussion of the importance of combining the appropriate domain of locality and synchronization. are given the same index &amp;quot;:i&amp;quot;. Linked nonterminal must be derived from a sequence of synchronized pairs. Thus, the first NP (NP:I) in the source rule corresponds to the second NP (NP:I) in the target rule, the Vs in both rules correspond to each other, and the second NP (NP:3) in the source rule corresponds to the first NP (NP:3) in the target rule.</Paragraph>
    <Paragraph position="8"> The source and target rules are called CFG skeleton of the pattern. The notion of a syntactic head is similar to that used in unification grammars, although the heads in our patterns are simply encoded as character strings rather than as complex feature structures. A head is typically introduced 6 in preterminal rules such as leave ---* V V *-- partir where two verbs, &amp;quot;leave&amp;quot; and &amp;quot;partir,&amp;quot; are associated with the heads of the nonterminal symbol V. This is equivalently expressed as leave:l --~ V:I V:I ~ partir:l which is physically implemented as an entry of an English-French lexicon.</Paragraph>
    <Paragraph position="9"> A set T of translation patterns is said to accept an input s iff there is a derivation sequence Q for s using the source CFG skeletons of T, and every head constraint associated with the CFG skeletons in Q is satisfied. Similarly, T is said to translate s iff there is a synchronized derivation sequence Q for s such that T accepts s, and every head and link constraint associated with the source and target CFG skeletons in Q is satisfied. The derivation Q then produces a translation t as the resulting sequence of terminal symbols included in the target CFG skeletons in Q. Translation of an input string s essentially consists of the following three steps:  1. Parsing s by using the source CFG skeletons 2. Propagating link constraints from source to target CFG skeletons to build a target CFG derivation sequence 3. Generating t from the target CFG derivation  sequence The third step is a trivial procedure when the target CFG derivation is obtained.</Paragraph>
    <Paragraph position="10"> Theorem 1 Let T be a PCFG. Then, there exists a CFG GT such that for two languages L(T) and L(GT) accepted by T and GT, respectively, L(T) = L(GT) holds. That is, T accepts a sentence s iff GT  rule X --* X1 ... Xk can only be constrained to have one of the heads in the RHS X1 ... X~. Thus, monotonicity of head constraints holds throughout the parsing process.  2. For each nonterminal symbol X in T, GT ineludes a set of nonterminal symbols {X~ \]w is either a terminal symbol in T or a special symbol e}.</Paragraph>
    <Paragraph position="11"> 3. For each preterminal rule</Paragraph>
    <Paragraph position="13"> If Xj has no head constraint in the above rule, GT includes a set of (N + 1) rules, where Xhj above is replaced with Xw for every terminal symbol w and Xe (Yhj will also be replaced if it is co-indexed with Xj).s Now, L(T) C_ L(GT) is obvious, since GT can simulate the derivation sequence in T with corresponding rules in GT. L(GT) C L(T) can be proven, with mathematical induction, from the fact that every valid derivation sequence of GT satisfies head constraints of corresponding rules in T.</Paragraph>
    <Paragraph position="14">  Proposition 1 Let a CFG G be a set of source CFG skeletons in T. Then, L(T) C n(c).</Paragraph>
    <Paragraph position="15"> Since a valid derivation sequence in T is always a valid derivation sequence in G, the proof is immediate. Similarly, we have Proposition 2 Let a CFG H be a subset of source CFG skeletons in T such that a source CFG skeleton k is in H iffk has no head constraints associated with it. Then, L(H) C L(T).</Paragraph>
    <Paragraph position="16"> THead constraints ate trivially satisfied or violated in preterminal rules. Hence, we assume, without loss of generality, that no head constraint is given in pretetminal rules. We also assume that &amp;quot;X ---* w&amp;quot; implies &amp;quot;X:I w:l&amp;quot;.</Paragraph>
    <Paragraph position="17"> STherefore, a single rule in T can be mapped to as many as (N + 1) k rules in GT, where N is the number of terminal symbols in T. GT could be exponentially larger than T.</Paragraph>
    <Paragraph position="18"> Two CFGs G and H define the range of CFL L(T).</Paragraph>
    <Paragraph position="19"> These two CFGs can be used to measure the &amp;quot;default&amp;quot; translation quality, since idioms and collocational phrases are typically translated by patterns with head constraints.</Paragraph>
    <Paragraph position="20"> Theorem 2 Let a CFG G be a set of source CFG skeletons in T. Then, L(T) C L(G) is undecidable. Proof&amp;quot; The decision problem, L(T) C L(G), of two CFLs such that L(T) C L(G) is solvable iff L(T) = L(G) is solvable. This includes a known undecidable problem, L(T) = E*?, since we can choose a grammar U with L(U) = E*, nullify the entire set of rules in U by defining T to be a vacuous set {S:I a:Sb:l, Sb:l --+ b:Su:l} U U (Sv and S are start symbols in U and T, respectively), and, finally, let T further include an arbitrary CFG F. L(G) = E* is obvious, since G has {S --* Sb, Sb --* Sv} U U. Now, we have L(G) = L(T) iff L(F) = E*.</Paragraph>
    <Paragraph position="21">  Theorem 2 shows that the syntactic coverage of T is, in general, only computable by T itself, even though T is merely a CFL. This may pose a serious problem when a grammar writer wishes to know if there is a specific expression that is only acceptable by using at least one pattern with head constraints, for which the answer is &amp;quot;no&amp;quot; iff L(G) = L(T). One way to trivialize this problem is to let T include a pattern with a pair of pure CFG rules for every pattern with head constraints, which guarantees that L(H) = L(T) = L(G). In this case, we know that the coverage of &amp;quot;default&amp;quot; patterns is always identical to L(T).</Paragraph>
    <Paragraph position="22"> Although our &amp;quot;patterns&amp;quot; have no more theoretical descriptive power than CFG, they can provide considerably better descriptions of the domain of locality than ordinary CFG rules. For example,</Paragraph>
    <Paragraph position="24"> can handle such NP pairs as &amp;quot;one year&amp;quot; and &amp;quot;un an,&amp;quot; and &amp;quot;more than two years&amp;quot; and &amp;quot;plus que deux ans,&amp;quot; which would have to be covered by a large number of plain CFG rules. TAGs, on the other hand, are known to be &amp;quot;mildly context-sensitive&amp;quot; grammars, and they can capture a broader range of syntactic dependencies, such as cross-serial dependencies. The computational complexity of parsing for TAGs, however, is O(IGIn6), which is far greater than that of CFG parsing. Moreover, defining a new STAG rule is not as easy for the users as just adding an entry into a dictionary, because each STAG rule has to be specified as a pair of tree structures. Our patterns, on the other hand, concentrate on specifying linear ordering of source and target constituents, and can be written by the users as easily as 9 9By sacrificing linguistic accuracy for the description of syntactic structures.</Paragraph>
    <Paragraph position="25">  to leave * -- de quitter * to be year:* old = d'avoir an:* Here, the wildcard &amp;quot;*&amp;quot; stands for an NP by default. The preposition &amp;quot;to&amp;quot; and &amp;quot;de&amp;quot; are used to specify that the patterns are for VP pairs, and &amp;quot;to be&amp;quot; is used to show that the phrase is the BE-verb and its complement. A wildcard can be constrained with a head, as in &amp;quot;house:*&amp;quot; and &amp;quot;maison:*&amp;quot;. The internal representations of these patterns are as follows:</Paragraph>
    <Paragraph position="27"> These patterns can be associated with an explicit nonterminal symbol such as &amp;quot;V:*&amp;quot; or &amp;quot;ADJP:*&amp;quot; in addition to head constraints (e.g., &amp;quot;leave:V:*'). By defining a few such notations, these patterns can be successfully converted into the formal representations defined in this section. Many of the divergences (Doff, 1993) in source and target language expressions are fairly collocational, and can be appropriately handled by using our patterns. Note the simplicity that results from using a notation in which users only have to specify the surface ordering of words and phrases. More powerful grammar formalisms would generally require either a structural description or complex feature structures.</Paragraph>
  </Section>
  <Section position="5" start_page="146" end_page="147" type="metho">
    <SectionTitle>
3 The Translation Algorithm
</SectionTitle>
    <Paragraph position="0"> The parsing algorithm for translation patterns can be any of known CFG parsing algorithms including CKY and Earley algorithms 1deg At this stage, head and link constraints are ignored. It is easy to show that the number of target charts for a single source chart increases exponentially if we build target charts simultaneously with source charts. For example, the two patterns</Paragraph>
    <Paragraph position="2"> will generate the following 2 n synchronized pairs of charts for the sequence of (n+l) nonterminal symbols AAA...AB, for which no effective packing of the target charts is possible.</Paragraph>
    <Paragraph position="3">  (A (A... (A B))) with (A (A... (A B))) (A (A... (A B))) with ((A ... (A B)) A) iA (A... (A S))) with (((B A) A)... A)  Our strategy is thus to find a candidate set of source charts in polynomial time. We therefore apply heuristic measurements to identify the most promising patterns for generating translations. In 1degOur prototype implementation was based on the Earley algorithm, since this does not require lexicalization of CFG rules.</Paragraph>
    <Paragraph position="4"> this sense, the entire translation algorithm is not guaranteed to run in polynomial time. Practically, a timeout mechanism and a process for recovery from unsuccessful translation (e.g., applying the idea of fitted parse (Jensen and Heidorn, 1983) to target CFG rules) should be incorporated into the translation algorithm.</Paragraph>
    <Paragraph position="5"> Some restrictions on patterns must be imposed to avoid infinitely many ambiguities and arbitrarily long translations. The following patterns are therefore not allowed:  1. A--*XY~--B 2. A + X Y ~-C1...B...C~ if there is a cycle of synchronized derivation such that A--+ X...--~ A and</Paragraph>
    <Paragraph position="7"> where A, B, X, and Y are nonterminal symbols with or without head and link constraints, and C's are either terminal or nonterminal symbols.</Paragraph>
    <Paragraph position="8"> The basic strategy for choosing a candidate derivation sequence from ambiguous parses is as follows. 11 A simplified view of the Earley algorithm (Earley, 1970) consists of three major components, predict(i), complete(i), and scan(i), which are called at each position i = 0, 1,..., n in an input string I = sls2...sn. Predict(i) returns a set of currently applicable CFG rules at position i. Complete(i) combines inactive charts ending at i with active charts that look for the inactive charts at position i to produce a new collection of active and inactive charts.</Paragraph>
    <Paragraph position="9"> Scan(i) tries to combine inactive charts with the symbol si+l at position i. Complete(n) gives the set of possible parses for the input I.</Paragraph>
    <Paragraph position="10"> Now, for every inactive chart associated with a nonterminal symbol X for a span of (i~) (1 ~ i, j &lt;_ n), there exists a set P of patterns with the source CFG skeleton, ... --* X. We can define the following ordering of patterns in P; this gives patterns with which we can use head and link constraints for building target charts and translations. These candidate patterns can be arranged and associated with  the chart in the complete() procedure.</Paragraph>
    <Paragraph position="11"> 1. Prefer a pattern p with a source CFG skeleton X --~ X1...X~ over any other pattern q with  the same source CFG skeleton X --~ X1 ..' Xk, such that p has a head constraint h:Xi if q has h:Xi (i = 1,...,k). The pattern p is said to be more specific than q. For example, p = 11 This strategy is similar to that of transfer-driven MT (TDMT) (Furuse and Iida, 1994). TDMT, however, is based on a combination of declarative/procedural knowledge sources for MT, and no clear computational properties have been investigated.</Paragraph>
    <Paragraph position="12">  2. Prefer a pattern p with a source CFG skeleton  to any pattern q that has fewer terminal symbols in the source CFG skeleton than p. For example, prefer &amp;quot;take:V:l a walk&amp;quot; to &amp;quot;take:V:l NP&amp;quot; if these patterns give the VP charts with the same span.</Paragraph>
    <Paragraph position="13">  3. Prefer a pattern p which does not violate any head constraint over those which violate a head constraint.</Paragraph>
    <Paragraph position="14"> 4. Prefer the shortest derivation sequence for each input substring. A pattern for a larger domain of locality tends to give a shorter derivation se- null quence.</Paragraph>
    <Paragraph position="15"> These preferences can be expressed as numeric values (cost) for patterns. 12 Thus, our strategy favors lexicalized (or head constrained) and collocational patterns, which is exactly what we are going to achieve with pattern-based MT. Selection of patterns in the derivation sequence accompanies the construction of a target chart. Link constraints are propagated from source to target derivation trees. This is basically a bottom-up procedure.</Paragraph>
    <Paragraph position="16"> Since the number M of distinct pairs (X,w), for a nonterminal symbol X and a subsequence w of input string s, is bounded by Kn 2, we can compute the m-best choice of pattern candidates for every inactive chart in time O(ITIKn 3) as claimed by Maruyama (Maruyama, 1993), and Schabes and Waters (Schabes and Waters, 1995). Here, K is the number of distinct nonterminal symbols in T, and n is the size of the input string. Note that the head constraints associated with the source CFG rules can be incorporated in the parsing algorithm, since the number of triples (X,w,h), where h is a head of X, is bounded by Kn 3. We can modify the predict(), complete(), and scan() procedures to run in O(\[T\[Kn 4) while checking the source head constraints. Construction of the target charts, if possible, on the basis of the m best candidate patterns for each source chart takes O(Kn~m) time. Here, m can be larger than 2 n if we generate every possible translation.</Paragraph>
    <Paragraph position="17"> The reader should note critical differences between lexicalized grammar rules (in the sense of LTAG and TIG) and translation patterns when they are used for MT.</Paragraph>
    <Paragraph position="18"> Firstly, a pattern is not necessarily lexicalized. An economical way of organizing translation patterns is to include non-lexicalized patterns as &amp;quot;default&amp;quot; translation rules.</Paragraph>
    <Paragraph position="19"> 12A similar preference can be defined for the target part of each pattern, but we found many counterexamples, where the number of nontermina\] symbols shows no specificity of the patterns, in the target part of English-to-Japanese translation patterns. Therefore, only the head constraint violation in the target part is accounted for in our prototype.</Paragraph>
    <Paragraph position="20"> Secondly, lexicalization might increase the size of STAG grammars (in particular, compositional grammar rules such as ADJP NP --* NP) considerably when a large number of phrasal variations (adjectives, verbs in present participle form, various numeric expressions, and so on) multiplied by the number of their translations, are associated with the ADJP part. The notion of structure sharing (Vijay-Shanker and Schabes, 1992) may have to be extended from lexical to phrasal structures, as well as from monolingual to bilingual structures.</Paragraph>
    <Paragraph position="21"> Thirdly, a translation pattern can omit the tree structure of a collocation, and leave it as just a sequence of terminal symbols. The simplicity of this helps users to add patterns easily, although precise description of syntactic dependencies is lost.</Paragraph>
  </Section>
  <Section position="6" start_page="147" end_page="148" type="metho">
    <SectionTitle>
4 Features and Agreements
</SectionTitle>
    <Paragraph position="0"> Translation patterns can be enhanced with unification and feature structures to give patterns additional power for describing gender, number, agreement, and so on. Since the descriptive power of unification-based grammars is considerably greater than that of CFG (Berwick, 1982), feature structures have to be restricted to maintain the efficiency of parsing and generation algorithms. Shieber and Schabes briefly discuss the issue (Shieber and Schabes, 1990). We can also extend translation patterns as follows: Each nonterminal node in a pattern can be associated with a fixed-length vector of binary features.</Paragraph>
    <Paragraph position="1"> This will enable us to specify such syntactic dependencies as agreement and subcategorization in patterns. Unification of binary features, however, is much simpler: unification of a feature-value pair succeeds only when the pair is either (0,0) or (1,1/. Since the feature vector has a fixed length, unification of two feature vectors is performed in a constant time. For example, the patterns 13</Paragraph>
    <Paragraph position="3"> are unifiable with transitive and intransitive verbs, respectively. We can also distinguish local and head features, as postulated in HPSG. Simplified version of verb subcategorization is then encoded as</Paragraph>
    <Paragraph position="5"> where &amp;quot;-OBJ&amp;quot; is a local feature for head VPs in LIISs, while &amp;quot;+OBJ&amp;quot; is a local feature for VPs in 13Again, these patterns can be mapped to a weakly equivalent set of CFG rules. See GPSG (Gazdar, Pullum, and Sag, 1985) for more details.</Paragraph>
    <Paragraph position="6">  the RHSs. Unification of a local feature with +OBJ succeeds since it is not bound.</Paragraph>
    <Paragraph position="7"> Agreement on subjects (nominative NPs) and finite-form verbs (VPs, excluding the BE verb) is disjunctively specified as</Paragraph>
    <Paragraph position="9"> Here, *AGRS and *AGRV are a pair of aggregate unification specifiers that succeeds only when one of the above combinations of the feature values is unifiable.</Paragraph>
    <Paragraph position="10"> Another way to extend our grammar formalism is to associate weights with patterns. It is then possible to rank the matching patterns according to a linear ordering of the weights rather than the pairwise partial ordering of patterns described in the previous section. In our prototype system, each pattern has its original weight, and according to the preference measurement described in the previous section, a penalty is added to the weight to give the effective weight of the pattern in a particular context. Patterns with the least weight are to be chosen as the most preferred patterns.</Paragraph>
    <Paragraph position="11"> Numeric weights for patterns are extremely useful as means of assigning higher priorities uniformly to user-defined patterns. Statistical training of patterns can also be incorporated to calculate such weights systematically (Fujisaki et al., 1989).</Paragraph>
    <Paragraph position="12"> Figure I shows a sample translation of the input &amp;quot;He knows me well,&amp;quot; using the following patterns.  To simplify the example, let us assume that we have the following preterminal rules:</Paragraph>
    <Paragraph position="14"> Input: He knows me well  (inactive arcs \[I 3\] (d) V NP, \[i 3\] (e) V NP) \[I 3\] knows me ---&gt; (d), (e) VP (inactive arc \[0 3\] (a) NP VP, active arcs \[I 3\] (b) VP.well, \[i 3\] (c) VP.ADVP) \[0 3\] He knows me ---&gt; (a) S \[3 4\] well ---&gt; (j) ADVP, (k) ADVP (inactive arcs \[I 4\] (b) VP ADVP, \[i 4\] (c) VP ADVP) \[i 4\] knows me well ---&gt; (b), (c) VP (inactive arc \[0 4\] (a) NP VP)  In the above example, the Earley-based algorithm with source CFG rules is used in Phase 1. In Phase 2, head and link constraints are examined, and unification of feature structures is performed by using the charts obtained in Phase 1. Candidate patterns are ordered by their weights and preferences. Finally, in Phase 3, the target charts are built to generate translations based on the selected patterns.</Paragraph>
  </Section>
  <Section position="7" start_page="148" end_page="149" type="metho">
    <SectionTitle>
5 Integration of Bilingual Corpora
</SectionTitle>
    <Paragraph position="0"> Integration of translation patterns with translation examples, or bilingual corpora, is the most important extension of our framework. There is no dis- null crete line between patterns and bilingual corpora. Rather, we can view them together as a uniform set of translation pairs with varying degrees of lexicalization. Sentence pairs in the corpora, however, should not be just added as patterns, since they are often redundant, and such additions contribute to neither acquisition nor refinement of non-sentential patterns.</Paragraph>
    <Paragraph position="1"> Therefore, we have been testing the integration method with the following steps. Let T be a set of translation patterns, B be a bilingual corpus, and (s,t) be a pair of source and target sentences.</Paragraph>
    <Paragraph position="2">  1. \[Correct Translation\] IfT can translate s into t, do nothing.</Paragraph>
    <Paragraph position="3"> 2. \[Competitive Situation\] If T can translate s into t' (t ~ t~), do the following: (a) \[Lexicalization\] If there is a paired deriva null tion sequence Q of (s,t) in T, create a new pattern p' for a pattern p used in Q such that every nonterminal symbol X in p with no head constraint is associated with h:X in q, where the head h is instantiated in X of p. Add p~ to T if it is not already there.</Paragraph>
    <Paragraph position="4"> Repeat the addition of such patterns, and assign low weights to them until the refined sequence Q becomes the most likely translation of s. For example, add leave:VP: 1 :+OBJ</Paragraph>
    <Paragraph position="6"> if the existing VP ADVP pattern does not give a correct translation.</Paragraph>
    <Paragraph position="7"> (b) \[Addition of New Patterns\] If there is no such paired derivation sequence, add specific patterns, if possible, for idioms and collocations that are missing in T, or add the pair (s,t) to T as a translation pattern. For example, add</Paragraph>
    <Paragraph position="9"> if the phrase &amp;quot;leave it behind&amp;quot; is not correctly translated.</Paragraph>
    <Paragraph position="10">  3. \[Translation Failure\] If T cannot translate s  at all, add the pair (s,t) to T as a translation pattern.</Paragraph>
    <Paragraph position="11"> The grammar acquisition scheme described above has not yet been automated, but has been manually simulated for a set of 770 English-Japanese simple sentence pairs designed for use in MT system evaluation, which is available from JEIDA (the Japan Electronic Industry Development Association) ((the Japan Electronic Industry Development Association), 1995), including: #100: Any question will be welcomed.</Paragraph>
    <Paragraph position="12"> ~200: He kept calm in the face of great danger.</Paragraph>
    <Paragraph position="13"> #300: He is what is called &amp;quot;the man in the news&amp;quot;.</Paragraph>
    <Paragraph position="14"> ~400: Japan registered a trade deficit of $101 million, reflecting the country's economic sluggishness, according to government figures.</Paragraph>
    <Paragraph position="15"> #500: I also went to the beach 2 weeks earlier.</Paragraph>
    <Paragraph position="16"> At an early stage of grammar acquisition, \[Addition of New Patterns\] was primarily used to enrich the set T of patterns, and many sentences were unambiguously and correctly translated. At a later stage, however, JEIDA sentences usually gave several translations, and \[Lexicalization\] with careful assignment of weights was the most critical task. Although these sentences are intended to test a system's ability to translate one basic linguistic phenomenon in each simple sentence, the result was strong evidence for our claim. Over 90% of JEIDA sentences were correctly translated. Among the failures were: ~95: I see some stamps on the desk .</Paragraph>
    <Paragraph position="17"> #171: He is for the suggestion, but I'm against it.</Paragraph>
    <Paragraph position="18"> ~244: She made him an excellent wife.</Paragraph>
    <Paragraph position="19"> #660: He painted the walls and the floor white.</Paragraph>
    <Paragraph position="20"> Some (prepositional and sentential) attachment ambiguities needs to be resolved on the basis of semantic information, and scoping of coordinated structures would have to be determined by using not only collocational patterns but also some measures of balance and similarities among constituents.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML