File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-1032_metho.xml
Size: 23,609 bytes
Last Modified: 2025-10-06 14:10:09
<?xml version="1.0" standalone="yes"?> <Paper uid="N06-1032"> <Title>Grammatical Machine Translation</Title> <Section position="3" start_page="248" end_page="250" type="metho"> <SectionTitle> 2 Extracting F-Structure Snippets </SectionTitle> <Paragraph position="0"> Our method for extracting transfer rules for dependency structure snippets operates on the paired sentences of a sentence-aligned bilingual corpus. Similar to phrase-based SMT, our approach starts with an improved word-alignment that is created by intersecting alignment matrices for both translation directions, and refining the intersection alignment by adding directly adjacent alignment points, and alignment points that align previously unaligned words (see Och et al. (1999)). Next, source and target sentences are parsed using source and target LFG grammars to produce a set of possible f(unctional) dependency structures for each side (see Riezler et al.</Paragraph> <Paragraph position="1"> (2002) for the English grammar and parser; Butt et al. (2002) for German). The two f-structures that most preserve dependencies are selected for further consideration. Selecting the most similar instead of the most probable f-structures is advantageous for rule induction since it provides for higher coverage with simpler rules. In the third step, the many-to-many word alignment created in the first step is used to define many-to-many correspondences between the substructures of the f-structures selected in the second step. The parsing process maintains an association between words in the string and particular predicate features in the f-structure, and thus the predicates on the two sides are implicitly linked by virtue of the original word alignment. The word alignment is extended to f-structures by setting into correspondence the f-structure units that immediately contain linked predicates. These f-structure correspondences are the basis for hypothesizing candidate transfer rules.</Paragraph> <Paragraph position="2"> To illustrate, suppose our corpus contains the following aligned sentences (this example is taken from our experiments on German-to-English translation): Daf&quot;ur bin ich zutiefst dankbar.</Paragraph> <Paragraph position="3"> I have a deep appreciation for that.</Paragraph> <Paragraph position="4"> Suppose further that we have created the many-to-many bi-directional word alignment</Paragraph> <Paragraph position="6"> indicating for example that Daf&quot;ur is aligned with words 6 and 7 of the English sentence (for and that).</Paragraph> <Paragraph position="7"> This results in the links between the predicates of the source and target f-structures shown in Fig. 1. From these source-target f-structure alignments transfer rules are extracted in two steps. In the first step, primitive transfer rules are extracted directly from the alignment of f-structure units. These include simple rules for mapping lexical predicates such as: PRED(%X1, ich) ==> PRED(%X1, I) and somewhat more complicated rules for mapping local f-structure configurations. For example, the rule shown below is derived from the alignment of the outermost f-structures. It maps any f-structure whose pred is sein to an f-structure with pred have, and in addition interprets the subj-to-subj link as an indication to map the subject of a source with this predicate into the subject of the target and the xcomp of the source into the object of the target. Features denoting number, person, type, etc. are not shown; variables %X denote f-structure values.</Paragraph> <Paragraph position="9"> The following rule shows how a single source f-structure can be mapped to a local configuration of several units on the target side, in this case the single f-structure headed by daf&quot;ur into one that corresponds to an English preposition+object f-structure.</Paragraph> <Paragraph position="11"> Transfer rules are required to only operate on contiguous units of the f-structure that are consistent with the word alignment. This transfer contiguity constraint states that 1. source and target f-structures are each connected. null 2. f-structures in the transfer source can only be aligned with f-structures in the transfer target, and vice versa.</Paragraph> <Paragraph position="12"> This constraint on f-structures is analogous to the constraint on contiguous and alignment-consistent phrases employed in phrase-based SMT. It prevents the extraction of a transfer rule that would translate dankbar directly into appreciation since appreciation is aligned also to zutiefst and its f-structure would also have to be included in the transfer. Thus, the primitive transfer rule for these predicates must be:</Paragraph> <Paragraph position="14"> In the second step, rules for more complex mappings are created by combining primitive transfer rules that are adjacent in the source and target fstructures. For instance, we can combine the primitive transfer rule that maps sein to have with the primitive transfer rule that maps ich to I to produce the complex transfer rule:</Paragraph> <Paragraph position="16"> In the worst case, there can be an exponential number of combinations of primitive transfer rules, so we only allow at most three primitive transfer rules to be combined. This produces O(n2) trans- null fer rules in the worst case, where n is the number of f-structures in the source.</Paragraph> <Paragraph position="17"> Other points where linguistic information comes into play is in morphological stemming in fstructures, and in the optional filtering of f-structure phrases based on consistency of linguistic types. For example, the extraction of a phrase-pair that translates zutiefst dankbar into a deep appreciation is valid in the string-based world, but would be prevented in the f-structure world because of the incompatibility of the types A and N for adjectival dankbar and nominal appreciation. Similarly, a transfer rule translating sein to have could be dispreferred because of a mismatch in the the verbal types V/A and V/N. However, the transfer of sein zutiefst dankbar to have a deep appreciation is licensed by compatible head types V.</Paragraph> </Section> <Section position="4" start_page="250" end_page="250" type="metho"> <SectionTitle> 3 Parsing-Transfer-Generation </SectionTitle> <Paragraph position="0"> We use LFG grammars, producing c(onstituent)structures (trees) and f(unctional)-structures (attribute value matrices) as output, for parsing source and target text (Riezler et al., 2002; Butt et al., 2002).</Paragraph> <Paragraph position="1"> To increase robustness, the standard grammar is augmented with a FRAGMENT grammar. This allows sentences that are outside the scope of the standard grammar to be parsed as well-formed chunks specified by the grammar, with unparsable tokens possibly interspersed. The correct parse is determined by a fewest-chunk method.</Paragraph> <Paragraph position="2"> Transfer converts source into a target f-structures by non-deterministically applying all of the induced transfer rules in parallel. Each fact in the German f-structure must be transferred by exactly one transfer rule. For robustness a default rule is included that transfers any fact as itself. Similar to parsing, transfer works on a chart. The chart has an edge for each combination of facts that have been transferred.</Paragraph> <Paragraph position="3"> When the chart is complete, the outputs of the transfer rules are unified to make sure they are consistent (for instance, that the transfer rules did not produce two determiners for the same noun). Selection of the most probable transfer output is done by beamdecoding on the transfer chart.</Paragraph> <Paragraph position="4"> LFG grammars can be used bidirectionally for parsing and generation, thus the existing English grammar used for parsing the training data can also be used for generation of English translations.</Paragraph> <Paragraph position="5"> For in-coverage examples, the grammar specifies c-structures that differ in linear precedence of sub-trees for a given f-structure, and realizes the terminal yield according to morphological rules. In order to guarantee non-empty output for the overall translation system, the generation component has to be fault-tolerant in cases where the transfer system operates on a fragmentary parse, or produces non-valid f-structures from valid input f-structures. For generation from unknown predicates, a default morphology is used to inflect the source stem correctly for English. For generation from unknown structures, a default grammar is used that allows any attribute to be generated in any order as any category, with optimality marks set so as to prefer the standard grammar over the default grammar.</Paragraph> </Section> <Section position="5" start_page="250" end_page="250" type="metho"> <SectionTitle> 4 Statistical Models and Training </SectionTitle> <Paragraph position="0"> The statistical components of our system are modeled on the statistical components of the phrase-based system Pharaoh, described in Koehn et al.</Paragraph> <Paragraph position="1"> (2003) and Koehn (2004). Pharaoh integrates the following 8 statistical models: relative frequency of phrase translations in source-to-target and target-to-source direction, lexical weighting in source-to-target and target-to-source direction, phrase count, language model probability, word count, and distortion probability.</Paragraph> <Paragraph position="2"> Correspondingly, our system computes the following statistics for each translation: 1. log-probability of source-to-target transfer rules, where the probability r(e|f) of a rule that transfers source snippet f into target snippet e is estimated by the relative frequency</Paragraph> <Paragraph position="4"> eprime count(f ==> e') 2. log-probability of target-to-source rules 3. log-probability of lexical translations from source to target snippets, estimated from Viterbi alignments ^a between source word positions i = 1, . . . , n and target word positions</Paragraph> <Paragraph position="6"> 4. log-probability of lexical translations from target to source snippets 5. number of transfer rules 6. number of transfer rules with frequency 1 7. number of default transfer rules (translating source features into themselves) 8. log-probability of strings of predicates from root to frontier of target f-structure, estimated from predicate trigrams in English f-structures 9. number of predicates in target f-structure 10. number of constituent movements during gen- null eration based on the original order of the head predicates of the constituents (for example, AP[2] BP[3] CP[1] counts as two movements since the head predicate of CP moved from the first position to the third position) 11. number of generation repairs 12. log-probability of target string as computed by trigram language model 13. number of words in target string These statistics are combined into a log-linear model whose parameters are adjusted by minimum error rate training (Och, 2003).</Paragraph> </Section> <Section position="6" start_page="250" end_page="252" type="metho"> <SectionTitle> 5 Experimental Evaluation </SectionTitle> <Paragraph position="0"> The setup for our experimental comparison is German-to-English translation on the Europarl parallel data set3. For quick experimental turnaround we restricted our attention to sentences with 5 to 15 words, resulting in a training set of 163,141 sentences and a development set of 1967 sentences. Final results are reported on the test set of 1,755 sentences of length 5-15 that was used in Koehn et al.</Paragraph> <Paragraph position="1"> (2003). To extract transfer rules, an improved bidirectional word alignment was created for the training data from the word alignment of IBM model 4 as implemented by GIZA++ (Och et al., 1999). Training sentences were parsed using German and English LFG grammars (Riezler et al., 2002; Butt et al., 2002). The grammars obtain 100% coverage on unseen data. 80% are parsed as full parses; 20% receive FRAGMENT parses. Around 700,000 transfer rules were extracted from f-structures pairs chosen according to a dependency similarity measure. For language modeling, we used the trigram model of Stolcke (2002).</Paragraph> <Paragraph position="2"> When applied to translating unseen text, the system operates on n-best lists of parses, transferred f-structures, and generated strings. For minimum-error-rate training on the development set, and for translating the test set, we considered 1 German parse for each source sentence, 10 transferred f-structures for each source parse, and 1,000 generated strings for each transferred f-structure. Selection of most probable translations proceeds in two steps: First, the most probable transferred f-structure is computed by a beam search on the transfer chart using the first 10 features described above. These features include tests on source and target f-structure snippets related via transfer rules (features 1-7) as well as language model and distortion features on the target c- and f-structures (features 8-10). In our experiments, the beam size was set to 20 hypotheses.</Paragraph> <Paragraph position="3"> The second step is based on features 11-13, which are computed on the strings that were actually generated from the selected n-best f-structures.</Paragraph> <Paragraph position="4"> We compared our system to IBM model 4 as produced by GIZA++ (Och et al., 1999) and a phrase-based SMT model as provided by Pharaoh (2004).</Paragraph> <Paragraph position="5"> The same improved word alignment matrix and the same training data were used for phrase-extraction for phrase-based SMT as well as for transfer-rule extraction for LFG-based SMT. Minimum-error-rate training was done using Koehn's implementation of Och's (2003) minimum-error-rate model. To train the weights for phrase-based SMT we used the first 500 sentences of the development set; the weights of the LFG-based translator were adjusted on the 750 sentences that were in coverage of our grammars.</Paragraph> <Paragraph position="6"> For automatic evaluation, we use the NIST metric (Doddington, 2002) combined with the approximate randomization test (Noreen, 1989), providing the desired combination of a sensitive evaluation metric and an accurate significance test (see Riezler and phrase-based SMT (P), and the LFG-based SMT (LFG) on the full test set and on in-coverage examples for LFG. Results in the same row that are not statistically significant from each other are marked with a [?].</Paragraph> <Paragraph position="7"> tions of phrase-based SMT (P) or LFG-based SMT (LFG) under criteria of fluency/grammaticality and translational/semantic adequacy on 500 in-coverage examples. Ratings by judge 1 are shown in rows, for judge 2 in columns. Agreed-on examples are shown in boldface in the diagonals.</Paragraph> <Paragraph position="8"> Maxwell (2005)). In order to avoid a random assessment of statistical significance in our three-fold pairwise comparison, we reduce the per-comparison significance level to 0.01 so as to achieve a standard experimentwise significance level of 0.05 (see Cohen (1995)). Table 1 shows results for IBM model 4, phrase-based SMT, and LFG-based SMT, where examples that are in coverage of the LFG-based systems are evaluated separately. Out of the 1,755 sentences of the test set, 44% were in coverage of the LFG-grammars; for 51% the system had to resort to the FRAGMENT technique for parsing and/or repair techniques in generation; in 5% of the cases our system timed out. Since our grammars are not set up with punctuation in mind, punctuation is ignored in all evaluations reported below.</Paragraph> <Paragraph position="9"> For in-coverage examples, the difference between NIST scores for the LFG system and the phrase-based system is statistically not significant. On the full set of test examples, the suboptimal quality on out-of-coverage examples overwhelms the quality achieved on in-coverage examples, resulting in a statistically not significant result difference in NIST scores between the LFG system and IBM model 4.</Paragraph> <Paragraph position="10"> In order to discern the factors of grammaticality and translational adequacy, we conducted a manual evaluation on randomly selected 500 examples that were in coverage of the grammar-based generator.</Paragraph> <Paragraph position="11"> Two independent human judges were presented with the source sentence, and the output of the phrase-based and LFG-based systems in a blind test. This was achieved by displaying the system outputs in random order. The judges were asked to indicate a preference for one system translation over the other, or whether they thought them to be of equal quality.</Paragraph> <Paragraph position="12"> These questions had to be answered separately under the criteria of grammaticality/fluency and translational/semantic adequacy. As shown in Table 2, both judges express a preference for the LFG system over the phrase-based system for both adequacy and grammaticality. If we just look at sentences where judges agree, we see a net improvement on translational adequacy of 57 sentences, which is an improvement of 11.4% over the 500 sentences. If this were part of a hybrid system, this would amount to a 5% overall improvement in translational adequacy.</Paragraph> <Paragraph position="13"> Similarly we see a net improvement on grammaticality of 77 sentences, which is an improvement of 15.4% over the 500 sentences or 6.7% overall in a hybrid system. Result differences on agreed-on ratings are statistically significant, where significance was assessed by approximate randomization via stratified shuffling of the preferences between the systems (Noreen, 1989). Examples from the manual evaluation are shown in Fig. 2.</Paragraph> <Paragraph position="14"> Along the same lines, a further manual evaluation was conducted on 500 randomly selected examples that were out of coverage of the LFG-based grammars. Across the combined set of 1,000 in-coverage and out-of-coverage sentences, this resulted in an agreed-on preference for the phrase-based system in 204 cases and for the LFG-based system in 158 cases under the measure of translational adequacy.</Paragraph> <Paragraph position="15"> Under the grammaticality measure the phrase-based system was preferred by both judges in 157 cases and the LFG-based system in 136 cases.</Paragraph> </Section> <Section position="7" start_page="252" end_page="254" type="metho"> <SectionTitle> 6 Discussion </SectionTitle> <Paragraph position="0"> The above presented evaluation of the LFG-based translator shows promising results for examples that are in coverage of the employed LFG grammars.</Paragraph> <Paragraph position="1"> However, a back-off to robustness techniques in parsing and/or generation results in a considerable (1) src: in diesem fall werde ich meine verantwortung wahrnehmen ref: then i will exercise my responsibility LFG: in this case i accept my responsibility P: in this case i shall my responsibilities (2) src: die politische stabilit&quot;at h&quot;angt ab von der besserung der lebensbedingungen ref: political stability depends upon the improvement of living conditions LFG: the political stability hinges on the recovery the conditions P: the political stability is rejects the recovery of the living conditions (3) src: und schliesslich muss dieser agentur eine kritische haltung gegen&quot;uber der kommission selbst erlaubt sein ref: moreover the agency must be able to criticise the commission itself LFG: and even to the commission a critical stance must finally be allowed this agency P: finally this is a critical attitude towards the commission itself to be agency (4) src: nach der ratifizierung werden co2 emissionen ihren preis haben ref: after ratification co2 emission will have a price tag LFG: carbon dioxide emissions have its price following the ratification P: after the ratification co2 emissions are a price (5) src: die lebensmittel m&quot;ussen die sichere ern&quot;ahrung des menschen gew&quot;ahrleisten ref: man's food must be safe to eat LFG: food must guarantee the safe nutrition of the people P: the people of the nutrition safe food must guarantee (6) src: was wir morgen beschliessen werden ist letztlich material f&quot;ur das vermittlungsverfahren ref: whatever we agree tomorrow will ultimately have to go into the conciliation procedure LFG: one tomorrow we approved what is ultimately material for the conciliation procedure P: what we decide tomorrow is ultimately material for the conciliation procedure (7) src: die verwaltung muss k&quot;unftig schneller reagieren k&quot;onnen ref: in future the administration must be able to react more quickly LFG: more in future the administration must be able to react P: the administration must be able to react more quickly (8) src: das ist jetzt &quot;uber 40 jahre her ref: that was over 40 years ago LFG: on 40 years ago it is now P: that is now over 40 years ago (9) src: das ist schon eine seltsame vorstellung von gleichheit ref: a strange notion of equality LFG: equality that is even a strange idea P: this is already a strange idea of equality (10) src: frau pr&quot;asidentin ich begl&quot;uckw&quot;unsche herrn nicholson zu seinem ausgezeichneten bericht ref: madam president i congratulate mr nicholson on his excellent report LFG: madam president i congratulate mister nicholson on his report excellented P: madam president i congratulate mr nicholson for his excellent report adequacy and grammaticality (ex 1-5), preference of phrased-based system over LFG (6-10) , together with source (src) sentences and human reference (ref) translations. All ratings are agreed on by both judges. loss in translation quality. The high percentage of examples that fall out of coverage of the LFG-based system can partially be explained by the accumulation of errors in parsing the training data where source and target language parser each produce FRAGMENT parses in 20% of the cases. Together with errors in rule extraction, this results in a large number ill-formed transfer rules that force the generator to back-off to robustness techniques.</Paragraph> <Paragraph position="2"> In applying the parse-transfer-generation pipeline to translating unseen text, parsing errors can cause erroneous transfer, which can result in generation errors. Similar effects can be observed for errors in translating in-coverage examples. Here disambiguation errors in parsing and transfer propagate through the system, producing suboptimal translations. An error analysis on 100 suboptimal in-coverage examples from the development set showed that 69 sub-optimal translations were due to transfer errors, 10 of which were due to errors in parsing.</Paragraph> <Paragraph position="3"> The discrepancy between NIST scores and manual preference rankings can be explained on the one hand by the suboptimal integration of transfer and generation in our system, making it infeasible to work with large n-best lists in training and application. Moreover, despite our use of minimum-error- null rate training and n-gram language models, our system cannot be adjusted to maximize n-gram scores on reference translation in the same way as phrase-based systems since statistical ordering models are employed in our framework after grammar-based generation, thus giving preference to grammaticality over similarity to reference translations.</Paragraph> </Section> class="xml-element"></Paper>