File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0910_metho.xml
Size: 15,987 bytes
Last Modified: 2025-10-06 14:14:44
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-0910"> <Title>Ivan.P.Bretan@telia.se Robert.H.Eklund@telia.se Mats.G.Wiren@telia.se</Title> <Section position="3" start_page="55" end_page="57" type="metho"> <SectionTitle> 3 Porting grammars and lexica </SectionTitle> <Paragraph position="0"> between closely related languages The original version of the Core Language Engine had a single language description for English, written by hand from scratch (Pulman, 1992; Rayner, 1994). Subsequently, language descriptions have been developed for Swedish (Gamb~ck and Rayner, 1992), French and Spanish (Rayner, Carter and Bouillon, 1995). In each of these cases, the new language description was created by manually editing the relevant files for the closest existing language.</Paragraph> <Paragraph position="1"> (The Swedish and French systems are modified versions of the original English one; the Spanish system is modified from the French one). There are however some serious drawbacks to this approach. Firstly, it requires a considerable quantity of expert effort; secondly, there is no mechanism for keeping the resuiting grammars in step with each other. Changes are often made to one grammar and not percolated to the other ones until concrete problems show up in test suites or demos. The net result is that the various grammars tend to drift steadily further apart.</Paragraph> <Paragraph position="2"> When we recently decided to create a language description for Danish, we thought it would be interesting to experiment with a more principled methodology, which explicitly attempts to address the problems mentioned above. The conditions appeared ideal: we were porting from Swedish, Swedish and Danish being an extremely closely related language pair. The basic principles we have attempted to observe are the following: * Whenever feasible, we have tried to arrange things so that the linguistic descriptions for the two languages consist of shared files. In particular, the grammar rules files for the two languages are shared. When required, rules or parts of rules specific to one language are placed inside macros whose expansion depends on the identity of the current language, so that the rule expands when loaded to an appropriate language-specific version.</Paragraph> <Paragraph position="3"> * When files cannot easily be shared iin particular, for the content-word lexica), we define the file for the new language in terms of declarations listing the explicit differences against the corresponding file for the old language. We have attempted to make the structure of these declarations as simple as possible, so that they can be written by linguists who lack prior familiarity with the system and its notation.</Paragraph> <Paragraph position="4"> Although we are uncertain how much generality to claim for the results (Swedish and Danish, as already noted, are exceptionally close), we found them encouraging. Four of the 175 existing Swedish grammar rules turned out to be inapplicable to Danish, and two had to be replaced by corresponding Danish rules. Five more rules had to be parameterized by language-specific macros. Some of the morphology rules needed to be rewritten, but this only required about two days of effort from a system specialist working together with a Danish linguist. The most significant piece of work, which we will now describe in more detail, concerned the lexicon.</Paragraph> <Paragraph position="5"> Our original intuition here was that the function-word lexicon and the paradigm macros (cf Section 2) would be essentially the same between the two languages, except that the surface forms of function words would vary. To put it slightly differently, we anticipated that it would make sense as a first approximation to say that there was a one-to-one correspondence between Swedish and Danish functionwords, and that their QLF representations could be left identical. This assumption does indeed appear to be borne out by the facts. The only complication we have come across so far concerns definite determiners: the feature-value assignments between the two languages need to differ slightly in order to handle the different rules in Swedish and Danish for determiner/noun agreement. This was handled, as with the grammar rules, by introduction of a suitable call to a language-specific macro.</Paragraph> <Paragraph position="6"> With regard to content words, the situation is somewhat different. Since word choice in translation is frequently determined both by collocational and by semantic considerations, it does not make as much sense to insist on one-to-one correspondences and identical semantic representations. We consequently decided that content-words would have a language-dependent QLF representation, so as to make it possible to use our normal strategy of letting the Swedish-to-Danish translation rules in general be many-to-many, with collocational preferences filtering the space of possible transfers.</Paragraph> <Paragraph position="7"> The remarks above motivate the concrete lexiconporting strategy which we now sketch. All work was carried out by Danish linguists who had a good knowledge of computational linguistics and Swedish, but no previous exposure to the system. The starting point was to write a set of word-to-word translation rules (cf Section 2), which for each Swedish surface lexical item defined a set of possible Danish translations. The left-hand side of each WW rule specified a Swedish surface word-form and an associated grammatical category (verb, noun, etc), and the right-hand side a possible Danish translation.</Paragraph> <Paragraph position="8"> An initial &quot;blank&quot; version of the rules was created automatically by machine analysis of a corpus; the left-hand side of the rule was filled in correctly, and a set of examples taken from the corpus was listed above. The linguist only needed to fill in the right-hand side appropriately with reference to the examples supplied.</Paragraph> <Paragraph position="9"> The next step was to use the word-to-word rules to induce a Danish lexicon. As a first approximation, we assumed that the possible grammatical (syntactic/semantic) categories of the word on the right-hand side of a WW rule would be the same as those of the word on its left-hand side. (Note that in general a word will have more than one lexical entry).</Paragraph> <Paragraph position="10"> Thus lexicon entries could be copied across from Swedish to Danish with appropriate modifications.</Paragraph> <Paragraph position="11"> In the case of function-words, the entry is copied across with only the surface form changed. For content-words, the porting routines query the linguist for the additional information needed to transform each specific item as follows.</Paragraph> <Paragraph position="12"> If the left-hand (Swedish) word belongs to a lexical category subject to morphological inflection, the linguist is asked for the root form of the right-hand (Danish) word and its inflectional pattern. If the inflectional pattern is marked as wholly or partly irregular (e.g. with strong verbs), the linguist is also queried for the values of the relevant irregular inflections. All requests for lexical information are output in a single file at the end of the run, formatted for easy editing. This makes it possible for the linguist to process large numbers of information requests quickly and efficiently, and feed the revised declarations back into the porting process in an iterative fashion.</Paragraph> <Paragraph position="13"> one particularly attractive aspect of the scheme is that transfer rules are automatically generated as a byproduct of the porting process. Grammar rules and function-words are regarded as interlingual; thus for each QLF constant C involved in the definition of a grammar rule or a function-word definition, the system adds a transfer rule which maps C into itself.</Paragraph> <Paragraph position="14"> Content-words are not interlingual. However, since each target lexical entry L is created from a source counterpart L', it is trivial to create simultaneously a transfer rule which maps the source QLF constant associated with L' into the target QLF constant associated with L.</Paragraph> </Section> <Section position="4" start_page="57" end_page="57" type="metho"> <SectionTitle> 4 Transfer composition </SectionTitle> <Paragraph position="0"> The previous sections have hopefully conveyed some of the flavour of our translation framework, which conceptually can be thought of as half-way between transfer and interlingua. We would if possible like to move closer to the interlingual end; however, the problems touched on above mean that we do not see this as being a realistic short-term possibility. Meanwhile, we are stuck with the problem that dogs all multilingual transfer-based systems: the number of sets of transfer rules required increases quadratically in the number of system languages. Even three languages are enough to make the problem non-trivial.</Paragraph> <Paragraph position="1"> In a recent paper (Rayner et al, 1996), we described a novel approach to the problem which we have implemented within the SLT system. Exploiting the declarative nature of our transfer formalism, we compose (off-line) existing sets of rules for the language pairs L1 --+ L2 and L2 ~ L3, to create a new set of rules for L1 ~ L3. It is clear that this can be done for rules which map atomic constants into atomic constants. What is less obvious is that complex rules, recursively defined in terms of translation of their sub-constituents, can also be composed. The method used is based on programtransformation ideas taken from logic programming, and is described in detail in the earlier paper. Simple methods, described in the same paper, can also be used to compose an approximate transfer preference model for the new language-pair.</Paragraph> <Paragraph position="2"> The rule composition algorithm is not complete; we strongly suspect that, because of recursion effects, the problem of finding a complete set of composed transfer rules is undecidable. But in practice, the set of composed rules produced is good enough that it can be improved quickly to an acceptable level of performance. Our methodology for performing this task makes use of rationally constructed, balanced domain corpora to focus the effort on frequently occurring problems (Rayner, Carter and Bouillon, 1995). It involves making declarations to reduce the overgeneration of composed rules; adding hand-coded rules to fill coverage holes; and adjusting preferences. The details reported in (Rayner et al, 1996).</Paragraph> </Section> <Section position="5" start_page="57" end_page="68" type="metho"> <SectionTitle> 5 Experiments </SectionTitle> <Paragraph position="0"> We will now present results for concrete experiments, where we applied the methods described above so as to rapidly construct translation systems for two new language pairs. All of the translation modules involved operate within the same Air Travel Inquiry (ATIS; (Hemphill et ai., 1990)) domain as other versions of SLT, using a vocabulary of about 1 500 source-language stem entries, and have been integrated into the main SLT system to produce versions which can perform credible translation of spoken Swedish into French and spoken English into Danish respectively.</Paragraph> <Paragraph position="1"> and English --+ Swedish on unseen speech data</Paragraph> <Section position="1" start_page="68" end_page="68" type="sub_section"> <SectionTitle> 5.1 Swedish --+ English ~ French </SectionTitle> <Paragraph position="0"> This section describes an exercise which involved using transfer composition to construct a Swedish French translation system by composing Swedish English and English ~ French versions of the system. The total expert effort was about two personweeks. We start by summarizing results, and then sketch the main points of the manual work needed to adjust the composed rule-sets.</Paragraph> <Paragraph position="1"> We used a corpus of 442 previously unseen spoken utterances, and processed the N-best lists output for them by the speech recognizer. The results are as given in Table 1; for comparison, we also give the results for English --+ Swedish, the language pair to which we have devoted the most effort (and which does not involve any transfer composition).</Paragraph> <Paragraph position="2"> Thus almost 30% (top row) of the translations produced were completely acceptable, with another 30% or so (rows 2-3) having only minor problems, giving a total of 60% that would probably be acceptable in practical use. A further 9% (rows 4-5) contained major errors but also some correct information, while nearly all the remaining 30% (bottom 3 rows) were clearly unacceptable, consisting either of nonsense or of a translation that made some sense but was wrong. The reasons for these 30% of outright failures, compared to only about 10% for English --~ Swedish, are firstly, that recognizer performance is slightly less good for Swedish than for English, owing to less training data being available; second, that Swedish and French differ more than English and Swedish do; thirdly, that transfer rules for both the component pairs (Swedish ~ English and English --+ French) have had much less work devoted to them than English --+ Swedish; and last but not least, of course, that transfer composition is being used.</Paragraph> <Paragraph position="3"> When cleaning up the automatically composed Swedish -+ French rule-set, the task on which we spent most effort was that of limiting overgeneration of composed transfer rules. The second most impor~ tant task was manual improvement of the Composed transfer preference model. The methods used are described in more detail in (Rayner et al, 1996).</Paragraph> </Section> <Section position="2" start_page="68" end_page="68" type="sub_section"> <SectionTitle> 5.2 English --+ Swedish --+ Danish </SectionTitle> <Paragraph position="0"> This section briefly describes a second series of experiments, in which we converted an English --~ Swedish system into an English --+ Danish system using the methods described earlier. The total investment of system expert effort was again around two person-weeks.</Paragraph> <Paragraph position="1"> About half the effort was used to port the Swedish language description to Danish, employing the methods of Section 3. After this, we carried out two rounds of testing and bug-fixing on the Swedish --~ Danish translation task. For this, we used a Swedish representative corpus, containing 331 sentences representing 9 385 words from the original Swedish corpus. These tests uncovered a number of new problems resulting from previously unnoted divergences between the Swedish and Danish grammars. About half the problems disappeared after the addition of 20 or so small hand-coded adjustments to the morphology, function-word lexicon, transfer rules and transfer preferences.</Paragraph> <Paragraph position="2"> After the second round of bug-fixing, 95% of the Swedish sentences received a Danish translation, and 79% a fully acceptable translation. (When measuring results on representative corpora, we count coverage in terms of &quot;weighted scores&quot;. The weight assigned to sentence is proportional to the number of words it represents in the original corpus: that is, its length in words times the number of sentences it represents). Most of the translation errors that did occur were minor ones. Finally, we composed the English ~ Swedish and Swedish --+ Danish rules to create a English -+ Danish rule-set, and used this, after a day's editing by an expert, to test English --+ Danish translation using a representative text corpus (we will present results for unseen speech input at the workshop). Our results, using the same scheme as above, were as given in Table 2.</Paragraph> </Section> </Section> class="xml-element"></Paper>