File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/h94-1029_metho.xml
Size: 20,646 bytes
Last Modified: 2025-10-06 14:13:47
<?xml version="1.0" standalone="yes"?> <Paper uid="H94-1029"> <Title>The Automatic Component of the LINGSTAT Machine-Aided Translation System*</Title> <Section position="3" start_page="0" end_page="166" type="metho"> <SectionTitle> 2. IMPLEMENTATION </SectionTitle> <Paragraph position="0"> Tokenization/de-inflection In LINGSTAT, &quot;tokenization&quot; refers to the process of breaking a source document into a sequence of root words tagged, if necessary, with inflection information. For most languages, the tokenizer is basically an engine that oversees the de-inflection of source words into root forms. For languages like Japanese, written without spaces, the tokenizer also has the job of segmenting the source.</Paragraph> <Paragraph position="1"> To segment Japanese, the LINGSTAT tokenizer uses a probabilistic dynamic programming algorithm to break up the character stream into the sequence of words that maximizes the product of word unigram probabilities, as supplied from a list of 300,000 words. Inflected forms are recognized during tokenization by a de-inflector module. This module has a language-independent engine driven by a language-specific de-inflection table. (More details on the function of these components can be found in \[1\].) There have been two improvements in the tokenizer/de null inflector module in the newer versions of the system, made possible by the introduction of part of speech information into the word list. The first is an extra check on the validity of suggested de-inflections by demanding consistency between the inflection and the part of speech of the proposed root. This has cleanly eliminated a number of spurious de-inflections that were previously handled in a more ad-hoc fashion. The second improvement, motivated more by plans to move on to Spanish, is to stop the tokenizer from attempting to uniquely specify the de-inflection path (and now, part of speech) for each token it finds. As an example of the problem this addresses, consider the two de-inflections of the Spanish word ayudas: ayadas.-+ eyed= (help, aid) ayudas .-~ ayudar (to help, to aid) The original tokeniser made a choice between the noun and verb de-inflection based on the unigram frequency of the root. The new tokenizer still finds all allowed possibilities, but now simply passes them to the parser, which is better equipped to resolve the ambiguity.</Paragraph> <Section position="1" start_page="163" end_page="165" type="sub_section"> <SectionTitle> Parsing </SectionTitle> <Paragraph position="0"> The parser in LINGSTAT has two roles. In the interactive component, information about modifying phrases is extracted from the parse and presented to the user as an aid to understanding the structure of each Japanese sentence. In the automatic component, the parse is the basis for the rearrangement of the Japanese sentence into English.</Paragraph> <Paragraph position="1"> Because it is a long-term goal to have a system that can he quickly adapted to new domains and languages, a high priority is placed on developing parsing techniques that are capable of extracting some information automatically through training on new sources of text, thus minimizing the amount of human effort. In the current system, this has led to a two-stage parsing process. The first stage implements a coarse probabilistic context-free grammar of a few hundred human-supplied rules acting on parts of speech. Because of this coarseness, some parsing ambiguities remain to be resolved by the second-stage parser, which implements a simple, lexicalized, probabilistic context-free grammar trained on word co-occurrences in unlabeled Japanese sentences without human input.</Paragraph> <Paragraph position="2"> Context-fzee parser. The first.stage parse is done using a standard probabilistic context-free grammar acting on about 50 parts of speech. Any ambiguities in part of speech assignments or de-inflection paths passed by the tokenizer/de-inflector are resolved based on the probability of possible parses. The grammar is allowed to contain unitary and null productions, which impme an ordering on the summation over rules that takes place during training; because there are currently only a few hundred rules, this ordering is checked by hand. The grammar can be trained with either the Inside-Outside \[2\] or Viterbi \[3\] algorithm.</Paragraph> <Paragraph position="3"> It is essential that the parser return a parse, even a bad one, for subsequent processing. Therefore special, low-probability '~unk&quot; rules have been introduced to handle unanticipated constructions. These junk rules affect the generation of terminal symbols and take the following form: for each rule in which a non-terminal generates a particular terminal, a rule is added permitting the same non-terminal to generate any other terminal with a small probability. This allows the grammar to force the termined string into a sequence that has a recognizable parse, but at a high enough cost such that any parse without such coersion will be favored. One advantage of this approach is that the grammar can compensate for missing or mislabeled data. Consider the fragment thedet largeadv dogno=n in which the adjective large has been mislabeled as an adverb. The junk rule permits the grammar to change its part of speech to something more appropriate provided no other sensible parse can be found.</Paragraph> <Paragraph position="4"> In principle, the probability of invoking the junk rule could be trained with the other rules in the grammar (the example above suggests that it might be advantageous to do so). Currently this is not being done, based on the observation that an invocation of the junk rule is more likely an indication of a deficiency in the grammar than a useful correction to the data.</Paragraph> <Paragraph position="5"> I..~'~,~\];~ed parser. The grammar implemented by the context-free parser is not fine enough to properly resolve certain kinds of ambiguity, such as the correct attachment of prepositional phrases or noun modifiers. These attachment problems are handled by a second parser, which does a top-down rescoring of certain probabilities computed in the first stage. Currently this rescoring is used to fine-tune attachments of particle phrases in Japanese sentences.</Paragraph> <Paragraph position="6"> The second parser makes use of a second probabilistic grammar, one whose basic elements are the words themselves, and whose data consist of the probabilities of each word in the vocabulary to be generated in the context of any other word. Like a bigram language model, these probabilities can be trained on word co-occurrences in unlabeled sentences, but unlike bigrams, the grammar can learn about associations between words in a sentence regardless of their separation.</Paragraph> <Paragraph position="7"> This very simple ~ntext-free grammar can be described as follows. To each word in the vocabulary we associate a terminal symbol to (the word itseff) and a non-terminal symbol A~. The grammar consists of the following two kinds of rules: Awe --~ A~= to2 A~=A~ , (la) A=, --~ C/, (lb) where ~ represents the null production. In addition, we introduce a sentence start symbol A0 with the production null Ao --* A, toA. . (2) The probability of invoking a particular rule depends only on the word associated with the generating non-terminal and the terminal word in the production. The probabilities for (la) and (lb) can therefore be written</Paragraph> <Paragraph position="9"> There is no null production for the start symbol.</Paragraph> <Paragraph position="10"> Roughly speaking, this grammar generates a sentence in the following manner. The start symbol first generates some word in the sentence. This word then generates some number of words to its left and right, which in turn generate other words to their left and right. From the form of the grammar it can be deduced that these generations are &quot;local,&quot; in the sense that if tot generates to2 on its right, w2 is not allowed to generate any word to the left of tot (and similarly for tot generating to2 on its left). The process continues in a cascading fashion until the whole sentence has been generated. The fertility of a particular word to (i.e., the number of words it will typically generate) is determined by the probability p(to--~ ~b), as can be seen from exeanining the productions (1): a non-terminal Aw will continue to produce words through rule (la) via tail recursion until rule (lb) is invoked.</Paragraph> <Paragraph position="11"> Although this grammar has the same type and number of parameters as a bigrarn model, here they have a very different interpretation: they measure the probability of one word to generate another anywhere in the sentence, subject only to the constraints imposed by the generation process described above. Thus an association between two words that might typically appear together, such as .fast and car, will be recognized even if another word might occasionally intervene, such as red. Another feature is that words with the most predictive power in a sentence tend to generate words with less predictive power, which has the consequence that words like ~e tend to generate no words at all. This is an improvement over a bigram model in which the is required to select a succeeding word from a distribution that is essentially fiat across a large portion of the vocabulary.</Paragraph> <Paragraph position="12"> This grammar shares the appealing feature of n-gram models that its parameters can be trained on unlabeled text (consisting of whole sentences). In this case, however, the training procedure is iteratiw a modification of the Inside-Outside algorithm that is of order N 4 in the sentence length, t The iteration starts from a fiat distribution, with co-occurrences of words within sentences leading to enhanced probabilities for some words to generate others.</Paragraph> <Paragraph position="13"> The N 4 algorithm actuary applies to a slightly different (but generatively equivalent) grammar than the one defined by rules (1) and (2). To implement this algorithm, we first replace rule (la) by A== ~ A==toaA==A't (ton to the left of tot) , Awl ~ A==A==to2A== (w2 to the right of tot), where the probability of both rules is the same and given by ~tot-~ to3). The only ditference between this sad rule (la) is that when A= generates multiple words to the right of to, they are generated right to left instead of left to right.</Paragraph> <Paragraph position="14"> As an example of how the N 4 dependence arises, consider the inside calculation for this model. For a sentence tot... WN, the quantities of interest for the inside pass are the probabilities l(Aw, --~ wj... wi-t) for j < i and I(A,, --~ toi+1...toj) for j > i. These may be calculated recursively by the following formulae:</Paragraph> <Paragraph position="16"> where the &quot;negative length&quot; string wi... wi-i is understood to represent the null production ~. The recursion XThe authors would llke to thank Joshua Goodman for developing the N t procedure, a notable improve=nent over previous implementations.</Paragraph> <Paragraph position="18"> The above computations involve a double sum and are therefore of order N 2, and there are order N = probabilities I(wi ~ wj ...wi-1) and I(A=, ~ w~+l ...wi), for a total of N 4. (For the Viterbi calculation, one simply selects the largest contribution from the right hand side of equations (3a) and (3b) instead of doing the double sum.) It is important to note that despite the N 4 behavior, this grammar is in general faster than context-free parsing, which is computationally of order N s. This is because the compute time for context-free parsing also includes a factor proportional to the number of rules in the grammar, which even in simple cases can be in the hundreds. There is no such factor in the computation for this lexicalized grammar--it is effectively replaced by another power of N, which is much smaller.</Paragraph> <Paragraph position="19"> To see how the probabilities p(wl .--~ w2) converge, this model was run through ten iterations of training on approximately 100,000 sentences of ten words or less from the English half of the Canadian Hansard corpus. Some examples of these probabilities follow: As expected, the trains strongly to generate the null symbol ~b. The token U. has a strong tendency to generate S. for obvious reasons; that it also generates agreement is a consequence of the frequent discussion in the corpus of the U. S. free trade agreement. This is an example of how the model will find associations between separated words that even a trigram model will not see. The distribution associated with tariffs arises from parliamentary debate on the general agreement on tariffs and trade.</Paragraph> <Paragraph position="20"> The simple grammar described above can be considered the starting point for a class of more complex models.</Paragraph> <Paragraph position="21"> One obvious extension is to train the probability distributious for generating to the left and right separately. This corresponds to implementing the greanmar A~ ---* A t; w A R AL t; ~,, 2 =,~=, , Aw, --~b, (4a) A R ....4, AR --L = AR R e,l ~w~.~w=w2~w= , A=t ---, ~. (4b) Training this grammar on the same text as the original model yields the left probabilities: Again, the tends to generate a null. Like mat nouns, U. has learned to generate a the to its left, and the left distribution for tariffs includes only those words found typically on its left. The right probabilities for the same words are&quot; the U. tariffs These are also consistent with the results from the original model.</Paragraph> </Section> <Section position="2" start_page="165" end_page="166" type="sub_section"> <SectionTitle> Rearrangement </SectionTitle> <Paragraph position="0"> The next step in LINGSTAT's translation method is a transfer of the parse of each Japanese sentence into a corresponding English parse, giving an English word ordering. This is accomplished through the use of English rewrite rules encoded in the Japanese grammar. Through this encoding, each non-terminal in the Japanese grammar corresponds to a non-terminal in an implied English granm~r. The rewrite process just conslats of taking the Japanese parse and expanding in this English grammar. As this expansion proceeds, Japanese constructs that are not translated (certain particles, for example) are removed, and tokens for English constructs not represented in the Japanese (such as articles) are introduced. null Annotation/language model The Japanese words in the reordered sentenced are annotated with (possibly several) candidate English glosses, supplied from an electronic dictionary compiled from various sources. Numbers are translated directly, and kat.akana tokens (which are usually borrowed foreign words) are transliterated into English. Tokens introduced in the rearrangement step are also glossed; the token indicating an English article is multiply glossed as the, a, an, and null (which expands to an empty word).</Paragraph> <Paragraph position="1"> Inflected Japanese words are glossed by first glossing the root, then applying an English version of the Japanese inflection to each candidate. This is made difficult by the poor correspondence between Japanese and English inflections: English is inflected for person and number, for example, while in Japanese there are inflections for such constructions as the causative, which require non-local changes in the corresponding English. Japanese inflectious also often consist of multiple steps, which means that the English inflections must be compounded. For example, to inflect the verb to wa& into the past desiderative involves the two step transformation, to walk --~ to want to walk --, wanted to walk.</Paragraph> <Paragraph position="2"> This procedure can produce some unusual results when the number of inflection steps is greater than two.</Paragraph> <Paragraph position="3"> The final step in the translation process is to apply an English language model to select the best gloss from among the many candidates for each word. In the current system this is done with a trigram model, which makes the choices that maximize the average probability per word. The trigram model used was trained on Wall Street Journal and so has a business bias, partially reflecting the bias of the evaluation texts.</Paragraph> </Section> </Section> <Section position="4" start_page="166" end_page="166" type="metho"> <SectionTitle> 3. RESULTS </SectionTitle> <Paragraph position="0"> The January 1994 AItPA machine translation evaluation has recently been completed. In this test, Dragon used the same translators as in the May 1993 evaluation and provided them with essentially the same interface and online tools. The difference in this evaluation was that the translators were also provided an antomatically generated English translation of the Japanese document as a first draft. Manual and machine-assisted translation times were measured, and the automatic output was also submitted for separate evaluation.</Paragraph> <Paragraph position="1"> Preliminary timing results show a speedup by a factor of 2.4 in machine-assisted vs. manual translation. Because we were using the May 1993 translators, this result may be compared to the May 1993 result; it is essentially unchanged. This suggests that the draft translation was of no significant help to the translators in this evaluation, probably because the quality of automatic output is not high enough to be relied upon.</Paragraph> <Paragraph position="2"> A quality measurement of the automatic output is not yet available, but we offer one example of a sample translation from the current system. For the following cor- null waazuhaimu dtumodaa oJ the America ineeJtment bank decided to adl oM 4.9~ o! the shares ol the same company to Mitsubishi Trust and Banldng Corporation Even this simple sentence demonstrates the large amount of rearrangement necessary to render the Japanese into English. This effort is not without errors; a correct translation shows that the word meaning same companu was mishandled, as was the modifier of Wertheim Schroder: The American investment bank Wertheim Schroder haa decided to sell 4.9~ o~ its ,reck to the MitJubishi</Paragraph> <Section position="1" start_page="166" end_page="166" type="sub_section"> <SectionTitle> Trust and Banking Corporation </SectionTitle> <Paragraph position="0"> This sentence is less complex than is typical in a Japanese newspaper article, and therefore LINGSTAT's performance in this case is not representative.</Paragraph> </Section> </Section> <Section position="5" start_page="166" end_page="167" type="metho"> <SectionTitle> 4. FUTURE PLANS </SectionTitle> <Paragraph position="0"> The steps that have the most effect on the quality of the final output translation (at least for Japanese) are the p~ser and gloss selection modules. The parser in particular is crucial, since it initiates a global rearrangement of the sentence into a sensible English order--a parsing mistake will often render a sentence unintelligible.</Paragraph> <Paragraph position="1"> The improvements contemplated for the parsing module include more hand work on the coarse context-free grammar to provide more accurate parses, and a general speedup to allow more extensive training. A faster parser would also allow the merging of the two grammars so that they could be trained simultaneously. Attempts to do this have so far resulted in an unacceptable increase in training and parsing time due to the complexity of the algorithm.</Paragraph> <Paragraph position="2"> The language model used to select glosses in the final translation step must be improved to have more global control. Common mistakes made by the current model include inconsistent glossing of a recurring word and virtually no notion of topic or domain (except on business subjects). Both of these problems are the result of using a language model, trigrams, that uses such restricted context.</Paragraph> <Paragraph position="3"> The newest version of the system must be ported to Spanish for the next evaluation, scheduled for June. This will require improvements to the Spanish dictionary and de-inflector, an update of the Spanish grammar from the older Spanish system, a lexicalized grammar trained on Spanish text, and Spanish rewrite rules. We intend to use the parallel Spanish-English component of the UN data to provide gloss information.</Paragraph> </Section> class="xml-element"></Paper>