The Automatic Component of the LINGSTAT 
Machine-Aided Translation System* 
Jonathan Yamron, James Cant, Anne Demerits, Tailco Dietzel, and Yoshilco lto 
Dragon Systems, Inc., 320 Newda Street, Newton, MA 02160 
ABSTRACT 
We present the newest implementation of the LINGSTAT 
machine-aided translation system. The moat signiflcsat 
change from earlier versions is a new set of modules that pro- 
duce a draft translation of the document for the user to refer 
to or modify. This paper describes these modules, with spe- 
cial emphasis on an automatically trained lexicalized gram- 
mar used in the parsing module. Some preHminary results 
from the January 1994 ARPA evaluation are reported. 
1. INTRODUCTION 
LINGSTAT is an interactive machine-aided translation 
system designed to increase the productivity of a trans- 
lator. It is aimed both at experienced users whose goal 
is high quality tr~nAlation, and inexperienced users with 
little knowledge of the source whose goal is simply to ex- 
tract information from foreign language text. (For an in- 
troduction to the basic structure of LINGSTAT, see \[1\].) 
The first problem to be studied is Japanese to English 
translation with an emphasis on text from the domain 
of mergers and acquisitions, although recent evaluations 
have included general newspaper text as well. Work is 
also progressing on a Spanish to English system. The 
approach described below represents the current state of 
the Japanese system, and will be applied with minimal 
changes to Spanish. 
Due to the special difficulties presented by the Japanese 
writing system, previous versions of LINGSTAT have 
focused on developing tools for the lexical analysis of 
Japanese (such as tokenization of the Japanese character 
stream, morphological analysis, and kat~l~a transliter- 
ation), and on providing the user access to lexical infor- 
mation (such as pronunciations, glosses, and definitions) 
via online lookup tools. In addition, a simple parser was 
incorporated to identify modifying phrases. No trans- 
lation of the document was provided. Instead, the user 
used the results of the above analyses and the online 
tools to construct a translation. 
In the newest version of LINGSTAT, the user is pro- 
*This work was sponsored by the Advanced Research Projects 
Agency under contract number J-FBI-91-239. 
vided with a draft translation of the source document. 
For a source language similar to English, a starting point 
for such a draft might be a word-for-word translation, 
but because Japanese word order and sentence structure 
are so different from English, a more general framework 
has been constructed. The translation process in LING- 
STAT consists of the following steps: 
• Tokenization and morphological analysis 
• Parsing 
• Rearrangement of the source into English order 
• Annotation and selection of glosses via an English 
language model 
These modules are described in Section 2 below. Sec- 
tion 3 gives some preliminary results from the January 
1994 evaluation, and Section 4 discusses some plans for 
future improvements. 
2. IMPLEMENTATION 
Tokenization/de-inflection 
In LINGSTAT, "tokenization" refers to the process of 
breaking a source document into a sequence of root 
words tagged, if necessary, with inflection information. 
For most languages, the tokenizer is basically an en- 
gine that oversees the de-inflection of source words into 
root forms. For languages like Japanese, written without 
spaces, the tokenizer also has the job of segmenting the 
source. 
To segment Japanese, the LINGSTAT tokenizer uses a 
probabilistic dynamic programming algorithm to break 
up the character stream into the sequence of words that 
maximizes the product of word unigram probabilities, as 
supplied from a list of 300,000 words. Inflected forms are 
recognized during tokenization by a de-inflector module. 
This module has a language-independent engine driven 
by a language-specific de-inflection table. (More details 
on the function of these components can be found in \[1\].) 
There have been two improvements in the tokenizer/de- 
163 
inflector module in the newer versions of the system, 
made possible by the introduction of part of speech in- 
formation into the word list. The first is an extra check 
on the validity of suggested de-inflections by demand- 
ing consistency between the inflection and the part of 
speech of the proposed root. This has cleanly eliminated 
a number of spurious de-inflections that were previously 
handled in a more ad-hoc fashion. The second improve- 
ment, motivated more by plans to move on to Spanish, is 
to stop the tokenizer from attempting to uniquely spec- 
ify the de-inflection path (and now, part of speech) for 
each token it finds. As an example of the problem this 
addresses, consider the two de-inflections of the Spanish 
word ayudas: 
ayadas.-+ eyed= (help, aid) 
ayudas .-~ ayudar (to help, to aid) 
The original tokeniser made a choice between the noun 
and verb de-inflection based on the unigram frequency 
of the root. The new tokenizer still finds all allowed 
possibilities, but now simply passes them to the parser, 
which is better equipped to resolve the ambiguity. 
Parsing 
The parser in LINGSTAT has two roles. In the inter- 
active component, information about modifying phrases 
is extracted from the parse and presented to the user as 
an aid to understanding the structure of each Japanese 
sentence. In the automatic component, the parse is the 
basis for the rearrangement of the Japanese sentence into 
English. 
Because it is a long-term goal to have a system that can 
he quickly adapted to new domains and languages, a 
high priority is placed on developing parsing techniques 
that are capable of extracting some information auto- 
matically through training on new sources of text, thus 
minimizing the amount of human effort. In the cur- 
rent system, this has led to a two-stage parsing process. 
The first stage implements a coarse probabilistic context- 
free grammar of a few hundred human-supplied rules 
acting on parts of speech. Because of this coarseness, 
some parsing ambiguities remain to be resolved by the 
second-stage parser, which implements a simple, lexical- 
ized, probabilistic context-free grammar trained on word 
co-occurrences in unlabeled Japanese sentences without 
human input. 
Context-fzee parser. The first.stage parse is done us- 
ing a standard probabilistic context-free grammar acting 
on about 50 parts of speech. Any ambiguities in part of 
speech assignments or de-inflection paths passed by the 
tokenizer/de-inflector are resolved based on the prob- 
ability of possible parses. The grammar is allowed to 
contain unitary and null productions, which impme an 
ordering on the summation over rules that takes place 
during training; because there are currently only a few 
hundred rules, this ordering is checked by hand. The 
grammar can be trained with either the Inside-Outside 
\[2\] or Viterbi \[3\] algorithm. 
It is essential that the parser return a parse, even a bad 
one, for subsequent processing. Therefore special, low- 
probability '~unk" rules have been introduced to handle 
unanticipated constructions. These junk rules affect the 
generation of terminal symbols and take the following 
form: for each rule in which a non-terminal generates a 
particular terminal, a rule is added permitting the same 
non-terminal to generate any other terminal with a small 
probability. This allows the grammar to force the termi- 
ned string into a sequence that has a recognizable parse, 
but at a high enough cost such that any parse without 
such coersion will be favored. One advantage of this ap- 
proach is that the grammar can compensate for missing 
or mislabeled data. Consider the fragment 
thedet largeadv dogno=n 
in which the adjective large has been mislabeled as an 
adverb. The junk rule permits the grammar to change its 
part of speech to something more appropriate provided 
no other sensible parse can be found. 
In principle, the probability of invoking the junk rule 
could be trained with the other rules in the grammar (the 
example above suggests that it might be advantageous 
to do so). Currently this is not being done, based on the 
observation that an invocation of the junk rule is more 
likely an indication of a deficiency in the grammar than 
a useful correction to the data. 
I..~'~,~\];~ed parser. The grammar implemented by the 
context-free parser is not fine enough to properly resolve 
certain kinds of ambiguity, such as the correct attach- 
ment of prepositional phrases or noun modifiers. These 
attachment problems are handled by a second parser, 
which does a top-down rescoring of certain probabilities 
computed in the first stage. Currently this rescoring 
is used to fine-tune attachments of particle phrases in 
Japanese sentences. 
The second parser makes use of a second probabilistic 
grammar, one whose basic elements are the words them- 
selves, and whose data consist of the probabilities of each 
word in the vocabulary to be generated in the context 
of any other word. Like a bigram language model, these 
probabilities can be trained on word co-occurrences in 
unlabeled sentences, but unlike bigrams, the grammar 
can learn about associations between words in a sentence 
regardless of their separation. 
164 
This very simple ~ntext-free grammar can be described 
as follows. To each word in the vocabulary we associate 
a terminal symbol to (the word itseff) and a non-terminal 
symbol A~. The grammar consists of the following two 
kinds of rules: 
Awe --~ A~= to2 A~=A~ , (la) 
A=, --~ ¢, (lb) 
where ~ represents the null production. In addition, we 
introduce a sentence start symbol A0 with the produc- 
tion 
Ao --* A, toA. . (2) 
The probability of invoking a particular rule depends 
only on the word associated with the generating non- 
terminal and the terminal word in the production. The 
probabilities for (la) and (lb) can therefore be written 
P(tol ~ to2) and p(wl -+ ~b), respectively, and these sat- 
isfy 
PCtot --* ¢) + ~"~P(tot --' to2) = 1. 
iV2 
For the start symbol, the probabilities are p(0--* to) and 
satisfy 
Ep(0--, to) ---- 1 . 
UJ 
There is no null production for the start symbol. 
Roughly speaking, this grammar generates a sentence in 
the following manner. The start symbol first generates 
some word in the sentence. This word then generates 
some number of words to its left and right, which in 
turn generate other words to their left and right. From 
the form of the grammar it can be deduced that these 
generations are "local," in the sense that if tot generates 
to2 on its right, w2 is not allowed to generate any word 
to the left of tot (and similarly for tot generating to2 on 
its left). The process continues in a cascading fashion 
until the whole sentence has been generated. The fertil- 
ity of a particular word to (i.e., the number of words it 
will typically generate) is determined by the probability 
p(to--~ ~b), as can be seen from exeanining the produc- 
tions (1): a non-terminal Aw will continue to produce 
words through rule (la) via tail recursion until rule (lb) 
is invoked. 
Although this grammar has the same type and number 
of parameters as a bigrarn model, here they have a very 
different interpretation: they measure the probability of 
one word to generate another anywhere in the sentence, 
subject only to the constraints imposed by the genera- 
tion process described above. Thus an association be- 
tween two words that might typically appear together, 
such as .fast and car, will be recognized even if another 
word might occasionally intervene, such as red. Another 
feature is that words with the most predictive power in 
a sentence tend to generate words with less predictive 
power, which has the consequence that words like ~e 
tend to generate no words at all. This is an improve- 
ment over a bigram model in which the is required to 
select a succeeding word from a distribution that is es- 
sentially fiat across a large portion of the vocabulary. 
This grammar shares the appealing feature of n-gram 
models that its parameters can be trained on unlabeled 
text (consisting of whole sentences). In this case, how- 
ever, the training procedure is iteratiw a modification 
of the Inside-Outside algorithm that is of order N 4 in 
the sentence length, t The iteration starts from a fiat 
distribution, with co-occurrences of words within sen- 
tences leading to enhanced probabilities for some words 
to generate others. 
The N 4 algorithm actuary applies to a slightly different 
(but generatively equivalent) grammar than the one de- 
fined by rules (1) and (2). To implement this algorithm, 
we first replace rule (la) by 
A== ~ A==toaA==A't (ton to the left of tot) , 
Awl ~ A==A==to2A== (w2 to the right of tot), 
where the probability of both rules is the same and given 
by ~tot-~ to3). The only ditference between this sad 
rule (la) is that when A= generates multiple words to 
the right of to, they are generated right to left instead of 
left to right. 
As an example of how the N 4 dependence arises, consider 
the inside calculation for this model. For a sentence 
tot... WN, the quantities of interest for the inside pass 
are the probabilities l(Aw, --~ wj... wi-t) for j < i and 
I(A,, --~ toi+1...toj) for j > i. These may be calculated 
recursively by the following formulae: 
i-1 i-I 
kfj Ilk 
x I(Aw. wk_l)I(A=, wh÷l ... wi) 
---, wt+  ... (3a) 
j b-1 
p(,, ---. to, ) 
k=i+l l=i 
x I(Aw, -"~ wk+l.., wj)l(Awh ~ wa+l.., w~-l) 
x I(A~, -~ wi+,.., wa) , (3b) 
where the "negative length" string wi... wi-i is under- 
stood to represent the null production ~. The recursion 
XThe authors would llke to thank Joshua Goodman for devel- 
oping the N t procedure, a notable improve=nent over previous 
implementations. 
165 
is initialized by 
I(a=, --. ~) = pCw~ -~ ~). 
The above computations involve a double sum and are 
therefore of order N 2, and there are order N = probabil- 
ities I(wi ~ wj ...wi-1) and I(A=, ~ w~+l ...wi), for 
a total of N 4. (For the Viterbi calculation, one simply 
selects the largest contribution from the right hand side 
of equations (3a) and (3b) instead of doing the double 
sum.) 
It is important to note that despite the N 4 behavior, this 
grammar is in general faster than context-free parsing, 
which is computationally of order N s. This is because 
the compute time for context-free parsing also includes 
a factor proportional to the number of rules in the gram- 
mar, which even in simple cases can be in the hundreds. 
There is no such factor in the computation for this lex- 
icalized grammar--it is effectively replaced by another 
power of N, which is much smaller. 
To see how the probabilities p(wl .--~ w2) converge, this 
model was run through ten iterations of training on ap- 
proximately 100,000 sentences of ten words or less from 
the English half of the Canadian Hansard corpus. Some 
examples of these probabilities follow: 
the U. 
.91 @ .52 ~, 
.26 S. 
.14 agreement 
=ari#s 
.44 
.09 agreement 
.08 and 
.08 general 
.08 on 
As expected, the trains strongly to generate the null sym- 
bol ~b. The token U. has a strong tendency to generate 
S. for obvious reasons; that it also generates agreement 
is a consequence of the frequent discussion in the corpus 
of the U. S. free trade agreement. This is an example of 
how the model will find associations between separated 
words that even a trigram model will not see. The distri- 
bution associated with tariffs arises from parliamentary 
debate on the general agreement on tariffs and trade. 
The simple grammar described above can be considered 
the starting point for a class of more complex models. 
One obvious extension is to train the probability distri- 
butious for generating to the left and right separately. 
This corresponds to implementing the greanmar 
A~ ---* A t; w A R AL t; ~,, 2 =,~=, , Aw, --~b, (4a) 
A R ....4, AR --L = AR R e,l ~w~.~w=w2~w= , A=t ---, ~. (4b) 
Training this grammar on the same text as the original 
model yields the left probabilities: 
the U. 
.96 ~ .80 
.10 the 
ta~#s 
.35 
.14 agreement 
.12 general 
.12 on 
Again, the tends to generate a null. Like mat nouns, 
U. has learned to generate a the to its left, and the left 
distribution for tariffs includes only those words found 
typically on its left. The right probabilities for the same 
words are" 
the U. tariffs 
.90 ~ .3~ ~ .52 ¢ 
.37 $. .18 and 
.19 agreement .17 trade 
.07 free 
These are also consistent with the results from the orig- 
inal model. 
Rearrangement 
The next step in LINGSTAT's translation method is 
a transfer of the parse of each Japanese sentence into 
a corresponding English parse, giving an English word 
ordering. This is accomplished through the use of 
English rewrite rules encoded in the Japanese gram- 
mar. Through this encoding, each non-terminal in the 
Japanese grammar corresponds to a non-terminal in an 
implied English granm~r. The rewrite process just con- 
slats of taking the Japanese parse and expanding in this 
English grammar. As this expansion proceeds, Japanese 
constructs that are not translated (certain particles, for 
example) are removed, and tokens for English constructs 
not represented in the Japanese (such as articles) are in- 
troduced. 
Annotation/language model 
The Japanese words in the reordered sentenced are anno- 
tated with (possibly several) candidate English glosses, 
supplied from an electronic dictionary compiled from 
various sources. Numbers are translated directly, and 
kat.akana tokens (which are usually borrowed foreign 
words) are transliterated into English. Tokens intro- 
duced in the rearrangement step are also glossed; the 
token indicating an English article is multiply glossed as 
the, a, an, and null (which expands to an empty word). 
Inflected Japanese words are glossed by first glossing the 
root, then applying an English version of the Japanese 
166 
inflection to each candidate. This is made difficult by the 
poor correspondence between Japanese and English in- 
flections: English is inflected for person and number, for 
example, while in Japanese there are inflections for such 
constructions as the causative, which require non-local 
changes in the corresponding English. Japanese inflec- 
tious also often consist of multiple steps, which means 
that the English inflections must be compounded. For 
example, to inflect the verb to wa& into the past desider- 
ative involves the two step transformation, 
to walk --~ to want to walk --, wanted to walk. 
This procedure can produce some unusual results when 
the number of inflection steps is greater than two. 
The final step in the translation process is to apply an 
English language model to select the best gloss from 
among the many candidates for each word. In the cur- 
rent system this is done with a trigram model, which 
makes the choices that maximize the average probabil- 
ity per word. The trigram model used was trained on 
Wall Street Journal and so has a business bias, partially 
reflecting the bias of the evaluation texts. 
3. RESULTS 
The January 1994 AItPA machine translation evalua- 
tion has recently been completed. In this test, Dragon 
used the same translators as in the May 1993 evaluation 
and provided them with essentially the same interface 
and online tools. The difference in this evaluation was 
that the translators were also provided an antomatically 
generated English translation of the Japanese document 
as a first draft. Manual and machine-assisted translation 
times were measured, and the automatic output was also 
submitted for separate evaluation. 
Preliminary timing results show a speedup by a factor of 
2.4 in machine-assisted vs. manual translation. Because 
we were using the May 1993 translators, this result may 
be compared to the May 1993 result; it is essentially 
unchanged. This suggests that the draft translation was 
of no significant help to the translators in this evaluation, 
probably because the quality of automatic output is not 
high enough to be relied upon. 
A quality measurement of the automatic output is not 
yet available, but we offer one example of a sample trans- 
lation from the current system. For the following cor- 
rectly glossed Japanese sentence, 
(Schroder) WA , (Mitsubishl Trust and Banking 
Co,~o.,tio.) (to) C*a,n~ ~o,n~.v) No (s~t) NO 
(4.9~) NO (sell) KOTO WO (decided) 
LINGSTAT produced 
waazuhaimu dtumodaa oJ the America ineeJtment 
bank decided to adl oM 4.9~ o! the shares ol the 
same company to Mitsubishi Trust and Banldng 
Corporation 
Even this simple sentence demonstrates the large amount 
of rearrangement necessary to render the Japanese into 
English. This effort is not without errors; a correct trans- 
lation shows that the word meaning same companu was 
mishandled, as was the modifier of Wertheim Schroder: 
The American investment bank Wertheim Schroder 
haa decided to sell 4.9~ o~ its ,reck to the MitJubishi 
Trust and Banking Corporation 
This sentence is less complex than is typical in a 
Japanese newspaper article, and therefore LINGSTAT's 
performance in this case is not representative. 
4. FUTURE PLANS 
The steps that have the most effect on the quality of the 
final output translation (at least for Japanese) are the 
p~ser and gloss selection modules. The parser in partic- 
ular is crucial, since it initiates a global rearrangement 
of the sentence into a sensible English order--a parsing 
mistake will often render a sentence unintelligible. 
The improvements contemplated for the parsing mod- 
ule include more hand work on the coarse context-free 
grammar to provide more accurate parses, and a gen- 
eral speedup to allow more extensive training. A faster 
parser would also allow the merging of the two grammars 
so that they could be trained simultaneously. Attempts 
to do this have so far resulted in an unacceptable increase 
in training and parsing time due to the complexity of the 
algorithm. 
The language model used to select glosses in the final 
translation step must be improved to have more global 
control. Common mistakes made by the current model 
include inconsistent glossing of a recurring word and vir- 
tually no notion of topic or domain (except on business 
subjects). Both of these problems are the result of us- 
ing a language model, trigrams, that uses such restricted 
context. 
The newest version of the system must be ported to 
Spanish for the next evaluation, scheduled for June. This 
will require improvements to the Spanish dictionary and 
de-inflector, an update of the Spanish grammar from the 
older Spanish system, a lexicalized grammar trained on 
Spanish text, and Spanish rewrite rules. We intend to 
use the parallel Spanish-English component of the UN 
data to provide gloss information. 
167 
REFERENCES 
[l] J. Yamron, J. Baker, Paul Bamberg, Hukon Cheva- 
lier, Taiko Dietzel, John Elder, Frank Kampmuu, Mark 
M~ndel, Linda Muganaro, Todd Margolis, ud Eliza- 
beth Steele, LINGSTAT: An interactive, Machine.Aided 
Translation System, Proceedings of the ARPA Human 
Technology Workshop, March 1993. 
[2] J.K. Baker, Trainable Grammara for Speech Recognition, 
Speech Communication Papers for the 9Tth Meeting of 
the Acoustical Society of America (D.H. Klatt ud J.J. 
Wolf, eds.), pp. 547-550, 1979. 
[3] F. Jelinek, J.D. Lat'erty, and R.L. Mercer, Baai¢ 
Methods of Probabiliati¢ Contcz~-~lee~ Grammars, in 
Speech Recognition and Underatanding: Recent Advan- 
ceJ, Trenda, and Applications, P. Ldace ud R. De Mori, 
eds., SprlngeroVedag, Series F: Computer and Systems 
Sciences, voL 75, 1992. 
168 
