Unit Completion for a Computer-aided Translation 
System 
Philippe Langlais, George Foster and Guy Lapalme 
RALI / DIRO 
Universit6 de Montreal 
C.P. 6128, succursale Centre-ville 
Montral (Qubec), Canada, H3C 3J7 
{ f elipe,f o ster, lapalme } @iro. umontreal, ca 
Typing 
Abstract 
This work is in the context of TRANSTYPE, a sys- 
tem that observes its user as he or she types a trans- 
lation and repeatedly suggests completions for the 
text already entered. The user may either accept, 
modify, or ignore these suggestions. We describe the 
design, implementation, and performance of a pro- 
totype which suggests completions of units of texts 
that are longer than one word. 
1 Introduction 
TRANSTYPE is part of a project set up to explore 
an appealing solution to Interactive Machine Trans- 
lation (IMT). In constrast to classical IMT systems, 
where the user's role consists mainly of assisting the 
computer to analyse the source text (by answering 
questions about word sense, ellipses, phrasal attach- 
ments, etc), in TRANSTYPE the interaction is direct- 
ly concerned with establishing the target text. 
Our interactive translation system works as fol- 
lows: a translator selects a sentence and begins typ- 
ing its translation. After each character typed by 
the translator, the system displays a proposed com- 
pletion, which may either be accepted using a spe- 
cial key or rejected by continuing to type. Thus 
the translator remains in control of the translation 
process and the machine must continually adapt it- 
s suggestions in response to his or her input. We 
are currently undertaking a study to measure the 
extent to which our word-completion prototype can 
improve translator productivity. The conclusions of 
this study will be presented elsewhere. 
The first version of TrtANSTYPE (Foster et al., 
1997) only proposed completions for the current 
word. This paper deals with predictions which ex- 
tend to the next several words in the text. The po- 
tential gain from multiple-word predictions can be 
appreciated in the one-sentence translation task re- 
ported in table 1, where a hypothetical user saves 
over 60% of the keystrokes needed to produce a 
translation in a word completion scenario, and about 
85% in a "unit" completion scenario. 
In all the figures that follow, we use different fonts 
to differentiate the various input and output: italics 
are used for the source text, sans-serif for characters 
typed by the user and typewriter-like for charac- 
ters completed by the system. 
The first few lines of the table 1 give an idea of 
how TransType functions. Let us assume the unit s- 
cenario (see column 2 of the table) and suppose that 
the user wants to produce the sentence "Ce projet 
de loi est examin~ ~ la chambre des communes" as a 
translation for the source sentence "This bill is ex- 
amined in the house of commons". The first hypoth- 
esis that the system produces before the user enters 
a character is loi (law). As this is not a good guess 
from TRANSTYPE the user types the first character 
(c) of the words he or she wants as a translation. 
Taking this new input into account, TRANSTYPE 
then modifies its proposal so that it is compatible 
whith what the translator has typed. It suggests 
the desired sequence ce projet de Ioi, which the user 
can simply validate by typing a dedicated key. Con- 
tinuing in this way, the user and TRANSTYPE alter- 
nately contribute to the final translation. A screen 
copy of this prototype is provided in figure 1. 
2 The Core Engine 
The core of TRANSTYPE is a completion engine 
which comprises two main parts: an evaluator which 
assigns probabilistic scores to completion hypotheses 
and a generator which uses the evaluation function 
to select the best candidate for completion. 
2.1 The Evaluator 
The evaluator is a function p(t\[t', s) which assigns to 
each target-text unit t an estimate of its probability 
given a source text s and the tokens t' which precede 
t in the current translation of s. 1 Our approach to 
modeling this distribution is based to a large extent 
on that of the IBM group (Brown et al., 1993), but 
it differs in one significant aspect: whereas the IB- 
M model involves a "noisy channel" decomposition, 
we use a linear combination of separate prediction- 
s from a  model p(tlt ~) and a translation 
model p(tls ). Although the noisy channel technique 
1We assume the existence of a deterministic procedure for 
tokenizing the target text. 
135 
This bill is examined in the house of commons 
word-completion task unit-completion task 
ce 
projet 
de 
Ioi 
est 
examin~ 
chambre 
des 
communes 
preL completions 
ce+ /loi • C/' 
p+ /est• p/rojet 
d+ /trbs • d/e 
I+ /t=~s • I/oi 
e+ /de • e/st 
e+ /en • e/xamin6 
~+ /par • ~/ 1~ 
+ /chambre 
de+ /co,~unes • d/e 
+ /communes 
• de/s 
pref. completions 
c-l- /loJ. • c/e pro jet de 1oi 
e+ /de • e/st 
ex+ /~ la chambre des communes. 
+ /b la chambre des con~unes 
e/n • ex/min~ 
Table 1: A one-sentence session illustrating the word- and unit-completion tasks. The first column indicates 
the target words the user is expected to produce. The next two columns indicate respectively the prefixes 
typed by the user and the completions proposed by the system in a word-completion task. The last two 
columns provide the same information for the unit-completion task. The total number of keystrokes for 
both tasks is reported in the last line. + indicates the acceptance key typed by the user. A completion is 
denoted by a/13 where a is the typed prefix and 13 the completed part. Completions for different prefixes 
are separated by •. 
is powerful, it has the disadvantage that p(slt' , t) is 
more expensive to compute than p(tls ) when using 
IBM-style translation models. Since speed is cru- 
cial for our application, we chose to forego the noisy 
channel approach in the work described here. Our 
linear combination model is described as follows: 
pCtlt',s) = pCtlt') a(t',s) + pCtls) \[1 - exit',s)\] (1) 
• ~ • • v J 
 translation 
where a(t', s) E \[0, 1\] are context-dependent inter- 
polation coefficients. For example, the translation 
model could have a higher weight at the start of a 
sentence but the contribution of the  mod- 
el might become more important in the middle or 
the end of the sentence• A study of the weightings 
for these two models is described elsewhere• In the 
work described here we did not use the contribution 
of the  model (that is, a(t', s) = O, V t', s). 
Techniques for weakening the independence as- 
sumptions made by the IBM models 1 and 2 have 
been proposed in recent work (Brown et al., 1993; 
Berger et al., 1996; Och and Weber, 98; Wang and 
Waibel, 98; Wu and Wong, 98). These studies report 
improvements on some specific tasks (task-oriented 
limited vocabulary) which by nature are very differ- 
ent from the task TRANSTYPE is devoted to. Fur- 
thermore, the underlying decoding strategies are too 
time consuming for our application• We therefore 
use a translation model based on the simple linear in- 
terpolation given in equation 2 which combines pre- 
dictions of two translation models -- Ms and M~ -- 
both based on IBM-like model 2(Brown et al., 1993). 
Ms was trained on single words and Mu, described 
in section 3, was trained on both words and units. 
-- _ (2) 
word unit 
where Ps and Pu stand for the probabilities given re- 
spectively by Ms and M~. G(s) represents the new 
sequence of tokens obtained after grouping the to- 
kens of s into units. The grouping operator G is 
illustrated in table 2 and is described in section 3. 
2.2 The Generator 
The task of the generator is to identify units that 
match the current prefix typed by the user, and pick 
the best candidate according to the evaluator. Due 
to time considerations, the generator introduces a 
division of the target vocabulary into two parts: a 
small active component whose contents are always 
searched for a match to the current prefix, and a 
much larger passive part over (380,000 word form- 
s) which comes into play only when no candidates 
are found in the active vocabulary. The active part 
is computed dynamically when a new sentence is s- 
elected by the translator. It is composed of a few 
entities (tokens and units) that are likely to appear 
in the translation. It is a union of the best can- 
didates provided by each model Ms and M~ over 
the set of all possible target tokens (resp. units) 
that have a non-null translation probability of being 
translated by any of the current source tokens (resp. 
units). Table 2 shows the 10 most likely tokens and 
units in the active vocabulary for an example source 
sentence. 
136 
that. is • what. the . prime, minister . said 
• and. i • have. outlined• what. has. 
happened . since• then.. 
c' - est. ce-que, le- premier - ministre, a- 
dit.,.et.j', ai. r4sum4- ce. qui.s'- est- 
produit - depuis • . 
g(s) that is what • the prime minister said • , and i 
• have . outlined • what has happened • since 
then • . 
As 
A~ 
• • • est • ce • ministre • que. et • a • premier 
lie 
ce qui s' est produit • et je - c' est ce que. voil~ 
ce que • qu' est - c' est •, et • le premier ministre 
disait 
Table 2: Role of the generator for a sample pair of 
sentences (t is the translation of s in our corpus). 
G(s) is the sequence of source tokens recasted by 
the grouping operator G. A8 indicates the 10 best 
tokens according to the word model, Au the 10 best 
units according to the unit model. 
3 Modeling Unit Associations 
Automatically identifying which source words or 
groups of words will give rise to which target words 
or groups of words is a fundamental problem which 
remains open. In this work, we decided to proceed 
in two steps: a) monolingually identifying groups of 
words that would be better handled as units in a giv- 
en context, and b) mapping the resulting source and 
target units. To train our unit models, we used a 
segment of the Hansard corpus consisting of 15,377 
pairs of sentences, totaling 278,127 english token- 
s (13,543 forms) and 292,865 french tokens (16,399 
forms). 
3.1 Finding Monolingual Units 
Finding relevant units in a text has been explored in 
many areas of natural  processing. Our ap- 
proach relies on distributional and frequency statis- 
tics computed on each sequence of words found in a 
training corpus. For sake of efficiency, we used the 
suffix array technique to get a compact representa- 
tion of our training corpus. This method allows the 
efficient retrieval of arbitrary length n-grams (Nagao 
and Mori, 94; Haruno et al., 96; Ikehara et al., 96; 
Shimohata et al., 1997; Russell, 1998). 
The literature abounds in measures that can help 
to decide whether words that co-occur are linguisti- 
cally significant or not. In this work, the strength of 
association of a sequence of words w\[ = wl,..., wn 
is computed by two measures: a likelihood-based one 
p(w'~) (where g is the likelihood ratio given in (Dun- 
ning, 93)) and an entropy-based one e(w'~) (Shimo- 
hata et al., 1997). Letting T stand for the training 
text and m a token: 
p(w~) = argming(w~, uS1 ) (3) ie\]l,n\[ 
e(w'~) = 0.5x +k 
~rnlw,~meT h ( Ireq(w'~ m) k Ir~q(wT) \] 
Intuitively, the first measurement accounts for the 
fact that parts of a sequence of words that should 
be considered as a whole should not appear often by 
themselves. The second one reflects the fact that a 
salient unit should appear in various contexts (i.e. 
should have a high entropy score). 
We implemented a cascade filtering strategy based 
on the likelihood score p, the frequency f, the length 
l and the entropy value e of the sequences. A 
first filter (.~"1 (lmin, fmin, Pmin, emin)) removes any 
sequence s for which l(s) < lmin or p(s) < Pmin 
or e(s) < e,nin or f(s) < fmin. A second filter 
(~'2) removes sequences that are included in pre- 
ferred ones. In terms of sequence reduction, apply- 
ing ~1 (2, 2, 5.0, 0.2) on the 81,974 English sequences 
of at least two tokens seen at least twice in our train- 
ing corpus, less than 50% of them (39,093) were fil- 
tered: 17,063 (21%) were removed because of their 
low entropy value, 25,818 (31%) because of their low 
likelihood value. 
3.2 Mapping 
Mapping the identified units (tokens or sequences) to 
their equivalents in the other  was achieved 
by training a new translation model (IBM 2) us- 
ing the EM algorithm as described in (Brown et al., 
1993). This required grouping the tokens in our 
training corpus into sequences, on the basis of the 
unit lexicons identified in the previous step (we will 
refer to the results of this grouping as the sequence- 
based corpus). To deal with overlapping possibilities, 
we used a dynamic programming scheme which opti- 
mized a criterion C given by equation 4 over a set S 
of all units collected for a given  plus all sin- 
gle words. G(w~) is obtained by returning the path 
that maximized B(n). We investigated several C- 
criteria and we found C~--a length-based measurc 
to be the most satisfactory. Table 2 shows an output 
of the grouping function. 
Oil i=o 
B(i) = argmax 
/~\[1,i\[ ,w~_les ) 
+ 
B(i - I - 1) 
(4) 
0 ifj<=i with: Cl(w~)= j--i + l else 
137 
source unit (s) 
we have 1748 
we must 720 
this bill 640 
people of canada 282 
mr. speaker : 269 
what is happening 190 
of course , 178 
is it the pleasure of the house to 14 
adopt the 
the world 
child care 
the free trade agreement 
post-secondary education 
the first time 
the canadian aviation safety board 
the next five years 
the people of china 
f(8) target units (\[a,p\]) 
\[nous,0.49\] \[avons,0.41\] \[, nous avons,0.07\] 
\[nous devons,0.61\] \[il rant,0.19\] \[nous,0.14\] 
\[ce projet de 1oi,0.35\] \[projet de loi .,0.21\] \[projet de loi,0.18\] 
\[les canadiens,0.26\] \[des canadiens,0.21\] \[la population,0.07\] 
\[m. le prdsident :,0.80\] \[a,0.07\] \[h la,0.06\] 
Ice qui se passe,0.21\] Ice qui se,0.16\] \[et,0.15\] 
\[dvidemment ,0.26\] \[naturellement,0.08\] \[bien stir,0.08\] 
\[plait-il h la chambre d' adopter,0.49\] \[la motion ?,0.42\] \[motion 
?,0.04\] 
201 \[le monde,O.46\] \[du monde,O.33\] lie monde entier,O.19\] 
86 lies garderies,O.59\] \[la garde d' enfants,O.23\] \[des services de 
garde d' enfants,O.13\] 
75 \[1' accord de libre-dchange,O.96\] \[la ddcision du gatt,O.04\] 
66 \[1' euseignement postsecondaize,O.75\] \[1' dducation postsec- 
ondaire,O.15\] \[des fonds,O.06\] 
62 \[la premiere fois,l.00\] 
36 lie bureau canadien de la s~urit~ adrienne,O.55\] \[du bureau cana- 
dien de la sdcurit~ adrienne,O.31\] \[1' un,O.14\] 
26 \[au cours des cinq prochaines ann~es,O.53\] \[cinq prochaines an- 
ndes,O.27\] \[25 milliards de d ollars,O.lO\] 
17 \[le peuple chinois,0.38\] \[la population chinoise,0.25\] \[les chi- 
nois,O.13\] 
Table 3: Bilingual associations. The first column indicates a source unit, the second one its frequency in the 
training corpus. The third column reports its 3-best ranked target associations (a being a token or a unit, 
p being the translation probability). The second half of the table reports NP-associations obtained after the 
filter described in the text. 
We investigated three ways of estimating the pa- 
rameters of the unit model. In the first one, El, 
the translation parameters are estimated by apply- 
ing the EM algorithm in a straightforward fashion 
over all entities (tokens and units) present at least 
twice in the sequence-based corpus 2. The two next 
methods filter the probabilities obtained with the Ez 
method. In E2, all probabilities p(tls ) are set to 0 
whenever s is a token (not a unit), thus forcing the 
model to contain only associations between source 
units and target entities (tokens or units). In E3 
any parameter of the model that involves a token 
is removed (that is, p(tls ) = 0 if t or s is a token). 
The resulting model will thus contain only unit as- 
sociations. In both cases, the final probabilities are 
renormalized. Table 3 shows a few entries from a 
unit model (Mu) obtained after 15 iterations of the 
EM-algorithm on a sequence corpus resulting from 
the application of the length-grouping criterion (dr) 
over a lexicon of units whose likelihood score is above 
5.0. The probabilities have been obtained by appli- 
cation of the method E2. 
We found many partially correct associations 
Cover the years/au fils des, we have/nous, etc) that 
illustrate the weakness of decoupling the unit iden- 
tification from the mapping problem. In most cas- 
2The entities seen only once are mapped to a special "un- 
known" word 
es however, these associations have a lower proba- 
bility than the good ones. We also found few er- 
ratic associations (the first time/e'dtait, some hon. 
members/t, etc) due to distributional artifacts. It is 
also interesting to note that the good associations 
we found are not necessary compositional in nature 
(we must/il Iaut, people of canada/les canadiens, of 
eourse/6videmment, etc). 
3.3 Filtering 
One way to increase the precision of the mapping 
process is to impose some linguistic constraints on 
the sequences such as simple noun-phrase contraints 
(Ganssier, 1995; Kupiec, 1993; hua Chen and Chen, 
94; Fung, 1995; Evans and Zhai, 1996). It is also 
possible to focus on non-compositional compounds, 
a key point in bilingual applications (Su et al., 1994; 
Melamed, 1997; Lin, 99). Another interesting ap- 
proach is to restrict sequences to those that do not 
cross constituent boundary patterns (Wu, 1995; Fu- 
ruse and Iida, 96). In this study, we filtered for po- 
tential sequences that are likely to be noun phrases, 
using simple regular expressions over the associated 
part-of-speech tags. An excerpt of the association 
probabilities of a unit model trained considering on- 
ly the NP-sequences is given in table 3. Applying 
this filter (referred to as JrNp in the following) to the 
39,093 english sequences still surviving after previ- 
ous filters ~'1 and ~'2 removes 35,939 of them (92%). 
138 
model spared ok good nu u 
1 baseline - model 1 48.98 0 0 747 0 
2 baseline - model 2 51.83 0 0 747 0 
3 E1 + ~'1(2, 2, 0, 0.2) 50.98 527 1702 5 626 
4 E1+~'1(2,2,5,0.2) 51.61 596 2149 5 658 
5 E1 + ~-~ (2, 2, 5, 0.2) + 9r2 51.72 633 2265 5 657 
6 E2 + ~'~(2,2,0,0.2) 51.39 514 1551 43 578 
7 £2 + ~-~ (2, 2, 5, 0.2) 51.99 470 1889 46 614 
8 E2 + ~'~(2,2,5,0.2) + ~'2 52.12 493 1951 46 606 
9 E3 + ~-1(2, 2, 0, 0.2) 51.07 577 1699 43 588 
10 E2 + ~-1(2, 2, 5, 0.2) 51.47 629 2124 46 618 
11 E2+~'~(2,2,5,0.2)+~'2 51.68 665 2209 46 615 
12 ~1 -}- .~1 (2, 2, 5, 0.2) -}- .~2 -}- ~:NP 52.83 416 1302 4 564 
13 E2 + ~'1(2, 2, 5, 0.2) + ~NP 53.12 439 1031 228 425 
14 £2 + ~'~ (2, 2, 5, 0.2) + 5r2 + ~'NP 53.16 458 1052 199 439 
15 ~3 -{- ~ : 0.4 -}- ~-1(2, 2, 5, 0.2) 4- .~NP 53.22 495 1031 228 425 
Table 4: Completion results of several translation models, spared: theoretical proportion of characters 
saved; ok: number of target units accepted by the user; good: number of target units that matched the 
expected whether they were proposed or not; nu: number of sentences for which no target unit was found 
by the translation model; u: number of sentences for which at least one helpful unit has been found by the 
model, but not necessarily proposed. 
More than half of the 3,154 remaining NP-sequences 
contain only two words. 
4 Results 
We collected completion results on a test corpus 
of 747 sentences (13,386 english tokens and 14,506 
french ones) taken from the Hansard corpus. These 
sentences have been selected randomly among sen- 
tences that have not been used for the training. 
Around 18% of the source and target words are not 
known by the translation model. 
The baseline models (line 1 and 2) are obtained 
without any unit model (i.e. /~ = 1 in equation 2). 
The first one is obtained with an IBM-like model 1 
while the second is an IBM-like model 2. We observe 
that for the pair of s we considered, model 
2 improves the amount of saved keystrokes of almost 
3% compared to model 1. Therefore we made use of 
alignment probabilities for the other models. 
The three next blocks in table 4 show how the 
parameter estimation method affects performance. 
Training models under the C1 method gives the worst 
results. This results from the fact that the word- 
to-word probabilities trained on the sequence based 
corpus (predicted by Mu in equation 2) are less ac- 
curate than the ones learned from the token based 
corpus. The reason is simply that there are less oc- 
currences of each token, especially if many units are 
identified by the grouping operator. 
In methods C2 and C3, the unit model of equation 
2 only makes predictions pu(tls ) when s is a source u- 
nit, thus lowering the noise compared to method £1. 
We also observe in these three blocks the influence 
of sequence filtering: the more we filter, the better 
the results. This holds true for all estimation meth- 
ods tried. In the fifth block of table 4 we observe 
the positive influence of the NP-filtering, especially 
when using the third estimation method. 
The best combination we found is reported in line 
15. It outperforms the baseline by around 1.5%. 
This model has been obtained by retaining all se- 
quences seen at least two times in the training cor- 
pus for which the likelihood test value was above 5 
and the entropy score above 0.2 (5rl (2, 2, 5, 0.2)). In 
terms of the coverage of this unit model, it is in- 
teresting to note that among the 747 sentences of 
the test session, there were 228 for which the model 
did not propose any units at all. For 425 of the re- 
maining sentences, the model proposed at least one 
helpful (good or partially good) unit. The active vo- 
cabulary for these sentences contained an average of 
around 2.5 good units per sentence, of which only 
half (495) were proposed during the session. The 
fact that this model outperforms others despite it- 
s relatively poor coverage (compared to the others) 
may be explained by the fact that it also removes 
part of the noise introduced by decoupling the i- 
dentification of the salient units from the training 
procedure. Furthermore, as we mentionned earlier, 
the more we filter, the less the grouping scheeme 
presented in equation 4 remains necessary, thus re- 
ducing a possible source of noise. 
The fact that this model outperforms others, de- 
spite its relatively poor coverage, is due to the fact 
139 
Eichler C)ptlons 
l am pleased to t~lce ]part in this debate tod W . 
Using rod W "s technologies, it is possible for all C~m~dia~s to 
register their votes on isstles of public spending and public 
I)orro~ving. 
II me falt plalelr de prendre la parole auJourd'hui dana le cadre de ¢e 
d~bat. 
Gr~ice & la technologle moderne, toue lea Canadlen= peuvent 6e 
prononcer sur le=; question= de d6pen=e== et d" ernprunta de I" I~tat . 
Notre p 
Figure 1: Example of an interaction in TRANSTYPE with the source text in the top half of the screen. The 
target text is typed in the bottom half with suggestions given by the menu at the insertion point. 
that it also removes part of the noise that is intro- 
duced by dissociating the identification of the salient 
units from the training procedure. ~rthermore, as 
we mentioned earlier, the more we filter, the less the 
grouping scheme presented in equation 4 remains 
necessary, thus further reducing an other possible 
source of noise. 
5 Conclusion 
We have described a prototype system called 
TRANSTYPE which embodies an innovative ap- 
proach to interactive machine translation in which 
the interaction is directly concerned with establish- 
ing the target text. We proposed and tested a mech- 
anism to enhance TRANSTYPE by having it predic- 
t sequences of words rather than just completions 
for the current word. The results show a modest 
improvement in prediction performance which will 
serve as a baseline for our future investigations. One 
obvious direction for future research is to revise our 
current strategy of decoupling the selection of units 
from their bilingual context. 
Acknowlegments 
TRANSTYPE is a project funded by the Natural Sci- 
ences and Engineering Research Council of Canada. 
We are undebted to Elliott Macklovitch and Pierre 
Isabelle for the fruitful orientations they gave to this 
work. 

References 

Adam L. Berger, Stephen A. Della Pietra, and Vincent J. Della Pietra. 1996. A maximum entropy approach to natural  processing. Computational Linguistics, 22(1):39-71. 

Peter F. Brown, Stephen A. Della Pietra, Vincent Della J. Pietra, and Robert L. Mercer. 1993. The mathematics of machine trmaslation: Parameter estimation. Computational Linguistics, 19(2):263-312, June. 

Ted Dunning. 1993. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1):61-74. 

David A. Evans and Chengxiang Zhai. 1996. Noun-phrase analysis in unrestricted text for information retrieval. In Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, pages 17-24, Santa Cruz, California. 

George Foster, Pierre Isabelle, and Pierre Plamondon. 1997. Target-text Mediated Interactive Machine Translation. Machine Translation, 12:175-194. 

Pascale Fung. 1995. A pattern matching method for finding noun and proper noun translations from noisy parallel corpora. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, pages 236-243, Cambridge, Massachusetts. 

Osamu Furuse and Hitoshi Iida. 1996. Incremental translation utilizing constituent boundray patterns. In Proceedings of the 16th International Conference On Computational Linguistics, pages 412-417, Copenhagen, Denmark. 

Eric Gaussier. 1995. Modles statistiques et patrons morphosyntaxiques pour l'extraction de lcxiques bilingues. Ph.D. thesis, Universit de Paris 7, janvier. 

Masahiko Haruno, Satoru Ikehara, and Takefumi Yamazaki. 1996. Learning bilingual collocations by word-level sorting. In Proceedings of the 16th International Conference On Computational Linguistics, pages 525-530, Copenhagen, Denmark. 

Kuang hua Chen and Hsin-Hsi Chen. 1994. Extracting noun phrases from large-scale texts: A hybrid approach and its automatic evaluation. In Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, pages 234-241, Las Cruces, New Mexico. 

Satoru Ikehara, Satoshi Shirai, and Hajine Uchino. 1996. A statistical method for extracting uninterrupted and interrupted collocations from very large corpora. In Proceedings of the 16th International Conference On Computational Linguistics, pages 574-579, Copenhagen, Denmark. 

Julian Kupiec. 1993. An algorithm for finding noun phrase correspondences in bilingual corpora. In Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics, pages 17-22, Colombus, Ohio. 

Dekang Lin. 1999. Automatic identification of non-compositional phrases. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, pages 317-324, College Park, Maryland. 

I. Dan Melamed. 1997. Automatic discovery of non-compositional coumpounds in parallel data. In Proceedings of the 2nd Conference on Empirical Methods in Natural Language Processing, pages 97-108, Providence, RI, August, lst-2nd. 

Makoto Nagao and Shinsuke Mori. 1994. A new method of n-gram statistics for large number of n and automatic extraction of words and phrases from large text data of japanese. In Proceedings of the 16th International Conference On Computational Linguistics, volume 1, pages 611-615, Copenhagen, Denmark. 

Franz Josef Och and Hans Weber. 1998. Improving statistical natural  translation with categories and rules. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics, pages 985-989, Montreal, Canada. 

Graham Russell. 1998. Identification of salient token sequences. Internal report, RALI, University of Montreal, Canada. 

Sayori Shimohata, Toshiyuki Sugio, and Junji Nagata. 1997. Retrieving collocations by cooccurrences and word order constraints. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics, pages 476-481, Madrid Spain. 

Keh-Yih Su, Ming-Wen Wu, and Jing-Shin Chang. 1994. A corpus-based approach to automatic compound extraction. In Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, pages 242-247, Las Cruces, New Mexico. 

Ye-Yi Wang and Alex Waibel. 1998. Modeling with structures in statistical machine translation. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics, volume 2, pages 1357-1363, Montreal, Canada. 

Dekai Wu and Hongsing Wong. 98. Machine translation with a stochastic grammatical channel. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics, pages 1408-1414, Montreal, Canada. 

Dekai Wu. 1995. Stochastic inversion transduction grammars, with application to segmentation, bracketing, and alignment of parallel corpora. In Proceedings of the International Joint Conference on Artificial Intelligence, volume 2, pages 1328-1335, Montreal, Canada. 
