Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language
Processing (HLT/EMNLP), pages 193–200, Vancouver, October 2005. c©2005 Association for Computational Linguistics
Predicting Sentences using N-Gram Language Models
Steffen Bickel, Peter Haider, and Tobias Scheffer
Humboldt-Universit¨at zu Berlin
Department of Computer Science
Unter den Linden 6, 10099 Berlin, Germany
{bickel, haider, scheffer}@informatik.hu-berlin.de
Abstract
We explore the benefit that users in sev-
eral application areas can experience from
a “tab-complete” editing assistance func-
tion. We develop an evaluation metric
and adapt N-gram language models to
the problem of predicting the subsequent
words, given an initial text fragment. Us-
ing an instance-based method as base-
line, we empirically study the predictabil-
ity of call-center emails, personal emails,
weather reports, and cooking recipes.
1 Introduction
Prediction of user behavior is a basis for the con-
struction of assistance systems; it has therefore been
investigated in diverse application areas. Previous
studies have shed light on the predictability of the
next unix command that a user will enter (Motoda
and Yoshida, 1997; Davison and Hirsch, 1998), the
next keystrokes on a small input device such as a
PDA (Darragh and Witten, 1992), and of the trans-
lation that a human translator will choose for a given
foreign sentence (Nepveu et al., 2004).
We address the problem of predicting the subse-
quent words, given an initial fragment of text. This
problem is motivated by the perspective of assis-
tance systems for repetitive tasks such as answer-
ing emails in call centers or letters in an adminis-
trative environment. Both instance-based learning
and N-gram models can conjecture completions of
sentences. The use of N-gram models requires the
application of the Viterbi principle to this particular
decoding problem.
Quantifying the benefit of editing assistance to a
user is challenging because it depends not only on
an observed distribution over documents, but also
on the reading and writing speed, personal prefer-
ence, and training status of the user. We develop
an evaluation metric and protocol that is practical,
intuitive, and independent of the user-specific trade-
off between keystroke savings and time lost due to
distractions. We experiment on corpora of service-
center emails, personal emails of an Enron execu-
tive, weather reports, and cooking recipes.
The rest of this paper is organized as follows.
We review related work in Section 2. In Section 3,
we discuss the problem setting and derive appropri-
ate performance metrics. We develop the N-gram-
based completion method in Section 4. In Section 5,
we discuss empirical results. Section 6 concludes.
2 Related Work
Shannon (1951) analyzed the predictability of se-
quences of letters. He found that written English
has a high degree of redundancy. Based on this find-
ing, it is natural to ask whether users can be sup-
ported in the process of writing text by systems that
predict the intended next keystrokes, words, or sen-
tences. Darragh and Witten (1992) have developed
an interactive keyboard that uses the sequence of
past keystrokes to predict the most likely succeed-
ing keystrokes. Clearly, in an unconstrained applica-
tion context, keystrokes can only be predicted with
limited accuracy. In the specific context of entering
URLs, completion predictions are commonly pro-
193
vided by web browsers (Debevc et al., 1997).
Motoda and Yoshida (1997) and Davison and
Hirsch (1998) developed a Unix shell which pre-
dicts the command stubs that a user is most likely
to enter, given the current history of entered com-
mands. Korvemaker and Greiner (2000) have de-
veloped this idea into a system which predicts en-
tire command lines. The Unix command predic-
tion problem has also been addressed by Jacobs and
Blockeel (2001) who infer macros from frequent
command sequences and predict the next command
using variable memory Markov models (Jacobs and
Blockeel, 2003).
In the context of natural language, several typ-
ing assistance tools for apraxic (Garay-Vitoria and
Abascal, 2004; Zagler and Beck, 2002) and dyslexic
(Magnuson and Hunnicutt, 2002) persons have been
developed. These tools provide the user with a list of
possible word completions to select from. For these
users, scanning and selecting from lists of proposed
words is usually more efficient than typing. By con-
trast, scanning and selecting from many displayed
options can slow down skilled writers (Langlais et
al., 2002; Magnuson and Hunnicutt, 2002).
Assistance tools have furthermore been developed
for translators. Computer aided translation systems
combine a translation and a language model in order
to provide a (human) translator with a list of sug-
gestions (Langlais et al., 2000; Langlais et al., 2004;
Nepveu et al., 2004). Foster et al. (2002) introduce
a model that adapts to a user’s typing speed in or-
der to achieve a better trade-off between distractions
and keystroke savings. Grabski and Scheffer (2004)
have previously developed an indexing method that
efficiently retrieves the sentence from a collection
that is most similar to a given initial fragment.
3 Problem Setting and Evaluation
Given an initial text fragment, a predictor that solves
the sentence completion problem has to conjecture
as much of the sentence that the user currently in-
tends to write, as is possible with high confidence—
preferably, but not necessarily, the entire remainder.
The perceived benefit of an assistance system is
highly subjective, because it depends on the expen-
diture of time for scanning and deciding on sug-
gestions, and on the time saved due to helpful as-
sistance. The user-specific benefit is influenced by
quantitative factors that we can measure. We con-
struct a system of two conflicting performance indi-
cators: our definition of precision quantifies the in-
verse risk of unnecessary distractions, our definition
of recall quantifies the rate of keystroke savings.
For a given sentence fragment, a completion
method may – but need not – cast a completion con-
jecture. Whether the method suggests a completion,
and how many words are suggested, will typically
be controlled by a confidence threshold. We con-
sider the entire conjecture to be falsely positive if at
least one word is wrong. This harsh view reflects
previous results which indicate that selecting, and
then editing, a suggested sentence often takes longer
than writing that sentence from scratch (Langlais et
al., 2000). In a conjecture that is entirely accepted
by the user, the entire string is a true positive. A
conjecture may contain only a part of the remaining
sentence and therefore the recall, which refers to the
length of the missing part of the current sentence,
may be smaller than 1.
For a given test collection, precision and recall
are defined in Equations 1 and 2. Recall equals
the fraction of saved keystrokes (disregarding the
interface-dependent single keystroke that is most
likely required to accept a suggestion); precision is
the ratio of characters that the users have to scan
for each character they accept. Varying the confi-
dence threshold of a sentence completion method re-
sults in a precision recall curve that characterizes the
system-specific trade-off between keystroke savings
and unnecessary distractions.
Precision =
summationtext
accepted completions string lengthsummationtext
suggested completions string length
(1)
Recall =
summationtext
accepted completions string lengthsummationtext
all queries length of missing part
(2)
4 Algorithms for Sentence Completion
In this section, we derive our solution to the sen-
tence completion problem based on linear interpola-
tion of N-gram models. We derive a k best Viterbi
decoding algorithm with a confidence-based stop-
ping criterion which conjectures the words that most
likely succeed an initial fragment. Additionally, we
194
briefly discuss an instance-based method that pro-
vides an alternative approach and baseline for our
experiments.
In order to solve the sentence completion problem
with an N-gram model, we need to find the most
likely word sequence wt+1,...,wt+T given a word
N-gram model and an initial sequence w1,...,wt
(Equation 3). Equation 4 factorizes the joint proba-
bility of the missing words; the N-th order Markov
assumption that underlies the N-gram model simpli-
fies this expression in Equation 5.
argmax
wt+1,...,wt+T
P(wt+1,...,wt+T|w1,...,wt) (3)
= argmax
wt+1,...,wt+T
Tproductdisplay
j=1
P(wt+j|w1,...,wt+j−1) (4)
= argmax
Tproductdisplay
j=1
P(wt+j|wt+j−N+1,...,wt+j−1) (5)
The individual factors of Equation 5 are provided by
the model. The Markov order N has to balance suffi-
cient context information and sparsity of the training
data. A standard solution is to use a weighted linear
mixture of N-gram models, 1 ≤ n ≤ N, (Brown et
al., 1992). We use an EM algorithm to select mixing
weights that maximize the generation probability of
a tuning set of sentences that have not been used for
training.
We are left with the following questions: (a)
how can we decode the most likely completion effi-
ciently; and (b) how many words should we predict?
4.1 Efficient Prediction
We have to address the problem of finding the
most likely completion, argmaxwt+1,...,wt+T
P(wt+1,...,wt+T|w1,...,wt) efficiently, even
though the size of the search space grows exponen-
tially in the number of predicted words.
We will now identify the recursive structure in
Equation 3; this will lead us to a Viterbi al-
gorithm that retrieves the most likely word se-
quence. We first define an auxiliary variable
δt,s(wprime1,...,wprimeN|wt−N+2,...,wt) in Equation 6; it
quantifies the greatest possible probability over all
arbitrary word sequences wt+1,...,wt+s, followed
by the word sequence wt+s+1 = wprime1,...,wt+s+N =
wprimeN, conditioned on the initial word sequence
wt−N+2,...,wt.
In Equation 7, we factorize the last transition and
utilize the N-th order Markov assumption. In Equa-
tion 8, we split the maximization and introduce a
new random variable wprime0 for wt+s. We can now refer
to the definition of δ and see the recursion in Equa-
tion 9: δt,s depends only on δt,s−1 and the N-gram
model probability P(wprimeN|wprime1,...,wprimeN−1).
δt,s(wprime1,...,wprimeN|wt−N+2,...,wt) (6)
= max
wt+1,...,wt+s
P(wt+1,...,wt+s,wt+s+1 = wprime1,
...,wt+s+N = wprimeN|wt−N+2,...,wt)
= max
wt+1,...,wt+s
P(wprimeN|wprime1,...,wprimeN−1) (7)
P(wt+1,...,wt+s,wt+s+1 = wprime1,
...,wt+s+N−1 = wprimeN−1|wt−N+2,...,wt)
= max
wprime0
max
wt+1,...,wt+s−1
P(wprimeN|wprime1,...,wprimeN−1) (8)
P(wt+1,...,wt+s−1,wt+s = wprime0,
...,wt+s+N−1 = wprimeN−1|wt−N+2,...,wt)
= max
wprime0
P(wprimeN|wprime1,...,wprimeN−1)
δt,s−1(wprime0,...,wprimeN−1|wt+N−2,...,wt)
(9)
Exploiting the N-th order Markov assumption,
we can now express our target probability (Equation
3) in terms of δ in Equation 10.
max
wt+1,...,wt+T
P(wt+1,...,wt+T|wt−N+2,...,wt) (10)
= max
wprime1,...,wprimeN
δt,T−N(wprime1,...,wprimeN|wt−N+2,...,wt)
The last N words in the most likely sequence
are simply the argmaxwprime1,...,wprime
N
δt,T−N(wprime1,...,wprimeN|
wt−N+2,...,wt). In order to collect the preceding
most likely words, we define an auxiliary variable Ψ
in Equation 11 that can be determined in Equation
12. We have now found a Viterbi algorithm that is
linear in T, the completion length.
Ψt,s(wprime1,...,wprimeN|wt−N+2,...,wt) (11)
= argmax
wt+s
max
wt+1,...,wt+s−1
P(wt+1,...,wt+s,wt+s+1 = wprime1,...,
wt+s+N = wprimeN|wt−N+2,...,wt)
= argmax
wprime0
δt,s−1(wprime0,...,wprimeN−1|wt−N+2,...,wt)
P(wprimeN|wprime1,...,wprimeN−1) (12)
The Viterbi algorithm starts with the most recently
entered word wt and moves iteratively into the fu-
ture. When the N-th token in the highest scored δ is
a period, then we can stop as our goal is only to pre-
dict (parts of) the current sentence. However, since
195
there is no guarantee that a period will eventually
become the most likely token, we use an absolute
confidence threshold as additional criterion: when
the highest δ score is below a threshold θ, we stop
the Viterbi search and fix T.
In each step, Viterbi stores and updates
|vocabulary size|N many δ values—unfeasibly
many except for very small N. Therefore, in Table
1 we develop a Viterbi beam search algorithm
which is linear in T and in the beam width. Beam
search cannot be guaranteed to always find the
most likely word sequence: When the globally
most likely sequence w∗t+1,...,w∗t+T has an initial
subsequence w∗t+1,...,w∗t+s which is not among
the k most likely sequences of length s, then that
optimal sequence is not found.
Table 1: Sentence completion with Viterbi beam
search algorithm.
Input: N-gram language model, initial sentence fragment
w1,...,wt, beam width k, confidence threshold θ.
1. Viterbi initialization:
Let δt,−N(wt−N+1,...,wt|wt−N+1,...,wt) = 1;
let s = −N +1;
beam(s − 1) = {δt,−N(wt−N+1,...,wt|wt−N+1,
...,wt)}.
2. Do Viterbi recursion until break:
(a) For all δt,s−1(wprime0,...,wprimeN−1|...) in
beam(s − 1), for all wN in vocabulary, store
δt,s(wprime1,...,wprimeN|...) (Equation 9) in beam(s)
and calculate Ψt,s(wprime1,...,wprimeN|...) (Equation
12).
(b) If argmaxwN maxwprime1,...,wprime
N−1
δt,s(wprime1,...,wprimeN|...) = period then break.
(c) If maxδt,s(wprime1,...,wprimeN|wt−N+1,...,wt) < θ
then decrement s; break.
(d) Prune all but the best k elements in beam(s).
(e) Increment s.
3. Let T = s+N. Collect words by path backtracking:
(w∗t+T−N+1,...,w∗t+T)
= argmaxδt,T−N(wprime1,...,wprimeN|...).
For s = T −N ...1:
w∗t+s = Ψt,s(w∗t+s+1,...,w∗t+s+N|
wt−N+1,...,wt).
Return w∗t+1,...,w∗t+T .
4.2 Instance-based Sentence Completion
An alternative approach to sentence completion
based on N-gram models is to retrieve, from the
training collection, the sentence that starts most sim-
ilarly, and use its remainder as a completion hypoth-
esis. The cosine similarity of the TFIDF representa-
tion of the initial fragment to be completed, and an
equally long fragment of each sentence in the train-
ing collection gives both a selection criterion for the
nearest neighbor and a confidence measure that can
be compared against a threshold in order to achieve
a desired precision recall balance.
A straightforward implementation of this near-
est neighbor approach becomes infeasible when the
training collection is large because too many train-
ing sentences have to be processed. Grabski and
Scheffer (2004) have developed an indexing struc-
ture that retrieves the most similar (using cosine sim-
ilarity) sentence fragment in sub-linear time. We use
their implementation of the instance-based method
in our experimentation.
5 Empirical Studies
we investigate the following questions. (a) How
does sentence completion with N-gram models
compare to the instance-based method, both in terms
of precision/recall and computing time? (b) How
well can N-gram models complete sentences from
collections with diverse properties?
Table 2 gives an overview of the four document
collections that we use for experimentation. The
first collection has been provided by a large online
store and contains emails sent by the service center
in reply to customer requests (Grabski and Scheffer,
2004). The second collection is an excerpt of the
recently disclosed email correspondence of Enron’s
management staff (Klimt and Yang, 2004). We use
3189 personal emails sent by Enron executive Jeff
Dasovich; he is the individual who sent the largest
number of messages within the recording period.
The third collection contains textual daily weather
reports for five years from a weather report provider
on the Internet. Each report comprises about 20
sentences. The last collection contains about 4000
cooking recipes; this corpus serves as an example of
a set of thematically related documents that might be
found on a personal computer.
We reserve 1000 sentences of each data set for
testing. As described in Section 4, we split the
remaining sentences in training (75%) and tuning
196
Table 2: Evaluation data collections.
Name Language #Sentences Entropy
service center German 7094 1.41
Enron emails English 16363 7.17
weather reports German 30053 4.67
cooking recipes German 76377 4.14
(25%) sets. We mix N-gram models up to an order
of five and estimate the interpolation weights (Sec-
tion 4). The resulting weights are displayed in Fig-
ure 1. In Table 2, we also display the entropy of the
collections based on the interpolated 5-gram model.
This corresponds to the average number of bits that
are needed to code each word given the preceding
four words. This is a measure of the intrinsic redun-
dancy of the collection and thus of the predictability.
11
11
22
22
3 3
33 44
55
5
4
0% 20%40%60%80%100%
cooking recipes
weather reports
Enron emails
service center
Figure 1: N-gram interpolation weights.
Our evaluation protocol is as follows. The beam
width parameter k is set to 20. We randomly draw
1000 sentences and, within each sentence, a posi-
tion at which we split it into initial fragment and
remainder to be predicted. A human evaluator is
presented both, the actual sentence from the collec-
tion and the initial fragment plus current comple-
tion conjecture. For each initial fragment, we first
cast the most likely single word prediction and ask
the human evaluator to judge whether they would
accept this prediction (without any changes), given
that they intend to write the actual sentence. We in-
crease the length of the prediction string by one ad-
ditional word and recur, until we reach a period or
exceed the prediction length of 20 words.
For each judged prediction length, we record the
confidence measure that would lead to that predic-
tion. With this information we can determine the
results for all possible threshold values of θ. To save
evaluation time, we consider all predictions that are
identical to the actual sentence as correct and skip
those predictions in the manual evaluation.
We will now study how the N-gram method com-
pares to the instance-based method. Figure 2 com-
pares the precision recall curves of the two meth-
ods. Note that the maximum possible recall is typi-
cally much smaller than 1: recall is a measure of the
keystroke savings, a value of 1 indicates that the user
saves all keystrokes. Even for a confidence thresh-
old of 0, a recall of 1 is usually not achievable.
Some of the precision recall curves have a con-
cave shape. Decreasing the threshold value in-
creases the number of predicted words, but it also
increases the risk of at least one word being wrong.
In this case, the entire sentence counts as an incor-
rect prediction, causing a decrease in both, precision
and recall. Therefore – unlike in the standard in-
formation retrieval setting – recall does not increase
monotonically when the threshold is reduced.
For three out of four data collections, the instance-
based learning method achieves the highest max-
imum recall (whenever this method casts a con-
jecture, the entire remainder of the sentence is
predicted—at a low precision), but for nearly all
recall levels the N-gram model achieves a much
higher precision. For practical applications, a high
precision is needed in order to avoid distracting,
wrong predictions. Varying the threshold, the N-
gram model can be tuned to a wide range of different
precision recall trade-offs (in three cases, precision
can even reach 1), whereas the confidence threshold
of the instance-based method has little influence on
precision and recall.
We determine the standard error of the precision
for the point of maximum F1-measure. For all data
collections and both methods the standard error is
below 0.016. Correct and incorrect prediction ex-
amples are provided in Table 3 for the service center
data set, translated from German into English. The
confidence threshold is adjusted to the value of max-
imum F1-measure. In two of these cases, the predic-
tion nicely stops at fairly specific terms.
How do precision and recall depend on the string
length of the initial fragment and the string length
of the completion cast by the systems? Figure 3
shows the relationship between the length of the ini-
tial fragment and precision and recall. The perfor-
mance of the instance-based method depends cru-
cially on a long initial fragment. By contrast, when
197
 0.5
 0.6
 0.7
 0.8
 0.9
 1
 0  0.2  0.4  0.6
Precision
Recall
service center
N-graminstance-based
 0
 0.2
 0.4
 0.6
 0.8
 0  0.01 0.02 0.03 0.04 0.05
Precision
Recall
Enron emails
N-graminstance-based
 0
 0.2
 0.4
 0.6
 0.8
 1
 0  0.02  0.04  0.06
Precision
Recall
weather reports
N-graminstance-based
 0
 0.2
 0.4
 0.6
 0.8
 1
 0  0.05  0.1  0.15
Precision
Recall
cooking recipes
N-graminstance-based
Figure 2: Precision recall curves for N-gram and instance-based methods of sentence completion.
Table 3: Prediction examples for service center data.
Initial fragment (bold face) and intended, missing part Prediction
Please complete your address. your address.
Kindly excuse the incomplete shipment. excuse the
Our supplier notified us that the pants are undeliverable. notified us that the
The mentioned order is not in our system. not in our system.
We recommend that you write down your login name and password. that you write down your login name and password.
The value will be accounted for in your invoice. be accounted for in your invoice.
Please excuse the delay. delay.
Please excuse our mistake. the delay.
If this is not the case give us a short notice. us your address and customer id.
the fragment length exceeds four with the N-gram
model, then this length and the accuracy are nearly
independent; the model considers no more than the
last four words in the fragment.
Figure 4 details the relation between string length
of the prediction and precision/recall. We see that
we can reach a constantly high precision over the en-
tire range of prediction lengths for the service center
data with the N-gram model. For the other collec-
tions, the maximum prediction length is 3 or 5 words
in comparison to much longer predictions cast by the
nearest neighbor method. But in these cases, longer
predictions result in lower precision.
How do instance-based learning and N-gram
completion compare in terms of computation time?
The Viterbi beam search decoder is linear in the pre-
diction length. The index-based retrieval algorithm
is constant in the prediction length (except for the fi-
nal step of displaying the string which is linear but
can be neglected). This is reflected in Figure 5 (left)
which also shows that the absolute decoding time
of both methods is on the order of few milliseconds
on a PC. Figure 5 (right) shows how prediction time
grows with the training set size.
We experiment on four text collections with di-
verse properties. The N-gram model performs re-
markably on the service center email collection.
Users can save 60% of their keystrokes with 85%
of all suggestions being accepted by the users, or
save 40% keystrokes at a precision of over 95%. For
cooking recipes, users can save 8% keystrokes at
60% precision or 5% at 80% precision. For weather
reports, keystroke savings are 2% at 70% correct
suggestions or 0.8% at 80%. Finally, Jeff Dasovich
of Enron can enjoy only a marginal benefit: below
1% of keystrokes are saved at 60% entirely accept-
able suggestions, or 0.2% at 80% precision.
How do these performance results correlate with
properties of the model and text collections? In Fig-
ure 1, we see that the mixture weights of the higher
order N-gram models are greatest for the service
center mails, smaller for the recipes, even smaller
for the weather reports and smallest for Enron. With
50% of the mixture weights allocated to the 1-gram
model, for the Enron collection the N-gram comple-
tion method can often only guess words with high
prior probability. From Table 2, we can further-
more see that the entropy of the text collection is
inversely proportional to the model’s ability to solve
the sentence completion problem. With an entropy
198
 0.4
 0.6
 0.8
 1
 0  2  4  6  8  10 12 14 16 18 20
Precision
Query length
service center
N-graminstance-based
 0
 0.2
 0.4
 0.6
 0.8
 1
 2  4  6  8  10
Precision
Query length
Enron emails
N-graminstance-based
 0
 0.2
 0.4
 0.6
 0  2  4  6  8  10 12 14 16 18 20
Precision
Query length
weather report
N-graminstance-based
 0
 0.2
 0.4
 0.6
 0.8
 1
 0  2  4  6  8  10 12 14 16 18 20
Precision
Query length
cooking recipes
N-graminstance-based
 0.3
 0.4
 0.5
 0.6
 0.7
 0.8
 0  2  4  6  8  10 12 14 16 18 20
Recall
Query length
service center
N-graminstance-based
 0
 0.05
 0.1
 2  4  6  8  10
Recall
Query length
Enron emails
N-graminstance-based
 0
 0.05
 0.1
 0.15
 0  2  4  6  8  10 12 14 16 18 20
Recall
Query length
weather report
N-graminstance-based
Figure 3: Precision and recall dependent on string length of initial fragment (words).
 0.4
 0.5
 0.6
 0.7
 0.8
 0.9
 0  2  4  6  8  10 12 14 16 18 20
Precision
Prediction length
service center
N-graminstance-based
 0
 0.2
 0.4
 0.6
 0.8
 2  4  6  8  10
Precision
Prediction length
Enron emails
N-graminstance-based
 0
 0.1
 0.2
 0.3
 0.4
 0  2  4  6  8  10 12 14 16 18 20
Precision
Prediction length
weather report
N-graminstance-based
 0
 0.2
 0.4
 0.6
 0.8
 1
 0  2  4  6  8  10 12 14 16 18 20
Precision
Prediction length
cooking recipes
N-graminstance-based
 0.5
 0.6
 0.7
 0.8
 0.9
 0  2  4  6  8  10 12 14 16 18 20
Recall
Prediction length
service center
N-graminstance-based
 0
 0.2
 0.4
 0.6
 2  4  6  8  10
Recall
Prediction length
Enron emails
N-graminstance-based
 0.05
 0.1
 0.15
 0  2  4  6  8  10 12 14 16 18 20
Recall
Prediction length
weather report
N-graminstance-based
 0
 0.2
 0.4
 0.6
 0.8
 1
 0  2  4  6  8  10 12 14 16 18 20
Recall
Prediction length
cooking recipes
N-graminstance-based
Figure 4: Precision and recall dependent on prediction string length (words).
of only 1.41, service center emails are excellently
predictable; by contrast, Jeff Dasovich’s personal
emails have an entropy of 7.17 and are almost as
unpredictable as Enron’s share price.
6 Conclusion
We discussed the problem of predicting how a user
will complete a sentence. We find precision (the
number of suggested characters that the user has to
read for every character that is accepted) and recall
(the rate of keystroke savings) to be appropriate per-
formance metrics. We developed a sentence com-
pletion method based on N-gram language models.
We derived a k best Viterbi beam search decoder.
Our experiments lead to the following conclusions:
(a) The N-gram based completion method has a
better precision recall profile than index-based re-
trieval of the most similar sentence. It can be tuned
to a wide range of trade-offs, a high precision can
be obtained. The execution time of the Viterbi beam
search decoder is in the order of few milliseconds.
(b) Whether sentence completion is helpful
strongly depends on the diversity of the document
collection as, for instance, measured by the entropy.
For service center emails, a keystroke saving of 60%
can be achieved at 85% acceptable suggestions; by
contrast, only a marginal keystroke saving of 0.2%
can be achieved for Jeff Dasovich’s personal emails
at 80% acceptable suggestions. A modest but signif-
icant benefit can be observed for thematically related
documents: weather reports and cooking recipes.
199
 0
 2
 4
 6
 8
 10
 1  2  3  4  5  6  7  8  9  10 11
Prediction time - ms
Prediction length
service center
n-graminstance-based
 10
 20
 30
 40
 50
 1  2  3
Prediction time - ms
Prediction length
weather reports
n-graminstance-based
 0
 0.5
 1
 1.5
 10 20 30 40 50 60 70 80 90 100
Prediction time - ms
Training set size in %
service center
n-graminstance-based
 0
 10
 20
 30
 40
 10  20 30  40 50  60 70  80 90 100
Prediction time - ms
Training set size in %
weather report
n-graminstance-based
Figure 5: Prediction time dependent on prediction length in words (left) and prediction time dependent on
training set size (right) for service center and weather report collections.
Acknowledgment
This work has been supported by the German Sci-
ence Foundation DFG under grant SCHE540/10.
References
P. Brown, S. Della Pietra, V. Della Pietra, J. Lai, and
R. Mercer. 1992. An estimate of an upper bound
for the entropy of english. Computational Linguistics,
18(2):31–40.
J. Darragh and I. Witten. 1992. The Reactive Keyboard.
Cambridge University Press.
B. Davison and H. Hirsch. 1998. Predicting sequences of
user actions. In Proceedings of the AAAI/ICML Work-
shop on Predicting the Future: AI Approaches to Time
Series Analysis.
M. Debevc, B. Meyer, and R. Svecko. 1997. An adap-
tive short list for documents on the world wide web. In
Proceedings of the International Conference on Intel-
ligent User Interfaces.
G. Foster, P. Langlais, and G. Lapalme. 2002. User-
friendly text prediction for translators. In Proceedings
of the Conference on Empirical Methods in Natural
Language Processing.
N. Garay-Vitoria and J. Abascal. 2004. A comparison of
prediction techniques to enhance the communication
of people with disabilities. In Proceedings of the 8th
ERCIM Workshop User Interfaces For All.
K. Grabski and T. Scheffer. 2004. Sentence completion.
In Proceedings of the ACM SIGIR Conference on In-
formation Retrieval.
N. Jacobs and H. Blockeel. 2001. The learning shell:
automated macro induction. In Proceedings of the In-
ternational Conference on User Modelling.
N. Jacobs and H. Blockeel. 2003. Sequence predic-
tion with mixed order Markov chains. In Proceedings
of the Belgian/Dutch Conference on Artificial Intelli-
gence.
B. Klimt and Y. Yang. 2004. The Enron corpus: A new
dataset for email classification research. In Proceed-
ings of the European Conference on Machine Learn-
ing.
B. Korvemaker and R. Greiner. 2000. Predicting Unix
command lines: adjusting to user patterns. In Pro-
ceedings of the National Conference on Artificial In-
telligence.
P. Langlais, G. Foster, and G. Lapalme. 2000. Unit com-
pletion for a computer-aided translation typing system.
Machine Translation, 15:267–294.
P. Langlais, M. Loranger, and G. Lapalme. 2002. Trans-
lators at work with transtype: Resource and evalua-
tion. In Proceedings of the International Conference
on Language Resources and Evaluation.
P. Langlais, G. Lapalme, and M. Loranger. 2004.
Transtype: Development-evaluation cycles to boost
translator’s productivity. Machine Translation (Spe-
cial Issue on Embedded Machine Translation Systems,
17(17):77–98.
T. Magnuson and S. Hunnicutt. 2002. Measuring the ef-
fectiveness of word prediction: The advantage of long-
term use. Technical Report TMH-QPSR Volume 43,
Speech, Music and Hearing, KTH, Stockholm, Swe-
den.
H. Motoda and K. Yoshida. 1997. Machine learning
techniques to make computers easier to use. In Pro-
ceedings of the Fifteenth International Joint Confer-
ence on Artificial Intelligence.
L. Nepveu, G. Lapalme, P. Langlais, and G. Foster. 2004.
Adaptive language and translation models for interac-
tive machine translation. In Proceedings of the Con-
ference on Empirical Methods in Natural Language
Processing.
C. Shannon. 1951. Prediction and entropy of printed
english. In Bell Systems Technical Journal, 30, 50-64.
W. Zagler and C. Beck. 2002. FASTY - faster typing
for disabled persons. In Proceedings of the European
Conference on Medical and Biological Engineering.
200
