Updating an NLP System to Fit New Domains: an empirical study on the
sentence segmentation problem
Tong Zhanga0 and Fred Damerau a1 and David Johnsona2
IBM T.J. Watson Research Center
Yorktown Heights
New York, 10598, USA
a0 tzhang@watson.ibm.com
a1 damerau@watson.ibm.com a2 dejohns@us.ibm.com
Abstract
Statistical machine learning algorithms have
been successfully applied to many natural lan-
guage processing (NLP) problems. Compared
to manually constructed systems, statistical
NLP systems are often easier to develop and
maintain since only annotated training text is
required. From annotated data, the underlying
statistical algorithm can build a model so that
annotations for future data can be predicted.
However, the performance of a statistical sys-
tem can also depend heavily on the character-
istics of the training data. If we apply such
a system to text with characteristics different
from that of the training data, then performance
degradation will occur. In this paper, we ex-
amine this issue empirically using the sentence
boundary detection problem. We propose and
compare several methods that can be used to
update a statistical NLP system when moving
to a different domain.
1 Introduction
An important issue for a statistical machine learning
based NLP system is that its performance can depend
heavily on the characteristics of the training data used to
build the system. Consequently if we train a system on
some data but apply it to other data with different charac-
teristics, then the system’s performance can degrade sig-
nificantly. It is therefore natural to investigate the follow-
ing related issues:
a3 How to detect the change of underlying data charac-
teristics, and to estimate the corresponding system
performance degradation.
a3 If performance degradation is detected, how to up-
date a statistical system to improve its performance
with as little human effort as possible.
This paper investigates some methodological and prac-
tical aspects of the above issues. Although ideally such
a study would include as many different statistical algo-
rithms as possible, and as many different linguistic prob-
lems as possible (so that a very general conclusion might
be drawn), in reality such an undertaking is not only diffi-
cult to carry out, but also can hide essential observations
and obscure important effects that may depend on many
variables. An alternative is to study a relatively simple
and well-understood problem to try to gain understand-
ing of the fundamental issues. Causal effects and essen-
tial observations can be more easily isolated and identi-
fied from simple problems since there are fewer variables
that can affect the outcome of the experiments.
In this paper, we take the second approach and focus on
a specific problem using a specific underlying statistical
algorithm. However, we try to use only some fundamen-
tal properties of the algorithm so that our methods are
readily applicable to other systems with similar proper-
ties. Specifically, we use the sentence boundary detection
problem to perform experiments since not only is it rel-
atively simple and well-understood, but it also provides
the basis for other more advanced linguistic problems.
Our hope is that some characteristics of this problem are
universal to language processing so that they can be gen-
eralized to more complicated linguistic tasks. In this pa-
per we use the generalized Winnow method (Zhang et al.,
2002) for all experiments. Applied to text chunking, this
method resulted in state of the art performance. It is thus
reasonable to conjecture that it is also suitable to other
linguistic problems including sentence segmentation.
Although issues addressed in this paper are very im-
portant for practical applications, there have only been
limited studies on this topic in the existing literature.
In speech processing, various adaption techniques have
been proposed for language modeling. However, the
language modeling problem is essentially unsupervised
(density estimation) in the sense that it does not require
any annotation. Therefore techniques developed there
cannot be applied to our problems. Motivated from adap-
tive language modeling, transformation based adaptation
techniques have also been proposed for certain super-
vised learning tasks (Gales and Woodland, 1996). How-
ever, typically they only considered very specific statisti-
cal models where the idea is to fit certain transformation
parameters. In particular they did not consider the main
issues investigated in this paper as well as generally appli-
cable supervised adaptation methodologies such as what
we propose. In fact, it will be very difficult to extend their
methods to natural language processing problems that use
different statistical models. The adaption idea in (Gales
and Woodland, 1996) is also closely related to the idea of
combining supervised and unsupervised learning in the
same domain (Merialdo, 1994). In machine learning, this
is often referred to as semi-supervised learning or learn-
ing with unlabeled data. Such methods are not always
reliable and can often fail(Zhang and Oles, 2000). Al-
though potentially useful for small distributional parame-
ter shifts, they cannot recover labels for examples not (or
inadequately) represented in the old training data. In such
cases, it is necessary to use supervised adaption methods
which we study in this paper. Another related idea is so-
called active learning paradigm (Lewis and Catlett, 1994;
Zhang and Oles, 2000), which selectively annotates the
most informative data (from the same domain) so as to re-
duce the total number of annotations required to achieve
a certain level of accuracy. See (Tang et al., 2002; Steed-
man et al., 2003) for related studies in statistical natural
language parsing.
2 Generalized Winnow for Sentence
Boundary Detection
For the purpose of this paper, we consider the following
form of the sentence boundary detection problem: to de-
termine for each period “.” whether it denotes a sentence
boundary or not (most non-sentence boundary cases oc-
cur in abbreviations). Although other symbols such as
“?” and “!” may also denote sentence boundaries, they
occur relatively rarely and when they occur, are easy to
determine. There are a number of special situations, for
example: three (or more) periods to denote omission,
where we only classify the third period as an end of sen-
tence marker. The treatment of these special situations
are not important for the purpose of this paper.
The above formulation of the sentence segmentation
problem can be treated as a binary classification prob-
lem. One method that has been successfully applied to a
number of linguistic problems is the Winnow algorithm
(Littlestone, 1988; Khardon et al., 1999). However, a
drawback of this method is that the algorithm does not
necessarily converge for data that are not linearly separa-
ble. A generalization was recently proposed, and applied
to the text chunking problem (Zhang et al., 2002), where
it was shown that this generalization can indeed improve
the performance of Winnow.
Applying the generalized Winnow algorithm on the
sentence boundary detection problem is straight forward
since the method solves a binary classification problem
directly. In the following, we briefly review this algo-
rithm, and properties useful in our study.
Consider the binary classification problem: to deter-
mine a label a4a6a5a8a7a10a9a12a11a10a13a14a11a16a15 associated with an input vec-
tor a17 . A useful method for solving this problem is
through linear discriminant functions, which consist of
linear combinations of components of the input vector.
Specifically, we seek a weight vectora18 and a thresholda19
with the following decision rule: ifa18a21a20a22a17a24a23a25a19 we predict
that the label a4a27a26a28a9a12a11 , and if a18a21a20a22a17a30a29a31a19 , we predict that
the label a4a32a26a33a11 . We denote by a34 the dimension of the
weight vectora18 which equals the dimension of the input
vectora17 . The weighta18 and thresholda19 can be computed
from the generalized Winnow method, which is based on
the following optimization problem:
a35
a18a36a13a37a19a10a38a39a26a25a40a42a41a44a43a46a45a48a47a50a49
a51a53a52a55a54a51a57a56
a58a59a61a60
a62
a63a37a64a22a65
a18a21a66
a63a68a67
a49
a18a21a66
a63
a69a71a70a46a72
a19a66
a67
a49
a19a66
a69a71a70
a72
a60
a62
a63a37a64a57a65
a18a12a73
a63a68a67
a49
a18 a73
a63
a69a74a70a75a72
a19 a73
a67
a49
a19 a73
a69a71a70
a72a77a76a8a78
a62a79
a64a57a65a22a80
a35a37a35
a18 a20 a17
a79
a9a81a19a10a38a82a4
a79
a38a84a83a85a13 (1)
s.t. a18a86a26a25a18 a66 a9a81a18 a73 a13a87a19a88a26a25a19a66 a9a27a19 a73 a13
where
a80
a35a90a89
a38a91a26a93a92a94
a95
a94a96
a9a21a97
a89 a89
a23a86a9a12a11
a65
a98
a35a90a89
a9a99a11a100a38
a98 a89
a5a102a101a103a9a12a11a16a13a14a11a14a104
a105 a89a107a106
a11a16a108
The numerical method which we use to solve this prob-
lem, as presented in Algorithm 1, is based on a dual for-
mulation of the above problem. See (Zhang et al., 2002)
for detailed derivation of the algorithm and its relation-
ship with the standard Winnow.
In all experiments, we use the same parameters sug-
gested in (Zhang et al., 2002) for the text chunking prob-
lem: a109a110a26a112a111 a105 a13a70 a26 a105 a108a103a11 , a113a30a26 a105 a108a105 a11 , and
a76
a26
a105
a108a50a11 . The
above parameter choices may not be optimal for sentence
segmentation. However since the purpose of this paper is
not to demonstrate the best possible sentence segmenta-
tion system using this approach, we shall simply fix these
parameters for all experiments.
Algorithm 1 (Generalized Winnow)
input: training data a35a17 a65a13a37a4 a65a38a114a13a14a108a74a108a14a108a53a13a35a17
a78
a13a87a4
a78
a38
output: weight vectora18 and thresholda19
leta115
a79
a26
a105 (
a116a117a26a118a11a16a13a74a108a14a108a74a108a119a13a87a120 )
leta18a21a66a63 a26a30a18a36a73a63 a26 a70 (a121a122a26a31a11a16a13a74a108a14a108a14a108a119a13a44a34 )
leta19a66 a26a123a19 a73 a26 a70
fora124a125a26a118a11a10a13a14a108a74a108a14a108a119a13a37a109
fora116a117a26a118a11a10a13a14a108a74a108a14a108a119a13a87a120
a126
a26
a35
a18 a66 a9a27a18 a73 a38a87a20a127a17
a79
a4
a79
a9
a35
a19a66 a9a27a19 a73 a38a82a4
a79
a128
a115
a79
a26a123a45a125a40a42a129
a35
a45a48a47a50a49
a35
a97
a76
a9a102a115
a79
a13a87a113
a35a74a130
a73a119a131a16a132a130a133a9
a126
a38a37a38a114a13a74a9a134a115
a79
a38
a18a21a66
a63
a26a25a18a21a66
a63a136a135
a129a138a137
a35a128
a115
a79
a17
a79
a63
a4
a79
a38 (a121a88a26a139a11a10a13a14a108a74a108a14a108a119a13a37a34 )
a18 a73
a63
a26a25a18 a73
a63 a135
a129a140a137
a35
a9
a128
a115
a79
a17
a79
a63
a4
a79
a38 (a121a122a26a31a11a16a13a74a108a14a108a14a108a141a13a44a34 )
a19a66 a26a25a19a66
a135
a129a138a137
a35
a9
a128
a115
a79
a4
a79
a38
a19 a73 a26a25a19 a73
a135
a129a138a137
a35a128
a115
a79
a4
a79
a38
a115
a79
a26a123a115
a79
a72
a128
a115
a79
end
end
leta18a86a26a30a18 a66 a9a27a18 a73
leta19a88a26a123a19 a66 a9a27a19 a73
It was shown in (Zhang et al., 2002) that if a35a18a36a13a87a19a10a38 is
obtained from Algorithm 1, then it also approximately
minimizes a142a134a143
a35
a97a42a144
a35
a4a145a26 a11a147a146a17a55a38a48a9a148a11a136a9a150a149
a35
a18a21a20a127a17a123a9a151a19a10a38a37a38
a98 ,
where a144
a35
a4a152a26a153a11a147a146a17a55a38 denotes the conditional probabil-
ity of a4a28a26a154a11 at a data point a17 . Here we have used
a149
a35
a126
a38 to denote the truncation of
a126 onto
a101a155a9a12a11a10a13a14a11a156a104: a149
a35
a126
a38a21a26
a45a85a47a103a49
a35
a11a16a13a37a45a125a40a157a129
a35
a9a12a11a10a13
a126
a38a37a38 . This observation implies that the
quantity a35a149 a35a18a158a20a127a17a122a9a61a19a10a38
a72
a11a100a38a37a159a42a97 can be regarded as an esti-
mate for the in-class conditional probability. As we will
see, this property will be very useful for our purposes.
For each period in the text, we construct a feature vec-
tor a17 as the input to the generalized Winnow algorithm,
and use its prediction to determine whether the period de-
notes a sentence boundary or not. In order to constructa17 ,
we consider linguistic features surrounding the period, as
listed in Table 1. Since the feature construction routine
is written in the Java language, “type of character” fea-
tures correspond to the Java character types, which can
be found in any standard Java manual. We picked these
features by looking at features used previously, as well
as adding some of our own which we thought might be
useful. However, we have not examined which features
are actually important to the algorithm (for example, by
looking at the size of the weights), and which features are
not.
We use an encoding scheme similar to that of (Zhang
et al., 2002). For each data point, the associated features
are encoded as a binary vector a17 . Each component of a17
corresponds to a possible feature value a89 of a feature
a80in Table 1. The value of the component corresponds to
a test which has value one if the corresponding feature
a80has value
a89 , or value zero if the corresponding feature
a80has another feature value.
token before the period
token after the period
character to the right
type of character to the right
character to the left
type of character to the left
character to the right of blank after word
type of character to the right of blank after word
character left of first character of word
type of character left of first character of word
first character of the preceding word
type of first character of the preceding word
length of preceding word
distance to previous period
Table 1: Linguistic Features
The features presented here may not be optimal. In
particular, unlike (Zhang et al., 2002), we do not use
higher order features (for example, combinations of the
above features). However, this list of features has already
given good performance, comparing favorably with pre-
vious approaches (see (Reynar and Ratnaparkhi, 1997;
Mikheev, 2000) and references therein).
The standard evaluation data is the Wall-Street Journal
(WSJ) tree-bank. Based on our processing scheme, the
training set contains about seventy-four thousand periods,
and the test set contains about thirteen thousand periods.
If we train on the training set, and test on the test set,
the accuracy is a160a16a160a161a108a163a162a16a164 . Another data set which has been
annotated is the Brown corpus. If we train on the WSJ
training set, and test on the Brown corpus, the accuracy
isa160a10a160a161a108a165a97a10a164 . The error rate is three times larger.
3 Experimental Design and System Update
Methods
In our study of system behavior under domain changes,
we have also used manually constructed rules to filter out
some of the periods. The specific set of rules we have
used are:
a3 If a period terminates a non-capitalized word, and is
followed by a blank and a capitalized word, then we
predict that it is a sentence boundary.
a3 If a period is both preceded and followed by alpha-
numerical characters, then we predict that it is not a
sentence boundary.
The above rules achieve error rates of less than a105 a108a50a11a100a164
on both the WSJ and Brown datasets, which is sufficient
for our purpose. Note that we did not try to make the
above rules as accurate as possible. For example, the first
rule will misclassifiy situations such as “A vs. B”. Elim-
inating such mistakes is not essential for the purpose of
this study.
All of our experiments are performed and reported on
the remaining periods that are not filtered out by the
above manual rules. In this study, the filtering scheme
serves two purposes. The first purpose is to magnify the
errors. Roughly speaking, the rules will classify more
than half of the periods. These periods are also relatively
easy to classify using a statistical classifier. Therefore
the error rate on the remaining periods is more than dou-
bled. Since the sentence boundary detection problem has
a relatively small error rate, this magnification effect is
useful for comparing different algorithms. The second
purpose is to reduce our manual labeling effort. In this
study, we had used a number of datasets that are not an-
notated. Therefore for experimentation purpose, we have
to label each period manually.
After filtering, the WSJ training set contains about
twenty seven thousand data points, and the test set con-
tains about five thousand data points. The Brown corpus
contains about seventeen thousand data points. In addi-
tion, we also manually labeled the following data:
a3 Reuters: This is a standard dataset for text catego-
rization, available from
http://kdd.ics.uci.edu/databases/reuters21578/
reuters21578.html. We only use the test-data in the
ModApte split, which contains about eight thousand
periods after filtering.
a3 MedLine: Medical abstracts with about seven thou-
sand periods, available from
www1.ics.uci.edu/a166 mlearn/MLRepository.html.
It is perhaps not surprising that a sentence boundary
classifier trained on WSJ does not perform nearly as well
on some of the other data sets. However it is useful to ex-
amine the source of these extra errors. We observed that
most of the errors are clearly caused by the fact that other
domains contain examples that are not represented in the
WSJ training set. There are two sources for these pre-
viously unseen examples: 1. change of writing style; 2.
new linguistic expressions. For example, quote marks are
represented as two single quote (or back quote) characters
in WSJ, but typically as one double quote character else-
where. In some data sets such as Reuters, phrases such
as “U.S. Economy” or “U.S. Dollar” frequently have the
word after the country name capitalized (they also appear
in lower case sometimes, in the same data). The above
can be considered as a change of writing style. In some
other cases, new expressions may occur. For example, in
the MedLine data, new expressions such as “4 degrees C.”
are used to indicate temperature, and expressions such as
“Bioch. Biophys. Res. Commun. 251, 744-747” are used
for citations. In addition, new acronyms and even formu-
las containing tokens ending with periods occur in such
domains.
It is clear that the majority of errors are caused by
data that are not represented in the training set. This
fact suggests that when we apply a statistical system to a
new domain, we need to check whether the domain con-
tains a significant number of previously unseen examples
which may cause performance deterioration. This can
be achieved by measuring the similarity of the new test
domain to the training domain. One way is to compute
statistics on the training domain, and compare them to
statistics computed on the new test domain; another way
is to calculate a properly defined distance between the test
data and the training data. However, it is not immediately
obvious what data statistics are important for determin-
ing classification performance. Similarly it is not clear
what distance metric would be good to use. To avoid
such difficulties, in this paper we assume that the clas-
sifier itself can provide a confidence measure for each
prediction, and we use this information to estimate the
classifier’s performance.
As we have mentioned earlier, the generalized Win-
now method approximately minimizes the quantity
a142a134a143
a35
a97a42a144
a35
a4a102a26a167a11a147a146a17a55a38a168a9a169a11a170a9a99a149
a35
a18a21a20a127a17a61a9a6a19a10a38a37a38
a98 . It is thus nat-
ural to use a35a149 a35a18a21a20a22a17a81a9a123a19a147a38
a72
a11a100a38a37a159a42a97 as an estimate of the
conditional probability a144 a35a4a46a26a171a11a147a146a17a55a38 . From simple al-
gebra, we obtain an estimate of the classification error
as a142a134a143a55a146a163a11a12a9a99a149 a35a18a158a20a127a17a61a9a32a19a10a38a14a146a159a42a97 . Since a149 a35a18a21a20a53a17a172a9a32a19a10a38 is only
an approximation of the conditional probability, this esti-
mate may not be entirely accurate. However, one would
expect it to give a reasonably indicative measure of the
classification performance. In Table 2, we compare the
true classification accuracy from the annotated test data
to the estimated accuracy using this method. It clearly
shows that this estimate indeed correlates very well with
the true classification performance. Note that this esti-
mate does not require knowing the true labels of the data.
Therefore we are able to detect the potential performance
degradation of the classifier on a new domain using this
metric without the ground truth information.
accuracy WSJ Brown Reuters MedLine
true 99.3 97.7 93.0 94.8
estimated 98.6 98.2 93.3 96.4
Table 2: True and estimated accuracy
As pointed out before, a major source of error for a
new application domain comes from data that are not
represented in the training set. If we can identify those
data, then a natural way to enhance the underlying classi-
fier’s performance would be to include them in the train-
ing data, and then retrain. However, a human is required
to obtain labels for the new data, but our goal is to reduce
the human labeling effort as much as possible. Therefore
we examine the potential of using the classifier to deter-
mine which part of the data it has difficulty with, and then
ask a human to label that part. If the underlying classi-
fier can provide confidence information, then it is natu-
ral to assume that confidence for unseen data will likely
be low. Therefore for labeling purposes, one can choose
data from the new domain for which the confidence is
low. This idea is very similar to certain methods used
in active learning. In particular a confidence-based sam-
ple selection scheme was proposed in (Lewis and Catlett,
1994). One potential problem for this approach is that by
choosing data with lower confidence levels, noisy data
that are difficult to classify tend to be chosen; another
problem is that it tends to choose similar data multiple
times. However, in this paper we do not investigate meth-
ods that solve these issues.
For baseline comparison, we consider the classifier ob-
tained from the old training data (see Table 3), as well as
classifiers trained on random samples from the new do-
main (see Table 4). In this study, we explore the follow-
ing three ideas to improve the performance:
a3 Data balancing: Merge labeled data from the new
domain with the existing training data from the old
domain; we also balance their relative proportion so
that the effect of one domain does not dominate the
other.
a3 Feature augmentation: Use the old classifier (first
level classifier) to create new features for the data,
and then train another classifier (second level classi-
fier) with augmented features (on newly labeled data
from the new domain).
a3 Confidence based feature selection: Instead of ran-
dom sampling, select data from the new domain with
lowest confidence based on the old classifier.
One may combine the above ideas. In particular, we will
compare the following methods in this study:
a3 Random: Randomly selected data from the new do-
main.
a3 Balanced: Use WSJ training set + randomly selected
data from the new domain. However, we super-
sample the randomly selected data so that the effec-
tive sample size isa173 -times that of the WSJ training
set, wherea173 is a balancing factor.
a3 Augmented (Random): Use the default classifier
output to form additional features. Then train a
second level classifier on randomly selected data
from the new domain, with these additional features.
In our experiments, four binary features are added;
they correspond to tests
a76
a106
a11 ,
a76
a106 a105 ,
a76
a23
a105 ,
a76
a23a8a9a12a11
(where
a76
a26a150a18a158a20a127a17a68a9a174a19 is the output of the first level
classifier).
a3 Augmented-balanced: As indicated, use additional
features as well as the original WSJ training set for
the second level classifier.
a3 Confidence-Balanced: Instead of random sampling
from the new domain, choose the least confident
data (which is more likely to provide new informa-
tion), and then balance with the WSJ training set.
a3 Augmented-Confidence-Balanced: This method is
similar to Augmented-balanced. However, we label
the least confident data instead of random sampling.
4 Experimental Results
We carried out experiments on the Brown, Reuters, and
MedLine datasets. We randomly partition each dataset
into training and testing. All methods are trained using
only information from the training set, and their perfor-
mance are evaluated on the test set. Each test set contains
a111
a105a10a105a16a105 data points randomly selected. This sample size is
chosen to make sure that an estimated accuracy based on
these empirical samples will be reasonably close to the
true accuracy. For a binary classifier, the standard devi-
ation between the empirical mean a175
a176 with a sample size
a177
a26a25a111
a105a16a105a10a105 , and the true mean
a178
a176 , is
a179 a178
a176
a35
a11a180a9a8a178
a176
a38a37a159
a177 . Since
a175
a176a6a181
a178
a176 , we can replace
a178
a176 by
a175
a176 . Now, if
a175
a176
a29
a105
a108a160 , then
the error is less thana105 a108a165a182a10a164 ; if a175a176 a29 a105 a108a160a16a183 , then the standard
deviation is no more than about a105 a108a97a147a164 . From the experi-
ments, we see that the accuracy of all algorithms will be
improved to about a105 a108a160a10a183 for all three datasets. Therefore
the test set size we have is sufficiently large to distinguish
a difference ofa105 a108a165a182a10a164 with reasonable confidence.
Table 3 lists the test set performance of classifiers
trained on the WSJ training set (denoted by WSJ), the
training set from the same domain (that is, Brown,
Reuters, and MedLine respectively for the corresponding
testsets), denoted by Self, and their combination. This
indicates upper limits on what can be achieved using the
corresponding training set information. It is also inter-
esting to see that the combination does not necessarily
improve the performance. We compare different updat-
ing schemes based on the number of new labels required
from the new domain. For this purpose, we use the fol-
lowing number of labeled instances: a11a105a16a105 a13a184a97a105a10a105 a13a87a111 a105a16a105 a13a37a183a105a10a105
and a11a74a185 a105a16a105 , corresponding to the “new data” column in the
tables. For all experiments, if a specific result requires
random sampling, then five different random runs were
performed, and the corresponding result is reported in the
format of “meana186 std. dev.” over the five runs.
Table 4 contains the performance of classifiers trained
on randomly selected data from the new domain alone. It
trainset Brown Reuters MedLine
WSJ 97.5 93.1 94.6
Self 99.1 98.4 98.2
WSJ+Self 98.9 98.9 97.9
Table 3: baseline accuracy
is interesting to observe that even with a relatively small
number of training examples, the corresponding classi-
fiers can out-perform those obtained from the default
WSJ training set, which contains a significantly larger
amount of data. Clearly this indicates that in some NLP
applications, using data with the right characteristics can
be more important than using more data. This also pro-
vides strong evidence that one should update a classifier
if the underlying domain is different from the training do-
main.
new data Brown Reuters MedLine
100 a160a42a111a187a108a182a180a186
a105
a108a160 a160a42a111a187a108a183a134a186a25a11a16a108a111 a160a16a188a189a108a97a180a186a25a11a16a108a50a11
200 a160a42a111a187a108a183a134a186a25a11a16a108a165a97 a160a10a182a161a108a183a134a186 a105 a108a160 a160a10a182a161a108a182a180a186 a105 a108a185
400 a160a16a185a189a108a183a134a186 a105 a108a188 a160a16a185a189a108a183a134a186 a105 a108a111 a160a16a185a189a108a185a134a186 a105 a108a111
800 a160a147a162a138a108a97a180a186 a105 a108a165a182 a160a147a162a138a108a185a134a186 a105 a108a50a11 a160a147a162a138a108a97a180a186 a105 a108a165a97
1600 a160a147a162a138a108a160a134a186 a105 a108a50a11 a160a16a183a189a108a105 a186 a105 a108a50a11 a160a147a162a138a108a183a134a186 a105 a108a165a97
Table 4: Random Selection
Table 5 contains the results of using the balancing idea.
With the same amount of newly labeled data, the im-
provement over the random method is significant. This
shows that even though the domain has changed, training
data from the old domain are still very useful. Observe
that not only is the average performance improved, but
the variance is also reduced. Note that in this table, we
have fixed a173a151a26 a105 a108a182 . The performance with different a173
values on the MedLine dataset is reported in Table 6. It
shows that different choices of a173 make relatively small
differences in accuracy. At this point, it is interesting to
check whether the estimated accuracy (using the method
described for Table 2) reflects the change in performance
improvement. The result is given in Table 7. Clearly the
method we propose still leads to reasonable estimates.
new data Brown Reuters MedLine
100 a160a147a162a138a108a165a162a190a186 a105 a108a50a11 a160a147a162a138a108a103a11a91a186 a105 a108a185 a160a10a182a161a108a183a134a186 a105 a108a111
200 a160a147a162a138a108a160a134a186
a105
a108a165a97 a160a147a162a138a108a165a162a190a186
a105
a108a188 a160a16a185a189a108a185a134a186
a105
a108a188
400 a160a147a162a138a108a160a134a186 a105 a108a50a11 a160a16a183a189a108a103a11a91a186 a105 a108a188 a160a147a162a138a108a97a180a186 a105 a108a165a97
800 a160a16a183a189a108a103a11a91a186 a105 a108a165a97 a160a16a183a189a108a188a134a186 a105 a108a188 a160a147a162a138a108a185a134a186 a105 a108a165a97
1600 a160a16a183a189a108a111a21a186 a105 a108a50a11 a160a16a183a189a108a165a162a190a186 a105 a108a50a11 a160a147a162a138a108a160a134a186 a105 a108a50a11
Table 5: Balanced (a173a172a26 a105 a108a165a182 )
Table 8 and Table 9 report the performance using
a173 a11a157a159a157a183 a11a100a159a100a111 a11a157a159a42a97 a11 a97
100 a160a16a185a189a108a105 a160a147a182a138a108a163a162 a160a10a182a161a108a183 a160a10a185a161a108a105 a160a42a111a189a108a160
200 a160a16a185a189a108a188 a160a10a185a161a108a165a182 a160a16a185a189a108a185 a160a10a185a161a108a188 a160a16a185a161a108a185
400 a160a16a185a189a108a183 a160a140a162a140a108a105 a160a147a162a138a108a97 a160a140a162a140a108a50a11 a160a16a185a161a108a183
800 a160a147a162a138a108a188 a160a140a162a140a108a165a182 a160a147a162a138a108a185 a160a140a162a140a108a165a182 a160a147a162a140a108a111
1600 a160a147a162a138a108a111 a160a140a162a140a108a183 a160a147a162a138a108a160 a160a10a183a161a108a105 a160a147a162a140a108a163a162
Table 6: Effect of a173 on MedLine using the balancing
scheme
accuracy Brown Reuters MedLine
true 98.1 98.3 97.6
estimated 98.4 97.9 98.2
Table 7: True and estimated accuracy (balancing scheme
witha183a105a16a105 samples anda173a172a26 a105 a108a165a182 )
augmented features, either with the random sampling
scheme, or with the balancing scheme. It can be seen that
with feature augmentation, the random sampling and the
balancing schemes perform similarly. Although the fea-
ture augmentation method does not improve the overall
performance (compared with balancing scheme alone),
one advantage is that we do not have to rely on the old
training data any more. In principle, one may even use
a two-level classification scheme: use the old classifier if
it gives a high confidence; use the new classifier trained
on the new domain otherwise. However, we have not ex-
plored such combinations.
new data Brown Reuters MedLine
100 a160a147a162a138a108a182a180a186 a105 a108a105 a160a147a162a138a108a165a162a190a186 a105 a108a165a97 a160a10a182a161a108a182a180a186a25a11a16a108a105
200 a160a147a162a138a108a185a134a186 a105 a108a50a11 a160a147a162a138a108a185a134a186 a105 a108a188 a160a10a182a161a108a160a134a186 a105 a108a183
400 a160a147a162a138a108a165a162a190a186 a105 a108a50a11 a160a147a162a138a108a183a134a186 a105 a108a165a97 a160a147a162a138a108a105 a186 a105 a108a160
800 a160a147a162a138a108a183a134a186
a105
a108a50a11 a160a16a183a189a108a103a11a91a186
a105
a108a111 a160a147a162a138a108a185a134a186
a105
a108a188
1600 a160a16a183a189a108a103a11a91a186 a105 a108a50a11 a160a16a183a189a108a188a134a186 a105 a108a188 a160a147a162a138a108a160a134a186 a105 a108a50a11
Table 8: Augmented (Random)
Table 10 and Table 11 report the performance using
confidence based data selection, instead of random sam-
pling. This method helps to some extent, but not as much
as we originally expected. However, we have only used
the simplest version of this method, which is suscepti-
ble to two problems mentioned earlier: it tends (a) to
select data that are inherently hard to classify, and (b)
to select redundant data. Both problems can be avoided
with a more elaborated implementation, but we have not
explored this. Another possible reason that using confi-
dence based sample selection does not result in signifi-
cant performance improvement is that for our examples,
the performance is already quite good with even a small
number of new samples.
new data Brown Reuters MedLine
100 a160a147a162a138a108a183a134a186 a105 a108a188 a160a147a162a138a108a105 a186a25a11a16a108a105 a160a10a182a161a108a111a21a186 a105 a108a163a162
200 a160a147a162a138a108a183a134a186 a105 a108a165a97 a160a147a162a138a108a165a162a190a186 a105 a108a188 a160a10a182a161a108a160a134a186 a105 a108a185
400 a160a16a183a189a108a105 a186 a105 a108a50a11 a160a16a183a189a108a105 a186 a105 a108a188 a160a16a185a189a108a183a134a186 a105 a108a185
800 a160a16a183a189a108a97a180a186
a105
a108a188 a160a16a183a189a108a111a21a186
a105
a108a188 a160a147a162a138a108a97a180a186
a105
a108a188
1600 a160a16a183a189a108a111a21a186 a105 a108a165a97 a160a16a183a189a108a165a162a190a186 a105 a108a188 a160a147a162a138a108a182a180a186 a105 a108a165a97
Table 9: Augmented + Balanced
new data Brown Reuters MedLine
100 a160a16a183a189a108a105 a160a147a162a138a108a182 a160a10a185a161a108a160
200 a160a16a183a189a108a103a11 a160a147a162a138a108a111 a160a140a162a140a108a105
400 a160a16a183a189a108a97 a160a147a162a138a108a183 a160a140a162a140a108a185
800 a160a16a183a189a108a165a162 a160a16a183a189a108a185 a160a10a183a161a108a105
1600 a160a16a183a189a108a183 a160a16a183a189a108a183 a160a10a183a161a108a105
Table 10: Confidence + Balanced
5 Conclusion
In this paper, we studied the problem of updating a sta-
tistical system to fit a domain with characteristics differ-
ent from that of the training data. Without updating, per-
formance will typically deteriorate, perhaps quite drasti-
cally.
We used the sentence boundary detection problem to
compare a few different updating methods. This provides
useful insights into the potential value of various ideas.
In particular, we have made the following observations:
1. An NLP system trained on one data set can perform
poorly on another because there can be new examples
not adequately represented in the old training set; 2. It
is possible to estimate the degree of system performance
degradation, and to determine whether it is necessary to
perform a system update; 3. When updating a classifier to
fit a new domain, even a small amount of newly labeled
data can significantly improve the performance (also, the
right training data characteristics can be more important
than the quantity of training data); 4. Combining the old
training data with the newly labeled data in an appropri-
ate way (e.g., by balancing or feature augmentation) can
be effective.
Although the sentence segmentation problem consid-
new data Brown Reuters MedLine
100 a160a147a162a138a108a188 a160a147a162a138a108a183 a160a10a185a161a108a160
200 a160a147a162a138a108a183 a160a147a162a138a108a165a162 a160a10a185a161a108a160
400 a160a16a183a189a108a103a11 a160a147a162a138a108a165a162 a160a140a162a140a108a185
800 a160a16a183a189a108a165a162 a160a16a183a189a108a185 a160a10a183a161a108a50a11
1600 a160a16a183a189a108a183 a160a16a183a189a108a160 a160a10a183a161a108a165a97
Table 11: Augmented + Confidence + Balanced
ered in this paper is relatively simple, we are currently
investigating other problems. We anticipate that the ob-
servations from this study can be applied to more compli-
cated NLP tasks.

References
M.J. Gales and P.C. Woodland. 1996. Mean and variance
adaptation within the MLLR framework. Computer
Speech and Language, 10:249–264.
R. Khardon, D. Roth, and L. Valiant. 1999. Relational
learning for NLP using linear threshold elements. In
Proceedings IJCAI-99.
D. Lewis and J. Catlett. 1994. Heterogeneous uncer-
tainty sampling for supervised learning. In Proceed-
ings of the Eleventh International Conference on Ma-
chine Learning, pages 148–156.
N. Littlestone. 1988. Learning quickly when irrelevant
attributes abound: a new linear-threshold algorithm.
Machine Learning, 2:285–318.
Bernard Merialdo. 1994. Tagging english text with
a probabilistic model. Computational Linguistics,
20:155–171.
A. Mikheev. 2000. Tagging sentence boundaries. In
NACL’2000, pages 264–271.
J. Reynar and A. Ratnaparkhi. 1997. A maximum en-
tropy approach to identifying sentence boundaries. In
Proceedings of the Fifth Conference on Applied Natu-
ral Language Processing, pages 16–19.
M. Steedman, R. Hwa, S. Clark, M. Osborne, A. Sarkar,
J. Hockenmaier, P. Ruhlen, S. Baker, and J. Crim.
2003. Example selection for bootstrapping statistical
parsers. In NAACL. to appear.
M. Tang, X. Luo, and S. Roukos. 2002. Active learning
for statistical natural language parsing. In Proceedings
of the Association for Computational Linguistics 40th
Anniversary Meeting, pages 120–127.
Tong Zhang and Frank J. Oles. 2000. A probability anal-
ysis on the value of unlabeled data for classification
problems. In ICML 00, pages 1191–1198.
Tong Zhang, Fred Damerau, and David E. Johnson.
2002. Text chunking based on a generalization of Win-
now. Journal of Machine Learning Research, 2:615–
637.
