Proceedings of the Human Language Technology Conference of the North American Chapter of the ACL, pages 455–462,
New York, June 2006. c©2006 Association for Computational Linguistics
Paraphrasing for Automatic Evaluation
David Kauchak
Department of Computer Science
University of California, San Diego
dkauchak@cs.ucsd.edu
Regina Barzilay
CSAIL
Massachusetts Institute of Technology
regina@csail.mit.edu
Abstract
This paper studies the impact of para-
phrases on the accuracy of automatic eval-
uation. Given a reference sentence and a
machine-generated sentence, we seek to
 nd a paraphrase of the reference sen-
tence that is closer in wording to the ma-
chine output than the original reference.
We apply our paraphrasing method in the
context of machine translation evaluation.
Our experiments show that the use of
a paraphrased synthetic reference re nes
the accuracy of automatic evaluation. We
also found a strong connection between
the quality of automatic paraphrases as
judged by humans and their contribution
to automatic evaluation.
1 Introduction
The use of automatic methods for evaluating
machine-generated text is quickly becoming main-
stream in natural language processing. The most
notable examples in this category include measures
such as BLEU and ROUGE which drive research
in the machine translation and text summarization
communities. These methods assess the quality of
a machine-generated output by considering its simi-
larity to a reference text written by a human. Ideally,
the similarity would re ect the semantic proximity
between the two. In practice, this comparison breaks
down to n-gram overlap between the reference and
the machine output.
1a. However, Israel’s reply failed to completely
clear the U.S. suspicions.
1b. However, Israeli answer unable to fully
remove the doubts.
Table 1: A reference sentence and corresponding
machine translation from the NIST 2004 MT eval-
uation.
Consider the human-written translation and the
machine translation of the same Chinese sentence
shown in Table 1. While the two translations con-
vey the same meaning, they share only auxiliary
words. Clearly, any measure based on word over-
lap will penalize a system for generating such a sen-
tence. The question is whether such cases are com-
mon phenomena or infrequent exceptions. Empiri-
cal evidence supports the former. Analyzing 10,728
reference translation pairs1 used in the NIST 2004
machine translation evaluation, we found that only
21 (less than 0.2%) of them are identical. Moreover,
60% of the pairs differ in at least 11 words. These
statistics suggest that without accounting for para-
phrases, automatic evaluation measures may never
reach the accuracy of human evaluation.
As a solution to this problem, researchers use
multiple references to re ne automatic evaluation.
Papineni et al. (2002) shows that expanding the
number of references reduces the gap between au-
tomatic and human evaluation. However, very few
human annotated sets are augmented with multiple
references and those that are available are relatively
1Each pair included different translations of the same sen-
tence, produced by two human translators.
455
small in size. Moreover, access to several references
does not guarantee that the references will include
the same words that appear in machine-generated
sentences.
In this paper, we explore the use of paraphras-
ing methods for re nement of automatic evalua-
tion techniques. Given a reference sentence and a
machine-generated sentence, we seek to  nd a para-
phrase of the reference sentence that is closer in
wording to the machine output than the original ref-
erence. For instance, given the pair of sentences in
Table 1, we automatically transform the reference
sentence (1a.) into
However, Israel’s answer failed to com-
pletely remove the U.S. suspicions.
Thus, among many possible paraphrases of the
reference, we are interested only in those that use
words appearing in the system output. Our para-
phrasing algorithm is based on the substitute in con-
text strategy. First, the algorithm identi es pairs of
words from the reference and the system output that
could potentially form paraphrases. We select these
candidates using existing lexico-semantic resources
such as WordNet. Next, the algorithm tests whether
the candidate paraphrase is admissible in the con-
text of the reference sentence. Since even synonyms
cannot be substituted in any context (Edmonds and
Hirst, 2002), this  ltering step is necessary. We pre-
dict whether a word is appropriate in a new context
by analyzing its distributional properties in a large
body of text. Finally, paraphrases that pass the  lter-
ing stage are used to rewrite the reference sentence.
We apply our paraphrasing method in the context
of machine translation evaluation. Using this strat-
egy, we generate a new sentence for every pair of
human and machine translated sentences. This syn-
thetic reference then replaces the original human ref-
erence in automatic evaluation.
The key  ndings of our work are as follows:
(1) Automatically generated paraphrases im-
prove the accuracy of the automatic evaluation
methods. Our experiments show that evaluation
based on paraphrased references gives a better ap-
proximation of human judgments than evaluation
that uses original references.
(2) The quality of automatic paraphrases de-
termines their contribution to automatic evalua-
tion. By analyzing several paraphrasing resources,
we found that the accuracy and coverage of a para-
phrasing method correlate with its utility for auto-
matic MT evaluation.
Our results suggest that researchers may  nd it
useful to augment standard measures such as BLEU
and ROUGE with paraphrasing information thereby
taking more semantic knowledge into account.
In the following section, we provide an overview
of existing work on automatic paraphrasing. We
then describe our paraphrasing algorithm and ex-
plain how it can be used in an automatic evaluation
setting. Next, we present our experimental frame-
work and data and conclude by presenting and dis-
cussing our results.
2 Related Work
Automatic Paraphrasing and Entailment Our
work is closely related to research in automatic para-
phrasing, in particular, to sentence level paraphras-
ing (Barzilay and Lee, 2003; Pang et al., 2003; Quirk
et al., 2004). Most of these approaches learn para-
phrases from a parallel or comparable monolingual
corpora. Instances of such corpora include multiple
English translations of the same source text writ-
ten in a foreign language, and different news arti-
cles about the same event. For example, Pang et
al. (2003) expand a set of reference translations us-
ing syntactic alignment, and generate new reference
sentences that could be used in automatic evaluation.
Our approach differs from traditional work on au-
tomatic paraphrasing in goal and methodology. Un-
like previous approaches, we are not aiming to pro-
duce any paraphrase of a given sentence since para-
phrases induced from a parallel corpus do not nec-
essarily produce a rewriting that makes a reference
closer to the system output. Thus, we focus on
words that appear in the system output and aim to
determine whether they can be used to rewrite a ref-
erence sentence.
Our work also has interesting connections with
research on automatic textual entailment (Dagan et
al., 2005), where the goal is to determine whether
a given sentence can be inferred from text. While
we are not assessing an inference relation between
a reference and a system output, the two tasks
face similar challenges. Methods for entailment
456
recognition extensively rely on lexico-semantic re-
sources (Haghighi et al., 2005; Harabagiu et al.,
2001), and we believe that our method for contex-
tual substitution can be bene cial in that context.
Automatic Evaluation Measures A variety of au-
tomatic evaluation methods have been recently pro-
posed in the machine translation community (NIST,
2002; Melamed et al., 2003; Papineni et al., 2002).
All these metrics compute n-gram overlap between
a reference and a system output, but measure the
overlap in different ways. Our method for reference
paraphrasing can be combined with any of these
metrics. In this paper, we report experiments with
BLEU due to its wide use in the machine translation
community.
Recently, researchers have explored additional
knowledge sources that could enhance automatic
evaluation. Examples of such knowledge sources in-
clude stemming and TF-IDF weighting (Babych and
Hartley, 2004; Banerjee and Lavie, 2005). Our work
complements these approaches: we focus on the im-
pact of paraphrases, and study their contribution to
the accuracy of automatic evaluation.
3 Methods
The input to our method consists of a reference sen-
tence R = r1 . . . rm and a system-generated sen-
tence W = w1 . . . wp whose words form the sets R
and W respectively. The output of the model is a
synthetic reference sentence SRW that preserves the
meaning of R and has maximal word overlap with
W. We generate such a sentence by substituting
words from R with contextually equivalent words
from W.
Our algorithm  rst selects pairs of candidate word
paraphrases, and then checks the likelihood of their
substitution in the context of the reference sentence.
Candidate Selection We assume that words from
the reference sentence that already occur in the sys-
tem generated sentence should not be considered
for substitution. Therefore, we focus on unmatched
pairs of the form {(r, w)|r ∈ R−W, w ∈ W−R}.
From this pool, we select candidate pairs whose
members exhibit high semantic proximity. In our
experiments we compute semantic similarity us-
ing WordNet, a large-scale lexico-semantic resource
employed in many NLP applications for similar pur-
2a. It is hard to believe that such tremendous
changes have taken place for those people and
lands that I have never stopped missing while
living abroad.
2b. For someone born here but has been
sentimentally attached to a foreign country
far from home, it is dif cult to believe
this kind of changes.
Table 2: A reference sentence and a corresponding
machine translation. Candidate paraphrases are in
bold.
poses. We consider a pair as a substitution candidate
if its members are synonyms in WordNet.
Applying this step to the two sentences in Table 2,
we obtain two candidate pairs (home, place) and
(dif cult, hard).
Contextual Substitution The next step is to de-
termine for each candidate pair (ri, wj) whether
wj is a valid substitution for ri in the context of
r1 . . . ri−1a50ri+1 . . . rm. This  ltering step is essen-
tial because synonyms are not universally substi-
tutable2. Consider the candidate pair (home, place)
from our example (see Table 2). Words home and
place are paraphrases in the sense of  habitat , but
in the reference sentence  place occurs in a differ-
ent sense, being part of the collocation  take place .
In this case, the pair (home, place) cannot be used
to rewrite the reference sentence.
We formulate contextual substitution as a
binary classi cation task: given a context
r1 . . . ri−1a50ri+1 . . . rm, we aim to predict whether
wj can occur in this context at position i. For
each candidate word wj we train a classi er that
models contextual preferences of wj. To train such
a classi er, we collect a large corpus of sentences
that contain the word wj and an equal number of
randomly extracted sentences that do not contain
this word. The former category forms positive
instances, while the latter represents the negative.
For the negative examples, a random position in
a sentence is selected for extracting the context.
This corpus is acquired automatically, and does not
require any manual annotations.
2This can explain why previous attempts to use WordNet for
generating sentence-level paraphrases (Barzilay and Lee, 2003;
Quirk et al., 2004) were unsuccessful.
457
We represent context by n-grams and local col-
locations, features typically used in supervised
word sense disambiguation. Both n-grams and
collocations exclude the word wj. An n-gram
is a sequence of n adjacent words appearing in
r1 . . . ri−1a50ri+1 . . . rm. A local collocation also
takes into account the position of an n-gram with
respect to the target word. To compute local colloca-
tions for a word at position i, we extract all n-grams
(n = 1 . . . 4) beginning at position i − 2 and ending
at position i + 2. To make these position dependent,
we prepend each of them with the length and starting
position.
Once the classi er3 for wj is trained, we ap-
ply it to the context r1 . . . ri−1a50ri+1 . . . rm. For
positive predictions, we rewrite the string as
r1 . . . ri−1wjri+1 . . . rm. In this formulation, all
substitutions are tested independently.
For the example from Table 2, only the pair
(dif cult, hard) passes this  lter, and thus the sys-
tem produces the following synthetic reference:
For someone born here but has been senti-
mentally attached to a foreign country far
from home, it is hard to believe this kind
of changes.
The synthetic reference keeps the meaning of the
original reference, but has a higher word overlap
with the system output.
One of the implications of this design is the need
to develop a large number of classi ers to test con-
textual substitutions. For each word to be inserted
into a reference sentence, we need to train a sepa-
rate classi er. In practice, this requirement is not a
signi cant burden. The training is done off-line and
only once, and testing for contextual substitution is
instantaneous. Moreover, the  rst  ltering step ef-
fectively reduces the number of potential candidates.
For example, to apply this approach to the 71,520
sentence pairs from the MT evaluation set (described
in Section 4.1.2), we had to train 2,380 classi ers.
We also discovered that the key to the success of
this approach is the size of the corpus used for train-
ing contextual classi ers. We derived training cor-
pora from the English Gigaword corpus, and the av-
erage size of a corpus for one classi er is 255,000
3In our experiments, we used the publicly available BoosT-
exter classi er (Schapire and Singer, 2000) for this task.
sentences. We do not attempt to substitute any words
that have less that 10,000 appearances in the Giga-
word corpus.
4 Experiments
Our primary goal is to investigate the impact of
machine-generated paraphrases on the accuracy of
automatic evaluation. We focus on automatic evalu-
ation of machine translation due to the availability of
human annotated data in that domain. The hypoth-
esis is that by using a synthetic reference transla-
tion, automatic measures approximate better human
evaluation. In section 4.2, we test this hypothesis
by comparing the performance of BLEU scores with
and without synthetic references.
Our secondary goal is to study the relationship
between the quality of paraphrases and their con-
tribution to the performance of automatic machine
translation evaluation. In section 4.3, we present a
manual evaluation of several paraphrasing methods
and show a close connection between intrinsic and
extrinsic assessments of these methods.
4.1 Experimental Set-Up
We begin by describing relevant background infor-
mation, including the BLEU evaluation method, the
test data set, and the alternative paraphrasing meth-
ods considered in our experiments.
4.1.1 BLEU
BLEU is the basic evaluation measure that we use
in our experiments. It is the geometric average of
the n-gram precisions of candidate sentences with
respect to the corresponding reference sentences,
times a brevity penalty. The BLEU score is com-
puted as follows:
BLEU = BP · 4
radicaltpradicalvertex
radicalvertexradicalbt 4productdisplay
n=1
pn
BP = min(1, e1−r/c),
where pn is the n-gram precision, c is the cardinality
of the set of candidate sentences and r is the size of
the smallest set of reference sentences.
To augment BLEU evaluation with paraphrasing
information, we substitute each reference with the
corresponding synthetic reference.
458
4.1.2 Data
We use the Chinese portion of the 2004 NIST
MT dataset. This portion contains 200 Chinese doc-
uments, subdivided into a total of 1788 segments.
Each segment is translated by ten machine transla-
tion systems and by four human translators. A quar-
ter of the machine-translated segments are scored by
human evaluators on a one-to- ve scale along two
dimensions: adequacy and  uency. We use only ad-
equacy scores, which measure how well content is
preserved in the translation.
4.1.3 Alternative Paraphrasing Techniques
To investigate the effect of paraphrase quality on
automatic evaluation, we consider two alternative
paraphrasing resources: Latent Semantic Analysis
(LSA), and Brown clustering (Brown et al., 1992).
These techniques are widely used in NLP applica-
tions, including language modeling, information ex-
traction, and dialogue processing (Haghighi et al.,
2005; Sera n and Eugenio, 2004; Miller et al.,
2004). Both techniques are based on distributional
similarity. The Brown clustering is computed by
considering mutual information between adjacent
words. LSA is a dimensionality reduction technique
that projects a word co-occurrence matrix to lower
dimensions. This lower dimensional representation
is then used with standard similarity measures to
cluster the data. Two words are considered to be a
paraphrase pair if they appear in the same cluster.
We construct 1000 clusters employing the Brown
method on 112 million words from the North Amer-
ican New York Times corpus. We keep the top 20
most frequent words for each cluster as paraphrases.
To generate LSA paraphrases, we used the Infomap
software4 on a 34 million word collection of arti-
cles from the American News Text corpus. We used
the default parameter settings: a 20,000 word vocab-
ulary, the 1000 most frequent words (minus a stop-
list) for features, a 15 word context window on either
side of a word, a 100 feature reduced representation,
and the 20 most similar words as paraphrases.
While we experimented with several parameter
settings for LSA and Brown methods, we do not
claim that the selected settings are necessarily opti-
mal. However, these methods present sensible com-
4http://infomap-nlp.sourceforge.net
Method 1 reference 2 references
BLEU 0.9657 0.9743
WordNet 0.9674 0.9763
ContextWN 0.9677 0.9764
LSA 0.9652 0.9736
Brown 0.9662 0.9744
Table 4: Pearson adequacy correlation scores for
rewriting using one and two references, averaged
over ten runs.
Method vs. BLEU vs. ContextWN
WordNet trianglelefttriangleleft triangletriangle
ContextWN trianglelefttriangleleft -
LSA X triangletriangle
Brown trianglelefttriangleleft triangle
Table 5: Paired t-test signi cance for all methods
compared to BLEU as well as our method for one
reference. Two triangles indicates signi cant at the
99% con dence level, one triangle at the 95% con-
 dence level and X not signi cant. Triangles point
towards the better method.
parison points for understanding the relationship be-
tween paraphrase quality and its impact on auto-
matic evaluation.
Table 3 shows synthetic references produced by
the different paraphrasing methods.
4.2 Impact of Paraphrases on Machine
Translation Evaluation
The standard way to analyze the performance of an
evaluation metric in machine translation is to com-
pute the Pearson correlation between the automatic
metric and human scores (Papineni et al., 2002;
Koehn, 2004; Lin and Och, 2004; Stent et al., 2005).
Pearson correlation estimates how linearly depen-
dent two sets of values are. The Pearson correlation
values range from 1, when the scores are perfectly
linearly correlated, to -1, in the case of inversely cor-
related scores.
To calculate the Pearson correlation, we create
a document by concatenating 300 segments. This
strategy is commonly used in MT evaluation, be-
cause of BLEU’s well-known problems with docu-
ments of small size (Papineni et al., 2002; Koehn,
2004). For each of the ten MT system translations,
459
Reference: The monthly magazine  Choices has won the deep trust of the residents. The current
Internet edition of  Choices will give full play to its functions and will help
consumers get quick access to market information.
System: The public has a lot of faith in the  Choice monthly magazine and the Council is now
working on a web version. This will enhance the magazine’s function and help consumer
to acquire more up-to-date market information.
WordNet The monthly magazine  Choices has won the deep faith of the residents. The current
Internet version of  Choices will give full play to its functions and will help
consumers acquire quick access to market information.
ContextWN The monthly magazine  Choices has won the deep trust of the residents. The current
Internet version of  Choices will give full play to its functions and will help
consumers acquire quick access to market information.
LSA The monthly magazine  Choice has won the deep trust of the residents. The current
web edition of  Choice will give full play to its functions and will help
consumer get quick access to market information.
Brown The monthly magazine  Choices has won the deep trust of the residents. The current
Internet version of  Choices will give full play to its functions and will help
consumers get quick access to market information.
Table 3: Sample of paraphrasings produced by each method based on the corresponding system translation.
Paraphrased words are in bold and  ltered words underlined.
the evaluation metric score is calculated on the docu-
ment and the corresponding human adequacy score
is calculated as the average human score over the
segments. The Pearson correlation is calculated over
these ten pairs (Papineni et al., 2002; Stent et al.,
2005). This process is repeated for ten different
documents created by the same process. Finally, a
paired t-test is calculated over these ten different cor-
relation scores to compute statistical signi cance.
Table 4 shows Pearson correlation scores for
BLEU and the four paraphrased augmentations,
averaged over ten runs.5 In all ten tests, our
method based on contextual rewriting (ContextWN)
improves the correlation with human scores over
BLEU. Moreover, in nine out of ten tests Contex-
tWN outperforms the method based on WordNet.
The results of statistical signi cance testing are sum-
marized in Table 5. All the paraphrasing methods
except LSA, exhibit higher correlation with human
scores than plain BLEU. Our method signi cantly
outperforms BLEU, and all the other paraphrase-
based metrics. This consistent improvement con-
 rms the importance of contextual  ltering.
5Depending on the experimental setup, correlation values
can vary widely. Our scores fall within the range of previous
researchers (Papineni et al., 2002; Lin and Och, 2004).
The third column in Table 4 shows that auto-
matic paraphrasing continues to improve correlation
scores even when two human references are para-
phrased using our method.
4.3 Evaluation of Paraphrase Quality
In the last section, we saw signi cant variations
in MT evaluation performance when different para-
phrasing methods were used to generate a synthetic
reference. In this section, we examine the correla-
tion between the quality of automatically generated
paraphrases and their contribution to automatic eval-
uation. We analyze how the substitution frequency
and the accuracy of those substitutions contributes
to a method’s performance.
We compute the substitution frequency of an au-
tomatic paraphrasing method by counting the num-
ber of words it rewrites in a set of reference sen-
tences. Table 6 shows the substitution frequency and
the corresponding BLEU score. The substitution
frequency varies greatly across different methods  
LSA is by far the most proli c rewriter, while Brown
produces very few substitutions. As expected, the
more paraphrases identi ed, the higher the BLEU
score for the method. However, this increase does
460
Method Score Substitutions
BLEU 0.0913 -
WordNet 0.0969 994
ContextWN 0.0962 742
LSA 0.992 2080
Brown 0.921 117
Table 6: Scores and the number of substitutions
made for all 1788 segments, averaged over the dif-
ferent MT system translations
Method Judge 1 Judge 2 Kappa
accuracy accuracy
WordNet 63.5% 62.5% 0.74
ContextWN 75% 76.0% 0.69
LSA 30% 31.5% 0.73
Brown 56% 56% 0.72
Table 7: Accuracy scores by two human judges as
well as the Kappa coef cient of agreement.
not translate into better evaluation performance. For
instance, our contextual  ltering method removes
approximately a quarter of the paraphrases sug-
gested by WordNet and yields a better evaluation
measure. These results suggest that the substitu-
tion frequency cannot predict the utility value of the
paraphrasing method.
Accuracy measures the correctness of the pro-
posed substitutions in the context of a reference sen-
tence. To evaluate the accuracy of different para-
phrasing methods, we randomly extracted 200 para-
phrasing examples from each method. A paraphrase
example consists of a reference sentence, a refer-
ence word to be paraphrased and a proposed para-
phrase of that reference (that actually occurred in a
corresponding system translation). The judge was
instructed to mark a substitution as correct only if
the substitution was both semantically and grammat-
ically correct in the context of the original reference
sentence.
Paraphrases produced by the four methods were
judged by two native English speakers. The pairs
were presented in random order, and the judges were
not told which system produced a given pair. We
employ a commonly used measure, Kappa, to as-
sess agreement between the judges. We found that
negative positive
 ltered 40 27
non- ltered 33 100
Table 8: Confusion matrix for the context  ltering
method on a random sample of 200 examples la-
beled by the  rst judge.
on all the four sets the Kappa value was around 0.7,
which corresponds to substantial agreement (Landis
and Koch, 1977).
As Table 7 shows, the ranking between the ac-
curacy of the different paraphrasing methods mir-
rors the ranking of the corresponding MT evalua-
tion methods shown in Table 4. The paraphrasing
method with the highest accuracy, ContextWN, con-
tributes most signi cantly to the evaluation perfor-
mance of BLEU. Interestingly, even methods with
moderate accuracy, i.e. 63% for WordNet, have a
positive in uence on the BLEU metric. At the same
time, poor paraphrasing accuracy, such as LSA with
30%, does hurt the performance of automatic evalu-
ation.
To further understand the contribution of contex-
tual  ltering, we compare the substitutions made by
WordNet and ContextWN on the same set of sen-
tences. Among the 200 paraphrases proposed by
WordNet, 73 (36.5%) were identi ed as incorrect by
human judges. As the confusion matrix in Table 8
shows, 40 (54.5%) were eliminated during the  lter-
ing step. At the same time, the  ltering erroneously
eliminates 27 positive examples (21%). Even at this
level of false negatives, the  ltering has an overall
positive effect.
5 Conclusion and Future Work
This paper presents a comprehensive study of the
impact of paraphrases on the accuracy of automatic
evaluation. We found a strong connection between
the quality of automatic paraphrases as judged by
humans and their contribution to automatic evalua-
tion. These results have two important implications:
(1) re ning standard measures such as BLEU with
paraphrase information moves the automatic evalu-
ation closer to human evaluation and (2) applying
paraphrases to MT evaluation provides a task-based
assessment for paraphrasing accuracy.
461
We also introduce a novel paraphrasing method
based on contextual substitution. By posing the
paraphrasing problem as a discriminative task, we
can incorporate a wide range of features that im-
prove the paraphrasing accuracy. Our experiments
show improvement of the accuracy of WordNet
paraphrasing and we believe that this method can
similarly bene t other approaches that use lexico-
semantic resources to obtain paraphrases.
Our ultimate goal is to develop a contextual  lter-
ing method that does not require candidate selection
based on a lexico-semantic resource. One source of
possible improvement lies in exploring more power-
ful learning frameworks and more sophisticated lin-
guistic representations. Incorporating syntactic de-
pendencies and class-based features into the context
representation could also increase the accuracy and
the coverage of the method. Our current method
only implements rewriting at the word level. In the
future, we would like to incorporate substitutions at
the level of phrases and syntactic trees.
Acknowledgments
The authors acknowledge the support of the Na-
tional Science Foundation (Barzilay; CAREER
grant IIS-0448168) and DARPA (Kauchak; grant
HR0011-06-C-0023). Thanks to Michael Collins,
Charles Elkan, Yoong Keok Lee, Philip Koehn, Igor
Malioutov, Ben Snyder and the anonymous review-
ers for helpful comments and suggestions. Any
opinions,  ndings and conclusions expressed in this
material are those of the author(s) and do not neces-
sarily re ect the views of DARPA or NSF.
References
B. Babych, A. Hartley. 2004. Extending the BLEU
evaluation method with frequency weightings. In Pro-
ceedings of the ACL, 621 628.
S. Banerjee, A. Lavie. 2005. METEOR: An automatic
metric for MT evaluation with improved correlation
with human judgments. In Proceedings of the ACL
Workshop on Intrinsic and Extrinsic Evaluation Mea-
sures for MT and/or Summarization, 65 72.
R. Barzilay, L. Lee. 2003. Learning to paraphrase: An
unsupervised approach using multiple-sequence align-
ment. In Proceedings of NAACL-HLT, 16 23.
P. F. Brown, P. V. deSouza, R. L. Mercer. 1992. Class-
based n-gram models of natural language. Computa-
tional Linguistics, 18:467 479.
I. Dagan, O. Glickman, B. Magnini, eds. 2005. The PAS-
CAL recognizing textual entailment challenge, 2005.
P. Edmonds, G. Hirst. 2002. Near synonymy and lexical
choice. Computational Linguistics, 28(2):105 144.
A. Haghighi, A. Ng, C. Manning. 2005. Robust tex-
tual inference via graph matching. In Proceedings of
NAACL-HLT, 387 394.
S. Harabagiu, D. Moldovan, M. Pasca, R. Mihal-
cea, M. Surdeanu, R. Bunescu, R. Girju, V. Rus,
P. Morarescu. 2001. The role of lexico-semantic feed-
back in open-domain textual question-answering. In
Proceedings of ACL, 274 291.
P. Koehn. 2004. Statistical signi cance tests for machine
translation evaluation. In Proceedings of EMNLP,
388 395.
J. R. Landis, G. G. Koch. 1977. The measurement of
observer agreement for categorical data. Biometrics,
33:159 174.
C. Lin, F. Och. 2004. ORANGE: a method for evaluating
automatic evaluation metrics for machine translation.
In Proceedings of COLING, 501 507.
I. D. Melamed, R. Green, J. P. Turian. 2003. Precision
and recall of machine translation. In Proceedings of
NAACL-HLT, 61 63.
S. Miller, J. Guinness, A. Zamanian. 2004. Name tag-
ging with word clusters and discriminative training. In
Proceedings of HLT-NAACL, 337 342.
NIST. 2002. Automatic evaluation of machine trans-
lation quality using n-gram co-occurrence statistics,
2002.
B. Pang, K. Knight, D. Marcu. 2003. Syntax-based
alignment of multiple translations: Extracting para-
phrases and generating new sentences. In Proceedings
of NAACL-HLT, 102 209.
K. Papineni, S. Roukos, T. Ward, W. Zhu. 2002. BLEU:
a method for automatic evaluation of machine transla-
tion. In Proceedings of the ACL, 311 318.
C. Quirk, C. Brockett, W. Dolan. 2004. Monolingual
machine translation for paraphrase generation. In Pro-
ceedings of EMNLP, 142 149.
R. E. Schapire, Y. Singer. 2000. Boostexter: A boosting-
based system for text categorization. Machine Learn-
ing, 39(2/3):135 168.
R. Sera n, B. D. Eugenio. 2004. FLSA: Extending la-
tent semantic analysis with features for dialogue act
classi cation. In Proceedings of the ACL, 692 699.
A. Stent, M. Marge, M. Singhai. 2005. Evaluating eval-
uation methods for generation in the presense of vari-
ation. In Proceedings of CICLING, 341 351.
462
