Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language
Processing (HLT/EMNLP), pages 740–747, Vancouver, October 2005. c©2005 Association for Computational Linguistics
BLANC1: Learning Evaluation Metrics for MT
Lucian Vlad Lita and Monica Rogati and Alon Lavie
Carnegie Mellon University
{llita,mrogati,alavie}@cs.cmu.edu
Abstract
We introduce BLANC, a family of dy-
namic, trainable evaluation metrics for ma-
chine translation. Flexible, parametrized
models can be learned from past data and
automatically optimized to correlate well
with human judgments for different cri-
teria (e.g. adequacy, fluency) using dif-
ferent correlation measures. Towards this
end, we discuss ACS (all common skip-
ngrams), a practical algorithm with train-
able parameters that estimates reference-
candidate translation overlap by comput-
ing a weighted sum of all common skip-
ngrams in polynomial time. We show that
the BLEU and ROUGE metric families are
special cases of BLANC, and we compare
correlations with human judgments across
these three metric families. We analyze the
algorithmic complexity of ACS and argue
that it is more powerful in modeling both
local meaning and sentence-level structure,
while offering the same practicality as the
established algorithms it generalizes.
1 Introduction
Although recent MT evaluation methods show
promising correlations to human judgments in terms
of adequacy and fluency, there is still considerable
room for improvement (Culy and Riehemann, 2003).
Most of these studies have been performed at a sys-
tem level and have not investigated metric robust-
ness at a lower granularity. Moreover, even though
the emphasis on adequacy vs. fluency is application-
dependent, automatic evaluation metrics do not dis-
tinguish between the need to optimize correlation
with regard to one or the other.
Machine translation automatic evaluation metrics
face two important challenges: the lack of powerful
features to capture both sentence level structure and
local meaning, and the difficulty of designing good
functions for combining these features into meaning-
ful quality estimation algorithms.
In this paper, we introduce BLANC1, an automatic
MT evaluation metric family that is a generaliza-
tion of popular and successful metric families cur-
rently used in the MT community (BLEU, ROUGE, F-
measure etc.). We describe an efficient, polynomial-
time algorithm for BLANC, and show how it can be
optimized to target adequacy, fluency or any other
criterion. We compare our metric’s performance
with traditional and recent automatic evaluation met-
rics. We also describe the parameter conditions under
which BLANC can emulate them.
Throughout the remainder of this paper, we dis-
tinguish between two components of automatic MT
evaluation: the statistics computed on candidate
and reference translations and the function used in
defining evaluation metrics and generating transla-
tion scores. Commonly used statistics include bag-
of-words overlap, edit distance, longest common sub-
sequence, ngram overlap, and skip-bigram overlap.
Preferred functions are various combinations of pre-
cision and recall (Soricut and Brill, 2004), including
1Since existing evaluation metrics (e.g. BLEU, ROUGE) are
special cases of our metric family, it is only natural to name it
Broad Learning and Adaptation for Numeric Criteria (BLANC) –
white light contains light of all frequencies
740
weighted precision and F-measures (Van-Rijsbergen,
1979).
BLANC implements a practical algorithm with
learnable parameters for automatic MT evaluation
which estimates the reference-candidate translation
overlap by computing a weighted sum of common
subsequences (also known as skip-ngrams). Com-
mon skip-ngrams are sequences of words in their
sentence order that are found both in the reference
and candidate translations. By generalizing and sep-
arating the overlap statistics from the function used
to combine them, and by identifying the latter as a
learnable component, BLANC subsumes the ngram
based evaluation metrics as special cases and can
better reflect the need of end applications for ade-
quacy/fluency tradeoffs .
1.1 Related Work
Initial work in evaluating translation quality focused
on edit distance-based metrics (Su et al., 1992; Akiba
et al., 2001). In the MT context, edit distance (Lev-
enshtein, 1965) represents the amount of word inser-
tions, deletions and substitutions necessary to trans-
form a candidate translation into a reference trans-
lation. Another evaluation metric based on edit dis-
tance is the Word Error Rate (Niessen et al., 2000)
which computes the normalized edit distance. BLEU
is a weighted precision evaluation metric introduced
by IBM (Papineni et al., 2001). BLEU and its exten-
sions/variants (e.g. NIST (Doddington, 2002)) have
become de-facto standards in the MT community and
are consistently being used for system optimization
and tuning. These methods rely on local features
and do not explicitly capture sentence-level features,
although implicitly longer n-gram matches are re-
warded in BLEU. The General Text Matcher (GTM)
(Turian et al., 2003) is another MT evaluation method
that rewards longer ngrams instead of assigning them
equal weight.
(Lin and Och, 2004) recently proposed a set of
metrics (ROUGE) for MT evaluation. ROUGE-L is a
longest common subsequence (LCS) based automatic
evaluation metric for MT. The intuition behind it is
that long common subsequences reflect a large over-
lap between a candidate translation and a reference
translation. ROUGE-W is also based on LCS, but
assigns higher weights to sequences that have fewer
gaps. However, these metrics still do not distinguish
among translations with the same LCS but different
number of shorter sized subsequences, also indica-
tive of overlap. ROUGE-S attempts to correct this
problem by combining the precision/recall of skip-
bigrams of the reference and candidate translations.
However, by using skip-ngrams with n¿=2, we might
be able to capture more information encoded in the
higher level sentence structure. With BLANC, we
propose a way to exploit local contiguity in a man-
ner similar to BLEU and also higher level structure
similar to ROUGE type metrics.
2 Approach
We have designed an algorithm that can perform a
full overlap search over variable-size, non-contiguous
word sequences (skip-ngrams) efficiently. At first
glance, in order to perform this search, one has to
first exhaustively generate all skip-ngrams in the can-
didate and reference segments and then assess the
overlap. This approach is highly prohibitive since the
number of possible sequences is exponential in the
number of words in the sentence. Our algorithm –
ACS (all common skip-ngrams) – directly constructs
the set of overlapping skip-ngrams through incremen-
tal composition of word-level matches. With ACS,
we can reduce computation complexity to a fifth de-
gree polynomial in the number of words.
Through the ACS algorithm, BLANC is not limited
only to counting skip-ngram overlap: the contribu-
tion of different skip-ngrams to the overall score is
based on a set of features. ACS computes the over-
lap between two segments of text and also allows
local and global features to be computed during the
overlap search. These local and global features are
subsequently used to train evaluation models within
the BLANC family. We introduce below several sim-
ple skip-ngram-based features and show that special-
case parameter settings for these features emulate the
computation of existing ngram-based metrics. In or-
der to define the relative significance of a particular
skip-ngram found by the ACS algorithm, we employ
an exponential model for feature integration.
2.1 Weighted Skip-Ngrams
We define skip-ngrams as sequences of n words taken
in sentence order allowing for arbitrary gaps. In algo-
rithms literature skip-ngrams are equivalent to subse-
quences. As special cases, skip-ngrams with n=2 are
741
referred to as skip-bigrams and skip-ngrams with no
gaps between the words are simply ngrams. A sen-
tence S of size |S| has C(|S|,n) = |S|!(|S|−n)!n! skip-
ngrams.
For example, the sentence “To be or not to be” has
C(6,2) = 15 corresponding skip-bigrams including
“be or”, “to to”, and three occurrences of “to be”.
It also has C(6,4) = 15 corresponding skip-4grams
(n = 4) including “to be to be” and “to or not to”.
Consider the following sample reference and can-
didate translations:
R0: machine translated text is evaluated automatically
K1: machine translated stories are chosen automatically
K2: machine and human together can forge a friendship that
cannot be translated into words automatically
K3: machine code is being translated automatically
The skip-ngram “machine translated automati-
cally” appears in both the reference R0 and all candi-
date translations. Arguably, a skip-bigram that con-
tains few gaps is likely to capture local structure
or meaning. At the same time, skip-ngrams spread
across a sentence are also very useful since they may
capture part of the high level sentence structure.
We define a weighting feature function for skip-
ngrams that estimates how likely they are to capture
local meaning and sentence structure. The weighting
function ϕ for a skip-ngram w1 ..wn is defined as:
ϕ(w1..wn) = e−α·G(w1..wn) (1)
where α ≥ 0 is a decay parameter and G(w1..wn)
measures the overall gap of the skip-ngram w1..wn in
a specific sentence. This overall skip-ngram weight
can be decomposed into the weights of its constituent
skip-bigrams:
ϕ(w1..wn) = e−α·G(w1,..,wn) (2)
= e−α·
Pn−1
i=1 G(wi,wi+1)
=
n−1productdisplay
i=1
ϕ(wi wi+1) (3)
In equation 3, ϕ(wi wi+1) is the number of words
between wi and wi+1 in the sentence. In the example
above, the skip-ngram “machine translated automat-
ically” has weight e−3α for sentence K1 and weight
e−12α = 1 for sentence K2.
In our initial experiments the gap G has been ex-
pressed as a linear function, but different families of
functions can be explored and their corresponding pa-
rameters learned. The parameter α dictates the be-
havior of the weighting function. When α = 0 ϕ
equals e0 = 1, rendering gap sizes irrelevant. In this
case, skip-ngrams are given the same weight as con-
tiguous ngrams. When α is very large, ϕ approaches
0 if there are any gaps in the skip-ngram and is 1 if
there are no gaps. This setting has the effect of con-
sidering only contiguous ngrams and discarding all
skip-ngrams with gaps.
In the above example, although the skip-ngram
“machine translated automatically” has the same cu-
mulative gap in both in K1 and K3, the occurrence in
K1 has is a gap distribution that more closely reflects
that of the reference skip-ngram in R0. To model gap
distribution differences between two occurrences of a
skip-ngram, we define a piece-wise distance function
δXY between two sentences x and y. For two succes-
sive words in the skip-ngram, the distance function is
defined as:
δXY (w1w2) = e−β·|GX(w1,w2)−GY (w1,w2)| (4)
where β ≥ 0 is a decay parameter. Intuitively, the
β parameter is used to reward better aligned skip-
ngrams. Similar to the ϕ function, the overall δXY
distance between two occurrences of a skip-ngram
with n > 1 is:
δXY (w1..wn) =
n−1productdisplay
i=1
δXY (wiwi+1) (5)
Note that equation 5 takes into account pairs of skip-
ngrams skip in different places by summing over
piecewise differences. Finally, using an exponen-
tial model, we assign an overall score to the matched
skip-ngram. The skip-ngram scoring function Sxy al-
lows independent features to be incorporated into the
overall score:
Sxy(wi..wk) = ϕ(wi..wk)·δxy(wi..wk)
·eλ1f1(wi..wk)·...·eλhfh(wi..wk) (6)
where features f1..fh can be functions based on the
syntax, semantics, lexical or morphological aspects
of the skip-ngram. Note that different models for
combining skip-ngram features can be used in con-
junction with ACS.
742
2.2 Multiple References
In BLANC we incorporate multiple references in a
manner similar to the ROUGE metric family. We
compute the precision and recall of each size skip-
ngrams for individual references. Based on these we
combine the maximum precision and maximum re-
call of the candidate translation obtained using all
reference translations and use them to compute an ag-
gregate F-measure.
The F-measure parameter βF is modeled by
BLANC. In our experiments we optimized βF indi-
vidually for fluency and adequacy.
2.3 The ACS Algorithm
We present a practical algorithm for extracting All
Common Skip-ngrams (ACS) of any size that appear
in the candidate and reference translations. For clar-
ity purposes, we present the ACS algorithm as it
relates to the MT problem: find all common skip-
ngrams (ACS) of any size in two sentences X and Y :
wSKIP ←Acs(δ,ϕ,X,Y ) (7)
={wSKIP1..wSKIPmin(|X|,|Y |)} (8)
where wSkipn is the set of all skip-ngrams of size n
and is defined as:
wSKIPn ={“w1..wn”|wi∈X,wi∈Y,∀i∈[1..n]
and wi≺wj,∀i < j∈[1..n]}
Given two sentences X and Y we observe a match
(w,x,y) if word w is found in sentence X at index x
and in sentence Y at index y:
(w,x,y)≡{0≤x≤|X|,0≤y≤|Y|,
w∈V,and X[x] = Y [y] = w} (9)
where V is the vocabulary with a finite set of words.
In the following subsections, we present the fol-
lowing steps in the ACS algorithm:
1. identify all matches – find matches and generate
corresponding nodes in the dependency graph
2. generate dependencies – construct edges ac-
cording to pairwise match dependencies
3. propagate common subsequences – count
all common skip-ngrams using corresponding
weights and distances
In the following sections we use the following exam-
ple to illustrate the intermediate steps of ACS.
X. “to be or not to be”
Y. “to exist or not be”
2.3.1 Step 1: Identify All Matches
In this step we identify all word matches (w,x,y)
in sentences X and Y . Using the example above, the
intermediate inputs and outputs of this step are:
Input: X. “to be or not to be”
Y. “to exist or not be”
Output: (to,1,1); (to,5,1); (or,3,3); (be,2,5); . . .
For each match we create a corresponding node N
in a dependency graph. With each node we associate
the actual word matched and its corresponding index
positions in both sentences.
2.3.2 Step 2: Generate Dependencies
A dependency N1 → N2 occurs when the two
corresponding matches (w1,x1,y1) and (w2,x2,y2)
can form a valid common skip-bigram: i.e. when
x1 < x2 and y1 < y2. Note that the matches can
cover identical words, but their indices cannot be the
same (x1 negationslash= x2 and y1 negationslash= y2) since a skip-bigram
requires two different word matches.
In order to facilitate the generation of all common
subsequences, the graph is populated with the
appropriate dependency edges:
for each node N in DAG
for each node Mnegationslash=N in DAG
if N(x)≤M(x) and N(y)≤M(y)
create edge E: N→M
compute δXY (E)
compute ϕ(E)
This step incorporates the concepts of skip-ngram
weight and distance into the graph. With each edge
E : N1→N2 we associate step-wise weight and dis-
tance information for the corresponding skip-bigram
formed by matches (w1,x1,y1) and (w2,x2,y2).
Note that rather than counting all skip-ngrams,
which would be exponential in the worst case sce-
nario, we only construct a structure of match depen-
dencies (i.e. skip-bigrams). As in dynamic program-
ming, in order to avoid exponential complexity, we
compute individual skip-ngram scores only once.
2.3.3 Step 3: Propagate Common Subsequences
In this last step, the ACS algorithm counts all com-
mon skip-ngrams using corresponding weights and
distances. In the general case, this step is equiva-
lent measuring the overlap of the two sentences X
and Y . As a special case, if no features are used, the
743
ACS algorithm is equivalent to counting the number
of common skip-ngrams regardless of gap sizes.
// depth first search (DFS)
for each node N in DAG
compute node N’s depth
// initialize skip-ngram counts
for each node N in DAG
vN[1]←1
for i=2 to LCS(X,Y)
vN[i] = 0
// compute ngram counts
for d=1 to MAXDEPTH
for each node N of depth d in DAG
for each edge E: N→M
for i=2 to d
vM[i] += Sxy(δ(E), ϕ(E), vN[i-1])
After algorithm ACS is run, the number of skip-
ngrams (weighted skip-ngram score) of size k is sim-
ply the sum of the number of skip-ngrams of size k
ending in each node N’s corresponding match:
wSKIPk =
summationdisplay
Ni∈DAG
vNi[k] (10)
2.3.4 ACS Complexity and Feasibility
In the worst case scenario, both sentences X and Y
are composed of exactly the same repeated word: X
= “w w w w .. ” and Y = “w w w w ..”. We let m =|X|
and n = |Y|. In this case, the number of matches is
M = n·m. Therefore, Step 1 has worst case time
and space complexity of O(m·n). However, em-
pirical data suggest that there are far fewer matches
than in the worst-case scenario and the actual space
requirements are drastically reduced. Even in the
worst-case scenario, if we assume the average sen-
tences is fewer than 100 words, the number of nodes
in the DAG would only be 10,000. Step 2 of the al-
gorithm consists of creating edges in the dependency
graph. In the worst case scenario, the number of di-
rected edges is O(M2) and furthermore if the sen-
tences are uniformly composed of the same repeated
word as seen above, the worst-case time and space
complexity is m(m+1)/2·n(n+1)/2 = O(m2n2).
In Step 3 of the algorithm, the DFS complexity for
computing of node depths is O(M) and the complex-
ity of LCS(X,Y ) is O(m·n). The dominant step
is the propagation of common subsequences (skip-
ngram counts). Let l be the size of the LCS. The up-
per bound on the size of the longest common subse-
quence is min(|X|,|Y|) = min(m,n). In the worst
case scenario, for each node we propagate l count val-
ues (the size of vector v) to all other nodes in the
DAG. Therefore, the time complexity for Step 3 is
O(M2·l) = O(m2n2l) (fifth degree polynomial).
3 BLANC as a Generalization of BLEU and
ROUGE
Due to its parametric nature, the All Common Sub-
sequences algorithm can emulate the ngram compu-
tation of several popular MT evaluation metrics. The
weighting function ϕ allows skip-ngrams with differ-
ent gap sizes to be assigned different weights. Param-
eter α controls the shape of the weighting function.
In one extreme scenario, if we allow α to take
very large values, the net effect is that all contiguous
ngrams of any size will have corresponding weights
of e0 = 1 while all other skip-ngrams will have
weights that are zero. In this case, the distance
function will only apply to contiguous ngrams which
have the same size and no gaps. Therefore, the dis-
tance function will also be 1. The overall result is
that the ACS algorithm collects contiguous common
ngram counts for all ngram sizes. This is equivalent
to computing the ngram overlap between two sen-
tences, which is equivalent to the ngram computa-
tion performed BLEU metric. In addition to comput-
ing ngram overlap, BLEU incorporates a thresholding
(clipping) on ngram counts based on reference trans-
lations, as well as a brevity penalty which makes sure
the machine-produced translations are not too short.
In BLANC, this is replaced by standard F-measure,
which research (Turian et al., 2003) has shown it can
be used successfully in MT evaluation.
Another scenario consists of setting the α and β
parameters to 0. In this case, all skip-ngrams are as-
signed the same weight value of 1 and skip-ngram
matches are also assigned the same distance value of
1 regardless of gap sizes and differences in gap sizes.
This renders all skip-ngrams equivalent and the ACS
algorithm is reduced to counting the skip-ngram over-
lap between two sentences. Using these counts, pre-
cision and recall-based metrics such as the F-measure
can be computed. If we let the α and β parameters to
be zero, disregard redundant matches, and compute
744
0 50 1000
50
100
150
200
Arabic 2003
Sentence Length
#sentences
0 50 1000
50
100
150
200
250
300
350
Chinese 2003
Sentence Length
#sentences
0 50 100
100
102
104
ACS #Matches
Sentence Length
Avg #Matches
0 50 100
100
105
ACS #Edges
Sentence Length
Avg #Edges
0 50 100
100
105
1010
ACS #Feature Calls
Sentence Length
Avg #Total
Arabic
Chinese
Worst Case
Figure 1: Empirical and theoretical behavior of ACS on 2003 machine translation evaluation data (semilog scale).
the ACS only for skip-ngrams of size 2, the ACS algo-
rithm is equivalent to the ROUGE-S metric (Lin and
Och, 2004). This case represents a specific parameter
setting in the ACS skip-ngram computation.
The longest common subsequence statistic has also
been successfully used for automatic machine trans-
lation evaluation in the ROUGE-L (Lin and Och,
2004) algorithm. In BLANC, if we set both α and
β parameters to zero, the net result is a set of skip-
bigram (common subsequence) overlap counts for all
skip-bigram sizes. Although dynamic programming
or suffix trees can be used to compute the LCS much
faster, under this parameter setting the ACS algorithm
can also produce the longest common subsequence:
LCS(X,Y )←argmax
k
ACS(wSKIPk) > 0
where Acs(wSKIPk) is the number of common
skip-ngrams (common subsequences) produced by
the ACS algorithm.
ROUGE-W (Lin and Och, 2004) relies on a
weighted version of the longest common subse-
quence, under which longer contiguous subsequences
are assigned a higher weight than subsequences that
incorporate gaps. ROUGE-W uses the polynomial
function xa in the weighted LCS computation. This
setting can also be simulated by BLANC by adjusting
the parameters α to reward tighter skip-ngrams and β
to assign a very high score to similar size gaps. In-
tuitively, α is used to reward skip-ngrams that have
smaller gaps, while β is used to reward better aligned
skip-ngram overlap.
4 Scalability & Data Exploration
In Figure 1 we show theoretical and empirical prac-
tical behavior for the ACS algorithm on the 2003
TIDES machine translation evaluation data for Ara-
bic and Chinese. Sentence length distribution is
somewhat similar for the two languages – only a very
small amount of text segments have more than 50
tokens. We show the ACS graph size in the worst
case scenario, and the empirical average number of
matches for both languages as a function of sentence
length. We also show (on a log scale) the upper bound
on time/space complexity in terms of total number
of feature computations. Even though the worst-
case scenario is tractable (polynomial), the empirical
amount of computation is considerably smaller in the
form of polynomials of lower degree. In Figure 1,
sentence length is the average between reference and
candidate lengths.
Finally, we also show the total number of fea-
ture computations involved in performing a full over-
lap search and computing a numeric score for the
745
reference-candidate translation pair. We have exper-
imented with the ACS algorithm using a worst-case
scenario where all words are exactly the same for a
fifty words reference translation and candidate trans-
lation. In practice when considering real sentences
the number of matches is very small. In this setting,
the algorithm takes less than two seconds on a low-
end desktop system when working on the worst case
scenario, and less then a second for all candidate-
reference pairs in the TIDES 2003 dataset. This re-
sult renders the ACS algorithm very practical for au-
tomatic MT evaluation.
5 Experiments & Results
In the dynamic metric BLANC, we have implemented
the ACS algorithm using several parameters includ-
ing the aggregate gap size α, the displacement feature
β, a parameter for regulating skip-ngram size contri-
bution, and the F-measure βF parameter.
Until recently, most experiments that evaluate au-
tomatic metrics correlation to human judgments have
been performed at a system level. In such experi-
ments, human judgments are aggregated across sen-
tences for each MT system and compared to aggre-
gate scores for automatic metrics. While high scor-
ing metrics in this setting are useful for understand-
ing relative system performance, not all of them are
robust enough for evaluating the quality of machine
translation output at a lower granularity. Sentence-
level translation quality estimation is very useful
when MT is used as a component in a pipeline of text-
processing applications (e.g. question answering).
The fact that current automatic MT evaluation met-
rics including BLANC do not correlate well with hu-
man judgments at the sentence level, does not mean
we should ignore this need and focus only on system
level evaluation. On the contrary, further research is
required to improve these metrics. Due to its train-
able nature, and by allowing additional features to be
incorporated into its model, BLANC has the potential
to address this issue.
For comparison purposes with previous literature,
we have also performed experiments at system level
for Arabic. The datasets used consist of the MT trans-
lation outputs from all systems available through the
Tides 2003 evaluation (663 sentences) for training
and Tides 2004 evaluation (1353 sentences) for test-
ing.
We compare (Table 1) the performance of BLANC
on Arabic translation output with the performance
of more established evaluation metrics: BLEU and
NIST, and also with more recent metrics: ROUGE-
L and ROUGE-S (using an unlimited size skip win-
dow), which have been shown to correlate well with
human judgments at system level – as confirmed by
our results. We have performed experiments in which
case information is preserved as well as experiments
that ignore case information. Since the results are
very similar, we only show here experiments under
the former condition. In order to maintain consis-
tency, when using any metric we apply the same pre-
processing provided by the MTEval script. When
computing the correlation between metrics and hu-
man judgments, we only keep strictly positive scores.
While this is not fully equivalent to BLEU smooth-
ing, it partially mitigates the same problem of zero
count ngrams for short sentences. In future work we
plan to implement smoothing for all metrics, includ-
ing BLANC.
We train BLANC separately for adequacy and flu-
ency, as well as for system level and segment level
correlation with human judgments. The BLANC pa-
rameters are currently trained using a simple hill-
climbing procedure and using several starting points
in order to decrease the chance of reaching a local
maximum.
BLANC proves to be robust across criteria and
granularity levels. As expected, different parameter
values of BLANC optimize different criteria (e.g. ad-
equacy and fluency). We have observed that train-
ing BLANC for adequacy results in more bias to-
wards recall (βF =3) compared to training it for flu-
ency (βF =2). This confirms our intuition that a dy-
namic, parametric metric is justified for automatic
evaluation.
6 Conclusions & Future Work
In previous sections we have defined simple distance
functions. More complex functions can also be incor-
porated in ACS. Skip-ngrams in the candidate sen-
tence might be rewarded if they contain fewer gaps in
the candidate sentence and penalized if they contain
more. Different distance functions could also be used
in ACS, including functions based on surface-form
features and part-of-speech features.
Most of the established MT evaluation methods are
746
Tides 2003 Arabic
System Level Segment Level
Method Adequacy Fluency Adequacy Fluency
BLEU 0.950 0.934 0.382 0.286
NIST 0.962 0.939 0.439 0.304
ROUGE-L 0.974 0.926 0.440 0.328
ROUGE-S 0.949 0.935 0.360 0.328
BLANC 0.988 0.979 0.492 0.391
Tides 2004 Arabic
System Level Segment Level
Method Adequacy Fluency Adequacy Fluency
BLEU 0.978 0.994 0.446 0.337
NIST 0.987 0.952 0.529 0.358
ROUGE-L 0.981 0.985 0.538 0.412
ROUGE-S 0.937 0.980 0.367 0.408
BLANC 0.982 0.994 0.565 0.438
Table 1: Pearson correlation of several metrics with human judgments at system level and segment level for fluency and adequacy.
static functions according to which automatic evalu-
ation scores are computed. In this paper, we have
laid the foundation for a more flexible, parametric ap-
proach that can be trained using existing MT data and
that can be optimized for highest agreement with hu-
man assessors, for different criteria.
We have introduced ACS, a practical algorithm
with learnable parameters for automatic MT evalu-
ation and showed that ngram computation of popu-
lar evaluation methods can be emulated through dif-
ferent parameters by ACS. We have computed time
and space bounds for the ACS algorithm and argued
that while it is more powerful in modeling local and
sentence structure, it offers the same practicality as
established algorithms.
In our experiments, we trained and tested BLANC
on data from consecutive years, and therefore tai-
lored the metric for two different operating points
in MT system performance. In this paper we show
that BLANC correlates well with human performance
when trained on previous year data for both sentence
and system level.
In the future, we plan to investigate the stability
and performance of BLANC and also apply it to auto-
matic summarization evaluation. We plan to optimize
the BLANC parameters for different criteria in addi-
tion to incorporating syntactic and semantic features
(e.g. ngrams, word classes, part-of-speech).
In previous sections we have defined simple dis-
tance functions. More complex functions can also
be incorporated in ACS. Skip-ngrams in the candi-
date sentence might be rewarded if they contain fewer
gaps in the candidate sentence and penalized if they
contain more. Different distance functions could also
be used in ACS, including functions based on surface-
form features and part-of-speech features.
Looking beyond the BLANC metric, this paper
makes the case for the need to shift to trained, dy-
namic evaluation metrics which can adapt to individ-
ual optimization criteria and correlation functions.
We plan to make available an implementation of
BLANC at http://www.cs.cmu.edu/ llita/blanc.
References
Y. Akiba, K. Iamamurfa, and E. Sumita. 2001. Using
multiple edit distances to automatically rank machine
translation output. MT Summit VIII.
C. Culy and S.Z. Riehemann. 2003. The limits of n-
gram translation evaluation metrics. Machine Transla-
tion Summit IX.
George Doddington. 2002. Automatic evaluation of ma-
chine translation quality using n-gram co-occurrence
statistics. Human Language Technology Conference
(HLT).
V.I. Levenshtein. 1965. Binary codes capable of cor-
recting deletions, insertions, and reversals. Doklady
Akademii Nauk SSSR.
C.Y. Lin and F.J. Och. 2004. Automatic evaluation of
machine translation quality using longest common sub-
sequence and skip bigram statistics. ACL.
S. Niessen, F.J. Och, G. Leusch, and H. Ney. 2000. An
evaluation tool for machine translation: Fast evaluation
for mt research. LREC.
K. Papineni, S. Roukos, T. Ward, and W.J. Zhu. 2001.
Bleu: a method for automatic evaluation of machine
translation. IBM Research Report.
R. Soricut and E. Brill. 2004. A unified framework for
automatic evaluation using n-gram co-occurence statis-
tics. ACL.
K.Y. Su, M.W. Wu, and J.S. Chang. 1992. A new quanti-
tative quality measure for machine translation systems.
COLING.
J.P. Turian, L. Shen, and I.D. Melamed. 2003. Evaluation
of machine translation and its evaluation. MT Summit
IX.
C.J. Van-Rijsbergen. 1979. Information retrieval.
747
