Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language
Processing (HLT/EMNLP), pages 779–786, Vancouver, October 2005. c©2005 Association for Computational Linguistics
The Hiero Machine Translation System:
Extensions, Evaluation, and Analysis
David Chiang, Adam Lopez, Nitin Madnani, Christof Monz, Philip Resnik, Michael Subotin
Institute for Advanced Computer Studies (UMIACS)
University of Maryland, College Park, MD 20742, USA
{dchiang,alopez,nmadnani,christof,resnik,msubotin}@umiacs.umd.edu
Abstract
Hierarchical organization is a well known prop-
erty of language, and yet the notion of hierarchi-
cal structure has been largely absent from the best
performing machine translation systems in recent
community-wide evaluations. In this paper, we dis-
cuss a new hierarchical phrase-based statistical ma-
chine translation system (Chiang, 2005), present-
ing recent extensions to the original proposal, new
evaluation results in a community-wide evaluation,
and a novel technique for fine-grained comparative
analysis of MT systems.
1 Introduction
Hierarchical organization is a well known prop-
erty of language, and yet the notion of hierarchi-
cal structure has, for the last several years, been
absent from the best performing machine transla-
tion systems in community-wide evaluations. Statis-
tical phrase-based models (e.g. (Och and Ney, 2004;
Koehnetal.,2003;MarcuandWong,2002))charac-
terize a source sentence f as a flat partition of non-
overlapping subsequences, or “phrases”, ¯f1 ··· ¯fJ,
and the process of translation involves selecting tar-
get phrases ¯ei corresponding to the ¯fj and modify-
ing their sequential order. The need for some way
to model aspects of syntactic behavior, such as the
tendency of constituents to move together as a unit,
is widely recognized—the role of syntactic units is
well attested in recent systematic studies of trans-
lation (Fox, 2002; Hwa et al., 2002; Koehn and
Knight, 2003), and their absence in phrase-based
models is quite evident when looking at MT system
output. Nonetheless, attempts to incorporate richer
linguistic features have generally met with little suc-
cess (Och et al., 2004a).
Chiang (2005) introduces Hiero, a hierarchical
phrase-based model for statistical machine transla-
tion. Hiero extends the standard, non-hierarchical
notion of “phrases” to include nonterminal sym-
bols,whichpermitsittocapturebothword-leveland
phrase-levelreorderingswithinthesameframework.
Themodelhastheformalstructureofasynchronous
CFG, but it does not make any commitment to a
linguistically relevant analysis, and it does not re-
quire syntactically annotated training data. Chiang
(2005) reported significant performance improve-
ments in Chinese-English translation as compared
with Pharaoh, a state-of-the-art phrase-based system
(Koehn, 2004).
In Section 2, we review the essential elements
of Hiero. In Section 3 we describe extensions to
this system, including new features involving named
entities and numbers and support for a fourfold
scale-up in training set size. Section 4 presents new
evaluation results for Chinese-English as well as
Arabic-English translation, obtained in the context
ofthe2005NISTMTEvalexercise.InSection5,we
introduce a novel technique for fine-grained com-
parative analysis of MT systems, which we em-
ploy in analyzing differences between Hiero’s and
Pharaoh’s translations.
2 Hiero
Hiero is a stochastic synchronous CFG, whose pro-
ductions are extracted automatically from unanno-
tated parallel texts, and whose rule probabilities
form a log-linear model learned by minimum-error-
rate training; together with a modified CKY beam-
search decoder (similar to that of Wu (1996)). We
describe these components in brief below.
779
S → 〈S 1 X 2 ,S 1 X 2 〉
S → 〈X 1 ,X 1 〉
X → 〈yu X 1 you X 2 ,have X 2 with X 1 〉
X → 〈X 1 de X 2 ,the X 2 that X 1 〉
X → 〈X 1 zhiyi,one of X 1 〉
X → 〈Aozhou,Australia〉
X → 〈shi,is〉
X → 〈shaoshu guojia,few countries〉
X → 〈bangjiao,diplomatic relations〉
X → 〈Bei Han,North Korea〉
Figure 1: Example synchronous CFG
2.1 Grammar
A synchronous CFG or syntax-directed transduction
grammar (LewisandStearns,1968)consistsofpairs
of CFG rules with aligned nonterminal symbols. We
denote this alignment by coindexation with boxed
numbers (Figure 1). A derivation starts with a pair
of aligned start symbols, and proceeds by rewrit-
ing pairs of aligned nonterminal symbols using the
paired rules (Figure 2).
Training begins with phrase pairs, obtained as by
Och, Koehn, and others: GIZA++ (Och and Ney,
2000) is used to obtain one-to-many word align-
ments in both directions, which are combined into a
singlesetofrefinedalignmentsusingthe“final-and”
method of Koehn et al. (2003); then those pairs of
substrings that are exclusively aligned to each other
are extracted as phrase pairs.
Then, synchronous CFG rules are constructed
out of the initial phrase pairs by subtraction: ev-
ery phrase pair 〈 ¯f, ¯e〉 becomes a rule X → 〈 ¯f, ¯e〉,
and a phrase pair 〈 ¯f, ¯e〉 can be subtracted from a
rule X → 〈γ1 ¯fγ2,α1¯eα2〉 to form a new rule X →
〈γ1X i γ2,α1X i α2〉, where i is an index not already
used. Various filters are also applied to reduce the
number of extracted rules. Since one of these filters
restricts the number of nonterminal symbols to two,
our extracted grammar is equivalent to an inversion
transduction grammar (Wu, 1997).
2.2 Model
The model is a log-linear model (Och and Ney,
2002) over synchronous CFG derivations. The
weight of a derivation is PLM(e)λLM, the weighted
language model probability, multiplied by the prod-
uct of the weights of the rules used in the derivation.
The weight of each rule is, in turn:
(1) w(X → 〈γ,α〉) =
productdisplay
i
φi(X → 〈γ,α〉)λi
where the φi are features defined on rules. The ba-
sic model uses the following features, analogous to
Pharaoh’s default feature set:
• P(γ | α) and P(α | γ)
• the lexical weights Pw(γ | α) and Pw(α | γ)
(Koehn et al., 2003);1
• a phrase penalty exp(1);
• a word penalty exp(l), where l is the number of
terminals in α.
The exceptions to the above are the two “glue”
rules, which are the rules with left-hand side S in
Figure 1. The second has weight one, and the first
has weight w(S → 〈S 1 X 2 ,S 1 X 2 〉) = exp(−λg),
theideabeingthatparameterλg controlsthemodel’s
preference for hierarchical phrases over serial com-
bination of phrases.
Phrase translation probabilities are estimated by
relative-frequency estimation. Since the extraction
process does not generate a unique derivation for
each training sentence pair, a distribution over pos-
sible derivations is hypothesized, which gives uni-
form weight to all initial phrases extracted from a
sentence pair and uniform weight to all rules formed
out of an initial phrase. This distribution is then used
to estimate the phrase translation probabilities.
The lexical-weighting features are estimated us-
ing a method similar to that of Koehn et al. (2003).
The language model is a trigram model with mod-
ified Kneser-Ney smoothing (Chen and Goodman,
1998), trained using the SRI-LM toolkit (Stolcke,
2002).
1Thisfeatureuseswordalignmentinformation,whichisdis-
carded in the final grammar. If a rule occurs in training with
more than one possible word alignment, Koehn et al. take the
maximum lexical weight; Hiero uses a weighted average.
780
〈S 1 ,S 1 〉 ⇒ 〈S 2 X 3 ,S 2 X 3 〉
⇒ 〈S 4 X 5 X 3 ,S 4 X 5 X 3 〉
⇒ 〈X 6 X 5 X 3 ,X 6 X 5 X 3 〉
⇒ 〈Aozhou X 5 X 3 ,Australia X 5 X 3 〉
⇒ 〈Aozhou shi X 3 ,Australia is X 3 〉
⇒ 〈Aozhou shi X 7 zhiyi,Australia is one of X 7 〉
⇒ 〈Aozhou shi X 8 de X 9 zhiyi,Australia is one of the X 9 that X 8 〉
⇒ 〈Aozhou shi yu X 1 you X 2 de X 9 zhiyi,Australia is one of the X 9 that have X 2 with X 1 〉
Figure 2: Example partial derivation of a synchronous CFG.
The feature weights are learned by maximizing
the BLEU score (Papineni et al., 2002) on held-out
data,usingminimum-error-ratetraining(Och,2003)
as implemented by Koehn. The implementation was
slightly modified to ensure that the BLEU scoring
matches NIST’s definition and that hypotheses in
the n-best lists are merged when they have the same
translation and the same feature vector.
3 Extensions
Inthissectionwedescribeourextensionstothebase
Hiero system that improve its performance signif-
icantly. First, we describe the addition of two new
features to the Chinese model, in a manner similar
to that of Och et al. (2004b); then we describe how
we scaled the system up to a much larger training
set.
3.1 New features
The LDC Chinese-English named entity lists (900k
entries) are a potentially valuable resource, but
previous experiments have suggested that simply
adding them to the training data does not help
(Vogel et al., 2003). Instead, we placed them in
a supplementary phrase-translation table, giving
greater weight to phrases that occurred less fre-
quently in the primary training data. For each en-
try 〈f,{e1,...,en}〉, we counted the number of times
c(f) that f appeared in the primary training data,
and assigned the entry the weight 1c(f)+1, which
was then distributed evenly among the supplemen-
tary phrase pairs {〈f,ei〉}. We then created a new
model feature for named entities. When one of these
supplementary phrase pairs was used in transla-
tion, its feature value for the named-entity feature
was the weight defined above, and its value in the
other phrase-translation and lexical-weighting fea-
tures was zero. Since these scores belonged to a sep-
arate feature from the primary translation probabili-
ties, they could be reweighted independently during
minimum-error-rate training.
Similarly, to process Chinese numbers and dates,
we wrote a rule-based Chinese number/date transla-
tor, and created a new model feature for it. Again,
the weight given to this module was optimized
during minimum-error-rate training. In some cases
we wrote the rules to provide multiple uniformly-
weighted English translations for a Chinese phrase
(for example,k�(bari) could become “the 8th” or
“onthe8th”),allowingthelanguagemodeltodecide
between the options.
3.2 Scaling up training
Chiang (2005) reports on experiments in Chinese-
English translation using a model trained on
7.2M+9.2M words of parallel data.2 For the NIST
MT Eval 2005 large training condition, consider-
ably more data than this is allowable. We chose
to use only newswire data, plus data from Sino-
rama, a Taiwanese news magazine.3 This amounts
to almost 30M+30M words. Scaling to this set re-
quired reducing the initial limit on phrase lengths,
previously fixed at 10, to avoid explosive growth of
2Here and below, the notation “X + Y words” denotes X
words of foreign text and Y words of English text.
3From Sinorama, only data from 1991 and later were used,
as articles prior to that were translated quite loosely.
781
the extracted grammar. However, since longer initial
phrases can be beneficial for translation accuracy,
we adopted a variable length limit: 10 for the FBIS
corpus and other mainland newswire sources, and 7
for the HK News corpus and Sinorama. (During de-
coding, limits of up to 15 were sometimes used; in
principle these limits should all be the same, but in
practice it is preferable to tune them separately.)
For Arabic-English translation, we used the ba-
sic Hiero model, without special features for named
entities or numbers/dates. We again used only the
newswire portions of the allowable training data; we
also excluded the Ummah data, as the translations
were found to be quite loose. Since this amounted
to only about 1.5M+1.5M words, we used a higher
initialphraselimitof15duringbothtrainingandde-
coding.
4 Evaluation
Figure 1 shows the performance of several systems
on NIST MT Eval 2003 Chinese test data: Pharaoh
(2004 version), trained only on the FBIS data; Hi-
ero, with various combinations of the new features
and the larger training data.4 This table also shows
Hiero’s performance on the NIST 2005 MT evalua-
tion task.5 The metric here is case-sensitive BLEU.6
Figure 2 shows the performance of two systems
on Arabic in the NIST 2005 MT Evaluation task:
DC, a phrase-based decoder for a model trained by
Pharaoh, and Hiero.
5 Analysis
Over the last few years, several automatic metrics
for machine translation evaluation have been intro-
duced, largely to reduce the human cost of itera-
tive system evaluation during the development cy-
cle (Lin and Och, 2004; Melamed et al., 2003; Pap-
ineni et al., 2002). All are predicated on the concept
4Thethirdline,correspondingtothemodelwithoutnewfea-
tures trained on the larger data, may be slightly depressed be-
cause the feature weights from the fourth line were used instead
of doing minimum-error-rate training specially for this model.
5Full results are available at http://www.nist.gov/
speech/tests/summaries/2005/mt05.htm. For this test, a
phrase length limit of 15 was used during decoding.
6For this task, the translation output was uppercased using
the SRI-LM toolkit: essentially, it was decoded again using
an HMM whose states and transitions are a trigram language
model of cased English, and whose emission probabilities are
reversed, i.e., probability of cased word given lowercased word.
System Features Train MT03 MT05
Pharaoh standard FBIS 0.268
Hiero standard FBIS 0.288
Hiero standard full 0.329
Hiero +nums, names full 0.339 0.300
Table 1: Chinese results. (BLEU-4; MT03 case-
insensitive, MT05 case-sensitive)
System Train MT05
DC full 0.399
Hiero full 0.450
Table 2: Arabic results. (BLEU-4; MT03 case-
insensitive, MT05 scores case-sensitive.
of n-gram matching between the sentence hypothe-
sized by the translation system and one or more ref-
erence translations—that is, human translations for
the test sentence. Although the motivations and for-
mulae underlying these metrics are all different, ul-
timately they all produce a single number represent-
ing the “goodness” of the MT system output over a
set of reference documents. This facility is valuable
in determining whether a given system modification
has a positive impact on overall translation perfor-
mance. However, the metrics are all holistic. They
provide no insight into the specific competencies or
weaknesses of one system relative to another.
Ideally, we would like to use automatic methods
to provide immediate diagnostic information about
the translation output—what the system does well,
and what it does poorly. At the most general level,
we want to know how our system performs on the
twomostbasicproblemsintranslation—wordtrans-
lation and reordering. Unigram precision and recall
statistics tell us something about the performance of
an MT system’s internal translation dictionaries, but
nothingaboutreordering.Itisthoughtthathigheror-
der n-grams correlate with the reordering accuracy
of MT systems, but this is again a holistic metric.
Whatwewouldreallyliketoknowishowwellthe
system is able to capture systematic reordering pat-
terns in the input, which ones it is successful with,
and which ones it has difficulty with. Word n-grams
arelittlehelphere:theyaretoomany,toosparse,and
it is difficult to discern general patterns from them.
782
5.1 A New Analysis Method
In developing a new analysis method, we are moti-
vated in part by recent studies suggesting that word
reorderings follow general patterns with respect to
syntax, although there remains a high degree of flex-
ibility (Fox, 2002; Hwa et al., 2002). This suggests
that in a comparative analysis of two MT systems, it
may be useful to look for syntactic patterns that one
system captures well in the target language and the
other does not, using a syntax based metric.
We propose to summarize reordering patterns us-
ing part-of-speech sequences. Unfortunately, recent
work has shown that applying statistical parsers to
ungrammatical MT output is unreliable at best, with
the parser often assigning unreasonable probabili-
ties and incongruent structure (Yamada and Knight,
2002; Och et al., 2004a). Anticipating that this
would be equally problematic for part-of-speech
tagging, we make the conservative choice to apply
annotation only to the reference corpus. Word n-
gram correspondences with a reference translation
are used to infer the part-of-speech tags for words in
the system output.
First, we tagged the reference corpus with parts
of speech. We used MXPOST (Ratnaparkhi, 1996),
and in order to discover more general patterns, we
map the tag set down after tagging, e.g. NN, NNP,
NNPS and NNS all map to NN. Second, we com-
puted the frequency freq(ti ...tj) of every possible
tag sequence ti ...tj in the reference corpus. Third,
we computed the correspondence between each hy-
pothesis sentence and each of its corresponding ref-
erence sentences using an approximation to max-
imum matching (Melamed et al., 2003). This al-
gorithm provides a list of runs or contiguous se-
quences of words ei ...ej in the reference that are
also present in the hypothesis. (Note that runs are
order-sensitive.) Fourth, for each recalled n-gram
ei ...ej, we looked up the associated tag sequence
ti ...tj and incremented a counter recalled(ti ...tj).
Finally, we computed the recall of tag patterns,
R(ti ...tj) = recalled(ti ...tj)/freq(ti ...tj), for all
patterns in the corpus.
By examining examples of these tag sequences in
the reference corpus and their hypothesized trans-
lations, we expect to gain some insight into the
comparative strengths and weaknesses of the MT
systems’ reordering models. (An interactive plat-
form for this analysis is demonstrated by Lopez and
Resnik (2005).)
5.2 Chinese
We performed tag sequence analysis on the Hiero
and Pharaoh systems trained on the FBIS data only.
Table 3 shows those n-grams for which Hiero and
Pharaoh’s recall differed significantly (p < 0.01).
The numbers shown are the ratio of Hiero’s recall
to Pharaoh’s. Note that the n-grams on which Hi-
ero had better recall are dominated by fragments of
prepositional phrases (in the Penn Treebank tagset,
prepositions are tagged IN or TO).
OurhypothesisisthatHieroproducesEnglishPPs
better because many of them are translated from
ChinesephraseswhichhaveanNPmodifyinganNP
to its right, often connected with the particle�(de).
TheseareoftentranslatedintoEnglishasPPs,which
modifytheNPtotheleft.Acorrecttranslation,then,
would have to reorder the two NPs. Notice in the ta-
ble that Hiero recalls proportionally more n-grams
as n increases, corroborating the intuition that Hiero
should be better at longer-distance reorderings.
Investigating this hypothesis qualitatively, we in-
spected the first five occurrences of the n-grams of
the first type on the list (JJ NN IN DT NN). Of
these, we omit one example because both systems
recalled the n-gram correctly, and one because they
differed only in lexical choice (Hiero matched the
5-gram with one reference sentence, Pharaoh with
zero). The other three examples are shown below (H
= Hiero, P = Pharaoh):
(2) T�
UN
�h
security
�
council
�
of
�*
five
8�
permanent
�
member
��
countries-all
R1. five permanent members of the UN Secu-
rity Council
H. the five permanent members of the un se-
curity council
P. the united nations security council perma-
nent members of the five countries
783
10.00 JJ NN IN DT NN
7.00 IN NN TO
5.50 IN DT NN NN PU NN
5.50 IN DT NN NN PU NN NN
4.50 NN JJ NN PU
4.50 NN IN DT JJ
4.00 VB CD IN DT
3.67 IN DT NN NN PU
3.50 NN IN DT NN NN
3.30 NN IN DT NN
3.14 DT NN IN DT NN
3.00 IN DT NN PU
2.50 NN TO NN
2.03 DT JJ NN IN
1.95 IN NN PU
1.77 IN NN CD
1.74 DT NN IN NN
1.70 JJ NN IN
1.55 VB IN DT
1.46 NN IN NN
1.46 DT NN PU
1.44 IN DT JJ
1.42 NN IN DT
1.41 IN DT NN
1.37 PU CC
1.34 IN CD
1.32 JJ NN PU
1.30 IN NN
1.29 NN IN
1.18 NN PU
1.09 CD
1.07 VB
1.06 NN NN
1.06 IN
1.05 NN
0.61 RB CD
0.21 TO VB PR
0.18 PU RB CD
0.12 NN CD TO NN
0.12 CD TO NN
Table 3: Chinese-English POS n-grams on which Hiero and Pharaoh had significantly different recall, ar-
ranged by recall ratio. Ratio > 1 indicates tag sequences that Hiero matched more frequently.
(3) 
�K
Iraq
q:
crisis
�
of
 
most
�
new
�U
development
R1. the latest development on the Iraqi crisis
H. the latest development on the Iraqi crisis
P. on the iraqi crisis, the latest development
(4) �t
this-year


upper
Jt
half-year
R1. the first half of this year
H. the first half of this year
P. the first half of
All three of these examples involve an NP modify-
ing an NP to its right; two with the particle�(de)
andonewithout.InallthreecasesHieroreordersthe
NPs correctly; Pharaoh preserves the Chinese word
order in two cases, but in the third, for reasons not
understood, drops the modifying NP.
The n-grams on which Hiero did worse than
Pharaoh mostly involved numbers; here a pattern is
not as easily discernible, but there are several cases
where Hiero makes errors in translating numbers
(neither system in this comparison used the dedi-
catednumbertranslator).Forthen-gramTOVBPR,
it seems Hiero has a tendency to delete possessive
pronouns (PR, abbreviated from PRP$).
5.3 Arabic
Initial inspection of the n-grams on which Hiero
showed significantly higher recall in the Arabic-
English task suggested that here, too, better trans-
lation of nominal phrases may be at play. We in-
vestigated this conjecture further by examining sev-
eral n-gram sets with the highest recall ratios. Some
of them on closer inspection turned out to conflate
different structural patterns, and provided little in-
terpretable information. However, the 8 sentences
in the n-gram list IN DT JJ JJ showed a degree of
structural consistency. The list contained 6 instances
where Hiero performed better in translating a com-
plex NP or PP, one instance in which DC performed
better in translating a complex PP, and one case in
which they both performed equally poorly. Below
we show two examples of phrases on which Hiero
performed better, and the one example on which its
hierarchical approach produced undesirable results
(H = Hiero, D = DC).
(5) Al
the
wjwd
presence
Al
the
EskrY
military
Al
the
AmYrkY
American
fY
in
Al
the
mnTqp
region
R1. the American military presence in the re-
gion
H. the american military presence in the re-
gion
D. the military presence in the region
(6) AltY
which
tSnEhA
manufactures-them
Al
the
$rkp
company
Al
the
kwrYp
Korean
Al
the
jnwbYp
Southern
R1. which are manufactured by the South Ko-
rean company
H. which are manufactured by the south ko-
rean company
D. which are manufactured by the company ,
the south korean
784
8.00 WR DT NN
8.00 PR NN IN DT
7.00 DT PU
6.00 DT NN NN PO
5.00 IN DT JJ JJ
4.67 DT NN IN VB
2.89 NN NN NN VB
2.73 PR VB IN
2.56 NN PU WD VB
2.45 JJ CC JJ NN
2.38 DT JJ JJ NN
2.08 CC JJ NN
2.01 PR VB
2.00 TO DT NN NN
1.80 NN PU WD
1.80 NN IN DT JJ NN
1.77 NN IN DT JJ
1.76 JJ JJ NN
1.74 VB CD
1.68 NN NN VB
1.46 JJ NN NN
1.43 JJ JJ
1.35 IN DT JJ
1.24 VB IN
1.21 NN VB
1.20 NN IN DT
1.17 PR
1.10 JJ NN
1.08 NN NN
1.07 IN DT
1.02 NN
0.47 NN CD PU CD NN NN
0.47 NN CD PU CD NN NN NN
0.47 NN CD PU CD NN NN NN PU
0.45 NN CD PU CD NN
0.29 NN CD NN
0.27 NN CD NN CD
0.09 NN CD NN PU
Table 4: Arabic-English POS n-grams on which Hiero and DC had significantly different recall, arranged by
recall ratio. Ratio > 1 indicates tag sequences that Hiero matched more frequently.
(7) swq
market
Al
the
EqArAt
real-estate
fY
in
Akbr
largest
mdYnp
city
SnAEYp
industrial
SYnYp
Chinese
$AnghAY
Shanghai
R2. The real estate market in the largest Chi-
nese industrial city , Shanghai
H. chineserealestatemarketinthelargestin-
dustrial city shanghai
D. realestatemarketinthelargestchinesein-
dustrial city shanghai
In the last example we see that Hiero mistakenly
identified the adjective “Chinese” as modifying the
highest head of the first NP in the apposition.
The style of Arabic newswire tends strongly to-
wards the verb-initial word order in the main clause.
Basedonourinspectionofthen-gramcollectionNN
VB, we were also able to note that Hiero performed
noticeably better in reordering the subject and main
verb to produce idiomatic English translations. Al-
though in this set the differences in the recall for the
NN VB bigram were influenced by many different
translation issues, reordering the subject and main
verbs was the only structural pattern that recurred
consistently throughout the set, appearing in 8 of the
29 relevant sentences.
(8) wqAl
and-said
Al
the
bYAn
statement
An
that
R1. The statement said
H. the statement said that
D. said a statement that
(9) AEln
announced
ms&wl
official
fY
in
Al
the
Amm
nations
Al
the
mtHdp
united
An
that
R1. A United Nations official announced that
H. the united nations official announced that
D. an official in the united nations that
Looking at the bottom of the list, we find more
examples of how Hiero’s reordering behavior some-
times backfires. These n-grams seem primarily to be
partsofbylines,whereHierohasatendencytorefor-
mat the date, whereas DC keeps the original format,
matching more often.
(10) mAnYlA
Manila
26
26
YnAYr
January
R3. Manila 26 January
H. manila , january 26
P. manila 26 january
6 Conclusions
The work reported in this paper extends the origi-
nal treatment of Hiero (Chiang, 2005) by evaluat-
ing an improved version in a community-wide exer-
cise for Chinese-English and Arabic-English trans-
lation, and by introducing a novel analysis tech-
nique for comparing MT systems’ output. The eval-
uation results provide strong evidence that the ap-
proach gains performance from its hierarchical ex-
tensions to phrase-based translation. The analysis
of part-of-speech tag sequences provides a way to
perform finer-grained comparison of system output,
pinpointing phenomena for which the systems differ
significantly.
785
Acknowledgements
We would like to thank Philipp Koehn for the use of
the Pharaoh system. This research was supported in
part by ONR MURI Contract FCPO.810548265 and
Department of Defense contract RD-02-5700.
References
Stanley F. Chen and Joshua Goodman. 1998. An empir-
ical study of smoothing techniques for language mod-
eling. TechnicalReportTR-10-98,HarvardUniversity
Center for Research in Computing Technology.
David Chiang. 2005. A hierarchical phrase-based model
for statistical machine translation. In Proceedings of
the 43rd Annual Meeting of the ACL, pages 263–270.
Heidi J. Fox. 2002. Phrasal cohesion and statistical ma-
chine translation. In Proceedings of EMNLP 2002,
pages 304–311.
Rebecca Hwa, Philip Resnik, Amy Weinberg, and Okan
Kolak. 2002. Evaluating translational correspondence
using annotation projection. In Proceedings of the
40th Annual Meeting of the ACL, pages 392–399.
Philipp Koehn and Kevin Knight. 2003. Feature-rich
statistical translation of noun phrases. In Proceedings
ofthe41stAnnualMeetingoftheACL,pages311–318.
Philipp Koehn, Franz Josef Och, and Daniel Marcu.
2003. Statisticalphrase-basedtranslation. InProceed-
ings of HLT-NAACL 2003, pages 127–133.
Philipp Koehn. 2004. Pharaoh: a beam search decoder
for phrase-based statistical machine translation mod-
els. In Proceedings of AMTA 2004, pages 115–124.
P. M. Lewis II and R. E. Stearns. 1968. Syntax-directed
transduction. Journal of the ACM, 15:465–488.
Chin-Yew Lin and Franz Josef Och. 2004. Automatic
evaluationofmachinetranslationqualityusinglongest
common subsequence and skip-bigram statistics. In
Proceedings of the 42nd Annual Meeting of the ACL,
pages 606–613.
Adam Lopez and Philip Resnik. 2005. Pattern visualiza-
tion for machine translation output. In Proceedings of
HLT/EMNLP 2005. Demonstration session.
DanielMarcuandWilliamWong. 2002. Aphrase-based,
joint probability model for statistical machine transla-
tion. In Proceedings of EMNLP 2002, pages 133–139.
I. Dan Melamed, Ryan Green, and Joseph P. Turian.
2003. Precision and recall of machine translation. In
Proceedings of HLT-NAACL 2003, pages 61–63.
Franz Josef Och and Hermann Ney. 2000. Improved sta-
tistical alignment models. In Proceedings of the 38th
Annual Meeting of the ACL, pages 440–447.
Franz Josef Och and Hermann Ney. 2002. Discrimina-
tive training and maximum entropy models for statis-
tical machine translation. In Proceedings of the 40th
Annual Meeting of the ACL, pages 295–302.
Franz Josef Och and Hermann Ney. 2004. The align-
ment template approach to statistical machine transla-
tion. Computational Linguistics, 30:417–449.
Franz Josef Och, Daniel Gildea, Sanjeev Khudanpur,
Anoop Sarkar, Kenji Yamada, Alex Fraser, Shankar
Kumar, Libin Shen, David Smith, Katherine Eng,
Viren Jain, Zhen Jin, and Dragomir Radev. 2004a. A
smorgasbord of features for statistical machine trans-
lation. In Proceedings of HLT-NAACL 2004.
Franz Josef Och, Ignacio Thayer, Daniel Marcu, Kevin
Knight, Dragos Stefan Munteanu, Quamrul Tipu,
Michel Galley, and Mark Hopkins. 2004b. Arabic and
Chinese MT at USC/ISI. Presentation given at NIST
Machine Translation Evaluation Workshop.
Franz Josef Och. 2003. Minimum error rate training in
statistical machine translation. In Proceedings of the
41st Annual Meeting of the ACL, pages 160–167.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Jing Zhu. 2002. B: a method for automatic evalua-
tionofmachinetranslation. In Proceedings of the 40th
Annual Meeting of the ACL, pages 311–318.
Adwait Ratnaparkhi. 1996. A maximum-entropy model
forpart-of-speechtagging. InProceedings of EMNLP,
pages 133–142.
Andreas Stolcke. 2002. SRILM – an extensible lan-
guage modeling toolkit. In Proceedings of the Inter-
national Conference on Spoken Language Processing,
volume 2, pages 901–904.
Stephan Vogel, Ying Zhang, Fei Huang, Alicia Trib-
ble, Ashish Venugopal, Bing Zhao, and Alex Waibel.
2003. The CMU statistical machine translation sys-
tem. In Proceedings of MT-Summit IX, pages 402–
409.
Dekai Wu. 1996. A polynomial-time algorithm for sta-
tisticalmachinetranslation. InProceedingsofthe34th
Annual Meeting of the ACL, pages 152–158.
Dekai Wu. 1997. Stochastic inversion transduction
grammars and bilingual parsing of parallel corpora.
Computational Linguistics, 23:377–404.
Kenji Yamada and Kevin Knight. 2002. A decoder for
syntax-based statistical MT. In Proceedings of the
40th Annual Meeting of the ACL, pages 303–310.
786
