Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP 2006), pages 62–69,
Sydney, July 2006. c©2006 Association for Computational Linguistics
The impact of parse quality on syntactically-informed statistical machine
translation
Chris Quirk and Simon Corston-Oliver
Microsoft Research
One Microsoft Way
Redmond, WA 98052 USA
{chrisq,simonco}@microsoft.com
Abstract
We investigate the impact of parse quality
on a syntactically-informed statistical ma-
chine translation system applied to techni-
cal text. We vary parse quality by vary-
ing the amount of data used to train the
parser. As the amount of data increases,
parse quality improves, leading to im-
provements in machine translation output
and results that significantly outperform a
state-of-the-art phrasal baseline.
1 Introduction
The current study is a response to a question
that proponents of syntactically-informed machine
translation frequently encounter: How sensitive is
a syntactically-informed machine translation sys-
tem to the quality of the input syntactic analysis?
It has been shown that phrasal machine translation
systems are not affected by the quality of the in-
put word alignments (Koehn et al., 2003). This
finding has generally been cast in favorable terms:
such systems are robust to poor quality word align-
ment. A less favorable interpretation of these re-
sults might be to conclude that phrasal statistical
machine translation (SMT) systems do not stand
to benefit from improvements in word alignment.
In a similar vein, one might ask whether con-
temporary syntactically-informed machine trans-
lation systems would benefit from improvements
in parse accuracy. One possibility is that cur-
rent syntactically-informed SMT systems are de-
riving only limited value from the syntactic anal-
yses, and would therefore not benefit from im-
proved analyses. Another possibility is that syn-
tactic analysis does indeed contain valuable infor-
mation that could be exploited by machine learn-
ing techniques, but that current parsers are not of
sufficient quality to be of use in SMT.
With these questions and concerns, let us be-
gin. Following some background discussion we
describe a set of experiments intended to elucidate
the impact of parse quality on SMT.
2 Background
We trained statistical machine translation systems
on technical text. In the following sections we
provide background on the data used for training,
the dependency parsing framework used to pro-
duce treelets, the treelet translation framework and
salient characteristics of the target languages.
2.1 Dependency parsing
Dependency analysis is an alternative to con-
stituency analysis (Tesni`ere, 1959; Melˇcuk, 1988).
In a dependency analysis of syntax, words di-
rectly modify other words, with no intervening
non-lexical nodes. We use the terms child node
and parent node to denote the tokens in a depen-
dency relation. Each child has a single parent, with
the lexical root of the sentence dependent on a syn-
thetic ROOT node.
We use the parsing approach described in
(Corston-Oliver et al., 2006). The parser is trained
on dependencies extracted from the English Penn
Treebank version 3.0 (Marcus et al., 1993) by
using the head-percolation rules of (Yamada and
Matsumoto, 2003).
Given a sentence x, the goal of the parser is to
find the highest-scoring parse ˆy among all possible
parses y ∈Y :
ˆy = argmaxy∈Y s(x,y) (1)
The score of a given parse y is the sum of the
62
scores of all its dependency links (i, j)∈ y:
s(x,y)= ∑
(i,j)∈y
d(i, j)= ∑
(i,j)∈y
w·f(i, j) (2)
where the link (i, j) indicates a parent-child de-
pendency between the token at position i and the
token at position j. The score d(i, j) of each de-
pendency link (i, j) is further decomposed as the
weighted sum of its features f(i, j).
The feature vector f(i, j) computed for each
possible parent-child dependency includes the
part-of-speech (POS), lexeme and stem of the par-
ent and child tokens, the POS of tokens adjacent
to the child and parent, and the POS of each to-
ken that intervenes between the parent and child.
Various combinations of these features are used,
for example a new feature is created that combines
the POS of the parent, lexeme of the parent, POS
of the child and lexeme of the child. Each feature
is also conjoined with the direction and distance
of the parent, e.g. does the child precede or follow
the parent, and how many tokens intervene?
To set the weight vector w, we train twenty
averaged perceptrons (Collins, 2002) on different
shuffles of data drawn from sections 02–21 of the
Penn Treebank. The averaged perceptrons are then
combined to form a Bayes Point Machine (Her-
brich et al., 2001; Harrington et al., 2003), result-
ing in a linear classifier that is competitive with
wide margin techniques.
To find the optimal parse given the weight vec-
tor w and feature vector f(i, j) we use the decoder
described in (Eisner, 1996).
2.2 Treelet translation
For syntactically-informed translation, we fol-
low the treelet translation approach described
in (Quirk et al., 2005). In this approach, trans-
lation is guided by treelet translation pairs. Here,
a treelet is a connected subgraph of a dependency
tree. A treelet translation pair consists of a source
treelet S, a target treelet T , and a word alignment
A ⊂ S×T such that for all s ∈ S, there exists a
unique t ∈T such that (s,t)∈A, and if t is the root
of T , there is a unique s ∈ S such that (s,t)∈ A.
Translation of a sentence begins by parsing
that sentence into a dependency representation.
This dependency graph is partitioned into treelets;
like (Koehn et al., 2003), we assume a uniform
probability distribution over all partitions. Each
source treelet is matched to a treelet translation
pair; together, the target language treelets in those
treelet translation pairs will form the target trans-
lation. Next the target language treelets are joined
to form a single tree: the parent of the root of each
treelet is dictated by the source. Let tr be the root
of the target language treelet, and sr be the source
node aligned to it. If sr is the root of the source
sentence, then tr is made the root of the target lan-
guage tree. Otherwise let sp be the parent of sr,
and tp be the target node aligned to sp: tr is at-
tached to tp. Finally the ordering of all the nodes
is determined, and the target tree is specified, and
the target sentence is produced by reading off the
labels of the nodes in order.
Translations are scored according to a log-linear
combination of feature functions, each scoring dif-
ferent aspects of the translation process. We use a
beam search decoder to find the best translation T∗
according to the log-linear combination of models:
T∗ = argmaxT
braceleftBigg
∑
f∈F
λ f f(S,T,A)
bracerightBigg
(3)
The models include inverted and direct channel
models estimated by relative frequency, lexical
weighting channel models following (Vogel et al.,
2003), a trigram target language model using mod-
ified Kneser-Ney smoothing (Goodman, 2001),
an order model following (Quirk et al., 2005),
and word count and phrase count functions. The
weights for these models are determined using the
method described in (Och, 2003).
To estimate the models and extract the treelets,
we begin from a parallel corpus. First the cor-
pus is word-aligned using GIZA++ (Och and Ney,
2000), then the source sentence are parsed, and
finally dependencies are projected onto the target
side following the heuristics described in (Quirk et
al., 2005). This word aligned parallel dependency
tree corpus provides training material for an order
model and a target language tree-based language
model. We also extract treelet translation pairs
from this parallel corpus. To limit the combina-
torial explosion of treelets, we only gather treelets
that contain at most four words and at most two
gaps in the surface string. This limits the number
of mappings to be O(n3) in the worst case, where
n is the number of nodes in the dependency tree.
2.3 Language pairs
In the present paper we focus on English-to-
German and English-to-Japanese machine transla-
63
you can set this property using Visual Basic
Sie können diese Eigenschaft auch mit Visual Basic festlegen
Figure 1: Example German-English and Japanese-English sentence pairs, with word alignments.
tion. Both German and Japanese differ markedly
from English in ways that we believe illumi-
nate well the strengths of a syntactically-informed
SMT system. We provide a brief sketch of the lin-
guistic characteristics of German and Japanese rel-
evant to the present study.
2.3.1 German
Although English and German are closely re-
lated – they both belong to the western branch of
the Germanic family of Indo-European languages
– the languages differ typologically in ways that
are especially problematic for current approaches
to statistical machine translation as we shall now
illustrate. We believe that these typological differ-
ences make English-to-German machine transla-
tion a fertile test bed for syntax-based SMT.
German has richer inflectional morphology than
English, with obligatory marking of case, num-
ber and lexical gender on nominal elements and
person, number, tense and mood on verbal ele-
ments. This morphological complexity, combined
with pervasive, productive noun compounding is
problematic for current approaches to word align-
ment (Corston-Oliver and Gamon, 2004).
Equally problematic for machine translation is
the issue of word order. The position of verbs is
strongly determined by clause type. For exam-
ple, in main clauses in declarative sentences, finite
verbs occur as the second constituent of the sen-
tence, but certain non-finite verb forms occur in fi-
nal position. In Figure 1, for example, the English
“can” aligns with German “k¨onnen” in second po-
sition and “set” aligns with German “festlegen” in
final position.
Aside from verbs, German is usually charac-
terized as a “free word-order” language: major
constituents of the sentence may occur in various
orders, so-called “separable prefixes” may occur
bound to the verb or may detach and occur at a
considerable distance from the verb on which they
depend, and extraposition of various kinds of sub-
ordinate clause is common. In the case of extrapo-
sition, for example, more than one third of relative
clauses in human-translated German technical text
are extraposed. For comparable English text the
figure is considerably less than one percent (Ga-
mon et al., 2002).
2.3.2 Japanese
Word order in Japanese is rather different from
English. English has the canonical constituent or-
der subject-verb-object, whereas Japanese prefers
subject-object-verb order. Prepositional phrases
in English generally correspond to postpositional
phrases in Japanese. Japanese noun phrases are
strictly head-final whereas English noun phrases
allow postmodifiers such as prepositional phrases,
relative clauses and adjectives. Japanese has lit-
tle nominal morphology and does not obligatorily
mark number, gender or definiteness. Verbal mor-
phology in Japanese is complex with morphologi-
cal marking of tense, mood, and politeness. Top-
icalization and subjectless clauses are pervasive,
and problematic for current SMT approaches.
The Japanese sentence in Figure 1 illustrates
several of these typological differences. The
sentence-initial imperative verb “move” in the En-
glish corresponds to a sentence-final verb in the
Japanese. The Japanese translation of the object
noun phrase “the camera slider switch” precedes
the verb in Japanese. The English preposition “to”
aligns to a postposition in Japanese.
3 Experiments
Our goal in the current paper is to measure the
impact of parse quality on syntactically-informed
statistical machine translation. One method for
producing parsers of varying quality might be to
train a parser and then to transform its output, e.g.
64
by replacing the parser’s selection of the parent for
certain tokens with different nodes.
Rather than randomly adding noise to the
parses, we decided to vary the quality in ways that
more closely mimic the situation that confronts us
as we develop machine translation systems. An-
notating data for POS requires considerably less
human time and expertise than annotating syntac-
tic relations. We therefore used an automatic POS
tagger (Toutanova et al., 2003) trained on the com-
plete training section of the Penn Treebank (sec-
tions 02–21). Annotating syntactic dependencies
is time consuming and requires considerable lin-
guistic expertise.1 We can well imagine annotat-
ing syntactic dependencies in order to develop a
machine translation system by annotating first a
small quantity of data, training a parser, training a
system that uses the parses produced by that parser
and assessing the quality of the machine transla-
tion output. Having assessed the quality of the out-
put, one might annotate additional data and train
systems until it appears that the quality of the ma-
chine translation output is no longer improving.
We therefore produced parsers of varying quality
by training on the first n sentences of sections 02–
21 of the Penn Treebank, where n ranged from 250
to 39,892 (the complete training section). At train-
ing time, the gold-standard POS tags were used.
For parser evaluation and for the machine transla-
tion experiments reported here, we used an auto-
matic POS tagger (Toutanova et al., 2003) trained
on sections 02–21 of the Penn Treebank.
We trained English-to-German and English-to-
Japanese treelet translation systems on approxi-
mately 500,000 manually aligned sentence pairs
drawn from technical computer documentation.
The sentence pairs consisted of the English source
sentence and a human-translation of that sentence.
Table 1 summarizes the characteristics of this data.
Note that German vocabulary and singleton counts
are slightly more than double the corresponding
English counts due to complex morphology and
pervasive compounding (see section 2.3.1).
3.1 Parser accuracy
To evaluate the accuracy of the parsers trained on
different samples of sentences we used the tradi-
1Various people have suggested to us that the linguistic
expertise required to annotate syntactic dependencies is less
than the expertise required to apply a formal theory of con-
stituency like the one that informs the Penn Treebank. We
tend to agree, but have not put this claim to the test.
75%
80%
85%
90%
95%
0 10,000 20,000 30,000 40,000
Sample size
De
pe
nd
en
cy
 ac
cu
rac
y.
PTB Section 23
Technical text
Figure 2: Unlabeled dependency accuracy of
parsers trained on different numbers of sentences.
The graph compares accuracy on the blind test sec-
tion of the Penn Treebank to accuracy on a set of
250 sentences drawn from technical text. Punctu-
ation tokens are excluded from the measurement
of dependency accuracy.
tional blind test section of the Penn Treebank (sec-
tion 23). As is well-known in the parsing commu-
nity, parse quality degrades when a parser trained
on the Wall Street Journal text in the Penn Tree-
bank is applied to a different genre or semantic do-
main. Since the technical materials that we were
training the translation system on differ from the
Wall Street Journal in lexicon and syntax, we an-
notated a set of 250 sentences of technical material
to use in evaluating the parser. Each of the authors
independently annotated the same set of 250 sen-
tences. The annotation took less than six hours for
each author to complete. Inter-annotator agree-
ment excluding punctuation was 91.8%. Differ-
ences in annotation were resolved by discussion,
and the resulting set of annotations was used to
evaluate the parsers.
Figure 2 shows the accuracy of parsers trained
on samples of various sizes, excluding punctua-
tion tokens from the evaluation, as is customary
in evaluating dependency parsers. When mea-
sured against section 23 of the Penn Treebank,
the section traditionally used for blind evaluation,
the parsers range in accuracy from 77.8% when
trained on 250 sentences to 90.8% when trained
on all of sections 02–21. As expected, parse accu-
racy degrades when measured on text that differs
greatly from the training text. A parser trained on
250 Penn Treebank sentences has a dependency
65
English German English Japanese
Training Sentences 515,318 500,000
Words 7,292,903 8,112,831 7,909,198 9,379,240
Vocabulary 59,473 134,829 66,731 68,048
Singletons 30,452 66,724 50,381 52,911
Test Sentences 2,000 2,000
Words 28,845 31,996 30,616 45,744
Table 1: Parallel data characteristics
accuracy of 76.6% on the technical text. A parser
trained on the complete Penn Treebank training
section has a dependency accuracy of 84.3% on
the technical text.
Since the parsers make extensive use of lexi-
cal features, it is not surprising that the perfor-
mance on the two corpora should be so similar
with only 250 training sentences; there were not
sufficient instances of each lexical item to train re-
liable weights or lexical features. As the amount
of training data increases, the parsers are able to
learn interesting facts about specific lexical items,
leading to improved accuracy on the Penn Tree-
bank. Many of the lexical items that occur in the
Penn Treebank, however, occur infrequently or not
at all in the technical materials so the lexical infor-
mation is of little benefit. This reflects the mis-
match of content. The Wall Street Journal articles
in the Penn Treebank concern such topics as world
affairs and the policies of the Reagan administra-
tion; these topics are absent in the technical mate-
rials. Conversely, the Wall Street Journal articles
contain no discussion of such topics as the intrica-
cies of SQL database queries.
3.2 Translation quality
Table 2 presents the impact of parse quality on a
treelet translation system, measured using BLEU
(Papineni et al., 2002). Since our main goal is to
investigate the impact of parser accuracy on trans-
lation quality, we have varied the parser training
data, but have held the MT training data, part-of-
speech-tagger, and all other factors constant. We
observe an upward trend in BLEU score as more
training data is made available to the parser; the
trend is even clearer in Japanese.2 As a baseline,
we include right-branching dependency trees, i.e.,
trees in which the parent of each word is its left
2This is particularly encouraging since various people
have remarked to us that syntax-based SMT systems may
be disadvantaged under n-gram scoring techniques such as
BLEU.
EG EJ
Phrasal decoder 31.7±1.2 32.9±0.9
Treelet decoder
Right-branching 31.4±1.3 28.0±0.7
250 sentences 32.8±1.4 34.1±0.9
2,500 sentences 33.0±1.4 34.6±1.0
25,000 sentences 33.7±1.5 35.7±0.9
39,892 sentences 33.6±1.5 36.0±1.0
Table 2: BLEU score vs. decoder and parser vari-
ants. Here sentences refer to the amount of parser
training data, not MT training data.
neighbor and the root of a sentence is the first
word. With this analysis, treelets are simply sub-
sequences of the sentence, and therefore are very
similar to the phrases of Phrasal SMT. In English-
to-German, this result produces results very com-
parable to a phrasal SMT system (Koehn et al.,
2003) trained on the same data. For English-to-
Japanese, however, this baseline performs much
worse than a phrasal SMT system. Although
phrases and treelets should be nearly identical
under this scenario, the decoding constraints are
somewhat different: the treelet decoder assumes
phrasal cohesion during translation. This con-
straint may account for the drop in quality.
Since the confidence intervals for many pairs
overlap, we ran pairwise tests for each system to
determine which differences were significant at
the p < 0.05 level using the bootstrap method de-
scribed in (Zhang and Vogel, 2004); Table 3 sum-
marizes this comparison. Neither language pair
achieves a statistically significant improvement
from increasing the training data from 25,000
pairs to the full training set; this is not surprising
since the increase in parse accuracy is quite small
(90.2% to 90.8% on Wall Street Journal text).
To further understand what differences in de-
pendency analysis were affecting translation qual-
ity, we compared a treelet translation system that
66
Pharaoh Right-branching 250 2,500 25,000 39,892
Pharaoh ∼ > > > >
Right-branching > > > >
250 ∼ > >
2,500 > >
25,000 ∼
(a) English-German
Pharaoh Right-branching 250 2,500 25,000 39,892
Pharaoh < ∼ > > >
Right-branching > > > >
250 > > >
2,500 > >
25,000 ∼
(b) English-Japanese
Table 3: Pairwise statistical significance tests. > indicates that the system on the top is significantly better
than the system on the left; < indicates that the system on top is significantly worse than the system on
the left; ∼ indicates that difference between the two systems is not statistically significant.
32
33
34
35
36
37
100 1000 10000 100000
Parser training sentences
BL
EU
 sc
ore
Japanese
German
Figure 3: BLEU score vs. number of sentences
used to train the dependency parser
used a parser trained on 250 Penn Treebank sen-
tences to a treelet translation system that used
a parser trained on 39,892 Treebank sentences.
From the test data, we selected 250 sentences
where these two parsers produced different anal-
yses. A native speaker of German categorized the
differences in machine translation output as either
improvements or regressions. We then examined
and categorized the differences in the dependency
analyses. Table 4 summarizes the results of this
comparison. Note that this table simply identifies
correlations between parse changes and translation
changes; it does not attempt to identify a causal
link. In the analysis, we borrow the term “NP
[Noun Phrase] identification” from constituency
analysis to describe the identification of depen-
dency treelets spanning complete noun phrases.
There were 141 sentences for which the ma-
chine translated output improved, 71 sentences for
which the output regressed and 38 sentences for
which the output was identical. Improvements in
the attachment of prepositions, adverbs, gerunds
and dependent verbs were common amongst im-
proved translations, but rare amongst regressed
translations. Correct identification of the depen-
dent of a preposition3 was also much more com-
mon amongst improvements.
Certain changes, such as improved root identifi-
cation and final punctuation attachment, were very
common across the corpus. Therefore their com-
mon occurrence amongst regressions is not very
surprising. It was often the case that improve-
ments in root identification or final punctuation at-
tachment were offset by regressions elsewhere in
the same sentence.
Improvements in the parsers are cases where
the syntactic analysis more closely resembles the
analysis of dependency structure that results from
applying Yamada and Matsumoto’s head-finding
rules to the Penn Treebank. Figure 4 shows dif-
ferent parses produced by parsers trained on dif-
3In terms of constituency analysis, a prepositional phrase
should consist of a preposition governing a single noun
phrase
67
Y o u c a n m a n ip u la te M ic r o s o f t A c c e s s o b je c t s fr o m a n o t h e r a p p lic a t io n th a t a ls o s u p p o r t s a u to m a tio n .R O O T
Y o u c a n m a n ip u la te M ic r o s o f t A c c e s s o b je c t s fr o m a n o t h e r a p p lic a t io n th a t a ls o s u p p o r t s a u to m a tio n .R O O T
( a )  D e p e n d e n c y  a n a ly s is  p r o d u c e d  b y  p a r s e r  t r a in e d  o n  2 5 0  W a ll S tr e e t  J o u r n a l s e n t e n c e s .
( b )  D e p e n d e n c y  a n a ly s is  p r o d u c e d  b y  p a r s e r  tr a in e d  o n  3 9 , 8 9 2  W a ll S tr e e t J o u r n a l s e n t e n c e s .
Figure 4: Parses produced by parsers trained on different numbers of sentences.
ferent numbers of sentences. The parser trained
on 250 sentences incorrectly attaches the prepo-
sition “from” as a dependent of the noun “ob-
jects” whereas the parser trained on the complete
Penn Treebank training section correctly attaches
the preposition as a dependent of the verb “ma-
nipulate”. These two parsers also yield different
analyses of the phrase “Microsoft Access objects”.
In parse (a), “objects” governs “Office” and “Of-
fice” in turn governs “Microsoft”. This analy-
sis is linguistically well-motivated, and makes a
treelet spanning “Microsoft Office” available to
the treelet translation system. In parse (b), the
parser has analyzed this phrase so that “objects”
directly governs “Microsoft” and “Office”. The
analysis more closely reflects the flat branching
structure of the Penn Treebank but obscures the
affinity of “Microsoft” and “Office”.
An additional measure of parse utility for MT
is the amount of translation material that can be
extracted from a parallel corpus. We increased the
parser training data from 250 sentences to 39,986
sentences, but held the number of aligned sentence
pairs used train other modules constant. The count
of treelet translation pairs occurring at least twice
in the English-German parallel corpus grew from
1,895,007 to 2,010,451.
4 Conclusions
We return now to the questions and concerns
raised in the introduction. First, is a treelet SMT
system sensitive to parse quality? We have shown
that such a system is sensitive to the quality of
Error category Regress Improve
Attachment of prep 1% 22%
Root identification 13% 28%
Final punctuation 18% 30%
Coordination 6% 16%
Dependent verbs 14% 32%
Arguments of verb 6% 15%
NP identification 24% 33%
Dependent of prep 0% 7%
Other attachment 3% 22%
Table 4: Error analysis, showing percentage of
regressed and improved translations exhibiting a
parse improvement in each specified category
the input syntactic analyses. With the less accu-
rate parsers that result from training on extremely
small numbers of sentences, performance is com-
parable to state-of-the-art phrasal SMT systems.
As the amount of data used to train the parser in-
creases, both English-to-German and English-to-
Japanese treelet SMT improve, and produce re-
sults that are statistically significantly better than
the phrasal baseline.
In the introduction we mentioned the concern
that others have raised when we have presented
our research: syntax might contain valuable infor-
mation but current parsers might not be of suffi-
cient quality. It is certainly true that the accuracy
of the best parser used here falls well short of what
we might hope for. A parser that achieves 90.8%
dependency accuracy when trained on the Penn
Treebank Wall Street Journal corpus and evalu-
68
ated on comparable text degrades to 84.3% accu-
racy when evaluated on technical text. Despite the
degradation in parse accuracy caused by the dra-
matic differences between the Wall Street Journal
text and the technical articles, the treelet SMT sys-
tem was able to extract useful patterns. Research
on syntactically-informed SMT is not impeded by
the accuracy of contemporary parsers.
One significant finding is that as few as 250
sentences suffice to train a dependency parser for
use in the treelet SMT framework. To date our
research has focused on translation from English
to other languages. One concern in applying the
treelet SMT framework to translation from lan-
guages other than English has been the expense
of data annotation: would we require 40,000 sen-
tences annotated for syntactic dependencies, i.e.,
an amount comparable to the Penn Treebank, in
order to train a parser that was sufficiently accu-
rate to achieve the machine translation quality that
we have seen when translating from English? The
current study gives hope that source languages can
be added with relatively modest investments in
data annotation. As more data is annotated with
syntactic dependencies and more accurate parsers
are trained, we would hope to see similar improve-
ments in machine translation output.
We challenge others who are conducting re-
search on syntactically-informed SMT to verify
whether or to what extent their systems are sen-
sitive to parse quality.

References
M. Collins. 2002. Discriminative training meth-
ods for hidden markov models: Theory and exper-
iments with perceptron algorithms. In Proceedings
of EMNLP.
Simon Corston-Oliver and Michael Gamon. 2004.
Normalizing German and English inflectional mor-
phology to improve statistical word alignment. In
R. E. Frederking and K. B. Taylor, editors, Machine
translation: From real users to research. Springer
Verlag.
Simon Corston-Oliver, Anthony Aue, Kevin Duh, and
Eric Ringger. 2006. Multilingual dependency pars-
ing using Bayes Point Machines. In Proceedings of
HLT/NAACL.
Jason M. Eisner. 1996. Three new probabilistic mod-
els for dependency parsing: An exploration. In Pro-
ceedings of COLING, pages 340–345.
Michael Gamon, Eric Ringger, Zhu Zhang, Robert
Moore, and Simon Corston-Oliver. 2002. Extrapo-
sition: A case study in German sentence realization.
In Proceedings of COLING, pages 301–307.
Joshua Goodman. 2001. A bit of progress in lan-
guage modeling, extended version. Technical Re-
port MSR-TR-2001-72, Microsoft Research.
Edward Harrington, Ralf Herbrich, Jyrki Kivinen,
John C. Platt, and Robert C. Williamson. 2003. On-
line bayes point machines. In Proc. 7th Pacific-Asia
Conference on Knowledge Discovery and Data Min-
ing, pages 241–252.
Ralf Herbrich, Thore Graepel, and Colin Campbell.
2001. Bayes Point Machines. Journal of Machine
Learning Research, pages 245–278.
Philipp Koehn, Franz Josef Och, and Daniel Marcu.
2003. Statistical phrase-based translation. In Pro-
ceedings of HLT/NAACL.
M. Marcus, B. Santorini, and M. Marcinkiewicz.
1993. Building a large annotated corpus of en-
glish: The Penn Treebank. Computational Linguis-
tics, 19(2):313–330.
Igor A. Melˇcuk. 1988. Dependency Syntax: Theory
and Practice. State University of New York Press.
Franz Josef Och and Hermann Ney. 2000. Improved
statistical alignment models. In Proceedings of the
ACL, pages 440–447, Hongkong, China, October.
Franz Josef Och. 2003. Minimum error rate training
in statistical machine translation. In Proceedings of
the ACL.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Jing Zhu. 2002. Bleu: a method for automatic eval-
uation of machine translation. In Proceedings of the
ACL, pages 311–318, Philadelpha, Pennsylvania.
Chris Quirk, Arul Menezes, and Colin Cherry. 2005.
Dependency treelet translation: Syntactically in-
formed phrasal SMT. In Proceedings of the ACL.
Lucien Tesni`ere. 1959. ´El´ements de syntaxe struc-
turale. Librairie C. Klincksieck.
Kristina Toutanova, Dan Klein, Christopher D. Man-
ning, and Yoram Singer. 2003. Feature-rich part-of-
speech tagging with a cyclic dependency network.
In Proceedings of HLT/EMNLP, pages 252–259.
Stephan Vogel, Ying Zhang, Fei Huang, Alicia Tribble,
Ashish Venugopal, Bing Zhao, and Alex Waibel.
2003. The CMU statistical machine translation sys-
tem. In Proceedings of the MT Summit.
Hiroyasu Yamada and Yuji Matsumoto. 2003. Statis-
tical dependency analysis with support vector ma-
chines. In Proceedings of IWPT, pages 195–206.
Ying Zhang and Stephan Vogel. 2004. Measuring con-
fidence intervals for mt evaluation metrics. In Pro-
ceedings of TMI.
