Reranking Translation Hypotheses Using Structural Properties
Saˇsa Hasan, Oliver Bender, Hermann Ney
Chair of Computer Science VI
RWTH Aachen University
D-52056 Aachen, Germany
{hasan,bender,ney}@cs.rwth-aachen.de
Abstract
We investigate methods that add syntac-
tically motivated features to a statistical
machine translation system in a reranking
framework. Thegoalistoanalyzewhether
shallow parsing techniques help in iden-
tifying ungrammatical hypotheses. We
show that improvements are possible by
utilizing supertagging, lightweight depen-
dency analysis, a link grammar parser and
a maximum-entropy based chunk parser.
Adding features to n-best lists and dis-
criminatively training the system on a de-
velopment set increases the BLEU score
up to 0.7% on the test set.
1 Introduction
Statistically driven machine translation systems
are currently the dominant type of system in the
MT community. Though much better than tradi-
tional rule-based approaches, these systems still
make a lot of errors that seem, at least from a hu-
man point of view, illogical.
The main purpose of this paper is to investigate
a means of identifying ungrammatical hypotheses
from the output of a machine translation system
by using grammatical knowledge that expresses
syntactic dependencies of words or word groups.
We introduce several methods that try to establish
this kind of linkage between the words of a hy-
pothesis and, thus, determine its well-formedness,
or “fluency”. We perform rescoring experiments
that rerank n-best lists according to the presented
framework.
As methodologies deriving well-formedness of
a sentence we use supertagging (Bangalore and
Joshi, 1999) with lightweight dependency anal-
ysis (LDA)1 (Bangalore, 2000), link grammars
(Sleator and Temperley, 1993) and a maximum-
entropy (ME) based chunk parser (Bender et al.,
2003). The former two approaches explicitly
model the syntactic dependencies between words.
Each hypothesis that contains irregularities, such
as broken linkages or non-satisfied dependencies,
should be penalized or rejected accordingly. For
the ME chunker, the idea is to train n-gram mod-
els on the chunk or POS sequences and directly
use the log-probability as feature score.
In general, these concepts and the underlying
programs should be robust and fast in order to be
able to cope with large amounts of data (as it is the
case for n-best lists). The experiments presented
show a small though consistent improvement in
termsofautomaticevaluationmeasureschosenfor
evaluation. BLEU score improvements, for in-
stance, lie in the range from 0.3 to 0.7% on the
test set.
In the following, Section 2 gives an overview
on related work in this domain. In Section 3
we review our general approach to statistical ma-
chine translation (SMT) and introduce the main
methodologies used for deriving syntactic depen-
dencies on words or word groups, namely su-
pertagging/LDA, link grammars and ME chunk-
ing. The corpora and the experiments are dis-
cussed in Section 4. The paper is concluded in
Section 5.
2 Related work
In (Och et al., 2004), the effects of integrating
syntactic structure into a state-of-the-art statistical
machine translation system are investigated. The
approachissimilartotheapproachpresentedhere:
1In the context of this work, the term LDA is not to be
confused with linear discriminant analysis.
41
firstly, a word graph is generated using the base-
line SMT system and n-best lists are extracted ac-
cordingly, then additional feature functions repre-
sentingsyntacticknowledgeareaddedandthecor-
responding scaling factors are trained discrimina-
tively on a development n-best list.
Och and colleagues investigated a large amount
of different feature functions. The field of appli-
cation varies from simple syntactic features, such
as IBM model 1 score, over shallow parsing tech-
niques to more complex methods using grammars
and intricate parsing procedures. The results were
rather disappointing. Only one of the simplest
models, i.e. the implicit syntactic feature derived
from IBM model 1 score, yielded consistent and
significant improvements. All other methods had
only a very small effect on the overall perfor-
mance.
3 Framework
In the following sections, the theoretical frame-
work of statistical machine translation using a di-
rect approach is reviewed. We introduce the su-
pertagging and lightweight dependency analysis
approach, link grammars and maximum-entropy
based chunking technique.
3.1 Direct approach to SMT
In statistical machine translation, the best trans-
lation ˆeˆI1 = ˆe1 ...ˆei ...ˆeˆI of source words fJ1 =
f1 ...fj ...fJ is obtained by maximizing the con-
ditional probability
ˆeˆI1 = argmax
I,eI1
{Pr(eI1|fJ1 )}
= argmax
I,eI1
{Pr(fJ1 |eI1) · Pr(eI1)}
(1)
using Bayes decision rule. The first probability
on the right-hand side of the equation denotes the
translation model whereas the second is the target
language model.
An alternative to this classical source-channel
approach is the direct modeling of the posterior
probability Pr(eI1|fJ1 ) which is utilized here. Us-
ing a log-linear model (Och and Ney, 2002), we
obtain
Pr(eI1|fJ1 ) =
exp
parenleftBigsummationtextM
m=1 λmhm(e
I1,fJ1 )
parenrightBig
summationtext
eprimeIprime1
exp
parenleftBigsummationtextM
m=1 λmhm(eprime
Iprime1 ,fJ
1 )
parenrightBig,
(2)
where λm are the scaling factors of the models de-
noted by feature functions hm(·). The denomina-
tor represents a normalization factor that depends
only on the source sentence fJ1 . Therefore, we can
omit it during the search process, leading to the
following decision rule:
ˆeˆI1 = argmax
I,eI1
braceleftBigg Msummationdisplay
m=1
λmhm(eI1,fJ1 )
bracerightBigg
(3)
This approach is a generalization of the source-
channel approach. It has the advantage that ad-
ditional models h(·) can be easily integrated into
the overall system. The model scaling factors
λM1 are trained according to the maximum en-
tropy principle, e.g., using the GIS algorithm. Al-
ternatively, one can train them with respect to
the final translation quality measured by an error
criterion (Och, 2003). For the results reported
in this paper, we optimized the scaling factors
with respect to a linear interpolation of word error
rate (WER), position-independent word error rate
(PER), BLEU and NIST score using the Downhill
Simplex algorithm (Press et al., 2002).
3.2 Supertagging/LDA
Supertagging(BangaloreandJoshi, 1999)usesthe
Lexicalized Tree Adjoining Grammar formalism
(LTAG) (XTAG Research Group, 2001). Tree Ad-
joiningGrammarsincorporateatree-rewritingfor-
malism using elementary trees that can be com-
bined by two operations, namely substitution and
adjunction, to derive more complex tree structures
of the sentence considered. Lexicalization allows
us to associate each elementary tree with a lexical
item called the anchor. In LTAGs, every elemen-
tarytree hassuch alexicalanchor, also calledhead
word. It is possible that there is more than one el-
ementary structure associated with a lexical item,
as e.g. for the case of verbs with different subcat-
egorization frames.
The elementary structures, called initial and
auxiliary trees, hold all dependent elements within
the same structure, thus imposing constraints on
the lexical anchors in a local context. Basically,
supertagging is very similar to part-of-speech tag-
ging. Instead of POS tags, richer descriptions,
namely the elementary structures of LTAGs, are
annotated to the words of a sentence. For this pur-
pose, they are called supertags in order to distin-
guish them from ordinary POS tags. The result
is an “almost parse” because of the dependencies
42
very[β2]
food[α1] delicious[α3]
the[β1]
was[α2]
Figure 1: LDA: example of a derivation tree, β
nodes are the result of the adjunction operation on
auxiliary trees, α nodes of substitution on initial
trees.
coded within the supertags. Usually, a lexical item
can have many supertags, depending on the vari-
ouscontextsitappearsin. Therefore, thelocalam-
biguity is larger than for the case of POS tags. An
LTAGparserforthisscenariocanbeveryslow, i.e.
its computational complexity is in O(n6), because
of the large number of supertags, i.e. elementary
trees, that have to be examined during a parse. In
order to speed up the parsing process, we can ap-
ply n-gram models on a supertag basis in order to
filter out incompatible descriptions and thus im-
prove the performance of the parser. In (Banga-
lore and Joshi, 1999), a trigram supertagger with
smoothing and back-off is reported that achieves
an accuracy of 92.2% when trained on one million
running words.
There is another aspect to the dependencies
coded in the elementary structures. We can use
them to actually derive a shallow parse of the sen-
tence in linear time. The procedure is presented
in (Bangalore, 2000) and is called lightweight de-
pendency analysis. The concept is comparable to
chunking. The lightweight dependency analyzer
(LDA) finds the arguments for the encoded depen-
dency requirements. There exist two types of slots
that can be filled. On the one hand, nodes marked
for substitution (in α-trees) have to be filled by the
complements of the lexical anchor. On the other
hand, thefootnodes(i.e.nodesmarkedforadjunc-
tion in β-trees) take words that are being modified
by the supertag. Figure 1 shows a tree derived by
LDA on the sentence the food was very delicious
from the C-Star’03 corpus (cf. Section 4.1).
The supertagging and LDA tools are available
from the XTAG research group website.2
As features considered for the reranking exper-
iments we choose:
2http://www.cis.upenn.edu/˜xtag/
D D EA EA
P P
SS
the food very deliciouswas
Figure 2: Link grammar: example of a valid link-
age satisfying all constraints.
• Supertagger output: directly use the log-
likelihoods as feature score. This did not im-
proveperformancesignificantly,sothemodel
was discarded from the final system.
• LDA output:
– dependency coverage: determine the
number of covered elements, i.e. where
the dependency slots are filled to the left
and right
– separatefeaturesforthenumberofmod-
ifiers and complements determined by
the LDA
3.3 Link grammar
Similar to the ideas presented in the previous sec-
tion, link grammars also explicitly code depen-
dencies between words (Sleator and Temperley,
1993). These dependencies are called links which
reflect the local requirements of each word. Sev-
eral constraints have to be satisfied within the link
grammar formalism to derive correct linkages, i.e.
sets of links, of a sequence of words:
1. Planarity: links are not allowed to cross each
other
2. Connectivity: links suffice to connect all
words of a sentence
3. Satisfaction: linking requirements of each
word are satisfied
An example of a valid linkage is shown in Fig-
ure 2. The link grammar parser that we use is
freely available from the authors’ website.3 Sim-
ilar to LTAG, the link grammar formalism is lex-
icalized which allows for enhancing the methods
with probabilistic n-gram models (as is also the
case for supertagging). In (Lafferty et al., 1992),
the link grammar is used to derive a new class of
3http://www.link.cs.cmu.edu/link/
43
[NP the food ] [VP was] [ADJP very delicious]
the/DT food/NN was/VBD very/RB delicious/JJ
Figure 3: Chunking and POS tagging: a tag next
to the opening bracket denotes the type of chunk,
whereas the corresponding POS tag is given after
the word.
language models that, in comparison to traditional
n-gram LMs, incorporate capabilities for express-
ing long-range dependencies between words.
The link grammar dictionary that specifies the
words and their corresponding valid links cur-
rentlyholdsapproximately60000entriesandhan-
dles a wide variety of phenomena in English. It is
derived from newspaper texts.
Within our reranking framework, we use link
grammar features that express a possible well-
formednessofthetranslationhypothesis. Thesim-
plest feature is a binary one stating whether the
link grammar parser could derive a complete link-
age or not, which should be a strong indicator of
a syntactically correct sentence. Additionally, we
added a normalized cost of the matching process
which turned out not to be very helpful for rescor-
ing, so it was discarded.
3.4 ME chunking
Like the methods described in the two preced-
ing sections, text chunking consists of dividing a
text into syntactically correlated non-overlapping
groups of words. Figure 3 shows again our ex-
ample sentence illustrating this task. Chunks are
represented as groups of words between square
brackets. We employ the 11 chunk types as de-
fined for the CoNLL-2000shared task (Tjong Kim
Sang and Buchholz, 2000).
For the experiments, we apply a maximum-
entropy based tagger which has been successfully
evaluated on natural language understanding and
named entity recognition (Bender et al., 2003).
Within this tool, we directly factorize the poste-
rior probability and determine the corresponding
chunk tag for each word of an input sequence. We
assume that the decisions depend only on a lim-
itedwindowei+2i−2 = ei−2...ei+2 aroundthecurrent
word ei and on the two predecessor chunk tags
ci−1i−2. In addition, part-of-speech (POS) tags gI1
are assigned and incorporated into the model (cf.
Figure 3). Thus, we obtain the following second-
order model:
Pr(cI1|eI1,gI1) =
=
Iproductdisplay
i=1
Pr(ci|ci−11 ,eI1,gI1) (4)
=
Iproductdisplay
i=1
p(ci|ci−1i−2,ei+2i−2,gi+2i−2), (5)
where the step from Eq. 4 to 5 reflects our model
assumptions.
Furthermore, we have implemented a set of bi-
nary valued feature functions for our system, in-
cluding lexical, word and transition features, prior
features, and compound features, cf. (Bender et
al., 2003). We run simple count-based feature
reduction and train the model parameters using
the Generalized Iterative Scaling (GIS) algorithm
(Darroch and Ratcliff, 1972). In practice, the
training procedure tends to result in an overfitted
model. To avoid this, a smoothing method is ap-
plied where a Gaussian prior on the parameters is
assumed (Chen and Rosenfeld, 1999).
Within our reranking framework, we firstly use
the ME based tagger to produce the POS and
chunk sequences for the different n-best list hy-
potheses. Given several n-gram models trained on
the WSJ corpus for both POS and chunk models,
we then rescore the n-best hypotheses and simply
use the log-probabilities as additional features. In
order to adapt our system to the characteristics of
the data used, we build POS and chunk n-gram
models on the training corpus part. These domain-
specific models are also added to the n-best lists.
The ME chunking approach does not model ex-
plicit syntactic linkages of words. Instead, it in-
corporates a statistical framework to exploit valid
and syntactically coherent groups of words by ad-
ditionally looking at the word classes.
4 Experiments
For the experiments, we use the translation sys-
tem described in (Zens et al., 2005). Our phrase-
based decoder uses several models during search
that are interpolated in a log-linear way (as ex-
pressed in Eq. 3), such as phrase-based translation
models, word-based lexicon models, a language,
deletion and simple reordering model and word
and phrase penalties. A word graph containing
the most likely translation hypotheses is generated
during the search process. Out of this compact
44
Supplied Data Track
Arabic Chinese Japanese English
Train Sentences 20000
Running Words 180075 176199 198453 189927
Vocabulary 15371 8687 9277 6870
Singletons 8319 4006 4431 2888
C-Star’03 Sentences 506
Running Words 3552 3630 4130 3823
OOVs (Running Words) 133 114 61 65
IWSLT’04 Sentences 500
Running Words 3597 3681 4131 3837
OOVs (Running Words) 142 83 71 58
Table 1: Corpus statistics after preprocessing.
representation, we extract n-best lists as described
in (Zens and Ney, 2005). These n-best lists serve
as a starting point for our experiments. The meth-
ods presented in Section 3 produce scores that are
used as additional features for the n-best lists.
4.1 Corpora
The experiments are carried out on a subset
of the Basic Travel Expression Corpus (BTEC)
(Takezawa et al., 2002), as it is used for the sup-
plieddatatrackconditionoftheIWSLTevaluation
campaign. BTEC is a multilingual speech corpus
which contains tourism-related sentences similar
to those that are found in phrase books. For the
supplied data track, the training corpus contains
20000 sentences. Two test sets, C-Star’03 and
IWSLT’04, are available for the language pairs
Arabic-English, Chinese-English and Japanese-
English.
The corpus statistics are shown in Table 1. The
average source sentence length is between seven
and eight words for all languages. So the task is
rather limited and very domain-specific. The ad-
vantage is that many different reranking experi-
ments with varying feature function settings can
be carried out easily and quickly in order to ana-
lyze the effects of the different models.
In the following, we use the C-Star’03 set for
development and tuning of the system’s parame-
ters. After that, the IWSLT’04 set is used as a
blind test set in order to measure the performance
of the models.
4.2 Rescoring experiments
The use of n-best lists in machine translation has
several advantages. It alleviates the effects of the
huge search space which is represented in word
graphs by using a compact excerpt of the n best
hypotheses generated by the system. Especially
for limited domain tasks, the size of the n-best list
can be rather small but still yield good oracle er-
ror rates. Empirically, n-best lists should have an
appropriate size such that the oracle error rate, i.e.
the error rate of the best hypothesis with respect to
anerrormeasure(suchasWERorPER)isapprox-
imately half the baseline error rate of the system.
N-bestlistsaresuitableforeasilyapplyingseveral
rescoring techniques since the hypotheses are al-
ready fully generated. In comparison, word graph
rescoring techniques need specialized tools which
can traverse the graph accordingly. Since a node
withinawordgraphallowsformanyhistories, one
canonlyapplylocalrescoringtechniques,whereas
for n-best lists, techniques can be used that con-
sider properties of the whole sentence.
For the Chinese-English and Arabic-English
task, we set the n-best list size to n = 1500. For
Japanese-English, n = 1000 produces oracle er-
ror rates that are deemed to be sufficiently low,
namely 17.7% and 14.8% for WER and PER, re-
spectively. The single-best output for Japanese-
English has a word error rate of 33.3% and
position-independent word error rate of 25.9%.
For the experiments, we add additional fea-
tures to the initial models of our decoder that have
shown to be particularly useful in the past, such as
IBM model 1 score, a clustered language model
score and a word penalty that prevents the hy-
potheses to become too short. A detailed defini-
tion of these additional features is given in (Zens
et al., 2005). Thus, the baseline we start with is
45
Chinese → English, C-Star’03 NIST BLEU[%] mWER[%] mPER[%]
Baseline 8.17 46.2 48.6 41.4
with supertagging/LDA 8.29 46.5 48.4 41.0
with link grammar 8.43 45.6 47.9 41.1
with supertagging/LDA + link grammar 8.22 47.5 47.7 40.8
with ME chunker 8.65 47.3 47.4 40.4
with all models 8.42 47.0 47.4 40.5
Chinese → English, IWSLT’04 NIST BLEU[%] mWER[%] mPER[%]
Baseline 8.67 45.5 49.1 39.8
with supertagging/LDA 8.68 45.4 49.8 40.3
with link grammar 8.81 45.0 49.0 40.2
with supertagging/LDA+link grammar 8.56 46.0 49.1 40.6
with ME chunker 9.00 44.6 49.3 40.6
with all models 8.89 46.2 48.1 39.6
Table 2: Effect of successively adding syntactic features to the Chinese-English n-best list for C-Star’03
(development set) and IWSLT’04 (test set).
BASE Any messages for me?
RESC Do you have any messages for me?
REFE Do you have any messages for me?
BASE She, not yet?
RESC She has not come yet?
REFE Lenny, she has not come in?
BASE How much is it to the?
RESC How much is it to the local call?
REFE How much is it to the city centre?
BASE This blot or.
RESC This is not clean.
REFE This still is not clean.
Table 3: Translation examples for the Chinese-
English test set (IWSLT’04): baseline system
(BASE) vs. rescored hypotheses (RESC) and refer-
ence translation (REFE).
already a very strong one. The log-linear inter-
polation weights λm from Eq. 3 are directly opti-
mized using the Downhill Simplex algorithm on a
linearcombinationofWER(worderrorrate), PER
(position-independent word error rate), NIST and
BLEU score.
In Table 2, we show the effect of adding the
presented features successively to the baseline.
Separate entries for experiments using supertag-
ging/LDA and link grammars show that a combi-
nation of these syntactic approaches always yields
some gain in translation quality (regarding BLEU
score). The performance of the maximum-entropy
based chunking is comparable. A combination of
all three models still yields a small improvement.
Table 3 shows some examples for the Chinese-
English test set. The rescored translations are syn-
tactically coherent, though semantical correctness
cannot be guaranteed. On the test data, we achieve
an overall improvement of 0.7%, 0.5% and 0.3%
in BLEU score for Chinese-English, Japanese-
English and Arabic-English, respectively (cf. Ta-
bles 4 and 5).
4.3 Discussion
From the tables, it can be seen that the use of
syntactically motivated feature functions within
a reranking concept helps to slightly reduce the
number of translation errors of the overall trans-
lation system. Although the improvement on the
IWSLT’04 set is only moderate, the results are
nevertheless comparable or better to the ones from
(Och et al., 2004), where, starting from IBM
model 1 baseline, an additional improvement of
only 0.4% BLEU was achieved using more com-
plex methods.
For the maximum-entropy based chunking ap-
proach, n-grams with n = 4 work best for the
chunker that is trained on WSJ data. The domain-
specific rescoring model which results from the
chunker being trained on the BTEC corpora turns
out to prefer higher order n-grams, with n = 6 or
more. This might be an indicator of the domain-
specific rescoring model successfully capturing
more local context.
The training of the other models, i.e. supertag-
ging/LDA and link grammar, is also performed on
46
Japanese → English, C-Star’03 NIST BLEU[%] mWER[%] mPER[%]
Baseline 9.09 57.8 31.3 25.0
with supertagging/LDA 9.13 57.8 31.3 24.8
with link grammar 9.46 57.6 31.9 25.3
with supertagging/LDA + link grammar 9.24 58.2 31.0 24.8
with ME chunker 9.31 58.7 30.9 24.4
with all models 9.21 58.9 30.5 24.3
Japanese → English, IWSLT’04 NIST BLEU[%] mWER[%] mPER[%]
Baseline 9.22 54.7 34.1 25.5
with supertagging/LDA 9.27 54.8 34.2 25.6
with link grammar 9.37 54.9 34.3 25.9
with supertagging/LDA + link grammar 9.30 55.0 34.0 25.6
with ME chunker 9.27 55.0 34.2 25.5
with all models 9.27 55.2 33.9 25.5
Table 4: Effect of successively adding syntactic features to the Japanese-English n-best list for C-Star’03
(development set) and IWSLT’04 (test set).
Arabic → English, C-Star’03 NIST BLEU[%] mWER[%] mPER[%]
Baseline 10.18 64.3 23.9 20.6
with supertagging/LDA 10.13 64.6 23.4 20.1
with link grammar 10.06 64.7 23.4 20.3
with supertagging/LDA + link grammar 10.20 65.0 23.2 20.2
with ME chunker 10.11 65.1 23.0 19.9
with all models 10.23 65.2 23.0 19.9
Arabic → English, IWSLT’04 NIST BLEU[%] mWER[%] mPER[%]
Baseline 9.75 59.8 26.1 21.9
with supertagging/LDA 9.77 60.5 25.6 21.5
with link grammar 9.74 60.5 25.9 21.7
with supertagging/LDA + link grammar 9.86 60.8 26.0 21.6
with ME chunker 9.71 59.9 25.9 21.8
with all models 9.84 60.1 26.4 21.9
Table 5: Effect of successively adding syntactic features to the Arabic-English n-best list for C-Star’03
(development set) and IWSLT’04 (test set).
out-of-domain data. Thus, further improvements
should be possible if the models were adapted to
the BTEC domain. This would require the prepa-
ration of an annotated corpus for the supertagger
and a specialized link grammar, which are both
time-consuming tasks.
The syntactically motivated methods (supertag-
ging/LDA and link grammars) perform similarly
to the maximum-entropy based chunker. It seems
that both approaches successfully exploit struc-
tural properties of language. However, one outlier
is ME chunking on the Chinese-English test data,
whereweobservealowerBLEUbutalargerNIST
score. For Arabic-English, the combination of all
methods does not seem to generalize well on the
test set. In that case, supertagging/LDA and link
grammar outperforms the ME chunker: the over-
all improvement is 1% absolute in terms of BLEU
score.
5 Conclusion
We added syntactically motivated features to a sta-
tistical machine translation system in a rerank-
ing framework. The goal was to analyze whether
shallow parsing techniques help in identifying un-
grammatical hypotheses. We showed that some
improvements are possible by utilizing supertag-
ging, lightweight dependency analysis, a link
47
grammar parser and a maximum-entropy based
chunk parser. Adding features to n-best lists and
discriminatively training the system on a develop-
ment set helped to gain up to 0.7% in BLEU score
on the test set.
Future work could include developing an
adapted LTAG for the BTEC domain or incor-
porating n-gram models into the link grammar
concept in order to derive a long-range language
model (Lafferty et al., 1992). However, we feel
that the current improvements are not significant
enough to justify these efforts. Additionally, we
will apply these reranking methods to larger cor-
pora in order to study the effects on longer sen-
tences from more complex domains.
Acknowledgments
This work has been partly funded by the
European Union under the integrated project
TC-Star (Technology and Corpora for Speech
to Speech Translation, IST-2002-FP6-506738,
http://www.tc-star.org), and by the R&D project
TRAMES managed by Bertin Technologies as
prime contractor and operated by the french DGA
(D´el´egation G´en´erale pour l’Armement).

References
Srinivas Bangalore and Aravind K. Joshi. 1999. Su-
pertagging: An approach to almost parsing. Com-
putational Linguistics, 25(2):237–265.
Srinivas Bangalore. 2000. A lightweight dependency
analyzerforpartialparsing. Computational Linguis-
tics, 6(2):113–138.
Oliver Bender, Klaus Macherey, Franz Josef Och, and
Hermann Ney. 2003. Comparison of alignment
templates and maximum entropy models for natural
language understanding. In EACL03: 10th Conf. of
the Europ. Chapter of the Association for Computa-
tional Linguistics, pages 11–18, Budapest, Hungary,
April.
Stanley F. Chen and Ronald Rosenfeld. 1999. A gaus-
sian prior for smoothing maximum entropy models.
TechnicalReportCMUCS-99-108, CarnegieMellon
University, Pittsburgh, PA.
J. N. Darroch and D. Ratcliff. 1972. Generalized iter-
ative scaling for log-linear models. Annals of Math-
ematical Statistics, 43:1470–1480.
John Lafferty, Daniel Sleator, and Davy Temperley.
1992. Grammatical trigrams: A probabilistic model
of link grammar. In Proc. of the AAAI Fall Sympo-
sium on Probabilistic Approaches to Natural Lan-
guage, pages 89–97, Cambridge, MA.
Franz Josef Och and Hermann Ney. 2002. Discrimina-
tive training and maximum entropy models for sta-
tistical machine translation. In Proc. of the 40th An-
nual Meeting of the Association for Computational
Linguistics(ACL),pages295–302,Philadelphia,PA,
July.
Franz Josef Och, Daniel Gildea, Sanjeev Khudanpur,
Anoop Sarkar, Kenji Yamada, Alex Fraser, Shankar
Kumar, Libin Shen, David Smith, Katherine Eng,
Viren Jain, Zhen Jin, and Dragomir Radev. 2004.
A smorgasbord of features for statistical machine
translation. In Proc. 2004 Meeting of the North
American chapter of the Association for Compu-
tational Linguistics (HLT-NAACL), pages 161–168,
Boston, MA.
Franz Josef Och. 2003. Minimum error rate training
in statistical machine translation. In Proc. of the
41st Annual Meeting of the Association for Compu-
tationalLinguistics(ACL),pages160–167,Sapporo,
Japan, July.
William H. Press, Saul A. Teukolsky, William T. Vet-
terling, and Brian P. Flannery. 2002. Numerical
Recipes in C++. Cambridge University Press, Cam-
bridge, UK.
Daniel Sleator and Davy Temperley. 1993. Parsing
English with a link grammar. In Third International
Workshop on Parsing Technologies, Tilburg/Durbuy,
The Netherlands/Belgium, August.
Toshiyuki Takezawa, Eiichiro Sumita, F. Sugaya,
H. Yamamoto, and S. Yamamoto. 2002. Toward
a broad-coverage bilingual corpus for speech trans-
lation of travel conversations in the real world. In
Proc. of the Third Int. Conf. on Language Resources
and Evaluation (LREC), pages 147–152, Las Pal-
mas, Spain, May.
Erik F. Tjong Kim Sang and Sabine Buchholz.
2000. Introduction to the CoNLL-2000 shared
task: Chunking. In Proceedings of CoNLL-2000
and LLL-2000, pages 127–132, Lisbon, Portugal,
September.
XTAG Research Group. 2001. A Lexicalized Tree
Adjoining Grammar for English. Technical Re-
portIRCS-01-03, IRCS,UniversityofPennsylvania,
Philadelphia, PA, USA.
Richard Zens and Hermann Ney. 2005. Word graphs
for statistical machine translation. In 43rd Annual
Meeting of the Assoc. for Computational Linguis-
tics: Proc. Workshop on Building and Using Par-
allel Texts: Data-Driven Machine Translation and
Beyond, pages 191–198, Ann Arbor, MI, June.
Richard Zens, Oliver Bender, Saˇsa Hasan, Shahram
Khadivi, Evgeny Matusov, Jia Xu, Yuqi Zhang, and
Hermann Ney. 2005. The RWTH phrase-based
statistical machine translation system. In Proceed-
ings of the International Workshop on Spoken Lan-
guage Translation (IWSLT), pages 155–162, Pitts-
burgh, PA, October.
