Proceedings of the 43rd Annual Meeting of the ACL, pages 549–556,
Ann Arbor, June 2005. c©2005 Association for Computational Linguistics
Context-dependent SMT Model using Bilingual Verb-Noun Collocation
Young-Sook Hwang
ATR SLT Research Labs
2-2-2 Hikaridai Seika-cho
Soraku-gun Kyoto, 619-0288, JAPAN
youngsook.hwang@atr.jp
Yutaka Sasaki
ATR SLT Research Labs
2-2-2 Hikaridai Seika-cho
Soraku-gun Kyoto, 619-0288, JAPAN
yutaka.sasaki@atr.jp
Abstract
In this paper, we propose a new context-
dependent SMT model that is tightly cou-
pled with a language model. It is de-
signed to decrease the translation ambi-
guities and efficiently search for an opti-
mal hypothesis by reducing the hypothe-
sis search space. It works through recipro-
cal incorporation between source and tar-
get context: a source word is determined
by the context of previous and correspond-
ing target words and the next target word
is predicted by the pair consisting of the
previous target word and its correspond-
ing source word. In order to alleviate
the data sparseness in chunk-based trans-
lation, we take a stepwise back-off trans-
lation strategy. Moreover, in order to ob-
tain more semantically plausible transla-
tion results, we use bilingual verb-noun
collocations; these are automatically ex-
tracted by using chunk alignment and a
monolingual dependency parser. As a case
study, we experimented on the language
pair of Japanese and Korean. As a result,
we could not only reduce the search space
but also improve the performance.
1 Introduction
For decades, many research efforts have contributed
to the advance of statistical machine translation.
Recently, various works have improved the quality
of statistical machine translation systems by using
phrase translation (Koehn et al., 2003; Marcu et al.,
2002; Och et al., 1999; Och and Ney, 2000; Zens
et al., 2004). Most of the phrase-based translation
models have adopted the noisy-channel based IBM
style models (Brown et al., 1993):
CMCT
C1
BD
BP CPD6CVD1CPDC
CT
C1
BD
C8D6B4CU
C2
BD
CYCT
C1
BD
B5C8D6B4CT
C1
BD
B5 (1)
In these model, we have two types of knowledge:
translation model, C8D6B4CU
C2
BD
CYCT
C1
BD
B5 and language model,
C8D6B4CT
C1
BD
B5. The translation model links the source lan-
guage sentence to the target language sentence. The
language model describes the well-formedness of
the target language sentence and might play a role
in restricting hypothesis expansion during decoding.
To recover the word order difference between two
languages, it also allows modeling the reordering by
introducing a relative distortion probability distribu-
tion. However, in spite of using such a language
model and a distortion model, the translation outputs
may not be fluent or in fact may produce nonsense.
To make things worse, the huge hypothesis search
space is much too large for an exhaustive search. If
arbitrary reorderings are allowed, the search prob-
lem is NP-complete (Knight, 1999). According
to a previous analysis (Koehn et al., 2004) of how
many hypotheses are generated during an exhaustive
search using the IBM models, the upper bound for
the number of states is estimated by C6 B3 BE
C2
CYCE
CT
CY
BE
C2,
where C2 is the number of source words and CYCE
CT
CY is
the size of the target vocabulary. Even though the
number of possible translations of the last two words
is much smaller than CYCE
CT
CY
BE
, we still need to make
further improvement. The main concern is the ex-
549
ponential explosion from the possible configurations
of source words covered by a hypothesis. In order
to reduce the number of possible configurations of
source words, decoding algorithms based on BT
A3
as
well as the beam search algorithm have been pro-
posed (Koehn et al., 2004; Och et al., 2001). (Koehn
et al., 2004; Och et al., 2001) used heuristics for
pruning implausible hypotheses.
Our approach to this problem examines the pos-
sibility of utilizing context information in a given
language pair. Under a given target context, the cor-
responding source word of a given target word is al-
most deterministic. Conversely, if a translation pair
is given, then the related target or source context is
predictable. This implies that if we considered bilin-
gual context information in a given language pair
during decoding, we can reduce the computational
complexity of the hypothesis search; specifically, we
could reduce the possible configurations of source
words as well as the number of possible target trans-
lations.
In this study, we present a statistical machine
translation model as an alternative to the classical
IBM-style model. This model is tightly coupled
with target language model and utilizes bilingual
context information. It is designed to not only re-
duce the hypothesis search space by decreasing the
translation ambiguities but also improve translation
performance. It works through reciprocal incorpo-
ration between source and target context: source
words are determined by the context of previous
and corresponding target words, and the next target
words are predicted by the current translation pair.
Accordingly, we do not need to consider any dis-
tortion model or language model as is the case with
IBM-style models.
Under this framework, we propose a chunk-based
translation model for more grammatical, fluent and
accurate output. In order to alleviate the data sparse-
ness problem in chunk-based translation, we use a
stepwise back-off method in the order of a chunk,
sub-parts of the chunk, and word level. Moreover,
we utilize verb-noun collocations in dealing with
long-distance dependency which are automatically
extracted by using chunk alignment and a monolin-
gual dependency parser.
As a case study, we developed a Japanese-to-
Korean translation model and performed some ex-
periments on the BTEC corpus.
2 Overview of Translation Model
The goal of machine translation is to transfer the
meaning of a source language sentence, CU
C2
BD
BP
CU
BD
BMBMBMCU
C2
, into a target language sentence, CT
C1
BD
BP
CT
BD
BMBMBMCT
C1
. In most types of statistical machine trans-
lation, conditional probability C8D6B4CT
C1
BD
CYCU
C2
BD
B5 is used to
describe the correspondence between two sentences.
This model is used directly for translation by solving
the following maximization problem:
CMCT
C1
BD
BP CPD6CVD1CPDC
CT
C1
BD
C8D6B4CT
C1
BD
CYCU
C2
BD
B5 (2)
BP CPD6CVD1CPDC
CT
C1
BD
C8D6B4CT
C1
BD
BNCU
C2
BD
B5
C8D6B4CU
C2
BD
B5
(3)
BP CPD6CVD1CPDC
CT
C1
BD
C8D6B4CT
C1
BD
BNCU
C2
BD
B5 (4)
Since a source language sentence is given and the
C8D6B4CU
C2
BD
B5 probability is applied to all possible corre-
sponding target sentences, we can ignore the denom-
inator in equation (3). As a result, the joint proba-
bility model can be used to describe the correspon-
dence between two sentences. We apply Markov
chain rules to the joint probability model and obtain
the following decomposed model:
C8D6B4CT
C1
BD
BNCU
C2
BD
B5 B3
C1
CH
CXBPBD
C8D6B4CU
CP
CX
CYCT
CX
BNCT
CXA0BD
B5C8D6B4CT
CX
CYCT
CXA0BD
CU
CP
CXA0BD
B5
(5)
where CP
CX
is the index of the source word that is
aligned to the word CT
CX
under the assumption of the
fixed one-to-one alignment. In this model, we have
two probabilities:
AF source word prediction probability under a
given target language context, C8D6B4CU
CP
CX
CYCT
CXA0BD
BNCT
CX
B5
AF target word prediction probability under the
preceding translation pair, C8D6B4CT
CX
CYCT
CXA0BD
BNCU
CP
CXA0BD
B5
The probability of target word prediction is used for
selecting the target word that follows the previous
target words. In order to make this more determin-
istic, we use bilingual context, i.e. the translation
pair of the preceding target word. For a given target
word, the corresponding source word is predicted by
source word prediction probability based on the cur-
rent and preceding target words.
550
Since a target and a source word are predicted
through reciprocal incorporation between source
and target context from the beginning of a target
sentence, the word order in the target sentence is
automatically determined and the number of pos-
sible configurations of source words is decreased.
Thus, we do not need to perform any computation
for word re-ordering. Moreover, since correspon-
dences are provided based on bilingual contextual
evidence, translation ambiguities can be decreased.
As a result, the proposed model is expected to re-
duce computational complexity during the decoding
as well as improve performance.
Furthermore, since a word-based translation ap-
proach is often incapable of handling complicated
expressions such as an idiomatic expressions or
complicated verb phrases, it often outputs nonsense
translations. To avoid nonsense translations and to
increase explanatory power, we incorporate struc-
tural aspects of the language into the chunk-based
translation model. In our model, one source chunk
is translated by exactly one target chunk, i.e., one-
to-one chunk alignment. Thus we obtain:
DICT
C3
BD
BP CPD6CVD1CPDC
DICT
C3
BD
C8D6B4DICT
C3
BD
BN
DI
CU
C3
BD
B5 (6)
C8D6B4DICT
C3
BD
BN
DI
CU
C3
BD
B5 B3
C3
CH
CXBPBD
C8D6B4
DI
CU
CP
CX
CYDICT
CX
BNDICT
CXA0BD
B5C8D6B4DICT
CX
CYDICT
CXA0BD
BN
DI
CU
CP
CXA0BD
B5
(7)
where C3 is the number of chunks in a source and a
target sentence.
3 Chunk-based J/K Translation Model
with Back-Off
With the translation framework described above, we
built a chunk-based J/K translation model as a case
study. Since a chunk-based translation model causes
severe data sparseness, it is often impossible to ob-
tain any translation of a given source chunk. In order
to alleviate this problem, we apply back-off trans-
lation models while giving the consideration to lin-
guistic characteristics.
Japanese and Korean is a very close language pair.
Both are agglutinative and inflected languages in the
word formation of a bunsetsu and an eojeol.Abun-
setsu/eojeol consists of two sub parts: the head part
composed of content words and the tail part com-
posed of functional words agglutinated at the end of
the head part. The head part is related to the mean-
ing of a given segment, while the tail part indicates
a grammatical role of the head in a given sentence.
By putting this linguistic knowledge to practical
use, we build a head-tail based translation model
as a back-off version of the chunk-based translation
model. We place several constraints on this head-tail
based translation model as follows:
AF The head of a given source chunk corresponds
to the head of a target chunk. The tail of the
source chunk corresponds to the tail of a target
chunk. If a chunk does not have a tail part, we
assign NUL to the tail of the chunk.
AF The head of a given chunk follows the tail of the
preceding chunk and the tail follows the head of
the given chunk.
The constraints are designed to maintain the struc-
tural consistency of a chunk. Under these con-
straints, the head-tail based translation can be for-
mulated as the following equation:
C8D6B4
DI
CU
CP
CX
CYDICT
CX
BNDICT
CXA0BD
B5C8D6B4DICT
CX
CYDICT
CXA0BD
BN
DI
CU
CP
CXA0BD
B5BP (8)
C8D6B4
DI
CU
CW
CP
CX
CYDICT
CW
CX
BNDICT
D8
CXA0BD
B5C8D6B4DICT
CW
CX
CYDICT
D8
CXA0BD
DI
CU
D8
CP
CXA0BD
B5
C8D6B4
DI
CU
D8
CP
CX
CYDICT
D8
CX
BNDICT
CW
CX
B5C8D6B4DICT
D8
CX
CYDICT
CW
CX
DI
CU
CW
CP
CX
B5
where DICT
CW
CX
denotes the head of the CX
D8CW
chunk and DICT
D8
CX
means the tail of the chunk.
In the worst case, even the head-tail based model
may fail to obtain translations. In this case, we
back it off into a word-based translation model. In
the word-based translation model, the constraints
on the head-tail based translation model are not ap-
plied. The concept of the chunk-based J/K transla-
tion framework with back-off scheme can be sum-
marized as follows:
1. Input a dependency-parsed sentence at the
chunk level,
2. Apply the chunk-based translation model to the
given sentence,
3. If one of chunks does not have any correspond-
ing translation:
AF divide the failed chunk into a head and a
tail part,
551
Figure 1: An example of (a) chunk alignment for chunk-based, head-tail based translation and (b) bilingual
verb-noun collocation by using the chunk alignment and a monolingual dependency parser
AF back-off the translation into the head-tail
based translation model,
AF if the head or tail does not have any corre-
sponding translation, apply a word-based
translation model to the chunk.
Here, the back-off model is applied only to the part
that failed to get translation candidates.
3.1 Learning Chunk-based Translation
We learn chunk alignments from a corpus that has
been word-aligned by a training toolkit for word-
based translation models: the Giza++ (Och and
Ney, 2000) toolkit for the IBM models (Brown
et al., 1993). For aligning chunk pairs, we con-
sider word(bunsetsu/eojeol) sequences to be chunks
if they are in an immediate dependency relationship
in a dependency tree. To identify chunks, we use
a word-aligned corpus, in which source language
sentences are annotated with dependency parse trees
by a dependency parser (Kudo et al., 2002) and tar-
get language sentences are annotated with POS tags
by a part-of-speech tagger (Rim, 2003). If a se-
quence of target words is aligned with the words in
a single source chunk, the target word sequence is
regarded as one chunk corresponding to the given
source chunk. By applying this method to the cor-
pus, we obtain a word- and chunk-aligned corpus
(see Figure 1).
From the aligned corpus, we directly estimate
the phrase translation probabilities, C8D6B4
DI
CUCYDICTB5,
and the model parameters, C8D6B4
DI
CU
CP
CX
CYDICT
CX
BNDICT
CXA0BD
B5,
C8D6B4DICT
CX
CYDICT
CXA0BD
BN
DI
CU
CP
CXA0BD
B5. These estimation are made
based on relative frequencies.
3.2 Decoding
For efficient decoding, we implement a multi-stack
decoder and a beam search with BT
A3
algorithm. At
each search level, the beam search moves through at
most D2-best translation candidates, and a multi-stack
is used for partial translations according to the trans-
lation cardinality. The output sentence is generated
from left to right in the form of partial translations.
Initially, we get D2 translation candidates for each
source chunk with the beam size D2. Every possible
translation is sorted according to its translation prob-
ability. We start the decoding with the initialized
beams and initial stack CB
BC
, the top of which has the
information of the initial hypothesis, CWDICT
BC
BPB0BN
DI
CU
BC
BP
B0CX. The decoding algorithm is described in Table 1.
In the decoding algorithm, estimating the back-
ward score is so complicated that the computational
complexity becomes too high because of the context
consideration. Thus, in order to simplify this prob-
lem, we assume the context-independence of only
the backward score estimation. The backward score
is estimated by the translation probability and lan-
guage model score of the uncovered segments. For
each uncovered segment, we select the best transla-
tion with the highest score by multiplying the trans-
lation probability of the segment by its language
model score. The translation probability and lan-
guage model score are computed without giving
consideration to context.
After estimating the forward and backward score
of each partial translation on stack CB
CX
, we try to
552
1. Push the initial hypothesis CWDICT
BC
BPB0BN
DI
CU
BC
BPB0CX on the initial
stack CB
BC
2. for i=1 to K
AF Pop the previous state information of CWDICT
CXA0BD
BN
DI
CU
CP
CXA0BD
CX
from stack CB
CXA0BD
AF Get next target DICT
CX
and corresponding source
DI
CU
CP
CX
AF for all pairs of CWDICT
CX
BN
DI
CU
CP
CX
CX
– Check the head-tail consistency
– Mark the source segment as a covered one
– Estimate forward and backward score
– Push the state of pair CWDICT
CX
BN
DI
CU
CP
CX
CX onto stack CB
CX
AF Sort all translations on stack CB
CX
by the scores
AF Prune the hypotheses
3. while (stack CB
C3
is not empty)
AF Pop the state of the pair CWDICT
C3
BN
DI
CU
CP
C3
CX
AF Compose translation output, CWDICT
BD
BMBMBMDICT
C3
CX
4. Output the best C6 translations
Table 1: BT
A3
multi-stack decoding algorithm
prune the hypotheses. In pruning, we first sort the
partial translations on stack CB
CX
according to their
scores. If the gradient of scores steeply decreases
over the given threshold at the CZ
D8CW
translation, we
prune the translations of lower scores than the CZ
D8CW
one. Moreover, if the number of filtered translations
is larger than C6, we only take the top C6 transla-
tions. As a final translation, we output the single
best translation.
4 Resolving Long-distance Dependency
Since most of the current translation models take
only the local context into account, they cannot
account for long-distance dependency. This often
causes syntactically or semantically incorrect trans-
lation to be output. In this section, we describe
how this problem can be solved. For handling the
long-distance dependency problem, we utilize bilin-
gual verb-noun collocations that are automatically
acquired from the chunk-aligned bilingual corpora.
4.1 Automatic Extraction of Bilingual
Verb-Noun Collocation(BiVN)
To automatically extract the bilingual verb-noun
collocations, we utilize a monolingual dependency
parser and the chunk alignment result. The basic
concept is the same as that used in (Hwang et al.,
2004): bilingual dependency parses are obtained by
sharing the dependency relations of a monolingual
dependency parser among the aligned chunks. Then
bilingual verb sub-categorization patterns are ac-
quired by navigating the bilingual dependency trees.
A verb sub-categorization is the collocation of a verb
and all of its argument/adjunct nouns, i.e. verb-noun
collocation(see Figure 1).
To acquire more reliable and general knowledge,
we apply the following filtering method with statis-
tical AV
BE
test and unification operation:
AF step 1. Filter out the reliable translation corre-
spondences from all of the alignment pairs by
AV
BE
test at a probability level of AB
BD
AF step 2. Filter out reliable bilingual verb-noun
collocations BiVN by a unification and AV
BE
test
at a probability level of AB
BE
: Here, we assume
that two bilingual pairs, CWDA
CU
BM DA
CT
CX and CWD2
CU
BM D2
CT
CX
are unifiable into a frame CWDA
CU
BM DA
CT
BND2
CU
BM D2
CT
CX iff
both of them are reliable pairs filtered in step 1.
and they share the same verb pair CWDA
CU
BM DA
CT
CX.
4.2 Application of BiVN
The acquired BiVN is used to evaluate the bilingual
correspondence of a verb-noun pair dependent on
each other and to select the correct translation. It
can be applied to any verb-noun pair regardless of
the distance between them in a sentence. Moreover,
since the verb-noun relation in BiVN is bilingual
knowledge, the sense of each corresponding verb
and noun can be almost completely disambiguated
by each other.
In our translation system, we apply this BiVN
during decoding as follows:
1. Pivot verbs and their dependents in a given
dependency-parsed source sentence
2. When extending a hypothesis, if one of the piv-
oted verb and noun pairs is covered and its cor-
responding translation pair is in BiVN,wegive
positive weight ACBQBD to the hypothesis.
AWB4BUCXCEC6
CX
B5BP
B4
BD if BUCXCEC6
CX
BE BUCXCEC6
BC otherwise
553
where BUCXCEC6
CX
BP CWDA
CU
BM DA
CT
BND2
CU
BM D2
CT
CX and AWB4BUCXCEC6
CX
B5
is a function that indicates whether the bilingual
translation pair is in BiVN. By adding the weight
of the AWB4BUCXCEC6
CX
B5 function, we refine our model as
follows:
DICT
C3
BD
B3 CPD6CVD1CPDC
C9
C3
CXBPBD
C8D6B4
DI
CU
CP
CX
CYDICT
CX
BNDICT
CXA0BD
B5 (10)
C8D6B4DICT
CX
CYDICT
CXA0BD
DI
CU
CP
CXA0BD
B5AC
CEC6B4CU
CP
CX
B5AWB4BUCXCEC6
CX
B5
where CEC6B4CU
CP
CX
B5 is a function indicating whether the
pair of a verb and its argument CWDA
CU
BND2
CU
CX is covered
with DA
CU
BP CU
CP
CX
or D2
CU
BP CU
CP
CX
and BUCXCEC6
CX
BP CWDA
CU
BM
DA
CT
BND2
CU
BM D2
CT
CX is a bilingual translation pair in the hy-
pothesis.
5 Experiments
5.1 Corpus
The corpus for the experiment was extracted from
the Basic Travel Expression Corpus (BTEC), a col-
lection of conversational travel phrases for Japanese
and Korean (see Table 2). The entire corpus was
split into two parts: 162,320 sentences in parallel for
training and 10,150 sentences for test. The Japanese
sentences were automatically dependency-parsed by
CaboCha (Kudo et al., 2002) and the Korean sen-
tences were automatically POS tagged by KUTag-
ger (Rim, 2003)
5.2 Translation Systems
Four translation systems were implemented for
evaluation: 1) Word based IBM-style SMT Sys-
tem(WBIBM), 2) Chunk based IBM-style SMT Sys-
tem(CBIBM), 3) Word based LM tightly Coupled
SMT System(WBLMC), and 4) Chunk based LM
tightly Coupled SMT System(CBLMC). To exam-
ine the effect of BiVN, BiVN was optionally used
for each system.
The word-based IBM-style (WBIBM) system
1
consisted of a word translation model and a bi-
gram language model. The bi-gram language
model was generated by using CMU LM toolkit
(Clarkson et al., 1997). Instead of using a fer-
tility model, we allowed a multi-word target of
a given source word if it aligned with more than
one word. We didn’t use any distortion model for
word re-ordering. And we used a log-linear model
1
In this experiment, a word denotes a morpheme
C8D6B4CTCYCUB5 BP CTDCD4B4
C8
CX
AL
CX
CWB4CTBNCUB5B5 for weighting the
language model and the translation model. For de-
coding, we used a multi-stack decoder based on the
BT
A3
algorithm, which is almost the same as that de-
scribed in Section 3. The difference is the use of
the language model for controlling the generation of
target translations.
The chunk-based IBM-style (CBIBM) system
consisted of a chunk translation model and a bi-
gram language model. To alleviate the data sparse-
ness problem of the chunk translation model, we ap-
plied the back-off method at the head-tail or mor-
pheme level. The remaining conditions are the same
as those for WBIBM.
The word-based LM tightly coupled (WBLMC)
system was implemented for comparison with the
chunk-based systems. Except for setting the transla-
tion unit as a morpheme, the other conditions are the
same as those for the proposed chunk-based transla-
tion system.
The chunk-based LM tightly coupled (CBLMC)
system is the proposed translation system. A bi-
gram language model was used for estimating the
backward score.
5.3 Evaluation
Translation evaluations were carried out on 510 sen-
tences selected randomly from the test set. The met-
rics for the evaluations are as follows:
PER(Position independent WER), which pe-
nalizes without considering positional dis-
fluencies(Niesen et al., 2000).
mWER(multi-reference Word Error Rate), which is
based on the minimum edit distance between
the target sentence and the sentences in the ref-
erence set (Niesen et al., 2000).
BLEU, which is the ratio of the n-gram for
the translation results found in the reference
translations with a penalty for too short sen-
tences (Papineni et al., 2001).
NIST which is a weighted n-gram precision in
combination with a penalty for too short sen-
tences.
For this evaluation, we made 10 multiple references
available. We computed all of the above criteria with
respect to these multiple references.
554
Training Test
Japanese Korean Japanese Korean
# of sentences 162,320 10,150
# of total morphemes 1,153,954 1,179,753 74,366 76,540
# of bunsetsu/eojeol 448,438 587,503 28,882 38,386
vocabulary size 15,682 15,726 5,144 4,594
Table 2: Statistics of Basic Travel Expression Corpus
PER mWER BLEU NIST
WBIBM 0.3415 / 0.3318 0.3668 / 0.3591 0.5747 / 0.5837 6.9075 / 7.1110
WBLMC 0.2667 / 0.2666 0.2998 / 0.2994 0.5681 / 0.5690 9.0149 / 9.0360
CBIBM 0.2677 / 0.2383 0.2992 / 0.2700 0.6347 / 0.6741 8.0900 / 8.6981
CBLMC 0.1954 / 0.1896 0.2176 / 0.2129 0.7060 / 0.7166 9.9167 / 10.027
Table 3: Evaluation Results of Translation Systems: without BiVN/with BiVN
WBIBM WBLMC CBIBM CBLMC
0.8110 / 0.8330 2.5585 / 2.5547 0.3345 / 0.3399 0.9039 / 0.9052
Table 4: Translation Speed of Each Translation Systems(sec./sentence): without BiVN/with BiVN
5.4 Analysis and Discussion
Table 3 shows the performance evaluation of each
system. CBLMC outperformed CBIBM in overall
evaluation criteria. WBLMC showed much better
performance than WBIBM in most of the evalua-
tion criteria except for BLEU score. The interesting
point is that the performance of WBLMC is close to
that of CBIBM in PER and mWER. The BLEU score
of WBLMC is lower than that of CBIBM, but the
NIST score of WBLMC is much better than that of
CBIBM.
The reason the proposed model provided better
performance than the IBM-style models is because
the use of contextual information in CBLMC and
WBLMC enabled the system to reduce the transla-
tion ambiguities, which not only reduced the compu-
tational complexity during decoding, but also made
the translation accurate and deterministic. In addi-
tion, chunk-based translation systems outperformed
word-based systems. This is also strong evidence of
the advantage of contextual information.
To evaluate the effectiveness of bilingual verb-
noun collocations, we used the BiVN filtered with
ABBD BP BMBCBHBNABBE BP BMBD, where coverage is BIBGBMBKBIB1
on the test set and average ambiguity is BEBMBLBL.We
suffered a slight loss in the speed by using the
BiVN(see Table 4), but we could improve perfor-
mance in all of the translation systems(see Table
3). In particular, the performance improvement in
CBIBM with BiVN was remarkable. This is a pos-
itive sign that the BiVN is useful for handling the
problem of long-distance dependency. From this re-
sult, we believe that if we increased the coverage of
BiVN and its accuracy, we could improve the per-
formance much more.
Table 4 shows the translation speed of each sys-
tem. For the evaluation of processing time, we used
the same machine, with a Xeon 2.8 GHz CPU and
4GB memory , and checked the time of the best per-
formance of each system. The chunk-based trans-
lation systems are much faster than the word-based
systems. It may be because the translation ambi-
guities of the chunk-based models are lower than
those of the word-based models. However, the pro-
cessing speed of the IBM-style models is faster than
the proposed model. This tendency can be analyzed
from two viewpoints: decoding algorithm and DB
system for parameter retrieval. Theoretically, the
computational complexity of the proposed model is
lower than that of the IBM models. The use of a
555
sorting and pruning algorithm for partial translations
provides shorter search times in all system. Since
the number of parameters for the proposed model is
much more than for the IBM-style models, it took a
longer time to retrieve parameters. To decrease the
processing time, we need to construct a more effi-
cient DB system.
6 Conclusion
In this paper, we proposed a new chunk-based statis-
tical machine translation model that is tightly cou-
pled with a language model. In order to alleviate
the data sparseness in chunk-based translation, we
applied the back-off translation method at the head-
tail and morpheme levels. Moreover, in order to
get more semantically plausible translation results
by considering long-distance dependency, we uti-
lized verb-noun collocations which were automat-
ically extracted by using chunk alignment and a
monolingual dependency parser. As a case study,
we experimented on the language pair of Japanese
and Korean. Experimental results showed that the
proposed translation model is very effective in im-
proving performance. The use of bilingual verb-
noun collocations is also useful for improving the
performance.
However, we still have some problems of the data
sparseness and the low coverage of bilingual verb-
noun collocation. In the near future, we will try to
solve the data sparseness problem and to increase the
coverage and accuracy of verb-noun collocations.
References
Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della
Pietra, and R. L. Mercer. 1993. The mathematics of
statistical machine translation: Parameter estimation,
Computational Linguistics, 19(2):263-311.
P.R. Clarkson and R. Rosenfeld. 1997. Statistical Lan-
guage Modeling Using the CMU-Cambridge Toolkit,
Proc. of ESCA Eurospeech.
Young-Sook Hwang, Kyonghee Paik, and Yutaka Sasaki.
2004. Bilingual Knowledge Extraction Using Chunk
Alignment, Proc. of the 18th Pacific Asia Con-
ference on Language, Information and Computation
(PACLIC-18), pp. 127-137, Tokyo.
Kevin Knight. 1999. Decoding Complexity in Word-
Replacement Translation Models, Computational Lin-
guistics, Squibs Discussion, 25(4).
Philipp Koehn, Franz Josef Och, and Daniel Marcu.
2003 Statistical Phrase-Based Translation, Proc.
of the Human Language Technology Confer-
ence(HLT/NAACL)
Philipp Koehn. 2004 Pharaoh: a Beam Search De-
coder for Phrase-Based Statistical Machine Transla-
tion Models, Proc. of AMTA’04
Taku Kudo, Yuji Matsumoto. 2002. Japanese Depen-
dency Analyisis using Cascaded Chunking, Proc. of
CoNLL-2002
Daniel Marcu and William Wong. 2002. A phrase-based,
joint probability model for statistical machine transla-
tion , Proc. of EMNLP.
Sonja Niesen, Franz Josef Och, Gregor Leusch, Hermann
Ney. 2000. An Evaluation Tool for Machine Transla-
tion: Fast Evaluation for MT Research, Proc. of the
2nd International Conference on Language Resources
and Evaluation, pp. 39-45, Athens, Greece.
Franz Josef Och, Christoph Tillmann, Hermann Ney.
1999. Improved alignment models for statistical ma-
chine translation, Proc. of EMNLP/WVLC.
Franz Josef Och and Hermann Ney. 2000. Improved Sta-
tistical Alignment Models , Proc. of the 38th Annual
Meeting of the Association for Computational Lin-
guistics, pp. 440-447, Hongkong, China.
Franz Josef Och, Nicola Ueffing, Hermann Ney. 2001.
An Efficient A* Search Algorithm for Statistical Ma-
chine Translation , Data-Driven Machine Translation
Workshop, pp. 55-62, Toulouse, France.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Jing Zhu. 2001. Bleu: a method for automatic evalu-
ation of machine translation , IBM Research Report,
RC22176.
Toshiyuki Takezawa, Eiichiro Sumita, Fumiaki Sugaya,
Hirofumi Yamamoto, and Seiichi Yamamoto. 2002.
Toward a broad-coverage bilingual corpus for speech
translation of travel conversations in the real world,
Proc. of LREC 2002, pp. 147-152, Spain.
Richard Zens and Hermann Ney. 2004. Improve-
ments in Phrase-Based Statistical Machine Transla-
tion, Proc. of the Human Language Technology Con-
ference (HLT-NAACL) , Boston, MA, pp. 257-264.
Hae-Chang Rim. 2003. Korean Morphological Analyzer
and Part-of-Speech Tagger, Technical Report, NLP
Lab. Dept. of Computer Science and Engineering, Ko-
rea University
556
