Plaesarn: Machine-Aided Translation Tool for English-to-Thai
Prachya Boonkwan and Asanee Kawtrakul
Specialty Research Unit of Natural Language Processing
and Intelligent Information System Technology
Department of Computer Engineering
Kasetsart University
Email: ak@vivaldi.cpe.ku.ac.th
Abstract
English-Thai MT systems are nowadays re-
stricted by incomplete vocabularies and trans-
lation knowledge. Users must consequently ac-
cept only one translation result that is some-
times semantically divergent or ungrammatical.
With the according reason, we propose novel
Internet-based translation assistant software in
order to facilitate document translation from
English to Thai. In this project, we utilize
the structural transfer model as the mechanism.
This project differs from current English-Thai
MT systems in the aspects that it empowers the
users to manually select the most appropriate
translation from every possibility and to manu-
ally train new translation rules to the system if
itisnecessary. Withtheappliedmodel, weover-
come four translation problems—lexicon rear-
rangement, structural ambiguity, phrase trans-
lation, and classifier generation. Finally, we
started the system evaluation with 322 ran-
domly selected sentences on the Future Mag-
azine bilingual corpus and the system yielded
59.87% and 83.08% translation accuracy for the
best case and the worse case based on 90.1%
average precision of the parser.
Introduction
Information comprehension of Thai people
should not be only limited in Thai; in con-
trast, it should also include a considerably large
amount of information sources from foreign
countries. Insufficient basic language knowl-
edge, a result of inadequate distribution in the
past, conversely, is the major obstruction for in-
formation comprehension. There are presently
several English-Thai MT systems—for instance,
Parsit (Sornlertlamvanich, 2000), Plae Thai,
and AgentDict. The first one applies seman-
tic transfer model via the methodology similar
to the lexical functional grammar (Kaplan et
al., 1989) and it is develop with the intention of
public use. The latter two implicitly apply the
direct transfer model with the purpose of com-
mercial use. Nonetheless, by limited vocabular-
ies and translation rules, the users must accept
the only one translation result that is occasion-
ally semantically divergent or ungrammatical.
Due to the according reason, we initiated
this project in order to relieve language prob-
lem of Thai people. In this project, we de-
velop a semi-automatic translation system to
assist them to translate English documents into
Thai. For this paper, the term semi-automatic
translation means the sentence translation with
user interaction to manually resolve structural
and semantic ambiguities during translation pe-
riod. Despite manual disambiguation, we pro-
vided a simple statistical disambiguation in or-
der to pre-select the most possible translation
for each source language sentence, though. The
automatic semantic disambiguation can be thus
excluded with this approach.
1 Translation Approaches
We can classify current translation approaches
into three major models as follows—structural
transfer, semantic transfer, and lexical transfer
(Trujillo, 1999).
† Structural transfer: this methodology
heavily depends on syntactic analysis (say,
grammar). Translation transfers the source
language structures into the target lan-
guage. This method is established by
the assumption that every language in the
world uses syntactic structure in order to
represent the meaning of sentences.
† Semantic transfer: this methodology
heavily depends on semantic analysis (say,
meaning). This model applies syntactic
analysis as well. On the contrary to the
structural transfer, a source language sen-
tence is not immediately translated into
the target language, but it is first trans-
lated into semantic representation (Inter-
lingua is mostly referred), and afterwards
into the target language. This method is
established by the assumption that every
language in the world describes the same
world; hence, there exists the semantic rep-
resentation for every language.
† Lexical transfer: this methodology heav-
ily depends on lexicon ordering patterns.
The translation occurs at the level of mor-
pheme. The translation process transfers
a set of morpheme in the source language
into that of the target language.
In this project, wedecided to utilize the struc-
tural transfer approach, since it is more ap-
propriate for rapid development. In addition,
semantic representation that covers every lan-
guage is now still under research.
2 Relevant Problems and Their
Solutions
2.1 Structural Ambiguity
By the reason of the ambiguities of natural lan-
guages, a sentence may be translated or inter-
preted into many senses. An example of struc-
tural ambiguity is “I saw a girl in the park with
a telescope.” This sentence can be grammati-
cally interpreted into four senses as follows.
† I saw a girl, to whom a telescope belonged,
who was in the park.
† I used a telescope to see a girl, who was in
the park.
† I was in the park and seeing a girl, to whom
a telescope belonged.
† I was in the park and using a telescope to
see a girl.
Furthermore, an example of word-sense am-
biguity is “I live near the bank.” The noun bank
can be semantically interpreted into at least two
senses as follows.
† n. a financial institution that accepts de-
posits and channels the money into lending
activities
† n. sloping land (especially the slope beside
a body of water)
In order to resolve structural ambiguity, we
apply the concept of the statistical machine
translation approach (Brown et al., 1990). We
apply the Maximum-Entropy-Inspired Parser
(Charniak, 1999) (so-called Charniak Parser) to
analyze and determine the appropriate gram-
matical structure of an English sentence. From
(Charniak, 1999), Charniak presented that the
parser uses the Penn Tree Bank tag set (Marcus
etal., 1994)(orPTBinabbreviation)asagram-
matical structure representation, and it yielded
90.1% average precision for sentences of length
40 or less, and 89.5% for sentences of length
100 and less. Moreover, with the intention to
resolve word-sense ambiguity, we embedded a
numerical statistic value with each translation
rule (including lexical transfer rule) with the
major aim of assisting to select the best transla-
tion parse tree from every possibility (Charniak,
1997). Section 3.4 will describe the method and
the tool to do so.
2.2 Phrase Translation
Phrase is a word-ordering pattern that cannot
be separately translated. An example is the
translation of the verb to be. The translation
of that depends on the context—for instance,
to be succeeding with noun phrase is trans-
lated to xe0xbbx9axb9 /penm/, succeeding with preposi-
tional phrase to xcdxc2xd9xe8 /yuul/, in progressive tenses
to xa1xd3xc5xd1xa7 /kammlangm/, in passive voice to xb6xd9xa1
/thuukl/, and succeeding with adjectival phrase
to translation omission. Another example is the
verbal phrase to look for something. It must
be translated to xc1xcdxa7xcbxd2 /m!!ngmhaar/ not to
xc1xcdxa7xcaxd3xcbxc3xd1xba /m!!ngm samrrabl/. The word look
is translated to xc1xcdxa7 /m!!ngm/, and for to
xcaxd3xcbxc3xd1xba /samrrabl/.
From empirical observation, we found that
the PTB tag set is rather problematical to
translate into Thai. We hence implement the
parse tree modification process in order to re-
lieve the complexity of transformation process
(Trujillo, 1999). In this process, the heads of
the tree are recursively modified so as to facili-
tate phrase translation. A portion of parse tree
modification rules shown on Table 1 is described
in parenthesis format.
Obviously, from Table 1, we can more easily
compose the rules in Table 2 to translate the
verb to be and the phrasal verb look for some-
thing.
2.3 Lexicon Rearrangement
In English, we can normally modify a cer-
tain core noun with modifiers in two ways—
Table 1: Rules for the parse tree modification
process
Original PTB Modified
(VP (AUX (be)) (NP)) (VP (be) (NP))
(VP (AUX (be)) (PP)) (VP (be) (PP))
(VP (AUX (be)) (VP
(VBG) *))
(VP (be) (VBG) *)
(VP(AUX(be))(VP(VBN)
*))
(VP (be) (VBN) *)
(VP (AUX (be)) (ADJP)) (VP (be) (ADJP))
(VP (VBP (look)) (PP (IN
(for)) (NP)))
(VP (look) (for) (NP))
Table 2: Rules to translate the verb to be and
the verbal phrase look for something
English Rules Thai Rules
VP ! be NP VP ! xe0xbbx9axb9 NP
VP ! be PP VP ! xcdxc2xd9xe8 PP
VP ! be VBG VP ! xa1xd3xc5xd1xa7 VBG
VP ! be VBN VP ! xb6xd9xa1 VBN
VP ! be ADJP VP ! ADJP
VP ! look up NP VP ! xc1xcdxa7xcbxd2 NP
putting them in front of or behind it. We
will focus the first case in this paper. The
problem occurs as soon as we would like to
translate a sequence of nouns and a sequence
of adjectives. The first case is translated
backwards, while the second forwards. An
example for this problem is that “she is a
beautiful diligent slim laboratory member” is
translated to xe0xb8xcdxe0xbbx9axb9xcaxc1xd2xaaxd4xa1xe1xc5xe7xbaxb7xd5x8bxcaxc7xc2xa2xc2xd1xb9xbcxcdxc1/th††m
penm salmaamchikh thiif suayr khalyanr
ph!!mr/. The word she is translated to xe0xb8xcd,
is to xe0xbbx9axb9, member to xcaxc1xd2xaaxd4xa1, laboratory to xe1xc5xe7xba,
beautiful to xcaxc7xc2, diligent to xa2xc2xd1xb9, and slim to xbcxcdxc1.
With the purpose to solve this problem, we
first group nouns and adjectives into groups—
NNS and ADJS—and we apply a number of
structural transfer rules. Table 3 shows a por-
tion of transfer rules.
Table 3: A portion of structural transfer rules
to solve the lexicon reordering
English Rules Thai Rules
NP ! ADJS NNS NP ! NNS ADJS
ADJS ! adj ADJS ! adj
ADJS ! adj ADJS ADJS ! adj ADJS
NNS ! nn NNS ! nn
NNS ! nn NNS NNS ! NNS nn
2.4 Classifier Generation
The vital linguistic divergence between English
and Thai is head-noun-corresponding classifiers
(Lamduan, 1983). In English, classifiers are
never used in order to identify the numeric num-
ber of a noun or definiteness. On the contrary,
classifiers are generally used in Thai—for ex-
ample, in English, a number precedes a noun
phrase; butincontrast, aclassifiertogetherwith
the number succeeds in Thai.
In order to generate a classifier, we develop
the classifier matching algorithm. By empirical
observation, it is noticeable that the head noun
in the noun phrase always indicates the classi-
fier. For example, supposing the rules in Table 4
are amassed in the linguistic knowledge base.
Table 4: An example of rules for classifier gen-
eration
Head Noun Classifier
xc3xb6 /rothh/ xa4xd1xb9 /khanm/
xc3xb6xe4xbf /rothhfaim/ xa2xbaxc7xb9 /khalbuanm/
Thus, we can revise “xc3xb6xe4xbfxe0xcbxd2xd0xb5xd5xc5xd1xa7xa1xd2
/rothhfaim h!l tiimlangmkaam/ 3 <cl>”
and “xc3xb6xc2xb9xb5xec /rothhyonm/ 4 <cl>” can be
respectively revised to “xc3xb6xe4xbfxe0xcbxd2xd0xb5xd5xc5xd1xa7xa1xd2 3 xa2xbaxc7xb9”
(three roller coasters) and “xc3xb6xc2xb9xb5xec 4 xa4xd1xb9” (four
automobiles). If there is no rule that can match
the noun phrase, its head noun is used as the
classifier (Lamduan, 1983)—for example, xbbxc3xd0xe0xb7xc8
Figure 1: System Overview
/praltheesf/ is a Thai word that there is, in
fact, no corresponding classifier. As soon as
we would like to specify, as the latter example,
the numeric number, we say “xbbxc3xd0xe0xb7xc8xbexd1xb2xb9xd2xe1xc5xe9xc7 x33
xbbxc3xd0xe0xb7xc8” /praltheesf phathhthahnaam lfifiwh/
(three developed countries).
3 System Overview
As illustrated in the Figure 1, the system com-
prises of four principle components—syntactic
analysis, structural transformation, sentence
generation, and linguistic knowledge acquisi-
tion.
3.1 Syntactic Analysis and Parse-Tree
Modification
In this process, we analyze each sentence of the
source documents with the Charniak Parser and
afterwards transform each of which into a parse
tree.
The first process that we have to accomplish
first is the sentence boundary identification. In
this step, we require users to manually pre-
pare sentenceboundaries byinserting a new-line
character among sentences.
The next step is the sentence-parsing process.
We analyze the surface structure of a sentence
with the Charniak Parser. In this case, the orig-
inal Charniak Parser nevertheless spends long
time for self-initiation to load its considerably
huge database. Consequently, we patched it to
be a client-server program so as to eliminate
such time.
As stated earlier, in the view of the fact that
parse trees generated by the Charniak Parser
are quite complicated to translate into Thai, we
therefore implement the parse tree modification
process (see Section 2.2).
3.2 Structural Transformation
This process performs recursive transformation
from the source language parse trees intoaset of
corresponding Thai translation parse trees with
their probabilities. As stated earlier, there are
some complexity in order to transfer a PTB-
formatted parse tree into Thai, we thus imple-
mented the parse tree modification process (see
Section 2.2) before performing transformation.
The transformation relies on the transformation
rules from the linguistic knowledge base.
A single step of transformation process
matches the root node and single-depth child
nodes with the transformation rules and after-
wards returns a set of transformation produc-
tions. As stated earlier, we embedded the prob-
ability of each rule. The probability of a parse
tree … is given by the equation
P(…) = –(c…;c…1;c…2;c…3;:::;c…n)
nY
k=1
P(…k)
where …k is the k-th subtree of the parse tree …
whose number of member subtrees is n, c… rep-
resents the constituent of the tree …, and – is a
probability relation that maps the constituents
of the root and its single-depth children to the
probability value.
3.3 Sentence Generation
This process generates a target language sen-
tence from the parse tree. This stage also re-
lies on the linguistic knowledge base. The addi-
tional process is the noun classifier. We apply
the methodology defined in classifier matching
algorithm (see Section 2.4). Finally, the system
will show the translations of the most possibil-
ity and let the users change each solution if they
would like to do so.
3.4 Linguistic Knowledge Acquisition
We provided an advantageous tool so as to man-
ually train new translation knowledge. Cur-
rently, it comprises of the translation rule
learner and the English-Thai unknown word
aligner (Kampanya et al., 2002).
In this module, the translation rule learner
obtains document and analyzes that into a set
of parse trees. Afterwards, the users manually
teach it the rules to grammatically translate a
certain tree from the source language into the
target language with the rules following to the
Backus-Naur Form (Lewis and Paradimitriou,
1998) (or BNF in abbreviation). This module
will determine whether the rule is re-trained.
If so, the module will raise the probability of
that rule up. If not, it will add that rule to the
knowledge base.
Moreover, the aligner is utilized to automat-
ically update the bilingual dictionary. For our
future work, we intend to develop a system to
automatically learn new translation rules from
our corpora.
4 Evaluation
We established the system evaluation on the Fu-
ture Magazine bilingual corpus. We categorized
the evaluation into two environments—under
restricted knowledge base and under increasing
knowledge base. Each of which is also catego-
rized into two environments—with parsing er-
rors and without parsing errors.
In the evaluation, we randomly selected 322
sentences from the corpus. In order to have
a manageable task and facilitate performance
measurement, we classify translation result into
the following three categories—exact (the same
as in the corpus), moderate (understandable
result), and incomprehensible (obviously non-
understandable result). Table 5 shows the eval-
uation results.
In this evaluation, we consider the results in
the exact and moderate categories as reason-
Table 5: Evaluation Results (in percentages)
The column A represents evaluation with restricted knowl-
edge base and with parsing errors, B as with restricted knowl-
edge base but without parsing errors, C as with increasing
knowledge base but with parsing errors, and D as with in-
creasing knowledge base and without parsing errors.
Categories A B C D
Exact 3.97 4.41 4.97 5.52
Moderate 55.90 62.04 69.88 77.56
Incomprehensible 40.13 33.55 25.15 16.92
Accuracy 59.87 66.45 74.85 83.08
able translations. Moreover, we also consider
that the evaluation with restricted knowledge
base and with parsing errors is the worst case
performance, and the evaluation with increas-
ing knowledge base and without parsing errors
is the best case performance.
Fromtheconstraintsweestablished, wefound
that the system yielded the translation accuracy
for 59.87% for the worst case and 83.08% for the
best case.
5 Conclusions
In this paper, we propose novel Internet-based
translation assistant software in order to fa-
cilitate document translation from English to
Thai. We utilize the structural transfer model
as the translation mechanism. This project dif-
fers from the current MT systems in the point
that the users have a capability to manually se-
lect the most appropriate translation, and they
can, in addition, teach new translation knowl-
edge if it is necessary.
The four translation problems—Lexicon Re-
arrangement, Structural Ambiguity, Phrase
Translation, and Classifier Generation—are ac-
complished with various methodologies. To re-
solve the lexicon rearrangement problem, we
compose a number of structural transfer rules.
For the structural ambiguity, we apply the sta-
tistical method by embedding probability val-
ues to each transfer rules. In order to relieve the
complexity of the phrase translation, we develop
the parse tree modification process to modify
some tree structure so as to more easily compose
translation rules. Finally, with the purpose of
resolving the classifier generation problem, we
define the classifier matching algorithm which
matches the longest head noun to the appropri-
ate classifier.
In the evaluation, we established the sys-
tem experiment on the Future Magazine bilin-
gual corpus and we categorized the evaluation
into two environments—under restricted knowl-
edge base and under increasing knowledge base.
From the evaluation, the system yielded the
translation accuracy for 59.87% for the worst
case and 83.08% for the best case.

References
Peter F. Brown, John Cocke, Stephen Della
Pietra, Vincent J. Della Pietra, Frederick Je-
linek, John D. Lafferty, Robert L. Mercer,
and Paul S. Roossin. 1990. A statistical
approach to machine translation. Computa-
tional Linguistics, 16(2):79–85.
Eugene Charniak. 1997. Statistical parsing
with a context-free grammar and word statis-
tics. In AAAI/IAAI, pages 598–603.
Eugene Charniak. 1999. A maximum-entropy-
inspired parser. Technical Report CS-99-12,
Brown Laboratory for Natural Language Pro-
cessing.
Nithiwat Kampanya, Prachya Boonkwan, and
Asanee Kawtrakul. 2002. Bilingual unknown
word alignment tool for english-thai. In Pro-
ceedings of the SNLP-Oriental COCOSDA,
Specialty Research Unit of Natural Lan-
guage Processing and Intelligent Informa-
tion System Technology, Kasetsart Univer-
sity, Bangkok, Thailand.
Ronald M. Kaplan, Klaus Netter, J¨urgen
Wedekind, and Annie Zaenen. 1989. Trans-
lation by structural correspondences. In Pro-
ceedings of the 4th. Annual Meeting of the
European Chapter of the Association for
Computational Linguistics, pages 272–281,
UMIST, Manchester, England.
Somchai Lamduan, 1983. Thai Grammar (in
Thai), chapter 4: Parts of Speech, pages 128–
131. Odeon Store Publisher.
Harry R. Lewis and Christos H. Paradimitriou,
1998. Elements of the Theory of Compu-
tation, chapter 3: Context-Free Grammar.
Prentice-Hall International Inc.
Mitchell P. Marcus, Beatrice Santorini, and
Mary Ann Marcinkiewicz. 1994. Building
a large annotated corpus of english: The
penn treebank. Computational Linguistics,
19(2):313–330.
Virach Sornlertlamvanich. 2000. The state of
the art in thai language processing.
Arturo Trujillo, 1999. Translation Engines:
Techniques for Machine Translation, chapter
6: Transfer MT, pages 121–166. Springer-
Verlag (London Limited).
