A Model of Competence for Corpus-Based Machine Translation 
Michael Carl 
Institut fiir Angewandte Informationsforschung, 
Martin-Luther-Strafle 14, 
66111 Saarbrficken, Germany, 
carl@iai.uni-sb.de 
Abstract 
In this paper I claborate a model of colnpetencc 
for corpus-based machine translation (CBMT) along 
the lines of the representations used in the transla- 
tion system. Representations in CBMT-systems can 
be rich or austere, molecular or holistic and they can 
be fine-grained or coarse-grained. The paper shows 
that different CBMT architectures are required de- 
pendent on whether a better translation quality or 
a broader coverage is preferred according to Boitct 
(1999)'s formula: "Coverage * Quality = K". 
1 Introduction 
In the machine translation (MT) literature, it has 
often been argued that translations of natural lan- 
guage texts are valid if and only if the source lan- 
guage text and the target  text have the 
same meaning cf. e.g. (Nagao, 1989). If we assume 
that MT systems produce meaningflfl translations 
to a certain extent, wc must assmne that such sys- 
tems have a notion of the source text meaning to a 
similar extent. Hence, the translation algorithm to- 
gether with the data it uses encode a formal model of 
meaning. Despite 50 years of intense research, there 
is no existing system that could map arbitrary input 
texts onto meaning-equivalent output texts. How is 
that possible? 
According to (Dummett, 1975) a theory of mean- 
lug is a theory of understanding: having a theory 
of meaning means that one has a theory of under- 
standing. In linguistic research, texts are described 
on a number of levels and dimensions each contribut- 
ing to its understanding and hence to its memfing. 
l~raditionally, the main focus has been on semantic 
aspects. In this research it is assumed that know- 
ing the propositional structure of a text means to 
understand it. Under the same premise, research in 
M.q? has focused on semantic aspects assmning that 
texts have the same meaning if they are semantically 
equivalent. 
Recent research in corpus-based MT has differ- 
ent premisses. Corpus-Based Machine Translation 
(CBMT) systems make use of a set of reference 
translations on which the translation of a new text 
is based. In CBMT-systems, it is assumed that 
the reference translations given to the system in a 
training phase have equivalence meanings. Accord- 
ing to their intelligence, these systems try to fig- 
urc out of what the meaning invariance consists in 
the reference text and learn an appropriate source 
/target  mapping mechanism. A 
translation can only be generated if an appropriate 
example translation is available in the reference text. 
An interesting question in CBMT systems is thus: 
what theory of meaning should the learning pro- 
cess implement in order to generate an appropriate 
understanding of the source text such that it can 
be mapped iuto a meaning equivalent target text? 
Dulmnett (Dummett, 1975) suggests a distinction 
of theories of meaning along the following lines: 
* In a rich theory of meaning, the knowledge of 
the concepts is achieved by knowing the features 
of these concepts. An ausle'ce theory merely re- 
lies upon simple recognition of the shape of the 
concepts. A rich theory can justify the use of a 
concept by means of the characteristic features 
of that concept, whereas an austere theory can 
justify the use of a concept merely by enmner- 
ating all occurrences of the use of that concept. 
. A moh'.euIar theory of meaning derives the 
understanding of an expression from a finite 
number of axioms. A holistic theory, in con- 
trast, derives the understanding of an expres- 
sion through its distinction from all other ex- 
pressions in that . A molecular theory, 
therefore, provides criteria to associate a cer- 
tain meaning to a sentence and can explain the 
concepts used in the . In a holistic the- 
ory nothing is specified about the knowledge of 
the  other than in global constraints 
related to the  as a whole. 
In addition, the granularity of concepts seems cru- 
cial for CBMT implementations. 
* A fine-grained theory of meaning derives con- 
cepts from single morphemes or separable words 
of the , whereas in a coar~'e-qrained 
997 
theory of meaning, concepts are obtained from 
morpheme clusters. In a fine-grained theory of 
meaning, complex concepts can be created by 
hierarchical composition of their components, 
whereas in a coarse-grained theory of meaning, 
complex meanings can only be achieved through 
a concatenation of concept sequences. 
The next three sections discuss the dichotomies of 
theories of nleaning, rich 'vs. auz~ere, molecular vs. 
holis*ic and coarse-grained vs. fine-grained where a 
few CBMT systems are classified according to the 
terminology introduced. This leads to a model of 
competence for CBMT. It appears that translation 
systems can either be designed to have a broad cov- 
erage or a high quali@. 
2 Rich vs. Austere CBMT 
A common characteristic of all CBMT systems is 
that the understanding of the translation task is de- 
rived fronl the understanding of the reference trans- 
lations. The inferred translation knowledge is used 
in the translation phase to generate new transla- 
tions. 
Collins (1998) distinguishes between Memory- 
Based MT, i.e. menlory heavy, linguistic light and 
Example-Based MT i.e. memory light and linguistic 
heavy. While the former systems implement an aus- 
tere theory of meaning, the latter make use of rich 
representations. 
The most superficial theory of understanding 
is implenlented in purely menlory-based MT ap- 
proaches where learning takes place only by extend- 
ing the reference text. No abstraction or generaliza- 
tion of the reference examples takes place. 
Translation Memories (TMs) are such purely 
memory based MT-systems. A TM e.g. TRADOS's 
Translator's Workbench (Heyn, 1996), and STAR's 
TRANSIT calculates the graphenfic similarity of the 
input text and the source side of the reference trans- 
lations and return the target string of the nlost sim- 
ilar translation examples as output. TMs make use 
of a set of reference translation examples and a (k- 
nn) retrieval algorithm. They iulplement an austere 
theory of nleaning because they cannot justify the 
use of a word other than by looking up all contexts 
in which the word occurs. They can, however, enu- 
merate all occurrences of a word in the reference 
text. 
The TM distributed by ZERES (Zer, 1997) follows 
a richer approach. The reference translations and 
the input sentence to be translated are lemmatized 
and part-of-speech tagged. The source  sen- 
tence is nlapped against the reference translations 
on a surface string level, on a lemma level and on 
a part-of-speech level. Those example translations 
which show greatest similarity to the input sentence 
with respect to the three levels of description are 
returned as the best available translation. 
Example Based Machine Translation (EBMT) 
systems (Sato and Nagao, 1990; Collins, 1998; 
Gilvenir and Cicekli, 1998; Carl, 1999; Brown, 1997) 
are richer systems. Translation examples are stored 
as feature and tree structures. Translation tenlplates 
are generated which contain - SOuletinles weighted 
- connections in those positions where the source 
 and the target  equivalences are 
strong. In the translation phase, a multi-layered 
mapping from the source  into the target 
 takes place on the level of templates and 
on the level of fillers. 
The ReVerb EBMT system (Collins, 1998) per- 
forms sub-sentential chunking and seeks to link con- 
stituents with the same function in the source and 
the target . A source  subject is 
translated as a target  subject and a source 
 object as a target  object. In case 
there is no appropriate translation template avail- 
able, single words can be replaced as well, at the 
expense of translation quality. 
The EBMT approach described in (Giivenir and 
Cicekli, 1998) makes use of morphological knowl- 
edge and relies on word stems as a basis for trans- 
lation. Translation templates are generalized fronl 
aligned sentences by substituting differences in sen- 
tence pairs with variables aud leaving the identical 
substrings unsubstituted. An iterative application 
of this nlethod generates translation examples and 
translation templates which serve as the basis for 
an example based MT system. An understanding 
consists of extraction of compositionally translatable 
substriugs and the generation of translation tem- 
plates. 
A similar approach is followed in EDGAR (Carl, 
1999). Sentences are morphologically analyzed and 
translation templates are decorated with features. 
Fillers in translation template slots are constrained 
to unify with these features. In addition to this, 
a shallow linguistic formalism is used to percolate 
features in derivation trees. 
Sato and Nagao (1990) proposed still richer repre- 
sentations where syntactically analyzed phrases and 
sentences are stored in a database. In the translation 
phase, most similar derivation trees are retrieved 
from the database and a target  deriva- 
tion tree is conlposed fronl the translated parts. By 
means of a thesaurus semantically similar lexical 
items may be exchanged in the derivation trees. 
Statistics based MT (SBMT) approaches imple- 
ment austere theories of lncaning. For instance, in 
Brown et al. (1990) a couple of models are pre- 
sented starting with simple stochastic translation 
models getting incrementally more complex and rich 
by introducing more random variables. No linguistic 
998 
Richness of Representation Richness of Representation Granularity of Representation 
sere 
syrt 
mot 
gra 
• 1 8e?T~ 
o2, 3 • G syn 
$4 TY~Or 
"5 *r "S,9 gra 
molecular nfixed holistic 
Atomicity of Representation 
Ol 
O2,3 06 
04 
08 o7 09,5 
word phrase sentence 
Granularity of Representation 
sent. 
phras, 
word 
O4,5 $9 
o6,7 
qb 1 o2,3 e8 
molecular mixed holistic 
Atonricity of Representation 
el: Sato and Nagao (1990) 
o4: ZERES Zer (1997) 
or: Brown (1997) 
• 2: EDGAR Carl (1999) 
• 5: TRADOS Heyn (1996) 
• s: Brown et al. (1990) 
• 3: GEBMT Gfivenir and Cicekli (1998) 
• 6: ReVerb Collins (1998) 
• 9: McLean (1992) 
\]figure 1: Atomicity, Granularity and Richness of CBMT 
analyses are taken into account in these approaches. 
However, in further research the authors plan to 
integrate linguistic knowledge such as inflectional 
analysis of verbs, nouns and adjectives. 
McLean (McLean, 1992) has proposed an austere 
approach where lie uses neural networks (NN) to 
translate surface strings from English to French. His 
approach functions similar to TM where the NN is 
u,;ed to classify the sequences of surface word forms 
according to the examples given in the reference 
translations. On a small set of examples hc shows 
that NN can successfully be applied for MT. 
3 Molecular vs. Holistic CBMT 
As discussed in the previous section, all CBMT sys- 
tems make use of sonle text dimensions in order to 
map a source  text into the target . 
TMs, for instance, rely on the set of graphenfical 
symbols i.e. the ASCII set. Richer systems use lcx- 
ical, morphological, syntactic and/or semantic de- 
scriptions. The degree to which the set of descrip- 
tions is independent from the reference translations 
determines the molecularity of the theory. The more 
the descriptions are learned from and thus depend 
on the reference translations the more the system 
becomes holistic. Learning descriptions from refer- 
cnce translations makes the system more robust and 
easy to adjust to a new text domain. 
SBMT approaches e.g. (Brown et al., 1990) have 
a purely holistic view on s. Every sentence 
of one  is considered to bca possible trans- 
lation of any sentence in the other . No 
account is given for the equivalence of the source 
 meaning and the target  meaning 
other than by means of global considerations con- 
cenfing frequencies of occurrence in the reference 
text. In order to compute the most probable trails- 
lations, each pair of items of the source  and 
the target  is associated with a certain prob- 
ability. This prior probability is derived from the 
reference text. In the translation phase, several tar- 
get  sequences are considered and the one 
with the highest posterior probability is then taken 
to be the translation of the source  string. 
Similarly, neural network based CBMT systems 
(McLean, 1992) are holistic approaches. The train- 
ing of the weights and the nfinimization of the classi- 
fication error relies oil the reference text as a whole. 
Temptations to extract rules from the trained neu- 
ral networks seek to isolate and make explicit aspects 
on how the net successfully classifies new sequences. 
The training process, however, remains holistic. 
TMs implement the molecular CBMT approach as 
they rely on a static distance metric which is inde- 
pendent from the size and content of the case base. 
TMs are molecular because they rely on a fixed and 
limited set of graphic symbols. Adding further ex- 
ample translations to the data base does not increase 
the set of the graphic symbols nor does it modify the 
distance metric. Learning capacities in TMs are triv- 
ial as their only way to learn is through extension of 
the example base. 
The translation templates generated by Giivenir 
and Cicekli (1998), for instance, differ according to 
the similarities and dissinfilarities found in the ref- 
erence text. Translation templates in this system 
thus reflect holistic aspects of the example transla- 
tions. The way in which morphological analyses is 
processed is, however, independent front the transla- 
tion examples and is thus a molecular aspect in the 
system. 
Similarly, the ReVerb EBMT system (Collins, 
1998) makes use of holistic components. The ref- 
erence text is part-of-speech tagged. The length of 
translation segments as well as their most likely lift- 
tim and final words arc calculated based on proba- 
999 
Expected Coverage of the System 
high 
low 
% 
low high 
Expected 2¥anslation Quality 
• 1: Sato and Nagao (1990) 
• 2: Carl (1999) 
• 3: Giivenir and Cicekli (1998) 
e4: Zer (1997) 
e~: Heyn (1996) 
• 6: Collins (1998) 
• ?: Brown (1997) 
• s: Brown et al. (1990) 
• 9: McLean (1992) 
Figure 2: A Model of Competence for CBMT 
bilities found in the reference text. 
4 Coarse vs. Fine Graining CBMT 
One task that all MT systems perform is to segment 
the text to be translated into translation units which 
-- to a certain extent -- can be translated indepen- 
dently. The ways in which segmentation takes place 
and how the translated segments are joined together 
in the target  are different in each MT sys- 
tem. 
In (Collins, 1998) segmentation takes place on a 
phrasal level. Due to the lack of a rich morphological 
representation, agreement cannot always be granted 
in the target  when translating single words 
from English to German. Reliable translation can- 
not be guaranteed when phrases in the target lan- 
guage - or parts of it - are moved from one position 
(e.g. the object position) into another one (e.g. a 
subject position). 
In (Giivenir and Cicekli, 1998), this situation is 
even more problematic because there are no restric- 
tions on possible fillers of translation template slots. 
Thus, a slot which has originally been filled with an 
object can, in the translation process, even accom- 
modate an adverb or the subject. 
SBMT approaches perform fine-grained segmen- 
tation. Brown et al. (1990) segment the input 
sentences into words where for each source-target 
 word pair translation probabilities, fertil- 
ity probabilities, alignment probabilities etc. are 
computed. Coarse-grained segmentation are unre- 
alistic because sequences of 3 or more words (so- 
called n-grams) occur very rarely for n > 3 even ill 
huge learning corpora 1. Statistical (and probabilis- 
tic) systems rely on word frequencies found in texts 
and usually cannot extrapolate from a very small 
number of word occurrences. A statistical  
1 Brown et al. (1990) uses the Hansard French-English text 
containing several million words. 
model assigns to each n-gram a probability which 
enables the system to generate the most likely tar- 
get  strings. 
5 A Competence Model for CBMT 
A competence model is presented as two indepen- 
dent parameters, i.e. Coverage and Quality (see Fig- 
ure 2). 
• Coverage of the system refers to the extent to 
which a variety of source  texts can be 
translated. A system has a high coverage if a 
great variety of texts can be translated. A low- 
coverage system can translate only restricted 
texts of a certain domain with limited ternfi- 
nology and linguistic structures. 
• Quality refers to the degree to which an MT 
system produces successful translations. A sys- 
tem has a low quality if the produced transla- 
tions are not even informative in the sense that 
a user cannot understand what the source text 
is about. A high quality MT-system produces 
user-oriented and correct translations with re- 
spect to text type, terminological preferences, 
personal style, etc. 
An MT systenr with low coverage and low quality 
is completely uninteresting. Such a system comes 
close to a randonr number generator as it translates 
few texts in an unpredictable way. 
An MT system with high coverage and "not-too- 
bad" quality can be useful in a Web-application 
where a great variety of texts are to be translated 
for occasional users which want to grasp the basic 
ideas of a foreign text. On the other hand a system 
with high quality and restricted coverage might be 
useful for in-house MT-applications or a controlled 
. 
An MT system with high coverage and high qual- 
ity would translate any type of text to everyone's 
1000 
satisfaction, lIowever, as one can expect, such a 
system seems to bc not feasible. 
Boitct (1999) proposes "the (tentative) formula: 
Coverage • Quality -= K "where K depends on the 
MT technology and the amount of work encoded in 
the system. The question, then, is when is the max~ 
imum K possible and how nluch work do we want 
to invest for what purpose. Moreover a given K can 
mean high coverage and low quality, or it can mean 
the reverse. 
The expected quality of a CBMT system increases 
when segmenting more coarsely the input text. Con- 
sequently, a low coverage must bc expected due to 
the combinatorial explosion of the number of longer 
(:hunks. in order for a fine-grailfing system to gencr- 
z~te at least informative translations, further knowl- 
edge resources need be considered. These knowledge 
resources may be either pre-defined and molecular or 
they can be derived fronl reference translations and 
holistic. 
TMs focus on the quality of translations. Only 
large clusters of nlcaning entities are translated into 
the target  in the hope that such clus- 
ters will not interfere with the context from which 
they are taken. Broader coverage can be achieved 
through finer grained segmentation of the input into 
phrases or single terms. Systems which finely seg- 
ment texts use rich representation s in or- 
der to adapt the translation units to the target lan- 
guage context or, as in the case of SBMT systems, 
use holistic derived constraints. 
What can bc learned and what should be learned 
from the reference text, how to represent the in- 
ferred knowledge, how to combine it with pre-defincd 
knowledge and the impact of difl'erent settings on the 
constant K in the formula of Boitet (1999) are all 
still open question for CBMT-design. 
6 Conclusion 
Machine Tr'anslation (MT) is a lneaning preserving 
raapping from a source  text into a target 
 text. In order to enable a computer system 
to perform such a mapping, it is provided with a 
formalized theory of meaning. 
Theories of meaning are characterized by three di- 
chotomies: they call be holistic or molecular, aus- 
tere or rich and they can be fine-grained or coarsc- 
Muiucd. 
A number of CBMT systems - translation mem- 
ories, example-based and statistical-based nlachine 
translation systenls - arc examined with respect to 
these dichotomies. Ill a system that uses a rich the- 
ory of meaning, complex representations arc com- 
puted including morphological, syntactical and se- 
rnantical representations, while with an austere the- 
ory the system relics on the mere graphcnfic surface 
form of the text. In a holistie implementation mean- 
ing descriptions are derived from reference transla- 
tions while in a molecular approach the meaning dc- 
scriptions are obtained from a finite set of prede- 
fined features. In a fine-grained theory, the minimal 
length of a translation unit is equivalent to a mor- 
1)heme while in a coarse-grained theory this amounts 
to a morphenle cluster, a phrase or a sentence. 
According to the implemented theory of meaning, 
one can expect to obtain high quality translations or 
a good covera.qc of the CBMT system. 
The more the system makes use of coarse-grained 
translation units, the higher is tlle expected trans- 
lation quality. The more the theory uses rich repre- 
sentations thc more the system may achieve broad 
coverage. CBMT systems can be tuned to achieve 
cither of the two goals. 

References 

Christian Boitet. 1999. A research perspective 
on how to democratize machine translation and 
translation aides aiming at high quality final out- 
put. In MT-Summit '99. 

Peter F. Brown, 3. Cockc, Stephen A. Della Pietra, 
Vincent 3. Della Pietra, F. Jelinek, Mercer Robert 
L., and Roossiu P.S. 1990. A statistical approach 
to machine translation. Computational Linguis- 
tics, 16:79-85. 

Ralf D. Brown. 1997. Automated Dictionary Ex- 
traction for "Knowledge-Free" Example-Based 
Translation. In TMI-97, pages 111-118. 

Michael Carl. 1999. Inducing Translation Templates 
for Example-Based Machine Translation. hi MT- 
Summit VII: 

Br6na Collins. 1998. Example-Based Machine 
Tra'nMation: An Adaptation-Guided Retrieval Ap- 
pTvach. Ph.D. thesis, Trinity College, Dublin. 

Michael Dummett. 1975. What is a Theory of 
Meaning? In Mind and Language. Oxford Uni- 
versity Press, Oxford. 

Halil Altay Giivenir and Ilyas Cicekli. 1998. Learn- 
ing Translation Templates from Examples. Infor- 
mation Systems, 23(6):353-363. 

Matthias Hcyn. 1996. Integrating machine trans- 
lation into translation memory systems. In Eu- 
ropean Asscociation for Machine Translation - 
Workshop Proceedings, pages 111-123, ISSCO~ 
Geneva. 

Ian J. McLean. 1992. Example-Based Machine 
Translation using Connectionist Matching. In 
TMI-92. 

Makoto Nagao. 1989. Machine Translation ftbw Far 
Can It Go. Oxford University Press, Oxford. 

S. Sato and M. Nagao. 1990. Towards memory- 
based translation. In COLING-90. 

Zeres GmbH, Bochunl, Germany, 1997. ZERE- 
STRANS Bcnutzcrhandbuch. 
