Universal Grammar and Lexis for Quick Ramp-Up of MT 
Systems 
Abstract 
This paper introduces Boas, a semi-automatic 
knowledge elicitation system that guides a team of 
two people through the process of developing the 
static knowledge sources for a moderate-quality, 
broad-coverage MT system from any "low-den- 
sity" language into English in about six months. 
The paper focuses on some issues in the elicitation 
of descriptive knowledge in Boas and also the issue 
of the principled reuse of pre-existing resources, 
such as a lexicon, an ontology, and an English gen- 
eration module, among others, made possible by 
the fact that the client MT system is developed for 
a single target language. 
1. Introduction: The Boas Project 
This paper presents Boas, a semi-automatic knowl- 
edge elicitation system that guides a team of two 
people through the process of developing static 
knowledge sources for a moderate-quality, broad- 
coverage MT system from any "low-density"l lan- 
guage into English in about six months. Boas con- 
tains knowledge about human language and means 
of realization of its phenomena in a number of spe- 
cific languages and is, thus, a kind of a "linguist in 
the box" that helps non-professional acquirers with 
the task, whose complexity is legendary. 2 
Sergei Nirenburg and Victor Raskin 
Computing Research Laboratory 
New Mexico State University 
Las Cruces, N.M. 88003, U.S.A. 
{ sergei, raskin } @ crl.nmsu.edu 
dictated by the amount of language work which can 
be carded out, given the resources available. The 
rules of the game specifically exclude linguists and 
MT developers from the acquisition team. Under 
such conditions, the only sensible course of action 
is to attempt to collect as much knowledge about as 
many languages as possible in advance and include 
it in the elicitation system itself. 
Section 2 below is devoted to defining the format 
of the descriptive language knowledge to be elic- 
ited from the acquirers through Boas. The descrip- 
tive language knowledge, which we address in this 
paper, is, later in the process of Boas operation, 
converted into operational knowledge capable of 
supporting the processes of source language analy- 
sis and source-target transfer. In Section 3, we dis- 
cuss how work on ontological semantics in MT can 
contribute to Boas in a situation of a single target 
language, English. In Section 4, we address the 
procedure for descriptive language knowledge 
acquisition in Boas, both in terms of resources cre- 
ated and reused and in terms of the actual elicita- 
tion techniques, differentiating between the 
acquisition of grammatical and lexical parameters. 
The knowledge about language elicited by Boas 
from the acquirers aims to support MT output qual- 
ity which is roughly commensurate with the out- 
puts of the better commercial systems, such as 
Systran. These relatively modest expectations are 
1 "Density" refers roughly to the amount of effort having been 
previously expended in the field on computational descriptions 
of particular languages, resulting in the creation of a variety of 
machine-tractable resources--text corpora, grammars, lexi. 
cons, analyzers, etc. Thus, Spanish will most probably count a ; 
"high-density" while, say, Tagalog will not. 
2. Defining Parameters for Boas 
The descriptive knowledge about the source lan- 
guage is a set of statements about morphological, 
syntactic, and lexical properties (parameters) of a 
language, listed together with their values and real- 
ization options. Data about each parameter 
includes the language, the name of the parameter, 
the list of entities to which this parameter applies 
(its domain) and the list of parameter values (its 
2 We have introduced Boas and discussed some per- 
tinent theoretical issues in Nirenburg (1998). In this 
paper, we focus on the more practical aspects of 
Boas implementation. 
975 
range). Moreover, parameter values have an associ- 
ated set of realization options in each language. For 
instance, the parameter of gender in Ukrainian is 
described as follows: 
language: Ukrainian 
parameter: gender 
domain: nouns, adjectives, possessives (head agree- 
ment), verbs in past tense 
range (parameter values): masculine, feminine, neuter 
realization: \[gender markers in lexicon for nouns; 
inflection paradigms for adjectives, possessives and 
verbs in past tense\] 
For comparison, the Hebrew gender is described 
differently: 
language: Hebrew 
parameter: gender 
domain: nouns, adjectives, possessive (non-first-person 
possessor agreement), finite verbs 
range (parameter values): masculine, feminine 
realization: \[gender markers in lexicon for nouns; gen- 
der inflection paradigms for adjectives, possessives and 
verbs\] 
Instead of discovering parameters from scratch for 
each language, it is preferable, in order to ensure 
uniformity and systematicity of Boas operation, to 
come up with a complete list of all possible param- 
eters in natural languages, with complete lists of 
their possible values attached. The attainability of 
such a resource becomes then a central issue. 
The terms 'parameter' and 'value' are used in our 
task in the same sense as in the school of theoreti- 
cal syntactic thought consecutively known as gov- 
ernment and binding (Chomsky 1981), principles 
and parameters (Chomsky 1986) and the minimal- 
ist position (Chomsky 1995). The theory postulates 
a small number of general principles defining the 
innate human language faculty and a larger number 
of language parameters, which implement these 
principles by selecting concrete values for particu- 
lar languages. The complete set of such parameters 
and values constitutes a universal grammar (UG)-- 
see also Culikover (1997), Lightfoot (1991) an( 
Webelhuth (1992). 
Unfortunately, work within this approach has no~ 
stressed the descriptive task of creating a compre- 
hensive inventory of universal grammar parame- 
ters or even those for particular languages o\] 
language families. For Project Boas, it means tha~. 
both the nature of the parameters it would be using 
and their inventory has to be developed in-house. 
In order to define a set of parameters for Boas, it is 
essential to distinguish among the language phe- 
nomena that should be accorded the status of 
parameter and those that should be understood as 
parameter values or their realizations. Still other 
phenomena may remain, at least for the task at 
hand, outside the parameter system. We believe, 
with Dorr (1993), that parameters may be under- 
stood as building blocks of an interlingua in MT. 
We reserve judgment about whether every compo- 
nent of an interlingua is by definition parametric 3. 
Thus, the parameter "lexical category" has a range 
of values { V, N, Adj, Adv, ... }. Any of these values 
may itself be considered a parameter. If viewed 
within a single language, their values are, ulti- 
mately, all words in the language which belong to 
the respective lexical categories. The realizations 
of these values are the specific forms of these 
words, which appear in text decorated with realiza- 
tions of appropriate values of such morphological 
parameters as NUMBER, GENDER, CASE, etc. 
An example of a syntactic parameter is HEAD-MOD- 
IFIER DEPENDENCY, whose values include such 
pairs as "head: noun; modifier: adjective; .... head: 
verb; modifier: adverb," "head: noun; modifier: 
relative clause" and others. Realization options for 
these values involve word or constituent order 
rules (for instance, post- or pre-posing) and agree- 
ment rules. 
Lexical parameters are viewed as language-inde- 
pendent lexical meanings (ontological concepts), 
such as TABLEFuRNITUR E. The values of this parame- 
ter are the word senses corresponding to this onto- 
logical concept across the inventory of languages. 
The realizations for these values are the words or 
phrases that express this meaning in each language, 
with a possibility of a lexical gap (a null value) 
3 Thus, for instance, a morphological analyzer for 
Turkish uses information that does not have to be 
expressed parametrically, such as data about nomi- 
nal suffixes, which one needs to know in order to 
recognize a noun form but which do not correspond 
to any parametric value that needs to be expressed 
in English; similarly, a Russian verbal prefix may 
help determine the aspect value but does not realize 
a distinct parametric value of its own. 
976 
included. Sense: furniture 
3. Translation Environment Supported by Boas 
The single-target-language (English) environment 
which Boas serves allows for simplification of both 
system implementation and the acquisition process 
compared to the case of multiple SLs and TLs. 
First, only one text synthesis module needs to be 
built. Second, many fewer transfer components 
(bilingual lexicons, transduction tables for closed- 
class lexical items, feature and structure transfer 
tables) are needed. In fact, this situation almost 
licenses the transfer approach, as the combinatorial 
argument for interlingual MT is weaker here than 
in the case of multiple TLs (see, however, below 
and fn. 3). Third, it appears that knowledge acqui- 
sition for a new SL may be aided by the presence 
of a number of resources already developed for the 
TL. 
These resources include a) the vocabulary of the 
generation lexicon which can serve as the list of 
lexical parameters for compiling the bilingual dic- 
tionary; b) a world model (ontology) providing the 
terms in which the senses of the English words and 
phrases are expressed (Boas uses the ontology 
from the Mikrokosmos project at NMSU CRL-- 
see Mahesh and Nirenburg 1995); c) the structure 
and term definitions from the text meaning repre- 
sentation in Mikrokosmos (see, for instance, Ony- 
shkevych and Nirenburg 1995), to help guide 
parameter elicitation; d) the set of English closed- 
class lexical items and morphemes; e) English 
grammar used in text synthesis, which provides the 
TL side of structural transfer rules in the runtime 
MT system (see Figure 1 above); and f) a set of 
"ecological" parameters and their realizations for 
English. While a complete description of the use of 
all of the above resources is beyond the scope of 
this paper, we will give a few brief illustrations. 
The list of English word senses seeds the acquisi- 
tion of the SL lexicon. The acquirer first simply 
translates all the word senses into SL and then adds 
SL features to the corresponding entries as needed. 
The result is an SL-TL transfer dictionary which 
also serves as the lexicon for SL analysis. The 
acquirer gets a lemma with all its senses: 
Entry: table-n 1 
POS: noun 
Entry: table-n2 
POS: noun 
Sense: diagram 
and produces the following SL lexicon entries (the 
example is in Hebrew): 
Entry: shulxan-n 1 
POS: noun 
Gender: m (plural -ot) 
Sense: table-nl 
Entry: tavla-nl 
POS: noun 
Gender: f 
Sense: table-n2. 
In the examples, the senses are conveniently 
explained not in any specially designed lexicon/ 
ontology notation, but rather through translation 
into English. Because each English translation is 
the entry head for a sense which is already 
explained in an ontology-based semantic metalan- 
guage in the already existing Mikrokosmos lexi- 
con, Expedition can benefit from richer semantic 
information than that acquired using Boas.We use 
the Mikrokosmos ontology as a search space to 
support word sense disambiguation. The method 
(suggested by Jim Cowie) depends on the bilingual 
dictionary of the kind illustrated above. Coarse 
grain-size lexical mappings of TL word senses to 
ontological concepts are established (for instance, 
chihuahua and poodle may be both linked to the 
ontological concept DOG). The system, thus, knows 
that both chihuahuas and poodles have four legs, 
are carnivorous, domesticated, etc. 
The disambiguation method uses such ontological 
constraints by computing a distance in the ontolog- 
ical space between ambiguous word senses on the 
one hand and the senses of other words in their 
context. SL syntactic information helps to guide 
the disambiguation process by providing additional 
constraints. Thus, closeness between senses of 
words belonging to the same syntactic unit is 
weighed more heavily than that across unit bound- 
aries. 
The acquisition of the complete list of parameters 
in the single-TL environment is facilitated not only 
by the availability of the initial set of lexical 
parameters but also by the prominence of the syn- 
977 
tactic and morphological parameters activated in 
English. Thus, for morphology and syntax, the 
existence of such comprehensive grammars of 
English as Quirk et al. (1985) allows a quick 
round-up of the major parameters. One cannot 
always limit oneself, however, to TL-induced 
acquisition as we have demonstrated in the previ- 
ous section on the example of GENDER in English. 
4. Source Language Knowledge Acquisition 
Acquisition of descriptive knowledge about a lan- 
guage consists in Boas of a set of elicitation "epi- 
sodes." The episodes have been clustered, very 
unevenly, into six large classes, namely, morphol- 
ogy, closed-class items, open-class items, syntax, 
transfer features, and ecology. Each episode is an 
HTML document, accessible through the standard 
Web browsers. Each page deals with one parameter 
and elicits information on its values present in the 
source language as well as the realizations of these 
values. It is morphology which seems to require 
the greatest number of parametric episodes, though 
the total is not very high: verbs, around 30 episodes 
for the finite forms, and about 40 for the non-finite 
forms; nouns, around 20; adverbs and adjectives, 
under 5. Morphology does include these four sec- 
tions. 
Closed-class items are pronouns, temporal rela- 
tions, spatial relations, and case-like relations, e.g., 
prepositional phrases (the morphological case is, of 
course, handled in the noun section of the morphol- 
ogy class). Each closed-class page deals with one 
English closed-class item in one appropriate sense 
and elicits all the possible translations of that item 
into the source language (or, more accurately, all 
possible expressions in the source language which 
may be translated into English with this item in this 
sense), with the complete morphological and syn- 
tactic information on each such translation. 
Because there are, roughly, 200 closed-items in 
English (and many other languages), this class 
requires the greatest number of Web pages but they 
are mot parametric and quite straightforward. 
Open-class items are acquired lexically, with the 
help of, essentially, one huge standard elicitation 
episode/Web page. Lexical acquisition proceeds as 
described in Section 3 and further aided by a spe- 
cial resource created for Boas/Expedition: continu- 
ing our work on significantly reducing the number 
of different senses in a lexicon entry by combining 
related senses in MRDs (see Nirenburg et al. 1995) 
and, more rarely, deleting the marginal ones, we 
have manually reduced a combined (Mikrokosmos 
and other sources) English lexicon of about 28,000 
words to about 40,000 word senses, each of which 
serves as a lexical parameter for SL acquisition. In 
addition, frequency analyses of SL corpora will 
provide the requirements for adding lexical param- 
eters from SL, not just TL. 
Syntax will have rather few elicitation episodes 
since much of it will be collected automatically 
from a large corpus, pre-tagged morphologically 
by Boas--automatically. 
The elicitation pages in transfer features class will 
deal with non-standard transfer correpondences, 
and ecology with proper names, punctuation, stan- 
dard acronyms for numbers, and other print con- 
ventions of the source language. It is unlikely to be 
a very numerous class and it is, of course, largely 
non-parametric. 
It is the morphology class which has necessitated 
the heaviest use of and most remedial effort on 
parameter inventories. We have largely expanded 
the inventory of parameters, previously acquired in 
the PROPERTY branch of the Mikrokosmos ontol- 
ogy: most of the "grammatical" meanings, realized 
in any one of the Mikrokosmos languages, such as 
English, Spanish, and Chinese, are already 
recorded and systematized there. We have also had 
to compile what we hope to turn out to be the most 
complete list of both parameters and their values, 
such as noun case (around 30 values), verb mood 
(about a dozen), verb aspect (about two dozen), 
etc. 
A standard morphological episode elicits the val- 
ues for a parameter which the user has already 
marked as present in the source language on the 
previous Web page. The moment the box for that 
parameter was checked there, the user is taken to 
the values page, where Boas offers a complete list 
of existing values for that parameter and requests 
that the user select all that apply. 
Two additional factors deserve a special mention. 
First, each elicitation episode is supported with 
context-sensitive online help, which can be also 
accessed as a complete morphological, syntactic, 
closed-class, etc. tutorial. This tutorial, as far as we 
978 
know, is the only available sketch of universal 
grammar. Secondly, each parameter and value 
choice provides for the selection of "other" 
unlisted values, and great care is taken to assist the 
user in naming the parameter or value as well as 
determining the appropriate values for each user- 
introduced parameter with the appropriate realiza- 
tions. 
At the conclusion of each elicitation cycle, such as 
nouns or verb finite forms, all the elicited informa- 
tion is presented to the user for checking, correct- 
ing, and editing in the form of a paradigm table, 
which id the Cartesian product of all the estab- 
lished parameters and values. The user is also 
guided through the parts of the source language 
grammar which deals with exceptional paradigms. 
It should be also noted that, in open-class acquisi- 
tion, the paradigms for each acquired source lan- 
guage item will be assigned to one of the already 
established types or, alternatively, a new excep- 
tional type will be added, if necessary. 
The most difficult issues in acquisition involve the 
transcategorial realization of values, such as the 
signalling of a noun case in the verb or non-stan- 
dard clitics, or the lexical realizations in SL of 
grammatical parameters in TL, such as the possible 
absence of continuous tenses in a SL and the 
choice of a grammatical realization of such lexical 
values as "right now" in the SL as the present con- 
tinuous marker on the corresponding verb. Interest- 
ingly also, clitics and similar morphological 
"complications" of source languages are unlikely 
to present a significant problem either in elicitation 
or in transfer, primarily because of the single target 
language environment in Boas and ensuing lack of 
necessity to generate (rather than just to analyze) 
much morphological complexity. 
5. Conclusion: Computational Field 
Linguistics? 
Boas exemplifies the broad-coverage descriptive 
approach to NLP (see, for instance, Nirenburg and 
Raskin 1996) and adds to it a complementary new 
commitment to developing and using automated 
field-linguistic methodology (cf. Nirenburg 1998). 
This goes hand in hand with the evolving reorienta- 
tion of theoretical linguistics from selective theo- 
rizing, in terms of prevalent atomistic rule 
postulation and testing, back to the primary goal of 
linguistics, which is a theory-based language 
description. 
A full evaluation of Boas, that is, the development 
of the first actual SL to English MT system over a 
six-month time interval, will take place within the 
next two years. 
Acknowledgments 
The research reported in this paper was sup- 
ported by Contract MDA904-92-C-5189 with the 
U.S. Department of Defense. Victor Raskin is 
grateful to Purdue University for permitting him to 
consult CRL/NMSU. 

References 
Chomsky, N. 1981. Lectures on Government and 
Binding. Dordrecht: Foris. 
Cbomsky, N. 1986. Knowledge of Language: Its Na- 
ture, Origin, and Use. New York: Praeger. 
Chomsky, N. 1995. The Minimalist Program. Cam- 
bridge, MA: MIT Press. 
Comrie, B., and N. Smith 1977. Lingua Descriptive 
Studies: Questionnaire. Lingua 42:1, pp. 1-72. 
Culikover, P. W. 1997. Principles and Parameters. An 
Introduction to Syntactic Theory. Oxford 
University Press. 
Dorr, B. 1993. Interlingual Machine Translation: A Pa- 
rametrized Approach. Artificial Intelligence 63, 
429-92. 
Lightfoot, D. 1991. How to Set Parameters. Cam- 
bridge, MA: MIT Press 
Mahesh, K., and S. Nirenburg 1995. Semantic Classifi- 
cation for Practical Natural Language Process- 
ing. In: Proceedings of the Sixth ASlS SIG/ 
CR Classification Research Workshop: An 
Interdisciplinary Meeting. Chicago, IL. 
Nirenburg, S. 1998. Project Boas: "A Linguist in the 
Box" as a Multi-Purpose Language Resource. 
In: Proceedings of The First Lexical 
Resources and Evaluation Conference. 
Granada, Spain. 
Nirenburg, S., and V. Raskin 19961 Ten Choices in Lexi- 
cal Semantics. MCCS-96-304, Las Cruces, 
N.M.: NMSU CRL. 
Nirenburg, S,, V. Raskin, and B. Onyshkevych 1995. 
Apologiae Ontologiae. TMI '95, Leuven. 
Onyshkevych, B., and S. Nirenburg. 1995. "A Lexicon 
for Knowledge-Based MT." Machine 
Translation, 10:1-2, pp. 5-57. 
Quirk, R., S. Greenbaum, G. Leech, and J. Svartvik 
1985. A Comprehensive Grammar of the 
English Language. London: Longman. 
Webelhuth, G. 1992. Principles and Parameters of 
Syntactic Saturation. New York and Oxford: 
Oxford University Press. 
