Acquisii:ion of a Language Computationall Model for NItA" 
Svetlana St IERI'~MI{TYI!VA 
Computing Research l,aboratory 
New Mexico State University 
Las Cruces, NM, USA ,~8003 
lana(a;crl.nmsu.edu 
Sergei NIP, ENBURG 
(?omputing Research I+aboratory 
New Mexico State University 
l.as Cruces, NM, USA 88003 
sergei@crl.nmsu.edu 
Aiistrael 
This paper describes art approach to actively 
acquire a  computational model. 
The purpose of this acquisition is rapid 
development of NLP systems. The model is 
created with the syntax module of the Boas 
knowledge elicitation system for a quick 
ramp tip of a standard transfer-.based 
machine translation system from \[, into 
English. 
\[ ntroduetlon 
Resource acquisition for NI.P systems is a well. 
known bottleneck in  engineering. It 
would be a clear advantage to have a 
methodology that could provide a nmch cheaper 
way of NI,I" resources acquisition. The 
methodology should be universal in the sense 
that it could be applied to any  and 
require no skilled labour of lm)fossionals. Our 
approach attempts just that. 
We describe it on the example of the syntax 
module of the Boas knowledge elicitation 
system tbr a quick ramp tip of a standard 
transfer-based machine tran<;laliori system from 
any langnage into English (Nirenburg 1998). 
This work is a parl of an ongoing project 
devoted to the creation of resources tbr NI,P by 
eliciting knowledge \[i-on-i intbrnlanis. 
1 Other Work on Synta)~ Acquisition 
i'\]xperinlents in "single -.slop" automatic 
acquisitioil of knowledge have been amoni~ lhe 
most lhshional)le topics in NI,I ) over the past 
decade. ()no can mention work on automalic 
acquisition of phrase structure usim3 distribution 
analyst,,; 0h-ill ci al 1990). "\[hc problerns with 
the current fully automatic corpus-based 
approaches include difficulties of maintaining 
any system based on them, due to the 
opaqueness of the method and the data to the 
 engineer. At the present time, the most 
promising NLP systems include elements of 
both corpus-based and human knowledge-based 
methods. One example is acquisition of Twisted 
Pair Grannnar (Jones and ttavrilla 1998) for a 
pair of English and a source  (SL). 
Another example of a mixture of corpus-based 
and human knowledge-based methods is a 
system to generate a I,exicalized Tree-Adjoining 
Gramn-iar (F. Xia et al. 1999) automatically from 
all abstract specification of a . Grossly 
sin-lplil~/ing and generalizing due to lack of 
space, one can state that these experiments are 
seldon-i comprehensive in coverage and their 
results ate not yet directly useful iri 
comprehensive applications, such as MT. 
7 Al:quisitim~ of Syntax in Boas 
2.1 Miethodolo~ies for Selection of Syntax 
Parameters 
In general, tile issue of the selection of 
parameters tbr grmnmar acquisition is one of the 
main problems tbr which there is rio single 
answer. Parameters applicable to more than one 
 are studied m the field of  
universals as well as lhe principles-and- 
parameters ap\[)roach (Chomsky 1981) arid its 
successors ((Tholnsky 1995). Widely devised as 
the ba:ds of universal granlmar, the principles-. 
and--parameters approach has Ibcused on the 
uiliversaliiy of coitaill I()rn-ial grammatical i-riles 
within thai particular approach rather on iho 
sub~tarllive and exhaustive lisl of universal 
parameters., a subset of which is applicable to 
each natural hm,<.~uage., along with lhcii ° 
llll 
corresponding sets of values, such as a 
parameter set of nominal cases. In some other 
approaches, parameters and parameter values are 
either not sought out or are expected to be 
obtained automatically (e <, Brown et al. 1990; 
Goldstein 1998), and, while holding promise for 
the tiittire as a potential component of an 
elicitation system, cannot, at this time, lbnn the 
basis of an entire system of this kind. 
lit order to ensure uniformity and systematicity 
of operation of a  knowledge elicitation 
system, such as Boas, it is desirable to come tip 
with a comprehensive list of all possible 
parameters in natural lalguages and, for each 
such parameter, to create a cumulative list of its 
possible values in all the s that Boas 
can expect as SLs. Three basic methodological 
approaches are used in Boas. 
Expectation-driven methodology: covering the 
material by collecting cross-linguistic 
information on lexical and grammatical 
parameters, including their possible values and 
realizations, and asking the user to choose what 
holds in SL; while it is beyond the means of the 
current prqiect to check all extant s fbr 
possible new parameters~ we have included 
infomlation from 25 s. 
Goal-driven methodology: in the spirit of the 
"demand-side" approach to NLP (Nirenburg 
1996) Boas was tailored lbr elicitation of Mr 
relevant parameters rather than any syntactic 
parameters that can be postulated. A parameter 
was considered to be relevant if it was necessary 
tbr the parser and the generator used in MT in 
the Expedition project (http:/tcrl,NMSU.Edu/ 
expeditiorl/). 
The parser used is a heuristic clause chunker 
developed at NMSU CP,\[, which replaces the 
complex system of phrase structure rules in a 
traditional '2 erammar and uses  specific 
information, among thent word order (SVO vs. 
SOV), clause element (sul!ject, o\[!iect, etc.) 
marking, agreement marking, nouil phrase 
structure pattern, position of a head. 
l)ata-driven methodology: prtmlpiillg the user 
by English words and phrases and requesting 
translatioris or othcr rcnderin,,s in SI.; data- 
driven acquisition is the first choice, wherever 
l'easible, because it is the easiest type of work 
lbr the userst; In Boas, data-driven acquisition is 
guided by the resident English knowledge 
sources. 
2.2 Types of Syntax Parameters in Boas 
The parameters which are elicited through the 
syntax module of Boas include 2 what we call 
diagnostic and restricting parameters. 
Diagnostic parameters are those whose values 
help determine clause structure lbr correct 
structural transfer and translation of clause 
constituents. For example, in s which 
use grammatical case, the subject is usually 
marked by the nominative, ergative or absolutive 
case; direct objects are usually marked by the 
accusative case, etc. \]he list of the currently 
used diagnostic parameters in Boas includes: 
bask sentence structure parameters: word 
order preferences, grammatical fimctions 
(subject marking direct object marking, indirect 
ol:tiect marking, complement marking, adverbial 
rnarking, verb marking), clause element 
agreement marking, clause boundary marking, 
and bask noun phrase structure parameters: 
POS patterns with head marking, phrase 
boundary marking, noun phrase component 
agreement 
Restricting parameters determine the scope of 
usage of diagnostic parameters. Some of the 
diagnostic paralneter values can only occur 
simultaneously with certain restricting parameter 
vatues. For exainple, in s with the 
ergative construction the case of grammatical 
subject is restricted by the tense and aspect of 
the main verb (Mel'chuk 1998). 
t l{emember: they are not stipposed to be trained 
linguists but are expected to be able to translate 
between the source  and \['nglish. 
2Such iraditionally naolphological paramctcr,~ >; part. 
of speech, number, gender, w)ice, aspect, etc. arc 
elicited l 7 the naorphological module of Boas and arc 
prerequisites \[Bl the syntax module. 
1112 
2°3 The Flicitatior~ Procedure 
PrereqnisRes fl~r syntax elicitationo l)ata that 
drives syntax elicitation is obtained at earlier 
stages of elicitation, namely morphology o- 
parameters :-;UCll as Part of speech, (lender, 
Number, Person, Voice, Aspect, etc., as well as 
value sets tbr those parameters; lexieal 
acquisition of a small SL-English le×ieon to 
help work with the examples; the entries in the 
dictiotmry contain all the word forum and feature 
vahies of a SL lexeine and its English 
equivalentS° amt a very small corpus of 
carefllliy preselected and pretagged English 
noun phrases and sentences, used as examples° 
The inventory of tags and represeritation 
format. The tags for NPs include head and 
parameter values: The parameter (feature) set 
consists of Part of speech, Case, Number, 
Gender, Animacy arid I)efiniteness (the values 
of the latter two may pose restrictions on 
agreement of NP components). Every NP is 
represented in tile Boas knowledge base in the 
fbmi era typed feature structure as illustrated by 
the following example (the sign "#" inarks the 
head): 
\["a good #boy"-: \[struct.ure:nouz_,--phrase\] 
\[ "a"- \[pos :determiner, 
number:sizlgular, root::"a"\]\] 
\["good"= \[pos : adjective, 
root : "good" \] \] 
\["boy"- \[pos:noun,case:nominative 4, 
number : s ingu\]_ar, animacy : anilna te, 
root: "boy", head:l\]\]\] 
Two kinds of tags are used for sentence 
taggirtg tags that t-efi:r to the whole seutence 
and tags for clause elen~ents. Sentences are 
assigned yah.los of such restricting parameters as; 
3We inchlde hl the prerequisite knowledge as much 
overtly listed linguistic information as possiMe, to 
avoid the necessity of atmmmtic morphological 
analysis and generation which caililot guar'_iiltec abso-- 
\[utcly correct results. This is possible title tO a Sll/all 
size of the Icxhson used for syntax exarnples. 
'<As we rise i:i set o\[" t-lnglish NPs out of context, we 
believe tl-lat every phra,'~c will be understood as being 
hi tile noininative case. 
"clause type," "°voice," "tense" and "aspect". 
(Ganse elements are tagged with the vahie of the 
diagnostic paraineter "'syntactic functiotf' and 
wllues of tile restricting parameters "chtuse 
element realizatiol<" "animacy" and 
"definiteness". Clause elements also inherit 
sontellce lags. Senloncos are tagged in Boas as 
shown by the following exatnple (the 17.)im of 
representation is ;l typed feature structure): 
\["the boy give<~ a book to his teacher":: 
\[structure:sentence, form:affirmative,el 
ause-type:main, voice: active 
tense:present, aspect:indefinite\] 
\["the boy"= if unction:subject, 
realization:noun-phrase, 
animacy : animate, 
definiteness:definite, head- 
root : "boy" \] \] 
\["gives"= \[function:verb, 
realization:verb, head-root: "give"\]\] 
\["a book"= \[function:direct-object, 
realization : noun-phrase, 
animacy : inanimate, 
definiteness:indefinite, head- 
root : "boo\]<" \] \] 
\["t:o his teacher"- 
\[ function : indirect-obj ect, 
realization:prepositional-phrase, 
animacy : animate, 
definiteness:definite, head- 
root : "teacher" \] \] \] 
Following tile expectation-driven methodology 
tile sets (if pretagged noun phrases and sentences 
are sclected to cover many though, admittedly, 
not all expected cotnbinations of parameter 
wihles for every phrase or sentence. The 
fbllowing two examples fiirther illustrate the 
Boas elicitation procedure. 
Noun phrase pattern eiieitation. The user i~ 
given a short deiinition of a noun phrase and 
asked to translate a given English phraso~ for 
example "a Xood t~r)l' '" into S|. using tile words 
given in a small lexicon of selccled SI, lexical 
items translated Ii'om t'nglisil. In case of the 
Russian hmguage tile resuh would be: a good boy 
1113 
---> horoshij malchik. Next, Boas atitomatically 
looks tip every input SL woM in the lexicon and 
assigns part of speech and feature vahie tags to 
all the components of SL noun phrases. English 
translations of SL words help record the 
comparative order of noun phrase pattern 
constituents in SL and English and automatically 
assigns the head marker to that element of the 
SL noun phrase which is the mmslation of the 
English head. This is the final result of SL noun 
phrase pattern elicitation tbr a given English 
phrase. It includes a SL noun phrase pattern to 
be used in an MT parser and a pattern transfer 
inlbnnation for an English generator. Possible 
ambiguities, i.e., multiple sets of feature values 
for one word is resolved actively. 1he module 
can also actively check correctness of noun 
phrase translations. 
Clause structure elicitation includes order of 
the words, subject markers (diagnostic feature 
values or particles), direct object markers, verb 
markers, and clause element agreement. Just like 
in the case of noun phrases, the user is asked to 
translate a given English phrase into SL using 
the words given in the lexicon. For the English 
sentence used in the example above the Russian 
translation will be: 
the boy gives a book to his teacher --- 
> malchik daet knigu uchitelju 
As soon as this is done, Boas presents the user 
with English phrases corresponding to clause 
elements of the translated sentence, so that for 
every English-SL pair of sentences the user 
types in (or drags from the sentence translation) 
corresponding SL phrases, thus aligning clause 
elements.After the ractive alignment is done, the 
system automatically: 
® transfers the clause element tags fiom 
English to SL 5. 
* nmrks the heads of every SI, chmse 
elernent, and 
o assigns feature values to the heads of 
clause elements. 
STiffs proved to be working in our experiment with I 1 
langtmgcs, such as French, Spanish, German, Rus- 
Si;:ill, tJkiliiili.~tll. Scrbo-Croatian, Chinese, l>crsiurl, 
Turkish, Arabic, and \[ lindi. 
assigns sentence restricting parameter 
values (clause type, voice, tense and 
aspect, the last three are ligature values 
of the verb). 
In the case of assignment of multiple sets of 
feature values the user is asked to disambiguate . 
them. As a result, every SL clause element is 
now tagged with certain values of diagnostic and 
restricting tags. The system stores these results 
as mternal knowledge represenmtion, in the fi, mn 
of a feature structure, for further processing. For 
example, tbr the above English-Russian 
sentence pair the mediate results (not shown to 
the user) will be: 
\["malchik daet knigu 
uchitelju":\[ structure:sentence, 
form: affirmative, clause- 
type :main, voice : active, 
tense :present, 
aspect : imper fective \] 
\[ "malchik'= 
\[function: subject, realization : noun-- 
phrase, animacy:animate, 
head-l, root:'malchik', 
case:nominative, number:singular, 
gender:masculir~e, person:third\]\] 
\["dae~'= \[function:verb, 
realization:verb, head-root:"davat'" 
,number:singular, 
person:third\]\] 
\["kniqu"- \[function:direct-object, 
realization:noun-phrase, 
animacy:inanimate, head-root:'kniga', 
case:accusative, number:singular, 
gender:feminine, person:third\]\] 
\["uchitelju"= \[function:indirect- 
object, realization:noun phrase, 
animacy:animate, head-root:"uchitel'", 
case:dative, number:singular, 
gender:masculine, person:third\]\]\] 
This data is fiu-ther automatically processed to 
obtain tile kind of knowledge which can be tised 
in tile parser or generator, that is, rules (not seen 
by the user), where the t l,,"~ht-hand side centares 
a diagnostic parameter value (word oMer, clause 
element marking, agreement marking, etc.) and 
1114 
the lefi-lmnd side contains the vahtes of 
restricting parameters which condition the use of 
the COiTesponding diagnostic parameter valtte. A 
sample rule for the Russian example above isas 
lbllows: 
DirectObjectMarkerl= SL.Ru\].e\[ 
!hs: SentenceForm\[affirmative\] 
ClauseType\[main\] 
Voice\[active\] 
Tense\[present\] 
Aspect\[imperfective\] 
Subject\[realization:noun-phrase 
animacy:animate\] 
DirectObject\[realization:noun. 
phrase animacy:inanimate\], 
rhs:<:SLDirectObjectMarker\[case:accus 
ative\] :>\]; 
These results are presented to the user for 
approval in a readable form° In Russian these 
rifles mean the tbllowing: 
in the a././b+mative sentence, mai/7 claltse, active 
voice, present tense, when the xuO/ect is realized 
as NP mid animate and direct c:/?/ect i:+" r'caliT.;ed 
as NI" and itumimcite, 
+ word order is SV(); 
® subject is in nominative case; 
* direct object is in accusative case; 
subject agrees with verb in number and 
person. 
After all the sentence translations are processed 
in this way, the rules with the same right-hand 
side are automatically combined. At the next 
stage of processing the set of values tbr every 
restricting parameter in the right-hand side of the 
combined rule is checked on completeness. This 
means that in Rttssian in the affinnative main 
clause the prelbrred word order is SVO. The 
final result:; are presented l+or the ttser lbr 
al->t-woval or editing. 
Conclusion 
Boas i+; implemented as a WWW-based Ihce, 
using IHMI+, Java Scripts and Purl. \]ks of 
November 1999, the coverage of Boas inchtdes 
the elicitation of inflectional moq~hology, 
moq'~hotactics+, opcn.-chms and closed-.class 
lexical items. Work on tokenization and proper 
names, syntax and feature and syntactic transfer + 
is under way. Initial experiments have been 
completed on producing operational knowledge 
from the declarative knowledge elicited through 
Boas. Testing and ewduation of the sysem have 
been platmed, and its results will be reported 
separately. 
Acknowledgments 
Research for this paper was supported in part by 
Contract MDA904-97-C-3976 from the US 
Department of Defense. Thanks to Jim Cowie 
and R6mi Zajac lbr many fi-uitful discussions of 
the issues related both to Boas proper and to the 
MT environment in which it operates. 

References 

Brill, E., D Magerman, M Marcus and B Santorini. 
(1990) Deducing Linguistic Structure from the 
Statistics of Large Corpora. Proceedings of the 
29th Annual Meeting of the Association for 
Computational Linguistics. Berkeley. CA. 

Brown, P., J+ Cocke, 5. Della Pietra, V. Delhi Pietra, 
F. Jelinek, J.D. l+afferty, P,.\[.. Mercer and P.S. 
Roossin. 1990. A statistical approach to machine 
translation. Computational 1.ingttistics, 16: 79-85. 

Chomsky, N. 1981. \[.ecturcs on Government and 
Binding. Dordrecht: Foris. 

Chomsky, N. 1995. The Minimalist Program. 
Cambridge, MA: Mrr Press. 

Goldsmith, J. 1998. Unsupervised l~carning of the 
Morphology of a NatLtral Language. http://humani- 
tics.uchicago.edu/facuhy/gohtsnlith/Atttonaorpholo 
gy/Papcr.doc 

Jones, D. and R.Hawilla. 1998. Twisted Pair 
(\]rammar: Support for Rapid Development of 
Machine Translation fin lmw Density l.anguagcs. 
AMTA'gg. 

Mcl'cuk I. 1988. Dependency Syntax: Theory and 
Practice. State University of New York 1Press, 
Albany. 

Nircnbt,rg, Scrgci 1996. Supply-side and demand- 
side Icxical semantics. Introduction to the 
Workshop on thcadth and Depth of Semantic 
LcxJcollS at AC|f196. 

Xia, Fei, M. Pahner, and K.Vijay-Shankcr. 1999. 
Towards SCllli-atltonlatic (hammar l)evelopmcnt 
Proceedings of tile Natnral |,angnl.{e Processing 
Pacific Rim Symposium. Bc(jing, China. 
