Modeling the language assessment process and result: 
Proposed architecture for automatic oral proficiency assessment 
Gina-Anne Levow and Mari Broman Olsen 
University of Maryland Institute for Advanced Computer Studies 
College Park, MD 20742 
{gina,molsen}@umiacs.umd.edu 
Abstract 
We outline challenges for modeling human lan- 
guage assessment in automatic systems, both in 
terms of the process and the reliability of the re- 
sult. We propose an architecture for a system to 
evaluate learners of Spanish via the Computer- 
ized Oral Proficiency Instrument, to determine 
whether they have 'reached' or 'not reached' the 
Intermediate Low level of proficiency, according 
to the American Council on the Teaching of For- 
eign Languages (ACTFL) Speaking Proficiency 
Guidelines. Our system divides the acoustic 
and non-acoustic features, incorporating human 
process modeling where permitted by the tech- 
nology and required by the domain. We suggest 
machine learning techniques applied to this type 
of system permit insight into yet unarticulated 
aspects of the human rating process. 
1 Introduction 
Computer-mediated language assessment ap- 
peals to educators and language evaluators be- 
cause it has the potential for making language 
assessment widely available with minimal hu- 
man effort and limited expense. Fairly robust 
results (n '~ 0.8) have been achieved in the com- 
mercial domain modeling the human rater re- 
sults, with both the Electronic Essay Rater (e- 
rater) system for written essay scoring (Burstein 
et al., 1998), and the PhonePass pronunciation 
assessment (Ordinate, 1998). 
There are at least three reasons why it is 
not possible to model the human rating pro- 
cess. First, there is a mismatch between what 
the technology is able to handle and what peo- 
ple manipulate, especially in the assessment 
of speech features. Second, we lack a well- 
articulated model of the human process, often 
characterized as holistic. Certain assessment 
features have been identified, but their rela- 
tive importance is not clear. Furthermore, un- 
like automatic assessments, human raters of oral 
proficiency exams are trained to focus on com- 
petencies, which are difficult to enumerate. In 
contrast, automatic assessments of spoken lan- 
guage fluency typically use some type of error 
counting, comparing duration, silence, speaking 
rate and pronunciation mismatches with native 
speaker models. 
There is, therefore, a basic tension within 
the field of computer-mediated language assess- 
ment, between modeling the assessment pro- 
cess of human raters or achieving comparable, 
consistent assessments, perhaps through differ- 
ent means. Neither extreme is entirely satisfac- 
tory. A spoken assessment system that achieves 
human-comparable performance based only, for 
example, on the proportion of silence in an ut- 
terance would seem not to be capturing a num- 
ber of critical elements of language competence, 
regardless of how accurate the assessments are. 
Such a system would also be severely limited 
in its ability to provide constructive feedback 
to language learners or teachers. The e-rater 
system has received similar criticism for basing 
essay assessments on a number of largely lexical 
features, rather than on a deeper, more human- 
style rating process. 
Thirdly, however, even if we could articulate 
and model human performance, it is not clear 
that we want to model all aspects of the hu- 
man rating process. For example, human per- 
formance varies due to fatigue. Transcribers of- 
ten inadvertently correct examinees' errors of 
omitted or incorrect articles, conjugations, or 
affixes. These mistakes are a natural effect of 
a cooperative listener; however, they result in 
an over-optimistic assessment of the speaker's 
actual proficiency. We arguably do not wish to 
build this sort of cooperation into an automated 
24 
assessment system, though it is likely desirable 
for other sorts of human-computer interaction 
systems. 
Furthermore, if we focus on modeling hu- 
man processes we may end up underutillzing the 
technology. Balancing human-derived features 
with machine learning techniques may actually 
allow us to discuss more about the human rat- 
ing process by making the entire process avail- 
able for inspection and evaluation. For exam- 
ple, if we are able to articulate human rating 
features, machine learning techniques may al- 
low us to 'learn' the relative weighting of these 
features for a particular assessment value. 
2 Modeling the rater 
2.1 Inference gz Inductive Bias 
Research in machine learning has demonstrated 
the need for some form of inductive bias, to limit 
the space of possible hypotheses the learning 
system can infer. In simple example-based con- 
cept learning, concepts are often restricted to 
certain classes of Boolean combinations, such as 
conjuncts of disjuncts, in order to make learning 
tractable. Recent research in automatic induc- 
tion of context-free grammars, a topic of more 
direct interest to language learning and assess- 
ment, also attests to the importance of structur- 
ing the class of grammars that can be induced 
from a data set. For instance Pereira and Sch- 
abes (1992) demonstrate that a grammar learn- 
ing algorithm with a simple constraint on binary 
branching (CNF) achieves less than 40% accu- 
racy after training on an unbracketed corpus. 
Two alternatives achieve comparable in- 
creases in grammatical accuracy. Training on 
partially bracketed corpora - providing more su- 
pervision and a restriction on allowable gram- 
mars - improves to better than 90%. (DeMar- 
cken, 1995) finds that requiring binary branch- 
ing, as well as headedness and head projection 
restrictions on the acquirable grammar, leads 
to similar improvements. These results argue 
strongly that simply presenting raw text or fea- 
ture sequences to a machine learning program to 
build an automatic rating system for language 
assessment is of limited utility. Results will be 
poorer and require substantially more training 
data than if some knowledge of the task or clas- 
sifter end-state based in human knowledge and 
linguistic theory is applied to guide the search 
for classifiers. 
2.2 Encoding Linguistic Knowledge 
Why, then, if it is necessary to encode human 
knowledge in order to make machine learning 
practical, do we not simply encode each piece of 
the relevant assessment knowledge from the per- 
son to the machine? Here again parallels with 
other areas of Natural Language Processing 
(NLP) and Artificial Intelligence (AI) provide 
guidance. While both rule-based, hand-crafted 
grammars and expert systems have played a 
useful role, they require substantial labor to 
construct and become progressively more diffi- 
cult to maintain as the number of rules and rule 
interactions increases. Furthermore, this labor 
is not transferable to a new (sub-)language or 
topic and is difficult to encode in a way that 
allows for graceful degradation. 
Another challenge for primarily hand-crafted 
approaches is identifying relevant features and 
their relative importance. As is often noted, 
human assessment of language proficiency is 
largely holistic. Even skilled raters have diffi- 
culty identifying and quantifying those features 
used and their weights in determining an as- 
sessment. Finally, even when identifiable, these 
features may not be directly available to a com- 
puter system. For instance, in phonology, hu- 
man listeners perceive categorical distinctions 
between phonemes (Eimas et al., 1971; Thi- 
bodeau and Sussman, 1979) whereas acoustic 
measures vary continuously. 
We appeal to machine learning techniques in 
the acoustic module, as well as in the pool- 
ing of information from both acoustic and non- 
acoustic features. 
3 Domain: The Computerized Oral 
Proficiency Instrument 
The Center for Applied Linguistics in Washing- 
ton, D.C. (CAL) has developed or assisted in 
developing simulated oral proficiency interview 
(SOPI) tests for a variety of languages, recently 
adapting them to a computer-administered for- 
mat, the COPI. Scoring at present is done en- 
tirely by human raters. The Spanish version 
of the COPI is in the beta-test phase; Chinese 
and Arabic versions are under development. All 
focus on assessing proficiency at the Intermedi- 
ate Low level, defined by the American Council 
on the Teaching of Foreign Languages (ACTFL) 
25 
Speaking Proficiency Guidelines (ACT, 1986), 
a common standard for passing at many high 
schools. We focus on Spanish, since we will have 
access to reed data. Our goal is to develop a sys- 
tem with a high interannotator agreement with 
human raters, such that it can replace one of the 
two or three raters required for oral proficiency 
interview scoring. 
With respect to non-acoustic features, our 
domain is tractable, for current natured lan- 
guage processing techniques, since the input 
is expected to be (at best) sentences, perhaps 
only phrases and words at Intermediate Low. 
Although the tasks and the language at this 
level are relatively simple, the domain varies 
enough to be interesting from a research stand- 
point: enumerating items in a picture, leav- 
ing a answering machine message, requesting a 
car rental, giving a sequence of directions, and 
describing one's family, among others. These 
tasks elicit a more varied, though still topically 
constrained, vocabulary. They also allow the 
assessment of the speaker's grasp of target lan- 
guage syntax, and, in the more advanced tasks, 
discourse structure and transitions. The COPI, 
therefore, provides a natured domain for rating 
non-native speech on both acoustic and non- 
acoustic features. These subsystems differ in 
terms of how amenable they are to machine 
modeling of the human process, as outlined be- 
low. 
4 Acoustic Features: The Speech 
Recognition Process 
In the last two decades significant advances have 
been made in the field of automatic speech 
recognition (SR), both in commercial and re- 
search domains. Recently, research interest 
in recognizing non-native speech has increased, 
providing direct comparisons of recognition ac- 
curacy for non-native speakers at different lev- 
els of proficiency (Byrne et ed., 1998). Tomokiyo 
(p.c.), in experiments with the JANUS (Waibel 
et ed., 1992) speech system developed by 
Carnegie Mellon University, reports that sys- 
tems with recognition accuracies of 85% for 
native speech perform at 40-50% for high flu- 
ency L2 learners (German, Tomokiyo, p.c.) 
and 30% for medium fluency speech (Japanese, 
Tomokiyo, p.c.). 
However, the current speech recognition tech- 
nology makes little or no effort to model the 
human auditory or speech understanding pro- 
cess. Furthermore, standard SR approaches 
to speaker adaptation rely on relatively large 
amounts (20-30 minutes) of fixed, recorded 
speech (Jecker, 1998) to modify the underly- 
ing model, say in the case of accented speech, 
again unlike human listeners. 
While a complete reengineering of speech 
recognition is beyond the scope of our current 
project, we do attempt to model the human as- 
sessor's approach to understanding non-native 
speech. The SR system allows us two points of 
access through which linguistic knowledge of L2 
phonology and grammar can be applied to im- 
prove recognition: the lexicon and the speech 
recognizer grammar. 
4.1 Lexicon: Transfer-model-based 
Phonological-Adaptation 
Since we have too little data for conventional 
speaker adaptation (less than 5 minutes of 
speech per examinee), we require a principled 
way of adapting an L1 or L2 recognizer model 
to non-native learner's speech that places less 
reliance upon recorded training data. We know 
that the pronunciation reflects native language 
influence, most notably at early stages (Novice 
and Intermediate), with which we are primar- 
ily concerned. Following the L2 transfer acqui- 
sition model, we assume that the L2 speaker, 
in attempting to produce target language ut- 
terances, will be influenced by L1 phonology 
and phonotactics. Thus, rather than being ran- 
dom divergences from TL pronunciation, errors 
should be closer to L1 phonetic realizations. 
To model these constraints, we will employ 
two distinct speech recognizers that can be em- 
ployed to recognize L2 speech, produced by 
adaptations specific to Target Language (TL) 
and Source Language (SL). We propose to 
use language identification technology to arbi- 
trate between the two sets of recognizer results, 
based on a sample of speech, either counting 
to 20, or a short read text. Since we need to 
choose between an underlying TL phonologi- 
ced model and one based on the SL, we will 
make the selection based on the language iden- 
tification decision as to the apparent phonolog- 
ical identity of the sample as SL or TL, based 
on the sample's phonological and acoustic fea- 
tures (Berkling and Barnard, 1994; Hazen and 
26 
Zue, 1994; Kadambe and Hieronymus, 1994; 
Muthusamy, 1993). Parameterizing phonetic 
expectation based on a short sample of speech 
(Ladefoged and Broadbent, 1957) or expecta- 
tions in context (Ohala and Feder, 1994) mirrors 
what people do in speech processing generally, 
independent of the rating context. 
4.2 An acoustic grammar: Modeling 
the process 
Modeling the grammar of a Novice or Inter- 
mediate level L2 speaker for use by a speech 
recognizer is a challenging task. As noted in 
the ACTFL guidelines, these speakers are fre- 
quently inaccurate. However, to use the con- 
tent of the speech in the assessment, we need 
to model human raters, who recognize even er- 
rorful speech as accurately and completely as 
possible. Speech recognizers work most effec- 
tively when perplexity is low, as is the case 
when the grammar and vocabulary are highly 
constrained. However, speech recognizers also 
recognize what they are told to expect, of- 
ten accepting and misrecognizing utterances 
when presented with out-of-vocabulary or out- 
of-grammar input. We must balance these con- 
flicting demands. 
We will take advantage of the fact that this 
task is being performed off-line and thus can tol- 
erate recognizer speeds several times real-time. 
We propose a multi-pass recognition process 
with step-wise relaxation of grammatical con- 
straints. The relaxed grammar specifies a noun 
phrase with determiner and optional adjective 
phrase but relaxes the target language restric- 
tions on gender and number agreement among 
determiner, noun, and adjective and on posi- 
tion of adjective. Similar relaxations can be ap- 
plied to other major constructions, such as verbs 
and verbal conjugations, to pass, without "cor- 
recting", utterances with small target language 
inaccuracies. For those who would not reach 
such a level, and for tasks in which sentence- 
level structure is not expected, we must relax 
the grammar still further, relying on rejection at 
the first pass grammar to choose grammars ap- 
propriately. Successive relaxation of the gram- 
mar model will allow us to balance the need to 
reduce perplexity as much as possible with the 
need to avoid over-predicting and thereby cor- 
recting the learner's speech. 
4.3 Acoustic features: Modeling the 
result 
Research in the area of pronunciation scoring 
(Rypa, 1996; Franco et al., 1997; Ehsani and 
Knodt, 1998) has developed both direct and 
indirect measures of speech quality and pro- 
nunciation accuracy, none of which seem to 
model human raters at any level. The direct 
measures include calculations of phoneme er- 
ror rate, computed as divergence from native 
speaker model standards, and number of incor- 
rectly pronounced phonemes. The indirect mea- 
sures attempt to capture some notion of flu- 
ency and include speaking rate, number and 
length of pauses or silences, and total utterance 
length. Analogous measures should prove useful 
in the current assessment of spoken proficiency. 
In addition, one could include, as a baseline, 
a human-scored measure of perceived accent or 
lack of fluency. A final measure of acoustic qual- 
ity could be taken from the language identifica- 
tion process used in the arbitration phase, as to 
whether the utterance was more characteristic 
of the source or target language. In our samples 
of Intermediate Low passing speech we iden- 
tify, for example, large proportions of silence 
to speech both between and within sentences. 
Some sentences are more than 50% silence. 
5 Natural Language Understanding: 
Linguistic Features Assessment 
In the non-acoustic features, we have a fairly 
explicit notion of generative competence and a 
reasonable way of encoding syntax in terms of 
Context-Free Grammars (CFGs) and semantics 
via Lexical Conceptual Structures (LCSs). We 
do not know, however, the relative importance 
of different aspects of this competence in deter- 
mining reached/not reached for particular lev- 
els in an assessment task. Therefore, we apply 
machine learning techniques to pool the human- 
identified features, generating a machine-based 
model of process which is fully explicit and 
amenable to evaluation. 
The e-rater system, deployed by the Educa- 
tional Testing Service (ETS) incorporates more 
than 60 variables based on properties used by 
human raters and divided into syntactic, rhetor- 
ical and topical content categories. Although 
the features deal with suprasentential structure, 
the reported variables (Burstein et al., 1998) 
27 
are identified via lexical information and shal- 
low constituent parsing, arguably not modeling 
the human process. 
We attempt to model the features based on 
a deeper analysis of the structure of the text at 
various levels. We propose to parallel the archi- 
tecture of the Military Language Tutoring sys- 
tem (MILT), developed jointly by the Univer- 
sity of Maryland and Micro Analysis and Design 
corporation under army sponsorship. MILT 
provides a robust model of errors from English 
speakers learning Spanish and Arabic, identify- 
ing lexical and syntactic characteristics of short 
texts, as well as low-level semantic features, a 
prerequisite for more sophisticated inferencing 
(Dorr et al., 1995; Weinberg et al., 1995). At a 
minimum, the system will provide linguistically 
principled feedback on errors of various types, 
rather than providing system error messages, or 
crashing on imperfect input. 
Our work with MILT and the COPI beta- 
test data suggests that relevant features may 
be found in each of four main areas of spoken 
language processing: acoustic, lexical, syntac- 
tic/semantic ' and discourse. In order to au- 
tomate the assessment stage of the oral profi- 
ciency exam, we must identify features of the 
L2 examinees' utterances that are correlated 
with different ratings and that can be extracted 
automatically. If we divide language up into 
separate components, we can describe a wide 
range of variation within a bounded set of pa- 
rameters within these components. We can 
therefore build a cross-linguistically valid meta- 
interpreter with the properties we desire (com- 
pactness, robustness and extensibility). This 
makes both engineering and linguistic sense. 
Our system treats the constraints as submod- 
ules, able to be turned on or off, at the instruc- 
tor's choice, made based on, e.g., what is learned 
early, and the level of correction desired. The 
MILT-style architecture allows us to make use of 
the University of Maryland's other parsing and 
lexicon resources, including large scale lexica in 
Spanish and English. 
5.1 Lexical features 
One would expect command of vocabulary to 
be a natural component of a language learner's 
proficiency in the taxget language. A variety 
of automatically extractable measures provide 
candidate features for assessing the examinee's 
lexical proficiency. In addition, the structure of 
the tasks in the examination allows for testing 
of extent of lexical knowledge in restricted com- 
mon topics. For instance, the student may be 
asked to count to twenty in the target language 
or to enumerate the items in a pictured context, 
such as a classroom scene. Within these tasks 
one can test for the presence and number of spe- 
cific desired vocabulary items, yielding another 
measure of lexical knowledge. 
Simple measures with numerical values in- 
clude number of words in the speech sample and 
number of distinct words. In addition, exami- 
nees at this level frequently rely on vocabulary 
items from English in their answers. 
A deeper type of knowledge may be captured 
by the lexicon in Lexical Conceptual Structure 
(LCS) (Dorr, 1993b; Dorr, 1993a; Jackendoff, 
1983). The LCS is an interlingual framework for 
representing semantic elements that have syn- 
tactic reflexes3 LCSs have been ported from 
English into a variety of languages, including 
Spanish, requiring a minimum of adaptation in 
even unrelated languages (e.g. Chinese (Olsen 
et al., 1998)). The representation indicates 
the argument-taking properties of verbs (hit re- 
quires an object; smile does not), selectional 
constraints (the subject of fear and the object of 
frighten are animate), thematic information of 
arguments (the subject of frighten is an agent; 
the object is a patient) and classification in- 
formation of verbs (motion verbs like go are 
conceptually distinct from psychological verbs 
like fear/frighten; run is a more specific type of 
motion verb than go). Each information type 
is modularly represented and therefore may be 
separately analyzed and scored. 
5.2 Syntactic features 
We adopt a generative approach to grammar 
(Government and Binding, Principles and Pa- 
rameter, Minimalism) principles. In these mod- 
els, differences in the surface structure of lan- 
guages can be reduced to a small number of 
modules and parameters. For example, al- 
though Spanish and English both have subject- 
verb-object (SVO) word order, the relative or- 
dering of many nouns and adjectives differs (the 
1That is, the LCS does not represent non-syntactic 
aspects of 'meaning', including metaphor and pragmat- 
iC$. 
28 
"head parameter"). 2 In English the adjective 
precedes the noun, whereas Spanish adjectives 
of nationality, color and shape regularly fol- 
low nouns (Whitley, 1986)\[pp. 241-2\]. The 
MILT architecture allows us both to enumer- 
ate errors of these types, and parse data that 
includes such errors. We will also consider mea- 
sures of number and form of distinct construc- 
tion types, both attempted and correctly com- 
pleted. Such constructs could include simple 
declaratives (subject, verb, and one argument), 
noun phrases with both determiner and adjec- 
tive, with correct agreement and word order, 
questions, and multi-clause sentences. 
5.3 Semantic features 
Like the syntactic information, lexical informa- 
tion can be used modularly to assess a variety 
of properties of examinees' speech. The Lexical 
Conceptual Structure (LCS) allows principles of 
robustness, flexibility and modularity to apply 
to the semantic component of the proposed sys- 
tem. The LCS serves as the basis of several 
different applications, including machine trans- 
lation and information retrieval as well as for- 
eign language tutoring. The LCS is considered 
a subset of mental representation, that is, the 
language of mental representation as realized in 
language (Dorr et al., 1995). Event types such 
as event and state, are represented in primitives 
such as GO, STAY, BE, GO-EXT and ORI- 
ENT, used in spatial and other 'fields'. As such, 
it allows potential modeling of human rater pro- 
cesses. 
The LCS allows various syntactic forms to 
have the same semantic representation, e.g. 
Walk to the table and pick up the book, Go to 
the table and remove the book, or Retrieve the 
book from the table. COPI examinees are also 
expected to express similar information in dif- 
ferent ways. We propose to use the LCS struc- 
ture to handle and potentially enumerate com- 
petence in this type of variation. 
Stored LCS representations may also handle 
hierarchical relations among verbs, and diver- 
gences in the expression of elements of meaning, 
sometimes reflecting native language word or- 
der. The modularity of the system allows us to 
tease the semantic and syntactic features apart, 
2Other parameters deal with case, theta-role assign- 
ment, binding, and bounding. 
giving credit for the semantic expression, but 
identifying divergences from the target L2. 
5.4 Discourse features 
Since the ability to productively combine words 
in phrase and sentence structures separates the 
Intermediate learner from the Novice, features 
that capture this capability should prove use- 
ful in semi-automatic assessment of Intermedi- 
ate Low level proficiency, our target level. Ac- 
cording to the ACTFL Speaking Proficiency 
Guidelines, Intermediate Low examinees begin 
to compose sentences; full discourses do not 
emerge until later levels of competence. Nev- 
ertheless, we want both to give credit for any 
discourse-level features that surface, as well as 
to provide a slot for such features, to allow scal- 
ability to more complex tasks and higher levels 
of competence. We will therefore develop dis- 
course and dialog models, with the appropriate 
and measurable characteristics. Many of these 
can be lexically or syntactically identified, as 
the ETS GMAT research shows (Burstein et al., 
1998). Our data might include uses of discourse 
connectives (entonces 'then; in that case' pero 
'but'; es que 'the fact is'; cuando 'when'), other 
subordinating structures (Yo creo que 'I think 
that') and use of pronouns instead of repeated 
nouns. The discourse measures can easily be 
expanded to cover additional, more advanced 
constructions that are lexically signaled, such 
as the use of subordination or paragraph-level 
structures. 
6 Machine learning 
While the above features capture some measure 
of target language speaking proficiency, it is dif- 
ficult to determine a priori which features or 
groups of features will be most useful in making 
an accurate assessment. In this work, human 
assessor ratings for those trained on the Speak- 
ing Proficiency Guidelines will be used as the 
"Gold Standard" for determining accuracy of 
automatic assessment. We plan to apply ma- 
chine learning techniques to determine the rel- 
ative importance of different feature values in 
rating a speech sample. 
The assessment phase goes beyond the cur- 
rent work in test scoring, combining recogni- 
tion of acoustic features, such as the Auto- 
matic Spoken Language Assessment by Tele- 
phone (ASLAT) or PhonePass (Ordinate, 1998) 
29 
with aspects of the syntactic, discourse, and se- 
mantic factors, as in e-rater. Our goal is to 
have the automatic scoring system mirror the 
outcome of raters trained in the ACTFL Guide- 
lines (ACT, 1986), to determine whether exam- 
inees did or did not reach the Intermediate Low 
level. We also aim to make the process of feature 
weighting transparent, so that we can determine 
whether the system provides an adequate model 
of the human rating process. We will evalu- 
ate quantitatively the extent to which machine 
classification agrees with human raters on both 
acoustic and non-acoustic properties alone and 
separately. We will also evaluate the process 
qualitatively with human raters. 
We plan to exploit the natural structuring 
of the data features through decision trees or 
a small hierarchical "mixture-of-experts"- type 
model (Quinlan, 1993; Jacobs et al., 1991; Jor- 
dan and Jacobs, 1992). Intuitively, the lat- 
ter approach creates experts (machine-learning 
trained classifiers) for each group of features 
(acoustic, lexical, and so on). The correct 
way of combining these experts is then ac- 
quired though similar machine learning tech- 
niques. The organization of the classifier allows 
the machine learning technique at each stage 
of the hierarchy to consider fewer features, and 
thus, due to the branching structure of tree 
classifier, dramatically fewer classifier configu- 
rations. 
Decision tree type classifiers have an addi- 
tional advantage: unlike neural network or near- 
est neighbor classifiers, they are easily inter- 
pretable by humans. The trees can be rewritten 
trivially as sequences of if-then rules leading to 
a certain classification. For instance, in the as- 
sessment task, one might hypothesize a rule of 
the form: IF silence > 20% of utterance, THEN 
Intermediate Low NOT REACHED. It is thus 
possible to have human raters analyze how well 
the rules agree with their own intuitions about 
scoring and to determine which automatic fea- 
tures play the most important role in assess- 
ment. 
7 Conclusions 
We have outlined challenges for modeling the 
human rating task, both in terms of process 
and result. In the domain of acoustic features 
and speech recognition, we suggest the technol- 
ogy currently does not permit complete mod- 
eling of the rating process. Nevertheless, the 
paucity of data in our domain requires us to 
adopt the transfer model of speech, which per- 
mits automatic adaptation to errorful speech. 
In addition, our recognizer incorporates a re- 
laxed grammar, permitting input to vary from 
the target language at the lexical and syntactic 
levels. These adaptations allow us to model hu- 
man perception and processing of (non-native) 
speech, as required by our task. In the non- 
acoustic domain, we also adopt machine learn- 
ing techniques to pool the relevant human- 
identified features. As a result, we can learn 
more about the feature-weighting in the process 
of tuning to an appropriate level of reliability 
with the results of human raters. 

References 
1986. Proficiency guidelines. American Council 
for the Teaching of Foreign Languages. 

Kay M. Berkling and Etienne Barnard. 1994. 
Language identification of six languages 
based on a common set of broad phonemes. 
In Proceedings of ICSLP \[ICS9~\], pages 1891-1894. 

Jill Burstein, Karen Kukich, Susanne Wolff, Chi 
Lu, Martin Chodorow, Lisa Braden-Harder, 
and Mary Dee Harris. 1998. Automated scor- 
ing using a hybrid feature identification tech- 
nique. In ACL/COLING 98, pages 206-210, 
Montreal, Canada, August 10-14. 

William Byrne, Eva Knodt, Sanjeev Khudan- 
pur, and Jared Bernstein. 1998. Is auto- 
matic speech recognition ready for non-native 
speech? a data collection effort and ini- 
tim experiments in modeling conversational 
hispanic english. In Proceedings of STILL 
(Speech Technology in Language Learning), 
Marholmen, Sweden. European Speech Com- 
munication Association, May. 

Carl DeMarcken. 1995. Lexical heads, phrase 
structure, and the induction of grammar. In 
Proceedings of Third Workshop on Very Large 
Corpora, Cambridge, MA. 

Bonnie J. Dorr, Jim Hendler, Scott Blanksteen, 
and Barrie MigdMoff. 1995. Use of LCS and 
Discourse for Intelligent Tutoring: On Be- 
yond Syntax. In Melissa Holland, Jondthan 
Kaplan, and Michelle Sams, editors, Intelli- 
gent Language Tutors: Balancing Theory and 
Technology, pages 289-309. Lawrence Erl- 
baum Associates, Hillsdale, NJ. 

Bonnie J. Dorr. 1993a. Interlingual Machine 
Translation: a Parameterized Approach. Ar- 
tificial Intelligence, 63(1&2):429-492. 

Bonnie J. Dorr. 1993b. Machine Translation: 
A View from the Lexicon. The MIT Press, 
Cambridge, MA. 

Farzad Ehsani and Eva Knodt. 1998. Speech 
technology in computer-aided language learn- 
ing: Strengths and limitations of a new CALL 
paradigm. Language Learning ~ Technology, 
2(1), July. 

Peter D. Eimas, Einar R. Siqueland, Pe- 
ter Jusczyk, and James Vigorito. 1971. 
Speech perception in infants. Science, 
171(3968):303-306, January. 

H. Franco, L. Neumeyer, Y. Kim, and O. Ro- 
hen. 1997. Automatic pronunciation scoring 
for language instruction. In Proceedings of 
ICASSP, pages 1471-1474, April. 

Timothy J. Hazen and Victor W. Zue. 1994. 
Recent improvements in an approach to 
segment-based automatic language identifica- 
tion. In Proceedings of ICSLP \[ICS94\], pages 
1883-1886. 

Ray Jackendoff. 1983. Semantics and Cogni- 
tion. The MIT Press, Cambridge, MA. 

R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and 
G. E. Hinton. 1991. Adaptive mixtures of lo- 
cal experts. Neural Computation, 3(1):79-87. 

D. Jecker. 1998. Speech recognition - perfor- 
mance tests. PC Magazine, 17, March. 

M. I. Jordan and R. A. Jacobs. 1992. Hierar- 
chies of adaptive experts. In Nips4. 

Shubha Kadambe and James L. Hieronymus. 
1994. Spontaneous speech language identifi- 
cation with a knowledge of linguistics. In Pro- 
ceedings of ICSLP \[ICS94\], pages 1879-1882. 

P. Ladefoged and D.E. Broadbent. 1957. Infor- 
mation conveyed by vowels. Journal of the 
Acoustical Society of America, 29:98-104. 

Yeshwant K. Muthusamy. 1993. A Segmen- 
tal Approach to Automatic Language Identi- 
fication. Ph.D. thesis, Oregon Graduate In- 
stitute of Science & Technology, P.O. Box 
91000, Portland, OR 97291-1000. 

J.J. Ohala and D. Feder. 1994. Listeners' 
normalization of vowel quality is influenced 
by restored consonantal context. Phonetica, 
51:111-118. 

Marl Broman Olsen, Bonnie J. Dorr, and 
Scott C. Thomas. 1998. Enhancing Auto- 
matic Acquisition of Thematic Structure in 
a Large-Scale Lexicon for Mandarin Chinese. 
In Proceedings of the Third Conference of 
the Association for Machine Translation in 
the Americas, AMTA-98, in Lecture Notes 
in Artificial Intelligence, 1529, pages 41-50, 
Langhorne, PA, October 28-31. 

Ordinate. 1998. The PhonePass Test. Tech- 
nical report, Ordinate Corporation, Menlo 
Park, CA, January. 

Fernando Pereira and Yves Schabes. 1992. 
Inside-outside reestimation from partially 
bracket corpora. In Proceedings of the 
30th Annual Meeting of the Association for 
Computational Linguistics, pages 128-135, 
Newark, DE. 

J. R. Quinlan. 1993. C4. 5: Programs for Ma- 
chine Learning. Morgan Kaufmann, San Ma- 
teo, CA. 

M. Rypa. 1996. VILTS: The voice interactive 
language training system. In Proceedings of 
CALICO, July. 

Linda M. Thibodeau and Harvey M. Sussman. 
1979. Performance on a test of categorical 
perception of speech in normal and com- 
munication disordered children. Phonetics, 
7(4):375-391, October. 

A. Waibel, A.N. Jain, A. McNair, J. Tebel- 
sis, L. Osterholtz, H. Salto, O. Schmid- 
bauer, T. Sloboda, and M. Woszczyna. 1992. 
JANUS: Speech-to-Speech Translation Using 
Connectionist and Non-Connectionist Tech- 
niques. In J.E. Moody, S.J. Hanson, and R.P. 
Lippman, editors, Advances in Neural Infor- 
mation Processing Systems 4. Morgan Kauf- 
mann. 

Amy Weinberg, Joseph Garman, Jeffery Mar- 
tin, and Paola Merlo. 1995. Principle-Based 
Parser for Foreign Language Training in 
German and Arabic. In Melissa Holland, 
Jonathan Kaplan, and Michelle Sams, ed- 
itors, Intelligent Language Tutors: Theory 
Shaping Technology, pages 23-44. Lawrence 
Erlbaum Associates, Hillsdale, NJ. 

Stanley M. Whitley. 1986. Spanish/English 
Contrasts: A Course in Spanish Linguistics. 
Georgetown University Press, Washington, 
D.C. 
