Experiments in Evaluating 
Interactive Spoken Language Systems 1 
Joseph Polifroni, Lynette Hirschman, Stephanie Seneff, and Victor Zue 
Spoken Language Systems Group 
Laboratory for Computer Science 
Massachusetts Institute of Technology 
Cambridge, Massachusetts 02139 
ABSTRACT 
As the DARPA spoken language community moves to- 
wards developing useful systems for interactive problem solv- 
ing, we must explore alternative evaluation procedures that 
measure whether these systems aid people in solving problems 
within the task domain. In this paper, we describe several 
experiments exploring new evaluation procedures. To look at 
end-to-end evaluation, we modified our data collection pro- 
cedure slightly in order to experiment with several objective 
task completion measures. We found that the task completion 
time is well correlated with the number of queries used. We 
also explored log file evaluation, where evaluators were asked 
to judge the clarity of the query and the correctness of the re- 
sponse based on examination of the log file. Our results show 
that seven evaluators were unanimous on more than 80% of 
the queries, and that at least 6 out of 7 evaluators agreed over 
90% of the time. Finally, we applied these new procedures to 
compare two systems, one system requiring a complete parse 
and the other using the more flexible robust parsing mecha- 
nism. We found that these metrics could distinguish between 
these systems: there were significant differences in ability to 
complete the task, number of queries required to complete the 
task, and score (as computed through a log file evaluation) 
between the robust and the non-robust modes. 
INTRODUCTION 
For the first two years of the DARPA Spoken Lan- 
guage Program, common evaluation in the ATIS domain 
has been performed solely with the Common Answer 
Specification (CAS) protocol \[4\], whereby a system's per- 
formance is determined by comparing its output, ex- 
pressed as a set of database tuples, with one or more 
predetermined reference answers \[1\]. The CAS protocol 
has the advantage that system evaluation can be car- 
ried out automatically, once the principles for generating 
the reference answers have been established and a cor- 
pus has been annotated accordingly. Since direct com- 
parison across systems can be performed relatively easily 
with this procedure, we have been able to achieve cross 
fertilization of research ideas, leading to rapid research 
progress. 
1This research was supported by DARPA under Contract 
N000\] 4-89-J-1332, monitored through the Office of Naval Research. 
28 
QUERY 1: 
RESPONSE I: 
QUERY 2: 
RESPONSE 2: 
PLEASE LIST THE FLIGHT FROM PITTSBURGH 
TO BALTIMORE THAT WILL HE MADE BY SIX 
SEAT AIRPLANE ON JUNE TWENTIETH 
There are no flights from pittsburgh to 
Baltimore leaving before 6:00 a.m. on 
Saturday June 20. 
LIST THE NUMBER OF FIRST CLASS FLIGHTS 
AVAILABLE ON DELTA AIRLINES 
279 
Table 1: Two examples of queries from the February '92 
Evaluation for which the CAS evaluation gives a misleading 
assessment of the system's ability to understand. 
However, CAS evaluation is not without its share 
of shortcomings. One unfortunate outcome of evaluat- 
ing systems based on answers retrieved from the flight 
database is that we cannot separately assess the system's 
understanding of the query from its ability to retrieve 
the correct information and to conform to the prescribed 
output format. In the best case, researchers may end up 
spending a considerable amount of time worrying about 
the form and substance of the answer, which has little 
to do with language understanding. In the worst case, 
the results of the evaluation may be down right mislead- 
ing. Consider the two examples from the February 1992 
test-set shown in Figure 1. For Query 1, the system 
misunderstood the phrase "by six" as meaning "before 
6:00 a.m." Nonetheless, the answer is judged correct, 
because both the hypothesized and reference answers are 
the NULL set, i.e., no flights satisfy the set of constraints. 
For Query 2, the system found 279 flights, but the cor- 
rect answer is 278. The erroneous extra flight is the one 
connecting flight in the database shared by two airlines, 
Delta and USAIR. 
Another shortcoming of the present evaluation pro- 
cedure is that it has no place for interactive dialogue. 
In a realistic application, the user and the computer are 
often partners in problem solving, in which the final so- 
lution may be best obtained by allowing both sides to 
take the initiative in the conversation. Since the hu- 
man/computer dialogue can vary widely from system to 
system, it is impossible to use the data collected from one 
system to evaluate another system without making avail- 
able the computer's half of the conversation. Even then, 
the system being tested becomes an observer analyzing 
two sides of a conversation rather than a participant. 
II Measurements I\[ Mean \[Std. Dev. I\[ 
Total ~ of Queries Used 4.8 1.6 
# of Queries with Error Messages 1.0 1.4 
Time to Completion (S.) 166.1 66.0 
To be sure, the current evaluation protocol has served 
the community well. The refinements made during the 
last year have significantly improved its ability to pro- 
vide an objective benchmark. However, as we continue 
to press forward in developing useful spoken language 
systems that can help us solve problems, we must cor- 
respondingly expand the battery of evaluation protocols 
to measure the effectiveness of these systems in accom- 
plishing specific tasks. 
At the March 1991 meeting of the SLS Coordinating 
Committee, a working group was formed with the specific 
goal of exploring methodologies that will help us evaluate 
if, and how welt, a spoken language system accomplishes 
its task in the ATIS domain. The consensus of the work- 
ing group was that, while we may not have a clear idea 
about how to evaluate overall system performance, it is 
appropriate to conduct experiments in order to gain ex- 
perience. The purpose of this paper is to describe three 
experiments conducted at MIT over the past few months 
related to this issue. These experiments explored a num- 
ber of objective and subjective evaluation metrics, and 
found some of them to be potentially helpful in deter- 
mining overall system performance and usefulness. 
END-TO-END EVALUATION 
In order to carry out end-to-end evaluation, i.e., eval- 
uation of overall task completion effectiveness, we must 
be able to determine precisely the task being solved, the 
correct answer(s), and when thesubject is done. Once 
these factors have been specified, we can then compute 
some candidate measures and see if any of them are 
appropriate for characterizing end-to-end system perfor- 
mance. 
While true measures of system performance will re- 
quire a (near) real-time spoken language system, we felt 
that some preliminary experiments could be conducted 
within the context of our ATIS data collection effort \[3,2\]. 
In our data collection paradigm, a typist types in the 
subject's queries verbatim, after removing disfluencies. 
All subsequent processing is done automatically by the 
system. To collect data for end-to-end evaluation, we 
modified our standard data collection procedure slightly, 
by adding a specific scenario which has a unique answer. 
For this scenario, the subjects were asked to report the 
answer explicitly. 
As a preliminary experiment, we used two simple sce- 
narios. In one of them, subjects were asked to determine 
Table 2: Objective end-to-end measures. 
the type of aircraft used on a flight from Philadelphia to 
Denver that makes a stop in Atlanta and serves break- 
fast. Subjects were asked to end the scenario by saying 
"End scenario. The answer is" followed by a statement 
of the answer, e.g., "End scenario. The answer is Boe- 
ing 727." From the log files associated with the session 
scenario, we computed a number of objective measures, 
including the success of task completion, task completion 
time, the number of successful and the number of unsuc- 
cessful queries (producing a "no answer" message) 2. 
We collected data from 29 subjects and analyzed the 
data from 24 subjects 3. All subjects were able to com- 
plete the task, and statistics on some of the objective 
measures are shown in Table 2. 
Figure 1 displays scatter plots of the number of queries 
used by each subject as a function of the task completion 
time. A least-square fit of the data is superimposed. The 
number of queries used is well correlated with the task 
completion time (R = 0.84), suggesting that this measure 
may be appropriate for quantifying the usefulness of sys- 
tems, at least within the context of our experiment. Also 
plotted are the number of queries that generated a "no 
answer" message. The correlation of this measure with 
task completion time is not as good (R = 0.66), possibly 
due to subjects' different problem solving strategies and 
abilities. 
LOG FILE EVALUATION 
We also conducted a different set of experiments to 
explore subject-based evaluation metrics. Specifically, we 
extracted from the log files pairs of subject queries and 
system responses in sequence, and asked evaluators to 
judge the clarity of the query (i.e., clear, unclear, or un- 
intelligible) and the correctness of the response (correct, 
partially correct, incorrect, or "system generated an error 
message"). A program was written to enable evaluators 
to enter their answers on-line, and the results were tab- 
ulated automatically. We used seven evaluators for this 
experiment, all people from within our group. Four peo- 
ple had detailed knowledge of the system and the desig- 
2The system generates a range of diagnostic messages, reporting 
that it cannot parse, or that it cannot formulate a retrieval query, 
etc. 
3Data from the remaining subjects were not analyzed, since they 
have been designated by NIST as test material. 
29 
10' 
2, 
• # of queries • 
n # of error messages 
• lid •• 
@11 \[\] 
• f'.: . . 
i00 200 300 400 
Time to Completion 
Figure 1: Relationship beween task completion time and the 
total number of queries used, and the number of queries that 
generated a "no answer" message. 
nated correct reference answers. Three of the evaluators 
were familiar with the ATIS system capabilities, but did 
not have a detailed knowledge of what constituted a cor- 
rect reference answer for the comparator. Our anedyses, 
based on data from 7 evaluators, indicate that 82% of the 
time there was unanimous agreement among the evalua- 
tors, and there were 1 or fewer disagreements 92% of the 
time. ii! 
A2°: 
0 I 2 3 4 
Number of Evaluators Who Disagree 
Figure 2: Consistency of the 7 evaluators' answers during 
log file evaluation. The data are based on 115 query/answer 
pairs. 
These results suggest that reasonable agreement is 
possible using humans to evaluate log files. Such on-line 
evaluation is also quite cost effective; the evaluators were 
each able to check the 115 query/answer pairs in 30-45 
minutes. 
SYSTEM COMPARISON 
EXPERIMENT 
Building on the results of the pilot experiments on 
end-to-end and log file evaluation, we designed an exper- 
iment to test whether these metrics would be useful in 
distinguishing the performance of two systems on a more 
complex set of tasks. 
30 
Experimental Design 
We decided to compare the performance of two MIT 
systems: the "full parse" system and the "robust parse" 
system \[5\]. These two systems contrast a conservative 
approach that only answers when it is confident (the full 
parse system) against a more aggressive approach that is 
willing to make mistakes by answering much more often, 
based on partial understanding (the robust parse sys- 
tem). These systems had very measurably different per- 
formance in terms of the CAS metric, and our hypothesis 
was that the metrics would show that the robust-parsing 
system outperformed the full-parsing system. To try to 
capture a broader range of user behavior, we decided to 
vary the difficulty of the scenarios; we used two pairs of 
scenarios, where each pair consist of an "easy" scenario 
followed by a "hard" scenario. The scenarios were chosen 
to have a single correct answer. The easy scenarios were 
scenarios adapted from our previous data collection and 
could be solved with around three queries. The more dif- 
ficult scenarios were constructed to require more queries 
(10-15) to solve them. The four scenarios are shown in 
Table 3. 
The experiment used a within-subject design, with 
each subject using both systems. In order to neutral- 
ize the effects of the individual scenarios and the order 
of scenarios, all subjects were presented with the same 
scenarios, in the same order. We alternated sessions in 
which the robust parser was used for scenarios one and 
two with sessions in which the robust parser was used for 
scenarios three and four. Subjects were given no prior 
training or warm-up exercises. 
As of this writing, we have collected data from fifteen 
subjects. Eight of these subjects used a version of the 
system with the robust parser turned on for the first two 
scenarios and turned off for the second two; seven used 
the opposite configuration of full-parsing followed by ro- 
bust parsing. All but two of the subjects had not used 
the system before. 
We used our standard subject instructions, slightly 
modified to inform the subject that s/he would be us- 
ing two distinct systems. The subjects were drawn from 
the same pool as in our previous data collection efforts, 
namely MIT students and Staff. Each subject was given a 
$10 gift certificate for a local store. The subjects were not 
given any special incentive for getting correct answers, 
nor were they told that they would be timed. Each sub- 
ject was asked to fill out a version of our debriefing ques- 
tionnaire, slightly modified to include a specific question 
asking the subject which system s/he had preferred. 
We found that writing the scenarios was tricky, and 
we had to iterate several times on the wording of the 
scenario descriptions; in particular, we found that it was 
difficult to elicit the desired answer. Even when we al- 
tered the instructions to remind the subjects of what 
1. Find a flight from Philadelphia to Dallas that makes a stop in Atlanta. The flight should serve breakfast. Identify the 
type of aircraft that is used on the flight to Dallas. (Information requested: aircraft type) 
2. You want to fly from Boston to San Francisco on the last weekend in April (Saturday, April 25 or Sunday, April 26). 
You'd like to return to Boston on the following Wednesday in the evening, if possible. Your main concern is that all 
flights be on Continental since you are trying to maximize your frequent flyer miles. Identify one flight in each direction 
(by number) that you can take. (Information requested: flight number) 
3. Find a flight from Atlanta to Baltimore. The flight should be on a Boeing 757 and arrive around 7:00 P.M. Identify the 
flight (by number) and what meal is served on this flight. (Information requested: flight number, meal type) 
4. You live in Pittsburgh. You want to combine a vacation trip to Atlanta with business and take a friend along. You 
will receive a fixed travel allowance, based on a first-class ticket. Identify a coach class fare (dollar amount) that comes 
closest to allowing you to cover the expenses of both you and your friend based on the regular first class fare. Choose 
a date within the next seven days and make sure the fare does not have a restriction that disallows this. (Information 
requested: fare amount (in dollars) for coach class fare) 
Table 3: The four scenarios used by subjects in the second MIT end-to-end experiment. 
kind of answer they should provide, subjects did not al- 
ways read or follow the scenarios carefully. We wanted 
to avoid prompting the subjects with phrases that we 
knew the system understood. We therefore tried to word 
the scenarios in such a way that subjects would not be 
able to read from their instructions verbatim and ob- 
tain a response from the system. We also wanted to 
see what problem-solving strategies subjects would use 
when various options were presented to them, only one 
of which could solve their scenario. In Scenario 2, for ex- 
ample, there are no Continental flights on the Saturday 
or Wednesday evening in question. There are, however, 
Continental flights on Sunday and on Wednesday during 
the day. 
Results and Analyses 
From the collected data, we made a number of mea- 
surements for each scenario, and examined how the two 
systems differed in terms of these measures. The mea- 
surements that we computed are: 
• Scenario completion time; 
• Existence of a reported solution; 
• Correctness of the reported solution; 
• Number of queries; 
• Number of queries answered; number resulting in a 
"no answer" message from the system; 
• Logfile evaluation metrics, including queries judged 
to be correctly answered, incorrectly answered, par- 
tially correct, and out of domain(class X); also score, 
defined as % Correct - % Incorrect; 
• User satisfaction from debriefing questionnaire. 
Table 4 summarizes some of the results comparing the 
two systems across all scenarios. For the remainder of 
this section, we will try to analyze these results and reach 
some tentative conclusions. 
31 
Task Completion The first column of Table 4 shows 
that the subjects were able to provide an answer in allthe 
scenarios when the system was in robust mode, whereas 
only 83% of the scenarios were completed in non-robust 
mode. Interestingly, a detailed examination of the data 
shows that, for the 5 cases in the non-robust mode when 
users gave up, there was never an incorrectly answered 
query, but the number of unanswered queries was ex- 
tremely high. From a problem-solving standpoint, we can 
tentatively conclude that a system that takes chances and 
answers more queries seems to be more successful than a 
more conservative one. 
Finding the Correct Solution Our experimental pa- 
radigm allowed us to determine automatically, by pro- 
cessing the log files, whether the subject solved the sce- 
nario correctly, incorrectly, or not at all. A much larger 
percentage of the scenarios were correctly answered with 
the robust system than with the non-robust system (90% 
vs. 70%). Measured in terms of the percent of scenar- 
ios correctly solved, the robust system outperformed the 
non-robust system in all scenarios. 
Task Completion Time The task completion time is 
summarized in the third column of Table 4. The results 
are somewhat inconclusive, due to a number of factors. 
Although we were interested in assessing how long it took 
to solve a scenario, we did not inform our subjects of this. 
In part, this was because we didn't want to add more 
stress to the situation. More than one subject inexpli- 
cably cleared the history after having nearly solved the 
scenario, and then essentially repeated, sometimes ver- 
batim, the same series of questions. Had they thought 
there was a time constraint, they probably would not 
have done this. We suspect that because subjects were 
not encouraged to proceed quickly, it is difficult to draw 
any conclusions from the results on time-to-completion. 
Another insidious factor was background network traffic 
and machine load, factors that would contribute to vari- 
ations in time-to-completion which we did not control for 
[ Scenario System % of Scenarios Solution \[ Completion Number w/Solution Correct \[ Time(s) 
1 Robust 100 100 
1 Full 86 71 
2 Robust 100 88 
2 Full 86 86 
3 Robust 100 100 
3 Full 88 88 
4 Robust 100 71 
4 Full 75 38 
All Robust 100 90 
ALl FuU 83 70 
Number of 
Queries 
% of Queries 
Correct 
% of Queries 
Incorrect 
215 4.4 94 0 
215 4.7 70 0 
478 8.6 66 25 
483 10.6 39 4 
199 4.4 82 15 
376 8.0 42 0 
719 11.7 71 22 
643 9.8 51 0 
399 
434 
75 
48 
7.2 
8.3 
18 
1 
% of Queries \] DARPA 
No Answer \] Score 
6 94 
30 70 
8 41 
56 35 
3 68 
58 42 
6 49 
49 51 
6 57 
51 47 
Table 4: Mean metrics for robust and full parse systems, shown by scenario 
in these experiments. 
The next column of the same table shows the average 
number of queries for each scenario. Since these numbers 
appear to be well correlated with task completion time, 
they suffer from some of the same deficiencies. 
Log File Score In order to measure the number of 
queries correctly answered by the system, two system de- 
velopers independently examined each query/answer pair 
and judged the answer as correct, partially correct, incor- 
rect, or unanswered, based on the evaluation program de- 
veloped for the logfile evaluation. The system developers 
were in complete agreement 92% of the time. The cases 
of disagreement were examined to reach a compromise 
rating. This provided a quick and reasonably accurate 
way to assess whether the subjects received the informa- 
tion they asked for. The percentages of queries correctly 
answered, incorrectly answered, and unanswered, and the 
resulting DARPA score (i.e., % correct - % incorrect) are 
shown in the last four columns of Table 4. 
Although not shown in Table 4, the overall ratio of 
correctly answered queries to those producing no an- 
swer was an order of magnitude higher for the robust 
parser (148:13) than for the non-robust parser (118:125). 
This was associated with an order-of-magnitude increase 
in the number of incorrect answers: 32 vs. 3 for the 
non-robust parser. However, the percentage of "no an- 
swer" queries seemed to be more critical in determining 
whether a subject succeeded with a scenario than the 
percentage of incorrect queries. 
Debriefing Questionnaire Each subject received a de- 
briefing questionnaire, which included a question asking 
for a comparison of the two systems used. Unfortunately, 
data were not obtained from the first five subjects. Of 
the ten subjects that responded, five preferred the ro- 
bust system, one preferred the non-robust system, and 
the remaining ones expressed no preference. 
Difficulty of Scenarios There was considerable vari- 
ability among the scenarios in terms of difficulty. Sce- 
nario 4 turned out to be by far the most difficult one to 
32 
solve, with only a little over half of the sessions being 
successfully completed 4. Subjects were asked to "choose 
a date within the next week" and to be sure that the 
restrictions on their fare were acceptable. We intention- 
ally did not expand the system to understand the phrase 
"within the next week" to mean "no seven-day advance 
purchase requirement," but instead required the user to 
determine that information through some other means. 
Also in Scenario 4, there were no available first class fares 
that would exactly cover two coach class fares. Scenarios 
2 and 4 were intended to be more difficult than 1 and 
3, and indeed they collectively had a substantially lower 
percentage of correct query answers than the other two 
scenarios, reflecting the fact that subjects were groping 
for ways to ask for information that the system would be 
able to interpret. 
There was a wide variation across subjects in their 
ability to solve a given scenario, and in fact, subjects de- 
viated substantially from our expectations. Several sub- 
jects did not read the instructions carefully and ignored 
or misinterpreted key restrictions in the scenario. For in- 
stance, one subject thought the "within the next week" 
requirement in Scenario 4 meant that he should return 
within a week of his departure. Some subjects had a 
weak knowledge of air travel; one subject assumed that 
the return trip would be on the same flight as the forward 
leg, an assumption which caused considerable confusion 
for the system. 
The full parser and robust parser showed different 
strengths and weaknesses in specific scenarios. For ex- 
ample, in Scenario 3, the full parser often could not parse 
the expression "Boeing 757", but the robust parser had 
no trouble. This accounts in part for the large "win" of 
the robust parser in this scenario. Conversely, in Sce- 
nario 4, the robust parser misinterpreted expressions of 
the type "about two hundred dollars", treating "about 
two" as a time expression. This led the conversation 
badly astray in these cases, and perhaps accounts for the 
4 The other three scenarios were solved successfully on average 
nearly 90% of the time. 
fact that subjects took more time solving the scenario in 
robust mode. The lesson here is that different scenarios 
may find different holes in the systems under compari- 
son, thus making the comparison extremely sensitive to 
the exact choice and wording of the scenarios. 
Performance Comparison The robust parser performed 
better than the non-robust parser on all measures for all 
scenarios except in Scenario 4. In Scenario 4, the per- 
centage of sessions resulting in a correct solution favored 
robust parsing by a large margin (71% vs. 38%), but 
the robust parser had a longer time to completion and 
more queries to completion than the non-robust system, 
as well as a worse DARPA score (51% to 49%). The ro- 
bust parser gave a greater percentage of correct answers 
(71% vs. 51%), but its incorrect answers were signif- 
icant enough (22% to 0%) to reverse the outcome for 
the DARPA score. Thus DARPA score seems to be cor- 
related with time to completion, but percent of correct 
answers seems to be correlated with getting a correct so- 
lution. 
We feel that the data for Scenario 4, when used to 
make comparisons between the robust and non-robust 
parser, are anomalous for several reasons. The scenario 
itself confused subjects, some of whom incorrectly as- 
sumed that the correct fare was one which was exactly 
one-half of the first class fare. Furthermore, fare restric- 
tions are not as familiar to subjects as we previously as- 
sumed, leading to lengthy interactions with the system. 
These difficulties led to differences in performance across 
systems that we feel are not necessarily linked directly to 
the systems themselves but rather to the nature of the 
scenario being solved. In summary, our data show the 
following salient trends: 
1. Subjects were always able to complete the scenario 
for the robust system. 
2. Successful task completion distinguished the two 
systems: full parse system succeeded 70% of the 
time, compared with 90% for the robust system. 
3. Percent of correctly answered queries followed the 
same trend as completion time and number of over- 
all queries; these may provide a rough measure of 
task difficulty. 
4. Scores for the performance on individual queries 
were not necessarily consistent with overall success 
in solving the problem. 
5. Users expressed a preference for the robust system. 
CONCLUSIONS 
The results of these experiments are very encourag- 
ing. We believe that it is possible to define metrics that 
measure the performance of interactive systems in the 
context of interactive problem solving. We have had con- 
siderable success in designing end-to-end task completion 
tests. We have shown that it is possible to design such 
scenarios, that the subjects can successfully perform the 
33 
designated task in most cases, and that we can define ob- 
jective metrics, including time to task completion, num- 
ber of queries, and number of system non-responses. In 
addition, these metrics appear to be correlated. To as- 
sess correctness of system response, we have shown that 
evaluators can produce better than 90% agreement eval- 
uating the correctness of response based on examination 
of query/answer pairs from the log file. We have im- 
plemented an interactive tool to support this evaluation, 
and have used it in two separate experiments. Finally, 
we demonstrated the utility of these metrics in charac- 
terizing two systems. There was good correspondence 
between how effective the system was in helping the user 
arrive at a correct answer for a given task, and metrics 
such as time to task completion, number of queries, and 
percent of correctly answered queries (based on log file 
evaluation). These metrics also indicated that system 
behavior may not be uniform over a range of scenarios 
- the robust parsing system performed better on three 
scenarios, but had a worse DARPA score on the fourth 
(and probably most difficult) scenario. Based on these 
experiments, we believe that these metrics provide the 
basis for evaluating spoken language systems in a realis- 
tic interactive problem solving context. 
ACKNOWLEDGEMENTS 
.~Ve would like to thank several people who made sig- 
nificant contributions to this work. David Goodine de- 
signed and implemented the interactive log file evalua- 
tion interface which greatly facilitated running the log 
file evaluation experiments. Christie Winterton recruited 
the subjects for the various data collection experiments 
and served as the wizard/transcriber for many of the sub- 
jects in our end-to-end evaluation experiment. We would 
also like to thank Nancy Daly, James Glass, Rob Kassel 
and Victoria Palay for serving as evaluators in the log 
file evaluation. 
REFERENCES 
\[1\] Bates, M., Boisen, S., and Makhoul, J., "Developing an 
Evaluation Methodology for Spoken Language Systems," 
Proc. DARPA Speech and Natural Language Workshop, 
pp. 102-108, June, 1990. 
\[2\] MADCOW, "Multi-Site Data Collection for a Spoken 
Language Corpus," MADCOW, These Proceedings. 
\[3\] Polifroni, J. Seneff, S., and Zue, V., "Collection of 
Spontaneous Speech for the sc Atis Domain and Com- 
parative Analyses of Data Collected at MIT and TI," 
Proc. DARPA Speech and Natural Language Workshop, 
pp.360-365, February 1991. 
\[4\] Ramshaw, L. A. and S. Boisen, "An SLS Answer Com- 
parator," SLS Note 7, BBN Systems and Technologies 
Corporation, Cambridge, MA, May 1990. 
\[5\] Seneff, S., "A Relaxation Method for Understanding 
Spontaneous Speech Utterances," These Proceedings. 
