Evaluation of a Practical Interlingua 
for Task-Oriented Dialogue 
Lori Levin, Donna Gates, Alon Lavie, Fabio Pianesi, 
Dorcas Wallace, Taro Watanabe, Monika Woszczyna 
Language Technologies Institute, Carnegie Mellon University and 
IRST ITC, Trento, Italy 
Internet: lsl©cs, cmu. edu 
Abstract 
IF (Interchange Format), the interlingua used by 
the C-STAR consortium, is a speech-act based in- 
terlingua for task-oriented dialogue. IF was de- 
signed as a practical interlingua that could strike 
a balance between expressivity and simplicity. If 
it is too simple, components of meaning will be 
lost and coverage of unseen data will be low. On 
the other hand, if it is too complex, it cannot be 
used with a high degree of consistency by collab- 
orators on different continents. In this paper, we 
suggest methods for evaluating the coverage of IF 
and the consistency with which it was used in the 
C-STAR consortium. 
Introduction 
IF (Interchange Format) is an interlingua used by 
the C-STAR consortium 1 for task-oriented dia- 
logues. Because it is used in five different coun- 
tries for six different s, it had to achieve 
a careful balance between being expressive ehough 
and being simple enough to be used consistently. 
If it was not expressive enough, components of 
meaning would be lost and coverage of unseen data 
would be low. On the other hand, if was not sim- 
ple enough, different system developers would use 
it inconsistently and the wrong meanings would be 
translated. IF is described in our previous papers 
(\[PT98, LGLW98, LLW+\]). 
For this paper, we have proposed methods for 
evaluating the coverage of IF and the degree to 
which it can be used consistently across C-STAR 
sites. Coverage was measured by having human IF 
specialists annotate unseen data. Consistency was 
measured by two means. The first was inter-coder 
agreement among IF specialists at Carnegie Mel- 
lonUniversity and ITC-irst (Centre per la ricerca 
lhttp://www.c-star.org 
18 
scientifica e tecnologica). The second, less direct 
method, was a cross-site end-to-end evaluation of 
English-to-Italian translation where the English- 
to-IF analysis grammars were written at CMU and 
IF-to-Italian generation was developed at IRST. If 
the English and Italian grammar writers did not 
agree on the meaning of the IF, wrong transla- 
tions will be produced. In this way, the cross-site 
evaluation can be an indirect indicator of whether 
the CMU and IRST IF specialists agreed on the 
meaning of IF representations. For comparison, 
we also present within-site end-to-end evaluations 
of English-to-German, English-to-Japanese, and 
English-to-IF-to-English, where all of the analysis 
and generation grammars were written at CMU. 
The Interchange Format 
Because we are working with task-oriented dia- 
logues, adequate rendering of the speech act in the 
target  often overshadows the need for lit- 
eral translation of the words. IF is therefore based 
on domain actions (DAs), which consist of on 
speech acts plus domain-specific concepts. An ex- 
ample of a DA is give-information+price+room 
(giving information about the price of a room). 
DAs are composed from 45 general speech acts 
(e.g., acknowledge, give-information, accept) 
and about 96 domain-specific concepts (e.g, 
price, temporal, room, flight, availability). 
In addition to the DA, IF representations can con- 
tain arguments such as room-type, destination, 
and price. There are about 119 argument types. 
In the following example, the DA consists 
of a speaker tag (a: for agent), the speech- 
act give-information, and two main concepts, 
+price and +room. The DA is followed by a list 
of arguments: room-type= and price=. The ar- 
guments have values that represent-information 
for the type of room double and the cost repre- 
Percent 
Cumulatlve Percent Count 
Coverage 
15.7 15,7 652 
19.8 4.1 172 
28.3 3.4 143 
26.0 2.7 113 
28.0 2.0 85 
30.1 2.0 85 
31,9 1.9 78 
33.7 1.8 75 
35.5 1.8 73 
37.2 1.7 70 
38.8 1.6 66 
40.3 l.S 64 
41.7 1,4 60 
43.2 1.4 60 
44.5 1.3 56 
45.8 1.3 52 
46.9 1.2 48 
48.0 1.1 46 
49.1 1.1 44 
50.1 1.0 42 
NA* ;:; 244 
DA 
acknowledge 
affirm 
thank 
introduce-self 
give-lnformation+prtce 
greeting 
give-lnfor marion+tern poral 
give-lnformatlon+numeral 
give-in formation+ price+room 
request-in for matio n+ payment 
give-information + payment 
give-inform+features+room 
give-in form -t- availability + room 
accept 
give-information+personal-data 
req-act +reserv+ feat ures+room 
req- verif-give-inforra +numeral 
offer+help 
apologize 
request-inform+personal-data 
no-tag 
Figure 1: Coverage of Top20 DAs and No-tag in 
development data 
sented with the complex argument price= which 
has its own arguments quantity=, currency= and 
per-unit=. This IF representation is neutral be- 
tween sentences that have different verbs, sub- 
jects, and objects such as A double room costs 150 
dollars a night, The price of a double room is 150 
dollars a night, and A double room is 150 dollars 
a night. ~ 
AGENT: ''a double room costs $150 a night.'' 
a:give-information+price+room 
(room-type=double, 
price=(quantity=lSO, 
currency=dollar, 
per-unit=night) 
Coverage and Distribution of 
Dialogue Acts 
In this section, we address the coverage of IF for 
task-oriented dialogues about travel planning. We 
want to know whether a very simple interlingua 
like IF can have good coverage. We are using a 
rather subjective measure of coverage: IF experts 
hand-tagged unseen data with IF representations 
and counted the percentage of utterances to which 
no IF could be assigned. (When they tagged the 
unseen data, they were not told that the IF was 
being tested for coverage. The tagging was done 
for system development purposes.) Our end-to- 
end evaluation described in the following sections 
can be taken as a less subjective measure of cov- 
2When we add anaphora resolution, we will need 
to know whether a verb (cost) or a noun (price) was 
used. This will be an issue our new project, NESPOLEI 
(http://nespole. itc. it/). 
Percent 
Cumulative Percent Count Speech Act 
Coverage 
30.1 80.1 1250 glve-lnformation 
45,8 15.7 655 acknowledge 
57,7 11.9 498 request-lnformation 
62,7 5,0 209 request-verification-give-inform 
87.6 4.9 203 request-actlon 
71.7 4.1 172 affirm 
75,1 3.4 143 thank 
77,9 2.7 113 introduce-self 
80.2 2.4 98 offer 
82,4 2.1 89 accept 
84.4 2.0 85 greeting 
85.7 1.3 55 suggest 
66.8 I.I 44 apologize 
87.8 1.0 41 closing 
88.5 0.8 32 negate.give-information 
89.2 0.6 27 delay-action 
89,8 0.6 25 introduce-topic 
90,2 0.5 19 please-wait 
90.6 0.4 15 reject 
91.0 0.4 15 request-suggestlon 
Figure 2: 
data 
Coverage of speech-acts in development 
erage. However, the score of an end-to-end evalu- 
ation encompasses grammar coverage problems as 
well as IF coverage problems. 
The development portion of the coverage ex- 
periment proceeded as follows. Over a period of 
two years, a database of travel planning dialogues 
was collected by C-STAR partners in the U.S., 
Italy, and Korea. The dialogues were role-playing 
dialogues between a person pretending to be a 
traveller and a person pretending to be a travel 
agent. For the English and Italian dialogues, the 
traveller and agent were talking face-to-face in the 
same  -- both speaking English or both 
speaking Italian. The Korean dialogues were also 
role playing dialogues, but one participant was 
speaking Korean and the other was speaking En- 
glish. From these dialogues, only the Korean ut- 
terances are included in the database. Each utter- 
ance in the database is annotated with an English 
translation and an IF representation. Table 1 sum- 
marizes the amount of data in each . The 
English, Italian, and Korean data was used for IF 
development. 
The development database contains over 4000 
dialogue act units, which are covered by a total of 
about 542 distinct DAs (346 agent DAs and 278 
client DAs). Figures 1 and 2 show the cumulative 
coverage of the top twenty DA's and speech acts 
in the development data. Figure 1 also shows the 
percentage of no-tag utterances (the ones we de- 
cided not to cover) in the development data. The 
first column shows the percent of the development 
data that is covered cumulatively by the DA's or 
speech acts from the top of the table to the cur- 
rent line. For example, acknowledg e and affirm 
together account for 19.8 percent of the data. The 
19 
Language(s) Type of Dialogue Number of DA Units 
D'evelopment Data: 
English 
Italian 
Korean-English 
Test Data: 
Japanese-English 
monolingual 
monolingual 
biiingual (only 'Korean 
utterances are included) 
bilingual (Japanese and 
English utterances are 
included) 
Table 1: The IF Database 
2698 
1142 
6069 
Percent 
' Cumulative Percent Count DA 
Cover~,--= - " 4.6 263 no-tag 
15.6 15.6 • - 885 acknowledge 
20.2 4,6 260 thank 
23.7 3.5 200 introduce-self 
27.0 3.4 191 affirm 
29.7 2.7 153 apologize 
32.3 2.6 147 greeting 
34.6 2.3 128 closing 
36.3 1.7 98 give-information+personal-data 
38.0 1.7 95 glve-inform ation +t em poraI 
39.5 1.6 89 give-in formation + price 
41.1 1.5 88 please-wait 
42.5 1.4 82 give-inform+telephone-number 
43.8 1.3 75 give-information+features+room 
45.0 I.I 65 request-inform+personal-data 
46.0 1.0 59 give-in for m ÷temp oral-.{- arrival 
47.0 1.0 55 accept 
48.0 l.O 55 give-infor m +availability + room 
48.9 1.0 55 give-information+price-broom 
49.8 0.9 50 verify 
50.7 0.9 49 request-in form +temporal+arrival 
Figure 3: Coverage of Top 20 DAs and No-tag in 
test data 
Percent 
Cumulative 
Coverage 
25.6 
Percent Count DA 
25.6 1454 give-information 
41.7 16.1 916 acknowledge 
53.6 11.9 677 request-information 
58.2 4.6 260 thank 
62,0 3.7 213 request-verification-give-inform 
65.5 3.5 200 introduce-self 
68.8 3.4 191 affirm 
72.0 3.2 181 request-action 
74.8 2.8 159 accept 
77.5 2.7 153 apologize 
80.1 2.6 147 greeting 
82.4 2.3 130 closing 
84.4 2.1 117 suggest 
86.3 1.8 104 verlfy-give-information 
87.9 1.7 94 offer 
89.5 1.5 88 please-wait 
90.6 I.I 65 negate-glve-lnformation 
91.5 0.9 50 verify 
92.0 0.5 30 negate 
92.5 0.5 . 26 request-affirmatlon 
Figure 4: Coverage of Top 20 SAs in test data 
second column shows the percent of the develop- 
ment data covered by each DA or speech act. The 
third column shows the number of times each DA 
or speech act occurs in the development data. 
The evaluation portion of the coverage ex- 
periment was carried out on 124 dialogues (6069 
dialogue act units) that were collected at ATR, 
Japan. One participant in each dialogue was 
speaking Japanese and the other was speaking En- 
glish. Both Japanese and English utterances are 
included in the data. The 124 Japanese-English 
dialogues were not examined closely by system de- 
velopers during IF development. After the IF de- 
sign was finalized and frozen in Summer 1999, the 
Japanese-English data was tagged with IFs. No 
further IF development took place at this point 
except that values for arguments were added. For 
example, Miyako could be added as a hotel name, 
but no new speech acts, concepts, or argument 
types could be added. Sentences were tagged as 
no-tag if the IF did not cover them. 
Figures 3 and 4 show the cumulative cover- 
age of the top twenty DAs and speech acts in the 
Japanese-English data, including the percent of 
no-tag sentences. 
Notice that the percentage of no-tag was 
lower in our test data than in our development 
data. This is because the role playing instructions 
for the test data were more restrictive than the 
role playing instructions for the development data. 
Figures 1 and 3 show that slightly more of the test 
data is covered by slightly fewer DAs. 
Cross-Site Reliability of IF 
Representations 
In this section we attempt to measure how reliably 
IF is used by researchers at different sites. Recall 
that one of the design criteria of IF was consis- 
tency of use by researchers who are separated by 
oceans. This criterion limits the complexity of IF. 
Two measures of consistency are used - inter-coder 
agreement and a cross-site end-to-end evaluation. 
Inter-Coder Agreement: Inter-coder agree- 
ment is a direct measure of consistency among 
20 
Percent Agreement 
Speech-act 82.14 
Dialog-act 65.48 
Concept lists 88.00 
Argument lists I 85.79 
Table 2: Inter-coder Agreement between CMU 
and IRST 
C-STAR partners. We used 84 DA units from 
the Japanese-English data described above. The 
84 DA units consisted of some coherent dialogue 
fragments and and some isolated sentences. The 
data was coded at CMU and at IRST. We counted 
agreement on ~he components of the IF separately. 
Table 2 shows agreement on speech acts, dialogue 
acts (speech act plus concepts), concepts, and ar- 
guments. The results are reported in Table 2 in 
terms of percent agreement. Further work might 
include some other calculation of agreement such 
as Kappa or precision and recall of the coders 
against each other. Figure 5 shows a fragment of 
a dialogue coded by CMU and IRST. The coders 
disagreed on the IF middle sentence, I'd like a twin 
room please. One coded it as an acceptance of a 
twin room, the other coded it as a preference for 
a twin room. 
Cross-Site Evaluation: As an approximate and 
indirect measure of consistency, we have compared 
intra-site end-to-end evaluation with cross-site 
end-to-end evaluation. An end-to-end evaluation 
includes an analyzer, which maps the source lan- 
guage input into IF and a generator, which maps 
IF into target  sentences. The intra-site 
evaluation was carried out on English-German, 
English-Japanese, and English-IF-English trans- 
lation. The English analyzer and the German, 
Japanese, and English generators were all writ- 
ten at CMU by IF experts who worked closely 
with each other. The cross-site evaluation was car- 
ried out on English-Italian translation, involving 
an English analyzer written at CMU and an Ital- 
ian generator written at IRST. The IF experts at 
CMU and IRST were in occasional contact with 
each other by email, and met in person two or 
three times between 1997 and 1999. 
A number of factors contribute to the success 
of an inter-site evaluation, just one of which is that 
the sites used IF consistently with each other. An- 
other factor is that the two sites used similar de- 
velopment data and have approximately the same 
coverage. If the inter-site evaluation results are 
about as good as the intra-site results, we can con- 
clude that all factors are handled acceptably, in- 
cluding consistency of IF usage. If the inter-site 
results are worse than the intra-site results, con- 
sistency of IF use or some other factor may be 
to blame. Before conducting this evaluation, we 
already knew that there was some degree of cross- 
site consistency in IF usage because we conducted 
successful inter-continental demos with speech 
translation and video conferencing in Summer 
1999. (The demos and some of the press coverage 
are reported on the C-STAR web site.) The de- 
mos included dialogues in English-Italian, English- 
German, English-Japanese, English-Korean, and 
English-French. At a later date, an Italian-Korean 
demo was produced with no additional work, thus 
illustrating the well-cited advantage of an inter- 
lingual approach in a multi-lingual situation. The 
end-to-end evaluation reported here goes beyond 
the demo situation to include data that was un- 
seen by system developers. 
Evaluation Data: The Summer 1999 intra-site 
evaluation was conducted on about 130 utterances 
from a CMU user study. The traveller was played 
by a second time user -- someone who had partici- 
pated in one previous user study, but had no other 
experience with our MT system. The travel agent 
was played by a system developer. Both people 
were speaking English, but they were in different 
rooms, and their utterances were paraphrased us- 
ing IF. The end-to-end procedure was that (1) an 
English utterance was spoken and decoded by the 
JANUS speech recognizer, (2) the output of the rec- 
ognizer was parsed into an IF representation, and 
(3) a different English utterance (supposedly with 
the same meaning) was generated from the IF rep- 
resentation. The speakers had no other means of 
communication with each other. 
In order to evaluate English-German and 
English-Japanese translation, the IFs of the 130 
test sentences were fed into German and Japanese 
generation components at CMU. The data used in 
the evaluation was unseen by system developers 
at the time of the evaluation. For English-Italian 
translation, the IF representations produced by 
the English analysis component were sent to IRST 
to be generated in Italian. 
Evaluation Scoring: In order to score the eval- 
uation, input and output sentences were compared 
by bilingual people, or monolingual people in the 
case of English-IF-English evaluation. A score of 
ok is assigned if the target  utterance is 
comprehensible and no components of meaning are 
deleted, added, or" changed by the translation. A 
21 
We have singles, and t,ins and also Japanese rooms available on the eleventh. 
CMU a:give-information+availability+room 
(room-type=(single ~ twin ~ japanese_style), time=mdll) 
IRST a:give-in2ormation+availability+room 
(room-type=(single ~ twin & japanese_style), time=mdll) 
I'd like a twin room, please. 
CMU c:accept+features+room (room-typeffitwin) 
IBST c:give-information+preference+features+room (room-type=twin) 
A twin room is fourteen thousand yen. 
CMU a:give-information+price+room 
(room-type=twin, price=(currency=yen, quantity=f4000)) 
IRST a:give-in.formation+price+room 
(room-type=twin, price=(currency=yen, quantity=f4000)) 
Figure 5: Examples of IF coding from CMU and IRST 
.o 
Method 
1 Recosnition only 
2 Transcription 
3 Recosnition 
4 Transcription 
5 Recognition 
6 Transcription 
7 Recognition 
8 Transcription 
9 Recognition 
10 Transcription 
11 Recognition 
I OutPut Language II OK+Perfect Perfect Grader I No. of Graders 
En$1ish 
English. 
En$1ish 
Japanese 
Japanese 
German 
German 
German 
German 
Italian 
Italian 
78 % 62 % CMU 3 
74 % 54 % CMU 3 
59 ~ 42 % CMU 3 
777 % 59 % CMU 2 
62 % 4,5 % CMU 2 70 %- ..... s9 % CMU " 
58 % 34 % CMU 2 
67 ~ 43 % IRST 2 
59 % 36 % IRST 2 
73 % 51% IRST ...... 6 
61% 42 % IRST 6 
Figure 6: Translation Grades for English to English, Japanese, German, and Italian 
score of perfect is assigned if, in addition to the 
previous criteria, the translation is fluent in the 
target . A score of bad is assigned if the 
target  sentence is incomprehensible or 
some element of meaning has been added, deleted, 
or changed. The evaluation procedure is described 
in detail in \[GLL+96\]. In Figure 6, acceptable is 
the sum of perfect and ok scores, s 
Figure 6 shows the results of the intra-site 
and inter-site evaluations. The first row grades 
the speech recognition output against a human- 
produced transcript of what was said. This gives 
us a ceiling for how well we could do if trans- 
lation were perfect, given speech recognition er- 
rors. Rows 2 through 7 show the results of the 
intra-site evaluation. All analyzers and genera- 
tors were written at CMU, and the results were 
graded by CMU researchers. (The German re- 
sults are a lower than the English and Japanese 
results because a shorter time was spent on gram- 
mar development.) Rows 8 and 9 report on CMU's 
intra~site evaluation of English-German transla~ 
Sin another paper (\[LBL+00\]), we describe a task- 
based evaluation which focuses on success of commu- 
nicative goals and how long it takes to achieve them. 
tion (the same system as in Rows 6 and 7), but 
the results were graded by researchers at IRST. 
Comparing Rows 6 and 7 with Rows 8 and 9, we 
can check that CMU and IRST graders were us- 
ing roughly the same grading criteria: a difference 
of up to ten percent among graders is normal in 
our experience. Rows 10 and 11 show the results 
of the inter-site English-Italian evaluation. The 
CMU English analyzer produced IF representa- 
tions which were sent to IRST and were fed into 
IRST's Italian generator. The results were graded 
by IRST researchers. 
Conclusions drawn from the inter-site evaluation: 
Since the inter-site evaluation results are compa- 
rable to the intra-site results, we conclude that re- 
searchers at IRST and CMU are using IF at least 
as consistently as researchers within CMU. 
Future Plans 
In the next phase of C-STAR, we will cover de- 
scriptive sentences (e.g., The castle was built in 
the thirteenth century and someone was impris- 
oned in the tower) as well as task-oriented sen- 
tences. Descriptive sentences will be represented 
22 
in a more traditional frame-based interlingua fo- 
cusing on lexical meaning and grammatical fea- 
tures of the sentences. We are working on disam- 
biguating literal from task-oriented meanings in 
context. For example That's great could be an ac- 
ceptance (like I'll take it) (task oriented) or could 
just express appreciation. Sentences may also con- 
tain a combination of task oriented (e.g., Can you 
tell me) and descriptive (how long the castle has 
been standing) components. 

References 
Donna Gates, A. Lavie, L. Levin, 
A. Waibel, M. Gavald~, L. Mayfield, 
M:-Woszczyna, and P. Zhan. End-to- 
End Evaluation in JANUS: A Speech- 
to-Speech Translation System. In Pro- 
ceedings of ECAI-96, Budapest, Hun- 
gary, 1996. 
Lori Levin, Boris Bartlog, Ari- 
adna Font Llitjos, Donna Gates, Alon 
Lavie, Dorcas Wallace, Taro Watan- 
abe, and Monika Woszczyna. Lessons 
Learned from a Task-Based Evaluation 
of Speech-to-Speech MT. In Proceed- 
ings of LREC 2000, Athens, Greece, 
June to appear, 2000. 
Lori Levin, D. Gates, A. Lavie, and 
A. Waibel. An Interlingua Based on 
Domain Actions for Machine Transla- 
tion of Task-Oriented Dialogues. In 
Proceedings of the International Con- 
ference on Spoken Language Process- 
ing (ICSLP'98), Sydney, Australia, 
1998. 
Lori Levin, A. Lavie, M. Woszczyna, 
D. Gates, M. Gavald~, D. Koll, and 
A. Waibel. The Janus-III Translation 
System. Machine Translation. To ap- 
pear. 
Fabio Pianesi and Lucia Tovena. Us- 
ing the Interchange Format for Encod- 
ing Spoken Dialogue. In Proceedings of 
SIG-IL Workshop, 1998. 
