A Minimum Message Length Approach for Argument Interpretation
Ingrid Zukerman and Sarah George
School of Computer Science and Software Engineering
Monash University
Clayton, VICTORIA 3800, AUSTRALIA
email: a0 ingrid,sarahg a1 @csse.monash.edu.au
Abstract
We describe a mechanism which receives
as input a segmented argument composed
of NL sentences, and generates an inter-
pretation. Our mechanism relies on the
Minimum Message Length Principle for
the selection of an interpretation among
candidate options. This enables our mech-
anism to cope with noisy input in terms
of wording, beliefs and argument struc-
ture; and reduces its reliance on a partic-
ular knowledge representation. The per-
formance of our system was evaluated by
distorting automatically generated argu-
ments, and passing them to the system
for interpretation. In 75% of the cases,
the interpretations produced by the system
matched precisely or almost-precisely the
representation of the original arguments.
1 Introduction
Discourse interpretation is at the cornerstone of
human-computer communication, and an essential
component of any dialogue system. In order to pro-
duce an interpretation from a user’s NL utterances,
the concepts referenced by the user’s words must
be identified, the propositions built using these con-
cepts must be understood, and the relations between
these propositions must be determined. Each of
these tasks is fraught with uncertainty.
In this paper, we focus on the interpretation of ar-
gumentative discourse, which is composed of impli-
cations. We present a mechanism for the interpre-
tation of NL arguments which is based on the ap-
plication of the Minimum Message Length (MML)
Principle for the evaluation of candidate interpreta-
tions (Wallace and Boulton, 1968). The MML prin-
ciple provides a uniform and incremental framework
for combining the uncertainty arising from differ-
ent stages of the interpretation process. This enables
our mechanism to cope with noisy input in terms of
wording, beliefs and argument structure, and to fac-
tor out the elements of an interpretation which rely
on a particular knowledge representation.
Our interpretation mechanism is embedded
in a web-based argumentation system called
BIAS (Bayesian Interactive Argumentation Sys-
tem). BIAS uses Bayesian Networks (BNs) (Pearl,
1988) as its knowledge representation and reasoning
formalism. It is designed to be a comprehensive ar-
gumentation system which will eventually engage in
an unrestricted interaction with users. However, the
current version of BIAS performs two activities: it
generates its own arguments (from a BN) and inter-
prets users’ arguments (generating a Bayesian sub-
net as an interpretation of these arguments). In this
paper we focus on the interpretation task.
Figure 1(a) shows a simple argument given by a
user, and Figure 1(d) shows a subset of a BN which
contains the preferred interpretation of the user’s ar-
gument; the nodes corresponding to the user’s in-
put are shaded. The user’s argument is obtained
through a web interface (the uncertainty value of
the consequent is entered using a drop-down menu).
In this example, the user’s input differs structurally
from the system’s interpretation, the belief value for
the consequent differs from that in the domain BN,
and the wording of the statements differs from the
canonical wording of the BN nodes. Still, the sys-
tem found a reasonable interpretation in the context
of its domain model.
The results obtained in this informal trial are val-
idated by our automated evaluation. This evalua-
     Philadelphia, July 2002, pp. 211-220.  Association for Computational Linguistics.
                  Proceedings of the Third SIGdial Workshop on Discourse and Dialogue,
ML(IG      |IG          )Usr SysInt
ML(UArg | IG     )Usr
ML(IG          |SysInt) = 0SysInt
(d) SysInt (Top 
     Candidate) N reported argument
N heard argument
G argued with B
G and B were enemies
G was in garden at 11
G was in garden
at time of death
G had motiveG had opportunity
G murdered B
N reported argument
N heard argument
G argued with B
G and B were enemies
G was in garden at 11
G was in garden
at time of death
G had motiveG had opportunity
G murdered B
(c) IG        for best 
     SysIntSysInt
(b) Top-ranked IG
N reported argumentG was in garden at 11
G murdered B
(a) User’s original argument (UArg)
Usr
The neighbour reported a             argument
  between Mr Green and Mr Body last week
Mr Green was         in the garden at 11
Mr Body was murdered by Mr Green
AND
-> [likely]
heated
seen
Figure 1: Interpretation and MML evaluation
tion, which assesses baseline performance, consists
of passing distorted versions of the system’s argu-
ments back to the system for interpretation. In 75%
of the cases, the interpretations produced by the sys-
tem matched the original arguments (in BN form)
precisely or almost-precisely.
In the next section, we review related research.
We then describe the application of the MML crite-
rion to the evaluation of interpretations. In Section 4,
we outline the argument interpretation process. The
results of our evaluation are reported in Section 5,
followed by concluding remarks.
2 Related Research
Our research integrates plan recognition for dis-
course understanding with the application of the
MML principle (Wallace and Boulton, 1968).
The system described in (Carberry and Lambert,
1999) recognized a user’s intentions during expert-
consultation dialogues. This system considered sev-
eral knowledge sources for discourse understanding.
It used plan libraries as its main knowledge rep-
resentation formalism, and handled short conversa-
tional turns. In contrast, our system relies on BNs
and handles unrestricted arguments.
BNs have been used in several systems that per-
form plan recognition for discourse understanding,
e.g., (Charniak and Goldman, 1993; Horvitz and
Paek, 1999; Zukerman, 2001). Charniak and Gold-
man’s system handled complex narratives, using a
BN and marker passing for plan recognition. It au-
tomatically built and incrementally extended a BN
from propositions read in a story, so that the BN
represented hypotheses that became plausible as the
story unfolded. Marker passing was used to restrict
the nodes included in the BN. In contrast, we use do-
main knowledge to constrain our understanding of
the propositions in a user’s argument, and apply the
MML principle to select a plausible interpretation.
Like Carberry and Lambert’s system, both
Horvitz and Paek’s system and Zukerman’s handled
short dialogue contributions. Horvitz and Paek used
BNs at different levels of an abstraction hierarchy
to infer a user’s goal in information-seeking inter-
actions with a Bayesian Receptionist. In addition,
they used decision-theoretic strategies to guide the
progress of the dialogue. We expect to use such
strategies when our system engages in a full dia-
logue with the user. In previous work, Zukerman
used a domain model and user model represented as
a BN, together with linguistic and attentional infor-
mation, to infer a user’s goal from a short-form re-
joinder. However, the combination of these knowl-
edge sources was based on heuristics.
The approach presented in this paper extends our
previous work in that (1) it handles input of unre-
stricted length, (2) it offers a principled technique
for selecting between alternative interpretations of a
user’s discourse, and (3) it handles discrepancies be-
tween the user’s input and the system’s expectations
at all levels (wording, beliefs and inferences). Fur-
ther, this approach makes no assumptions regarding
the synchronization between the user’s beliefs and
the system’s beliefs (but it assumes that the system
is a domain expert). Finally, this approach may be
extended to incorporate various aspects of discourse
and dialogue, such as information pertaining to the
dialogue history and user modeling information.
The MML principle is a model-selection tech-
nique which applies information-theoretic crite-
ria to trade data fit against model complex-
ity (a glossary of model-selection techniques
appears in http://www-white.media.mit.edu/
a0 tpminka/statlearn/glossary). MML has been
used in a variety of applications, e.g., in NL it was
used for lexical selection in speech understanding
(Thomas et al., 1997). In this paper, we demonstrate
its applicability to a higher-level NL task.
3 Argument Interpretation Using MML
The MML criterion implements Occam’s Razor,
which may be stated as follows: “If you have two
theories which both explain the observed facts, then
you should use the simplest until more evidence
comes along”. According to the MML criterion,
we imagine sending to a receiver a message that de-
scribes a user’s NL argument, and we want to send
the shortest possible message.1 This message corre-
sponds to the simplest interpretation of a user’s argu-
ment. We postulate that this interpretation is likely
to be a reasonable interpretation (although not nec-
essarily the intended one).
A message that encodes an NL argument in terms
of an interpretation is composed of two parts: (1) in-
structions for building the interpretation, and (2) in-
structions for rebuilding the original argument from
this interpretation. These two parts balance the need
for a concise interpretation (Part 1) with the need for
an interpretation that matches closely the user’s ut-
terances (Part 2). For instance, the message for a
concise interpretation that does not match well the
original argument will have a short first part but a
long second part. In contrast, a more complex in-
terpretation which better matches the original argu-
ment may yield a message that is shorter overall,
with a longer first portion, but a shorter second por-
tion. Thus, the message describing the interpretation
(BN) which best matches the user’s intent will be
among the messages with a short length (hopefully
the shortest). Further, a message which encodes an
NL argument in terms of a reasonable interpretation
will be shorter than the message which transmits the
words of the argument directly. This is because an
interpretation which comprises the nodes and links
in a Bayesian subnet (Part 1 of the message) is much
1It is worth noting that the sender and the receiver are theo-
retical constructs of the MML theory, which are internal to the
system and are not to be confused with the system and the user.
The concept of a receiver which is different from the sender en-
sures that the message constructed by the sender to represent a
user’s argument does not make unwarranted assumptions.
more compact than a sequence of words which iden-
tifies these nodes and links. If this interpretation is
reasonable (i.e., the user’s argument is close to this
interpretation), then the encoding of the discrepan-
cies between the user’s argument and the interpre-
tation (Part 2 of the message) will not significantly
increase the length of the message.
In order to find the interpretation with the shortest
message length, we compare the message lengths of
candidate interpretations. These candidates are ob-
tained as described in Section 4.
3.1 MML Encoding
The MML criterion is derived from Bayes Theorem:
Pra1a3a2a5a4a7a6a9a8a11a10 Pra1a3a6a9a8a13a12 Pra1a3a2a9a14a6a15a8 , wherea2 is the data
and a6 is a hypothesis which explains the data.
An optimal code for an event a16 with probability
Pra1a16 a8 has message length MLa1a16 a8a17a10a19a18a21a20a23a22a25a24a27a26 Pra1a16 a8
(measured in bits). Hence, the message length for
the data and a hypothesis is:
MLa1a3a2a5a4a7a6a15a8a11a10 MLa1a3a6a15a8a29a28 MLa1a3a2a15a14a6a9a8a31a30
The hypothesis for which MLa1a3a2a32a4a7a6a15a8 is minimal is
considered the best hypothesis.
Now, in our context, UArg contains the user’s ar-
gument, and SysInt an interpretation generated by
our system. Thus, we are looking for the SysInt
which yields the shortest message length for
MLa1UArga4 SysInta8a25a10 MLa1SysInta8a29a28 MLa1UArga14SysInta8
The first part of the message describes the in-
terpretation, and the second part describes how
to reconstruct the argument from the interpreta-
tion. To calculate the second part, we rely on
an intermediate representation called Implication
Graph (IG). An Implication Graph is a graphi-
cal representation of an argument, which repre-
sents a basic “understanding” of the argument.
It is composed of simple implications of the
form Antecedenta33 Antecedenta26 a30a34a30a34a30 Antecedenta35a37a36
Consequent (where a36 indicates that the antecedents
imply the consequent, without distinguishing be-
tween causal and evidential implications). a38a40a39 Usr
represents an understanding of the user’s argument.
It contains propositions from the underlying repre-
sentation, but retains the structure of the user’s ar-
gument. a38a40a39 SysInt represents an understanding of a
candidate interpretation. It is directly obtained from
SysInt, but it differs from SysInt in that all its arcs
point towards a goal node and head-to-head evi-
dence nodes are represented as antecedents of an im-
plication, while SysInt is a general Bayesian subnet.
Since both a38a40a39 Usr and a38a27a39 SysInt use domain proposi-
tions and have the same type of representation, they
can be compared with relative ease.
Figure 1 illustrates the interpretation of a short ar-
gument presented by a user, and the calculation of
the message length of the interpretation. The inter-
pretation process obtains a38a40a39 Usr from the user’s in-
put, and SysInt from a38a27a39 Usr (left-hand side of Fig-
ure 1). If a sentence in UArg matches more than
one domain proposition, the system generates more
than onea38a27a39 Usr from UArg (Section 4.1). Eacha38a27a39 Usr
may in turn yield more than one SysInt. This hap-
pens when the underlying representation has sev-
eral ways of connecting between the nodes in a38a40a39 Usr
(Section 4.2). The message length calculation goes
from SysInt to UArg through the intermediate rep-
resentations a38a40a39 SysInt and a38a40a39 Usr (right-hand side of
Figure 1). This calculation takes advantage of the
fact that there can be only onea38a40a39 Usr for each UArg–
SysInt combination. Hence,
Pra1 UArga4 SysInta8 a10 Pra1 UArga0 a38a27a39 Usra0 SysInta8
a10 Pra1 UArga14
a38a40a39 Usr
a0 SysInt
a8 a12
Pra1a38a27a39 Usr
a14SysInta8 a12 Pra1 SysInta8
cond. ind.a10 Pra1 UArga14
a38a40a39 Usr
a8 a12
Pra1a38a27a39 Usra14SysInta8 a12 Pra1 SysInta8
Thus, the length of the message required to trans-
mit the user’s argument and an interpretation is
MLa1 UArga4 SysInta8 a10 MLa1 UArga14a38a40a39 Usra8a29a28
MLa1a38a40a39 Usra14SysInta8a29a28
MLa1 SysInta8 (1)
That is, for each candidate interpretation, we cal-
culate the length of the message which conveys:
a1 SysInt – the interpretation,
a1
a38a40a39 Usr
a14SysInt – how to obtain the belief and struc-
ture of a38a40a39 Usr from SysInt,2 and
a1 UArg
a14
a38a27a39 Usr – how to obtain the sentences in UArg
from the corresponding propositions in a38a27a39 Usr.
The interpretation which yields the shortest message
is selected (the message-length equations for each
component are summarized in Table 1).
2We use a2a4a3 SysInt for this calculation, rather than SysInt.
This does not affect the message length because the receiver
can obtain a2a4a3 SysInt directly from SysInt.
Throughout the remainder of this section, we de-
scribe the calculation of the components of Equa-
tion 1, and illustrate this calculation using the simple
example in Figure 2 (the message length calculation
for our example is summarized in Table 2).
UArg: a5a7a6 Usr:
Mr Body and Mr Green argued
a8 Mr Green had a motive to
kill Mr Body
G argued with B
G had motive
a5a7a6 SysInt: SysInt:
G argued with B
G had motive
G and B were enemies
G argued with B
G had motive
G and B were enemies
Figure 2: Simple Argument and Interpretation
3.2 Calculating MLa1 SysInta8
In order to transmit SysInt, we simply send its propo-
sitions and the relations between them. A standard
MML assumption is that the sender and receiver
share domain knowledge (recall that the receiver is
not the user, but is a construct of the MML theory).
Hence, one way to send SysInt consists of transmit-
ting how SysInt is extracted from the domain rep-
resentation. This involves selecting its propositions
from those in the domain, and then choosing which
of the possible relations between these propositions
are included in the interpretation. In the case of a
BN, the propositions are represented as nodes, and
the relations between propositions as arcs. Thus the
message length for SysInt in the context of a BN is
a20a23a22a25a24 a26 a1 # nodes(SysInt)a8 a28 a20a23a22a25a24 a26 a1 # arcs(SysInt)a8a28
a20a23a22a25a24 a26 C# nodes(domainBN)
# nodes(SysInt)
a28 a20a23a22a25a24 a26 C# incident arcs(SysInt)
# arcs(SysInt)
(2)
For the example in Figure 2, in order to transmit
SysInt we must choose 3 nodes from the 82 nodes
in the BN which represents our murder scenario (the
Bayesian subnet in Figure 1(d) is a fragment of this
BN). We must then select 2 arcs from the 3 arcs that
connect these nodes. This yields a message of length
a20a23a22a25a24a13a26a10a9 a28 a20a23a22a25a24a13a26a10a11 a28 a20a22a25a24 a26 C
a12
a26
a13
a28 a20a23a22a25a24a40a26 C
a13
a26 a10
a14 a30a16a15 a28 a14 a28 a14 a15a40a30a18a17 a28 a14 a30a16a15 a10a19a11a21a20a27a30a16a15 bits.
3.3 Calculating MLa1 IGUsra14SysInta8
The message which describes a38a27a39 Usr in terms of
SysInt (or rather in terms of a38a27a39 SysInt) conveys how
a38a40a39 Usr differs from the system’s interpretation in two
respects: (1) belief, and (2) argument structure.
3.3.1 Belief differences
For each proposition a0 in botha38a27a39 SysInt anda38a40a39 Usr,
we transmit any discrepancy between the belief
stated by the user and the system’s belief in this
proposition (propositions that appear in only one IG
are handled by the message component which de-
scribes structural differences). The length of the
message required to convey this information is
a1
a2a4a3a6a5a8a7
Usra9
a5a8a7
SysInt
MLa1a11a10a13a12a15a14 a1a0 a0 a38a40a39 Usra8a34a14a10a13a12a15a14 a1a0 a0 a38a40a39 SysInta8a8
where a10a13a12a15a14 a1a0 a0 a38a40a39a17a16 a8 is the belief in proposition a0
in a38a40a39a17a16 . Assuming an optimal message encoding,
we obtain
a1
a2a4a3a6a5a8a7
Usra9
a5a8a7
SysInt
a18 a20a23a22a25a24 a26 Pra1a11a10a13a12a15a14 a1a0
a0
a38a40a39 Usr
a8a34a14a10a13a12a15a14 a1a0
a0
a38a40a39 SysInt
a8a8
(3)
which expresses discrepancies in belief as a proba-
bility that the user will hold a particular belief in a
proposition, given the belief held by the system in
this proposition.
Since our system interacts with people, we use
linguistic categories of probability that people find
acceptable (similar to those used in Elsaesser, 1987)
instead of precise probabilities. Our 7 categories are:
a18
VeryUnlikely, Unlikely, ALittleUnlikely, EvenChance, ALit-
tleLikely, Likely, VeryLikelya19 . This yields the following
approximation of Equation 3:
a1
a2a20a3a21a5a8a7
Usra9
a5a8a7
SysInt
a18 a20a23a22a25a24 a26 Pra1a11a10a22a14a24a23a26a25a34a1a0
a0
a38a27a39 Usr
a8a34a14a10a22a14a24a23a26a25a34a1a0
a0
a38a40a39 SysInt
a8a8
(4)
where a10a27a14a24a23a17a25 a1a0 a0 a38a40a39a17a16 a8 is the category for the belief
in node a0 in a38a40a39a17a16 .
In the absence of statistical information about dis-
crepancies between user beliefs and system beliefs,
we have devised a probability function as follows:
Pra1a11a10a22a14a24a23a26a25a34a1a0 a0 a38a27a39 Usr
a8a34a14a10a22a14a24a23a26a25a34a1a0
a0
a38a27a39 SysInt
a8a8a17a10
a28
a12 a11
a2a30a29a15a31a33a32a35a34a11a36
a33
a36a38a37a39a41a40a42a32a35a34a11a43a44a2a30a45a5a8a7
Usra46
a36a47a39a41a40a42a32a35a34a11a43a44a2a30a45a5a8a7
SysInta46
a37
(5)
where a28 is a normalizing constant, and NumCt is
the number of belief categories (=7). This function
yields a maximum probability when the user’s be-
lief in node a0 agrees with the system’s belief. This
probability gets halved (adding 1 bit to the length of
the message) for each increment or decrement in be-
lief category. For instance, if both the user and the
system believe that node a0 is Likely, Equation 5 will
yield a probability of a28 a12 a11a49a48 a36 a33 a36a51a50 a10 a15a7a17 a28 . In con-
trast, if the user believed that this node has only an
EvenChance, then the probability of this belief given
the system’s belief would be a28 a12 a11 a48 a36 a33 a36
a26
a10
a14
a15
a28 .
3.3.2 Structural differences
The message which transmits the structural dis-
crepancies between a38a27a39 SysInt and a38a40a39 Usr describes the
structural operations required to transform a38a40a39 SysInt
into a38a40a39 Usr. These operations are: node insertions
and deletions, and arc insertions and deletions. A
node is inserted in a38a40a39 SysInt when the system can-
not reconcile a proposition in the user’s argument
with any proposition in its domain representation.
In this case, the system proposes a special Escape
(wild card) node. Note that the system does not pre-
sume to understand this proposition, but still hopes
to achieve some understanding of the argument as a
whole. Similarly, an arc is inserted when the user
mentions a relationship which does not appear in
a38a40a39 SysInt. An arc (node) is deleted when the corre-
sponding relation (proposition) appears in a38a27a39 SysInt,
but is omitted from a38a40a39 Usr. When a node is deleted,
all the arcs incident upon it are rerouted to connect
its antecedents directly to its consequent. This op-
eration, which models a small inferential leap, pre-
serves the structure of the implication around the
deleted node. If the arcs so rerouted are inconsis-
tent with a38a40a39 Usr they will be deleted separately.
For each of these operations, the message an-
nounces how many times the operation was per-
formed (e.g., how many nodes were deleted) and
then provides sufficient information to enable the
message receiver to identify the targets of the op-
eration (e.g., which nodes were deleted). Thus, the
length of the message which describes the structural
operations required to transform a38a27a39 SysInt into a38a40a39 Usr
comprises the following components:
MLa1a38a40a39 Usra14a38a40a39 SysInta8 a10
MLa1 node insertionsa8 a28 MLa1 node deletionsa8 a28
MLa1 arc insertionsa8a29a28 MLa1 arc deletionsa8 (6)
a1 Node insertions = number of inserted nodes plus
the penalty for each insertion. Since a node
is inserted when no proposition in the domain
matches a user’s statement, we use an insertion
penalty equal to a52a54a53 – the probability-like score
of the worst acceptable word-match between the
user’s statement and a proposition (Section 4.1).
Thus the message length for node insertions is
a20a23a22a25a24 a26 a1 # nodes insa8a28 # nodes insa12 a1a18 a20a23a22a25a24 a26
a52a41a53
a8 (7)
a1 Node deletions = number of deleted nodes plus
their designations. To designate the nodes to be
deleted, we select them from the nodes in SysInt
(or a38a40a39 SysInt):
a20a23a22a25a24a13a26a25a1 # nodes dela8 a28 a20a23a22a25a24 a26 C# nodes(
a2a4a3
SysInt)
# nodes del (8)
a1 Arc insertions = number of inserted arcs plus
their designations plus the direction of each arc.
(This component also describes the arcs incident
upon newly inserted nodes.) To designate an arc,
we need a pair of nodes (head and tail). However,
some nodes in a38a40a39 SysInt are already connected by
arcs, which must be subtracted from the total
number of arcs that can be inserted, yielding
# poss arc ins a10 C# nodes(
a2a4a3
SysInt)+# nodes insa26
a18 # arcs(
a38a40a39 SysInt)
We also need to send 1 extra bit per inserted arc
to convey its direction. Hence, the length of the
message that conveys arc insertions is:
a20a23a22a25a24 a26 a1 # arcs insa8a28 a20a23a22a25a24 a26 C# poss arc ins
# arcs ins
a28 # arcs ins
(9)
a1 Arc deletions = number of deleted arcs plus their
designations.
a20a23a22a25a24a13a26 a1 # arcs dela8a29a28 a20a22a25a24 a26 C# arcs(
a2a4a3
SysInt)
# arcs del (10)
For the example in Figure 2, a38a40a39 SysInt and a38a40a39 Usr
differ in the node [B and G were enemies] and
the arcs incident upon it. In order to transmit that
this node should be deleted from a38a27a39 SysInt, we must
select it from the 3 nodes comprising a38a27a39 SysInt. The
length of the message that conveys this information
is: a20a23a22a25a24 a26 a14 a28a32a20a23a22a25a24 a26 Ca13
a33
a10 a14 a30a16a15 bits (the automatic rerout-
ing of the arcs incident upon the deleted node yields
a38a40a39 Usr at no additional cost).
3.4 Calculating ML(UArga14IGUsr)
The user’s argument is structurally equivalent to
a38a40a39 Usr. Hence, in order to transmit UArg in terms of
a38a40a39 Usr we only need to transmit how each statement
in UArg differs from the canonical statement gener-
ated for the matching node in a38a40a39 Usr (Section 4.1).
The length of the message which conveys this infor-
mation is
a1
a2a4a3a6a5a8a7
Usr
MLa1 Sentencea2 in UArga14a0 a8
Table 1: Summary of Message Length Calculation
MLa1 UArga4 SysInta8 Equation 1
MLa1 SysInta8 Equation 2
MLa1a38a40a39 Usra14SysInta8
belief operations Equations 4, 5
structural operations Equations 6, 7, 8, 9, 10
MLa1 UArga14a38a40a39 Usra8 Equation 11
Table 2: Summary of Message Length Calculation
for the Simple Argument
MLa1 SysInta8 20.6 bits
MLa1a38a40a39 Usra14SysInta8
belief operations (no beliefs stated) 0.0 bits
structural operations 1.6 bits
MLa1 UArga14a38a40a39 Usra8 65.6 bits
MLa1 UArga4 SysInta8 87.8 bits
where Sentencea2 in UArg is the user’s sentence
which matches the proposition for node a0 ina38a40a39 Usr.
Assuming an optimal message encoding, we obtain
a1
a2a20a3a21a5a8a7
Usr
a18 a20a22a25a24a13a26 Pra1 Sentence
a2 in UArg
a14a0 a8 (11)
We approximate Pra1 Sentencea2 in UArga14a0 a8 using
the score returned by the comparison function de-
scribed in Section 4.1. For the example in Fig-
ure 2, the discrepancy between the canonical sen-
tences “Mr Body argued with Mr Green” and “Mr
Green had a motive to murder Mr Body” and the
corresponding user sentences yields a message of
length 33.6 bits + 32 bits respectively (=65.6 bits).
4 Interpreting Arguments
Our system generates candidate interpretations for a
user’s argument by first postulating propositions that
match the user’s sentences, and then finding differ-
ent ways to connect these propositions – each variant
is a candidate interpretation.
4.1 Postulating propositions
We currently use a naive approach for postulating
propositions. For each user sentence a0 Usr we gen-
erate candidate propositions as follows. For each
node a0 in the domain, the system proposes one or
more canonical sentences a0 a2 (produced by a simple
English generator). This sentence is compared to
a0
Usr, yielding a match-score for the pair (
a0
Usr,
a0 ).
When a match-score is above a threshold a52 a53 , we
have found a candidate interpretation for a0 Usr.3 For
example, the proposition [G was in garden at 11] in
Figure 1(b) is a plausible interpretation of the input
sentence “Mr Green was seen in the garden at 11” in
Figure 1(a). Some sentences may have no proposi-
tions with match-scores above a52 a53 . This does not
automatically invalidate the user’s argument, as it
may still be possible to interpret the argument as a
whole, even if a few sentences are not understood
(Section 3.3).
The match-score for a user sentence a0 Usr and a
proposition a0 – a number in the [0,1] range – is
scaled from a weighted sum of individual word-
match scores that relate words in a0 Usr with words
in a0 a2 . Inserted or deleted words are given a fixed
penalty.
The goodness of a word-match depends on the
following factors: (1) level of synonymy – the per-
centage of synonyms the words have in common (ac-
cording to WordNet, Miller et al., 1990); (2) posi-
tion in sentence (expressed as a fraction, e.g., “1/3
of the way through the sentence”); and (3) relation
tags – SUBJ/OBJ tags as well as parts-of-speech such
as NOUN, VERB, etc (obtained using the MINIPAR
parser, Lin 1998). That is, the a0 th word in sentence
a0
a2 ,
a1a3a2
a45a4a6a5 , matches perfectly the
a7 th word in the
user’s sentence, a1a9a8 a45a4 Usr, if both words are exactly
the same, they are in the same sentence position,
and they have the same relation tag. The match-
score between a1a3a2 a45a4a6a5 and a1a9a8 a45a4 Usr is reduced if their
level of synonymy is less than 100%, or if there are
discrepancies in their relation tags or their sentence
positions. For instance, consider the canonical sen-
tence “Mr Green murdered Mr Body” and the user
sentences “Mr Body was murdered by Mr Green”
and “Mr Green murdered Ms Scarlet”. The first user
sentence has a higher score than the second one.
This is because the mismatch between the canoni-
cal sentence and the first user sentence is merely due
to non-content words and word positions, while the
mismatch between the canonical sentence and the
second user sentence is due to the discrepancy be-
tween the objects of the sentences.
3This step of the matching process is concerned only with
identifying the nodes that best match a user’s sentences. Words
indicating negation provide further (heuristic-based) informa-
tion about whether the user intended the positive version of a
node (e.g., “Mr Green murdered Mr Body”) or the negative ver-
sion (e.g., “Mr Green didn’t murder Mr Body”). This informa-
tion is used when calculating the user’s belief in a node.
Upon completion of this process, the match-
scores between a user sentence and its candidate
propositions are normalized, and the result used to
approximate Pra1 a0 Usra14a0 a8 , which is required for the
MML evaluation (Section 3.4).4
At first glance, this process may appear unwieldy,
as it compares each of the user’s sentences with each
proposition in the knowledge base. However, since
the complexity of this process is linear for each input
sentence, and our informal trials indicate that most
user arguments have less than 10 propositions, re-
sponse time will not be compromised even for large
BNs. Specifically, the response time on our 82-node
BN is perceived as instantaneous.
4.2 Connecting the propositions
The above process may match more than one node
to each of the user’s sentences. Hence, we first gen-
erate the a38a40a39 Usrs which are consistent with the user’s
argument. For instance, the sentence “Mr Green was
seen in the garden at 11” in Figure 1(a) matches both
[G was in garden at 11] and [N saw G in garden] (but
the former has a higher probability). If each of the
other input sentences in Figure 1(a) matches only
one proposition, two IGs which match the user’s in-
put will be generated – one for each of the above
alternatives.
Figure 3 illustrates the remainder of the
interpretation-generation process with respect
to one a38a40a39 Usr. This process consists of finding con-
nections within the BN between the nodes in a38a40a39 Usr;
eliminating superfluous BN nodes; and generating
sub-graphs of the resulting graph, such that all the
nodes in a38a27a39 Usr are connected (Figures 3(b), 3(c)
and 3(d), respectively). The connections between
the nodes in a38a40a39 Usr are found by applying a small
number of inferences from these nodes (spreading
outward in the BN). Currently, we apply two rounds
of inferences, as they enable the system to produce
“sensible” interpretations for arguments with small
inferential leaps. These are arguments whose nodes
are separated by at most four nodes in the system’s
BN, e.g., nodes b and c in Figure 3(d).5 If upon
completion of this process, some nodes are still
4We are currently implementing a more principled model for
sentence comparison which yields more accurate probabilities.
5Intuitively, one round of inferences would miss out on plau-
sible interpretations, while three rounds of inferences would
allow too many alternative interpretations. Our choice of two
rounds of inferences will be validated during trials with users.
(a) User’s Original Argument (b) Expand twice from the
    users nodes. Produces one
    or more node "clusters"
(c) Eliminate nodes that
    aren’t in a sortest path
(d) Candidates are all the subgraphs of (c) that connect the user’s nodes.
    (Only some of the 9 candidates illustrated)
a
b
c
a
d
e
b
f
c
i
j
m
n
p
s
tx
a
d
e
b
f
c
n
tx
a
d
e
b
f
c
m
n
p
s
tx
a
b
f
c
i
j
m p
s
b
f
c
i
j
n
tx
a
a
d
e
b
f
c
i
j
k
l
m
n
o
pq
s
tv x
Figure 3: Argument interpretation process
unconnected, the system rejects the current a38a27a39 Usr.
This process is currently implemented in the context
of a BN. However, any representation that supports
the generation of a connected argument involving a
given set of propositions would be appropriate.
5 Evaluation
Our evaluation consisted of an automated experi-
ment where the system interpreted noisy versions of
its own arguments. These arguments were generated
from different sub-nets of its domain BN, and they
were distorted at the BN level and at the NL level.
At the BN level, we changed the beliefs in the nodes,
and we inserted and deleted nodes and arcs. At the
NL level, we distorted the wording of the proposi-
tions in the resultant arguments. All these distor-
tions were performed for BNs of different sizes (3,
5, 7 and 9 arcs). Our measure of performance is the
edit-distance between the original BN used to gener-
ate an argument, and the BN produced as the inter-
pretation of this argument. That is, we counted the
number of differences between the source BN and
the interpretation. For instance, two BNs that differ
by one arc have an edit-distance of 2 (one addition
and one deletion), while a perfect match has an edit-
distance of 0.
Overall, our results were as follows. Our system
produced an interpretation in 86% of the 5400 tri-
als. In 75% of the 5400 cases, the generated inter-
pretations had an edit-distance of 3 or less from the
original BN, and in 50% of the cases, the interpre-
tations matched perfectly the original BN. Figure 4
depicts the frequency of edit distances for the differ-
ent BN sizes under all noise conditions. We plotted
edit-distances of 0, a30a34a30a34a30, 9 and a0a2a1 , plus the category
NI, which stands for “No Interpretation”. As shown
in Figure 4, the 0 edit-distance has the highest fre-
quency, and performance deteriorates as BN size in-
creases. Nonetheless, for BNs of 7 arcs or less, the
vast majority of the interpretations have an edit dis-
tance of 3 or less. Only for BNs of 9 arcs the num-
ber of NIs exceeds the number of perfect matches.
Figure 5 provides a different view of these results.
It displays edit-distance as a percentage of the pos-
sible changes for a BN of a particular size (the x-
axis is divided into buckets of 10%). For example, if
a selected interpretation differs from its source-BN
by the insertion of one arc, the percent-edit-distance
will be a14 a20 a20 a12 a33a26 a2a4a3
a33
, where a0 is the number of arcs
in the source-BN.6 The results shown in Figure 5 are
consistent with the previous results, with the vast
majority of the edits being in the [0,10)% bucket.
That is, most of the interpretations are within 10%
of their source-BNs.
We also tested each kind of noise separately,
6A BN of
a5 arcs has a maximum of a5 +1 nodes, yielding a
maximum of a6a7a5a9a8a11a10 edits to create the BN.
Figure 4: Frequency of edit-distances for all noise
conditions (5400 trials)
Figure 5: Frequency of edit-distances as percent of
maximum edits for all noise conditions (5400 trials)
maintaining the other kinds of noise at 0%. All the
distortions were between 0 and 40%. We performed
1560 trials for word noise, arc noise and node in-
sertions, and 2040 trials for belief noise, which war-
ranted additional observations. Figures 6, 7 and 8
show the recognition accuracy of our system (in
terms of average edit distance) as a function of arc
noise, belief noise and word noise percentages, re-
spectively. The performance for the different BN
sizes (in arcs) is also shown. Our system’s perfor-
mance for node insertions is similar to that obtained
for belief noise (the graph was not included owing
to space limitations). Our results show that the two
main factors that affect recognition performance are
BN size and word noise, while the average edit dis-
tance remains stable for belief and arc noise, as well
as for node insertions (the only exception occurs for
40% arc noise and size 9 BNs). Specifically, for arc
Figure 6: Effect of arc noise on performance (1560
trials)
Figure 7: Effect of belief noise on performance
(2040 trials)
noise, belief noise and node insertions, the average
edit distance was 3 or less for all noise percentages,
while for word noise, the average edit distance was
higher for several word-noise and BN-size combina-
tions. Further, performance deteriorated as the per-
centage of word noise increased.
The impact of word noise on performance rein-
forces our intention to implement a more principled
sentence comparison procedure (Section 4.1), with
the expectation that it will improve this aspect of our
system’s performance.
6 Conclusion
We have offered a mechanism which produces in-
terpretations of segmented NL arguments. Our ap-
plication of the MML principle enables our system
to handle noisy conditions in terms of wording, be-
liefs and argument structure, and allows us to isolate
Figure 8: Effect of word noise on performance
(1560 trials)
the effect of the underlying knowledge representa-
tion on the interpretation process. The results of our
automated evaluation were encouraging, with inter-
pretations that match perfectly or almost-perfectly
the source-BN being generated in 75% of the cases
under all noise conditions.
Our system has the following limitations:
a1 The interpretations generated by our system are
in terms of the propositions and relations known
by the system. However, the MML Principle it-
self addresses this limitation (at least partially),
as the length of a message is a quantitative mea-
sure for determining whether an interpretation is
likely to reflect the user’s intentions.
a1 Our mechanism does not infer an implicit goal
proposition, nor does it infer discourse relations
from free-form discourse. At present, this limita-
tion is circumvented by forcing the user to state
the goal proposition of the argument, and to in-
dicate clearly the antecedents and consequents of
the implications in his/her argument (this is done
by means of a web-based interface).
a1 Our argument-interpretation mechanism has
been tested on one knowledge representation
only – BNs.
a1 It is unclear whether arguments produced by au-
tomatically distorting our system’s arguments are
representative of arguments generated by people.
Further trials with real users will be conducted to
ascertain this fact.
a1 The system’s performance deteriorates for large
BNs (9 nodes). However, it is unclear whether
this will affect the use of the system in practice.
Despite these limitations, we are hopeful about
the potential of this approach to address the dis-
course interpretation challenge.
Acknowledgments
This research was supported in part by Australian
Research Council grant A49927212.

References
Sandra Carberry and Lynn Lambert. 1999. A process
model for recognizing communicative acts and model-
ing negotiation subdialogues. Computational Linguis-
tics, 25(1):1–53.
Eugene Charniak and Robert P. Goldman. 1993. A
Bayesian model of plan recognition. Artificial Intel-
ligence, 64(1):50–56.
Christopher Elsaesser. 1987. Explanation of probabilis-
tic inference for decision support systems. In Proceed-
ings of the AAAI-87 Workshop on Uncertainty in Artifi-
cial Intelligence, pages 394–403, Seattle, Washington.
Eric Horvitz and Tim Paek. 1999. A computational ar-
chitecture for conversation. In UM99 – Proceedings of
the Seventh International Conference on User Model-
ing, pages 201–210, Banff, Canada.
Dekang Lin. 1998. Dependency-based evaluation of
MINIPAR. In Workshop on the Evaluation of Parsing
Systems, Granada, Spain.
George Miller, Richard Beckwith, Christiane Fellbaum,
Derek Gross, and Katherine Miller. 1990. Introduc-
tion to WordNet: An on-line lexical database. Journal
of Lexicography, 3(4):235–244.
Judea Pearl. 1988. Probabilistic Reasoning in Intelligent
Systems. Morgan Kaufmann Publishers, San Mateo,
California.
Ian Thomas, Ingrid Zukerman, Jonathan Oliver, David W.
Albrecht, and Bhavani Raskutti. 1997. Lexical ac-
cess for speech understanding using Minimum Mes-
sage Length encoding. In UAI97 – Proceedings of the
Thirteenth Conference on Uncertainty in Artificial In-
telligence, pages 464–471. Morgan Kaufmann.
C.S. Wallace and D.M. Boulton. 1968. An informa-
tion measure for classification. The Computer Jour-
nal, 11:185–194.
Ingrid Zukerman. 2001. An integrated approach for gen-
erating arguments and rebuttals and understanding re-
joinders. In UM01 – Proceedings of the Eighth Inter-
national Conference on User Modeling, pages 84–94,
Sonthofen, Germany.
