File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-0227_metho.xml
Size: 26,975 bytes
Last Modified: 2025-10-06 14:07:55
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-0227"> <Title>A Minimum Message Length Approach for Argument Interpretation</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Argument Interpretation Using MML </SectionTitle> <Paragraph position="0"> The MML criterion implements Occam's Razor, which may be stated as follows: &quot;If you have two theories which both explain the observed facts, then you should use the simplest until more evidence comes along&quot;. According to the MML criterion, we imagine sending to a receiver a message that describes a user's NL argument, and we want to send the shortest possible message.1 This message corresponds to the simplest interpretation of a user's argument. We postulate that this interpretation is likely to be a reasonable interpretation (although not necessarily the intended one).</Paragraph> <Paragraph position="1"> A message that encodes an NL argument in terms of an interpretation is composed of two parts: (1) instructions for building the interpretation, and (2) instructions for rebuilding the original argument from this interpretation. These two parts balance the need for a concise interpretation (Part 1) with the need for an interpretation that matches closely the user's utterances (Part 2). For instance, the message for a concise interpretation that does not match well the original argument will have a short first part but a long second part. In contrast, a more complex interpretation which better matches the original argument may yield a message that is shorter overall, with a longer first portion, but a shorter second portion. Thus, the message describing the interpretation (BN) which best matches the user's intent will be among the messages with a short length (hopefully the shortest). Further, a message which encodes an NL argument in terms of a reasonable interpretation will be shorter than the message which transmits the words of the argument directly. This is because an interpretation which comprises the nodes and links in a Bayesian subnet (Part 1 of the message) is much 1It is worth noting that the sender and the receiver are theoretical constructs of the MML theory, which are internal to the system and are not to be confused with the system and the user. The concept of a receiver which is different from the sender ensures that the message constructed by the sender to represent a user's argument does not make unwarranted assumptions.</Paragraph> <Paragraph position="2"> more compact than a sequence of words which identifies these nodes and links. If this interpretation is reasonable (i.e., the user's argument is close to this interpretation), then the encoding of the discrepancies between the user's argument and the interpretation (Part 2 of the message) will not significantly increase the length of the message.</Paragraph> <Paragraph position="3"> In order to find the interpretation with the shortest message length, we compare the message lengths of candidate interpretations. These candidates are obtained as described in Section 4.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 MML Encoding </SectionTitle> <Paragraph position="0"> The MML criterion is derived from Bayes Theorem:</Paragraph> <Paragraph position="2"> and a6 is a hypothesis which explains the data.</Paragraph> <Paragraph position="3"> An optimal code for an event a16 with probability Pra1a16 a8 has message length MLa1a16 a8a17a10a19a18a21a20a23a22a25a24a27a26 Pra1a16 a8 (measured in bits). Hence, the message length for the data and a hypothesis is:</Paragraph> <Paragraph position="5"> The hypothesis for which MLa1a3a2a32a4a7a6a15a8 is minimal is considered the best hypothesis.</Paragraph> <Paragraph position="6"> Now, in our context, UArg contains the user's argument, and SysInt an interpretation generated by our system. Thus, we are looking for the SysInt which yields the shortest message length for</Paragraph> <Paragraph position="8"> The first part of the message describes the interpretation, and the second part describes how to reconstruct the argument from the interpretation. To calculate the second part, we rely on an intermediate representation called Implication Graph (IG). An Implication Graph is a graphical representation of an argument, which represents a basic &quot;understanding&quot; of the argument. It is composed of simple implications of the form Antecedenta33 Antecedenta26 a30a34a30a34a30 Antecedenta35a37a36 Consequent (where a36 indicates that the antecedents imply the consequent, without distinguishing between causal and evidential implications). a38a40a39 Usr represents an understanding of the user's argument.</Paragraph> <Paragraph position="9"> It contains propositions from the underlying representation, but retains the structure of the user's argument. a38a40a39 SysInt represents an understanding of a candidate interpretation. It is directly obtained from SysInt, but it differs from SysInt in that all its arcs point towards a goal node and head-to-head evidence nodes are represented as antecedents of an implication, while SysInt is a general Bayesian subnet.</Paragraph> <Paragraph position="10"> Since both a38a40a39 Usr and a38a27a39 SysInt use domain propositions and have the same type of representation, they can be compared with relative ease.</Paragraph> <Paragraph position="11"> Figure 1 illustrates the interpretation of a short argument presented by a user, and the calculation of the message length of the interpretation. The interpretation process obtains a38a40a39 Usr from the user's input, and SysInt from a38a27a39 Usr (left-hand side of Figure 1). If a sentence in UArg matches more than one domain proposition, the system generates more than onea38a27a39 Usr from UArg (Section 4.1). Eacha38a27a39 Usr may in turn yield more than one SysInt. This happens when the underlying representation has several ways of connecting between the nodes in a38a40a39 Usr (Section 4.2). The message length calculation goes from SysInt to UArg through the intermediate representations a38a40a39 SysInt and a38a40a39 Usr (right-hand side of Figure 1). This calculation takes advantage of the fact that there can be only onea38a40a39 Usr for each UArg-SysInt combination. Hence,</Paragraph> <Paragraph position="13"> Thus, the length of the message required to transmit the user's argument and an interpretation is</Paragraph> <Paragraph position="15"> That is, for each candidate interpretation, we calculate the length of the message which conveys: a38a27a39 Usr - how to obtain the sentences in UArg from the corresponding propositions in a38a27a39 Usr. The interpretation which yields the shortest message is selected (the message-length equations for each component are summarized in Table 1).</Paragraph> <Paragraph position="16"> 2We use a2a4a3 SysInt for this calculation, rather than SysInt. This does not affect the message length because the receiver can obtain a2a4a3 SysInt directly from SysInt.</Paragraph> <Paragraph position="17"> Throughout the remainder of this section, we describe the calculation of the components of Equation 1, and illustrate this calculation using the simple example in Figure 2 (the message length calculation for our example is summarized in Table 2).</Paragraph> <Paragraph position="18"> UArg: a5a7a6 Usr:</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Calculating MLa1 SysInta8 </SectionTitle> <Paragraph position="0"> In order to transmit SysInt, we simply send its propositions and the relations between them. A standard MML assumption is that the sender and receiver share domain knowledge (recall that the receiver is not the user, but is a construct of the MML theory).</Paragraph> <Paragraph position="1"> Hence, one way to send SysInt consists of transmitting how SysInt is extracted from the domain representation. This involves selecting its propositions from those in the domain, and then choosing which of the possible relations between these propositions are included in the interpretation. In the case of a BN, the propositions are represented as nodes, and the relations between propositions as arcs. Thus the message length for SysInt in the context of a BN is</Paragraph> <Paragraph position="3"> (2) For the example in Figure 2, in order to transmit SysInt we must choose 3 nodes from the 82 nodes in the BN which represents our murder scenario (the Bayesian subnet in Figure 1(d) is a fragment of this BN). We must then select 2 arcs from the 3 arcs that connect these nodes. This yields a message of length</Paragraph> <Paragraph position="5"> The message which describes a38a27a39 Usr in terms of SysInt (or rather in terms of a38a27a39 SysInt) conveys how a38a40a39 Usr differs from the system's interpretation in two respects: (1) belief, and (2) argument structure.</Paragraph> <Paragraph position="6"> For each proposition a0 in botha38a27a39 SysInt anda38a40a39 Usr, we transmit any discrepancy between the belief stated by the user and the system's belief in this proposition (propositions that appear in only one IG are handled by the message component which describes structural differences). The length of the message required to convey this information is which expresses discrepancies in belief as a probability that the user will hold a particular belief in a proposition, given the belief held by the system in this proposition.</Paragraph> <Paragraph position="7"> Since our system interacts with people, we use linguistic categories of probability that people find acceptable (similar to those used in Elsaesser, 1987) instead of precise probabilities. Our 7 categories are: VeryUnlikely, Unlikely, ALittleUnlikely, EvenChance, ALittleLikely, Likely, VeryLikelya19 . This yields the following approximation of Equation 3: where a10a27a14a24a23a17a25 a1a0 a0 a38a40a39a17a16 a8 is the category for the belief in node a0 in a38a40a39a17a16 .</Paragraph> <Paragraph position="8"> In the absence of statistical information about discrepancies between user beliefs and system beliefs, we have devised a probability function as follows:</Paragraph> <Paragraph position="10"> where a28 is a normalizing constant, and NumCt is the number of belief categories (=7). This function yields a maximum probability when the user's belief in node a0 agrees with the system's belief. This probability gets halved (adding 1 bit to the length of the message) for each increment or decrement in belief category. For instance, if both the user and the system believe that node a0 is Likely, Equation 5 will yield a probability of a28 a12 a11a49a48 a36 a33 a36a51a50 a10 a15a7a17 a28 . In contrast, if the user believed that this node has only an EvenChance, then the probability of this belief given the system's belief would be a28 a12 a11 a48 a36 a33 a36</Paragraph> <Paragraph position="12"> The message which transmits the structural discrepancies between a38a27a39 SysInt and a38a40a39 Usr describes the structural operations required to transform a38a40a39 SysInt into a38a40a39 Usr. These operations are: node insertions and deletions, and arc insertions and deletions. A node is inserted in a38a40a39 SysInt when the system cannot reconcile a proposition in the user's argument with any proposition in its domain representation.</Paragraph> <Paragraph position="13"> In this case, the system proposes a special Escape (wild card) node. Note that the system does not presume to understand this proposition, but still hopes to achieve some understanding of the argument as a whole. Similarly, an arc is inserted when the user mentions a relationship which does not appear in a38a40a39 SysInt. An arc (node) is deleted when the corresponding relation (proposition) appears in a38a27a39 SysInt, but is omitted from a38a40a39 Usr. When a node is deleted, all the arcs incident upon it are rerouted to connect its antecedents directly to its consequent. This operation, which models a small inferential leap, preserves the structure of the implication around the deleted node. If the arcs so rerouted are inconsistent with a38a40a39 Usr they will be deleted separately. For each of these operations, the message announces how many times the operation was performed (e.g., how many nodes were deleted) and then provides sufficient information to enable the message receiver to identify the targets of the operation (e.g., which nodes were deleted). Thus, the length of the message which describes the structural operations required to transform a38a27a39 SysInt into a38a40a39 Usr comprises the following components:</Paragraph> <Paragraph position="15"> the penalty for each insertion. Since a node is inserted when no proposition in the domain matches a user's statement, we use an insertion penalty equal to a52a54a53 - the probability-like score of the worst acceptable word-match between the user's statement and a proposition (Section 4.1).</Paragraph> <Paragraph position="16"> Thus the message length for node insertions is</Paragraph> <Paragraph position="18"> their designations. To designate the nodes to be deleted, we select them from the nodes in SysInt</Paragraph> <Paragraph position="20"> their designations plus the direction of each arc.</Paragraph> <Paragraph position="21"> (This component also describes the arcs incident upon newly inserted nodes.) To designate an arc, we need a pair of nodes (head and tail). However, some nodes in a38a40a39 SysInt are already connected by arcs, which must be subtracted from the total number of arcs that can be inserted, yielding # poss arc ins a10 C# nodes(</Paragraph> <Paragraph position="23"> We also need to send 1 extra bit per inserted arc to convey its direction. Hence, the length of the message that conveys arc insertions is:</Paragraph> <Paragraph position="25"> For the example in Figure 2, a38a40a39 SysInt and a38a40a39 Usr differ in the node [B and G were enemies] and the arcs incident upon it. In order to transmit that this node should be deleted from a38a27a39 SysInt, we must select it from the 3 nodes comprising a38a27a39 SysInt. The length of the message that conveys this information is: a20a23a22a25a24 a26 a14 a28a32a20a23a22a25a24 a26 Ca13</Paragraph> <Paragraph position="27"> ing of the arcs incident upon the deleted node yields a38a40a39 Usr at no additional cost).</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.4 Calculating ML(UArga14IGUsr) </SectionTitle> <Paragraph position="0"> The user's argument is structurally equivalent to a38a40a39 Usr. Hence, in order to transmit UArg in terms of a38a40a39 Usr we only need to transmit how each statement in UArg differs from the canonical statement generated for the matching node in a38a40a39 Usr (Section 4.1). The length of the message which conveys this information is</Paragraph> <Paragraph position="2"> the score returned by the comparison function described in Section 4.1. For the example in Figure 2, the discrepancy between the canonical sentences &quot;Mr Body argued with Mr Green&quot; and &quot;Mr Green had a motive to murder Mr Body&quot; and the corresponding user sentences yields a message of length 33.6 bits + 32 bits respectively (=65.6 bits).</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Interpreting Arguments </SectionTitle> <Paragraph position="0"> Our system generates candidate interpretations for a user's argument by first postulating propositions that match the user's sentences, and then finding different ways to connect these propositions - each variant is a candidate interpretation.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Postulating propositions </SectionTitle> <Paragraph position="0"> We currently use a naive approach for postulating propositions. For each user sentence a0 Usr we generate candidate propositions as follows. For each node a0 in the domain, the system proposes one or more canonical sentences a0 a2 (produced by a simple English generator). This sentence is compared to Usr, yielding a match-score for the pair (</Paragraph> <Paragraph position="2"> When a match-score is above a threshold a52 a53 , we have found a candidate interpretation for a0 Usr.3 For example, the proposition [G was in garden at 11] in Figure 1(b) is a plausible interpretation of the input sentence &quot;Mr Green was seen in the garden at 11&quot; in Figure 1(a). Some sentences may have no propositions with match-scores above a52 a53 . This does not automatically invalidate the user's argument, as it may still be possible to interpret the argument as a whole, even if a few sentences are not understood (Section 3.3).</Paragraph> <Paragraph position="3"> The match-score for a user sentence a0 Usr and a proposition a0 - a number in the [0,1] range - is scaled from a weighted sum of individual word-match scores that relate words in a0 Usr with words in a0 a2 . Inserted or deleted words are given a fixed penalty.</Paragraph> <Paragraph position="4"> The goodness of a word-match depends on the following factors: (1) level of synonymy - the percentage of synonyms the words have in common (according to WordNet, Miller et al., 1990); (2) position in sentence (expressed as a fraction, e.g., &quot;1/3 of the way through the sentence&quot;); and (3) relation tags - SUBJ/OBJ tags as well as parts-of-speech such as NOUN, VERB, etc (obtained using the MINIPAR parser, Lin 1998). That is, the a0 th word in sentence user's sentence, a1a9a8 a45a4 Usr, if both words are exactly the same, they are in the same sentence position, and they have the same relation tag. The match-score between a1a3a2 a45a4a6a5 and a1a9a8 a45a4 Usr is reduced if their level of synonymy is less than 100%, or if there are discrepancies in their relation tags or their sentence positions. For instance, consider the canonical sentence &quot;Mr Green murdered Mr Body&quot; and the user sentences &quot;Mr Body was murdered by Mr Green&quot; and &quot;Mr Green murdered Ms Scarlet&quot;. The first user sentence has a higher score than the second one.</Paragraph> <Paragraph position="5"> This is because the mismatch between the canonical sentence and the first user sentence is merely due to non-content words and word positions, while the mismatch between the canonical sentence and the second user sentence is due to the discrepancy between the objects of the sentences.</Paragraph> <Paragraph position="6"> 3This step of the matching process is concerned only with identifying the nodes that best match a user's sentences. Words indicating negation provide further (heuristic-based) information about whether the user intended the positive version of a node (e.g., &quot;Mr Green murdered Mr Body&quot;) or the negative version (e.g., &quot;Mr Green didn't murder Mr Body&quot;). This information is used when calculating the user's belief in a node. Upon completion of this process, the match-scores between a user sentence and its candidate propositions are normalized, and the result used to approximate Pra1 a0 Usra14a0 a8 , which is required for the MML evaluation (Section 3.4).4 At first glance, this process may appear unwieldy, as it compares each of the user's sentences with each proposition in the knowledge base. However, since the complexity of this process is linear for each input sentence, and our informal trials indicate that most user arguments have less than 10 propositions, response time will not be compromised even for large BNs. Specifically, the response time on our 82-node BN is perceived as instantaneous.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Connecting the propositions </SectionTitle> <Paragraph position="0"> The above process may match more than one node to each of the user's sentences. Hence, we first generate the a38a40a39 Usrs which are consistent with the user's argument. For instance, the sentence &quot;Mr Green was seen in the garden at 11&quot; in Figure 1(a) matches both [G was in garden at 11] and [N saw G in garden] (but the former has a higher probability). If each of the other input sentences in Figure 1(a) matches only one proposition, two IGs which match the user's input will be generated - one for each of the above alternatives.</Paragraph> <Paragraph position="1"> Figure 3 illustrates the remainder of the interpretation-generation process with respect to one a38a40a39 Usr. This process consists of finding connections within the BN between the nodes in a38a40a39 Usr; eliminating superfluous BN nodes; and generating sub-graphs of the resulting graph, such that all the nodes in a38a27a39 Usr are connected (Figures 3(b), 3(c) and 3(d), respectively). The connections between the nodes in a38a40a39 Usr are found by applying a small number of inferences from these nodes (spreading outward in the BN). Currently, we apply two rounds of inferences, as they enable the system to produce &quot;sensible&quot; interpretations for arguments with small inferential leaps. These are arguments whose nodes are separated by at most four nodes in the system's BN, e.g., nodes b and c in Figure 3(d).5 If upon completion of this process, some nodes are still 4We are currently implementing a more principled model for sentence comparison which yields more accurate probabilities. 5Intuitively, one round of inferences would miss out on plausible interpretations, while three rounds of inferences would allow too many alternative interpretations. Our choice of two rounds of inferences will be validated during trials with users. (a) User's Original Argument (b) Expand twice from the users nodes. Produces one or more node &quot;clusters&quot; (c) Eliminate nodes that aren't in a sortest path (d) Candidates are all the subgraphs of (c) that connect the user's nodes. unconnected, the system rejects the current a38a27a39 Usr. This process is currently implemented in the context of a BN. However, any representation that supports the generation of a connected argument involving a given set of propositions would be appropriate.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Evaluation </SectionTitle> <Paragraph position="0"> Our evaluation consisted of an automated experiment where the system interpreted noisy versions of its own arguments. These arguments were generated from different sub-nets of its domain BN, and they were distorted at the BN level and at the NL level.</Paragraph> <Paragraph position="1"> At the BN level, we changed the beliefs in the nodes, and we inserted and deleted nodes and arcs. At the NL level, we distorted the wording of the propositions in the resultant arguments. All these distortions were performed for BNs of different sizes (3, 5, 7 and 9 arcs). Our measure of performance is the edit-distance between the original BN used to generate an argument, and the BN produced as the interpretation of this argument. That is, we counted the number of differences between the source BN and the interpretation. For instance, two BNs that differ by one arc have an edit-distance of 2 (one addition and one deletion), while a perfect match has an edit-distance of 0.</Paragraph> <Paragraph position="2"> Overall, our results were as follows. Our system produced an interpretation in 86% of the 5400 trials. In 75% of the 5400 cases, the generated interpretations had an edit-distance of 3 or less from the original BN, and in 50% of the cases, the interpretations matched perfectly the original BN. Figure 4 depicts the frequency of edit distances for the different BN sizes under all noise conditions. We plotted edit-distances of 0, a30a34a30a34a30, 9 and a0a2a1 , plus the category NI, which stands for &quot;No Interpretation&quot;. As shown in Figure 4, the 0 edit-distance has the highest frequency, and performance deteriorates as BN size increases. Nonetheless, for BNs of 7 arcs or less, the vast majority of the interpretations have an edit distance of 3 or less. Only for BNs of 9 arcs the number of NIs exceeds the number of perfect matches.</Paragraph> <Paragraph position="3"> Figure 5 provides a different view of these results.</Paragraph> <Paragraph position="4"> It displays edit-distance as a percentage of the possible changes for a BN of a particular size (the x-axis is divided into buckets of 10%). For example, if a selected interpretation differs from its source-BN by the insertion of one arc, the percent-edit-distance will be a14 a20 a20 a12 a33a26 a2a4a3 a33 , where a0 is the number of arcs in the source-BN.6 The results shown in Figure 5 are consistent with the previous results, with the vast majority of the edits being in the [0,10)% bucket.</Paragraph> <Paragraph position="5"> That is, most of the interpretations are within 10% of their source-BNs.</Paragraph> <Paragraph position="6"> We also tested each kind of noise separately, maximum edits for all noise conditions (5400 trials) maintaining the other kinds of noise at 0%. All the distortions were between 0 and 40%. We performed 1560 trials for word noise, arc noise and node insertions, and 2040 trials for belief noise, which warranted additional observations. Figures 6, 7 and 8 show the recognition accuracy of our system (in terms of average edit distance) as a function of arc noise, belief noise and word noise percentages, respectively. The performance for the different BN sizes (in arcs) is also shown. Our system's performance for node insertions is similar to that obtained for belief noise (the graph was not included owing to space limitations). Our results show that the two main factors that affect recognition performance are BN size and word noise, while the average edit distance remains stable for belief and arc noise, as well as for node insertions (the only exception occurs for 40% arc noise and size 9 BNs). Specifically, for arc</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> (2040 trials) </SectionTitle> <Paragraph position="0"> noise, belief noise and node insertions, the average edit distance was 3 or less for all noise percentages, while for word noise, the average edit distance was higher for several word-noise and BN-size combinations. Further, performance deteriorated as the percentage of word noise increased.</Paragraph> <Paragraph position="1"> The impact of word noise on performance reinforces our intention to implement a more principled sentence comparison procedure (Section 4.1), with the expectation that it will improve this aspect of our system's performance.</Paragraph> </Section> class="xml-element"></Paper>