File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/89/p89-1031_metho.xml
Size: 29,421 bytes
Last Modified: 2025-10-06 14:12:23
<?xml version="1.0" standalone="yes"?> <Paper uid="P89-1031"> <Title>EVALUATING DISCOURSE PROCESSING ALGORITHMS</Title> <Section position="4" start_page="251" end_page="254" type="metho"> <SectionTitle> 2 Quantitative Evaluati0n-Black Box 2.1 The Algorithms </SectionTitle> <Paragraph position="0"> When embarking on such a comparison, it would be convenient to assume that the inputs to the algorithms are identical and compare their outputs. Unfortunately since researchers do not even agree on which phenomena can be explained syntactically and which semantically, the boundaries between two modules are rarely the same in NL systems. In this case the BFP centering algorithm and Hobbs algorithm both make ASSUMPTIONS about other system components. These are, in some sense, a further specification of the operation of tile algorithms that must be made in order to hand-simulate the algorithms.</Paragraph> <Paragraph position="1"> There are two major sets of assumptions, based on discourse segmentation and syntactic representation.</Paragraph> <Paragraph position="2"> We attempt to make these explicit for each algorithm and pinpoint where the algorithms might behave differently were these assumptions not well-founded.</Paragraph> <Paragraph position="3"> In addition, there may be a number of UNDERSPECIFICATIONS in the descriptions of the algorithms. These often arise because theories that attempt to categorize naturally occurring data and algorithms based On them will always be prey to previously unencountered examples. For example, since the BFP salience hierarchy for discourse entities is based on grammatical relation, an implicit assumption is that an utterance only has one subject. However the novel Wheels has many examples of reported dialogue such as She continued, unperturbed, ~Mr. Vale quotes the Bible about air pollution.&quot; One might wonder whether the subject is She or Mr. Vale. In some cases, the algorithm might need to be further specificied in order to be able to process any of the data, whereas in others they may just highlight where the algorithm needs to be modified (see section 3.2). In general we count underspecifications as failures.</Paragraph> <Paragraph position="4"> Finally, it may not be clear what the DEFINITION OF SUCCESS is. In particular it is not clear what to do in those cases where an algorithm produces multiple or partial interpretations. In this situation a system might flag the utterance as ambiguous and draw in support from other discourse components. This arises in the present analysis for two reasons: (1) the constraints given by \[GJW86\] do not always allow one to choose a preferred interpretation, (2) the BFP algorithm proposes equally ranked interpretations in parallel. This doesn't happen with the Robbs algorithm because it proposes interpretations in a sequential manner, one at a time. We chose to count as a failure those situations in which the BFP algorithm only reduces the number of possible interpretations, but Robbs algorithm stops with a correct interpretation. This ignores the fact that tIobbs may have rejected a number of interpretations before stopping.</Paragraph> <Paragraph position="5"> We also have not needed to make a decision on how to score an algorithm that only finds one interpretation for an utterance that humans find ambiguous.</Paragraph> <Paragraph position="6"> The centering algorithm as defined by Brennan, Friedman and Pollard, (BFP algorithm), is derived from a set of rules and constraints put forth by Grosz, Joshi and Weinstein \[GJW83, GJW86\]. We shall not reproduce this algorithm here (See \[BFP87\]). There are two main structures in the centering algorithm, the CB, the BACKWARD LOOKING CENTER, which is what the discourse is 'about', and an ordered list, CF, of FORWARD LOOKING CENTERS, which are the discourse entities available to the next utterance for pronorninalization. The centering framework predicts that in a local coherent stretch of dialogue, speakers will prefer to CONTINUE talking about the same discourse entity, that the CB will be the highest ranked entity of the previous utterance's forward centers that is realized in the current utterance, and that if anything is pronominalized the CB must be.</Paragraph> <Paragraph position="7"> In the centering framework, the order of the forward-centers list is intended to reflect the salience of discourse entities. The BFP algorithm orders this list bY grammatical relation of the complements of the main verb, i.e. first the subject, then object, then indirect object, then other subcategorized-for complements, then noun phrases found in adjunct clauses. This captures the intuition that subjects are more salient than other discourse entities.</Paragraph> <Paragraph position="8"> The BFP algorithm added linguistic constraints on CONTRA-INDEXING to the centering framework.</Paragraph> <Paragraph position="9"> These constraints are exemplified by the fact that, in the sentence he Hkes him, the entity cospecified by he cannot be the same as that cospecified by him. We say that he and him are CONTRA-INDEXED. The BFP algorithm depends on semantic processing to precompute these constraints, since they are derived from the syntactic structure, and depend on some notion of c-command\[Rei76\]. The other assumption that is dependent on syntax is that the the representations of discourse entities can be marked with the grammatical function through which they were realized, e.g. subject.</Paragraph> <Paragraph position="10"> The BFP algorithm assumes that some other mech~ anism can structure both written texts and task-oriented dialogues into hierarchical segments. The present concern is not with whether there might be a grammar of discourse that determines this structure, or whether it is derived from the cues that cooperative speakers give hearers to aid in processing. Since centering is a local phenomenon and is intended to operate within a segment, we needed to deduce a segmental structure in order to analyse the data. Speaker's intentions, task structure, cue words like O.K. now.., intonational properties of utterances, coherence relations, the scoping of modal, operators, and mechanisms for shift'ing control between discourse participants have all been proposed as ways of determining discourse segmentation \[Gro77, GS86, Rei85, PH87, HL87, Hob78, Hob85, Rob88, WS88\].</Paragraph> <Paragraph position="11"> Here, we use a combination of orthography, anaphora distribution, cue words and task structure. The rules are&quot; * In published texts, a paragraph is a new segment unless the first sentence has a pronoun in subject position or a pronoun where none of the preceding sentence-internal noun phrases match its syntactic features.</Paragraph> <Paragraph position="12"> * In the task-oriented dialogues, the action PICK-UP marks task boundaries hence segment boundaries. Cue words like nezt, then, and now also mark segment boundaries. These will usually co-occur but either one is sufficient for marking a segment boundary.</Paragraph> <Paragraph position="13"> BFP never state that cospecifiers for pronouns within the same segment are preferred over those in previous segments, but this is an implicit assumption, since this line of research is derived from Sidner's work on local focusing. Segment initial utterances therefore are the only situation where the BFP algorithm will prefer a within-sentence noun phrase as the cospecifier of a pronoun.</Paragraph> <Paragraph position="14"> The Hobbs algorithm is based on searching for a pronoun's co-specifier in the syntactic parse tree of input sentences \[Hob76b\]. We reproduce this algorithm in full in the appendix along with an example. Hobbs algorithm operates on one sentence at a time, but the structure of previous sentences in the discourse is available. It is stated in terms of searches on parse trees. When looking for an intrasentential antecedent, these searches are conducted in a left-toright, breadth-first manner. However, when looking for a pronoun's antecedent within a sentence, it will go sequentially further and further up the tree to the left of the pronoun, and that failing will look in the previous sentence. Hobbs does not assume a segmentation of discourse structure in this algorithm; the algorithm will go back arbitrarily far in the text to find an antecedent. In more recent work, Hobbs uses the notion of COHERENCE RELATIONS to structure the discourse \[HM87\].</Paragraph> <Paragraph position="15"> The order by which Hobbs' algorithm traverses the parse tree is the closest thing in his framework to predictions about which discourse entities are salient. In the main it prefers co-specifiers for pronouns that are within the same sentence, and also ones that are closer to the pronoun in tile sentence. This amounts to a claim that different discourse entities are salient, depending on the position of a pronoun in a sentence. When seeking an intersentential cospecification, Hobbs algorithm searches the parse tree of the previous utterance breadth-first, from left to right. This predicts that entities realized in subject position are more salient, since even if an adjunct clause linearly precedes the main subject, any noun phrases within it will be deeper in the parse tree. This also means that objects and indirect objects will be among the first possible antecedents found, and in general that the depth of syntactic embedding is an important determiner of discourse prominence.</Paragraph> <Paragraph position="16"> Turning to the assumptions about syntax, we note that Hobbs assumes that one can produce the correct syntactic structure for an utterance, with all adjunct phrases attached at the proper point of the parse tree. In addition, in order to obey linguistic constraints on coreference, the algorithm depends on the existence of a N parse tree node, which denotes a noun phrase without its determiner (See the example in the Appendix). Hobbs algorithm procedurally encodes contra-indexing constraints by skipping over NP nodes whose N node dominates the part of the parse tree in which the pronoun is found, which means that he cannot guarantee that two contraindexed pronouns will not choose the same NP as a co-specifier.</Paragraph> <Paragraph position="17"> Hobbs also assumes that his algorithm can somehow collect discourse entities mentioned alone into sets as co-specifiers of plural anaphors. Hobbs discusses at length other assumptions that he makes about the capabilities of an interpretive process that operates before the algorithm \[Hob76b\]. This includes such things as being able to recover syntactically recoverable omitted text, such as elided verb phrases, and the identities of the speakers and hearers in a dialogue.</Paragraph> <Paragraph position="18"> A major component of any discourse algorithm is the prediction of which entities are salient, even though all the factors that contribute to the salience of a discourse entity have not been identified \[Pri81, Pri85, BF83, HTD86\]. So an obvious question is when the two algorithms actually make different predictions.</Paragraph> <Paragraph position="19"> The main difference is that the choice of a co-specifier for a pronoun in the Hobbs algorithm depends in part on the position of that pronoun in the sentence. In the centering framework, no matter what criteria one uses to order the forward-centers list, pronouns take the most salient entities as antecedents, irrespective of that pronoun's position. Hobbs ordering of entities from a previous utterance varies from BFP in that possessors come before case-marked objects and indirect objects, and there may be some other differences as well but none of them were relevant to the analysis that follows.</Paragraph> <Paragraph position="20"> The effects ot&quot; some of the assumptions are measurable and we will attempt to specify exactly what these effects are, however some are not, e.g. we cannot measure the effect of Hobbs' syntax assumption since it is difficult to say how likely one is to get the wrong parse. We adopt the set collection assumption for both algorithms as well as the ability to recover the identity of speakers and hearers in dialogue.</Paragraph> <Section position="1" start_page="253" end_page="254" type="sub_section"> <SectionTitle> 2.2 Quantitative Results of the Algo- </SectionTitle> <Paragraph position="0"> rithms The texts on which the algorithms are analysed are the first chapter of Arthur Hailey's novel Wheels, and the July 7, 1975 edition of Newsweek. The sentences in Wheels are short and simple with long sequences consisting of reported conversation, so it is similar to a conversational text. The articles from Newsweek are typical of journalistic writing. For each text, the first 100 occurrences of singular and plural third-person pronouns were used to test the performance of the algorithms. The task-dialogues contain a total of 81 uses of it and no other pronouns except for I and you. In the figures below note that possessives like h/a are counted along with he and that accusatives like him and her are counted as he and she 2.</Paragraph> <Paragraph position="1"> Wheels, Newsweek and Task Dialogues We performed three analyses on the quantitative results. A comparison of the two algorithms on each data set individually and an overall analysis on the three data sets combined revealed no significant dig ferences in the performance of the two algorithms plea it fails on in \[Hob76b, Hob76a\]. The numbers reported here vary slightly from those. This is probably due to a discrepancy in exactly what the data.set consisted of. (X 2 = 3.25, not significant). In addition for each algorithm alone we tested whether there were significant differences in performance for different textual types. Both of the algorithms performed significantly worse on the task dialogues (X 2 = 22.05 for Hobbs, X 2 = 21.55 for BFP, p < 0.05).</Paragraph> <Paragraph position="2"> We might wonder with what confidence we should view these numbers. A significant factor that must be considered is the contribution of FALSE POSITIVES and ERROR CHAINING. A FALSE POSITIVE is when an algorithm gets the right answer for the wrong reason. A very simple example of this phenomena is illustrated by this sequence from one of the task dialogues. null Expl: Now put IT in the pan of water.</Paragraph> <Paragraph position="3"> The first it in Expl refers to the pump. Hobbs algorithm gets the right antecedent for it in Exp3, which is the little handle, but then fails on it in Exp4, whereas the BFP algorithm has the pump centered at Expl and continues to select that as the antecedent for it throughout the text. This means BFP gets the wrong co-specifier in Exps but this error allows it to get the correct co-specifier in Exp4.</Paragraph> <Paragraph position="4"> Another type of false positive example is &quot;Everybody and HIS brother suddenly wants to be the President's friend, n said one aide. Hobbs gets this correct as long as one is willing to accept that Everybody is really the antecedent of his. It seems to me that this might be an idiomatic use.</Paragraph> <Paragraph position="5"> ERROR CHAINING refers to the fact that once an algorithm makes an error, other errors can result. Consider: null Cli1: Sorry no luck.</Paragraph> <Paragraph position="6"> Expx: I bet IT's the stupid red thing.</Paragraph> <Paragraph position="7"> Exp2: Take IT out.</Paragraph> <Paragraph position="8"> Cli2: Ok. IT is stuck.</Paragraph> <Paragraph position="9"> In this example once an algorithm fails at Expx it will fail on Exp2 and Cli2 as well since the choices of a cospeciller in the following examples are dependent on the choice in Expl.</Paragraph> <Paragraph position="10"> It isn't possible to measure the effect of false positives, since in some sense they are subjective judgements. However one can and should measure the effects of error chaining, since reporting numbers that correct for error chaining is misleading, but if the error that produced the error chain can be corrected then the algorithm might show a significant improvement. In this analysis, error chains contributed 22 failures to Hobbs' algorithm and 19 failures to BFP.</Paragraph> </Section> </Section> <Section position="5" start_page="254" end_page="257" type="metho"> <SectionTitle> 3 Qualitative Evaluation-Glass Box </SectionTitle> <Paragraph position="0"> The numbers presented in the previous section are intuitively unsatisfying. They tell us nothing about what makes the algorithms more or less general, or how they might be improved. In addition, given the assumptions that we needed to make in order to produce them, one might wonder to what extent the data is a result of these assumptions. Figure 1 also fails to indicate whether the two algorithms missed the same examples or are covering a different set of phenomena, i.e. what the relative distribution of the successes and failures are. But having done the hand-simulation in order to produce such numbers, all of this information is available. In this section we will first discuss the relative importance of various factors that go into producing the numbers above, then discuss if the algorithms can be modified since the flexibility of a framework in allowing one to make modifications is an important dimension of evaluation.</Paragraph> <Section position="1" start_page="254" end_page="256" type="sub_section"> <SectionTitle> 3.1 Distributions </SectionTitle> <Paragraph position="0"> The figures 2, 3 and 4 show for each pronominal category, the distribution of successes and failures for both algorithms.</Paragraph> <Paragraph position="1"> Since the main purpose of evaluation must be to improve the theory that we are evaluating, the most interesting cases are the ones on which the algorithrns' performance varies and those that neither algorithm gets correct. We discuss these below. In the Wheels data, 4 examples rest on the assumption that the identities of speakers and hearers is recoverable. For example in The GM president smiled. &quot;Except Henry will be damned forceful and the papers won't print all HIS language. ~, getting the his correct here depends on knowing that it is the GM president speaking. Only 4 examples rest on being able to produce collections or discourse entities, and 2 of these occurred with an explicit instruction to the hearer to produce such a collection by using the phrase them both.</Paragraph> <Paragraph position="2"> There are 21 cases that Hobbs gets that BFP don't, and of these these a few classes stand out. In every case the relevant factor is Hobbs' preference for intrasentential co-specifiers.</Paragraph> <Paragraph position="3"> One class, (n = 3), is exemplified by Put the little black ring into the the large blue CAP with the hole in IT. All three involved using the preposition with in a descriptive adjunct on a noun phrase. It may be that with-adjuncts are common in visual descriptions, since they were only found in our data in the task dialogues, and a quick inspection of Grosz's task-oriented dialogues revealed some as well\[Deu74\]. Another class, (n = 7), are possessives. In some cases the possessive co-specified with the subject of the sentence, e.g. The SENATE took time from ITS paralyzing New Hampshire election debate to vote agreement, and in others it was within a relative clause and co-specified with the subject of that clause, e.g. The auto industry should be able to produce a totally safe, defect-free CAR that doesn't pollute ITS environment.</Paragraph> <Paragraph position="4"> Other cases seem to be syntactically marked sub-ject matching with constructions that link two S clauses (n = 8). These are uses of more-than in e.g. but Chamberlain grossed about $8.3 million more than HE could have made by selling on the home front.</Paragraph> <Paragraph position="5"> There also are S-if-S cases, as in Mondale said: &quot;I think THE MAFIA would be broke if'IT conducted all its business that way.&quot; We also have subject matching in AS-AS examples as in ... and the resulting EX-POSURE to daylight has become as uncomfortable as IT was unaccustomed, as well as in sentential complements, such as But another liberal, Minnesota's Walter MONDALE, said HE had found a lot of incompetence in the agency's operations. The fact that quite a few of these are also marked with But may be significant.</Paragraph> <Paragraph position="6"> In terms of the possible effects that we noted earlier, the DEFINITION OF SUCCESS (see section 2.1 favors Hobbs (n = 2). Consider: K: Next take the red piece that is the smallest and insert it into the hole in the side of the large plastic tube. IT goes in the hole nearest the end with the engravings on IT.</Paragraph> <Paragraph position="7"> The Hobbs algorithm will correctly choose the end as the antecedent for the second it. The BFP algorithm on the other hand will get two interpretations, one in which the second it co-specifies the red piece and one in which it co-specifies the end. They are both CONTINUING interpretations since the first it co-specifies the CB, but the constraints don't make a choice.</Paragraph> <Paragraph position="8"> All of the examples on which BFP succeed and Hobbs fails have to do with extended discussion of one discourse entity. For instance: Expt: Now take the blue cap with the two prongs sticking out (CB -- blue cap) Exp2: and fit the little piece of pink plastic on IT. Ok? (CB= blue cap) Clit : ok.</Paragraph> <Paragraph position="9"> Exp3: Insert the rubber ring into that blue cap.</Paragraph> <Paragraph position="10"> (CB= blue cap) Exp4: Now screw IT onto the cylinder.</Paragraph> <Paragraph position="11"> On this example, Hobbs fails by choosing the co-specifier of it in Exp4 to be the rubber ring, even though the whole segment has been about the blue cap.</Paragraph> <Paragraph position="12"> Another example from the novel WHEELS is given below. On this one Hobbs gets the first use of he but then misses the next four, as a result of missing the second one by choosing a housekeeper as the co-specifier for HIS.</Paragraph> <Paragraph position="13"> ..An executive vice-president of Ford was preparing to leave for Detroit Metropolitan Airport. HE had already breakfasted, alone. A housekeeper had brought a tray to HIS desk in the softly lighted study where, since 5 a.m., HE had been alternately reading memoranda (mostly on special blue stationery which Ford vice-presidents used in implementing policy) and dictating crisp instructions into a recording machine. HE had scarcely looked up, either as the mall arrived, or while eating, as HE accomplished in an hour what would have taken...</Paragraph> <Paragraph position="14"> Since an ezecutive vice-president is centered in the first sentence, and continued in each following sentence, the BFP algorithm will correctly choose the cospecifier.</Paragraph> <Paragraph position="15"> Among the examples that neither algorithm gets correctly are 20 examples from the task dialogues of it referring to the global focus, the pump. In 15 cases, these shifts to global focus are marked syntactically with a cue word such as Now, and are not marked in 5 cases. Presumably they are felicitous since the pump is visually salient. Besides the global focus cases, pronominal references to entities that were not linguistically introduced are rare. The only other example is an implicit reference to 'the problem' of the pump not working: Clil: Sorry no luck.</Paragraph> <Paragraph position="16"> Expl: I bet IT's the stupid red thing.</Paragraph> <Paragraph position="17"> We have only two examples of sentential or VP anaphora altogether, such as Madam Chairwoman, said Colby at last, I am trying to ran a secret intelligence service. IT u~as a forlorn hope. Neither Hobbs algorithm nor BFP attempt to cover these examples. Three of the examples are uses of it that seem to be lexicalized with certain verbs, e.g. They hit IT off real well. One can imagine these being treated as phrasal lexical items, and therefore not handled by an anaphoric processing component\[AS89\].</Paragraph> <Paragraph position="18"> Most of the interchanges in the task dialogues consist of the client responding to cotmnands with cues such as O.K. or Ready to let the expert know when they have completed a task. When both parties contribute discourse entities to the common ground, both algorithms may fail (n = 4).</Paragraph> <Paragraph position="19"> Consider: Expl: Now we have a little red piece left Exp2: and I don't know what to do with IT.</Paragraph> <Paragraph position="20"> Clil: Well, there is a hole in the green plunger inside the cylinder.</Paragraph> <Paragraph position="21"> Expa: I don't think IT goes in THERE.</Paragraph> <Paragraph position="22"> Exp4: I think IT may belong in the blue cap onto which you put the pink piece of plastic.</Paragraph> <Paragraph position="23"> In Exp3, one might claim that it and there are contraindexed, and that there can be properly resolved to a hole, so that it cannot be any of the noun phrases in the prepositional phrases that modify a hole, but whether any theory of contra-indexing actually give. us this is questionable.</Paragraph> <Paragraph position="24"> The main factor seems to be that even though Expt is not syntactically a question, the little red piece is the focus of a question, and as such is in focus despite the fact that the syntactic construction there is supposedly focuses a hole in the green plunger ...\[Sid79\]. These examples suggest that a questioned entity is left focused until the point in the dialogue at which the question is resolved. The fact that well has been noted as a marker of response to questions supports this analysis\[Sch87\]. Thus the relevant factor here may be the switching of control among discourse participants \[WS88\]. These mixed-initiati.ve features make these sequences inherently different than text.</Paragraph> </Section> <Section position="2" start_page="256" end_page="257" type="sub_section"> <SectionTitle> 3.2 Modifiability </SectionTitle> <Paragraph position="0"> Task structure in the pump dialogues is an important factor especially as it relates to the use of global focus.</Paragraph> <Paragraph position="1"> Twenty of the cases on which both algorithms fail are references to the pump, which is the global focus. We can include a global focus in the centering framework, as a separate notion from the current CB. This means that in the 15 out of 20 cases where the shift to global focus is identifiably marked with a cue-word such as now, the segment rules will allow BFP to get the global focus examples.</Paragraph> <Paragraph position="2"> BFP can add the VP and the S onto the end of the forward centers list, as Sidner does in her algorithm for local focusing \[Sid79\]. This lets BFP get the two examples of event anaphora. Hobbs discusses the fact that his algorithm cannot be modified to get event anaphora in \[Hob76b\].</Paragraph> <Paragraph position="3"> Another interesting fact is that in every case in which Hobbs' algorithm gets the correct co-specifier and BFP didn't, the relevant factor is Hobbs' preference for intrasentential co-specifiers. One view on these cases may be that these are not discourse anaphora, but there seems to be no principled way to make this distinction. However, Carter has proposed some extensions to Sidner's algorithm for local focusing that seem to be relevant here(chap. 6, \[Car87\]). He argues that intra-sentential candidates (ISCs) should be preferred over candidates from the previous utterance, ONLY in the cases where no discourse center has been established or the discourse center is rejected for syntactic or selectional reasons. He then uses Hobbs algorithm to produce an ordering of these ISCs. This is compatible with the centering framework since it is underspecifled as to whether one should always choose to establish a discourse center with a co-specifier from a previous utterance. If we adopt Carter's rule into the centering framework, we find that of the 21 cases that Hobbs gets that BFP don't, in 7 cases there is no discourse center established, and in another 4 the current center can be rejected on the basis of syntactic or sortal information. Of these Carter's rule clearly gets 5, and another 3 seem to rest on whether one might want to establish a discourse entity from a previous utterance. Since the addition of this constraint does not allow BFP to get any examples that neither algorithm got, it seems that this combination is a way of making the best out of both algorithms.</Paragraph> <Paragraph position="4"> The addition of these modifications changes the quantitative results. See the Figure 5.</Paragraph> <Paragraph position="5"> Modifications, for Wheels, Newsweek and Task Dialogues null However, the statistical analyses still show that there is no significant difference in the performance of the algorithms in general. It is also still the case that the performance of each algorithm significantly varies depending on tile data. Tile only significant difference as a result of the modifcations is that tile BFP algorithm now performs significantly better oil tile pump dialogues alone (X 2 = 4.3 I, p < .05).</Paragraph> </Section> </Section> class="xml-element"></Paper>