File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/93/p93-1021_evalu.xml
Size: 5,145 bytes
Last Modified: 2025-10-06 14:00:08
<?xml version="1.0" standalone="yes"?> <Paper uid="P93-1021"> <Title>A LANGUAGE-INDEPENDENT ANAPHORA RES()LUTION SYSTEM FOR UNDERSTANDING MULTILINGUAL TEXTS</Title> <Section position="5" start_page="159" end_page="161" type="evalu"> <SectionTitle> 3 Evaluating and Training the </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="159" end_page="159" type="sub_section"> <SectionTitle> Discourse Module </SectionTitle> <Paragraph position="0"> In order to choose the most effective KS's for a particular phenomenon, as well as to debug and track progress of the discourse module, we must be able to evaluate the performance of discourse processing. To perform objective evaluation, we compare the results of running our discourse module over a corpus with a set of manually created discourse tags. Examples of discourse-tagged text are shown in Figure 7. The metrics we use for evaluation are detailed in Figure 8.</Paragraph> </Section> <Section position="2" start_page="159" end_page="159" type="sub_section"> <SectionTitle> 3.1 Evaluating the Discourse Module </SectionTitle> <Paragraph position="0"> We evaluate overall performance by calculating recall and precision of anaphora resolution results. The higher these measures are, the better the discourse module is working. In addition, we evaluate the discourse performance over new texts, using blackbox evaluation (e.g. scoring the results of a data extraction task.) To calculate a generator's failure vale, a filter's false positive rate, and an orderer's effectiveness, the algorithms in Figure 9 are used. 3</Paragraph> </Section> <Section position="3" start_page="159" end_page="161" type="sub_section"> <SectionTitle> 3.2 Choosing Main Strategies </SectionTitle> <Paragraph position="0"> The uniqueness of our approach to discourse analysis is also shown by the fact that our discourse module can be trained for a particular domain, similar to the ways grammars have been trained (of. Black 3,,Tile remaining antecedent hypotheses&quot; are the hypotheses left after all the filters are applied for all anaphor. For each discourse phenomenon, given anaphor and antecedent pairs in the corpus, calculate how often the generator fails to generate the antecedents. For each discourse phenomenon, given anaphor and antecedent pairs in the corpus, for each filter, calculate how often the filter incorrectly eliminates the antecedents. For each anaphor exhibiting a given discourse phenomenon in the corpus, given the remaining antecedent hypotheses for the anaphor, for each applicable orderer, test if the orderer chooses the correct antecedent as the best hypothesis. \[4\]). As Walker \[lS\] reports, different discourse algorithms (i.e. Brennan, Friedman and Pollard's centering approach \[5\] vs. Hobbs' algorithm \[12\]) perform differently on different types of data. This suggests that different sets of KS's are suitable for different domains.</Paragraph> <Paragraph position="1"> In order to determine, for each discourse phenomenon, the most effective combination of generators, filters, and orderers, we evaluate overall performance of the discourse module (cf. Section 3.1) at different rate settings. We measure particular generators, filters, and orders for different phenomena to identify promising strategies. We try to minimize the failure rate and the false positive rate while minimizing the average number of hypotheses that the generator suggests and maximizing the number of hypotheses that the filter eliminates. As for orderers, those with highest effectiveness measures are chosen for each phenomenon. The discourse module is &quot;trained&quot; until a set of rate settings at which the overall performance of the discourse module becomes highest is obtained.</Paragraph> <Paragraph position="2"> Our approach is more general than Dagan and Itai \[7\], which reports on training their anaphora resolution component so that &quot;it&quot; can be resolved to its correct antecedent using statistical data on lexical relations derived from large corpora. We will certainly incorporate such statistical data into our discourse KS's.</Paragraph> </Section> <Section position="4" start_page="161" end_page="161" type="sub_section"> <SectionTitle> 3.3 Determining Backup Strategies </SectionTitle> <Paragraph position="0"> If the main strategy for resolving a particular anaphor fails, a backup strategy that includes either a new set of filters or a new generator is atternpted. Since backup strategies are eml)loyed only when the main strategy does not return a hypothesis, a backup strategy will either contain fewer filters than the main strategy or it will employ a generator that returns more hypotheses.</Paragraph> <Paragraph position="1"> If the generator has a non-zero failure rate 4, a new generator with more generating capability is chosen from the generator tree in the knowledge source KB as a backup strategy. Filters that occur in the main strategy but have false positive rates above a certain threshold are not included in the backup strategy.</Paragraph> </Section> </Section> class="xml-element"></Paper>