XML Viewer - p06-2027

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/p06-2027_evalu.xml
Size: 11,598 bytes
Last Modified: 2025-10-06 13:59:42
<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2027">
  <Title>Automatic Creation of Domain Templates</Title>
  <Section position="10" start_page="210" end_page="212" type="evalu">
    <SectionTitle>
6 Evaluation
</SectionTitle>
    <Paragraph position="0"> The task we deal with is new and there is no well-defined and standardized evaluation procedure for it. Sudo et al. (2003) evaluated how well their IE patterns captured named entities of three pre-defined types. We are interested in evaluating how well we capture the major actions as well as their constituent parts.</Paragraph>
    <Paragraph position="1"> There is no set of domain templates which are built according to a unique set of principles against which we could compare our automatically created templates. Thus, we need to create a gold standard. In Section 6.1, we describe how the gold standard is created. Then, in Section 6.2, we evaluate the quality of the automatically created templates by extracting clauses corresponding to the templates and verifying how many answers from the questions in the gold standard are answered by the extracted clauses.</Paragraph>
    <Section position="1" start_page="210" end_page="211" type="sub_section">
      <SectionTitle>
6.1 Stage 1. Information Included into
</SectionTitle>
      <Paragraph position="0"> Templates: Interannotator Agreement To create a gold standard we asked people to create a list of questions which indicate what is important for the domain description. Our decision to aim for the lists of questions and not for the templates themselves is based on the following considerations: first, not all of our subjects are familiar with the field of IE and thus, do not necessarily know what an IE template is; second, our goal for this evaluation is to estimate interannotator agreement for capturing the important aspects for the domain and not how well the subjects agree on the template structure.</Paragraph>
      <Paragraph position="1"> We asked our subjects to think of their experience of reading newswire articles about various domains.7 Based on what they remember from this experience, we asked them to come up with a list of questions about a particular domain. We asked them to come up with at most 20 questions covering the information they will be looking for given an unseen news article about a new event in the domain. We did not give them any input information about the domain but allowed them to use any sources to learn more information about the domain. null We had ten subjects, each of which created one list of questions for one of the four domains under  terannotator agreement.</Paragraph>
      <Paragraph position="2"> analysis. Thus, for the earthquake and terrorist attack domains we got two lists of questions; for the airplane crash and presidential election domains we got three lists of questions.</Paragraph>
      <Paragraph position="3"> After the questions lists were created we studied the agreement among annotators on what information they consider is important for the domain and thus, should be included in the template. We matched the questions created by different annotators for the same domain. For some of the questions we had to make a judgement call on whether it is a match or not. For example, the following question created by one of the annotators for the earthquake domain was: Did the earthquake occur in a well-known area for earthquakes (e.g. along the San Andreas fault), or in an unexpected location? We matched this question to the following three questions created by the other annotator: What is the geological localization? Is it near a fault line? Is it near volcanoes? Usually, the degree of interannotator agreement is estimated using Kappa. For this task, though, Kappa statistics cannot be used as they require knowledge of the expected or chance agreement, which is not applicable to this task (Fleiss et al., 1981). To measure interannotator agreement we use the Jaccard metric, which does not require knowledge of the expected or chance agreement.</Paragraph>
      <Paragraph position="4"> Table 2 shows the values of Jaccard metric for interannotator agreement calculated for all four domains. Jaccard metric values are calculated as</Paragraph>
      <Paragraph position="6"> where QSdi and QSdj are the sets of questions created by subjects i and j for domain d. For the airplane crash and presidential election domains we averaged the three pairwise Jaccard metric values.</Paragraph>
      <Paragraph position="7"> The scores in Table 2 show that for some domains the agreement is quite high (e.g., earthquake), while for other domains (e.g., presidential election) it is twice as low. This difference in scores can be explained by the complexity of the domains and by the differences in understanding of these domains by different subjects. The scores for the presidential election domain are predictably low as in different countries the roles of presidents are very different: in some countries the president is the head of the government with a lot of power, while in other countries the president is merely a ceremonial figure. In some countries the presidents are elected by general voting while in other countries, the presidents are elected by parliaments. These variations in the domain cause the subjects to be interested in different issues of the domain. Another issue that might influence the interannotator agreement is the distribution of the presidential election process in time. For example, one of our subjects was clearly interested in the pre-voting situation, such as debates between the candidates, while another subject was interested only in the outcome of the presidential election.</Paragraph>
      <Paragraph position="8"> For the terrorist attack domain we also compared the lists of questions we got from our subjects with the terrorist attack template created by experts for the MUC competition. In this template we treated every slot as a separate question, excluding the first two slots which captured information about the text from which the template fillers were extracted and not about the domain. The results for this comparison are included in Table 2.</Paragraph>
      <Paragraph position="9"> Differences in domain complexity were studied by IE researchers. Bagga (1997) suggests a classification methodology to predict the syntactic complexity of the domain-related facts. Huttunen et al. (2002) analyze how component sub-events of the domain are linked together and discuss the factors which contribute to the domain complexity.</Paragraph>
    </Section>
    <Section position="2" start_page="211" end_page="212" type="sub_section">
      <SectionTitle>
6.2 Stage 2. Quality of the Automatically
Created Templates
</SectionTitle>
      <Paragraph position="0"> In section 6.1 we showed that not all the domains are equal. For some of the domains it is much easier to come to a consensus about what slots should be present in the domain template than for others.</Paragraph>
      <Paragraph position="1"> In this section we describe the evaluation of the four automatically created templates.</Paragraph>
      <Paragraph position="2"> Automatically created templates consist of slot structures and are not easily readable by human annotators. Thus, instead of direct evaluation of the template quality, we evaluate the clauses extracted according to the created templates and  check whether these clauses contain the answers to the questions created by the subjects during the first stage of the evaluation. We extract the clauses corresponding to the test instances according to the following procedure: 1. Identify all the simple clauses in the documents corresponding to a particular test instance (respective TDT topic). For example, for the sentence Her husband, Robert, survived Thursday's explosion in a Yemeni harbor that killed at least six crew members and injured 35.</Paragraph>
      <Paragraph position="3"> only one part is output: that killed at least six crew members and injured 35 2. For every domain template slot check all the simple clauses in the instance (TDT topic) under analysis. Find the shortest clause (or sequence of clauses) which includes both the verb and other words extracted for this slot in their respective order. Add this clause to the list of extracted clauses unless this clause has been already added to this list.</Paragraph>
      <Paragraph position="4"> 3. Keep adding clauses to the list of extracted clauses till all the template slots are analyzed or the size of the list exceeds 20 clauses.</Paragraph>
      <Paragraph position="5"> The key step in the above algorithm is Step 2. By choosing the shortest simple clause or sequence of simple clauses corresponding to a particular template slot, we reduce the possibility of adding more information to the output than is necessary to cover each particular slot.</Paragraph>
      <Paragraph position="6"> In Step 3 we keep only the first twenty clauses so that the length of the output which potentially contains an answer to the question of interest is not larger than the number of questions provided by each subject. The templates are created from the slot structures extracted for the top 50 verbs. The higher the estimated score of the verb (Eq. 1) for the domain the closer to the top of the template the slot structure corresponding to this verb will be. We assume that the important information is more likely to be covered by the slot structures that are placed near the top of the template.</Paragraph>
      <Paragraph position="7"> The evaluation results for the automatically created templates are presented in Figure 1. We calculate what average percentage of the questions is covered by the outputs created according to the domain templates. For every domain, we present the percentage of the covered questions separately for each annotator and for the intersection of questions (Section 6.1).</Paragraph>
      <Paragraph position="8">  For the questions common for all the annotators we capture about 70% of the answers for three out of four domains. After studying the results we noticed that for the earthquake domain some questions did not result in a template slot and thus, could not be covered by the extracted clauses. Here are two of such questions: Is it near a fault line?Is it near volcanoes? According to the template creation procedure, which is centered around verbs, the chances that extracted clauses would contain answers to these questions are low. Indeed, only one of the three sentence sets extracted for the three TDT earthquake topics contain an answer to one of these questions.</Paragraph>
      <Paragraph position="9"> Poor results for the presidential election domain could be predicted from the Jaccard metric value for interannotator agreement (Table 2). There is considerable discrepancy in the questions created by human annotators which can be attributed to the great variation in the presidential election domain itself. It must be also noted that most of the questions created for the presidential election domain were clearly referring to the democratic election procedure, while some of the TDT topics categorized as Elections were about either election fraud or about opposition taking over power without the formal resignation of the previous president.</Paragraph>
      <Paragraph position="10"> Overall, this evaluation shows that using automatically created domain templates we extract sentences which contain a substantial part of the important information expressed in questions for that domain. For those domains which have small diversity our coverage can be significantly higher.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML