File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-0106_metho.xml

Size: 22,874 bytes

Last Modified: 2025-10-06 14:07:20

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-0106">
  <Title>Analyzing the Reading Comprehension Task</Title>
  <Section position="3" start_page="0" end_page="36" type="metho">
    <SectionTitle>
2 The Complexity of Extracting a
Fact From Text
</SectionTitle>
    <Paragraph position="0"> Any text document is a collection of facts (information). These facts may be explicitly or implicitly stated in the text. In addition, there are &amp;quot;easy&amp;quot; facts which may be found in a single sentence (example: the name of a city) as well as &amp;quot;difficult&amp;quot; facts which are spread across several sentences (example: the reason for a particular event).</Paragraph>
    <Paragraph position="1"> For a computer system to be able to process text documents in applications like information extrac- null tion (IE), question answering, and reading comprehension, it has to have the ability to extract facts from text. Obviously, the performance of the system will depend upon the type of fact it has to extract: explicit or implicit, easy or difficult, etc. (by no means is this list complete). In addition, the performance of such systems varies greatly depending on various additional factors including known vocabulary, sentence length, the amount of training, quality of parsing, etc. Despite the great variations in the performances of such systems, it has been hypothesized that there are facts that are simply harder to extract than others (Hirschman, 1992).</Paragraph>
    <Paragraph position="2"> In this section we describe a method for estimating the complexity of extracting a fact from text. The proposed model was initially used to analyze the information extraction task (Bagga and Biermann, 1997). In addition to verifying Hirschman's hypothesis, the model also provided us with a framework for analyzing and understanding the performance of several IE systems (Bagga and Biermann, 1998). We have also proposed using this model to analyze the complexity of the QA task Which is related to both the IE, and the reading comprehension tasks (Bagga et al., 1999). The remainder of this section describes the model in detail, and provides a sample application of the model to an IE task. In the following section, we discuss how this model can be used to analyze the reading comprehension task.</Paragraph>
    <Section position="1" start_page="35" end_page="35" type="sub_section">
      <SectionTitle>
2.1 Definitions
</SectionTitle>
      <Paragraph position="0"> Network: A network consists of a collection of nodes interconnected by an accompanying set of arcs. Each node denotes an object and each arc represents a binary relation between the objects. (Hendrix, 1979) A Partial Network: A partial network is a collection of nodes interconnected by an accompanying set of arcs where the collection of nodes is a subset of a collection of nodes forming a network, and the accompanying set of arcs is a subset of the se.t of arcs accompanying the set of nodes which form the network.</Paragraph>
      <Paragraph position="1"> of</Paragraph>
      <Paragraph position="3"> Figure 1 shows a sample network for the following piece of text: &amp;quot;The Extraditables,&amp;quot; or the Armed Branch of the Medellin Cartel have claimed responsibility for the murder of two employees of Bogota's daily E1 Espectador on Nov 15.</Paragraph>
      <Paragraph position="4"> The murders took place in Medellin.</Paragraph>
    </Section>
    <Section position="2" start_page="35" end_page="36" type="sub_section">
      <SectionTitle>
2.2 The Level of A Fact
</SectionTitle>
      <Paragraph position="0"> The level of a fact, F, in a piece of text is defined by the following algorithm:  1. Build a network, S, for the piece of text. 2. Identify the nodes that are relevant to the fact, F. Suppose {xl,x~,...,Xn} are the nodes relevant to F. Let s be the partial network con- null sisting of the set of nodes {xl, x~,..., x~} interconnected by the set of arcs {tl, t2,..., tk}. We define the level of the fact, F, with respect to the network, S to be equal to k, the number of arcs linking the nodes which comprise the fact Fins.</Paragraph>
      <Paragraph position="1">  Given the definition of the level of a fact, the following observations can be made: * The level of a fact is related to the concept of &amp;quot;semantic vicinity&amp;quot; defined by Schubert et. al. (Schubert and others, 1979). The semantic vicinity of a node in a network consists of the nodes and the arcs reachable from that node by traversing a small number of arcs. The fundamental assumption used here is that &amp;quot;the knowledge required to perform an intellectual task generally lies in the semantic vicinity of the concepts involved in the task&amp;quot; (Schubert and others, 1979).</Paragraph>
      <Paragraph position="2"> The level of a fact is equal to the number of arcs that one needs to traverse to reach all the concepts (nodes) which comprise the fact of interest. null  * A level-0 fact consists of a single node (i.e. no transitions) in a network.</Paragraph>
      <Paragraph position="3"> * A level-k fact is a union of k level-1 facts: * Conjunctions/disjunctions increase the level of a fact.</Paragraph>
      <Paragraph position="4"> * A higher level fact is likely to be harder to extract than a lower level fact.</Paragraph>
      <Paragraph position="5"> * A fact appearing at one level in a piece of text may appear at some other level in the same  piece of text.</Paragraph>
      <Paragraph position="6"> * The level of a fact in a piece of text depends on the granularity of the network constructed for that piece of text. Therefore, the level of a fact with respect to a network built at the word level (i.e. words represent objects and the relationships between the objects) will be greater than the level of a fact with respect to a network built at the phrase level (i.e. noun groups represent objects while verb groups and preposition groups represent the relationships between the objects).</Paragraph>
      <Paragraph position="7">  Let S be the network shown in Figure 1. S has been built at the phrase level.</Paragraph>
      <Paragraph position="8">  We define the type o/attack in the network to be an attack designator such as &amp;quot;murder, .... bombing,&amp;quot; or &amp;quot;assassination&amp;quot; with one modifier giving the victim, perpetrator, date, location, or other information.</Paragraph>
      <Paragraph position="9"> In this case the type of attack fact is composed of the &amp;quot;the murder&amp;quot; and the &amp;quot;two employees&amp;quot; nodes and their connector. This makes the type of attack a level-1 fact.</Paragraph>
      <Paragraph position="10"> The type of attack could appear as a level-0 fact as in &amp;quot;the Medellin bombing&amp;quot; (assuming that the network is built at the phrase level) because in this case both the attack designator (bombing) and the modifier (Medellin) occur in the same node. The type of attack fact occurs as a level-2 fact in the following sentence (once again assuming that the network is built at the phrase level): &amp;quot;10 people were killed in the offensive which included several bombings.&amp;quot; In this case there is no direct connector between the attack designator (several bombings) and its modifier (10 people). They are connected by the intermediatory &amp;quot;the offensive&amp;quot; node; thereby making the type of attack a level-2 fact. The type of attack can also appear at higher levels.</Paragraph>
      <Paragraph position="11"> * In S, the date of the murder of the two employees is an example of a level-2 fact.</Paragraph>
      <Paragraph position="12"> This is because the attack designator (the tourder) along with its modifier (two employees) account for one level and the arc to &amp;quot;Nov 15&amp;quot; accounts for the second level.</Paragraph>
      <Paragraph position="13"> The date of the attack, in this case, is not a level-1 fact (because of the two nodes &amp;quot;the tourder&amp;quot; and &amp;quot;Nov 15&amp;quot;) because the phrase &amp;quot;the murder on Nov 15&amp;quot; does not tell one that an attack actually took place. The article could have been talking about a seminar on murders that took place on Nov 15 and not about the murder of two employees which took place then.</Paragraph>
      <Paragraph position="14"> * In S, the location of the murder of the two employees is an example of a level-2 fact.</Paragraph>
      <Paragraph position="15"> The exact same argument as the date of the murder of' the two employees applies here.</Paragraph>
      <Paragraph position="16"> * The complete information, in S, about the victiros is an example of a level-2 fact because to know that two employees of Bogota's Daily E1 Espectador were victims, one has to know that they were murdered. The attack designator (the murder) with its modifier (two employees) accounts for one level, while the connector between &amp;quot;two employees&amp;quot; and &amp;quot;Bogota's Daily E1 Espectador&amp;quot; accounts for the other.</Paragraph>
    </Section>
    <Section position="3" start_page="36" end_page="36" type="sub_section">
      <SectionTitle>
2.3 Building the Networks
</SectionTitle>
      <Paragraph position="0"> As mentioned earlier, the level of a fact for a piece of text depends on the network constructed for the text. Since there is no unique network corresponding to a piece of text, care has to be taken so that the networks are built consistently.</Paragraph>
      <Paragraph position="1"> We used the following algorithm to build the networks: null  1. Every article was broken up into a non-overlapping sequence of noun groups (NGs), verb groups (VGs), and preposition groups (PGs). The rules employed to identify the NGs, VGs, and PGs were almost the same as the ones employed by SRI's FASTUS system 1.</Paragraph>
      <Paragraph position="2"> 2. The nodes of the network consisted of the NGs while the transitions between the nodes consisted of the VGs and the PGs.</Paragraph>
      <Paragraph position="3"> 3. Identification of coreferent nodes and preposi null tional phrase attachments were done manually. The networks are built based largely upon the syntactic structure of the text contained in the articles. However, there is some semantics encoded into the networks because identification of coreferent nodes and preposition phrase attachments are done manually. null Obviously, if one were to employ a different algorithm for building the networks, one would get different numbers for the level of a fact. But, if the algorithm were employed consistently across all the facts of interest and across all articles in a domain, the numbers on the level of a fact would be consistently different and one would still be able to analyze the relative complexity of extracting that fact from a piece of text in the domain.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="36" end_page="37" type="metho">
    <SectionTitle>
3 Example: Analyzing the
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="36" end_page="37" type="sub_section">
      <SectionTitle>
Extraction Task
</SectionTitle>
      <Paragraph position="0"> In order to validate our model of complexity we applied it to the Information Extraction (IE) task, or the Message Understanding task (DAR, 1991), (DAR, 1992), (ARP, 1993), (DAR, 1995), (DAR, 1998). The goal of an IE task is to extract pre-specified facts from text and fill in predefined templates containing labeled slots.</Paragraph>
      <Paragraph position="1"> We analyzed the complexity of the task used for the Fourth Message Understanding Conference (MUC-4) (DAR, 1992). In this task, the participangs were asked to extract the following facts from articles describing terrorist activities in Latin America: null  We analyzed a set of 100 articles from the MUC-4 domain each of which reported one or more terrorist attacks. Figure 2 shows the level distribution for each of the five facts. A closer analysis of the figure shows that the &amp;quot;type of attack&amp;quot; fact is the easiest to extract while the &amp;quot;perpetrator&amp;quot; fact is the hardest (the curve peaks at level-2 for this fact). In addition, Figure 3 shows the level distribution of the five facts combined. This figure gives some indication of the complexity of the MUC-4 task because it shows that almost 50% of the MUC-4 facts occur at level-1. The expected level of the five facts in the MUC-4 domain was 1.74 (this is simply the weighted average of the level distributions of the facts). We define this number to be the Task Complexity for the MUC-4 task.</Paragraph>
      <Paragraph position="2"> Therefore, the MUC-4 task can now be compared to, say, the MUC-5 task by comparing their Task Complexities. In fact, we computed the Task Complexity of the MUC-5 task and discovered that it was equal to 2.5. In comparison, an analysis, using more &amp;quot;superficial&amp;quot; features, done by Beth Sundheim, shows that the nature of the MUC-5 EJV task is approximately twice as hard as the nature of the MUC-4 task (Sundheim, 1993). The features used in the study included vocabulary size, the average number of words per sentence, and the average number of sentences per article. More details about this analysis can be found in (Bagga and Biermann, 1998).</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="37" end_page="39" type="metho">
    <SectionTitle>
4 Analyzing the Reading
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="37" end_page="38" type="sub_section">
      <SectionTitle>
Comprehension Task
</SectionTitle>
      <Paragraph position="0"> The reading comprehension task differs from the QA task in the following way: while the goal of the QA task is to find answers for a set of questions from a collection of documents, the goal of the reading comprehension task is to find answers to a set of questions from a single related document. Since the QA task involves extracting answers from a collection of documents, the complexity of this task depends on the expected level of occurrence of the answers of the questions. While it is theoretically possible to compute the average level of any fact in the entire  document collection, it is not humanly possible to analyze every document in such large collections to compute this. For example, the TREC collection used for the QA track is approximately 5GB. However, since the reading comprehension task involves extracting the answers from a single document, it is possible to analyze the document itself in addition to computing the level of the occurrence of each answer. Therefore, the results presented in this paper will provide both these values.</Paragraph>
    </Section>
    <Section position="2" start_page="38" end_page="38" type="sub_section">
      <SectionTitle>
4.1 Analysis and Results
</SectionTitle>
      <Paragraph position="0"> We analyzed a set of five reading comprehension tests offered by the English Language Center at the University of Victoria in Canada 2. These five tests are listed in increasing order of difficulty and are classified by the Center as: Basic, Basic-Intermediate, Intermediate, Intermediate-Advanced, and Advanced. For each of these tests, we calculated the level number of each sentence in the text, and the level number of the sentences containing the answers to each question for every test. In addition, we also calculated the number of coreferences present in each sentence in the texts, and the corresponding number in the sentences containing each answer. It should be noted that we were forced to calculate the level number of the sentences containing the answer as opposed to calculating the level number of the answer itself because several questions had only true/false answers. Since there was no way to compute the level numbers of true/false answers, we decided to calculate the level numbers of the sentences containing the answers in order to be consistent. For true/false answers this implied analyzing all the sentences which help determine the truth value of the question.</Paragraph>
      <Paragraph position="1"> Figure 4 shows for each text, the number of sentences in the text, the average level number of a sentence, the average number of coreferences per sentence, the number of questions corresponding to the test, the average level number of each answer, and the average number of coreferences per answer.</Paragraph>
      <Paragraph position="2"> The results shown in Figure 4 are consistent with the model. The figure shows that as the difficulty level of the tests increase, so do the corresponding level numbers per sentence, and the answers. One 2 http://web2.uvcs.uvic.ca/elc/studyzone/index.htm conclusion that we can draw from the numbers is that the Basic-Intermediate test, based upon the analysis, is slightly more easy than the Basic test.</Paragraph>
      <Paragraph position="3"> We will address this issue in the next section.</Paragraph>
      <Paragraph position="4"> The numbers of coreferences, surprisingly, do no increase with the difficulty of the tests. However, a closer look at the types of coreference shows that while most of the coreferences in the first two tests (Basic, and Basic-Intermediate) are simple pronominal coreferences (he, she, it, etc.), the coreferences used in the last two tests (Intermediate-Advanced, and Advanced) require more knowledge to process.</Paragraph>
      <Paragraph position="5"> Some examples include marijuana coreferent with the drug, hemp with the pant, etc. Not being able to capture the complexity of the coreferences is one, among several, shortcomings of this model.</Paragraph>
    </Section>
    <Section position="3" start_page="38" end_page="39" type="sub_section">
      <SectionTitle>
4.2 A Comparison with Qanda
</SectionTitle>
      <Paragraph position="0"> MITRE 3 ran its Qanda reading comprehension system on the five tests analyzed in the previous section. However, instead of producing a single answer for each question, Qanda produces a list of answers listed in decreasing order of confidence. The rest of this section describes an evaluation of Qanda's performance on the five tests and a comparison with the analysis done in the previous section.</Paragraph>
      <Paragraph position="1"> In order to evaluate Qanda's performance on the five tests we decided to use the Mean Reciprocal Answer Rank (MRAR) technique which was used for evaluating question-answering systems at TREC-8 (Singhal, 1999). For each answer, this techniques assigns a score between 0 and 1 depending on its rank in the list of answers output. The score for answer, i, is computed as:  Scorel = rank of answeri If no correct answer is found in the list, a score of 0 is assigned. Therefore, MRAR for a reading comprehension test is the sum of the scores for answers corresponding to each question for that test.</Paragraph>
      <Paragraph position="2"> Figure 5 summarizes Qanda's results for the five tests. The figure shows, for each test, the number of questions, the cumulative MRAR for all answers for the test, and the average MRAR per answer.</Paragraph>
      <Paragraph position="3"> 3We would like to thank Marc Light and Eric Breck for their help with running Qanda on our data.</Paragraph>
      <Paragraph position="4">  The results from Qanda are more or less consistent with the analysis done earlier. Except for the Advanced test, the average Mean Reciprocal Answer Rank is consistent with the average number of levels per sentence (from Figure 4). It should be pointed out that the system performed significantly better on the Basic-Intermediate Test compared to the Basic test consistent with the numbers in Figure 4. However, contrary to expectation, Qanda performed exceedingly well on the Advanced test answering 7 out of the 10 questions with answers whose rank is 1 (i.e. the first answer among the list of possible answers for each question is the correct one). We are currently consulting the developers of the system for conducting an analysis of the performance on this test in more detail.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="39" end_page="39" type="metho">
    <SectionTitle>
5 Shortcomings
</SectionTitle>
    <Paragraph position="0"> This measure is just the beginning of a search for useful complexity measures. Although the measure is a big step up from the measures used earlier, it has a number of shortcomings. The main shortcoming is the ambiguity regarding the selection of nodes from the network regarding the fact of interest. Consider the following sentence: &amp;quot;This is a report from the Straits of Taiwan ......... Yesterday, China test fired a missile.&amp;quot; Suppose we are interested in the location of the launch of the missile. The ambiguity here arises from the fact that the article does not explicitly mention that the missile was launched in the Straits of Taiwan. The decision to infer that fact from the information present depends upon the person building the network.</Paragraph>
    <Paragraph position="1"> In addition, the measure does not account for the following factors (the list is not complete): coreference: If the extraction of a fact requires the resolution of several coreferences, it is clearly more difficult than an extraction which does not. In addition, the degree of difficulty of resolving coreferences itself varies from simple exact matches~ and pronominal coreferences, to ones that require external world knowledge.</Paragraph>
    <Paragraph position="2"> frequency of answers: The frequency of occurrence of facts in a collection of documents has an impact on the performance of systems.</Paragraph>
    <Paragraph position="3"> occurrence of multiple (similar) facts: Clearly, if several similar facts are present in the same article, the systems will find it harder to extract the correct fact.</Paragraph>
    <Paragraph position="4"> vocabulary size: Unknown words present some problems to systems making it harder for them to perform well.</Paragraph>
    <Paragraph position="5"> On the other hand, no measure can take into account all possible features in natural language. Consider the following example. In an article, suppose one initially encounters a series of statements that obliquely imply that the following statement is false. Then the statement is given: &amp;quot;Bill Clinton visited Taiwan last week.&amp;quot; Processing such discourse requires an ability to perfectly understand the initial series of statements before the truth value of tlie last statement can be properly evaluated. Such complete understanding is beyond the state of the art and is likely to remain so for many years.</Paragraph>
    <Paragraph position="6"> Despite these shortcomings, the current measure does quantify complexity on one very important dimension, namely the number of clauses (or phrases) required to specify a fact. For the short term it appears to be the best available vehicle for understanding the complexity of extracting a fact.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML