File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/p99-1042_metho.xml
Size: 13,808 bytes
Last Modified: 2025-10-06 14:15:28
<?xml version="1.0" standalone="yes"?> <Paper uid="P99-1042"> <Title>Deep Read: A Reading Comprehension System</Title> <Section position="5" start_page="326" end_page="328" type="metho"> <SectionTitle> 3 System Architecture </SectionTitle> <Paragraph position="0"> The process of taking short-answer reading comprehension tests can be broken down into the following subtasks: Extraction of information content of the question.</Paragraph> <Paragraph position="1"> * Extraction of information content of the document.</Paragraph> <Paragraph position="2"> * Searching for the information requested in the question against information in document. A crucial component of all three of these subtasks is the representation of information in text. Because our goal in designing our system was to explore the difficulty of various reading comprehension exams and to measure baseline performance, we tried to keep this initial implementation as simple as possible.</Paragraph> <Section position="1" start_page="327" end_page="327" type="sub_section"> <SectionTitle> 3.1 Bag-of-Words Approach </SectionTitle> <Paragraph position="0"> Our system represents the information content of a sentence (both question and text sentences) as the set of words in the sentence.</Paragraph> <Paragraph position="1"> The word sets are considered to have no structure or order and contain unique elements. For example, the representation for (la) is the set in (lb).</Paragraph> <Paragraph position="2"> la (Sentence): By giving it 6,457 of his books, Thomas Jefferson helped get it started. lb (Bag): {6,457 books by get giving helped his it Jefferson of started Thomas} Extraction of information content from text, both in documents and questions, then consists of tokenizing words and determining sentence boundary punctuation. For English written text, both of these tasks are relatively easy although not trivial--see Palmer and Hearst (1997).</Paragraph> <Paragraph position="3"> The search subtask consists of finding the best match between the word set representing the question and the sets representing sentences in the document. Our system measures the match by size of the intersection of the two word sets. For example, the question in (2a) would receive an intersection score of 1 because of the mutual set element books.</Paragraph> <Paragraph position="4"> document, we additionally prefer sentences that first match on longer words, and second, occur earlier in the document.</Paragraph> </Section> <Section position="2" start_page="327" end_page="328" type="sub_section"> <SectionTitle> 3.2 Normalizations and Extensions of the Word Sets </SectionTitle> <Paragraph position="0"> In this section, we describe extensions to the extraction approach described above. In the next section we will discuss the performance benefits of these extensions.</Paragraph> <Paragraph position="1"> The most straightforward extension is to remove function or stop words, such as the, of, a, etc. from the word sets, reasoning that they offer little semantic information and only muddle the signal from the more contentful words.</Paragraph> <Paragraph position="2"> Similarly, one can use stemming to remove inflectional affixes from the words: such normalization might increase the signal from contentful words. For example, the intersection between (lb) and (2b) would include give if inflection were removed from gave and giving.</Paragraph> <Paragraph position="3"> We used a stemmer described by Abney (1997).</Paragraph> <Paragraph position="4"> A different type of extension is suggested by the fact that who questions are likely to be answered with words that denote people or organizations. Similarly, when and where questions are answered with words denoting temporal and locational words, respectively. By using name taggers to identify person, location, and temporal information, we can add semantic class symbols to the question word sets marking the type of the question and then add corresponding class symbols to the word sets whose sentences contain phrases denoting the proper type of entity.</Paragraph> <Paragraph position="5"> For example, due to the name Thomas Jefferson, the word set in (lb) would be extended by :PERSON, as would the word set (2b) because it is a who question. This would increase the matching score by one. The system makes use of the Alembic automated named entity system (Vilain and Day 1996) for finding named entities. In a similar vein, we also created a simple common noun classification module using WordNet (Miller 1990). It works by looking up all nouns of the text and adding person or location classes if any of a noun's senses is subsumed by the appropriate WordNet class. We also created a filtering module that ranks sentences higher if they contain the appropriate class identifier, even though they may have fewer matching words, e.g., if the bag representation of a sentence does not contain :PERSON, it is ranked lower as an answer to a who question than sentences which do contain :PERSON.</Paragraph> <Paragraph position="6"> Finally, the system contains an extension which substitutes the referent of personal pronouns for the pronoun in the bag representation. For example, if the system were to choose the sentence He gave books to the library, the answer returned and scored would be Thomas Jefferson gave books to the library, if He were resolved to Thomas Jefferson. The current system uses a very simplistic pronoun resolution system which</Paragraph> </Section> </Section> <Section position="6" start_page="328" end_page="329" type="metho"> <SectionTitle> 4 Experimental Results </SectionTitle> <Paragraph position="0"> Our modular architecture and automated scoring metrics have allowed us to explore the effect of various linguistic sources of information on overall system performance. We report here on three sets of findings: the value added from the various linguistic modules, the questionspecific results, and an assessment of the difficulty of the reading comprehension task.</Paragraph> <Section position="1" start_page="328" end_page="329" type="sub_section"> <SectionTitle> 4.1 Effectiveness of Linguistic Modules </SectionTitle> <Paragraph position="0"> We were able to measure the effect of various linguistic techniques, both singly and in combination with each other, as shown in Figure 3 and Table 1. The individual modules are indicated as follows: Name is the Alembic named tagger described above. NameHum is hand-tagged named entity. Stem is Abney's automatic stemming algorithm. Filt is the filtering module. Pro is automatic name and personal pronoun coreference. ProHum is handtagged, full reference resolution. Sem is the WordNet-based common noun semantic classification.</Paragraph> <Paragraph position="1"> We computed significance using the non-parametric significance test described by Noreen (1989). The following performance improvements of the AnsWdRecall metric were statistically significant results at a confidence level of 95%: Base vs. NameStem, NameStem vs.</Paragraph> <Paragraph position="2"> FiltNameHumStem, and FiltNameHumStem vs.</Paragraph> <Paragraph position="3"> FiltProHumNameHumStem. The other adjacent performance differences in Figure 3 are suggestive, but not statistically significant. Removing stop words seemed to hurt overall performance slightly--it is not shown here.</Paragraph> <Paragraph position="4"> Stemming, on the other hand, produced a small but fairly consistent improvement. We compared these results to perfect stemming, which made little difference, leading us to conclude that our automated stemming module worked well enough.</Paragraph> <Paragraph position="5"> Name identification provided consistent gains.</Paragraph> <Paragraph position="6"> The Alembic name tagger was developed for newswire text and used here with no modifications. We created hand-tagged named entity data, which allowed us to measure the performance of Alembic: the accuracy (Fmeasure) was 76.5; see Chinchor and Sundheim (1993) for a description of the standard MUC scoring metric. This also allowed us to simulate perfect tagging, and we were able to determine how much we might gain by improving the name tagging by tuning it to this domain. As the results indicate, there would be little gain from improved name tagging. However, some modules that seemed to have little effect with automatic name tagging provided small gains with perfect name tagging, specifically WordNet common noun semantics and automatic pronoun resolution.</Paragraph> <Paragraph position="7"> When used in combination with the filtering module, these also seemed to help.</Paragraph> <Paragraph position="8"> Similarly, the hand-tagged reference resolution data allowed us to evaluate automatic coreference resolution. The latter was a combination of name coreference, as determined by Alembic, and a heuristic resolution of personal pronouns to the most recent prior named person. Using the MUC coreference scoring algorithm (see Vilain et al. 1995), this had a precision of 77% and a recall of 18%. 3 The use of full, hand-tagged reference resolution caused a substantial increase of the AnsWdRecall metric. This was because the system substitutes the antecedent for all referring expressions, improving the word-based measure. This did not, however, provide an increase in the sentence-based measures.</Paragraph> <Paragraph position="9"> Finally, we plan to do similar human labeling experiments for semantic class identification, to determine the potential effect of this knowledge source.</Paragraph> </Section> <Section position="2" start_page="329" end_page="329" type="sub_section"> <SectionTitle> 4.2 Question-Specific Analysis </SectionTitle> <Paragraph position="0"> Our results reveal that different questiontypes behave very differently, as shown in Figure 4. Why questions are by far the hardest (performance around 20%) because they require understanding of rhetorical structure and because answers tend to be whole clauses (often occurring as stand-alone sentences) rather than phrases embedded in a context that matches the query closely. On the other hand, who and when queries benefit from reliable person, name, and time extraction. Who questions seem to benefit most dramatically from perfect name tagging combined with filtering and pronoun resolution.</Paragraph> <Paragraph position="1"> What questions show relatively little benefit from the various linguistic techniques, probably because there are many types of what question, most of which are not answered by a person, time or place. Finally, where question results are quite variable, perhaps because location expressions often do not include specific place names.</Paragraph> <Paragraph position="2"> 3 The low recall is attributable to the fact that the heuristic asigned antecedents only for names and pronouns, and completely ignored definite noun phrases and plural pronous.</Paragraph> </Section> <Section position="3" start_page="329" end_page="329" type="sub_section"> <SectionTitle> 4.3 Task Difficulty </SectionTitle> <Paragraph position="0"> These results indicate that the sample tests are an appropriate and challenging task. The simple techniques described above provide a system that finds the correct answer sentence almost 40% of the time. This is much better than chance, which would yield an average score of about 4-5% for the sentence metrics, given an average document length of 20 sentences. Simple linguistic techniques enhance the baseline system score from the low 30% range to almost 40% in all three metrics. However, capturing the remaining 60% will clearly require more sophisticated syntactic, semantic, and world knowledge sources.</Paragraph> </Section> </Section> <Section position="7" start_page="329" end_page="331" type="metho"> <SectionTitle> 5 Future Directions </SectionTitle> <Paragraph position="0"> Our pilot study has shown that reading comprehension is an appropriate task, providing a reasonable starting level: it is tractable but not trivial. Our next steps include: * Application of these techniques to a standardized multiple-choice reading comprehension test. This will require some minor changes in strategy. For example, in preliminary experiments, our system chose the answer that had the highest sentence matching score when composed with the question. This gave us a score of 45% on a small multiple-choice test set. Such tests require us to deal with a wider variety of question types, e.g., What is this story about? This will also provide an opportunity to look at rejection measures, since many tests penalize for random guessing.</Paragraph> <Paragraph position="1"> * Moving from whole sentence retrieval towards answer phrase retrieval. This will allow us to improve answer word precision, which provides a good measure of how much extraneous material we are still returning.</Paragraph> <Paragraph position="2"> * Adding new linguistic knowledge sources.</Paragraph> <Paragraph position="3"> We need to perform further hand annotation experiments to determine the effectiveness of semantic class identification and lexical semantics.</Paragraph> <Paragraph position="4"> * Encoding more semantic information in our representation for both question and document sentences. This information could be derived from syntactic analysis, including noun chunks, verb chunks, and clause groupings.</Paragraph> <Paragraph position="5"> Cooperation with educational testing and content providers. We hope to work together with one or more major publishers. This will provide the research community with a richer collection of training and test material, while also providing educational testing groups with novel ways of checking and benchmarking their tests.</Paragraph> </Section> class="xml-element"></Paper>