File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-1026_metho.xml
Size: 30,460 bytes
Last Modified: 2025-10-06 14:09:40
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-1026"> <Title>Experiments with Interactive Question-Answering</Title> <Section position="4" start_page="206" end_page="206" type="metho"> <SectionTitle> GENERAL BACKGROUND </SectionTitle> <Paragraph position="0"> 1) Country Profile 3) Military Operations: Army, Navy, Air Force, Leaders, Capabilities, Intentions 4) Allies/Partners: Coalition Forces 5) Weapons: Chemical, Biological, Materials, Stockpiles, Facilities, Access, Research Efforts, Scientists 6) Citizens: Population, Growth Rate, Education 8) Economics: Growth Domestic Product, Growth Rate, Imports 9) Threat Perception: Border and Surrounding States, International, Terrorist Groups 10) Behaviour: Threats, Invasions, Sponsorship and Harboring of Bad Actors 13) Leadership: 7) Industrial: Major Industrires, Exports, Power Sources 14) Behaviour: Threats to use WMDs, Actual Usage, Sophistication of Attack, Anectodal or Simultaneous Serving as a background to the scenarios, the following list contains subject areas that may be relevant to the scenarios under examination, and it is provided to assist the analyst in generating questions. 2) Government: Type of, Leadership, Relations SCENARIO: Assessment of Egypt's Biological Weapons As terrorist Activity in Egypt increases, the Commander of the United States Army believes a better understanding of Egypt's Military capabilities is needed. Egypt's biological weapons database needs to be updated to correspond with the Commander's request. Focus your investigation on Egypt's access to old technology, assistance received from the Soviet Union for development of their pharmaceutical infrastructure, production of toxins and BW agents, stockpiles, exportation of these materials and development technology to Middle Eastern countries, and the effect that this information will have on the United States and Coalition Forces in the Middle East.</Paragraph> <Paragraph position="1"> Please incorporate any other related information to your report.</Paragraph> <Paragraph position="2"> 11) Transportation Infrastructure: Kilometers of Road, Rail, Air Runways, Harbors and Ports, Rivers 12) Beliefs: Ideology, Goals, Intentions 15) Weapons: Chemical, Bilogical, Materials, Stockpiles, Facilities, Access</Paragraph> </Section> <Section position="5" start_page="206" end_page="208" type="metho"> <SectionTitle> 3 Modeling the Dialogue Topic </SectionTitle> <Paragraph position="0"> Our experiments in interactive Q/A were based on several scenarios that were presented to us as part of the ARDA Metrics Challenge Dialogue Workshop. Figure 2 illustrates one of these scenarios. It is to be noted that the general background consists of a list of subject areas, whereas the scenario is a narration in which several sub-topics are identified (e.g. production of toxins or exportation of materials). The creation of scenarios for interactive Q/A requires several different types of domain-specific knowledge and a level of operational expertise not available to most system developers. In addition to identifying a particular domain of interest, scenarios must specify the set of relevant actors, outcomes, and related topics that are expected to operate within the domain of interest, the salient associations that may exist between entities and events in the scenario, and the specific timeframe and location that bound the scenario in space and time. In addition, real-world scenarios also need to identify certain operational parameters as well, such as the identity of the scenario's sponsor (i.e. the organization sponsoring the research) and audience (i.e. the organization receiving the information), as well as a series of evidence conditions which specify how much verification information must be subject to before it can be accepted as fact. We assume the set of sub-topics mentioned in the general background and the scenario can be used together to define a topic structure that will govern future interactions with the Q/A system. In order to model this structure, the topic representation that we create considers separate topic signatures for each sub-topic.</Paragraph> <Paragraph position="1"> The notion of topic signatures was first introduced in (Lin and Hovy, 2000). For each subtopic in a scenario, given (a) documents relevant to the sub-topic and (b) documents not relevant to the subtopic, a statistical method based on the likelihood ratio is used to discover a weighted list of the most topic-specific concepts, known as the topic signature. Later work by (Harabagiu, 2004) demonstrated that topic signatures can be further enhanced by discovering the most relevant relations that exist between pairs of concepts. However, both of these types of topic representations are limited by the fact that they require the identification of topic-relevant documents prior to the discovery of the topic signatures. In our experiments, we were only presented with a set of documents relevant to a particular scenario; no further relevance information was provided for individual subject areas or sub-topics.</Paragraph> <Paragraph position="2"> In order to solve the problem of finding relevant documents for each subtopic, we considered four different approaches: a0 Approach 1: All documents in the CNS collection were initially clustered using K-Nearest Neighbor (KNN) clustering (Dudani, 1976).</Paragraph> <Paragraph position="3"> Each cluster that contained at least one key-word that described the sub-topic was deemed relevant to the topic.</Paragraph> <Paragraph position="4"> a0 Approach 2: Since individual documents may contain discourse segments pertaining to different sub-topics, we first used TextTiling (Hearst, 1994) to automatically segment all of the documents in the CNS collection into individual text tiles. These individual discourse segments then served as input to the KNN clustering algorithm described in Approach 1.</Paragraph> <Paragraph position="5"> a0 Approach 3: In this approach, relevant documents were discovered simultaneously with the discovery of topic signatures. First, we associated a binary seed relation a0a2a1 for each each sub-topic a3 a1 . (Seed relations were created both by hand and using the method presented in (Harabagiu, 2004).) Since seed relations are by definition relevant to a particular subtopic, they can be used to determine a binary partition of the document collection a4 into (1) a relevant set of documents a5a6a1 (that is, the documents relevant to relation a0 a1 ) and (2) a set of non-relevant documents a4 -a5a6a1 . Inspired by the method presented in (Yangarber et al., 2000), a topic signature (as calculated by (Harabagiu, 2004)) is then produced for the set of documents in a5a7a1 .</Paragraph> <Paragraph position="6"> For each subtopic a3 a1 defined as part of the dialogue scenario, documents relevant to a corresponding seed relation a0 a1 are added to a5 iff the relation a0a8a1 meets the density criterion (as defined in (Yangarber et al., 2000)). If a9 represents the set of documents where a0a2a1 is recognized, then the density criterion can be defined as:</Paragraph> <Paragraph position="8"> a9 is added to a5a21a1 , then a new topic signature is calculated for a5 . Relations extracted from the new topic signature can then be used to determine a new document partition by re-iterating the discovery of the topic signature and of the documents relevant to each subtopic.</Paragraph> <Paragraph position="9"> a0 Approach 4: Approach 4 implements the technique described in Approach 3, but operates at the level of discourse segments (or texttiles) rather than at the level of full documents. As with Approach 2, segments were produced using the TextTiling algorithm.</Paragraph> <Paragraph position="10"> In modeling the dialogue scenarios, we considered three types of topic-relevant relations: (1) structural relations, which represent hypernymy or meronymy relations between topic-relevant concepts, (2) definition relations, which uncover the characteristic properties of a concept, and (3) extraction relations, which model the most relevant events or states associated with a sub-topic. Although structural relations and definition relations are discovered reliably using patterns available from our Q/A system (Harabagiu et al., 2003), we found only extraction relations to be useful in determining the set of documents relevant to a subtopic. Structural relations were available from concept ontologies implemented in the Q/A system. The definition relations were identified by patterns used for processing definition questions.</Paragraph> <Paragraph position="11"> Extraction relations are discovered by processing documents in order to identify three types of relations, including: (1) syntactic attachment relations (including subject-verb, object-verb, and verb-PP relations), (2) predicate-argument relations, and (3) salience-based relations that can be used to encode long-distance dependencies between topic-relevant concepts. (Salience-based relations are discovered using a technique first reported in (Harabagiu, 2004) which approximates a Centering Theory-style approach (Kameyama, 1997) to the resolution of coreference.) Subtopic: Egypt's production of toxins and BW agents for the scenario illustrated in Figure 2.</Paragraph> <Paragraph position="12"> We made the extraction relations associated with each topic signature more general (a) by replacing words with their (morphological) root form (e.g.</Paragraph> <Paragraph position="13"> wounded with wound, weapons with weapon), (b) by replacing lexemes with their subsuming category from an ontology of 100,000 words (e.g. truck is replaced by VEHICLE, ARTIFACT, or OBJECT), and (c) by replacing each name with its name class (Egypt with COUNTRY). Figure 3 illustrates the topic signatures resulting for the scenario illustrated in Figure 2.</Paragraph> <Paragraph position="14"> Once extraction relations were obtained for a particular set of documents, the resulting set of relations were ranked according to a method proposed in (Yangarber, 2003). Under this approach, the score associated with each relation is given by:</Paragraph> <Paragraph position="16"> resents the cardinality of the documents where the relation is identified, and a3a32a31a9a33 a7 a0a9a8 represents support associated with the relation a0 . a3a32a31a9a33 a7 a0a9a8 is defined as the sum of the relevance of each document in a9 : a3a32a31a9a33 a7 a0a9a8a36a10 a37a39a38a41a40 a11 a5a42a5a44a43 a7a46a45 a8 . The relevance of a document that contains a topic-significant relation can be defined as: a5a42a5a47a43 a7a46a45 a8a48a10a50a49a52a51a54a53 other subtopic, a3 a66 .</Paragraph> <Paragraph position="17"> We use a different learner for each subtopic in order to train simultaneously on each iteration. (The calculation of topic signatures continues to iterate until there are no more relations that can be added to the overall topic signature.) When the precision of a relation to a subtopic a3 a1 is computed, it takes into account the negative evidence of its relevance to any other subtopic a3 a1a74a73a10 a3 a66 . If a56 a0a14a5 a1 a7 a0a6a8a74a75a77a76 , the relation is not included in the topic signature, where relations are ranked by the score a3 a1a41a3 a0a6a5 a7 a0a9a8a78a10</Paragraph> <Paragraph position="19"> Representing topics in terms of relevant concepts and relations is important for the processing of questions asked within the context of a given topic. For interactive Q/A, however, the ideal topic-structured representation would be in the form of question-answer pairs (QUABs) that model the individual segments of the scenario. We have currently created two sets of QUABs: a handcrafted set and an automatically-generated set. For the manually-created set of QUABs, 4 linguists manually generated 3210 question-answer pairs for each of the 8 dialogue scenarios considered in our experiments.</Paragraph> <Paragraph position="20"> In a separate effort, we devised a process for automatically populating the QUAB for each scenario.</Paragraph> <Paragraph position="21"> In order to generate question-answer pairs for each subtopic, we first identified relevant text passages in the document collection to serve as &quot;answers&quot; and then generated individual questions that could be an4Initially, null a80a32a81 contains only the seed relation. Additional relations can be added with each iteration.</Paragraph> <Paragraph position="22"> swered by each answer passage.</Paragraph> <Paragraph position="23"> a82 Answer Identification: We defined an answer passage as a contiguous sequence of sentences with a positive answer rank and a passage price of a75 4. To select answer passages for each sub-topic a3 a1 , we calculate an answer rank, a0a14a83a85a84a87a86 a7a83a88a8a89a10</Paragraph> <Paragraph position="25"> a0a8a1a90a8 , that sums across the scores of each relation from the topic signature that is identified in the same text window. Initially, the text window is set to one sentence. (If the sentence is part of a quote, however, the text window is immediately expanded to encompass the entire sentence that contains the quote.) Each passage with a0a14a83a85a84a87a86 a7a83a88a8a92a91a93a76 is then considered to be a candidate answer passage.</Paragraph> <Paragraph position="26"> The text window of each candidate answer passage is then expanded to include the following sentence.</Paragraph> <Paragraph position="27"> If the answer rank does not increase with the addition of the succeeding sentence, then the price (a33 ) of the candidate answer passage is incremented by 1, otherwise it is decremented by 1. The text window of each candidate answer passage continues to expand untila33 a10a95a94 . Before the ranked list of candidate answers can be considered by the Question Generation module, answer passages with a positive pricea33 are stripped of the lasta33 sentences.</Paragraph> </Section> <Section position="6" start_page="208" end_page="209" type="metho"> <SectionTitle> ANSWER </SectionTitle> <Paragraph position="0"> In the early 1970s, Egyptian President Anwar Sadat validates that Egypt has a BW stockpile.</Paragraph> <Paragraph position="1"> a82 Question Generation: In order to automatically generate questions from answer passages, we considered the following two problems: a0 Problem 1: Every word in an answer passage can refer to an entity, a relation, or an event. In order for question generation be successful, we must determine whether a particular reference is &quot;interesting&quot; enough to the scenario such that it deserves to be mentioned in a topic-relevant question. For example, Figure 4 illustrates an answer that includes two predicates and four entities. In this case, four types of reference are used to associate these linguistic objects with other related objects: (a) definitional reference, used to link entity (E1) &quot;Anwar Sadat&quot; to a corresponding attribute &quot;Egyptian President&quot;, (b) metonymic reference, since (E1) can be coerced into (E2), (c) part-whole reference, since &quot;BW stockpiles&quot;(E4) necessarily imply the existence of a &quot;BW program&quot;(E5), and (d) relational reference, since validating is subsumed as part of the meaning of declaring (as determined by WordNet glosses), while admitting can be defined in terms of declaring, as in declaring [to be true].</Paragraph> </Section> <Section position="7" start_page="209" end_page="211" type="metho"> <SectionTitle> ANSWER </SectionTitle> <Paragraph position="0"> a0 Problem 2: We have found that the identification of the association between a candidate answer and a question depends on (a) the recognition of predicates and entities based on both the output of a named entity recognizer and a semantic parser (Surdeanu et al., 2003) and their structuring into predicate-argument frames, (b) the resolution of reference (addressed in Problem 1), (c) the recognition of implicit relations between predications stated in the answer.</Paragraph> <Paragraph position="1"> Some of these implicit relations are referential, as is the relation between predicates a56 a60 and a56a1a0 illustrated in Figure 4. A special case of implicit relations are the causal relations. Figure 5 illustrates an answer where a causal relation exists and is marked by the cue phrase because. Predicates - like those in Figure 5 can be phrasal (like a56a3a2a0 ) or negative (like a56a3a2a30 ). Causality is established between predicates a56 a2a30 and a56a5a4 ' as they are the ones that ultimately determine the selection of the answer. The predicatea33 a2a4 can be substituted by its nominalization</Paragraph> <Paragraph position="3"> of a56 a30 is BW, the same argument is transferred to a56a7a2a2a4 . The causality implied by the answer from Figure 5 has two components: (1) the effect (i.e. the predicate a56 a2a2a4 ) and (2) the result, which eliminates the semantic effect of the negative polarity item never by implying the predicate a33a9a8 , obstacle. The questions that are generated are based on question patterns associated with causal relations and therefore allow different degrees for the specificity of the resultative, i.e obstacle or deterrent.</Paragraph> <Paragraph position="4"> We generated several questions for each answer passage. Questions were generated based on patterns that were acquired to model interrogations using relations between predicates and their arguments. Such interrogations are based on (1) associations between the answer type (e.g. DATE) and the question stem (e.g. &quot;when&quot; and (2) the relation between predicates, question stem and the words that determine the answer type (Narayanan and Harabagiu, 2004). In order to obtain these predicate-argument patterns, we used 30% (approximately 1500 questions) of the handcrafted question-answer pairs, selected at random from each of the 8 dialogue scenarios. As Figures 4 and 5 illustrate, we used patterns based on (a) embedded predicates and (b) causal or counterfactual predicates.</Paragraph> <Paragraph position="5"> 4 Managing Interactive Q/A Dialogues As illustrated in Figure 1, the main idea of managing dialogues in which interactions with the Q/A system occur is based on the notion of predictions, i.e. by proposing to the user a small set of questions that tackle the same subject as her question (as illustrated in Table 1). The advantage is that the user can follow-up with one of the pre-processed questions, that has a correct answer and resides in one of the QUABs. This enhances the effectiveness of the dialogue. It also may impact on the efficiency, i.e. the number of questions being asked if the QUABs have good coverage of the subject areas of the scenario.</Paragraph> <Paragraph position="6"> Moreover, complex questions, that generally are not processed with high accuracy by current state-of-the-art Q/A systems, are associated with predictive questions that represent decompositions based on similarities between predicates and arguments of the original question and the predicted questions.</Paragraph> <Paragraph position="7"> The selection of the questions from the QUABs that are proposed for each user question is based on a similarity-metric that ranks the QUAB questions.</Paragraph> <Paragraph position="8"> To compute the similarity metric, we have experimented with seven different metrics. The first four metrics were introduced in (Lytinen and Tomuro, 2002).</Paragraph> <Paragraph position="9"> , where a18 is the number of questions in the QUAB, a45 a1 a1 is the number of questions containing a0 a1 and a0a2a1a2a1 is the number of times a0 a1 appears in the question. This allows the user question and any QUAB question to be transformed into two</Paragraph> <Paragraph position="11"> (b) the term vector similarity is used to compute the similarity between the user question and any question from the QUAB: a35 a27a37a36 a7a19</Paragraph> <Paragraph position="13"> user question terms that appear in the QUAB question. It is obtained by finding the intersection of the terms in the term vectors of the two questions.</Paragraph> <Paragraph position="14"> a0 Similarity Metric 3 is based on semantic information available from WordNet. It involves: (a) finding the minimum path between Word-Net concepts. Given two terms a0 a60 and a0 a30 , each with a84 and a41 WordNet senses a3 type similarity. Instead of using the question class, determined by its stem, whenever we could recognize the answer type expected by the question, we used it for matching. As back-off only, we used a question type similarity based on a matrix akin to the one reported in (Lytinen and Tomuro, 2002) a0 Similarity Metric 5 is based on question concepts rather than question terms. In order to translate question terms into concepts, we replaced (a) question stems (i.e. a WH-word + NP construction) with expected answer types (taken from the answer type hierarchy employed by FERRET's Q/A system) and (b) named entities with corresponding their corresponding classes. Remaining nouns and verbs were also replaced with their WordNet semantic classes, as well. Each concept was then associated with a weight: concepts derived from named entities classes were weighted heavier than concepts from answer types, which were in turn weighted heavier than concepts taken from WordNet clases. Similarity was then computed across &quot;matching&quot; concepts. 5 The resultant similarity score was based on three vari- null Q1: Does Iran have an indigenous CW program? (1b) Has the plant at Qazvin been linked to CW production? (1c) What CW does Iran produce? (1a) How did Iran start its CW program? Q2: Where are Iran's CW facilities located? (2a) What factories in Iran could produce CW? (2b) Where are Iran's stockpiles of CW? (2c) Where has Iran bought equipment to produce CW? Q3: What is Iran's goal for its CW program? (3a) What motivated Iran to expand its chemical weapons program? (3b) How do CW figure into Iran's long[?]term strategic plan? (3c) What are Iran's future CW plans? Although Iran is making a concerted effort to attain an independent production capability for all aspects of chemical weapons program, it remains dependent on foreign sources for chemical warfare[?]related technologies. According to several sources, Iran's primary suspected chemical weapons production facility is located in the city of Damghan. In their pursuit of regional hegemony, Iran and Iraq probably regard CW weapons and missiles as necessary to support their political and military objectives. Possession of chemical weapons would likely lead to increased intimidation of their Gulf, neighbors, as well as increased willingness to confront the United States. QUAB questions are clustered based on their mapping to a vector of important concepts in the QUAB.The clustering was done using the K-Nearest Neighbor (KNN) method (Dudani, 1976). Instead of measuring the similarity between the user question and each question in the QUAB, similarities are computed only between the user question and the centroid of each cluster.</Paragraph> <Paragraph position="15"> a0 Similarity Metric 7 was derived from the results of Similarity Metrics 5 and 6 above. In this case, if the QUAB question (a84 a85 ) that was deemed to be most similar to a user question</Paragraph> <Paragraph position="17"> ) under Similarity Metric 5 is contained in the cluster of QUAB questions deemed to be most similar to a84 a13 under Similarity Metric 6, then a84a86a85 receives a cluster adjustment score in order to boost its ranking within its QUAB cluster. We calculate the cluster adjustment score as a0a2a1a41a3 a0a14a5 a1 a38a66 a7a84a89a85a16a8 a10 a7 a0 a3a41 a8 a23 a7 a49 a51 a4 a16 a8a57a8a57a6</Paragraph> <Paragraph position="19"> in rank between the centroid of the cluster and the previous rank of the QUAB question a84 a85 .</Paragraph> <Paragraph position="20"> In the currently-implemented version of FERRET, we used Similarity Metric 5 to automatically identify the set of 10 QUAB questions that were most similar to a user's question. These question-and-answer pairs were then returned to the user - along with answers from FERRET's automatic Q/A system - as potential continuations of the Q/A dialogue. We used the remaining 6 similarity metrics described in this section to manually assess the impact of similarity on a Q/A dialogue.</Paragraph> </Section> <Section position="8" start_page="211" end_page="213" type="metho"> <SectionTitle> 5 Experiments with Interactive Q/A Dialogues </SectionTitle> <Paragraph position="0"> To date, we have used FERRET to produce over 90 Q/A dialogues with human users. Figure 6 illustrates three turns from a real dialogue from a human user investigating Iran's chemical weapons prorgram. As it can be seen coherence can be established between the user's questions and the system's answers (e.g.</Paragraph> <Paragraph position="1"> Q3 is related to both A1 and A3) as well as between the QUABs and the user's follow-up questions (e.g.</Paragraph> <Paragraph position="2"> QUAB (1b) is more related to Q2 than either Q1 or A1). Coherence alone is not sufficient to analyze the quality of interactions, however.</Paragraph> <Paragraph position="3"> In order to better understand interactive Q/A dialogues, we have conducted three sets of experiments with human users of FERRET. In these experiments, users were allotted two hours to interact with Ferret to gather information requested by a dialogue scenario similar to the one presented in Figure 2. In Experiment 1 (E1), 8 U.S. Navy Reserve (USNR) intelligence analysts used FERRET to research 8 different scenarios related to chemical and biological weapons. Experiment 2 and Experiment 3 considered several of the same scenarios addressed in E1: E2 included 24 mixed teams of analysts and novice users working with 2 scenarios, while E3 featured 4 USNR analysts working with 6 of the original 8 scenarios. (Details for each experiment are provided in Table 2.) Users were also given a task to focus their research; in E1 and E3, users prepared a short report detailing their findings; in E2, users were given a list of &quot;challenge&quot; questions to answer.</Paragraph> <Paragraph position="4"> In E1 and E2, users had access to a total of 3210 QUAB questions that had been hand-created by developers for each the 8 dialogue scenarios. (Table 3 provides totals for each scenario.) In E3, users performed research with a version of FERRET that included no QUABs at all.</Paragraph> <Paragraph position="5"> We have evaluated FERRET by measuring efficiency, effectiveness, and user satisfaction: Efficiency FERRET's QUAB collection enabled users in our experiments to find more relevant information by asking fewer questions. When manually-created QUABs were available (E1 and E2), users submitted an average of 12.25 questions each session. When no QUABs were available (E3), users entered a total of 44.5 questions per session. Table 4 lists the number of QUAB question-answer pairs selected by users and the number of user questions entered by users during the 8 scenarios considered in E1. In E2, freed from the task of writing a research report, users asked significantly (p a0 0.05) fewer questions and selected fewer QUABs than they did in E1. (See Table 5).</Paragraph> <Paragraph position="6"> Effectiveness QUAB question-answer pairs also improved the overall accuracy of the answers returned by FERRET. To measure the effectiveness of a Q/A dialogue, human annotators were used to perform a post-hoc analysis of how relevant the QUAB pairs returned by FERRET were to each question entered by a user: each QUAB pair returned was graded as &quot;relevant&quot; or &quot;irrelevant&quot; to a user question in a forced-choice task. Aggregate relevance scores were used to calculate (1) the percentage of relevant QUAB pairs returned and (2) the mean reciprocal rank (MRR) for each user question. MRR is defined as a60a43 a37 , whree a0a8a1 is the lowest rank of any relevant answer for the a3 a1a3a2 user query7. Table 6 describes the performance of FERRET when each of the 7 similarity measures presented in Section 4 are used to return QUAB pairs in response to a query.</Paragraph> <Paragraph position="7"> When only answers from FERRET's automatic Q/A system were available to users, only 15.7% of system responses were deemed to be relevant to a user's query. In contrast, when manually-generated QUAB pairs were introduced, as high as 84% of the system's responses were deemed to be relevant. The results listed in Table 6 show that the best metric is Similarity Metric 5. Thse results suggest that the selection of relevant questions depends on sophisticated similarity measures that rely on conceptual hierarchies and semantic recognizers.</Paragraph> <Paragraph position="8"> We evaluated the quality of each of the four sets of automatically-generated QUABs in a similar fashion. For each question submitted by a user in E1, E2, and E3, we collected the top 5 QUAB question-answer pairs (as determined by Similarity Metric 5) that FERRET returned. As with the manually-generated QUABs, the automatically7We chose MRR as our scoring metric because it reflects the fact that a user is most likely to examine the first few answers from any system, but that all correct answers returned by the system have some value because users will sometimes examine a very large list of query results.</Paragraph> <Paragraph position="9"> generated pairs were submitted to human assessors who annotated each as &quot;relevant&quot; or irrelevant to the user's query. Aggregate scores are presented in Table 7.</Paragraph> <Section position="1" start_page="213" end_page="213" type="sub_section"> <SectionTitle> User Satisfaction Users were consistently satis- </SectionTitle> <Paragraph position="0"> fied with their interactions with FERRET. In all three experiments, respondents claimed that they found that FERRET (1) gave meaningful answers, (2) provided useful suggestions, (3) helped answer specific questions, and (4) promoted their general understanding of the issues considered in the scenario.</Paragraph> <Paragraph position="1"> Complete results of this study are presented in Table 88.</Paragraph> </Section> </Section> class="xml-element"></Paper>