File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/w06-1809_evalu.xml
Size: 11,179 bytes
Last Modified: 2025-10-06 13:59:49
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1809"> <Title>Incorporating User Models in Question Answering to Improve Readability</Title> <Section position="6" start_page="0" end_page="0" type="evalu"> <SectionTitle> 4 Results </SectionTitle> <Paragraph position="0"> We report the results of running our system on a range of queries, which include factoid/simple, complex and controversial questions9.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Simple answer </SectionTitle> <Paragraph position="0"> As an example of a simple query, we present the results for: &quot;Who painted the Sistine Chapel?&quot;, the system returned the following passages: --UMgood: &quot;Sistine Chapel (sis-teen). A chapel adjoining Saint Peter's Basilica, noted for the frescoes of biblical subject painted by Michelangelo on its walls and ceilings.&quot; --UMmed: &quot;In all Michelangelo painted more than 300 different figures on the Sistine Chapel ceiling.&quot; --UMpoor: &quot;My name is Jacopo L'Indaco and I was an assistant to Michelangelo when he painted the Sistine Chapel.&quot; To obtain the above answers the system was run 3 times with different values for the reading level parameter in the UM, as defined in 3.2.3. As we can see, in all cases the correct information is present although not always explicitly, as in the first two cases. This is because our current semantic similarity metric only operates at word level. In this example, all sentences containing &quot;painted&quot;, &quot;Sistine&quot; and &quot;Chapel&quot; obtain a distance of 0 to the query, regardless of their formulation. Also notice how the difference in language complexity is clearly discernible in the different answers.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Complex answer </SectionTitle> <Paragraph position="0"> We illustrate the results of our system with the query &quot;Definition of metaphor&quot;, which relates to a difficult concept. Our top results, highlighted in 9Notice that this partition is not to be interpreted as a methodological division, as we currently approach complex and controversial answers the same way.</Paragraph> <Paragraph position="1"> 54 KRAQ06 boldface within their containing passages, are: --UMgood: &quot;Definition of Metaphor. Metaphor is a process by which we conceive &quot;one thing in terms of another, and its primary function is understanding&quot; (Lakoff and Johnson, Metaphors We Live By 36). That is, we often do (and must) talk about things that do not physically exist, that we cannot reach out and touch or take apart with our hands to see what is inside and how it works: love, time, anger, life. [. . . ] We make them tangible by, in a process of thought, applying models of physical things to them: love is falling; argument is war. &quot; --UMmed: &quot;Definition: A metaphor is a figure of speech that compares unlike things by saying that one thing is the other. Often metaphors are simple comparisons, but they can be extended so that different aspects of the things compared are treated separately. &quot; --UMpoor: we did not find any document exactly matching the reading level, we thus returned the results forUMmed (see above).</Paragraph> <Paragraph position="2"> As before, there is a clear difference in the language complexity of the two answers.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Controversial answer </SectionTitle> <Paragraph position="0"> We illustrate this case using the query &quot;Who wrote the Iliad?&quot;, which is a subject of debate: the existence of a single author, as well as the historicity of the poet Homer are under discussion. These are among our top results: --UMgood: &quot;Most Classicists would agree that, whether there was ever such a composer as &quot;Homer&quot; or not, the Homeric poems are the product of an oral tradition, a generations-old technique that was the collective inheritance of many singer-poets, aoidoi. [. . . ] Could the Iliad and Odyssey have been oral-formulaic poems, composed on the spot by the poet using a collection of memorized traditional verses and phases?&quot; --UMmed: &quot;No reliable ancient evidence for Homer earliest traditions involve conjecture (e.g. conflicting claims to be his place of origin) and legend (e.g. Homer as son of river-god). General ancient assumption that same poet wrote Iliad and Odyssey (and possibly other poems) questioned by many modern scholars: differences explained biographically in ancient world (e g wrote Od. in old age); but similarities could be due to imitation.&quot; --UMpoor: &quot;Homer wrote The Iliad and The Odyssey (at least, supposedly a blind bard named &quot;Homer&quot; did).&quot; In this case we can see how the problem of attribution of the Iliad is made clearly visible: in the three results, document passages provide a context which helps to explain such controversy at different levels of difficulty.</Paragraph> </Section> </Section> <Section position="7" start_page="0" end_page="0" type="evalu"> <SectionTitle> 5 Evaluation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Methodology </SectionTitle> <Paragraph position="0"> Our system is not a QA system in the strict sense, as it does not single out one correct answer phrase. The key objective is an improved satisfaction of the user towards its adaptive results, which are hopefully more suitable to his reading level. A user-centred evaluation methodology that assesses how the system meets individual information needs is therefore more appropriate for YourQA than TREC-QA metrics.</Paragraph> <Paragraph position="1"> We draw our evaluation guidelines from (Su, 2003), which proposes a comprehensive search engine evaluation model. We define the following metrics (see Table 1): 1. Relevance: * strict precision (P1): the ratio between the number of results rated as relevant and all the returned results, * loose precision (P2): the ratio between the number of results rated as relevant or partially relevant and all the returned results.</Paragraph> <Paragraph position="2"> 2. User satisfaction: a 7-point Likert scale10 is used to assess satisfaction with: * loose precision of results (S1), * query success (S2).</Paragraph> <Paragraph position="3"> 3. Reading level accuracy (Ar). This metric was not present in (Su, 2003) and has been introduced to assess the reading level estimation. Given the set R of results returned by the system for a reading level r, it is the ratio between the number of documents [?] R rated by the users as suitable for r and |R|. We compute Ar for each reading level.</Paragraph> <Paragraph position="4"> 4. Overall utility (U): the search session as a whole is assessed via a 7-point Likert scale. We have discarded some of the metrics proposed by (Su, 2003) when they appeared as linked to technical aspects of search engines (e.g. connectivity), and when response time was concerned as at the present stage this has not been considered 10This measure - ranging from 1= &quot;extremely unsatisfactory&quot; to 7=&quot;extremely satisfactory&quot; - is particularly suitable to assess the degree to which the system meets the user's search needs. It was reported in (Su, 1991) as the best single measure for information retrieval among 20 tested. 55 KRAQ06 an issue. Also, we exclude metrics relating to the user interface which are not relevant for this study.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 Evaluation results </SectionTitle> <Paragraph position="0"> We performed our evaluation by running 24 queries (partly reported in Table 3) on both Google and YourQA11. The results - i.e. snippets from the Google result page and passages returned by YourQA - were given to 20 evaluators. These were aged between 16 and 52, all having a self-assessed good or medium English reading level.</Paragraph> <Paragraph position="1"> They came from various backgrounds (University students/graduates, professionals, high school) and mother-tongues. Evaluators filled in a questionnaire assessing the relevance of each passage, the success and result readability of the single queries, and the overall utility of the system; values were thus computed for the metrics in Table 1.</Paragraph> <Paragraph position="2"> The precision results (see Table 2) for the whole search session were computed by averaging the values obtained for the 20 queries. Although quite close, they show a 10-15% difference in favour of the YourQA system for both strict precision (P1) and loose precision (P2). This suggests that the coarse semantic processing applied and the visualisation of the context contribute to the creation of more relevant passages.</Paragraph> <Paragraph position="3"> 11To make the two systems more comparable, we turned off query expansion and only submitted the original question sentence After each query, we asked evaluators the following questions: &quot;How would you rate the ratio of relevant/partly relevant results returned?&quot; (assessing S1) and &quot;How would you rate the success of this search?&quot; (assessing S2). Table 2 denotes a higher level of satisfaction tributed to the YourQA system in both cases.</Paragraph> <Paragraph position="4"> 5.2.3 Reading level accuracy Adaptivity to the users' reading level is the distinguishing feature of the YourQA system: we were thus particularly interested in its performance in this respect. Table 3 shows that altogether, evaluators found our results appropriate for the reading levels to which they were assigned.</Paragraph> <Paragraph position="5"> The accuracy tended to decrease (from 94% to 72%) with the level: this was predictable as it is more constraining to conform to a lower reading level than to a higher one. However this also suggests that our estimation of document difficulty was perhaps too &quot;optimisitic&quot;: we are currently working with better quality training data which allows to obtain more accurate language models.</Paragraph> <Paragraph position="6"> Query Ag Am Ap Who painted the Sistine Chapel? 0,85 0,72 0,79 Who was the first American in space? 0,94 0,80 0,72 Who was Achilles' best friend? 1,00 0,98 0,79 When did the Romans invade Britain? 0,87 0,74 0,82 At the end of the whole search session, users answered the question: &quot;Overall, how was this search session?&quot; relating to their search experience with Google and the YourQA system. The values obtained for U in Table 2 show a clear preference (a difference of similarequal 1 on the 7-point scale) of the users for YourQA, which is very positive con56 KRAQ06 sidering that it represents their general judgement on the system.</Paragraph> </Section> </Section> class="xml-element"></Paper>