File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/w01-0905_metho.xml

Size: 19,873 bytes

Last Modified: 2025-10-06 14:07:39

<?xml version="1.0" standalone="yes"?>
<Paper uid="W01-0905">
  <Title>Two levels of evaluation in a complex NL system</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 System architecture
</SectionTitle>
    <Paragraph position="0"> Figure 1 shows the architecture of the QALC system, made of five separate modules: Question analysis, Search engine, Re-indexing and selection of documents, Named entity recognition, and Question/sentence pairing.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Question analysis
</SectionTitle>
      <Paragraph position="0"> Question analysis is performed in order to assign features to questions and use these features for the matching measurement between a question and potential answer sentences. It relies on a shallow parser which spots discriminating patterns and assigns categories to a question.</Paragraph>
      <Paragraph position="1"> The categories correspond to the types of named entities that are likely to constitute the answer to this question. Named entities receive one of the following types: person, organisation, location (city or place), number (a time expression or a number expression). For example the pattern how far yields to the answer type length: Question: How far away is the moon? Answer type: LENGTH Answer within the document : With a &lt;b_numex_TYPE=&amp;quot;NUMBER&amp;quot;&gt; 28 &lt;e_numex&gt; -power telescope you can see it on the moon &lt;b_numex_TYPE=&amp;quot;LENGTH&amp;quot;&gt; 250,000 miles &lt;e_numex&gt; away.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Selection of relevant documents
</SectionTitle>
      <Paragraph position="0"> The second module is a classic search engine, giving, for each question, a ranked list of documents, each of which could contain the answer.</Paragraph>
      <Paragraph position="1"> This set of documents is then processed by a third module, made of FASTR (Jacquemin, 1999), a shallow transformational natural language analyser and of a ranker. This module can select, among documents found by the search engine, a subset that satisfies more refined criteria. FASTR improves things because it indexes documents with a set of terms, including not only the (simple or compound) words of the initial question, but also their morphological, syntactic and semantic variants. Each index is given a weight all the higher as it is close to the original word in the question, or as it is significant. For instance, original terms are considered more reliable than semantic variants, and proper names are considered more significant than nouns. Then, documents are ordered according to the number and the quality of the terms they contain. An analysis of the weight graph of the indexed documents enables the system to select a relevant subpart of those documents, whose size varies along the questions. Thus, when the curve presents a high negative slope, the system only select documents before the fall, otherwise a fixed threshold is used.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Named entity recognition
</SectionTitle>
      <Paragraph position="0"> The fourth module tags named entities in documents selected by the third one. Named entities are recognized through a combination of lexico-syntactic patterns and significantly large lexical data. The three lists used for lexical lookup are CELEX (1998), a lexicon of 160,595 inflected words with associated lemma and syntactic category, a list of 8,070 first names (6,763 of which are from the CLR (1998) archive) and a list of 211,587 family names also from the CLR archive.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.4 Question-sentence pairing
</SectionTitle>
      <Paragraph position="0"> The fifth module evaluates each sentence in the ranker-selected documents, using a similarity measure between, on one side, terms and named entities in the sentence, and on the other side, words in the questions and expected answer type. To do so, it uses the results of the question parser, and the named entity tagger, along with a frequency-weighted vocabulary of the TREC corpus.</Paragraph>
      <Paragraph position="1"> The QALC system proposes long and short answers. Concerning the short ones, the system focuses on parts of sentences that contain the expected named entity tags, when they are available, or on the larger subpart without any terms.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="1" type="metho">
    <SectionTitle>
3 Search engine evaluation
</SectionTitle>
    <Paragraph position="0"> The second module of the QALC system deals with the selection, through a search engine, of documents that may contain an answer to a given question from the whole TREC corpus (whose size is about 3 gigabytes).</Paragraph>
    <Paragraph position="1"> We tested three search engines with the 200 questions that were proposed at the TREC8 QA track. The first one is Zprise, a vectorial search engine developed by NIST. The second is Indexal (de Loupy et al 1998), a pseudo-boolean search engine developed by Bertin Technologies  . The third search engine is ATT whose results to the TREC questions are provided by NIST in the form of ranked lists of the top 1000 documents retrieved for each question. We based our search engine tests on  We are grateful to Bertin Technologies for providing us with the outputs of Indexal on the TREC collection for the TREC8-QA and TREC9-QA question set.</Paragraph>
    <Paragraph position="2"> the list of relevant documents extracted from the list of correct answers provided by TREC organizers.</Paragraph>
    <Paragraph position="3"> Since a search engine produces a large ranked list of relevant documents, we had to define the number of documents to retain for further processing. Indeed, having too many documents leads to a question processing time that is too long, but conversely, having too few documents reduces the possibility of obtaining the correct answer. The other goal of the tests obviously was to determine the best search engine, that is to say the one that gives the highest number of relevant documents.</Paragraph>
    <Section position="1" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
3.1 Document selection threshold
</SectionTitle>
      <Paragraph position="0"> In order to determine the best selection threshold, we carried out four different tests with the Zprise search engine. We ran Zprise for the 200 questions and then compared the number of relevant documents respectively in the top 50, 100, 200, and 500 retrieved documents. Table 1 shows the test results.</Paragraph>
      <Paragraph position="1">  without relevant documents retrieved for different thresholds According to Table 1, the improvement of the search engine results tends to decrease beyond the threshold of 200 documents. The top 200 ranked documents thus seem to offer the best trade-off between the number of documents in which the answer may be found and the question processing time.</Paragraph>
    </Section>
    <Section position="2" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
3.2 Evaluation
</SectionTitle>
      <Paragraph position="0"> We compared the results given by the three search engines for a threshold of 200 documents. Table 2 gives the tests results.</Paragraph>
      <Paragraph position="1">  Indexal, Zprise and ATT search engines All three search engines perform quite well. Nevertheless, the ATT search engine revealed itself the most efficient according to the following two criteria: the lowest number of questions for which no relevant document was retrieved, and the most relevant documents retrieved for all the 200 questions. Both criteria are important. First, it is most essential to obtain relevant documents for as many questions as possible. But the number of relevant documents for each question also counts, since having more sentences containing the answer implies a greater probability to actually find it.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="1" end_page="1" type="metho">
    <SectionTitle>
4 Document ranking evaluation
</SectionTitle>
    <Paragraph position="0"> As the processing of 200 documents by the following Natural Language Processing (NLP) modules still was too time-consuming, we needed an additional stronger selection. The selection of relevant documents performed by the re-indexing and selection module relies on an NLP-based indexing composed of both single-word and phrase indices, and linguistic links between the occurrences and the original terms. The original terms are extracted from the questions. The tool used for extracting text sequences that correspond to occurrences or variants of these terms is FASTR (Jacquemin, 1999). The ranking of the documents relies on a weighted combination of the terms and variants extracted from the documents. The use of multi-words and variants for document weighting makes a finer ranking possible.</Paragraph>
    <Paragraph position="1"> The principle of the selection is the following: when there is a sharp drop of the documents weight curve after a given rank, we keep only those documents which occur before the drop. Otherwise, we arbitrarily keep the first 100.</Paragraph>
    <Paragraph position="2"> In order to evaluate the efficiency of the ranking process, we proceeded to several measures. First, we apply our system on the material given for the TREC8 evaluation, one time with the ranking process, and another time without this process. 200 documents were retained for each of the 200 questions. The system was scored by 0.463 in the first case, and by 0.452 in the second case. These results show that document selection slightly improves the final score while much reducing the amount of text to process.</Paragraph>
    <Paragraph position="3"> However, a second measurement gave us more details about how things are improved.</Paragraph>
    <Paragraph position="4"> Indeed, when we compare the list of relevant documents selected by the search engine with the list of ranker-selected ones, we find that the ranker loses relevant documents. For thirteen questions among the 200 in the test, the ranker did not consider relevant documents selected by the search engine. What happens is: the global score improves, because found answers rank higher, but the number of found answers remains the same.</Paragraph>
    <Paragraph position="5"> The interest to perform such a selection is also illustrated by the results given Table 3, computed on the TREC9 results.</Paragraph>
    <Paragraph position="6">  We see that the selection process discards documents for 50% of the questions: 340 questions are processed from less than 100 documents. For those 340 questions, the average number of selected documents is 37. The document set retrieved for those questions has a weight curve with a sharp drop. QALC finds more often the correct answer and in a better position for these 340 questions than for the 342 remaining ones. These results are very interesting when applying such time-consuming processes as named-entities recognition and question/sentence matching. Document selection will also enable us to apply further sentence syntactic analysis.</Paragraph>
  </Section>
  <Section position="6" start_page="1" end_page="3" type="metho">
    <SectionTitle>
5 Question-sentence pairing evaluation
</SectionTitle>
    <Paragraph position="0"> We sent to TREC9 two runs which gave answers of 250 characters length, and one run which gave answers of 50 characters length. The first and the last runs used ATT as search engine, and the second one, Indexal. Results are consistent with our previous analysis (see Section 3.2). Indeed, the run with ATT search engine gives slightly better results (0.407 strict)  than those obtained with the Indexal search engine (0.375 strict). Table 4 sums up the number of answers found by our two runs.</Paragraph>
    <Paragraph position="1">  by rank, for the two runs at 250 characters The score of the run with answers of 50 characters length was not encouraging, amounting only 0.178, with 183 correct answers retrieved  .</Paragraph>
    <Section position="1" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
5.1 Long answers
</SectionTitle>
      <Paragraph position="0"> From results of the evaluation concerning document ranking, we see that the performance level of the question-sentence matcher depends partly on the set of sentences it has parsed, and not only on the presence, or absence, of the answer within these sentences. In other words, we do not find the answer each time it is in the set of selected sentences, but we find it easily if there are few documents (and then few sentences) selected. That is because similarity  With this score, the QALC system was ranked 6th among 25 participants at TREC9 QA task for answers with 250 characters length.</Paragraph>
      <Paragraph position="1">  With this score, the QALC system was ranked 19th among 24 participants at TREC9 QA task for answers with 50 characters length.</Paragraph>
      <Paragraph position="2"> assessment relies upon a small number of criteria, which are found to be insufficiently discriminant. Therefore, several sentences obtain the same mark, in which case, the rank of the correct answer depends on the order in which sentences are encountered.</Paragraph>
      <Paragraph position="3"> This is something we cannot yet manage, so we evaluated the matchers performance, without any regard to the side effect induced by document processing order. As remarked in 3.2, search engines perform well. In particular, ATT retains relevant documents, namely, those that yield good answers, for 97 percent of the questions. The ranker, while improving the final score, loses some questions. After it stepped in, the system retains relevant documents for 90% of the questions. The matcher finds a relevant document in the first five answers for 74% of the questions, but answers only 62% of them correctly. Finding the right document is but one step, knowing where to look inside it is no obvious task.</Paragraph>
    </Section>
    <Section position="2" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
5.2 Short answers
</SectionTitle>
      <Paragraph position="0"> A short answer is selectively extracted from a long one. We submitted this short answer selector (under 50 characters) to evaluation looking for the impact of the expected answer type. Among TREC questions, some expect an answer consisting of a named entity: for instance a date, a personal or business name. In such cases, assigning a type to the answer is rather simple, although it implies the need of a good named entity recognizer. Answers to other questions (why questions for instance, or some sort of what questions), however, will consist of a noun or sentence. Finding its type is more complex, and is not done very often.</Paragraph>
      <Paragraph position="1"> Some systems, like FALCON (Harabagiu et al 2000) use Wordnet word class hierarchies to assign types to answers. Among 682 answers in TREC9, 57.5% were analysed by our system as named-entity questions, while others received no type assignment. Among answers from our best 250-character run, 62.7% were about named entities. However, our run for shorter answers, yielding a more modest score, gives 84% of named-entities answers. In our system answer type assignment is of surprisingly small import, where longer answers are concerned.</Paragraph>
      <Paragraph position="2"> However, it does modify the selecting process, when the answer is extracted from a longer sentence.</Paragraph>
      <Paragraph position="3"> Such evaluations help us to see more clearly where our next efforts should be directed.</Paragraph>
      <Paragraph position="4"> Having more criteria in the similarity measurement would, in particular, be a source of improvement.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="3" end_page="5" type="metho">
    <SectionTitle>
6 Discussion
</SectionTitle>
    <Paragraph position="0"> We presented quantitative evaluations. But since we feel that evaluations should contribute to improvements of the system, more qualitative and local ones also appear interesting.</Paragraph>
    <Paragraph position="1"> TREC organizers send us, along with run results, statistics about how many runs found the correct answer, and at which rank. Such statistics are useful in many ways. Particularly, they provide a characterisation of a posteriori difficult questions. Knowing that a question is a difficult one is certainly relevant when trying to answer it. Concerning this problem, de Loupy and Bellot (2000) proposed an interesting set of criteria to recognize a priori difficult questions. They use word frequency, multi-words, polysemy (a source of noise) and synonymy (a source of silence). They argue that an intelligent system could even insist that a question be rephrased when it is too difficult.</Paragraph>
    <Paragraph position="2"> While their approach is indeed quite promising, we consider that their notion of a priori difficulty should be complemented by the notion of a posteriori difficulty we mentioned: the two upcoming examples of queries show that a question may seem harmless at first sight, even using de Loupy and Bellots criteria, and still create problems for most systems.</Paragraph>
    <Paragraph position="3"> From these statistics, we also found disparities between our system and others for certain questions. At times, it finds a good answer where most others fail and obviously the reverse also happens. This is the case in the two following examples. The first one concerns an interesting issue in a QA system that is the determination of which terms from the question are to be selected for the question-answer pairing. This is particularly important when the question has few words. For instance, to the question How far away is the moon?, our term extractor kept not only moon(NN), but also away(RB). Moreover, our question parser knows that how far is an interrogative phrase yielding a LENGTH type for the answer. This leads our system to retrieve the correct answer: With a 28-power telescope, you can see it on the moon 250,000 miles away  .</Paragraph>
    <Paragraph position="4"> The second example concerns the relative weight of the terms within the question. When a proper noun is present, it must be found in the answer, hence an important weight for it. Look at the question Who manufactures the software, PhotoShop?. The term extractor kept software(NN), PhotoShop(NP), and manufacture(VBZ) as terms to be matched, but the matcher assigns equal weights to them, so we could not find the answer  . Later, we modified these weights, and the problem was solved.</Paragraph>
    <Paragraph position="5"> Indeed, evaluation corpus seems to be difficult to build. Apart from the problem of the question difficulty level, question type distribution may also vary from a corpus to another. For instance, we note that TREC8 proposed much more questions with named entity answer type (about 80%) than TREC9 (about 60%). Thus, some participants who trains their systems on the TREC8 corpus were somehow disapointed by their results at TREC9 with regards with their training results (Scott and Gaizauskas, 2000).</Paragraph>
    <Paragraph position="6"> However, it is generally hard to predict what will happen if we modify the system. A local improvement can result in a loss of performance for other contexts. Although the systems complexity cannot be reduced to just two levels (a local one and a global one), this can be an efficient step in the design of improvements to the whole system via local adjustments. But this is a very frequent situation in engineering tasks.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML