XML Viewer - w04-0503

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-0503_metho.xml
Size: 15,803 bytes
Last Modified: 2025-10-06 14:09:05
<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-0503">
  <Title>The Problem of Precision in Restricted-Domain Question-Answering. Some Proposed Methods of Improvement</Title>
  <Section position="4" start_page="2" end_page="3" type="metho">
    <SectionTitle>
4 Improving Precision by Re-ranking
Candidates
</SectionTitle>
    <Paragraph position="0"> We experimented with two methods of reranking, one with a strongly specific terminological set, and one with a good document characterization.</Paragraph>
    <Section position="1" start_page="2" end_page="3" type="sub_section">
      <SectionTitle>
4.1 Re-ranking using specific vocabulary
</SectionTitle>
      <Paragraph position="0"> In the first experiment, we noted that the names of specific Bell services, such as 'Business Internet Dial', 'Web Live Voice', etc., could be used as a relevance characterizing information, because they occurred very often in almost every document and question, and a service was often presented or mentioned in only one or a few documents, making these terms very discriminating. To have a generic concept, let's call these names 'special terms'.</Paragraph>
      <Paragraph position="1"> Luckily, these special terms occurred normally in capital letters, and could be automatically extracted easily. After a manual filtering, we obtained more than 450 special terms.</Paragraph>
      <Paragraph position="2"> We designed a new scoring system which raises the score of the candidates containing occurrences of special terms found in the corresponding question, as follows:</Paragraph>
      <Paragraph position="4"> Thus, the score of candidate i in the ranked list returned by Okapi depends on: (i) The original Okapi_score given by Okapi, weighted by some integer value OW. (ii) A Term_score that measures the importance of common occurrences of special terms, and, with less emphasis, other noun phrases and open-class words, in the question and the candidate. It is weighted by some integer value RC[i] (for rank coefficient) that represents the role of the relative ranking of Okapi. (iii) A document coefficient DC that indicates the relative importance of a candidate i coming or not coming from a document which contains at least a special term occurring in the question. DC is thus represented by a 2-value pair; e.g., the pair (1, 0) corresponds to the extreme case of keeping only candidates coming from a document which contains at least one special term in the question, and throwing out all others. We ran the system with 20 different values of DC, 50 of RC, and OW from 0 to 60, on the training question set. See (Doan-Nguyen and Kosseim, 2004) for a detailed explanation of how formula (1) was derived, and how to design the values of DC, RC, and OW.</Paragraph>
      <Paragraph position="5"> Formula (1) gave very good improvements on the training set (Table 3), but just modest results when running the system with optimal training parameters on the test set (Table 4). Note: [?]Q(n) =  Okapi allows one to give it a list of phrases as indices, in addition to indices automatically created from single words. In fact, the results in Tables 1 and 2 correspond to this kind of indexing, in which we provided Okapi with the list of special terms. These results are much better than those of standard indexing, i.e. without the special term list.</Paragraph>
    </Section>
    <Section position="2" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
4.2 Re-ranking with a better document
</SectionTitle>
      <Paragraph position="0"> characterization In formula (1), the coefficient DC represents an estimate of the relevance of a document to a question based only on special terms; it cannot help when the question and document do not contain special terms. To find another document characterization which can complement this, we tried to map the documents into a system of concepts. Each document says things about a set of concepts, and a concept is discussed in a set of documents. Building such a concept system seems feasible within closed-domain applications, because the domain of the document collection is pre-defined, the number of documents is in a controlled range, and the documents are often already classified topically, e.g. by their creator. If no such classification existed, one can use techniques of building hierarchies of clusters (e.g. those summarized in (Kowalski, 1997)).</Paragraph>
      <Paragraph position="1"> We used the original document classification of Bell Canada, represented in the web page URLs, as the basis for constructing the concept hierarchy and the mapping between it and the document collection. Below is a small excerpt from the  In general, a leaf node concept corresponds to one or very few documents talking about it. A parent concept corresponds to the union of documents of its child concepts. Note that although many concepts coincide in fact with a special term, e.g. 'First Rate', many others are not special terms, e.g. 'phone', 'wireless', 'long distance', etc.</Paragraph>
      <Paragraph position="2"> The use of the concept hierarchy in the QA system was based on the following assumption: A question can be well understood only when we can recognize the concepts implicit in it. For example, the concepts in the question: It seems that the First Rate Plan is only good if most of my calls are in the evenings or weekends. If so, is there another plan for long distance calls anytime during the day? include Personal-Phone-LongDistance and Personal-Phone-LongDistance-FirstRate.</Paragraph>
      <Paragraph position="3"> Once the concepts are recognized, it is easy to determine a small set of documents relevant to these concepts, and carry out the search of answers in this set.</Paragraph>
      <Paragraph position="4"> To map a question to the concept hierarchy, we postulated that the question should contain words expressing the concepts. These words may be those constituting the concepts, e.g., 'long', 'distance', 'first', 'rate', etc., or synonyms/near synonyms of them, e.g., 'telephone' to 'phone'; 'mobile', 'cellphone' to 'wireless'. For every concept, we built a bag of words which make up the concept, e.g., the bag of words for Personal-Phone-LongDistance-FirstRate is {'personal', 'phone', 'long', 'distance', 'first', 'rate'}. We also built manually a small lexicon of (near) synonyms as mentioned above.</Paragraph>
      <Paragraph position="5"> Now, a question will be analyzed into separate words (stop words removed), and we look for concepts whose bags of words have elements in common with them. (Here we used the Porter stemmed form of words in comparison, and also counted cases of synonyms/near synonyms.) A concept is judged more relevant to a question if: (i) its bag of words has more elements in common with the question's set of words; (ii) the quotient of the size of the common subset mentioned in (i) over the size of the entire bag of words is larger; and (iii) the question contains more occurrences of words in that subset.</Paragraph>
      <Paragraph position="6"> From the relevant concept set, it is straightforward to derive the relevant document set for a given question. The documents will be ranked according to the order of the deriving concepts. (If a document is derived from several concepts, the highest rank will be used.) As for the coverage of the mapping, there were only 4 questions in the training set and 6 in the test set (7% of the entire question set) having an empty relevant document set. In fact, these questions seemed to need a context to be understood, e.g., a question like 'What does Dot org mean?' should be posed in a conversation about Internet services.</Paragraph>
      <Paragraph position="7"> Now the score of a candidate is calculated by:  (2) Score_of_candidate[i] = (CC + DC) x (OW x Okapi_score + RC[i] x Term_score + 1)  The value of CC (concept-related coefficient) depends on the document that provides the candidate. CC should be high if the rank of the document is high, e.g. CC=1 if rank=1, CC=0.9 if rank=2, CC=0.8 if rank=3, etc. If the document does not occur in the concept-derived list, its CC should be very small, e.g. 0. The sum (CC + DC) represents a combination of the two kinds of document characterization. We ran the system with 15 different values of the CC vector, with CC for rank 1 varying from 0 to 7, and CC for other ranks decreasing accordingly. Values for other coefficients are the same as in the previous experiment using formula (1). Results (Tables 5 and 6) are uniformly better than those of formula (1). Good improvements show that the approach is appropriate and effective.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="3" end_page="3" type="metho">
    <SectionTitle>
5 Two-Level Candidate Searching
</SectionTitle>
    <Paragraph position="0"> As the mapping in the previous section seems to be able to point out the documents relevant to a given question with a high precision, we tried to see how to combine it with the IR engine Okapi. In the previous experiments, the entire document collection was indexed by Okapi. Now indexing will be carried out separately for each question: only the document subset returned by the mapping, which usually contains no more than 20 documents, is indexed, and Okapi will search for candidate answers for the question only in this subset. We hoped that Okapi could achieve higher precision in working with a much smaller document set. This strategy can be considered as a kind of two-level candidate searching.</Paragraph>
    <Paragraph position="1">  with re-ranking on the test set.</Paragraph>
    <Paragraph position="2"> Results show that Okapi did not do better in this case than when it worked with the entire document collection (compare MO Q(n) in Tables 7 and 8 with Q(n) in Tables 1 and 2. MO means 'mappingthen-Okapi'). We then applied formula (2) to rearrange the candidate list as in the previous section. Although results on the training set (Table 7) are generally better than those of the previous section, results on the test set (Table 8) are worse, which leads to an unfavorable conclusion for this method. (Note that [?]Q(n) and %[?]Q(n) are always comparisons of the new Q(n) with the original Okapi Q(n) in Tables 1 and 2.)</Paragraph>
  </Section>
  <Section position="6" start_page="3" end_page="4" type="metho">
    <SectionTitle>
6 Re-implementing the IR engine
</SectionTitle>
    <Paragraph position="0"> The precision of the question-document mapping was good, but the performance of the two-level system based on Okapi in the previous section was not very persuasive. This led us back to the first approach mentioned in Section 3, i.e. replacing Okapi by another IR engine. We would not look for another generic engine because it was not interesting theoretically, but would instead implement a two-level engine using the question-document mapping. As already known, the mapping returns just a small set of relevant documents for a given question; the new engine will search for candidate answers in this set. If the document set is empty, the system takes the candidates proposed by Okapi as results (&amp;quot;Okapi as Last Resort&amp;quot;).</Paragraph>
    <Paragraph position="1"> We implemented just a simple IR engine. First the question is analyzed into separate words (stop words removed). For every document in the set returned by the question-document mapping, the system scores each paragraph by counting in this paragraph the number of occurrences of words which also appear in the question (using the stemmed form of words). Here 'paragraph' means a block of text separated by one newline, not two as in Okapi sense. Note that texts in the Bell Canada collection contain a lot of short and empty paragraphs. The candidate passage is extracted by taking the five consecutive paragraphs which have the highest score sum. However, if the document is &amp;quot;small&amp;quot;, i.e. contains less than 2000 characters, the entire document is taken as the candidate and its score is the sum of scores of all paragraphs.</Paragraph>
    <Paragraph position="2"> This choice seemed unfair to previous experiments because about 60% of the collection are such small documents. However, we decided to have a more realistic notion of answer candidates which reflects the nature of the collection and of our current task: in fact, those small documents are often dedicated to a very specific topic, and it seems necessary to present its contents in its entirety to any related question for reasons of understandability, or because of important additional information in the document. Also, a size of 2000 characters (which are normally 70% of a page) seems acceptable for a human judgement in the scenario of semi-automatic systems.</Paragraph>
    <Paragraph position="3">  Let's call the score calculated as above Occurrence_score. We also considered the role of the rank of the document in the list returned by the question-document mapping. The final score formula is as follows:  The portion (21 - Document_Rank) guarantees that high-rank documents contribute high scores. That portion is always positive because we retained no more than 20 documents for every question. RC is a coefficient representing the importance of the document rank. Due to time limit - judgement of candidates has to be done manually and is very time consuming, we carried out the experiment with only RC=0, 1, 1.5, and 2, and achieved the best results with RC=1.5.</Paragraph>
    <Paragraph position="4"> Results (Tables 9 and 10) show that except the case of n=1 in the test set, the new system performs well in precision. This might be explained partly because it tolerates larger candidates than previous experiments. However what is interesting here is that the engine is very simple but efficient because it does searching on a well selected and very small document subset.</Paragraph>
    <Paragraph position="5">  In fact, candidates returned by Okapi are not uniform in length. Some are very short (e.g. one line), some are very long (more than 2000 characters).</Paragraph>
  </Section>
  <Section position="7" start_page="4" end_page="4" type="metho">
    <SectionTitle>
7 Second Approach Revisited: Extending
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
Answer Candidates
</SectionTitle>
      <Paragraph position="0"> The previous experiment has shown that extending the size of answer candidates can greatly ease the task. This can be considered as another method belonging to the second approach - that of improving precision performance by improving the results returned by the IR engine. To be fair, it may be necessary to see how precision performance will be improved if this extending is used in other experiments. We did two small experiments. In the first one, any candidates returned by Okapi (cf.</Paragraph>
      <Paragraph position="1"> Tables 1 and 2) which came from a document of less than 2000 characters were extended into the entire document. Table 11 shows that improvements are not as good as those obtained by other methods.</Paragraph>
      <Paragraph position="2">  on the training set (A) and test set (B).</Paragraph>
      <Paragraph position="3"> In the second experiment, we similarly extended candidates returned by the two-level search process &amp;quot;mapping-then-Okapi&amp;quot; in Section 5. Improvements (Table 12) seem comparable to those of the experiment in Section 5 (Tables 7 and 8), but less good than those of experiments in Sections 4.2 and 6. The two experiments of this section suggest that extending candidates helps improve the precision, but not so much unless it is combined with other methods. We have not yet, however, carried out experiments of combining candidate extending with re-ranking.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML