File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/h05-1121_metho.xml

Size: 12,425 bytes

Last Modified: 2025-10-06 14:09:35

<?xml version="1.0" standalone="yes"?>
<Paper uid="H05-1121">
  <Title>Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 963-970, Vancouver, October 2005. c(c)2005 Association for Computational Linguistics Query Expansion with the Minimum User Feedback by Transductive Learning</Title>
  <Section position="5" start_page="963" end_page="964" type="metho">
    <SectionTitle>
2 Basic Methods
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="963" end_page="964" type="sub_section">
      <SectionTitle>
2.1 Query Expansion
</SectionTitle>
      <Paragraph position="0"> So far, many query expansion techniques have been proposed. While some techniques focus on the domain specific search which prepares expansion terms in advance using some domain specific training documents (Flake and et al, 2002; Oyama and et al, 2001), most of techniques are based on relevance feedback which is given automatically or manually.</Paragraph>
      <Paragraph position="1"> In this technique, expansion terms are selected from relevant documents by a scoring function. The Robertson's wpq method (Ruthven, 2003) is often used as such a scoring function in many researches (Yu and et al, 2003; Lam-Adesina and Jones, 2001).</Paragraph>
      <Paragraph position="2"> We also use it as our basic scoring function. It calculates the score of each term by the following formula. null</Paragraph>
      <Paragraph position="4"> where rt is the number of seen relevant documents containing term t. nt is the number of documents containing t. R is the number of seen relevant documents for a query. N is the number of documents in the collection. The second term of this formula is called the Robertson/Spark Jones weight (Robertson, 1990) which is the core of the term weighting function in the Okapi system (Robertson, 1997).</Paragraph>
      <Paragraph position="5"> This formula is originated in the following formula. null</Paragraph>
      <Paragraph position="7"> (2) where pt is the probability that a term t appears in relevant documents. qt is the probability that a term t appears in non-relevant documents. We can easily notice that it is very important how the two probability of pt and qt should be estimated. The first formula estimates pt with rtR and qt with Nt[?]RtN[?]R . For the good estimation of pt and qt, plenty of relevant document is necessary. Although pseudo feedback which automatically assumes top n documents as relevant is one method and is often used, its performance heavily depends on the quality of an initial search. As we show later, pseudo feedback has limited performance.</Paragraph>
      <Paragraph position="8"> We here consider a query expansion technique which uses manual feedback. It is no wonder  manual feedback shows excellent and stable performance if enough relevant documents are available, hence the challenge is how it keeps high performance with less amount of manual relevance judgment. In particular, we restrict the manual judgment to the minimum amount, namely only a relevant document and a non-relevant document. In this assumption, the problem is how to find more relevant documents based on a relevant document and a non-relevant document. We use transductive learning technique which is suitable for the learning problem where there is small training examples.</Paragraph>
    </Section>
    <Section position="2" start_page="964" end_page="964" type="sub_section">
      <SectionTitle>
2.2 Transductive Learning
</SectionTitle>
      <Paragraph position="0"> Transductive learning is a machine learning technique based on the transduction which directly derives the classification labels of test data without making any approximating function from training data (Vapnik, 1998). Because it does not need to make approximating function, it works well even if the number of training data is small.</Paragraph>
      <Paragraph position="1"> The learning task is defined on a data set X of n points. X consists of training data set L = (x1,x2,...,xl) and test data set U = (xl+1,xl+2,...,xl+u); typically l lessmuch u. The purpose of the learning is to assign a label to each data point in U under the condition that the label of each data point in L are given.</Paragraph>
      <Paragraph position="2"> Recently, transductive learning or semi-supervised learning is becoming an attractive subject in the machine learning field. Several algorithms have been proposed so far (Joachims, 1999; Zhu and et al., 2003; Blum and et al., 2004) and they show the advantage of this approach in various learning tasks. In order to apply transductive learning to our query expansion, we select an algorithm called &amp;quot;Spectral Graph Transducer (SGT)&amp;quot; (Joachims, 2003), which is one of the state of the art and the best transductive learning algorithms. SGT formalizes the problem of assigning labels to U with an optimization problem of the constrained ratiocut.</Paragraph>
      <Paragraph position="3"> By solving the relaxed problem, it produces an approximation to the original solution.</Paragraph>
      <Paragraph position="4"> When applying SGT to query expansion, X corresponds to a set of top n ranked documents in a hit-list. X does not corresponds to a whole document collection because the number of documents in a collection is too huge1 for any learning system to process. L corresponds to two documents with manual judgments, a relevant document and a non-relevant document. Furthermore, U corresponds to the documents of X [?] L whose relevancy is unknown. SGT is used to produce the relevancy of documents in U. SGT actually assigns values around g+ [?] th for documents possibly being relevant and g[?] [?]th for documents possibly being non-relevant. g+ = +</Paragraph>
      <Paragraph position="6"> documents in X. We cannot know the true value of fp in advance, thus we have to estimate its approximation value before applying SGT.</Paragraph>
      <Paragraph position="7"> According to Joachims, parameter k (the number of k-nearest points of a data x) and d (the number of eigen values to ...) give large influence to SGT's learning performance. Of course those two parameters should be set carefully. However, besides them, fp is much more important for our task because it controls the learning performance. Since extremely small L (actually |L |= 2 is our setting) give no information to estimate the true value of fp, we do not strain to estimate its single approximation value but propose a new method to utilize the results of learning with some promising fp. We describe the method in the next section.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="964" end_page="965" type="metho">
    <SectionTitle>
3 Parameter Estimations based on
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="964" end_page="965" type="sub_section">
      <SectionTitle>
Multiple SGT Predictions
3.1 Sampling for Fraction of Positive Examples
</SectionTitle>
      <Paragraph position="0"> SGT prepares 2 estimation methods to set fp automatically. One is to estimate from the fraction of positive examples in training examples. This method is not suitable for our task because fp is always fixed to 0.5 by this method if the number of training examples changes despite the number of relevant documents is small in many practical situations. The other is to estimate with a heuristic that the difference between a setting of fp and the fraction of positive examples actually assigned by SGT should be as small as possible. The procedure provided by SGT starts from fp = 0.5 and the next fp is set to the fraction of documents assigned as relevant in the previous SGT trial. It repeats until fp changes</Paragraph>
      <Paragraph position="2"> five times or the difference converges less than 0.01.</Paragraph>
      <Paragraph position="3"> This method is neither works well because the convergence is not guaranteed at all.</Paragraph>
      <Paragraph position="4"> Presetting of fp is primarily very difficult problem and consequently we take another approach which laps the predictions of multiple SGT trials with some sampled fp instead of setting a single fp. This approach leads to represent a relevant document by not a binary value but a real value between 0 and 1. The sampling procedure for fp is illustrated in Figure 1.</Paragraph>
      <Paragraph position="5"> In this procedure, sampling interval changes according to the number of training examples. In our preliminary test, the number of sampling points should be around 10. However this number is adhoc one, thus we may need another value for another corpus.</Paragraph>
    </Section>
    <Section position="2" start_page="965" end_page="965" type="sub_section">
      <SectionTitle>
3.2 Modified estimations for pt and qt
</SectionTitle>
      <Paragraph position="0"> Once we get a set of sampling points S = {fip :</Paragraph>
      <Paragraph position="2"> each resultant of prediction to calculate pt and qt as follows.</Paragraph>
      <Paragraph position="4"> Here, Ri is the number of documents which SGT predicts as relevant with ith value of fip, and rit is the number of documents in Ri where a term t appears. In each trial, SGT predicts the relevancy of documents by binary value of 1 (for relevant) and 0 (for non-relevant), yet by lapping multiple resultant of predictions, the binary prediction value changes to a real value which can represents the relevancy of documents in more detail. The main merit of this approach in comparison with fixing fp to a single value, it can differentiate a value of pt if Ntr is small.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="965" end_page="966" type="metho">
    <SectionTitle>
4 Expansion Procedures
</SectionTitle>
    <Paragraph position="0"> We here explain a whole procedure of our query expansion method step by step.</Paragraph>
    <Paragraph position="1">  1. Initial Search: A retrieval starts by inputting a query for a topic to an IR system.</Paragraph>
    <Paragraph position="2"> 2. Relevance Judgment for Documents in a Hit-List: The IR system returns a hit-list for  the initial query. Then the hit-list is scanned to check whether each document is relevant or non-relevant in descending order of the ranking. In our assumption, this reviewing process terminates when a relevant document and a non-relevant one are found.</Paragraph>
    <Paragraph position="3"> 3. Finding more relevant documents by transductive learning: Because only two judged documents are too few to estimate pt and qt correctly, our query expansion tries to increase the number of relevant documents for the wpq formula using the SGT transductive learning algorithm. As shown in Figure2, SGT assigns a value of the possibility to be relevant for the topic to each document with no relevance judgment (documents under the dashed line in the Fig) based on two judged documents (documents above the dashed line in the Figure).  1. Document 1 2. Document 0 3. Document ? 4. Document ?  4. Selecting terms to expand the initial query:  Our query expansion method calculates the score of each term appearing in relevant documents (including documents judged as relevant by SGT) using wpq formula, and then selects a certain number of expansion terms according to the ranking of the score. Selected terms are added to the initial query. Thus an expanded query consists of the initial terms and added terms.</Paragraph>
  </Section>
  <Section position="8" start_page="966" end_page="966" type="metho">
    <SectionTitle>
5. The Next Search with an expanded query:
</SectionTitle>
    <Paragraph position="0"> The expanded query is inputted to the IR system and a new hit-list will be returned. One cycle of query expansion finishes at this step.</Paragraph>
    <Paragraph position="1"> In the above procedures, we naturally introduced transductive learning into query expansion as the effective way in order to automatically find some relevant documents. Thus we do not need to modify a basic query expansion procedure and can fully utilize the potential power of the basic query expansion.</Paragraph>
    <Paragraph position="2"> The computational cost of transductive learning is not so much. Actually transductive learning takes a few seconds to label 100 unlabeled documents and query expansion with all the labeled documents also takes a few seconds. Thus our system can expand queries sufficiently quick in practical applications.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML