File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-2005_metho.xml

Size: 9,895 bytes

Last Modified: 2025-10-06 14:08:43

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-2005">
  <Title>Term Distillation in Patent Retrieval</Title>
  <Section position="4" start_page="1" end_page="1" type="metho">
    <SectionTitle>
2. Initial retrieval
</SectionTitle>
    <Paragraph position="0"> Each query term is submitted one byoneto the ranking search module, which assigns a weight to the term andscores documents includingit. Retrieved documents are merged and sorted on the score in the descending order.</Paragraph>
  </Section>
  <Section position="5" start_page="1" end_page="1" type="metho">
    <SectionTitle>
3. Seed document selection
</SectionTitle>
    <Paragraph position="0"> As a result of the initial retrieval, top ranked documents are assumed to be pseudo-relevant to the query and selected as a \seed&amp;quot; of query expansion. The maximum number of seed documents is ten.</Paragraph>
  </Section>
  <Section position="6" start_page="1" end_page="1" type="metho">
    <SectionTitle>
4. Query expansion
</SectionTitle>
    <Paragraph position="0"> Candidates of expansion terms are extracted from the seed documents by pattern matching as in the query term extraction mentioned above.</Paragraph>
    <Paragraph position="1"> Phrasal terms are not used for query expansion because phrasal terms may be less eectivetoimprove recall and risky in case of pseudo-relevance feedback.</Paragraph>
    <Paragraph position="2"> The weight of initial query term is re-calculated with the Robertson/Spark-Jones formula (Robertson and Sparck-Jones, 1976) if the term is found in the candidate pool.</Paragraph>
    <Paragraph position="3"> The candidates are ranked on the Robertson's SelectionValue (Robertson, 1990) and top-ranked terms are selected as expansion terms.</Paragraph>
  </Section>
  <Section position="7" start_page="1" end_page="1" type="metho">
    <SectionTitle>
5. Final retrieval
</SectionTitle>
    <Paragraph position="0"> Each query and expansion term is submitted one by one to the rankingsearch module as in the initial retrieval.</Paragraph>
  </Section>
  <Section position="8" start_page="1" end_page="1" type="metho">
    <SectionTitle>
3 Term distillation
</SectionTitle>
    <Paragraph position="0"> In cross-database retrieval, the domain of queries (news article) diers from that of the retrievaltarget (patent) inthe distributionof term occurrences. This causes incorrect term weighting in the retrieval system which assigns to each term a retrieval weight based on the distribution of term occurrences. Moreover, the terms which mightbegiven an incorrect weight are too many to be collected in a stop word dictionary.</Paragraph>
    <Paragraph position="1"> For these reasons, we nd it necessary to have a query term selection stage specially designed for cross-database retrieval. We dene \term distillation&amp;quot; as a general framework for the query term selection.</Paragraph>
    <Paragraph position="2"> More specically, the termdistillationconsists of the following steps :  1. Extraction of query term candidates Candidates of query terms are extracted from the query string (news articles) and pooled.</Paragraph>
    <Paragraph position="3"> 2. Assignment of TDV (Term Distillation Value) Each candidate in the pool is givenaTDV which represents \goodness&amp;quot; of the term to retrieve documents in the target domain.</Paragraph>
    <Paragraph position="4"> 3. Selection of query terms  The candidates are ranked on the TDVand top-ranked n terms are selected as query terms, where n is an unknown constant and treated as a tuning parameter for fullautomatic retrieval.</Paragraph>
    <Paragraph position="5"> The term distillation seems appropriate to avoid falling foul of the \curse of dimensionality&amp;quot; (Robertson, 1990) incase that a given query is very lengthy.</Paragraph>
    <Paragraph position="6"> In what follows in this section, we explain a generic model to dene the TDV. Thereafter some instances of the model whichembody the term distillation are introduced.</Paragraph>
    <Section position="1" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
3.1 Generic Model
</SectionTitle>
      <Paragraph position="0"> In order to dene the TDV, we give a generic model with the following formula.</Paragraph>
      <Paragraph position="2"> where QV and TV represent the importance of the term in the query and the target domain respectively. QV seems to be commonly used for query term extraction in ordinary retrieval systems, however, TV is newly introduced for cross-database retrieval. A combination of QV and TV embodies a term distillation method.</Paragraph>
      <Paragraph position="3"> We instance them separately as bellow.</Paragraph>
    </Section>
    <Section position="2" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
3.2 Instances of TV
</SectionTitle>
      <Paragraph position="0"> We give some instances of TV using two probabilities p and q, where p is a probability that the term occurs in the target domain and q is a probability that the term occurs in the query domain. Because the estimation method of p and q is independent on the instances of TV,it is explained later. Weshoweach instance of TV with the id-tag as follows:</Paragraph>
      <Paragraph position="2"> where and are unknown constants.</Paragraph>
      <Paragraph position="4"/>
    </Section>
    <Section position="3" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
3.3 Instances of QV
</SectionTitle>
      <Paragraph position="0"> We showeach instance of QV with the id-tag as follows:</Paragraph>
      <Paragraph position="2"> where tf isthe within-queryterm frequency and is an unknown constant.</Paragraph>
      <Paragraph position="4"> where weight is the retrieval weight given by the retrieval system.</Paragraph>
      <Paragraph position="6"/>
    </Section>
  </Section>
  <Section position="9" start_page="1" end_page="1" type="metho">
    <SectionTitle>
4 Experiments on term distillation
</SectionTitle>
    <Paragraph position="0"> Using the NTCIR-3 patent retrieval test collection, we conducted experiments to evaluate the eect of term distillation.</Paragraph>
    <Paragraph position="1"> For query construction, weusedonlynewsarticle elds in the 31 topics for the formal run. The number of query terms selected by term distillation was just eight in each topic. As described in the section 2, retrieval was fullautomatically executed with pseudo-relevance feedback.</Paragraph>
    <Paragraph position="2"> The evaluation results for some combinations of QV and TV are summarizedinTable 1, where the documents judged to be \A&amp;quot; were taken as relevant ones. The combinations were selected on the results in our preliminary experiments.</Paragraph>
    <Paragraph position="3"> Each of \t&amp;quot;, \i&amp;quot;, \a&amp;quot; and \w&amp;quot; in the columns \p&amp;quot; or \q&amp;quot; represents a certain method for estimation of the probability p or q as follows : t : estimate p by the probability that the term occurs in titles of patents. More specically</Paragraph>
    <Paragraph position="5"> is the number of patent titles includingthe term and N p is the number of patents in the NTCIR-3 collection. i : estimate q by the probability that the term occurs in news articles. More specically</Paragraph>
    <Paragraph position="7"> is the number of articles including the term and N i is the number of news articles inthe IREX collection ('98-'99 MAINICHI news article).</Paragraph>
    <Paragraph position="8"> a : estimate p by the probability that the term occurs in abstracts of patents. More specif- null is the number of patent abstracts in which the term occurs. w : estimate q by the probability that the term occurs in the whole patent. More specif-</Paragraph>
    <Paragraph position="10"> of patents in which the term occurs. We tried to approximate the dierence in term statistics between patents and news articles using the conbination of &amp;quot;a&amp;quot; and &amp;quot;w&amp;quot; in the term distillation.</Paragraph>
    <Paragraph position="11"> In Table 1, the combination of QV2 and TV0 corresponds to query term extraction without  tion, retrieval performances are improved using instances of TV except for TV7. This means the term distillation produces a positive eect. The best performance in the table is produced by the combination of QV2 (raw term frequency) and TV4 (BIM).</Paragraph>
    <Paragraph position="12"> While the combination of \a&amp;quot; and \w&amp;quot; for estimation of probabilities p and q has the virtue in that the estimation requires only target document collection, the performance is poor in comparison with the combination of \t&amp;quot; and \i&amp;quot;. Although the instances of QV can be compared each other by focusing on TV3, it is unclear whether QV5 is superiorto QV2. We think it is necessary to proceed to the evaluation including the other combinations of TV and QV .</Paragraph>
  </Section>
  <Section position="10" start_page="1" end_page="1" type="metho">
    <SectionTitle>
5 Results in NTCIR-3 patent task
</SectionTitle>
    <Paragraph position="0"> We submitted four mandatory runs. The evaluation results of our submitted runs are summarizedinTable 2, where the documents judged to be \A&amp;quot; were taken as relevant ones.</Paragraph>
    <Paragraph position="1"> These runs were automatically produced using both article and supplement elds, where each supplement eld includes a short descriptionon thecontent of the newsarticle. Termdistillation using TV3 (Bayes classication model) and query expansion by pseudo-relevance feedbackwere applied to all runs.</Paragraph>
    <Paragraph position="2"> The retrieval performances are remarkable among all submitted runs. However, the eect  of term distillation is somewhat unclear, comparing with the run with only supplementelds in Table 3 (the average precision is 0.2712). We think supplement elds supply enough terms so that it is dicult to evaluate the performance of cross-database retrieval in the mandatory runs.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML