XML Viewer - c04-1069

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1069_metho.xml
Size: 17,454 bytes
Last Modified: 2025-10-06 14:08:41
<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1069">
  <Title>Document Re-ranking Based on Automatically Acquired Key Terms in Chinese Information Retrieval</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Overview of Document Reordering
</SectionTitle>
    <Paragraph position="0"> in Chinese IR For Chinese IR, many retrieval models, indexing strategies and query expansion strategies have been studied and successfully used in IR. Chinese Character, bi-gram, n-gram (n&gt;2) and word are the most used indexing units. (Li. P. 1999) gives out many research results on the effectiveness of single Chinese Character as indexing unit and how to improve the effectiveness of single Chinese Character as indexing unit. (K.L. Kwok. 1997) compares three kinds of indexing units (single Character, bigram and short-words) and their effectiveness. It reports that single character indexing is good but not sufficiently competitive, while bi-gram indexing works surprisingly well and it's as good as short-word indexing in precision. (J.Y. Nie, J. Gao, J. Zhang and M. Zhou. 2000) suggests that word indexing and bi-gram indexing can achieve comparable performance but if we consider the time and space factors, then it is preferable to use words (and characters) as indexes. It also suggests that a combination of the longest-matching algorithm with single character is a good method for Chinese and if there is unknown word detection, the performance can be further improved. Many other papers in literature (Palmer, D. and Burger, J, 1997; Chien, L.F, 1995) give similar conclusions. Although there are still different voices on if bi-gram or word is the best indexing unit, bi-gram and word are both considered as the most important top two indexing units in Chinese IR and they are used in many reported Chinese IR systems and experiences.</Paragraph>
    <Paragraph position="1"> There are mainly two kinds of retrieval models: Vector Space Model (G. Salton and M. McGill, 1983) and Probabilistic Retrieval (N. Fuhr, 1992). They are both used in a lot of experiments and applications.</Paragraph>
    <Paragraph position="2"> For query expansion, almost all of the proposed strategies make use of the top N documents in initial ranking documents in the initial retrieval. Generally, query expansion strategy selects M indexing units (M&lt;50) from the top N (N&lt;25) documents in initial ranking documents according to some kind of measure and add these M indexing units to original query to form a new query. In such process of query expansion, it's supposed that the top N documents are related with original query, but in practice, such an assumption is not always true. The Okapi approach (S.E. Roberson and S.Walker, 2001) supposes that the top R documents are related with query and it selects N indexing unit from the top R documents to form a new query, for example, R=10 and N=25. (M. Mitra., Amit. S. and Chris. B, 1998) did an experiment on different query topics and it is reported the effectiveness of query expansion mainly depends on the precision of the top N ranking documents. If the top N ranking documents are highly related with the original query, then query expansion can improve the final result. But if the top N documents are less related with the original query, query expansion cannot improve the final result or even reduces the precision of final result. These researches conclude that whether query expansion is successful or not mainly depends on the quality of top N ranking documents in the initial retrieval.</Paragraph>
    <Paragraph position="3"> The precision of top N documents in the initial ranking documents depends on indexing unit and retrieval models and mainly depends on indexing unit. As discussed above, bi-gram and word both are the most effective indexing units in Chinese IR.</Paragraph>
    <Paragraph position="4"> Other effort has been done to improve the precision of top N documents. (Qu. Y, 2002) proposed a method to re-rank initial relevant documents by using individual thesaurus but the thesaurus must be constructed manually and depends on each query topic.</Paragraph>
    <Paragraph position="5"> In this paper, we propose a new method to improve the precision of top N ranking documents in initial ranking documents by reordering the top M (M &gt; N and M &lt; 1000) ranking documents in initially retrieved documents. To reorder documents, we try to find long terms (more than 2 Chinese characters) that generally represent some complete concepts in query and documents, then we make use of these long terms to re-weight the top M documents in initial ranking documents and reorder them by re-weighted value. We adopt a two-stage approach to acquire such kinds of long terms. Firstly, we acquire Global Key Terms from the whole document set; secondly, we use Global Key Terms to acquire Local Key Terms in a query or a document. After we have acquired Local Key Terms, we use them to re-weight the top M documents in initial ranking documents. Figure 2 demonstrates the processes of an IR system that integrates with this new method.</Paragraph>
    <Paragraph position="7"/>
  </Section>
  <Section position="4" start_page="0" end_page="133" type="metho">
    <SectionTitle>
3 Global/Local Key Term Extraction
</SectionTitle>
    <Paragraph position="0"> The Global /Local Key Term extraction concerns the problem of what is a key term.</Paragraph>
    <Paragraph position="1"> Intuitively, key terms in a document are some conceptual terms that are prominent in document and play main roles in discriminating the document from other documents. In other words, a key term in a document can represent part of the content of the document. Generally, from the point of the view of conventional linguistic studies, Key Terms may be some NPs, NP-Phrases or some kind of VPs, adjectives that can represent some specific concepts in document content representation.</Paragraph>
    <Paragraph position="2"> We define two kinds of Key Terms: Global Key Terms which are acquired from the whole document set and Local Key Terms which are acquired from a single document or a query.</Paragraph>
    <Paragraph position="3"> We adopt a two-stage approach to automatically acquire Global Key Terms and Local Key Terms. In the first stage, we acquire Global Key Terms from document set by using a seeding-and-expansion method. In the second stage, we make use of acquired Global Key Terms to find Local Key Terms in a single document or a query.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Global Key Terms
</SectionTitle>
      <Paragraph position="0"> Global Key Terms are terms which are extracted from the whole document set and they can be regarded to represent the main concepts of document set.</Paragraph>
      <Paragraph position="1"> Although the definition of Global Key Terms is difficult, we try to give some assumptions about a Global Key Term. Before we give these assumptions, we first give out the definition of Seed and Key Term in a document (or document cluster) d.</Paragraph>
      <Paragraph position="2"> The concept Seed is given to reflect the prominence of a Chinese Character in a document (or document cluster) in some way. Suppose r is the reference document set (reference document set including document set and other statistical large document collection), d is a document (or a document set), w is an individual Chinese Character in d, let Pr(w) and Pd(w) be the probability of w occurring in r and d respectively, we adopt 1), relative probability or salience of w in d with respect to r (Schutze. 1998), as the criteria for evaluation of Seed.</Paragraph>
      <Paragraph position="4"> We call w a Seed if Pd(w) / Pr(w)[?]d (d&gt;1).</Paragraph>
      <Paragraph position="5"> Now we give out the assumptions about a  Key Terms in document d.</Paragraph>
      <Paragraph position="6"> i) a Key Term contains at least a Seed. ii) a Key Term occurs at least N (N&gt;1) times in d. iii) the length of a Key Term is less than L (L&lt;30). iv) a maximal character string meeting i), ii) and iii) is a Key Term.</Paragraph>
      <Paragraph position="7"> v) for a Key Term, a real maximal substring meeting i), ii) and iiI) without considering their occurrence in all those Key Terms containing it is also a Key Terms.  Here a maximal character string meeting i), ii) and iii) refers to a adjacent Chinese character string meeting i), ii) and iii) while no other longer Chinese character strings containing it meet i), ii) and iii). A real maximal substring meeting i), ii) and iii) refers to a real substring meeting i), ii), and iii) while no other longer real substrings containing it meet i), ii) and iii). We use a kind of seeding-and-expansion-based statistical strategy to acquire Key Terms in document (or document cluster), in which we first identify seeds for a Key Term then expand from it to get the whole Key Term.</Paragraph>
      <Paragraph position="8"> Fig. 3 describes the procedure to extract Key Terms from a document (or document cluster) d.</Paragraph>
      <Paragraph position="9"> let Fd(t) represents the frequency of t in d; let N is a given threshold (N&gt;1);</Paragraph>
      <Paragraph position="11"> return K as Key Terms in document d; Fig. 3 Key Term Extraction from document d To acquire Global Key Terms, we first roughly cluster the whole document set r into K (K&lt;2000) document clusters, then we regard each document cluster as a large document and apply our proposed Key Term Extraction algorithm (see Fig. 3) on each document cluster and respectively get Key Terms in each document cluster. All these Key Terms from document clusters form Global Key Terms.</Paragraph>
      <Paragraph position="12"> There are many document clustering approaches to cluster document set. K-Means and hierarchical clustering are the two usually used approaches. In our algorithm, we don't need to use complicated clustering approaches because we only need to roughly cluster document set r into K document clusters. Here we use a simple K-Means approach to cluster document set. Firstly, we pick up randomly 10*K documents from document set r; secondly, we use K-Means approach to cluster these 10*K documents into K document clusters; finally, we insert every other document into one of the K document clusters.</Paragraph>
      <Paragraph position="13"> Fig. 4 describes the general process to cluster document set r into K document clusters.</Paragraph>
      <Paragraph position="14"> let K is the number of documnet clusters to get; T-10*K documents randomly pickuped from r; cluster T into K clusters {Kj} by using K-Means; for any document d in {r-T} { Ki- document cluster which has the maximal similarity with d; insert d to document cluster Ki; } return K document clusters {Kj|1&lt;=j&lt;=K}; Fig. 4 Cluster document set r into K clusters Fig. 5 describes the procedure to acquire Global Key Terms from document set r.</Paragraph>
      <Paragraph position="15"> roughly cluster document set r to K document clusters</Paragraph>
      <Paragraph position="17"> return G as Global Key Terms in document set r; Fig. 5 Global Key Terms Acquisition In the processing of Global Key Terms acquisition, the frequency of each Global Key Term is also recorded for further use in identifying Local Key Terms - terms in a single document or query.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="133" type="sub_section">
      <SectionTitle>
3.2 Local Key Terms
</SectionTitle>
      <Paragraph position="0"> Unlike Global Key Terms, Local Key Terms are not extracted by using Key Term extraction algorithm from single document or query, they are identified based on Global Key Terms and their frequencies.</Paragraph>
      <Paragraph position="1"> Fig.6 describes the procedure of Local Key Terms acquisition from a single document or query d.</Paragraph>
      <Paragraph position="2"> Given threshold M (M&gt;10), N (N&gt;100) and document d;</Paragraph>
      <Paragraph position="4"> collect Global Key Terms occurred in d and their frequency in document set r into S = &lt;c, tf&gt;; for all &lt;c, tf&gt;[?]S</Paragraph>
      <Paragraph position="6"> remove &lt;max-t, max-tf&gt; from S; if max-t occurs in d</Paragraph>
      <Paragraph position="8"> remove all occurrance of max_t in d; for all &lt;b, tf-b&gt;[?]S where b is a substring of max-t; if tf-b &lt; max-tf remove &lt;b,tf-b&gt; from S;</Paragraph>
      <Paragraph position="10"> return L as Local Key Terms in document d; Fig. 6 Local Key Terms Acquisition Following are some examples of Global Key</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="133" end_page="133" type="metho">
    <SectionTitle>
4 Document Reordering
</SectionTitle>
    <Paragraph position="0"> After we have acquired Global Key Terms in document set and Local Key Terms in every document and query, we make use of them to reorder the top M (M&lt;=1000) documents in initial ranking documents. Suppose q is a query,  documents in initial ranking documents where w(t) is the weight assigned to Local Key Term t. w(t) can be assigned different value by different measures. For example, i) w(t) = the length of t; ii) w(t) = the number of Chinese Characters in t; iii) w(t) = square root of the length of t; iv) w(t) = square root of the number of Chinese Characters in t; (default) for each document d in top M ranking documents sim - similary value between d and q;</Paragraph>
    <Paragraph position="2"> set sim as the new similary between d and q }; reorder top M documents by their new similarity</Paragraph>
  </Section>
  <Section position="6" start_page="133" end_page="133" type="metho">
    <SectionTitle>
5 Experience &amp; Evaluation
</SectionTitle>
    <Paragraph position="0"> We make use of the Chinese document set CIRB011 (132,173 documents) and CIRB20 (249,508 documents) and D-run type query topic set (42 topics) of CLIR in NTCIR3 (see http://research.nii.ac.jp/ntcir-ws3/work-en.html for more information) to evaluate our proposed method. We use vector space model as our retrieval model and use cosine to measure the similarity between document and query. For indexing units, we use bigrams and words respectively. To measure the effectiveness of IR, we use the same two kinds of relevant measures: relax-relevant and rigid-relevant. A document is rigid-relevant if it's highly relevant or relevant with a query, and a document is relax-relevant if it is high relevant or relevant or partially relevant with a query.</Paragraph>
    <Paragraph position="1"> We also use PreAt10 and PreAt100 to represent the precision of top 10 ranking documents and top 100 ranking documents.</Paragraph>
    <Paragraph position="2"> When we use our proposed method and algorithm to extract Global Key Terms from document set r, we set all kinds of algorithm parameters as following:  acquire Local Key Terms. (Fig. 6) Table 1 lists the normal results and enhanced results based on bigram indexing. The enhanced results are acquired by using our method to enhance the effectiveness. PreAt10 is the average precision of 42 queries in precision of top 10 ranking documents, while PreAt100 is the average precision of 42 queries in precision of top 100 ranking documents.</Paragraph>
    <Paragraph position="3"> Column 2 (normal) displays the precision of normal retrieval, column 3 (Enhanced) displays the precision of using our proposed approach, and column 4 (ratio) displays the ratio of column 3 (enhanced) compared with column 2 (normal). Table 2 lists the normal results and our enhanced results based on word indexing.</Paragraph>
    <Paragraph position="4">  From table 1, we can see that compared with bigrams as indexing units, our proposed method can improve PreAt10 by 11% from 0.3642 to 0.4052 in relax relevant measure and improve 11% from 0.2595 to 0.2871 in rigid relevant measure. Even in PreAt100 level, our method can improve 2% and 4% in relax relevant and rigid relevant measure. Fig. 8 displays the PreAt10 values of each query in relax relevant measure based on bigram indexing where the red lines represent the precision enhanced with our method while the black lines represent the normal precision.</Paragraph>
    <Paragraph position="5"> Among the 42 query topics, there are only 5 queries whose enhanced precisions are worse than normal precisions, the precisions of other 37 queries are all improved.</Paragraph>
    <Paragraph position="6"> From table 2, using words as indexing units (we use a dictionary which contains 80000 Chinese items to segment Chinese document and query), our method can improve PreAt10 by 10% from 0.3761 to 0.4119 in relax relevant measure and improve 10% from 0.269 to 0.2952 in rigid relevant measure.</Paragraph>
    <Paragraph position="7"> Even in PreAt100 level, our method can improve 3% and 5% in rigid and relax relevant measure.</Paragraph>
    <Paragraph position="8">  In our experiments, compared with the most important and effective Chinese indexing units: bigram and words, our proposed method improves the average precision of all queries in top 10 measure levels for about 10%. What lies behind our proposed method is that in most case, proper long terms may contain more information (position and Chinese Character dependence) and such information can help us to focus on relevant documents. Our experiment also shows improper long terms may decrease the precision of top documents. So it's very important to extract right and proper terms in documents and queries.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML