File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1116_metho.xml

Size: 11,711 bytes

Last Modified: 2025-10-06 14:08:36

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1116">
  <Title>Extraction of User Preferences from a Few Positive Documents</Title>
  <Section position="4" start_page="1" end_page="222" type="metho">
    <SectionTitle>
3 Extraction of user preferences
</SectionTitle>
    <Paragraph position="0"> User preferences are extracted from a few example documents through two steps: a) the first step generates a set of keywords called IRKs (Initial Representative Keywords) which corresponds to the initial user query in the relevance feedback techniques of IR and b) these IRKs are expanded and reweighted by a relevance feedback technique.</Paragraph>
    <Paragraph position="1"> It is very important to select IRKs reflecting user's preferences well from example or training documents (set of documents judged relevant by the user) because we have to calculate term co-occurrences similarity between these IRKs and candidate terms within each example document.</Paragraph>
    <Paragraph position="2"> Three factors of a term (term frequency, document frequency within positive examples, and IDF) are used to calculate the importance of a specific term.</Paragraph>
    <Paragraph position="3"> Since these factors essentially have inexact and uncertain characteristics, we combine them by fuzzy inference instead of a simple equation.</Paragraph>
    <Paragraph position="4"> The IRKs are selected based on the selection criteria that each example document has at least one or more IRKs. After selecting the IRKs, we perform term modification process based on the term co-occurrence similarity between these IRKs and candidate terms. The Rocchio and Widrow-Hoff algorithms do not consider the term co-occurrence relationship within training documents.</Paragraph>
    <Paragraph position="5"> But, we regard the term co-occurrence relationship as the key factor to calculate the importance of terms under the assumption that the IRKs reflect user's preferences well.</Paragraph>
    <Section position="1" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
3.1 Calculation of the Representativeness of
Terms through Fuzzy Inference
</SectionTitle>
      <Paragraph position="0"> The given positive examples are transformed into the set of candidate terms through eliminating stopwords and stemming by Porter's algorithm.</Paragraph>
      <Paragraph position="1"> The TF, DF, and IDF of each term are calculated based on this set and used as inputs of fuzzy inference. From now on, we will explain these three input variables. The TF (Term Frequency) is the term frequency of a specific term not in a document but in a set of documents, which is calculated by dividing total occurrences of the term in a set of documents by the number of documents in the set containing the term. It needs to be normalized for being used in fuzzy inference. The following shows the normalized term frequency (NTF).</Paragraph>
      <Paragraph position="3"> frequency of documents having a specific term within the example documents. The normalized document frequency, NDF, is defined in equation (4), where</Paragraph>
      <Paragraph position="5"> The IDF (Inverse Document Frequency) represents the inverse document frequency of a specific term over an entire document collection not example documents. The normalized inverse document frequency, NIDF, is defined as follows:</Paragraph>
      <Paragraph position="7"> Figure 1 shows the membership functions of the input/output variables - 3 inputs (NTF, NDF, NIDF) and 1 output (TW) - used in our method. As you can see in Figure 1(a), NTF variable has { S(Small), L(Large) }, and NDF and NIDF variables have { S(Small), M(Middle), L(Large) } as linguistic labels (or terms). The fuzzy output variable, TW (Term Weight) which represents the importance of a term, has six linguistic labels as shown in Figure 1(b).</Paragraph>
      <Paragraph position="8"> The 18 fuzzy rules are involved to infer the term weight (TW). The rules are constructed based on the intuition that the important or representative terms may occur across many positive example documents but not in general documents, i.e., their NDF and NIDF are very high. As shown in Table 1, the TW of a term is Z in most cases regardless of its NDF and NTF if its NIDF is S, because such term may occur frequently in any document and thus its NDF and NTF can be high. When NDF of a term is high and its NIDF is also high, the term is considered as a representative keyword and then the output value is between X and XX. The other rules were set similarly.</Paragraph>
      <Paragraph position="10"> We can get the term weight TW through the following procedure. But, the output is in the form of fuzzy set and thus has to be converted to the crisp value. In this paper, the center of gravity method is used to defuzzify the output (Lee, 1990).</Paragraph>
      <Paragraph position="11"> * Apply the NTF, NDF, and NIDF fuzzy values to the antecedent portions of 18 fuzzy rules.</Paragraph>
      <Paragraph position="12"> * Find the minimum value among the membership degrees of three input fuzzy values. * Classify every 18 membership degree into 6 groups according to the fuzzy output variable TW.</Paragraph>
      <Paragraph position="13"> * Calculate the maximum output value for each group and then generate 6 output values. null</Paragraph>
    </Section>
    <Section position="2" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
3.2 Selection of Initial Representative Key-
</SectionTitle>
      <Paragraph position="0"> words After calculation of the term weights of candidate terms through fuzzy inference, some candidate terms are selected as IRKs based on their weights with the constraint that each example document should contain at least one or more IRKs. The algorithm for selection of IRKs is given in Figure 2. Let us consider the following example to understand our selection procedure.</Paragraph>
      <Paragraph position="1"> i) An example document set, DS, is composed of documents d1, d2, d3, d4, d5, and d6. Each document contains the following terms:</Paragraph>
      <Paragraph position="3"> If we apply the algorithm in Figure 2 to this example, then temporary variables in line 2, 3 and 4 are initialized. The statement block from line 5 to line 14 is executed repeatedly until at least one or more IRKs are extracted from every example document in DS. Let us assume that the documents in the example document set are processed in sequence. After the first loop of the statement block from line 5 to line 14 is executed, the output value of ITS contains only term &amp;quot;a&amp;quot;. There is no change in ITS after the second loop of the block because term &amp;quot;a&amp;quot; has already been included in ITS. After d3, the third loop of the block, is processed, a term &amp;quot;d&amp;quot; is newly added to ITS. So, there is {a, d} in ITS. After d4, d5, and d6 are sequentially processed, none, term &amp;quot;b&amp;quot;, and term &amp;quot;e&amp;quot; are added to ITS, respectively. Therefore the algorithm return ITS having a set of terms {a, b, d, e}. We can find the algorithm in Figure 2 works well according to our constraint.</Paragraph>
      <Paragraph position="4"> Input: DS (Example Documents Set)  TS (Candidate Terms Set) 1] Procedure get_ITS(DS, TS) 2] ITS: Initial Representative Terms Set, initialized to empty. 3] TS': Temporary Terms Set, initialized to TS.</Paragraph>
      <Paragraph position="5"> 4] d, t: Document and Term element respectively.</Paragraph>
      <Paragraph position="6"> 5] Repeat 6] Select a document element as d from DS.</Paragraph>
      <Paragraph position="7"> 7] Repeat 8] Select the highest element as t in TS' according to the weight.</Paragraph>
      <Paragraph position="8"> 9] If t appears in d and not member in ITS Then Add t to ITS.</Paragraph>
      <Paragraph position="9"> 10] Remove t from TS.</Paragraph>
      <Paragraph position="10"> 11] Until t appears in d.</Paragraph>
      <Paragraph position="11"> 12] Remove d from DS.</Paragraph>
      <Paragraph position="12"> 13] Assign TS to TS'.</Paragraph>
      <Paragraph position="13"> 14] Until DS is empty.</Paragraph>
      <Paragraph position="14"> 15] Return ITS.</Paragraph>
    </Section>
    <Section position="3" start_page="1" end_page="222" type="sub_section">
      <SectionTitle>
3.3 Automatic Expansion and Reweighting of
IRKs
</SectionTitle>
      <Paragraph position="0"> After the IRKs are selected, additional terms are selected to be expanded in the order of their weights calculated by the method in Section 3.1.</Paragraph>
      <Paragraph position="1"> Let us assume that 5 terms are used to represent a user's preference and the number of IRKs is 3.</Paragraph>
      <Paragraph position="2"> Then, 2 terms with highest weights except IRKs are selected additionally. The IRKs and these terms constitute the final representative keywords (FRKs) and are reweighted by considering the co-occurrence similarity with IRKs. For this end, the relevance degrees of the FRKs in every document are calculated with the equation (6). Each positive example document represents user's preferable content. In other words, each document tends to contain general or specific or partial contents. We regard the IRKs as the essential terms of the given positive examples. So, the possibility that the related terms, e.g., synonym, collocated terms and so on, occurred together with these IRKs in the same document set increases.</Paragraph>
      <Paragraph position="4"> is the relevance degree between IRKs and candidate term t</Paragraph>
      <Paragraph position="6"> , n is the number of IRKs, p is a control parameter. In our experiments, p is set to 10. The RD ik is treated as 0 if it has negative value. For example, let K be a set of IRKs consisting of k1, k2 and k3 terms and their frequencies in document d1 be 4, 3, and 1, respectively. Also, let the frequency of term t</Paragraph>
      <Paragraph position="8"> versely proportional to the sum of term frequency difference between initial representative term and candidate term. So, the higher is the value of Rd, the more similar the co-occurrence is, that is, the equation reflects the co-occurrence similarity between initial representative terms and a candidate term appropriately. After calculating the relevance degree of a candidate term, the weight of the term in the set of example documents is determined by the following equation:</Paragraph>
      <Paragraph position="10"> the number of example documents.</Paragraph>
      <Paragraph position="11"> The equation (7) is a modification of the Rocchio's in Section 2. Different from that equation, we additionally use the term relevance degree between initial representative terms and a candidate term. Let us assume that the IDF value of the candidate term t  is 1.0 and it occurs 3, 2, and 1 within document d1, d2 and d3, respectively. If the relevance degrees for three documents are also assumed to 0.3, 0.5, and 0.7, respectively, then the weight of candidate term t i is calculated as below.</Paragraph>
      <Paragraph position="12">  of using the weight obtained by fuzzy inference, the initial weight w</Paragraph>
      <Paragraph position="14"> is recalculated by the equation (9), if the term is in IRKs and otherwise 0. The equation is the one introduced to assign a weight to an initial query term in IR systems based on the vector space model (Baeza-Yates and Ribeiro-Neto, 1999).</Paragraph>
      <Paragraph position="15">  appear, and N is the total number of documents. null Let K = {t1, t3, t4} be the set of IRKs, WK = {3.0, 2.0, 1.0} be the set of their weights calculated by the equation (9), T = {t1, t2, t3, t4, t5} be the set of FRKs, and WT = {5.0, 4.0, 3.0, 2.0, 1.0} be their weights through the equation (7). Then, we can get the final weights of FRKs, {8.0,4.0, 5.0, 3.0, 1.0}.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML