XML Viewer - i05-2024

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/i05-2024_metho.xml
Size: 15,565 bytes
Last Modified: 2025-10-06 14:09:35
<?xml version="1.0" standalone="yes"?>
<Paper uid="I05-2024">
  <Title>Information Retrieval Capable of Visualization and High Precision</Title>
  <Section position="3" start_page="138" end_page="140" type="metho">
    <SectionTitle>
2 Self-organizing documentary maps
</SectionTitle>
    <Paragraph position="0"> and ranking related documents A SOM can be visualized as a two-dimensional array of nodes on which a high-dimensional in2For a specific query, other queries and documents in the map are considered to be irrelevant (i.e., documents unrelated to the query). This map is therefore equivalent to a map consisting of one query and related and unrelated documents, which will be adopted in the practical IR system that we aim to develop.</Paragraph>
    <Paragraph position="1"> put vector can be mapped in an orderly manner through a learning process. After the learning, a meaningful nonlinear coordinate system for different input features is created over the network.</Paragraph>
    <Paragraph position="2"> This learning process is competitive and unsupervised and is called a self-organizing process.</Paragraph>
    <Paragraph position="3"> Self-organizing documentary maps are ones in which given queries and all related documents in the collection are mapped in order of similarity, i.e., queries and documents with similar content are mapped to (or best-matched by) nodes that are topographically close to one another, and those with dissimilar content are mapped to nodes that are topographically far apart. Ranking is the procedure of ranking documents related to each query from the map by calculating the Euclidean distances between the points of the queries and the points of the documents in the map and choosing the N closest documents as the retrieval result.</Paragraph>
    <Section position="1" start_page="138" end_page="138" type="sub_section">
      <SectionTitle>
2.1 Data
</SectionTitle>
      <Paragraph position="0"> The queries are those used in a dry run of the 1999 IREX contest and the documents relating to the queries are original Japanese newspaper articles used in the contest as the correct answers. In this study, only nouns (including Japanese verbal nouns) were selected for use.</Paragraph>
    </Section>
    <Section position="2" start_page="138" end_page="140" type="sub_section">
      <SectionTitle>
2.2 Data coding
</SectionTitle>
      <Paragraph position="0"> Suppose we have a set of queries:</Paragraph>
      <Paragraph position="2"> where q is the total number of queries, and a set of documents: A = fAi j (i = 1;C/C/C/ ;q;j = 1;C/C/C/ ;ai)g; (2) where ai is the total number of documents related to Q i. For simplicity, where there is no need to distinguish between queries and documents, we use the same term &amp;quot;documents&amp;quot; and the same notation Di to represent either a query Q i or a document Ai j. That is, we define a new set</Paragraph>
      <Paragraph position="4"> which includes all queries and documents. Here, d is the total number of queries and documents,</Paragraph>
      <Paragraph position="6"> Each document, Di, can then be defined by the set of nouns it contains as</Paragraph>
      <Paragraph position="8"> where noun(i)k (k = 1;C/C/C/ ;ni) are all different nouns in the document Di and w(i)k is a weight representing the importance of noun(i)k (k = 1;C/C/C/ ;ni) in document Di. The weights are computed by their tf or tfidf values. That is,</Paragraph>
      <Paragraph position="10"> In the case of using tf, the weights are normalized such that</Paragraph>
      <Paragraph position="12"> Also, when using the Japanese thesaurus, Bunrui Goi Hyou (The National Institute for Japanese Language, 1964) (BGH for short), synonymous nouns in the queries are added to the sets of nouns from the queries shown in Eq. (5) and their weights are set to be the same as those of the original nouns.</Paragraph>
      <Paragraph position="13"> Suppose we have a correlative matrix whose element dij is some metric of correlation, or a similarity distance, between the documents Di and Dj; i.e., the smaller the dij, the more similar the two documents. We can then code document Di with the elements in the i-th row of the correlative matrix as</Paragraph>
      <Paragraph position="15"> The V(Di) 2 &lt;d is the input to the SOM. Therefore, the method to compute the similarity distance dij is the key to creating the maps. Note that the individual dij of vector V(Di) only reflects the relationships between a pair of documents when they are considered independently.</Paragraph>
      <Paragraph position="16"> To establish the relationships between the document Di and all other documents, representations such as vector V(Di) are required. Even if we have these high-dimensional vectors for all the documents, it is still difficult to establish their global relationships. We therefore need to use an SOM to reveal the relationships between these high-dimensional vectors and represent them two-dimensionally. In other words, the role of the SOM is merely to self-organize vectors; the quality of the maps created depends on the vectors provided.</Paragraph>
      <Paragraph position="17"> In computing the similarity distance dij between documents, we take two factors into account: (1) the larger the number of common nouns in two documents, the more similar the two documents should be (i.e., the shorter the similarity distance); (2) the distance between any two queries should be based on their application to the IR processing; i.e., by considering the procedure used to rank the documents relating to each query from the map. For this reason, the documentsimilarity distance between queries should be set to the largest value. To satisfy these two factors, dij is calculated as follows:  where jDij and jDjj are values (the numbers of elements) of sets of documents Di and Dj defined by Eq. (5) and jCijj is the value of the intersection Cij of the two sets Di and Dj. jCijj is therefore some metric of document similarity (the inverse of the similarity distance dij) between documents Di and Dj which is normalized by jDij+jDjj!jCijj. Before describing the methods for computing them, we first rewrite the definition of documents given by Eq. (5) for Di and Dj as follows.</Paragraph>
      <Paragraph position="19"> where ck (k = 1;C/C/C/ ;l) are the common nouns of documents Di and Dj and n(i)k (k = 1;C/C/C/ ;mi) and n(j)k (k = 1;C/C/C/ ;mj) are nouns of documents Di and Dj which differ from each other. By comparing Eq. (5) and Eqs. (10) and (11), we know  that l+mi +mj = ni +nj. Thus, jDij (or jDjj) of Eq. (9) can be calculated as follows.</Paragraph>
      <Paragraph position="21"> For calculating jCijj, on the other hand, since the weights (of either common or different nouns) generally differ between two documents, we devised four methods which are expressed as follows. null</Paragraph>
      <Paragraph position="23"> Note that we need not consider the case where both are queries for calculating jCijj because this has been considered independently as shown by Eq. (9).</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="140" end_page="142" type="metho">
    <SectionTitle>
3 Experimental Results
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="140" end_page="140" type="sub_section">
      <SectionTitle>
3.1 Data
</SectionTitle>
      <Paragraph position="0"> Six queries Q i (i = 1;C/C/C/ ;q, q = 6) and 433 documents Ai j (i = 1;C/C/C/ ;q, q = 6, j =  of the 1999 IREX contest were used for our experiments. The distribution of these documents, i.e., the number ai (i = 1;C/C/C/ ;q, q = 6) of documents related to each query, is shown in Table 1. It should be noted that since the proposed IR approach will be slotted into a practical IR system in the second phase in which a small number (say below 1,000, or even below 500) of the related documents should have been collected, this experimental scale is definitely a practical one.</Paragraph>
    </Section>
    <Section position="2" start_page="140" end_page="140" type="sub_section">
      <SectionTitle>
3.2 SOM
</SectionTitle>
      <Paragraph position="0"> We used a SOM of a 40PS40 two-dimensional array. Since the total number d of queries and documents to be mapped was 439, i.e., d = q +P  i=1 ai = 439, the number of dimensions of input n was 439. In the ordering phase, the number of learning steps T was set at 10,000, the initial value of the learning rate fi(0) at 0.1, and the initial radius of the neighborhood (0) at 30. In the fine adjustment phase, T was set at 15,000, fi(0) at 0.01, and (0) at 5. The initial reference vectors mi(0) consisted of random values between 0 and 1.0.</Paragraph>
    </Section>
    <Section position="3" start_page="140" end_page="142" type="sub_section">
      <SectionTitle>
3.3 Results
</SectionTitle>
      <Paragraph position="0"> We first performed a preliminary experiment and analysis to determine which of the four methods was the optimal one for calculating jCijj shown in Eqs. (13)-(16). Table 2 shows the IR precision, i.e., the precision of the ranking results obtained from the self-organized documentary maps created using the four methods. The IR precision was calculated by follows.</Paragraph>
      <Paragraph position="2"> where q is the total number of queries, # means number, and ai is the total number of documents related to Q i as shown in Table 1.</Paragraph>
      <Paragraph position="3"> In the case of using tf values as weights of nouns, method B obviously did not work. Al- null though the similarity between queries was mandatorily set to the largest value, all six queries were mapped in almost the same position, thus producing the poorest result. We consider the reason for this was as follows. In general, the number of words in a query is much smaller than the number of words in the documents, and the number of queries is much smaller than the number of documents collected. As described in section 2, each query was defined by a vector consisting of all similarities between the query and five other queries and all documents in the collection. We think that using the average weights of words appearing in the queries and documents to calculate the similarities between queries and documents, as in method B, tends to produce similar vectors for the queries. All of these query vectors are then mapped to almost the same position. With coding method A, because the larger of the two weights of a query and a document is used, the same problem could also arise in practice. There were no essential differences between coding methods C and D, which were almost equally precise. Neither of these methods have the shortcomings described above for methods A and B. However, when tfidf values were used as the weights of the nouns, even methods A and B worked quite well. Therefore, if we use tfidf values as the weights of the nouns, we may use either of the four methods. Based on this analysis and the preliminary experimental result that method C and D had highest precisions in the cases of using tf and tfidf values as weights of the nouns, respectively, we used methods C and D for calculating jCijj in all the remaining experiments.</Paragraph>
      <Paragraph position="4"> Table 3 shows the IR precision obtained using various methods. From this table we can see that the proposed method in the case of SOM (w=tfidf, C), i.e., using method C for calculating jCijj, using tfidf values as the weights of nouns, and not using the Japanese thesaurus (BGH), in the case of SOM (w=tfidf, D), i.e., using method D, using tfidf values, and not using the BGH, and in  the case of SOM (w=tfidf, C, BGH), i.e., using method C, using tfidf values, and using the BGH produced the highest, second highest, and third highest precision, respectively, of all the methods including the conventional TFIDF method. When the BGH was used, however, the IR precision of the proposed method dropped inversely, whereas that of the conventional TFIDF improved. The lower precision of the proposed method when using BGH might be due to the calculation of the denominator of Eq. (9); this will be investigated in future study.</Paragraph>
      <Paragraph position="5"> Table 4 shows the IR precision obtained using various methods when the retrieval process is focused on the top N related documents. From this table we can see that the IR precision of the proposed method, no matter whether the BGH was used or not, or whether method C or D was used for calculating jCijj, was much higher than that of the conventional TFIDF method when the process was focused on retrieving the most relevant documents. This result demonstrated that the proposed method might be especially useful for picking highly relevant documents, thus greatly improving the precision of IR.</Paragraph>
      <Paragraph position="6"> Figure 1 shows the left-top area of a self-organized documentary map obtained using the proposed method in the case of SOM (w=tfidf, D)3. From this map, we can see that query Q 4 3Note that the map obtained using the proposed method in the case of SOM (w=tfidf, C), which had the highest IR precision, was better than this.</Paragraph>
      <Paragraph position="7">  and its related documents A4 / (where * denotes an Arabic numeral), Q 2 and its related documents A2 / were mapped in positions near each other. Similar results were obtained for the other queries which were not mapped in the area of the figure. This map provides visible and continuous retrieval results in which all queries and documents are placed in topological order according to their similarities. The map provides an easy way of finding documents related to queries and also shows the relationships between documents with regard to the same query and even the relationships between documents across different queries. Finally, it should be noted that each map that consists of 400 to 500 documents was obtained in 10 minutes by using a personal computer with a</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="142" end_page="142" type="metho">
    <SectionTitle>
3GHZ CPU of Pentium 4.
4 Conclusion
</SectionTitle>
    <Paragraph position="0"> This paper described a neural-network based self-organizing approach that enables information retrieval to be visualized while improving its precision. This approach has a practical use by slotting it into a practical IR system as the second-phase processor. Computer experiments of practical scale showed that two-dimensional documentary maps in which queries and documents are mapped in topological order according to their similarities can be created and that the ranking of the results retrieved using the created maps is better than that produced using a conventional TFIDF method. Furthermore, the precision of the proposed method was much higher than that of the conventional TFIDF method when the process was focused on retrieving the most relevant documents, suggesting that the proposed method might be especially suited to information retrieval tasks in which precision is more important than recall.</Paragraph>
    <Paragraph position="1"> In future work, we first plan to re-confirm the effectiveness of using the BGH and to further improve the IR accuracy of the proposed method.</Paragraph>
    <Paragraph position="2"> We will then begin developing a practical IR system capable of visualization and high precision using a two-phase IR procedure. In the first phase, a large number of related documents are gathered from newspapers or websites in response to a query presented using conventional IR; the second phase involves visualization of the retrieval results and picking the most relevant results.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML