File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1104_metho.xml

Size: 12,967 bytes

Last Modified: 2025-10-06 14:08:36

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1104">
  <Title>A Differential LSI Method for Document Classification</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Main Algorithm
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Basic Concepts
</SectionTitle>
      <Paragraph position="0"> A term is defined as a word or a phrase that appears at least in two documents. We exclude the so-called stop words such as &amp;quot;a&amp;quot;, &amp;quot;the&amp;quot; , &amp;quot;of&amp;quot; and so forth. Suppose we select and list the terms that appear in the documents asa0a2a1a4a3a5a0a7a6a8a3a4a9a4a9a4a9a10a3a5a0a12a11 . For each document a13 in the collection, we assign each of the terms with a real vector a14a16a15 a1a18a17a19a3a15 a6a20a17a19a3a4a9a4a9a4a9a21a3  a17 is the local weighting of the term a0a25 in the document indicating the significance of the term in the document, while  a25 is a global weight of all the documents, which is a parameter indicating the importance of the term in representing the documents. Local weights could be either raw occurrence counts, boolean, or logarithms of occurrence counts. Global ones could be no weighting (uniform), domain specific, or entropy weighting. Both of the local and global weights are thoroughly studied in the literatures (Raghavan and Wong, 1986; Luhn, 1958; van Rijsbergen, 1979; Salton, 1983; Salton, 1988; Lee et al., 1997), and will not be discussed further in this paper. An example will be given below:  ,a65a19a25 is the total number of times that term a0a25 appears in the collection, a41a66a25a17 the number of times the term a0a25 appears in the document a13 , and a50 the number of documents in the collection. The document vector a14a16a15 a1a18a17 a3a15 a6a20a17 a3a4a9a4a9a4a9a21a3a15 a11a22a17 a23 can be normalized as a14a68a67 a1a18a17a69a3a67a6a20a17a49a3a4a9a4a9a4a9a10a3a67a11a70a17a43a23 by the following formula:</Paragraph>
      <Paragraph position="2"> a11 a23 of a cluster can be calculated in terms of the normalized vector as</Paragraph>
      <Paragraph position="4"> is a mean vector of the member documents in the cluster which are normalized as a0 a1a10a3a0 a6a8a3a4a9a4a9a4a9a10a3a0 a77 ; i.e.,</Paragraph>
      <Paragraph position="6"> take a79 itself as a normalized vector of the cluster.</Paragraph>
      <Paragraph position="7"> A differential document vector is defined as a0a55a25 a44</Paragraph>
      <Paragraph position="9"> tors satisfying some criteria as given above.</Paragraph>
      <Paragraph position="10"> A differential intra-document vector a1a3a2 is the differential document vector defined as a0 a25 a44a4a0 a17 , where a0 a25 and a0 a17 are two normalized document vectors of same cluster.</Paragraph>
      <Paragraph position="11"> A differential extra-document vector a1a6a5 is the differential document vector defined as a0 a25 a44a7a0 a17 , where a0 a25 and a0 a17 are two normalized document vectors of different clusters.</Paragraph>
      <Paragraph position="12"> The differential term by intra- and extra-document matrices a1 a2 and a1a8a5 are respectively defined as a matrix, each column of which comprise a differential intra- and extra- document vector respectively. null</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 The Posteriori Model
</SectionTitle>
      <Paragraph position="0"> Any differential term by document a9 -by-a50 matrix of a1 , say, of rank a10a4a11a13a12 a27a15a14a17a16a19a18 a14a20a9 a3a50 a23 , whether it is a differential term by intra-document matrix a1 a2 or a differential term by extra-document matrix a1a3a5 can be decomposed by SVD into a product of three matrices: a1 a27a22a21a24a23a26a25a42a85 , such that a21 (left singular matrix) and a25 (right singular matrix) are an a9 -bya12 and a12 -by-a50 unitary matrices respectively with the first a10 columns of U and V being the eigenvectors of</Paragraph>
      <Paragraph position="2"> where a27 a25 are nonnegtive square roots of eigen values of a1a3a1 a85 , a27a2a25a32a31a34a33 for a35a36a11a37a10 and a27a57a25 a27 a33 for a35a26a31a37a10 . The diagonal elements of a23 are sorted in the decreasing order of magnitude. To obtain a new reduced matrix a23 a77 , we simply keep the k-by-k leftmost-upper corner matrix (a38a40a39a37a10 ) of a23 , deleting other terms; we similarly obtain the two new matrices a21 a77 and a25 a77 by keeping the left most a38 columns of a21 and a25 respectively. The product of a21 a77 , a23 a77 and</Paragraph>
      <Paragraph position="4"> proximately equals to a1 .</Paragraph>
      <Paragraph position="5"> How we choose an appropriate value of a38 , a reduced degree of dimension from the original matrix, depends on the type of applications. Generally we choose a38a40a41 a39a42a33a43a33 for a39a42a33a43a33a43a33a3a11  a11a45a44a46a33a43a33a43a33 , and the corresponding a38 is normally smaller for the differential term by intra-document matrix than that for the differential term by extra- document matrix, because the differential term by extra-document matrix normally has more columns than the differential term by intra-document matrix has.</Paragraph>
      <Paragraph position="6"> Each of differential document vector a12 could find a projection on the a38 dimensional fact space spanned by the a38 columns of a21 a77 . The projection can easily be obtained by a21 a85a77 a12 .</Paragraph>
      <Paragraph position="7"> Noting that the mean a47 a48 of the differential intra(extra-) document vectors are approximately a33 , we may assume that the differential vectors formed follows a high-dimensional Gaussian distribution so that the likelihood of any differential vector a48 will be given by</Paragraph>
      <Paragraph position="9"> a48 , and a65 is the covariance of the distribution computed from the training set ex-</Paragraph>
      <Paragraph position="11"> where a72 a27a75a21 a85 a48 a27 a14a72 a1a4a3 a72 a6a8a3a4a9a4a9a4a9a10a3 a72 a62 a23a85 .</Paragraph>
      <Paragraph position="12"> Because a23 is a diagonal matrix,a65 a14a48 a23 can be represented by a simpler form as: a65a62a14a48 a23 a27 a50 a84a77a76</Paragraph>
      <Paragraph position="14"> . In practice, a27 a25 (a35a81a31a7a38 ) could be estimated by fitting a function (say, a39</Paragraph>
      <Paragraph position="16"> , and a10 is the rank of matrix a1 . In practice, a79 may be chosen as a27</Paragraph>
      <Paragraph position="18"> describes the projection of a48 onto the DLSI space, while a108a24a14  a23 approximates the distance from  to DLSI space.</Paragraph>
      <Paragraph position="19"> When both a49 a14a48a51a50a1 a2 a23 and a49 a14a48a86a50a1a8a5 a23 are computed, the Baysian posteriori function can be computed as:</Paragraph>
      <Paragraph position="21"> where a49 a14a71a1 a2 a23 is set to a39 a72 a50a1a0 where a50a1a0 is the number of clusters in the database 1 while a49 a14a71a1 a5 a23 is set to</Paragraph>
      <Paragraph position="23"/>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Algorithm
</SectionTitle>
      <Paragraph position="0"> 1. By preprocessing documents, identify terms either of the word and noun phrase from stop words.</Paragraph>
      <Paragraph position="1"> 2. Construct the system terms by setting up the term list as well as the global weights.</Paragraph>
      <Paragraph position="2"> 3. Normalize the document vectors of all the collected documents, as well as the centroid vectors of each cluster.</Paragraph>
      <Paragraph position="3"> 4. Construct the differential term by intra-</Paragraph>
      <Paragraph position="5"> followed by the composition of a1 a2a7a6a77 a4 a27</Paragraph>
      <Paragraph position="7"> giving an approximate a1 a2 in terms of an appropriate a38 a4 , then evaluate the likelihood function:</Paragraph>
      <Paragraph position="9"/>
      <Paragraph position="11"> ciently large.</Paragraph>
      <Paragraph position="12"> 6. Construct the term by extra- document matrix</Paragraph>
      <Paragraph position="14"> , such that each of its column is an extra- differential document vector.</Paragraph>
      <Paragraph position="15"> 7. Decompose a1 a5 , by exploiting the SVD algorithm, into a1 a5 a27 a21 a5 a23 a5 a25 a85</Paragraph>
      <Paragraph position="17"> and a79 a24 to a27  terms as well as their frequencies of occurrence in the document, so that a normalized document vector a27 is obtained for the document from equation (1).</Paragraph>
      <Paragraph position="18"> For each of the clusters of the data base, repeat the procedure of item 2-4 below.</Paragraph>
      <Paragraph position="19">  2. Using the document to be classified, construct a differential document vector a48 a27 a27a28a44 a79 , where a79 is the normalized vector giving the center or centroid of the cluster.</Paragraph>
      <Paragraph position="20"> 3. Calculate the intra-document likelihood func-</Paragraph>
      <Paragraph position="22"> 4. Calculate the Bayesian posteriori probability function a49 a14a71a1 a2 a50a48 a23 .</Paragraph>
      <Paragraph position="23"> 5. Select the cluster having a largest a49 a14a71a1 a2 a50a48 a23 as the recall candidate.</Paragraph>
      <Paragraph position="24"> 3 Examples and Comparison</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Problem Description
</SectionTitle>
      <Paragraph position="0"> We demonstrate our algorithm by means of numerical examples below. Suppose we have the following  to Computer related field, a79 a6 to Mathematics, a79 a0 to Physics, anda79 a2 to Chemical Science. We will show, as an example, below how we will set up the classifier to classify the following new document: a27 : &amp;quot;The Elements of Computing Science.&amp;quot; We should note that a conventional matching method of &amp;quot;common&amp;quot; words does not work in this example, because the words &amp;quot;compute&amp;quot; and, &amp;quot;science&amp;quot; in the new document appear in a79 a1 and a79 a2 separately, while the word &amp;quot;elements&amp;quot; occur in both  a6 and a79a12a0 simultaneously, giving no indication on the appropriate candidate of classification simply by counting the &amp;quot;common&amp;quot; words among documents. We will now set up the DLSI-based classifier and LSI-based classifier for this example.</Paragraph>
      <Paragraph position="1"> First, we can easily set up the document vectors of the database giving the term by document matrix by simply counting the frequency of occurrences; then we could further obtain the normalized form as in  The document vector for the new document a27 is given by: a14a71a33</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 DLSI Space-Based Classifier
</SectionTitle>
      <Paragraph position="0"> The normalized form of the centroid of each cluster is shown in Table 2.</Paragraph>
      <Paragraph position="1"> Following the procedure of the previous section, it is easy to construct both the differential term by intra-document matrix and the differential term by extra-document matrix. Let us denote the differential term by intra-document matrix by a1  a44 to test the classifier. Now using equations (3), (4) and (5), we can calculate the a49 a14a48a51a50a1 a2 a23 , a49 a14a48a51a50a1 a5 a23 and finally a49 a14a71a1 a2 a50a48 a23 for each differential document vector a48 a27 a27 a44 a79 a25 (a35  a3a4a3 ) as shown in Table 3. The a79 a25 having a largest a49 a14a71a1 a2 a50a27 a44 a79 a25a23 is chosen as the cluster to which the new document a27 belongs. Because both  . The last row of Table 3 clearly shows that Cluster a79 a6 , that is, &amp;quot;Mathematics&amp;quot; is the best possibility regardless of the parameters a38 a4 a27 a38 a24 a27 a39 or a38 a4 a27 a38 a24 a27 a44 chosen, showing the robustness of the computation.</Paragraph>
    </Section>
    <Section position="6" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 LSI Space-Based Classifier
</SectionTitle>
      <Paragraph position="0"> As we have already explained in Introduction, the LSI based-classifier works as follows: First, employ an SVD algorithm on the term by document matrix to set up an LSI space, then the classification is completed within the LSI space.</Paragraph>
      <Paragraph position="1"> Using the LSI-based classifier, our experiment show that, it will return a79 a0 , namely &amp;quot;Physics&amp;quot;, as the most likely cluster to which the document a27 belongs. This is obviously a wrong result.</Paragraph>
    </Section>
    <Section position="7" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 Conclusion of the Example
</SectionTitle>
      <Paragraph position="0"> For this simple example, the DLSI space-based approach finds the most reasonable cluster for the document &amp;quot;The elements of computing science&amp;quot;, while the LSI approach fails to do so.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML