File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/x96-1033_metho.xml

Size: 24,872 bytes

Last Modified: 2025-10-06 14:14:25

<?xml version="1.0" standalone="yes"?>
<Paper uid="X96-1033">
  <Title>A SIMPLE PROBABILISTIC APPROACH TO CLASSIFICATION AND ROUTING</Title>
  <Section position="4" start_page="0" end_page="169" type="metho">
    <SectionTitle>
2.2. The Mathematical Model
</SectionTitle>
    <Paragraph position="0"> The mathematical model used here is to represent each category as a multinomial distribution. Parameters are estimated from the frequency of certain sets of words and phrases (the'distinguishing word sets') found in the training collections.</Paragraph>
    <Paragraph position="1"> Previous results (Guthrie et al 1994) indicate that the simple statistical technique of the maximum likelihood ratio test would, under certain conditions, give rise to an excellent classification scheme for documents. Previous theoretical results were verified using two classes of documents, and excellent recall and precision scores were achieved for distinguishing topics (previous tests were conducted in both Japanese and English). In this paper we both extend the classification scheme to include any number of topics and modify the scheme to also perform routing.</Paragraph>
    <Paragraph position="2"> In modeling a class of text, our technique requires that we identify a set of key concepts, or distinguishing words and phrases. The intuition is given in the example above, but in this work we want to automate the process of choosing word sets in a way that results in sets of 'distinguishing concepts'.</Paragraph>
    <Paragraph position="3"> In (Guthrie et al 1994), it was shown that if the probabilities of the distinguishing word sets in each of the classes is known, we can predict the probability of correct classification. Our goal eventually is to define an algorithm for choosing 'distinmlishing word sets' in an optimal way; i.e. a way that will maximize the probability of correct classification. The method we use now (described in section 4.1.) is empirical, but allows us to guarantee excellent classification results.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3. Common Approaches
</SectionTitle>
      <Paragraph position="0"> Schemes for classification and routing all teild to follow a particular paradigm:  1. Represent each class (or topic or profile or bucket) as a numerical object.</Paragraph>
      <Paragraph position="1"> 2. Represent each new document that arrives as a numerical object.</Paragraph>
      <Paragraph position="2"> 3. Measure the 'similarity' between the new document and each of the classes.</Paragraph>
      <Paragraph position="3"> 4. For Classification - Place the new document  in the category corresponding to the class (or bucket or prc~'fle) to which it is most similar. For Routing - Rank the document in the class using some function of the similarity measure.</Paragraph>
      <Paragraph position="4"> AlthonPSh many similarity measures have been studied, two of them seem to have gained popularity in the recent literature: the Cosine and tf.idf measures. The Cosine measure is used when a document is represented as a multi-dimensional vector, and a document is defreed as more similar to Class 1 than Class 2 if its corresponding vector is closer to that of Class 1 than to that of Class 2. In ff.idf a document is more similar to Class 1 than Class 2 if more terms match the Class 1 terms than do the Class 2 terms. In our work a document is more similar to Class 1 than Class 2 if the probability of it belonging to Class 1 is greater than the probability of it belonging to Class 2.</Paragraph>
      <Paragraph position="5"> In choosing a representation of a class or a representation of a document, much of the current research in classification and routing is focused on choosing the best set of terms (in our case, we call them Distinguishing Terms) to represent it. Many systems start with prevalent but not common (so that words such as 'the' and 'to' are not used) words and phrases in the class training set. The training set may be as small as the initial query which defined the class or as large as all of the documents which are available which are deemed to be relevant to the class. If this set of terms is too small, feedback is generally employed in which the full corpus of documents to be classified and routed is compared to the set, prevalent words and phrases from highly ranked retrieved documents are added to the set, and the full corpus is run again against the larger set of terms.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="168" type="sub_section">
      <SectionTitle>
2.4. Probabilistic Classification Approach
Using Multinomial Distribution
</SectionTitle>
      <Paragraph position="0"> A probabilistic method for classification was proposed by Guthrle and Walker \[1\], which assumed each class was distributed by the multinomial distribution.</Paragraph>
      <Paragraph position="1"> Elementary statistics tells us that a maximum likelihood ratio test is the best way to calculate the probability that a set of outcomes was produced by a given input. In the example below, we assume a multinomial distribution for our dice and fred the largest conditional probability of getting a certain output given a certain input. For ex- null ample, consider the set of outcomes produced by rolling one of two single six-sided dice. One of the dice is fair and one is loaded to be more likely to give a '6' outcome. Let us assign the expected probabilities for the outcomes for each of the two dice.</Paragraph>
      <Paragraph position="2">  Using the multinomiai distribution, we may calculate which is the more likely die to have produced each of the outputs. The multinomial equation is shown below, for the case of 6 possible outcomes.</Paragraph>
      <Paragraph position="4"> Using the probabilities assigned to each die for Pl through P6, and the number of times each outcome occurred for nl through n6, and the total number of outcomes for n, the following probabilities of producing each output given that a particular die was used are calculated. null Output Fair Die Loaded Die set 1 3.46 x 10 -4 1.33 x 10- 7 set 2 4.09 x 10- 6 5.25 x 10- 4 set 3 7.07 x 10- s 4.71 x 10- 5 Table 2.3-3. Probability of Output The most likely die to produce each output is the one with the maximum probability. We can see that these probabilities are an excellent measure for determining which of the dice was more likely to be used to generate each of the sets of outcomes. Set 1, which has a fairly uniform distribution, is much more likely to have been created with the fair die than the loaded one. Set 2, which has nearly half of the outcomes as '6', is much more likely to have been created with the loaded die than the fair one. Set 3 does not have an obvious distribution. It has more '6' outccanes than would be expected with the fair die. but not as many as would be expected with the loaded die. As it turns out, it is just slightly more likely that the fair die was used to generate set 3.</Paragraph>
      <Paragraph position="5"> Applying this approach to the document classification problem, we may define the outcomes to be the sets of Distinguish Terms which deAr'me the classes. The expected probabilities are then the sum of the frequencies of the Distinguishing Terms in each of the classes divided by the training set lengths. The outputs are the counts of how many of the Distinguishing Terms from each class are evident in a document. Since to create a multinc~nial distribution all possible outcomes must be accounted for, an additional count is kept of all of the words in a document are not members of any of the Disfinguishing Term sets. The expected probability for this set of words is 1.0 minus the sum of the probabilities of all of the Distinguishing Terms in the Iraining set.</Paragraph>
    </Section>
    <Section position="3" start_page="168" end_page="169" type="sub_section">
      <SectionTitle>
2.5. Probabilistic Routing Approach Us-
ing Multinomial Distribution
</SectionTitle>
      <Paragraph position="0"> Expa~dinoC/ this approach to the routing problem, we want to fred the most likely class given the probabilities of the outputs. This can be calculated with Bayes' Theoreln, using the assumption that all classes have equally likely occurrences.</Paragraph>
      <Paragraph position="2"> Continuing the example with the fair and the loaded die, the sets are assigned probabilities that they belong to each of the classes given the fact that they have a certain set of outcomes. This would result in the following probabilities.</Paragraph>
      <Paragraph position="3">  Sorting these probabilities, we get the expected resuits; set 1 is the output most likely to have been created with the fair die and set 2 the least, and set 2 is the output most likely to have been created with the loaded die and set 1 the least.</Paragraph>
      <Paragraph position="4"> Comparing these routing results to the classification results, the question may be raised why the probability that a set is from a class needs to be calculated. Ranking with the probability of getting the outputs (Table 2.3-3) would have given the same ranking. But now consider the case in which set 3 was ten times larger, as shown in the table below.</Paragraph>
      <Paragraph position="5">  Our expectation is still that set 3 should be ranked in the middle, between sets 1 and 2 for each die. Calculating the probabilities of getting these outputs, we get the following table.</Paragraph>
      <Paragraph position="6"> Output Fair Die Loaded Die set 1 3.46 x 10 --4 1.33 x 10 -7 set 2 4.09 x 10- 6 5.25 x 10- 4 set 3 1.96 x 10-16 3.39 x 10-18 Table 2.3-6. Probability of Output Using these probabilities directly for ranking would place set 3 on the bottom of each list, which does not agree with intuition. Note that this problem is the same problem that document retrieval systems have with doeuments of varying lengths; longer documents are ranked lower than they should be. But now we take the second step of calculating the probability that an output is in a class.</Paragraph>
      <Paragraph position="7">  We can see that now the rankings are as we expect; set 1 is the output most likely to have been created with the fair die and set 2 the least, and set 2 is the output most likely to have been created with the loaded die and set 1 the least. So using this multinomial distribution to rank documents is less likely to be adversely affected by varying document lengths.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="169" end_page="170" type="metho">
    <SectionTitle>
3. APPROACH
</SectionTitle>
    <Paragraph position="0"> Below is a description of the different approaches implemented for calculating the match between a document and a class profile. The class scores are then compared to each other to determine the classification and routing results.</Paragraph>
    <Section position="1" start_page="169" end_page="169" type="sub_section">
      <SectionTitle>
3.1. Class Scoring Techniques
</SectionTitle>
      <Paragraph position="0"> ~.idf The weight associated with each term in the training set is the log of the number of classes divided by the number of classes which contain the term.</Paragraph>
      <Paragraph position="1"> The class score is calculated by the following equation \[2\]. This equation has been modified from the reference by dividing by the sum over the class of the term weights, to normalize the results when Distinguishing Term sets are used which have different lengths.</Paragraph>
      <Paragraph position="2">  The weight associated with each term in the training set is calculated by the following equation \[ 1\]. weight = log number of classes with term + 1 The class score is calculated by the following equation \[ll.</Paragraph>
      <Paragraph position="4"/>
    </Section>
    <Section position="2" start_page="169" end_page="170" type="sub_section">
      <SectionTitle>
class document
Multinomial Distribution
</SectionTitle>
      <Paragraph position="0"> A number of weights are associated with each term in the training set. A weight is calculated for each of the classes for each term, and the weight is the probability of the term occurrence in the class. This is approximated by taking the frequency of the term occurrence in the training set divided by the size of the training set.</Paragraph>
      <Paragraph position="1"> The weights for all of the Distinguishing Terms in a set are combined into a single value, called the set weight.</Paragraph>
      <Paragraph position="2"> An additional weight is calculated, which is necessary for the multinomial distribution. This is the probability that a term is not a Distinguiqhlng Term, and is calculated as 1.0 minus the sum of the probabilities of all of the Distinguishing Terms in the training set. Since the class scores calculated with this approach are exceedingly small, the log of the probability equation is used to avoid computational difficulties.</Paragraph>
      <Paragraph position="3">  The class score is calculated by the following equation \[3\].</Paragraph>
      <Paragraph position="5"> For routing, the score is the probability for each class calculated given the words in the document. This is done with the following equafiou for each class.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="170" end_page="170" type="metho">
    <SectionTitle>
SMART
</SectionTitle>
    <Paragraph position="0"> The SMART program independently calculates the scores for the Distinguishing Terms and for the document based upon the word frequencies in the entire collection available for classif'mation and routing, and takes the score as the sum of the products of the Distinguisking Term and document weights. A variety of weighting schemes are possible, and a common oue is called 'lnc.ltc'. The weight associated with each term in the Distingui.qhing Term set is calculated by the following equation \[6\].</Paragraph>
    <Paragraph position="2"> For classification the document is classified into the class wMch has the maximum score.</Paragraph>
    <Section position="1" start_page="170" end_page="170" type="sub_section">
      <SectionTitle>
Routing
</SectionTitle>
      <Paragraph position="0"> In routing the top ranked documents for each class are returned. For the tf.idf, Cosine, and SMART methods the class score is used to rank the documents, for the Multinomial Distribution method the routing score is used.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="170" end_page="171" type="metho">
    <SectionTitle>
4. IMPLEMENTATION
</SectionTitle>
    <Paragraph position="0"> The following methods were used to determine the Distingui,qhing Terms, calculate the weights associated with those terms, and to compare documents to the DistinguLqhing Terms to get class scores and classification and routing determinations.</Paragraph>
    <Section position="1" start_page="170" end_page="171" type="sub_section">
      <SectionTitle>
4.1, Selection of Distinguishing Terms
</SectionTitle>
      <Paragraph position="0"> Each class has a set of Distinguishing Terms, which are those individual terms which occur more often in the class than in other classes, and which can be used to distinguish the class from the other classes. The better this set of Distinguishing Terms is, the better the results will be for routing and classification.</Paragraph>
      <Paragraph position="1"> The Distinguishing Terms are found by processing a training set of documents which are representative of the class. This training set must be of a sufficient size to produce good statistics of the terms in the class and the frequencies of the terms.</Paragraph>
      <Paragraph position="2"> In each document, the header information up to the headline is removed. This eliminates the class and source information which is added by the collection agent, which would bias the word set. The remaining words are separated at blank spaces onto individual lines, and stemming is performed to remove embedded SGML syntax, possessives, punctuation, and some suffixes (see Appendix A).</Paragraph>
      <Paragraph position="3"> The words are then counted and sorted by frequency, and the word probability in the class is calculated by dividing the frequency by the number of words in the training set.</Paragraph>
      <Paragraph position="4"> At this point the Distinguishing Terms for each class can be chosen. For this report, three different methods were implemented and experimented with.</Paragraph>
      <Paragraph position="5">  1. Use all of the words in the training set.</Paragraph>
      <Paragraph position="6"> 2. Use the high frequency words in each list which are not the high fiequency words in any other list, by selecting the words which  .</Paragraph>
      <Paragraph position="7"> are in the highest so many on the list and not in the highest so many on any other list.</Paragraph>
      <Paragraph position="8"> Use the high frequency words in each list which occur with low frequency on all of the other lists, by selecting only the words which occur more often in one list than in all other lists combined, until enough words have been chosen.</Paragraph>
    </Section>
    <Section position="2" start_page="171" end_page="171" type="sub_section">
      <SectionTitle>
4.2. Calculation of Term Weights
</SectionTitle>
      <Paragraph position="0"> Each of the selection methods requires a weight to be calculated for each Distinguishing Term. The tf.idf and Cosine methods all calculate the weight using the number of classes which contain the term, while the Multinomial Distribution method calculates the weight using the term probabilities.</Paragraph>
    </Section>
    <Section position="3" start_page="171" end_page="171" type="sub_section">
      <SectionTitle>
4.3. Document Classification
</SectionTitle>
      <Paragraph position="0"> Each document to he classified is processed the same as the training sets are up to the selection of Distinguishing Terms; the header information is removed, remaining words are separated at blank spaces onto individual flues, and stemming is performed to remove embedded SGML syntax, possessives, punctuation, and many suffixes. The words are then counted and sorted by frequency.</Paragraph>
      <Paragraph position="1"> The document words are compared to each of the Distinguishing Terms sets, and a class score is calculated according to the selection method being used. For classification, the document is classified into the class which has the maximum score.</Paragraph>
      <Paragraph position="2"> For routing, the routing score is calculated from the class scores. Mter all of the documents have been class/fled the routing scores are sorted, with the highest ranking documents being those which are the most like the class profile than any other profde.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="171" end_page="172" type="metho">
    <SectionTitle>
5. EXAMPLE SELECTION OF DISTIN-
GUISHING WORDS AND WEIGHTS
</SectionTitle>
    <Paragraph position="0"> To help illustrate the procedure, a small example is described. Consider two different classes, each represented by a training set. Each training set consists of a single document. Class 1 is 'Nursery Rhymes', represented with 'Mary Had a Little Lamb', and Class 2 is 'U.S. Documents', represented with the 'The Pledge of Allegiance'. These documents are shown below.</Paragraph>
    <Paragraph position="1">  Mter removing the header material, separating the words, stemming, sorting by frequency, and calculating the probabilities, the following lists would result. Norice that the stemming does not always work perfectly; 'united' is shortened to 'unite', but 'followed' is shortened to 'foUowe'. Overall, though, the stemming works much more often than it fails.</Paragraph>
    <Paragraph position="2">  The Distinguishing Terms are then chosen, by one of three methods. The first is to choose all of the words in each fist. The second is to select the words which are in the highest so many on each fist and not in the highest so many on the other fist. For this example, let us choose the words that are in the top 15 on each list and not in the top 10 on the other fist. This would produce the following lists. The words 'the' and 'to' were eliminated from each list.</Paragraph>
    <Paragraph position="3">  The third way to choose Distinguishing Terms is to select only the words which occur more often in one list than in all other lists combined until enough words have been chosen. For this example, let us choose words which occur more often in one list than in the other list until the sum of the probabilities of the chosen words is at least 40%. This would produce the following fists.</Paragraph>
  </Section>
  <Section position="9" start_page="172" end_page="173" type="metho">
    <SectionTitle>
0.03922 LITILE
</SectionTitle>
    <Paragraph position="0"> Table 5-3. Most Likely Words Then the weight for each word is calculated. This is done here for each selection method for the last set of distinguishing words.</Paragraph>
    <Section position="1" start_page="172" end_page="173" type="sub_section">
      <SectionTitle>
Most Likely Words
SMART
</SectionTitle>
      <Paragraph position="0"> Weights are not kept from the training set, only the fist of words is kept. New weights are calculated from the corpus of documents to be classified and routed. But making the assumption that the training set and the corpus have the same distribution of words, the following weights wonld be calculated.</Paragraph>
    </Section>
  </Section>
  <Section position="10" start_page="173" end_page="173" type="metho">
    <SectionTitle>
Words
6. TESTING
</SectionTitle>
    <Paragraph position="0"> The methods were tested against a small set of available documents. These were FBIS documents from June and July of 1991 on four different topics.</Paragraph>
    <Section position="1" start_page="173" end_page="173" type="sub_section">
      <SectionTitle>
6.1. Selection of Distinguishing Terms
</SectionTitle>
      <Paragraph position="0"> Ten documents randomly chosen from each class were used as training. These training documents were then eliminated from the set of documents to be classified. The following table shows some information about the training documents.</Paragraph>
      <Paragraph position="1">  Set 1 contained editorials from Vietnam. Some extremely short documents were included which were no longer than the header information (which was stripped before use), the rifle, author and source, and a note that the article was in Viemamese and had not been translated. Many of the high frequency words were political or economic.</Paragraph>
      <Paragraph position="2"> Set 2 contained abstracts from Japanese technical papers. Many of the high frequency words were technological or were Japanese locations and companies. Set 3 contained articles about arms control from all over the world. Many of the high frequency words were location, military, or negoriarion related.</Paragraph>
      <Paragraph position="3"> Set 4 contained articles from the Soviet Union about various military affairs, including those in other countries. Many of the high frequency words were Soviet Union locations or military related.</Paragraph>
      <Paragraph position="4"> After experimenting with the Distinguishing Term selection methods, it was found that using the most frequent 300 words which were not the most frequent 300 words in any other class worked best for the ff.idf method. The Cosine method worked best when the Distinguishing Terms for each class were the words which were more likely to be in the class than in the sum of the rest of the classes, until the sum of the probabilities of the chosen words was at least 20%. The Multinomial Distribution method works best if the Distingalishlng Terms for each class are more lilfely to be in the class than in another class, so the method which worked best was to choose the words which occur more often in one list than in all other lists combined until the sum of the probabilities of the chosen words was at least 25%.</Paragraph>
    </Section>
  </Section>
  <Section position="11" start_page="173" end_page="174" type="metho">
    <SectionTitle>
6.2. Results for Classification
</SectionTitle>
    <Paragraph position="0"> Topics 3 and 4 had a significant overlap in distinguiqhing words, and this created the most difficulty in choosing the proper class. For example, one topic 4 document described arms control efforts in France, and this was always misclassified as topic 3.</Paragraph>
    <Paragraph position="1"> The following charts show the classifmation precision and recall for each of the classes. The ff.idf method gave the poorest results, while the SMART. Cosine, and Mulrinomial Distribution methods produced better results. null  Simplifying the charts to a single number F measure (average of precision plus recall) gives the following comparison.</Paragraph>
    <Paragraph position="2">  Simplifying the clam to a single number measure (area under the curve) gives the following comparison.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML