File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/90/h90-1057_metho.xml
Size: 27,943 bytes
Last Modified: 2025-10-06 14:12:29
<?xml version="1.0" standalone="yes"?> <Paper uid="H90-1057"> <Title>Representation Quality in Text Classification:</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> Amherst~ MA 01003 </SectionTitle> <Paragraph position="0"> lewis @cs. umass, e du</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> ABSTRACT </SectionTitle> <Paragraph position="0"> The way in which text is represented has a strong impact on the performance of text classification (retrieval and categorization) systems. We discuss the operation of text classification systems, introduce a theoretical model of how text representation impacts their performance, and describe how the performance of text classification systems is evaluated. We then present the results of an experiment on improving text representation quahty, as well as an analysis of the results and the directions they suggest for future research.</Paragraph> </Section> <Section position="3" start_page="0" end_page="289" type="metho"> <SectionTitle> 1 The Task of Text Classification </SectionTitle> <Paragraph position="0"> Text-based systems can be broadly classified into classification systems and comprehension systems. Text classification systems include traditional information retrieval (IR) systems, which retrieve texts in response to a user query, as well as categorization systems, which assign texts to one or more of a fixed set of categories. Text comprehension systems go beyond classification to transform text in some way, such as producing summaries, answering questions, or extracting data.</Paragraph> <Paragraph position="1"> Text classification systems can be viewed as computing a function from documents to one or more class values. Most commercial text retrieval systems require users to enter such a function directly in the form of a boolean query. For example, the query (language OR speech) AND A U = Smith specifies a 1-ary 2-valued (boolean) function that takes on the value TRUE for documents that are authored by Smith and contain the word language or the word speech. In statistical IR systems, which have long been investigated by researchers and are beginning to reach the marketplace, the user typically enters a natural language query, such as Show me uses of speech recogni$ion.</Paragraph> <Paragraph position="2"> The assumption is made that the attributes (content words, in this case) used in the query will be strongly associated with documents that should be retrieved. A statistical IR system uses these attributes to construct a classification function, such as: f (x) .~ Cl ~shaw Jr C2~3ttse s -Jr C3Y3speech -~- C4~recog~zitio~r t This function assumes that there is an attribute corresponding to each word, and that attribute takes on some value for each document, such as the number of occurrences of the word in the document. The coefficients c~ indicate the weight given to each attribute. The function produces a numeric score for each document, and these scores can be used to determine which documents to retrieve or, more usefully, to display documents to the user in ranked order: Speech Recognition Applications 0.88 Jones Gives Speech at Trade Show 0.65 Speech and Speech Based Systems 0.57 Most methods for deriving classification functions from natural language queries use statistics of word occurrences to set the coefficients of a linear discriminant function \[5,20\]. The best results are obtained when supervised machine learning, in the guise of relevance feedback, is used \[21,6\].</Paragraph> <Paragraph position="3"> Text categorization systems can also be viewed as computing a function defined over documents, in this case a k-ary function, where k is the number of categories into which documents can be sorted. Rather than deriving this function from a natural language query, it is typically constructed directly by experts \[28\], perhaps using a complex pattern matching language \[12\]. Alternately, the function may be induced by machine learning techniques from large numbers of previously categorized documents \[17,11,2\].</Paragraph> <Section position="1" start_page="0" end_page="288" type="sub_section"> <SectionTitle> 1.1 Text Representation and The Con- </SectionTitle> <Paragraph position="0"> cept Learning Model Any text classification function assumes a particular representation of documents. With the exception of a few experimental knowledge-based IR systems \[15\], these text representations map documents into vectors of attribute values, usually boolean or numeric. For example, the document title &quot;Speech and Speech Based Systems&quot; might be represented as (F,F,F, T,F,F, T,F, T,F,F,F...) in a system which uses boolean attribute values and omits common function words (such as and) from the text representation. The T's correspond to the words speech, based, and systems. The same title might be represented as (0, O, O, 1.0, O, O, 0.5, O, 0.5, O, O, 0 ...) in a statistical retrieval systems where each attribute is given a weight equal to the number of occurrences of the word in the document, divided by the number of occurrences of the most frequent word in the document. Information retrieval researchers have experimented with a wide range of text representations, including variations on words from the original text, manually assigned keywords, citation and publication information, and structures produced by NLP analysis \[15\]. Besides this empirical work, there have also been a few attempts to theoretically characterize the properties of different representations and relate them to retrieval system performance. The most notable of these attempts is Salton's term discrimination model \[19\] which says that a good text attribute is one that increases the average distance between all pairs of document vectors.</Paragraph> <Paragraph position="1"> However, none of the proposed models of text representation quahty addresses the following anomaly: since most text representations have very high dimensionality (large number of attributes), there is usually a legal classification function that will produce any desired partition of the document collection. This means that essentially all proposed text representations have the same upper bound performance. Therefore, in order to understand why one text representation is better than another, we need to take into consideration the limited ability of both humans and machine learning algorithms to produce classification functions.</Paragraph> <Paragraph position="2"> The concept learning model of text classification \[14\] assumes that both machine production of classification functions (as in translation of natural language queries and relevance feedback) and human production of classification functions (as in user querying or expert construction of categorization systems) can usefully be viewed as machine learning. Whether this is a useful model of human production of classification functions is a question for experiment. If so, useful view (which remains to be determined), a wide range of theoretical results and practical techniques from machine learning, pattern recognition, and statistics will take on new significance for text classification systems.</Paragraph> <Paragraph position="3"> We survey a variety of representations from the standpoint of the concept learning model in \[15\]. We are currently conducting several experiments to test the predictions of the model \[14\]. One such experiment is described in Section 2 of this paper. First, however, we discuss how text classification systems are evaluated.</Paragraph> </Section> <Section position="2" start_page="288" end_page="289" type="sub_section"> <SectionTitle> 1.2 Evaluation of Text Classification Systems </SectionTitle> <Paragraph position="0"> We have refered several times to the &quot;performance&quot; of text classification systems, so we should say something about how performance is measured. Retrieval systems are typically evaluated using test collections \[24\]. A test collection consists of, at minimum, a set of documents, a set of sample user queries, and a set of relevance judgments. The relevance judgments tell which documents are relevant (i.e. should be retrieved) for each query.</Paragraph> <Paragraph position="1"> The retrieval system can be applied to each query in turn, producing either a set of retrieved documents, or ranking all documents in the order in which they would be retrieved.</Paragraph> <Paragraph position="2"> Two performance figures can be computed for a set of retrieved documents. Recall is the percentage of all relevant documents which show up in the retrieved set, while precision is the percentage of documents in the retrieved set which are actually relevant. Recall and precision figures can be averaged over the group of queries, or the recall precision pair for each query plotted on a scatterplot.</Paragraph> <Paragraph position="3"> For systems which produce a ranking rather than a single retrieved set, there is a recall and precision figure corresponding to each point in the ranking. The average performance for a set of queries can be displayed in terms of average precision levels at various recall levels (as in Table 1) or the averages at various points can be graphed as a recall precision curve. Both methods display for a particular technique how much precision must be sacrificed to reach a particular recall level.</Paragraph> <Paragraph position="4"> A single performance figure which is often used to compare systems is the average precision at 10 standard recall levels (again as in Table 1), which is an approximation to the area under the recall precision curve. A difference of 5% in these figures is traditionally called noticeable and 10% is considered material \[22\]. Other single figures of merit have also been proposed \[27\].</Paragraph> <Paragraph position="5"> A large number of test collections have been used in IR research, with some being widely distributed and used by many researchers. The superiority of a new technique is not widely accepted until it has been demonstrated on several test collections. Test collections range in size from a few hundreds to a few tens of thousands of documents, with anywhere from 20 to a few hundred queries.</Paragraph> <Paragraph position="6"> Results on the smaller collections have often turned out to be unrehable, so the larger collections are preferred.</Paragraph> <Paragraph position="7"> Evaluation is still a research issue in IR. The exhaustive relevance judgments assumed for traditional test collections are not possible with larger collections, nor when evaluating highly interactive retrieval systems \[6\].</Paragraph> <Paragraph position="8"> For more on evaluation in IR, the reader is referred to Sparck 3ones' excellent collection on the subject \[25\].</Paragraph> <Paragraph position="9"> Evaluation of text categorization systems also needs more attention. One approach is to treat each category as a query and compute average recall and precision across categories \[12\], but other approaches are possible \[2\] and no standards have been arrived at.</Paragraph> </Section> </Section> <Section position="4" start_page="289" end_page="292" type="metho"> <SectionTitle> 2 An Experiment on Improving </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="289" end_page="289" type="sub_section"> <SectionTitle> Text Representation </SectionTitle> <Paragraph position="0"> One method of improving text representation that has seen considerable recent attention is the use of syntactic parsing to create indexing phrases. These syntactic phrases are single attributes corresponding to pairs of words in one of several specified syntactic relationships in the original text (e.g. verb and head noun of subject, noun and modifying adjective, etc.). For instance, the document title Jones Gives Speech at Trade Show might be represented not just by the attributes Jones, gives, speech, trade, show but also by the attributes <Jones gives>, <gives speech>, <speech show>, <evade show>.</Paragraph> <Paragraph position="1"> on their tendency to occur in documents assigned to the same Computing Reviews categories. 2 Each of the 6922 phrases which occurred in two or more documents was used as the seed for a cluster, so 6922 clusters were formed. A variety of thresholds on cluster size and minimum similarity were explored. Document scores were computed using the formulae for word and phrase weights used in Fagan's study of phrasal indexing \[8\] and Crouch's work on cluster indexing \[7\].</Paragraph> <Paragraph position="2"> Precision figures at 10 recall levels are shown in Table 1 for words, phrases combined with words, and clusters combined with words. While phrase clusters did improve performance, as is not always the case with clusters of individual words, the hypothesis that phrase clusters would be better identifiers than individual phrases was not supported. A number of variations on the criteria for membership in a cluster were tried, but none were found to give significantly better results. In the next section we discuss a number of possible causes for the observed performance levels.</Paragraph> <Paragraph position="3"> Previous experiments have shown only small retrieval performance improvements from the use of syntactic phrases. Syntactic phrases are desirable text attributes since they are less ambiguous than words and have narrower meanings. On the other hand, their statistical properties are inferior to those of words. In particular, the large number of different phrases and the low frequency of occurrence of individual phrases makes it hard to estimate the relative frequency of occurrence of phrases, as is necessary for statistical retrieval methods. Furthermore, a syntactic phrase representation is highly redundant (there are large numbers of phrases with essentially the same meaning), and noisy (since redundant phrases are not assigned to the same set of documents).</Paragraph> </Section> <Section position="2" start_page="289" end_page="289" type="sub_section"> <SectionTitle> 2.1 Clustering of Syntactic Phrases </SectionTitle> <Paragraph position="0"> The concept learning model predicts that if the statistical properties of syntactic phrases could be corrected, without degrading their desirable semantic properties, then the quality of this form of representation will be improved. A number of dimensionality reduction techniques from pattern recognition potentially would have this effect \[13\]. One approach is to use cluster analysis \[1\] to recognize groups of redundant attributes and replace them with a single attribute.</Paragraph> <Paragraph position="1"> We recently conducted a prehminary experiment testing this approach. 1 The titles and abstracts of the 3204 documents in the CACM-3204 test collection \[9\] were syntactically parsed and phrases extracted. Each phrase corresponded to a pair of content words in a direct grammatical relation. Words were stemmed \[18\] and the original relationship between the words was not stored. (The words are unordered in the phrases.) Phrases were clustered using a nearest neighbor clustering technique, with similarity between phrases based</Paragraph> </Section> <Section position="3" start_page="289" end_page="292" type="sub_section"> <SectionTitle> 2.2 Analysis </SectionTitle> <Paragraph position="0"> Can we conclude from Table 1 that clustering of syntactic phrases is not a useful technique for information retrieval? No--the generation of performance figures is only the beginning of the analysis of a text classification technique. Controlhng for all variables that might affect performance is usually impossible due to the complexity of the techniques used and the richness and variety of the texts which might be input to these systems. Further analysis, and usually further experiment, is necessary before strong conclusions can be reached.</Paragraph> <Paragraph position="1"> In this section we examine a range of possible reasons for the failure of syntactic phrase clusters to significantly improve retrieval performance. Our goal is to discover what the most significant influences were on the performance of syntactic phrase clusters, and so suggest what direction this research should take in the future.</Paragraph> <Paragraph position="2"> The first possibihty to consider is that there is nothing wrong with the clusters themselves, but only with how we used them. In other words, the coefficients of the classification functions derived from queries, or the numeric values assigned to the cluster attributes, might have been inappropriate. There is some merit in this suggestion, since the cluster and phrase weighting methods currently used are heuristic, and are based on experiments on relatively few collections. More theoretically sound methods of phrase and cluster weighting are being investigated \[6,26\].</Paragraph> <Paragraph position="3"> On the other hand, scoring is unlikely to be the only problem. Simply examining a random selection of clusters (the seed member for each is underhned) 2Only 1425 of the 3204 CACM documents had Computing Reviews categories assigned, so only phrases that appeared in these documents were clustered.</Paragraph> <Paragraph position="4"> shows they leave much to be desired as content indicators. We therefore need to consider reasons why the clusters formed were inappropriate.</Paragraph> <Paragraph position="5"> The simplest explanation for the low quality of clusters is that not enough text was used in forming them. Table 2 gives considerable evidence that this is the case. The majority of occurrences of phrases were of phrases that occurred only once, and only 17.6% of distinct phrases occurred two or more times. We restricted cluster formation to phrases that occurred at least twice, and most of these phrases occurred exactly twice. This means that we were trying to group phrases based on the similarity of distributions estimated from very little data. Church \[3\] and others have stressed the need for large amounts of data in studying statistical properties of words, and this is even more necessary when studying phrases, with their lower frequency of occurrence.</Paragraph> <Paragraph position="6"> Another statistical issue arises in the calculation of similarities between phrases. We associated with each phrase a vector of values of the form npc/~ nqc, where npc is the number of occurrences of phrase p in documents assigned to Computing Reviews category c, and the denominator is the total number of occurrences of all phrases in category c. This is the maximum likelihood estimator of the probability that a randomly selected phrase from documents in the category will be the given phrase. Similarity between phrases was computed by applying the cosine correlation \[1\] to these vectors. Problems with the maximum likelihood estimator for small samples are well known \[10,4\], so it is possible that clustering will be improved by the use of better estimators.</Paragraph> <Paragraph position="7"> Another question is whether the clustering method used might be inappropriate. Previous research in IR has not found large differences between different methods for clustering words, and all clustering methods are likely to be affected by the other problems described in this section, so experimenting with different clustering methods probably deserves lower priority than addressing the other problems discussed.</Paragraph> <Paragraph position="8"> A final issue is raised by the fact that using clusters and phrases together (see Table 3) produced performance superior to using either clusters or phrases alone. One way of interpreting this is that the seed phrase of a cluster is a better piece of evidence for the presence of the cluster than are the other cluster members. This raises the possibility that explicit clusters should not be formed at all, but rather that every phrase be considered good evidence for its own presence, and somewhat less good evidence for the presence of phrases with similar distributions. 3 Again, investigating this is not likely to tion Another set of factors potentially affecting the performance of phrase clustering is the phrases themselves. Our syntactic parsing is by no means perfect, and incorrectly produced phrases could both cause bad matches between queries and documents, and interfere with the distributional estimates that clustering is based on.</Paragraph> <Paragraph position="9"> It is difficult to gauge directly the latter effect, but we can measure whether syntactically malformed phrases seem to be significantly worse content identifiers than syntactically correct ones. To determine this we found all matches between queries and relevant documents on syntactic phrases. We examined the original query text to see whether the phrase was correctly formed or whether it was the result of a parsing error, and did the same for the phrase occurrence in the document. We then gathered the same data for about 20% of the matches (randomly selected) between queries and nonrelevant documents. null The results are shown in Table 4. We see that for both relevant and nonrelevant documents, the majority of matches are on syntactically correct phrases. The proportion of invahd matches is somewhat higher for non-relevant documents, but the relatively small difference suggests that syntactically malformed phrases are not a primary problem.</Paragraph> <Paragraph position="10"> In proposing the clustering of syntactic phrases, we argued that the semantic properties of individual phrases were good, and only their statistical properties needed improving. This clearly was not completely true, since phrases such as paper gives (from sentences such as This paper gives resul$s on...) are clearly very bad indicators of a document's content.</Paragraph> <Paragraph position="11"> We believed, however, that such phrases would tend to cluster together, and none of the phrases in these clusters would match query phrases. Unfortunately, almost the opposite happened. While we did not gather statistics, it appeared that these bad phrases, with their relatively fiat distribution, proved to be similar to many other phrases and so were included in many otherwise coherent clusters.</Paragraph> <Paragraph position="12"> Some of the low quahty phrases had fairly high frequency. Since IR research on clustering of individual words has shown omitting high frequency words from clusters to be useful, we experimented with omitting high frequency phrases from clustering. This actually degraded performance. Either frequency is less correlated with attribute quahty for phrases than for words, or our sample was too small for rehable frequency estimates, or both.</Paragraph> <Paragraph position="13"> Fagan, who did the most comprehensive study \[8\] of phrasal indexing to date, used a number of techniques to screen out low quality phrases. For instance, he only formed phrases which contained a head noun and one of its modifiers, while we formed phrases from all pairs of syntactically connected content words. Since many of our low quality phrases resulted from main verb / argument combinations, we will reconsider this choice.</Paragraph> <Paragraph position="14"> Fagan also maintained a number of lists of semantically general content words that were to be omitted from phrases, and which triggered special purpose phrase formation rules. We chose not to replicate this technique, due to the modifications required to our phrase generator, and our misgivings about a technique that might require a separate list of exemption words for each corpus. null We did, however, conduct a simpler experiment which suggests that distinguishing between phrases of varying qualities will be important. We had a student from our lab who was not working on the phrase clustering experiments identify for each CACM query a set of pairs of words he felt to be good content identifiers. We then treated these pairs of words just as if they had been the set of syntactic phrases produced from the query.</Paragraph> <Paragraph position="15"> This gave the results shown in Table 5. As can be seen, retrieval performance was considerably improved, even though the phrases assigned to documents and to clusters did not change. (More results on eliciting good identifiers from users are discussed in [6].) Given this evidence that not all syntactic phrases were equally desirable identifiers, we tried one more experiment. We have mentioned that many poor phrases had relatively flat distributions across the Computing Reviews categories. Potentially this very flatness might be used to detect and screen out these low quality phrases. To test this belief, we ranked all phrases which occurred in 8 or more documents by the similarity of their Computing Reviews vectors to that of a hypothetical phrase with even distribution across all categories.</Paragraph> <Paragraph position="16"> The top-ranked phrases, i.e. those with the flattest distributions, are found in Table 6. Unfortunately, while some of these phrases are bad identifiers, others are reasonably good. More apparent is a strong correlation between flatness of distribution and occurrence in a large number of documents. This suggests that once again we are being tripped up by small sample estimation problems, this time manifesting itself as disproportionately skewed distributions of low frequency phrases. The use of better estimators may help this technique, but once again a larger corpus is clearly needed.</Paragraph> </Section> </Section> <Section position="5" start_page="292" end_page="292" type="metho"> <SectionTitle> 3 Future Work </SectionTitle> <Paragraph position="0"> The fact that phrase clusters provided small improvements in performance is encouraging, but the most clear conclusion from the above analysis is that syntactic phrase clustering needs to be tried on much larger corpora. This fact poses some problems for evaluation, since the CACM collection is one of the larger of the currently available IR test collections. The need for larger IR test collections is widely recognized, and methods for their pursuing two other approaches for experimenting with phrase clustering. The first is to form clusters on a corpus different from the one on which the retrieval experiments are performed. If the content and style of the texts are similar enough, the clusters should still be usable. To this end, we have obtained a collection of approximately 167,000 MEDLINE records (including abstracts and titles, but no queries or relevance judgments) to be used in forming clusters. The clusters will be tested on two IR test collections which, while much smaller, are also based on MEDLINE records.</Paragraph> <Paragraph position="1"> A second approach is to experiment with text categorization, rather than text retrieval, since large collections of categorized text are available. The same large MEDLINE subset described above can be used for this kind of experiment, and we have also obtained the training and test data (roughly 30,000 newswire stories) used in building the CONSTRUE text categorization system 1121.</Paragraph> <Paragraph position="2"> Besides the need for repeating the above experiments with more text, our analysis also suggests that some method of screening out low quality phrases is needed.</Paragraph> <Paragraph position="3"> We plan to experiment first with restricting phrases to nouns plus modifiers, as Fagan did, and with screening out phrases based on flatness of distribution, using more text and better small sample estimatiors. Improving the syntactic parsing method does not seem to be an immediate need.</Paragraph> </Section> class="xml-element"></Paper>