File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/91/h91-1061_metho.xml

Size: 28,784 bytes

Last Modified: 2025-10-06 14:12:41

<?xml version="1.0" standalone="yes"?>
<Paper uid="H91-1061">
  <Title>Evaluating Text Categorization I</Title>
  <Section position="3" start_page="0" end_page="313" type="metho">
    <SectionTitle>
EFFECTIVENESS MEASURES
</SectionTitle>
    <Paragraph position="0"> While a number of different effectiveness measures have been used in evaluating text categorization in the past, almost all have been based on the same model of decision making by the categorization system. I begin by discussing this contingency table model, which motivates a small number of simple and widely used effectiveness measures. Complexities arise, however, in how to compute and interpret these measures in the context of a text categorization experiment.</Paragraph>
    <Paragraph position="1"> The bulk of the discussion concerns these complexities.</Paragraph>
    <Section position="1" start_page="312" end_page="313" type="sub_section">
      <SectionTitle>
The Contingency Table
</SectionTitle>
      <Paragraph position="0"> Consider a system that is required to make n binary decisions, each of which has exactly one correct answer (either Yes or No). The result of n such decisions can be summarized in a contingency table, as shown in Table 1. Each entry in the table specifies the number of decisions of the specified type. For instance, a is the number of times the system decided Yes, and Yes was in fact the correct answer.</Paragraph>
      <Paragraph position="1"> Given the contingency table, three important measures of the system's effectiveness are:  (1) recall = ~/(~ + c) (2) precision = ~/(~ + b) (3) fallout = ~/(~ + d)  Measures equivalent to recall and fallout made their first appearance in signal detection theory \[Swe64\], where they play a central role. Recall and precision are ubiquitous in information retrieval, where they measure the proportion of relevant documents retrieved and the proportion of retrieved documents which are relevant, respectively. Fallout measures the proportion of nonrelevant documents which are retrieved, and has also seen considerable use.</Paragraph>
      <Paragraph position="2"> A decision maker can achieve very high recall by rarely deciding No, or very high precision (and low fallout) by rarely deciding Yes. For this reason either recall and precision, or recall and fallout, are necessary to ensure a non-trivial evaluation of a decision maker's effectiveness under the above model.</Paragraph>
      <Paragraph position="3"> Another measure sometimes used in categorization experiments is overlap: (4) overlap = a/(a + b + c) This measure is symmetric with respect to b and c, and so is sometimes used to measure how much two categorizations are alike without defining one or the other to be correct. null It is appropriate at this point to mention some of the limitations of the contingency table model. It does not take into account the possibility that different errors have different costs; doing so requires more general decision theoretic models. The contingency table also requires all decisions to be binary. It may be desirable for category assignments to be weighted rather than binary, and we will discuss later one approach to evaluation in this case.</Paragraph>
      <Paragraph position="4"> Defining Decisions and Averaging Effectiveness null The contingency table model presented above is applicable to a wide range of decision making situations. In this section, I will first consider how query-driven text retrieval has been evaluated under this model, and then consider how text categorization can be evaluated under the same model. In both cases it will be necessary to interpret the system's behavior as a set of binary decisions.</Paragraph>
      <Paragraph position="5"> In a query-driven retrieval systems, the basic decision is whether or not to retrieve a particular document for a particular query. For a set of q queries and d documents a total of n = qd decisions are made. Given those qd decisions, two ways of computing effectiveness are available. Microaueraging considers all qd decisions as a single group and computes recall, precision, fallout, or overlap as defined above. Macroaveraging computes these effectiveness measures separately for the set of d documents associated with each query, and then computes the mean of the resulting q effectiveness values.</Paragraph>
      <Paragraph position="6"> Macroaveraging has been favored in evaluating query-driven retrieval, partly because it gives equal weight to each user query. A microaveraged recall measurement, for instance, would be disproportionately affected by recall performance on queries from users who desired large numbers of documents.</Paragraph>
      <Paragraph position="7"> An obvious analogy exists between categories in a text categorization system and queries in a text retrieval system. The most common view taken of categorization is that an assignment decision is made for each category/document pair. A categorization experiment will compare the categorization decisions made by a computer system with some standard of correctness, usually human category assignment. In contrast to evaluations of query-driven retrieval, evaluations of categorization have usually used microaveraging rather than macroaveraglng. Many ad hoc variants of both forms of averaging have also been used.</Paragraph>
      <Paragraph position="8"> Whether microaveraging or macroaveraging is more informative depends on the purpose for the categorization. For instance, if categorization is used to index documents for text retrieval, and each category appears in user queries at about the same frequency it appears in documents, then  rlzer X on One Document microaveraging seems very appropriate. On the other hand, if categorization were used to route documents to divisions of a company, with each division viewed as being equally important, then macroaveraging would be more informative. The choice will often not be clearcut. I assume microaveraging in the following discussion unless otherwise mentioned.  Precision and fallout both measure (in roughly inverse ways) the tendency of the categorizer to assign incorrect categories. However, in doing so they capture different properties of the categorization.</Paragraph>
      <Paragraph position="9"> In the context of query-driven retrieval, S alton has pointed out how systems which maintain constant precision react differently to increasing numbers of documents than those which maintain constant fallout \[Sal72\]. Similar effects can arise for categorizers as the number or nature of categories changes.</Paragraph>
      <Paragraph position="10"> Table 2 shows the hypothetical performance of categorizer X as the category set is expanded to include new topics. Decreasing fallout suggests that the categorizer X incorrectly assigns categories in proportion to the number of correct categories to be assigned. A different categorizer, Y, might show the pattern in Table 3, suggesting categories are incorrectly assigned in proportion to the total number of incorrect categories (or in proportion to the total number of all categories).</Paragraph>
      <Paragraph position="11"> In extreme cases a system could actually improve on precision while worsening on fallout, or vice versa. Having both measures, plus recall, available is useful in quickly appraising a method's behavior under changing circumstances.  The basic tools of microaveraging and macroaveraging can be applied to arbitrary subsets of categorization decisions. Subsets of decisions can be defined in terms of sub-</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="313" end_page="315" type="metho">
    <SectionTitle>
CategorYABcD Set ~ A Assignedc ~~bF0
ABCDEFGH A CEF 5150 125 150
ABCDEFGHIJKL A CEFIJK
</SectionTitle>
    <Paragraph position="0"> rlzer Y on One Document sets of categories, subsets of documents, or gradations in the correctness standard.</Paragraph>
    <Paragraph position="1"> Categories can be partitioned by importance, frequency, similarity of meaning, or strategy used in assigning them. Presenting effectiveness measures averaged over category groups defined by frequency in the training set would be extremely informative, but does not appear to have been done in any published study. If the number of categories is small enough, effectiveness can be presented separately for each category \[HKC88\].</Paragraph>
    <Paragraph position="2"> Subsets of the set of test documents can be defined as well, particularly if the behavior of the system on texts of different kinds is of interest. Maron grouped documents on * the basis of the amount of evidence they provided for making a categorization decision, and showed that effectiveness increased in proportion to the amount of evidence \[lVlar61\]. Finally, it is sometimes appropriate to partition results by degree of correctness of a category/document pair. While the contingency table model assumes that an assignment decision is either correct or incorrect, the standard they are being tested against may actually have gradations of correctness. The model can still be used if gradations are partitioned into two disjoint classes, for instance correct and marginal being considered correct, and ineffective and incorrect being considered incorrect. In this circumstance, it may be desirable to present results under several plausible partitions.</Paragraph>
    <Paragraph position="3"> The appropriate partitions to make will depend on many factors that cannot be anticipated here. A crucial point to stress, however, is that care should be taken to partition supporting data on the task and system in the same fashion \[Lew91\]. For instance, if effectiveness measures are presented for subsets of documents, then statistics such as average number of words per document, etc. should be given for the same groups of documents.</Paragraph>
    <Section position="1" start_page="313" end_page="314" type="sub_section">
      <SectionTitle>
Arithmetic Anomalies
</SectionTitle>
      <Paragraph position="0"> The above discussion assumed that computing the effectiveness measures is always straightforward. Referring to equations (1) to (3) shows that 0 denominators arise when there exist no correct category assignments, no incorrect category assignments, or when the system never assigns a category. All these situations are extremely unlikely when microaveraging is used, but are quite possible under macroaveraging.</Paragraph>
      <Paragraph position="1"> For evaluating query-drlven retrieval, Tague suggests either treating 0/0 as 1.0 or throwing out the query, but says neither solution is entirely satisfactory. For a categorization system, we also have the option of partitioning the categories and macroaveraging only over the categories for which these anomalies don't arise. As discussed above, the  same partitioning should be used for any background data presented on the testset and task.</Paragraph>
      <Paragraph position="2"> One Category or Many? Evaluations of systems which assign multiple categories to a document have often been flawed, particularly for categorizers which use statistical techniques. For instance, some of the results in \[Mar61\] and ~KW75\] were obtained under assumptions equivalent to the categorizer knowing in advance how many correct categories each test document has. This knowledge is not available in an operational setting. Better attempts to both produce and evaluate multiple category assignments are found in work by Fuhr and Knorz, and by Field. Field uses the strategy of assigning the top k categories to a document, but unlike the above studies does this without knowledge of the proper number of categories for any particular document. He then plots the recall value achieved for variations in the number of categories assigned \[Fie75\]. Fuhr and Knorz plot a curve showing tradeoff between recall and precision as a category assignment threshold varies \[FK84\].</Paragraph>
      <Paragraph position="3"> When categories are completely disjoint and a categorizer always assigns exactly 1 of the M categories to a text, we really have a single M-ary decision, rather than M binary decisions. The contingency table model provides one way of summarizing M-ary decision effectiveness, but other approaches, such as confusion matrices \[Swe64\], may be more revealing.</Paragraph>
      <Paragraph position="4">  The effectiveness measures described above require that correct categorizations are known for a set of test documents. In cases where an automated categorizer is being developed to replace or aid manual categorization, categorizations from the operational human system may be used as the standard. Otherwise, it may be necessary to have human indexers categorize some texts specifically for the purposes of the experiment.</Paragraph>
      <Paragraph position="5"> Many studies have found that even professional bibliographic indexers disagree on a substantial proportion of categorization decisions \[Bor64, Fie75, HZ80\]. This calls into question the validity of human category assignment as a standard against which to judge mechanical assignment.</Paragraph>
      <Paragraph position="6"> One approach to this problem has been to have an especially careful indexing done \[Fie75, HZS0\]. Sometimes evaluation is done against several indexings \[Fie75, HKC88\].</Paragraph>
      <Paragraph position="7"> Another approach is to accept that there will always be some degree of inconsistency in human categorization, and that this imposes an upper limit on the effectiveness of machine categorization. The degree of consistency between several human indexers can be measured, typically using overlap, as defined in Equation (4), or some variant of this. How measures of consistency between human indexers might best aid the interpretation of machine categorization effectiveness is unclear. Overlap between the machineassigned categories and each human indexers' categories can be measured and compared to overlap among humans. It is less clear how to interpret recall, precision, or fallout in the presence of a known level of inconsistency.</Paragraph>
      <Paragraph position="8"> The possibility also exists that machine categorization could be better than human categorization, making consistency with human categorization a questionable measure under any circumstance. Indirect evaluation, discussed in the next section, is the best way to address this possibility.</Paragraph>
    </Section>
    <Section position="2" start_page="314" end_page="315" type="sub_section">
      <SectionTitle>
Indirect Evaluation
</SectionTitle>
      <Paragraph position="0"> The output of a text categorization system is often used by another system in performing text retrieval, text extraction, or some other task. When this is the case, it is possible to evaluate the categorization indirectly, by measuring the performance of the system which uses the categorization.</Paragraph>
      <Paragraph position="1"> This indirect evaluation of the categorization can be an important complement to direct evaluation, particularly when multiple categorizations are available to be compared.</Paragraph>
      <Paragraph position="2"> How an indirect evaluation is done depends on the kind of system using the categorized text. Most categorizers have been intended to index documents for query-driven text retrieval. Despite this, there have been surprisingly few studies \[Hat82, FK84\] comparing text retrieval performance under different automatic category assignments.</Paragraph>
      <Paragraph position="3"> The focus on manual categorization as a standard appears to have led categorization researchers to ignore some promising research directions. For instance, I know of no study that has evaluated weighted assignment of categories to documents, despite early recognition of the potential of this technique \[Mar61\] and the strong evidence that weighting free text terms in documents improves retrieval performance \[Sa186\].</Paragraph>
      <Paragraph position="4"> Categorization of documents may be desired for other purposes than supporting query-driven retrieval. Separation of a text stream by category may allowing packaging of the text stream as different products \[Hay90\]. Some comparison of average retrieval effectiveness across text streams might be an appropriate measure in this case.</Paragraph>
      <Paragraph position="5"> Categorization may also be used to select a subset of texts for more sophisticated processing, such as extraction of information or question answering \[JR90\]. Evaluating the quality of the extracted information may give some insight into categorization performance, though the connection can be distant here.</Paragraph>
      <Paragraph position="6">  There are drawbacks to indirect evaluation, of course.</Paragraph>
      <Paragraph position="7"> Tague questions why any particular set of queries should serve as a test of an indexing. Cleaxly, if a categorization is to be evaluated by text retrieval performance, the query set needs to be as large as possible, and representative of the actual usage the system will experience. When categorization is used as a component in a complex language understanding system, that system itself may be difficult to evaluate \[21190\] or differences in categorization quality may be hard to discern from overall system behavior. A single categorization may also be intended to serve several purposes, some possibly not yet defined. Using both direct and indirect evaluation will be the best approach, when practical. null Other Issues The evaluation of natural language processing (NLP) systems is an area of active research \[PF90\], and a great deal remains to be learned. Much more could be said even about evaluating categorization systems. In particular, I have focused entirely on numerical measures. Carefully chosen examples, examined in detail, can also be quite revealing \[HKC88\]. However, the numerical measures described above provide a useful standard for understanding the differences between methods under a variety of conditions.</Paragraph>
      <Paragraph position="8"> Comparison between categorization methods would be aided by the use of common testsets, something which has rarely been done. (An exception is \[BB64\].) Development of standard collections would be an important first step to better understanding of text categorization.</Paragraph>
      <Paragraph position="9"> Categorization is an important facet of many kinds of text processing systems. The effectiveness measures defined above may be useful for evaluating some aspects of these systems. In the next section we consider the evaluation of text extraction systems from this standpoint.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="315" end_page="317" type="metho">
    <SectionTitle>
IMPLICATIONS FOR EVALUATING
TEXT EXTRACTION
</SectionTitle>
    <Paragraph position="0"> Systems for test e.ztraction generate formatted data from natural language text. Some forms of extraction, for instance specifying the highest level of action in a naval report \[Sun89\], are in fact categorization decisions. Other forms of extraction are very different, and do not fit well into the contingency table model.</Paragraph>
    <Paragraph position="1"> In the following I briefly consider evaluation of text extraction systems using the effectiveness measures described for categorization. Two perspectives are taken--one focusing on the type of data extracted and the other focusing on the purpose for which extraction is done.</Paragraph>
    <Section position="1" start_page="315" end_page="316" type="sub_section">
      <SectionTitle>
Types of Extracted Data
</SectionTitle>
      <Paragraph position="0"> Extracted data can include binary or M-axy categorizations, quantities, and templates or database records with atomic or structured fillers \[Sun89, McC90, Hal90\]. The number of desired records per text may depend on text content, and cross references between fillers of record fields may be required.</Paragraph>
      <Paragraph position="1"> Using the effectiveness measures described above requires interpreting the system output in terms of a set of binary decisions which can be either correct or incorrect. The measures become less meaningful as the extraction task becomes less a matter of making isolated decisions with easily defined correctness, and more a matter of generating a legal expression from some potentially infinite language.</Paragraph>
      <Paragraph position="2"> Binary data, either as the sole output of extraction or as the filler of a fixed subpart of a larger structure, fits easily into the contingency table model of evaluation. This includes the case where a slot can have 0 or more fillers from a fixed set of possible fillers. Each pair of the form (slot, possible filler) can be treated as a category in the categorization model. Micro- or macroaveraging across slot/filler pairs for a single slot or for all slots in a template can be done. The situation where exactly one of a fixed set of M fillers must fill a slot is an M-ary decision, as mentioned above for categorization.</Paragraph>
      <Paragraph position="3"> Another common extraction task is to recognize all human names in a piece of text, and produce a canonical string for each name as part of the extracted data. Effectiveness measures from categorization begin to break down here.</Paragraph>
      <Paragraph position="4"> Treating the assignment of each possible canonical name as a binary decision is likely to be uninformative, given the very large set of legal names. (And is impossible if instead of a fixed set of canonical names there axe rules defining an unbounded number of them.) The situation is even more difficult when arbitrary strings may be slot fillers.</Paragraph>
      <Paragraph position="5"> The MUC-3 evaluation \[HalP0\] has taken the approach of retaining the contingency table measures but redefining the set of possible decisions. Rather than taking the cross-product of the set of all fillers and the set of all documents, the set of decisions is implicitly considered to be the union of all correct string/document assignments and all system-produced string/document assignments. This is equivalent to setting cell d of the contingency table to 0, while retaining the others. Fallout is thus eliminated as a measure but recall, precision, and overlap can still be computed. A scheme for assigning partial credit is also used.</Paragraph>
      <Paragraph position="6"> While this approach has been quite useful, it may not be ideal. Two processes are being evaluated at once-recognition of an extractable concept, and selection of a string (canonical or arbitrary) to represent that concept.</Paragraph>
      <Paragraph position="7"> It may be preferable, for instance, to evaluate these processes separately. This approach also requires subtle human  judgments of the relative correctness of various strings that might be extracted. Finally, when comparing systems using this approach, the underlying decision spaces may be different for each system, making interpreting the effectiveness measures more diffcult.</Paragraph>
      <Paragraph position="8"> When a system goes beyond string fills to filling slots with arbitrary structures, the contingency table model becomes very difficult to apply. At best there may be some hopes of capturing some parts of the task in this way, such as getting the right category of structure in a slot. More research on evaluation is clearly needed here.</Paragraph>
    </Section>
    <Section position="2" start_page="316" end_page="317" type="sub_section">
      <SectionTitle>
Purposes for Extracted Data
</SectionTitle>
      <Paragraph position="0"> The data type of extracted information affects what effectiveness measures can be computed. Even more important, however, is the purpose for which information is being extracted. This issue has been given surprisingly little attention in published discussions of text extraction systems.</Paragraph>
      <Paragraph position="1"> In the following, I give three examples to suggest that explicit consideration of how extracted data will be used is crucial in choosing appropriate effectiveness measures.</Paragraph>
      <Paragraph position="2"> Statistical Analysis of Real-World Events A data-base of extracted information may be meant to support queries about real-world events described in the texts. An analyst might want to check for correlations between numbers of naval equipment failures and servicing in certain ports, or list the countries where plastic explosives have been used in terrorist bombings, to give examples.</Paragraph>
      <Paragraph position="3"> Accurate answers to questions about numbers of events depend on recognizing when multiple event references in the same or in different documents in fact refer to a single real world event, and on proper handling of phenomena such as plurals, numbers, and quantification. High precision and low fallout may be favored over high recall. If it is expected that the same event will be described by multiple sources, a single failure to recognize it may not be important. Evaluation might focus on effectiveness in extracting details necessary to uniquely identify each event. On the other hand, if support of arbitrary existence queries (Has plastic ee~plosive been used...) is important, then recall for all recognizable details of events may be the most important thing to evaluate.</Paragraph>
      <Paragraph position="4"> The degree of connection between reports of events and actual events will vary from reliable (intra-agency traffc) to dubious (political propaganda). This makes it likely that the extraction system will at best be an aid to a human analyst, who will need to make judgment calls on the tellability of textual descriptions. The most useful evaluation may be of the analyst's performance with and without the extraction system.</Paragraph>
      <Paragraph position="5"> Content Analysis Content a,alysi8 has been defined in many different ways (\[Ho169\], pp. 2-3) but here I focus particularly on the analysis of texts to gain insight into the motivations and plans of the texts' authors. In its simplest form content analysis involves counting the number of occurrences of members of particular linguistic classes. For instance, one might count how often words with positive or negative connotations are used in referring to a neighboring country. The great potential of the computer to aid with the drudgery of analyzing large corpora of text has long been recognized, as has the potential for NLP to improve the effectiveness of this process.</Paragraph>
      <Paragraph position="6"> In content analysis, faithfulness to the text rather than faithfulness to the world may be the primary concern. Of particular importance is that the number of instances of a particular linguistic item extracted is not a~ected by extraneous variables. Consider a comparison of the number of references to a particular border skirmish in political broadcasts from two countries. In this case, one would want confidence that extraction effectiveness was about the same for texts from the two countries and was not affected by, for instance, differing capitalization conventions. The absolute level of e/Fectiveness might be a lesser concern.</Paragraph>
      <Paragraph position="7"> Indexing for Query-Drlven Text Retrieval In this case, the extracted data is used only indirectly. An analyst will use either a text retrieval system or a conventional database system to retrieve documents indexed by extracted data. The analyst may want the documents for any of a number of purposes, including the ones described above. The difference is that extracted information participates in the analysis only to the extent of influencing which documents the analyst sees. No numeric values are derived directly from the extracted data.</Paragraph>
      <Paragraph position="8"> In evaluating formatted data extracted for this purpose, a number of results from information retrieval research are important to consider. One is the fact, mentioned earlier, that document-specific weighting of indexing units is likely to substantially increase performance. Since NLP systems can potentially use many sources of evidence in deciding whether to extract a particular piece of information, there is a rich opportunity for such weighting.</Paragraph>
      <Paragraph position="9"> Another lesson from IK research is that people find it very difficult to judge the quality of indexing in the absence of retrieval data. Strongly held intuitions about the relative effectiveness of indexing languages and indexing methods for supporting document retrieval have often been shown by experiment to be incorrect \[Spaglb\]. If the primary purpose of extracted information is to support querying, then indirect evaluation, i.e. testing with actual queries, is very important.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML