File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/91/m91-1035_metho.xml

Size: 28,633 bytes

Last Modified: 2025-10-06 14:12:47

<?xml version="1.0" standalone="yes"?>
<Paper uid="M91-1035">
  <Title>PERQUIN THREATING TROOP VELIZ</Title>
  <Section position="1" start_page="0" end_page="245" type="metho">
    <SectionTitle>
INTRODUCTIO N
</SectionTitle>
    <Paragraph position="0"> The data extraction systems studied in the MUC-3 evaluation perform a variety of subtasks in fillin g out templates. Some of these tasks are quite complex, and seem to require a system to represen t the structure of a text in some detail to perform the task successfully . Capturing reference relation s between slot fillers, distinguishing between historic and recent events, and many other subtask s appear to have this character .</Paragraph>
    <Paragraph position="1"> Other of the MUC-3 subtasks, however, appear amenable to simpler techniques . In particular, whenever a slot filling task involves making choices among a relatively small set of prespecifie d alternatives, the potential exists for modeling this task as text categorization [10] . Text categorization systems treat a document as a single unit, and make a decision to assign each document to zero , one or more of a fixed set of categories .</Paragraph>
    <Paragraph position="2"> Text categorization techniques have played a role in previous text extraction systems, both for screening out documents which it is not desirable to process, and for directing documents t o category-specific extraction routines [2, 9, 15] . The role of categorization in these systems was relatively limited, however . Analyses of the behavior of these systems have given relatively littl e attention to how categorization operated or to what factors influenced categorization performance .</Paragraph>
    <Paragraph position="3"> Nor had performing data extraction solely by categorization techniques seen much attention before MUC-3.</Paragraph>
    <Paragraph position="4"> The study reported here is an initial exploration of the issues involved in using categorization a s part of the extraction process. As such it has two main goals. One goal is provide a measure of th e difficulty of the extraction task being performed by the MUC-3 systems . By evaluating the degree to which various text to template mappings can be captured statistically as word to filler mappings, we gain a sense of where the linguistic and domain knowledge embedded in the NLP systems i s having the most effect .</Paragraph>
    <Paragraph position="5"> A second goal is to identify aspects of the extraction task that categorization might plausibl y aid, or even take primary responsibility for . With respect to this second goal, this paper provides a n additional data point to be considered along with the results of several official MUC-3 sites which made greater or lesser use of categorization techniques .</Paragraph>
    <Paragraph position="6"> In order to most clearly discover what characteristics of the extraction task impact categorization , we ran an experiment using an &amp;quot;off-the-shelf&amp;quot; categorization system, with access to no additiona l language processing capability or knowledge base . The technique investigated was a well-understood , domain independent statistical procedure . It should be pointed out that the method used is onl y one of a variety of methods, some quite knowledge intensive, that have been used in previous wor k on text categorization [12].</Paragraph>
    <Paragraph position="7"> The results presented here are not official MUC-3 results, and should be considered tentative .</Paragraph>
    <Paragraph position="8"> This paper is only a small step toward an understanding of the role of categorization methods i n data extraction.</Paragraph>
    <Paragraph position="9">  VIEWING MUC-3 AS TEXT CATEGORIZATIO N For a data extraction task to be viewed as text categorization, it must involve choosing among a finite, preferably small, set of categories that a document might be assigned to . We therefore nee d to consider which MUC-3 subtasks might fit this description .</Paragraph>
    <Paragraph position="10"> The MUC-3 task can be broken into two main parts: deciding how many templates should b e generated for a document, and filling the slots within the generated templates . Deciding the numbe r of templates could conceivably be modeled as categorization, given some upper bound on the numbe r of allowed templates . (A more profitable view, if one wanted to use machine learning technique s for this subtask, might be to use multiple regression or some similar technique .) We chose not t o address this task directly, though indirectly it is dealt with by the rules we use to generate templates from a binary categorization .</Paragraph>
    <Paragraph position="11"> Many of the slot filling tasks also could not be modeled as classifying a document with respec t to a fixed set of categories . For instance, all slot filling tasks that involve extracting strings from the text allow a range of answers that cannot be defined in advance . One might force string filling into a categorization mode by viewing the task as one of deciding, for each substring in a document , whether that substring falls into the class of fillers of a particular slot . In this paper, however, we chose to consider only the classification of entire document texts into categories .</Paragraph>
    <Paragraph position="12"> The slots that appeared most amenable to categorization techniques were the set fill slots. The following slots were considered to be set fill slots in the MUC-3 documentation :</Paragraph>
  </Section>
  <Section position="2" start_page="245" end_page="245" type="metho">
    <SectionTitle>
0. MESSAGE ID
3. TYPE OF INCIDENT
4. CATEGORY OF INCIDENT
</SectionTitle>
    <Paragraph position="0"> 7. PERPETRATOR: CONFIDENCE 10 . PHYSICAL TARGET : TYPE(S) 13. HUMAN TARGET: TYPE(S) 14. TARGET: FOREIGN NATION(S) 15. INSTRUMENT: TYPE(S) 16. LOCATION OF INCIDENT 17. EFFECT ON PHYSICAL TARGET(S ) 18. EFFECT ON HUMAN TARGET(S)  Of these, the MESSAGE-ID slot is assigned straightforwardly without reference to the articl e content. The foreign nation (14) and location (16) slots, are set fill slots only under the assumption that the set of relevant locations is known and fixed . This is appropriate for the MUC-3 evaluation , but might be problematical in an operational setting . In any case, filling of these slots is to such a large extent a function of recognizing particular names in the text that it appeared categorizatio n would have little to say about these slots .</Paragraph>
    <Paragraph position="1"> This left eight slots (3, 4, 7, 10, 13, 15, 17, and 18) which could plausibly be treated by categorization. Each of these slots has a finite set of possible fillers . The decision to fill a particular slot for a particular document can therefore be modeled as deciding whether the document fits int o one or more of the 88 categories defined by combinations of these slots and their fillers . Even here, however, several factors, such as the necessity of distributing slot fillers among multiple templates , cause filling these slots to depart from a traditional categorization model. These issues are discusse d in the next section.</Paragraph>
  </Section>
  <Section position="3" start_page="245" end_page="249" type="metho">
    <SectionTitle>
FILLING SET FILL SLOTS WITH A STATISTICAL CAT-
EGORIZE R
</SectionTitle>
    <Paragraph position="0"> To investigate the use of categorization techniques for filling set fill slots, we used the Maxcat syste m [11] . This is a suite of C programs and Unix shell scripts supporting machine learning using large,  sparse feature sets, such as are found in text retrieval and text categorization. Datasets are store d using sparsely nondefault matrices . A variety of programs for structural and arithmetic manipulatio n of these matrices are included, and several machine learning algorithms have been implemented using them.</Paragraph>
    <Paragraph position="1"> In the following we discuss the processing of MUC-3 training and test documents, and ho w categorization and template generation were performed.</Paragraph>
    <Paragraph position="2"> Modeling and Preparing the MUC-3 Dat a The training data available for the MUC-3 task was a set of 1400 document texts (1300 origina l training documents and the interim TST1 testset of 100 documents), along with one or more completed templates for each document . The 100 documents and corresponding templates from th e second MUC-3 testset (TST2) were used for testing .</Paragraph>
    <Paragraph position="3"> Maxcat requires all data it operates on to be represented in the form of feature vectors . We therefore viewed the complete MUC-3 data set as consisting of two sets of 1500 feature vectors, tw o vectors for each document . One set of features, the predictor features, specified the presence o r absence of each English word from the corpus in each document . We defined a word as a sequenc e of alphanumeric characters separated by whitespace or punctuation . By this definition, there were a total of 16,755 distinct word types in the 1500 documents, and thus 16,755 predictor features . Most predictor features took on a value of 0 (not present) for most documents, and the Maxcat syste m took advantage of this by storing feature vectors in inverted file form. Information on the numbe r of occurrences and positions of words was not used in this experiment .</Paragraph>
    <Paragraph position="4"> The second set of feature vectors specified the presence or absence of each of the 88 slot/fixed se t filler combinations in the templates for each document . As with words, information about multiple occurrences of a filler, in one or multiple templates for a document, was omitted since the particular categorization method used did not take make use of this information. The program which converted the training templates into feature vectors uncovered approximately 100 errors in these templates . Most of these were obviously typographical errors in filler names, and we corrected these by han d before training.</Paragraph>
    <Paragraph position="5"> It should be noted that while the separation between training and test documents was stric t from the standpoint of the machine learning and evaluation procedures, it was not maintaine d implementationally . The entire set of training and test documents were processed together for th e purpose of defining the complete set of word features that would be used . Strict isolation, to official MUC-3 standards, could have been handled with the Maxcat software, given some additional codin g and a loss in efficiency.</Paragraph>
    <Paragraph position="7"> as performed by a Bayesian classification strategy previously investigated for tex t categorization [7, 13] as well as for text retrieval [18] . The strategy treats the categorization proble m as one of estimating P(F, = 11Dm), i.e. the conditional probability the slot/filler pair F1 should be assigned given a particular document description, An . The formula used, suggested by Fuhr [4] for text retrieval, estimates P(Fi = 1ID,,,) by:</Paragraph>
    <Paragraph position="9"> the document. P(Wi = 1) is the prior probability that word Wi is an appropriate indexing term fo r the document, while P(Wi = 0), which is 1 -- P(Wi = 1), is the prior probability that it is not a n appropriate indexing term for the document . P(Wi = 1IF3 = 1) is the conditional probability W; should be an indexing term for the document given that F, appears in a template for the document , and similarly for P(Wi = OIF, = 1).</Paragraph>
    <Paragraph position="11"> that the document has the description An . For purposes of this experiment, we let P(Wi = 1IDm )  be 1.0 if Wi had one or more occurrences in the document, and 0 .0 if it did not. However, better performance is likely to be obtainable with nonbinary estimates of P(Wi = 11Dm) [1, 16, 17] . All other probabilities were estimated from the 1400 training documents . The expected likelihoo d estimate was used to partially compensate for small sample errors [6] .</Paragraph>
    <Paragraph position="12"> The main categorization program took as its input a set of matrix files storing the above probabilities, along with a matrix specifying the word occurrences in the 100 test documents . It produce d a matrix of estimates of P(F5 = 11D,n ) for each of the 100 test documents and each of the 88 slo t filler pairs 1'i . The problem then remained of making a yes/no decision for each F,, An pair on the basis of these 8800 probability estimates .</Paragraph>
    <Paragraph position="13"> The obvious strategy of setting a single threshold on the estimated probabilities and assignin g to each document all slot/filler pairs which meet this threshold will not, unfortunately, work . The problem is that the probability estimates produced by the above formula vary widely between fillers . This is due to the different predictor features used for each filler, and the fact that the assumptions about independence of features made by the above model are not met in reality .</Paragraph>
    <Paragraph position="14"> The result is that a different threshold is necessary for each category . Tying these thresholds to a single parameter that can be varied for a system is a difficult problem and has not receive d much attention in previous research on text categorization. For the current experiment, we used th e following ad hoc strategy. The categorization software multiplies the estimate of P(Fi = 1) from the training corpus by 100 (the number of test documents) and by a constant k. The resulting product provides an estimate, N(F' = 1), of the number of test documents that Fj should be assigned to .</Paragraph>
    <Paragraph position="15"> For each F,, the software assigned the slot/filler pair F~ to the N(Fi = 1) test documents with th e highest estimated values of P(Fi = 11Dm) . The parameter k controls the tradeoff between recal l and precision, with higher values of k leading to more slot/filler pair assignments, higher recall and , usually, lower precision.</Paragraph>
    <Paragraph position="16"> As implemented, this strategy requires processing test documents in batch mode . An incremental alternative would be to apply the classification function to the original training set and check th e relative position of the test document's score within the set of scores it produces on the training set . Even better would be to explicitly model the distribution of scores on the training set and derive a decision theoretic assignment rule with a cost parameter .</Paragraph>
    <Paragraph position="17"> Feature Selectio n Recall that, by our very simple definition of a word, the MUC-3 corpus contained 16,755 distinct word types. In theory, all of these words could be used as features for each categorization, with th e actual quality of each feature being compensated for in the conditional probability estimates . In practice, the &amp;quot;curse of dimensionality&amp;quot; [3] limits the number of features that can be used effectively b y any statistical estimation method . The more parameters which need to be estimated by a learnin g procedure, the more samples which the procedure must see in order to avoid fitting noise in th e data. The relationship between the appropriate number of features and the appropriate number of samples varies with feature quality, the nature and amount of noise in the data, appropriatenes s of the statistical model being fit, and other concerns . For text categorization problems, Fuhr has suggested that 50-100 samples per parameter ,are necessary [5] .</Paragraph>
    <Paragraph position="18"> With only 1400 training documents, it is clearly inappropriate to use 16,755 features. Maron [13] and Hamill [7] suggested, but did not try, information theoretic measures for selecting a subset of words to be used in predicting each category. Such measures are widely used for feature selection i n pattern recognition and machine learning . We chose to use I(Wi ; F5 ), the system mutual informatio n [8] between assignment or nonassignment of Fi and assignment or nonassignment of Wi , as a feature</Paragraph>
    <Paragraph position="20"> Notice that system mutual information is symmetric with respect to F5 and Wi . However, th e above formula is sometimes rewritten to emphasize the information gained about the feature w e desire to predict (F' in our case) based on the feature we can observe (W,*) :</Paragraph>
    <Paragraph position="22"> a=o, 1 In the latter form the measure has been referred to as information gain [14] . For each category (slot/filler pair), the words with the top d values on mutual information wer e selected as features . Differing values of d were investigated as described below. It should be noted that information-based feature selection itself requires estimating probabilities, and so also is affected by the curse of dimensionality. While we have 1400 samples to estimate each of the mutual information values, if the filler and/or the word was very infrequent in the training set we will have few or no positive instances available for the estimation, leading to inaccurate value s of mutual information . There's little to be done about low frequency fillers, but we did investigat e omitting low frequency words from consideration as features, as described below.</Paragraph>
    <Paragraph position="23"> Binary Categorization and Template Generatio n For each test document, the categorization procedure outputs a length 88 binary vector indicatin g whether each slot/filler pair should be assigned to that document . It is then necessary to map from such a vector into one or more templates. This was done by a program that examined the 8 8 categorization decisions and did the following :  1. If all 88 categorization decisions were &amp;quot;NO&amp;quot;, it considered the document irrelevant and generated an single irrelevant template (all slots filled with &amp;quot;*&amp;quot;) .</Paragraph>
    <Paragraph position="24"> 2. If there were one or more positive categorization decisions, but no positive categorizatio n decisions for the TYPE OF INCIDENT slot, a single ATTACK template was generated, an d all other slots in the template were filled according to the positive categorization decisions. 3. If there were one or more positive categorization decisions for TYPE OF INCIDENT, we  generated one template for each TYPE OF INCIDENT filler for which there was a positive decision . Each of these templates got a copy of every non-TYPE OF INCIDENT filler fo r which there was a positive categorization decision .</Paragraph>
    <Paragraph position="25"> In addition to the above procedure, we encoded the rules for legal fillers from the MUC-3 documentation into the program. If a filler suggested by a positive categorization was in fact not legal for the type of incident of a template, the program omitted it from the template . No other domain knowledge or reasoning was used in generating the templates from the binary categorization . Since the categorizer output only binary categorization decisions, cross references were neve r produced in templates, nor was more than one copy of the same filler produced in any slot . This constraint imposed an upper bound of 50% on the system's recall for most slots, since most slot s allowed only partial credit in the absence of cross references .</Paragraph>
    <Section position="1" start_page="247" end_page="249" type="sub_section">
      <SectionTitle>
Evaluation
</SectionTitle>
      <Paragraph position="0"> Two methods of evaluation were used. One was a simple Unix shell script, ebc0, which compute s microaveraged recall, precision, fallout, and overlap measures for a binary assignment of categorie s to documents [10]. It operates on two matrices specifying, respectively, the binary categorizatio n decisions made by the categorizer, and the correct binary categorization decisions extracted fro m the key templates . Since this program was fast and did not require human intervention it was used to evaluate all the experiments . However, it overestimates performance on the MUC-3 task, since i t does not take into account multiple copies of fillers in the key templates .</Paragraph>
      <Paragraph position="1"> The other form of evaluation was, of course, to generate the MUC-3 templates as described abov e and evaluate them using the MUC-3 scoring program. This was done for only a few of the runs , due to the time required for interactive scoring . In interactive scoring, I allowed partial matches  program SFO rows for *'ed entries appear in Table 2. Complete MUC-3 scoring output for the +'e d entry appears in Table 3.</Paragraph>
      <Paragraph position="2"> only when the categorizer produced exactly the same filler as one of the fillers in the key template , but omitted a cross-reference string . Note that the maximum possible recall on most slots was 50% (partial credit on every filler) since they required a cross-reference for full credit .</Paragraph>
      <Paragraph position="3"> Matches were allowed against optional and alternative fills whenever possible . If the same fill occurred multiple times in the key template, the categorizer's fill was allowed to match each of thes e occurrences . Partial credit was not allowed in any other circumstance, and interactive full credi t was not used.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="249" end_page="249" type="metho">
    <SectionTitle>
RESULTS
</SectionTitle>
    <Paragraph position="0"> Figure 1 shows microaveraged recall and precision results from ebc0 for 60 parameter combinations .</Paragraph>
    <Paragraph position="1"> Five variations on the number of features were explored, as well as 12 variations on the parameter k that trades off recall and precision . An additional 60 runs were made requiring that a word occur i n at least 4 training documents to be a feature, vs . requiring it occur in at least 2 training documents .</Paragraph>
    <Paragraph position="2"> The resulting differences were insignificant, however, so only the cases where at least 2 documen t occurrences were required are reported in Table 1 .</Paragraph>
    <Paragraph position="3"> For certain of the parameter settings with superior tradeoffs between recall and precision, we generated a template file from the binary categorization and evaluated the resulting categorization s using the MUC-3 scoring program . (The latest version of scoring program available June 5, 199 1 was used.) The scoring program's summary for all set fill slots (the SFO row) for these paramete r settings are presented in Table 2, and their recall precision points in Table 1 are marked with a * .</Paragraph>
    <Paragraph position="4"> As we can see, performance on the MUC-3 evaluation measures is lower than is estimated by ebc0 .</Paragraph>
    <Paragraph position="5"> This is due to the limitation to partial credit, presence of multiple templates with the same incident type, and the inclusion of location slots in the MUC-3 set fill summary .</Paragraph>
    <Paragraph position="6"> An example of the complete MUC-3 scoring output for parameter settings d = 10 and k = 2.0 is shown in Table 3 .</Paragraph>
  </Section>
  <Section position="5" start_page="249" end_page="250" type="metho">
    <SectionTitle>
DISCUSSIO N
</SectionTitle>
    <Paragraph position="0"> The results of the above experiments have not yet been analyzed in detail . In this section I attempt , however, to discuss some of the more noticeable results . I particularly examine the performance of feature selection .</Paragraph>
    <Paragraph position="2"/>
  </Section>
  <Section position="6" start_page="250" end_page="250" type="metho">
    <SectionTitle>
Overall Results
</SectionTitle>
    <Paragraph position="0"> The proportional assignment strategy was successful, in that varying lc allowed a smooth tradeoff to be made between recall and precision . The break even point between recall and precision wa s roughly 50/50 when evaluated under a binary categorization model and around 35/35 under the MUC-3 scoring for set fill slots.</Paragraph>
    <Paragraph position="1"> Performance varied considerably between different set fill slots as can be seen, for example, i n Table 3 . Not surprisingly, performance was best for the two slots, TYPE OF INCIDENT an d</Paragraph>
  </Section>
  <Section position="7" start_page="250" end_page="252" type="metho">
    <SectionTitle>
CATEGORY OF INCIDENT, which are tied most closely to the overall content of the article rathe r
</SectionTitle>
    <Paragraph position="0"> than to single references within the article . Performance for the other six set fill slots was roughl y equivalent and fairly low, with break even points between 10 and 30 .</Paragraph>
    <Paragraph position="1"> Also not surprisingly, recall and precision for a filler was strongly related to the frequency o f the filler on the training set . Fillers which had few occurrences on the training set were, of course , assigned infrequently under our categorization strategy, but when they were assigned they wer e usually incorrect .</Paragraph>
    <Section position="1" start_page="250" end_page="252" type="sub_section">
      <SectionTitle>
Feature Selection
</SectionTitle>
      <Paragraph position="0"> Feature selection by the mutual information measure worked surprisingly well, considering the challenge of choosing from such a large set of potential features, most of which had relatively fe w positive instances . As expected, feature selection was much more accurate for fillers which occurre d frequently on the training set than for the more rare fillers . In Table 4 we list, in alphabetical order , the top 10 features selected for the four most frequent fillers, as well as for two fillers that had only a single appearance apiece on the training set . Each word was required to occur in at least 2 trainin g documents to be a feature.</Paragraph>
      <Paragraph position="1"> The selected features show a number of interesting properties of the feature selection algorithm .</Paragraph>
      <Paragraph position="2"> The features selected for BOMBING (an incident type filler) fit one's intuitions of what good feature s should look like . They are words that are associated with bombing and, in a semantic analysi s system, would undoubtedly be mapped to concepts closely related to bombing .</Paragraph>
      <Paragraph position="3"> The features selected for MURDER (another incident type filler) are a mixture of words whic h are in general related to the concept of murder, and words connected to a particular murder that is widely discussed in the MUC corpus . The influence of this incident, the murder of 6 Jesuit priests , is even more evident in the features selected for CIVILIAN (a human target type filler) . Since we would not expect these features to be good predictors for future articles, this suggests a need t o consider the composition of training sets carefully when machine learning or statistical methods ar e used in text processing.</Paragraph>
      <Paragraph position="4"> Another interesting phenomenon shows up in the features selected for TERRORIST ACT ( a category of incident filler) . We see some contentful words, but also a number of grammatical functio n words (WE, ALL, CAN) . These clearly result from stylistic characteristics of a large class of article s reporting terrorist acts, rather than to the content of those articles . A &amp;quot;stoplist&amp;quot; is often used to screen out function words in text retrieval systems, but in the MUC-3 domain the use of such stylisti c clues may in fact be helpful .</Paragraph>
      <Paragraph position="5"> For fillers such as KIDNAPPING THREAT and WATER with only one document occurrence , features were selected that occurred in that single document, and one other. These are poor predictors of the fillers, since their single cooccurrence is often an accident . Most fillers fell between th e two extremes presented in the table, and received a mixture of good and poor features .</Paragraph>
      <Paragraph position="6"> The number of features used, d, was a smaller influence on performance than expected . As Table 1 shows, performance was relatively constant across a range of feature set sizes from 5 to 80 .</Paragraph>
      <Paragraph position="7"> However, if we look at Table 5, which gives only those recall precision points from Table 1 whic h are not strictly dominated by some other point, a bit of a trend becomes clear . We can see that th e optimal number of features increases as the category assignment parameter increases. This means , not surprisingly, that more features should be used if higher recall is desired .</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML