XML Viewer - p06-2022

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-2022_metho.xml
Size: 22,928 bytes
Last Modified: 2025-10-06 14:10:22
<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-2022">
  <Title>Automatically Extracting Nominal Mentions of Events with a Bootstrapped Probabilistic Classifier[?]</Title>
  <Section position="4" start_page="168" end_page="169" type="metho">
    <SectionTitle>
2 Weakly-supervised, simultaneous
</SectionTitle>
    <Paragraph position="0"> lexical acquisition and disambiguation In this section we present a computational method that learns the distribution of context patterns that correlate with event vs. non-event mentions based on unambiguous seeds. Using these seeds we build two Bayesian probabilistic generative models of the data, one for non-event nominals and the other for event nominals. A classifier is then constructed by comparing the probability of a candidate instance under each model, with the winning model determining the classification. In Section 3 we show that this classifier's coverage can be increased beyond the initial labeled seed set by automatically selecting additional seeds from a very large unlabeled, parsed corpus.</Paragraph>
    <Paragraph position="1"> The technique proceeds as follows. First, two lexicons of seed terms are created by hand. One lexicon includes nominal terms that are highly likely to unambiguously denote events; the other includes nominal terms that are highly likely to unambiguously denote anything other than events.</Paragraph>
    <Paragraph position="2"> Then, a very large corpus (&gt;150K documents) is parsed using a broad-coverage dependency parser to extract all instantiations of a core set of semantic dependency relations, including verb-logical subject, verb-logical object, subject-nominal predicate, noun phrase-appositive-modifier, etc.</Paragraph>
    <Paragraph position="3"> Format of data: Each instantiation is in the form of a dependency triple, (wa,R,wb), where R is the relation type and where each argument is represented just by its syntactic head, wn. Each partial instantiation of the relation--i.e. either wa or wb is treated as a wild card [?] that can be filled by any term--becomes a feature in the model. For every common noun term in the corpus that appears with at least one feature (including each entry in the seed lexicons), the times it appears with each feature are tabulated and stored in a matrix of counts. Each column of the matrix represents a feature, e.g. (occur,Verb-Subj,[?]); each row represents an individual term,2 e.g. murder; and each entry is the number of times a term appeared with the feature in the corpus, i.e. as the instantiation of [?]. For each row, if the corresponding term appears in a lexicon it is given that designation, i.e. EVENT or NONEVENT, or if it does not appear in either lexicon, it is left unlabeled.</Paragraph>
    <Paragraph position="4">  Probabilistic model: Here we present the details of the EVENT model--the computations for the NONEVENT model are identical. The probability model is built using a set of seed words labeled as EVENTs and is designed to address two desiderata: (I) the EVENT model should assign high probability to an unlabeled vector, v, if its features (as recorded in the count matrix) are similar to the vectors of the EVENT seeds; (II) each seed term s should contribute to the model in proportion to its prevalence in the training data.3 These desiderata can be incorporated naturally into a mixture model formalism, where there are as many components in the mixture model as there are EVENT seed terms. Desideratum (I) is addressed by having each component of the mixture model assigning a multinomial probability to the vector, v. For the ith mixture component built around the ith seed, s(i), the probability is</Paragraph>
    <Paragraph position="6"> where s(i)f is defined as the proportion of the times the seed was seen with feature f compared to the number of times the seed was seen with any feature fprime [?] F. Thus s(i)f is simply the (i,f)th entry in a row-sum normalized count matrix,</Paragraph>
    <Paragraph position="8"> Desideratum (II) is realized using a mixture density by forming a weighted mixture of the above multinomial distributions from all the provided seeds i [?] E. The weighting of the ith component is fixed to be the ratio of the number of occurrences of the ith EVENT seed, denoted |s(i)|, to the total number of all occurrences of event seed words. This gives more weight to more prevalent seed words:</Paragraph>
    <Paragraph position="10"> The EVENT generative probability is then:</Paragraph>
    <Paragraph position="12"> An example of the calculation for a model with just two event seeds and three features is given in  seen with any feature in the training corpus because the indexing tool used to calculate counts does not keep track of which instances appeared simultaneously with more than one feature. We do not expect this artifact to dramatically change the relative seed frequencies in our model.</Paragraph>
    <Paragraph position="13"> f1 f2 f3 event seed vector s(1) 3 1 8 event seed vector s(2) 4 6 1 unlabeled mention vector v 2 0 7</Paragraph>
    <Paragraph position="15"> &lt;&lt;2,, 6  &lt;&lt;0,, 1  beled instance v under the event distribution composed of two event seeds s(1) and s(2). event seeds as well, and a corresponding probability p(v|NONEVENT) is computed. The following</Paragraph>
    <Paragraph position="17"> is then calculated. An instance v encoded as the vector v is labeled as EVENT or NONEVENT by examining the sign of d(v). A positive difference d(v) classifies v as EVENT; a negative value of d(v) classifies v as NONEVENT. Should d=0 the classifier is considered undecided and abstains.</Paragraph>
    <Paragraph position="18"> Each test instance is composed of a term and the dependency triples it appears with in context in the test document. Therefore, an instance can be classified by (i: word): Find the unlabeled feature vector in the training data corresponding to the term and apply the classifier to that vector, i.e. classify the instance based on the term's behavior summed across many occurrences in the trainingcorpus; (ii: context): Classifytheinstance based only on its immediate test context vector; or (iii: word+context): For each model, multiply the probability information from the word vector (=i) and the context vector (=ii). In our experiments, all terms in the test corpus appeared at least once (80% appearing at least 500 times) in the training corpus, so there were no cases of unseen terms-not suprising with a training set 1,800 times larger than the test set. However, the ability to label an instance based only on its immediate context means that there is a backoff method in the case of unseen terms.</Paragraph>
  </Section>
  <Section position="5" start_page="169" end_page="173" type="metho">
    <SectionTitle>
3 Experimental Results
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="169" end_page="170" type="sub_section">
      <SectionTitle>
3.1 Training, test, and seed word data
</SectionTitle>
      <Paragraph position="0"> In order to train and test the model, we created two corpora and a lexicon of event and non-event seeds. The training corpus consisted of 156,000 newswire documents, [?]100 million words, from the Foreign Broadcast Information Service, Lexis  Nexis, and other online news archives. The corpus was parsed using Janya's information extraction application, Semantex, which creates both shallow, non-recursive parsing structures and dependency links, and all (wi,R,wj) statistics were extracted as described in Section 2. From the 1.9 million patterns, (wi,R,[?]) and ([?],R,wj) extracted from the corpus, the 48,353 that appeared more than 300 times were retained as features.</Paragraph>
      <Paragraph position="1"> The test corpus was composed of 77 additional documents ([?]56K words), overlapping in time and content but not included in the training set.</Paragraph>
      <Paragraph position="2"> These were annotated by hand to mark event nominals. Specifically, every referential noun phrase headed by a non-proper noun was considered for whether it denoted an achievement, accomplishment, activity, or process (Parsons, 1990). Noun heads denoting any of these were marked as EVENT, and all others were left unmarked.</Paragraph>
      <Paragraph position="3"> All documents were first marked by a junior annotator, and then a non-blind second pass was performed by a senior annotator (first author). Several semantic classes were difficult to annotate because they are particularly prone to coactivation, including terms denoting financial acts, legal acts, speech acts, and economic processes. In addition, for terms like mission, plan, duty, tactic, policy, it can be unclear whether they are hyponyms of EVENT or another abstract concept. In every case, however, the mention was labeled as an event or non-event depending on whether its use in that context appeared to be more or less event-like, respectively. Tests for the &amp;quot;event-y&amp;quot;ness of the context included whether an unambiguous word would be an acceptable substitute there (e.g. funds [=only non-event] for expenditure [either]).</Paragraph>
      <Paragraph position="4"> To create the test data, the annotated documents were also parsed to automatically extract all common noun-headed NPs and the dependency triples they instantiate. Those with heads that aligned with the offsets of an event annotation were labeled as events; the remainder were labeled as non-events. Because of parsing errors, about 10% of annotated event instances were lost, that is remained unlabeled or were labeled as non-events.</Paragraph>
      <Paragraph position="5"> So, our results are based on the set of recoverable event nominals as a subset of all common-noun headed NPs that were extracted. In the test corpus there were 9,381 candidate instances, 1,579 (17%) events and 7,802 (83%) non-events.</Paragraph>
      <Paragraph position="6"> There were 2,319 unique term types; of these, 167 types (7%) appeared both as event tokens and non-event tokens. Some sample ambiguous terms include: behavior, attempt, settlement, deal, violation, progress, sermon, expenditure.</Paragraph>
      <Paragraph position="7"> We constructed two lexicons of nominals to use as the seed terms. For events, we created a list of 95 terms, such as election, war, assassination, dismissal, primarilybasedonintrospectioncombined with some checks on individual terms in WordNet and other dictionaries and using Google searches to judge how &amp;quot;event-y&amp;quot; the term was.</Paragraph>
      <Paragraph position="8"> To create a list of non-events, we used WordNet and the British National Corpus. First, from the set of all lexemes that appear in only one synset in WordNet, all nouns were extracted along with the topmost hypernym they appear under. From these we retained those that both appeared on a lemmatized frequency list of the 6,318 words with more than 800 occurrences in the whole 100Mword BNC (Kilgarriff, 1997) and had one of the hypernyms GROUP, PSYCHOLOGICAL, ENTITY, POSSESSION. We also retained select terms from the categories STATE and PHENOMENON were labeled non-event seeds. Examples of the 295 non-event seeds are corpse, electronics, bureaucracy, airport, cattle.</Paragraph>
      <Paragraph position="9"> Of the 9,381 test instances, 641 (6.8%) had a term that belonged to the seed list. With respect to types, 137 (5.9%) of the 2,319 term types in the test data also appeared on the seed lists.</Paragraph>
    </Section>
    <Section position="2" start_page="170" end_page="173" type="sub_section">
      <SectionTitle>
3.2 Experiments
</SectionTitle>
      <Paragraph position="0"> Experiments were performed to investigate the performance of our models, both when using original seed lists, and also when varying the content of the seed lists using a bootstrapping technique that relies on the probabilistic framework of the model. A 1,000-instance subset of the 9,381 test data instances was used as a validation set; the remaining 8,381 were used as evaluation data, on which we report all results (with the exception of Table 3 which is on the full test set).</Paragraph>
      <Paragraph position="1"> EXP1: Results using original seed sets Probabilistic models for non-events and events were built from the full list of 295 non-event and 95 event seeds, respectively, as described above.</Paragraph>
      <Paragraph position="2"> Table 1 (top half: original seed set) shows the results over the 8,381 evaluation data instances when using the three classification methods described above: (i) word, (ii) context, and (iii) word+context. The first row (ALL) reports scores where all undecided responses are marked as in- null swers (d = 0) are left out of the total, so the number of correct answers stays the same, but the percentage of correct answers increases.4 Scores are measured in terms of accuracy on the EVENT instances, accuracy on the NONEVENT instances, TOTAL accuracy across all instances, and the simple AVERAGE of accuracies on non-events and events (last column). The AVERAGE score assumes that performance on non-events and events is equally important to us.</Paragraph>
      <Paragraph position="3"> ?From EXP1, we see that the behavior of a term across an entire corpus is a better source of information about whether a particular instance of that term refers to an event than its immediate context.</Paragraph>
      <Paragraph position="4"> We can further infer that this is because the immediate context only provides definitive evidence for the models in 63.0% of cases; when the context model is not penalized for indecision, its accuracy improves considerably. Nonetheless, in combination with the word model, immediate context does not appear to provide much additional information over only the word. In other words, based only on a term's distribution in the past, one can make a reasonable prediction about how it will be used when it is seen again. Consequently, it seems that a well-constructed, i.e. domain customized, lexicon can classify nearly as well as a method that also takes context into account.</Paragraph>
      <Paragraph position="5"> EXP2: Results on ACE 2005 event data In addition to using the data set created specifically for this project, we also used a subset of the anno4Note that Att(%) does not change with bootstrapping-an artifact of the sparsity of certain feature vectors in the training and test data, and not the model's constituents seeds.  (accuracy) and %attempted, for our classifiers and LEX 1. tated training data created for the ACE 2005 Event Detection and Recognition (VDR) task. Because only event mentions of specific types are marked in the ACE data, only recall of ACE event nominals can be measured rather than overall recall of event nominals and accuracy on non-event nominals. Results on the 1,934 nominal mentions of events (omitting cases of d = 0) are shown in Table 2. The performance of the hand-crafted Lexicon 1 on the ACE data, described in Section 3.3 below, is also included.</Paragraph>
      <Paragraph position="6"> The fact that our method performs somewhat betterontheACEdatathanonourowndata, while the lexicon approach is worse (7 points higher vs. 3 points lower, respectively) can likely be explained by the fact that in creating our introspective seed set for events, we consulted the annotation manual for ACE event types and attempted to include in our list any unambiguous seed terms that fit those types.</Paragraph>
      <Paragraph position="7"> EXP3: Increasing seed set via Bootstrapping  Thereareover2,300unlabeledvectorsinthetraining data that correspond to the words that appear as lexical heads in the test data. These unlabeled training vectors can be powerfully leveraged using a simple bootstrapping algorithm to improve the individual models for non-events and events, as follows: Step 1: For each vector v in the unlabeled portion of training data, row-sum normalize  1001 5 10 15 LEX160  symbols on left denote classifier built from initial (295,95) seeds; and bold (disconnected) symbols at right are LEX 1. it to produce ~v and compute a normalized measure of confidence of the algorithm's prediction, given by the magnitude of d(~v). Step 2: Add those vectors most confidently classified as either non-events or events to the seed set for non-events or events, according to the sign of d(~v). Step 3: Recalculate the model based on the new seed lists.</Paragraph>
      <Paragraph position="8"> Step 4: Repeat Steps 1-3 until either no more unlabeled vectors remain or the validation accuracy no longer increases.</Paragraph>
      <Paragraph position="9"> In our experiments we added vectors to each model such that the ratio of the size of the seed sets remained constant, i.e. 50 non-events and 16 events were added at each iteration. Using our validation set, we determined that the bootstrapping should stop after 15 iterations (despite continuing for 21 iterations), at which point the average accuracy leveled out and then began to drop. After 15 iterations the seed set is of size (295,95)+(50,16)x15 = (1045,335). Figure 2 shows the change in the accuracy of the model as it is bootstrapped through 15 iterations.</Paragraph>
      <Paragraph position="10"> TOTAL accuracy improves with bootstrapping, despite EVENT accuracy decreasing, because the test data is heavily populated with non-events, whose accuracy increases substantially. The AVERAGE accuracy also increases, which proves that bootstrapping is doing more than simply shifting the bias of the classifier to the majority class. The figure also shows that the final bootstrapped classifier comfortably outperforms Lexicon 1, impressive because the lexicon contains at least 13 times more terms than the seed lists.</Paragraph>
      <Paragraph position="11"> EXP4: Bootstrapping with a reduced number of seeds The size of the original seed lists were chosen somewhat arbitrarily. In order to determine whether similar performance could be obtainedusingfewerseeds, i.e.lesshumaneffort, we experimented with reducing the size of the seed lexicons used to initialize the bootstrapping.</Paragraph>
      <Paragraph position="12"> To do this, we randomly selected a fixed fraction, f%, of the (295,95) available event and non-event seeds, and built a classifier from this sub-set of seeds (and discarded the remaining seeds). We then bootstrapped the classifier's models using the 4-step procedure described above, using candidate seed vectors from the unlabeled training corpus, and incrementing the number of seeds until the classifier consisted of (295,95) seeds.</Paragraph>
      <Paragraph position="13"> We then performed 15 additional bootstrapping iterations, each adding (50,16) seeds. Since the seeds making up the initial classifier are chosen stochastically, we repeated this entire process 10 times and report in Figures 3(a) and 3(b) the mean of the total and average accuracies for these 10 folds, respectively. Both plots have five traces, with each trace corresponding the fraction f = (20,40,60,80,100)% of labeled seeds used to build the initial models. As a point of reference, notethatinitializingwith100%oftheseedlexicon corresponds to the first point of the traces in Figure 2 (where the x-axis is marked with f =100%).</Paragraph>
      <Paragraph position="14"> Interestingly, there is no discernible difference in accuracy (total or average) for fractions f greater than 20%. However, upon bootstrapping we note the following trends. First, Figure 3(b) shows that using a larger initial seed set increases the maximum achievable accuracy, but this maximum occurs after a greater number bootstrapping iterations; indeed the maximum for 100% is achieved at 15 (or greater) iterations. This reflects the difference in rigidity of the initial models, with smaller initial models more easily misled by the seeds added by bootstrapping. Second, the final accuracies (total and average) are correlated with theinitialseedsetsize, whichisintuitivelysatisfying. Third, it appears from Figure 3(a) that the total accuracy at the model size (295,95) (or 100%) is in fact anti-correlated with the size of the initial seed set, with 20% performing best. This is correct, but highlights the sometimes misleading interpretation of the total accuracy: in this case the model is defaulting to classifying anything as a non-event (the majority class), and has a considerably impoverished event model.</Paragraph>
      <Paragraph position="15"> If one wants to do as well as Lexicon 1 after 15 iterations of bootstrapping then one needs at least  percentage of correct classifications on the full test set. an initial seed set of size 60%. An alternative is to perform fewer iterations, but here we see that using 100% of the seeds comfortably achieves the highest total and average accuracies anyway.</Paragraph>
    </Section>
    <Section position="3" start_page="173" end_page="173" type="sub_section">
      <SectionTitle>
3.3 Comparison with existing lexicons
</SectionTitle>
      <Paragraph position="0"> In order to compare our weakly-supervised probabilisticmethodwithalexicallookupmethodbased null on very large hand-created lexical resources, we created three lexicons of event terms, which were used as very simple classifiers of the test data. If the test instance term belongs to the lexicon, it is labeled EVENT; otherwise, it is labeled as NON-EVENT. The results on the full test set using these lexicons are shown in Table 3.</Paragraph>
      <Paragraph position="1"> Lex 1 5,435 entries from NomLex (Macleod et al., 1998), FrameNet(Baker et al., 1998), CELEX (CEL, 1993), Timebank(Day et al., 2003).</Paragraph>
      <Paragraph position="2"> Lex 2 13,659 entries from WordNet 2.0 hypernym classes EVENT, ACT, PROCESS, COGNITIVE PRO-CESS, &amp; COMMUNICATION combined with Lex 1.</Paragraph>
      <Paragraph position="3"> Lex 3 Combination of pre-existing lexicons in the information extraction application from WordNet, Oxford Advanced Learner's Dictionary, etc.</Paragraph>
      <Paragraph position="4"> As shown in Tables 1 and 3, the relatively knowledge-poor method developed here using around 400 seeds performs well compared to the use of the much larger lexicons. For the task of detecting nominal events, using Lexicon 1 might be the quickest practical solution. In terms of extensibility to other semantic classes, domains, or languages lacking appropriate existing lexical resources, the advantage of our trainable method is clear. The primary requirement of this method is a dependency parser and a system user-developer who can provide a set of seeds for a class of interest and its complement. It should be possible in the next few years to create a dependency  parserforalanguagewithnoexistinglinguisticresources (Klein and Manning, 2002). Rather than having to spend the considerable person-years it takes to create resources like FrameNet, CELEX, and WordNet, a better alternative will be to use weakly-supervised semantic labelers like the one described here.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML