File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/w99-0201_metho.xml

Size: 23,175 bytes

Last Modified: 2025-10-06 14:15:25

<?xml version="1.0" standalone="yes"?>
<Paper uid="W99-0201">
  <Title>Cross-Document Event Coreference: Annotations, Experiments, and Observations Amit Bagga General Electric Company CRD</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Differences between Cross
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Document Event Reference and IE
and TDT
</SectionTitle>
      <Paragraph position="0"> Before proceeding further, it should be emphasized that cross-document event reference is a distinct goal from Information Extraction (IE) and Topic Detection and Tracking (TDT).</Paragraph>
      <Paragraph position="1"> Our approach differs from both IE and TDT in that it takes a very abstract definition of an event as a starting place, for instance the initial set of documents for resignation events consists of documents that have &amp;quot;resign&amp;quot; as a sub-string. This is even less information than information retrieval evaluations like TREC. IE takes as an event description large hand built event recognizers that are typically finite state machines. TDT starts with rather verbose descriptions of events. In addition to differences in what these technologies take as input to describe the event, the goal of the technologies differ as well.</Paragraph>
      <Paragraph position="2"> Information Extraction focuses on mapping from free text into structured data formats like database entries. Two separate instances of an event in two documents would be mapped into the database structures without consideration whether they were the same event or not. In fact, cross-document event tracking could well help information extraction systems by identifying sets of documents that describe the same event, and giving the patterns multiple chances to find a match.</Paragraph>
      <Paragraph position="3"> Topic Detection and Tracking seeks to classify a stream of documents into &amp;quot;bins&amp;quot; based on a description of the bins. Looking at the tasks from the TDT2 evaluation, there are examples that are more general and tasks that are more specific than our annotation. For example, the topic &amp;quot;Asian bailouts by the IMF&amp;quot; clusters documents into the same bin irrespective of which country is being bailed out. Our approach would try to more finely individuate the documents by distinguishing between countrieS and times. Another TDT topic involved the Texas Cat-John Perry, of Weston Golf Club, announced his resignation yesterday. He was the President of the Massachusetts Golf Association. During his two years in ofrice, Perry guided the MGA into a closer relationship with the Women's Golf Association of Massachusetts.</Paragraph>
      <Paragraph position="4"> Oliver &amp;quot;Biff&amp;quot; Kelly of Weymouth succeeds John Perry as president of the Massachusetts Golf Association. &amp;quot;We will have continued growth in the future,&amp;quot; said Kelly, who will serve for two years. &amp;quot;There's been a lot of changes and there will be continued changes as we head into the year 2000.&amp;quot;  tlemen's Association lawsuit against Oprah Winfrey.</Paragraph>
      <Paragraph position="5"> Given &amp;quot;lawsuits&amp;quot; as an event, we would seek to put documents mentioning that lawsuit into the same equivalent class, but would also form equivalence classes of for other lawsuits. In addition, our eventual goal is to provide generic cross-document coreference for all entities/events in a document i.e. we want to resolve cross-docuemtn coreferences for all entities and events mentioned in a document. This goal is significantly different from TDT's goal of classifying a stream of documents into &amp;quot;bins&amp;quot;.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Cross-Document Coreference for
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Individuals
</SectionTitle>
      <Paragraph position="0"> The primary technology that drives this research is cross-document coreference. Until recently, cross-document coreference had been thought to be a hard problem to solve (Grishman, 94). However, preliminary results in (Bagga, 98a) and (Bagga, 98b) show that high quality cross-document coreference is achievable.</Paragraph>
      <Paragraph position="1"> Figure 1 shows the architecture of the cross-document system built. Details about each of the main steps of the cross-document coreference algorithm are given below.</Paragraph>
      <Paragraph position="2"> * First, for each article, the within document coreference module of the University of Pennsylvania's CAMP system is run on that article. It produces coreference chains for all the entities mentioned in the article. For example, consider the two extracts in Figures 2 and 4. The coreference chains output by CAMP for the two extracts are shown in Figures 3 and 5.</Paragraph>
      <Paragraph position="3"> * Next, for the coreference chain of interest within each article (for example, the coreference chain</Paragraph>
      <Paragraph position="5"> that contains &amp;quot;John Perry&amp;quot;), the Sentence Extractor module extracts all the sentences that contain the noun phrases which form the coreference chain. In other words, the SentenceExtractor module produces a &amp;quot;summary&amp;quot; of the article with respect to the entity of interest. These summaries are a special case of the query sensitive techniques being developed at Penn using CAMP. Therefore, for doc.36 (Figure 2), since at least one of the three noun phrases (&amp;quot;John Perry,&amp;quot; &amp;quot;he,&amp;quot; and &amp;quot;Perry&amp;quot;) in the coreference chain of interest appears in each of the three sentences in the extract, the summary produced by SentenceExtractor is the extract itself. On the other hand, the summary produced by SentenceExtractor for the coreference chain of interest in doc.38 is only the first sentence of the extract because the only element of the coreference chain appears in this sentence.</Paragraph>
      <Paragraph position="6"> Finally, for each article, the VSM-Disambiguate module uses the summary extracted by the SentenceExtractor and computes its similarity with the summaries extracted from each of the other articles. The VSM-Disambiguate module uses a standard vector space model (used widely in information retrieval) (Salton, 89) to compute the similarities between the summaries. Summaries having similarity above a certain threshold are considered to be regarding the same entity.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Scoring
</SectionTitle>
      <Paragraph position="0"> In order to score the cross-document coreference chains output by the system, we had to map the cross-document coreference scoring problem to a within-document coreference scoring problem. This was done by creating a meta document consisting of the file names of each of the documents that the system was run on. Assuming that each of the doc-</Paragraph>
      <Paragraph position="2"> uments in the data sets was about a single entity, or about a single event, the cross-document coreference chains produced by the system could now be evaluated by scoring the corresponding within-document coreference chains in the meta document.</Paragraph>
      <Paragraph position="3"> We used two different scoring algorithms for scoring the output. The first was the standard algorithm for within-document coreference chains which was used for the evaluation of the systems participating in the MUC-6 and the MUC-7 coreference tasks. This algorithm computes precision and recall statistics by looking at the number of links identified by a system compared to the links in an answer key.</Paragraph>
      <Paragraph position="4"> The shortcomings of the MUC scoring algorithm when used for the cross-document coreference task forced us to develop a second algorithm - the B-CUBED algorithm - which is described in detail below. Full details about both these algorithms (including the shortcoming of the MUC scoring algorithm) can be found in (Bagga, 98).</Paragraph>
      <Paragraph position="5">  For an entity, i, we define the precision and recall with respect to that entity in Figure 6.</Paragraph>
      <Paragraph position="6"> The final precision and recall numbers are computed by the following two formulae:</Paragraph>
      <Paragraph position="8"> where N is the number of entities in the document, and wi is the weight assigned to entity i in the document. For the results discussed in this paper, equal weights were assigned to each entity in the meta document. In other words, wi = -~ for all i.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="3" type="metho">
    <SectionTitle>
5 Cross-Document Coreference for
</SectionTitle>
    <Paragraph position="0"> Events In order to extend our systems, as described earlier, so that it was able to handle events, we needed t o figure out a method to capture all the information about an event in a document. Previously, with named entities, it was possible to use the within-document coreference chain regarding the entity to extract a &amp;quot;summary&amp;quot; with respect to that entity. However, since CAMP does not annotate within-document coreference chains for events, it was not possible to use the same approach.</Paragraph>
    <Paragraph position="1"> The updated version of the system builds &amp;quot;summaries&amp;quot; with respect to the event of interest by extracting all the sentences in the article that contain either the verb describing the event or one of its nominalizations. Currently, sentences that contain synonyms of the verb are not extracted. However', we did conduct an experiment (described later in the paper) where the system extracted sentences con- null taining one of three pre-specified synonyms to the verb.</Paragraph>
    <Paragraph position="2"> The new version of the system was tested on several data sets.</Paragraph>
    <Section position="1" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
5.1 Analysis of Data
</SectionTitle>
      <Paragraph position="0"> for the experiments described later in the paper. In the figure, Column 1 shows the number of articles in the data set. The second column shows the average number of sentences in the summary for the entity/event of interest constructed for each article.</Paragraph>
      <Paragraph position="1"> Column 3 shows, for each summary, the average number of words that were found in at least one other summary (in the same data set). The conditions when measuring the overlap should be noted here: * the summaries are filtered for stop words * all within-document coreference chains passing through the summaries are expanded and the resulting additional noun phrases are attached to the summaries The fourth column shows for each such overlapping word, the average number of summaries (in the same data set) that it is found in. Column 5 which is the product of the numbers in Columns 3 and 4 shows, for each summary, the average number of summaries, in the data set, it shares a word with (the amount of overlap). We hypothesize here that the higher the amount of overlap, the higher is the ambiguity in the domain. We will return to this hypothesis later in  onage&amp;quot; data sets are remarkably similar. They have very similar numbers for the number of sentences per summary, the average number of overlapping words per summary, and the average number of summaries that each of the overlapping words occur in. A closer look at several of the summaries from each data set yielded the following properties that the two data sets shared: * The summaries usually consisted of a single sentence from the article.</Paragraph>
      <Paragraph position="2"> * The &amp;quot;players&amp;quot; involved in the events (people, places, companies, positions, etc.) were usually referenced in the sentences which were in the summaries.</Paragraph>
      <Paragraph position="3"> However, the &amp;quot;election&amp;quot; data set is very different from the other two sets. This data set has almost twice as many sentences per summary (2.38). In addition, the number of overlapping words in each summary is also comparatively high although the average number of summaries that an overlapping words occurs in is similar to that of the other two data sets. But, &amp;quot;elections&amp;quot; has a very high overlap  number (22.41) which is about 30% more than the other data sets. From our hypothesis it follows that this data set is comparatively much more ambiguous; a fact which is verified later in the paper.</Paragraph>
      <Paragraph position="4"> Assuming our hypothesis is true, the overlap number also gives an indication of the optimal threshold which, when chosen, will result in the best precision and recall numbers for the data set. It seems a reasonable conjecture that the optimal threshold varies inversely with the overlap number i.e. the higher the overlap number, the higher the ambiguity, and lower the optimal threshold.</Paragraph>
    </Section>
    <Section position="2" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
5.2 Experiments and Results
</SectionTitle>
      <Paragraph position="0"> We tested our cross-document coreference system on several data sets. The goal was to identify cross-document coreference chains about the same event.</Paragraph>
      <Paragraph position="1"> Figures 8 - 15 shows the results from the experiments we conducted. For each experiment conducted, the following conditions hold:  * Figure 7 shows, for each data set, the number of articles chosen for the experiment.</Paragraph>
      <Paragraph position="2"> * All of the articles in the data sets were chosen randomly from the 1996 and 1997 editions of the New York Times. The sole criterion used when choosing an article was the presence/ absence of the event of interest in the data set. For example, an article containing the word &amp;quot;election&amp;quot; would be put in the elections data set.</Paragraph>
      <Paragraph position="3"> * The answer keys for each data set were constructed manually, although scoring was automated. null Figure 16 shows for each data set, the optimal threshold, and the best precision, recall, and F-Measure obtained at that threshold.</Paragraph>
    </Section>
    <Section position="3" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
5.3 Analysis of Results
</SectionTitle>
      <Paragraph position="0"> We had mentioned earlier that we expected the optimal threshold value to vary inversely with the over- null lap number. Figure 16 verifies this - the optimal thresholds decline for the &amp;quot;espionage&amp;quot;, &amp;quot;resign&amp;quot;, and the &amp;quot;election&amp;quot; data sets (which have increasing overlap numbers). In addition, the results for the &amp;quot;election&amp;quot; data set also verify our hypothesis that data sets with large overlap numbers are more ambiguous.</Paragraph>
      <Paragraph position="1"> There are several different factors which can affect the performance of the system. We describe some of the more important ones below.</Paragraph>
      <Paragraph position="2"> expansion of coreference chains: Expanding the coreference chains that pass through the sentences contained in a summary and appending the coreferent noun phrases to the summary results in approximately a 5 point increase in F-Measure for each data set.</Paragraph>
      <Paragraph position="3"> use of synonyms: For the &amp;quot;election&amp;quot; data set, the use of three synonyms (poll, vote, and campaign) to extract additional sentences for the summaries helped in increasing the performance  of the system by 3 F-measure points. The resulting increase in performance implies that the sentences containing the term &amp;quot;election&amp;quot; did not contain sufficient information for disambiguating all the elections. Some of the disambiguation information (example: the &amp;quot;players&amp;quot; involved in the event) was mentioned in the additional sentences. This also strengthens our observation that this data set is more comparatively more ambiguous.</Paragraph>
      <Paragraph position="4"> presence of a single, large coreference chain: The presence of a single, large cross-document coreference chain in the test set affects the performance of a system with respect to the scoring algorithm used. For example, the &amp;quot;election&amp;quot; data set consisted of a very large coreference chain - the coreference chain consisting of articles regarding the 1996 US General (Congressional and Presidential) elections. This chain consisted of 36 of the 73 links in the data set. The B-CUBED algorithm penalizes systems severely for precision and recall errors in such a scenario. The difference in the results reported by the two scoring algorithms for this data set is glaring. The MUC scorer reports a 71 point F-Measure while the B-CUBED scorer reports only a 43 point F-Measure.</Paragraph>
    </Section>
    <Section position="4" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
5.4 The &amp;quot;election&amp;quot; Data Set
</SectionTitle>
      <Paragraph position="0"> Since the results for the &amp;quot;election&amp;quot; data set were significantly lower than other results, we decided to analyze this data set in more detail. The following factors makes this data set harder to deal with: presence of sub-events: The presence of sub-events that correspond to a single event makes the task harder. The &amp;quot;election&amp;quot; data set often mentioned election events which consisted of more than one actual election. For example, the data set contained articles which mentioned the 1996 US General Elections which comprised of the US Congressional elections and the US Presidential elections. In addition, there were articles which only mentioned the sub-elections without mentioning the 'more general event.</Paragraph>
      <Paragraph position="1"> &amp;quot;players&amp;quot; are the same: Elections is one event where the players involved are often the same.</Paragraph>
      <Paragraph position="2"> For example, elections are about the same positions, in the same places, and very often involving the same people making the task very ambiguous. Very often the only disambiguating factor is the year (temporal information) of the election and this too has to be inferred. For example, articles will mention an election in the following ways: &amp;quot;the upcoming November elections,&amp;quot; &amp;quot;next years elections,&amp;quot; &amp;quot;last fall's elections,&amp;quot; etc.</Paragraph>
      <Paragraph position="3"> descriptions are very similar: Another very important factor that makes the &amp;quot;elections&amp;quot; task harder is the fact that most election issues (across elections in different countries) are very similar. For example: crime rates, inflation, unemployment, etc.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="3" end_page="6" type="metho">
    <SectionTitle>
6 Interannotator Agreement
</SectionTitle>
    <Paragraph position="0"> When comparing machine performance against a human annotation, it is important to understand how consistently two humans can perform the same task.</Paragraph>
    <Paragraph position="1"> If people cannot replicate one other, then there may be serious problems with the task definition that question the wisdom of developing automated methods for the task.</Paragraph>
    <Paragraph position="2"> Both authors independently annotated the &amp;quot;elections&amp;quot; data set with no agreed upon annotation standard in contrast to how data sets were annotated in the MUC-6/7 coreference task. Instead, we used  whatever mutual understanding we had on what the goal of our annotation was from phone calls over the course of a few months. We did not develop an annotation standard because we have not considered a sufficiently broad range of events to write down necessary and sufficient conditions for event coreference. For now our understanding is: Any two events are in the same equivalence class if they are of the same generic class, ie &amp;quot;elections&amp;quot; or &amp;quot;resignations&amp;quot;, and the principle actors, entities, and times are the same.</Paragraph>
    <Paragraph position="3"> This definition does not cover the specificity of event descriptions, i.e. the difference between the general November 96 elections and a particular election in a district (at the same time). We left this decision up to human judgment rather than trying to codify the decision at this early stage.</Paragraph>
    <Paragraph position="4"> Interannotator agreement was evaluated in two phases, a completely independent phase and a consensus phase where we compared annotations and corrected obvious errors and attentional lapses but allowed differences of opinion when there was room for judgment. The results for the completely independent annotation were 87% precision and 87% recall as determined by treating one annotation as truth and the other as a systems output with the MUC scorer. Perfect agreement between the annotators would result in 100% precision and recall. These results are quite high given the lack of a clear annotation standard in combination with the ambiguity of the task.</Paragraph>
    <Paragraph position="5"> After adjudication, the agreement increased significantly to 95% precision and recall which indicates that there was genuine disagreement for 5% of the links found across two annotators. Using the B-CUBED scorer the results were 80% for the independent case and 93% for the consensus phase. These figures establish an upper bound on possible machine performance and suggest that cross document event coreference is a fairly natural phenomenon for people to recognize.</Paragraph>
  </Section>
  <Section position="7" start_page="6" end_page="6" type="metho">
    <SectionTitle>
7 Future Research
</SectionTitle>
    <Paragraph position="0"> The goal of this research has been to gain experience in cross document reference across a range of entities/events. We have focused on simple techniques (the vector space model) over rich data structures (within document coreference annotated text) as a means to better understanding of where to further explore the phenomenon.</Paragraph>
    <Paragraph position="1"> It is worth exploring alternatives to the vector space model since there are areas where it could be improved. One possibility would be to explicitly identify the individuating factors of events, i.e. the &amp;quot;players&amp;quot; of an event, and then individuate by comparing these factors. This would be particularly helpful when there is only one individuating factor like a date that differentiates two events.</Paragraph>
    <Paragraph position="2"> The benefit of cross document entity reference centers around nove.1 interfaces to large data collections, so we are focusing on potential applications that include link visualization (Bagga, 98c), question answering, and multi-document summarization.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML