File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-1012_metho.xml
Size: 19,209 bytes
Last Modified: 2025-10-06 14:14:50
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-1012"> <Title>Entity-Based Cross-Document Coreferencing Using the Vector</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Cross-Document Coreference: The Problem </SectionTitle> <Paragraph position="0"> Cross-document coreference is a distinct technology from Named Entity recognizers like IsoQuest's NetOwl and IBM's Textract because it attempts to determine whether name matches are actually the same individual (not all John Smiths are the same).</Paragraph> <Paragraph position="1"> Neither NetOwl or Textract have mechanisms which try to keep same-named individuals distinct if they are different people.</Paragraph> <Paragraph position="2"> Cross-document coreference also differs in substantial ways from within-document coreference.</Paragraph> <Paragraph position="3"> Within a document there is a certain amount of consistency which cannot be expected across documents. In addition, the problems encountered during within document coreference are compounded when looking for coreferences across documents because the underlying principles of linguistics and discourse context no longer apply across documents. Because the underlying assumptions in cross-document coreference are so distinct, they require novel approaches.</Paragraph> </Section> <Section position="4" start_page="0" end_page="80" type="metho"> <SectionTitle> 3 Architecture and the Methodology </SectionTitle> <Paragraph position="0"> Figure 1 shows the architecture of the cross-document system developed. The system is built upon the University of Pennsylvania's within document coreference system, CAMP, which participated in the Seventh Message Understanding Conference (MUC-7) within document coreference task (MUC7 1998).</Paragraph> <Paragraph position="1"> Our system takes as input the coreference processed documents output by CAMP. It then passes these documents through the SentenceExtractor module which extracts, for each document, all the sentences relevant to a particular entity of interest. The VSM-Disambiguate module then uses a vector space model algorithm to compute similarities between the sentences extracted for each pair of documents. null</Paragraph> <Paragraph position="3"> nounced his resignation yesterday. He was the President of the Massachusetts Golf Association. During his two years in ofrice, Perry guided the MGA into a closer relationship with the Women's Golf Association of Massachusetts.</Paragraph> <Paragraph position="4"> Oliver &quot;Biff&quot; Kelly of Weymouth succeeds John Perry as president of the Massachusetts Golf Association. &quot;We will have continued growth in the future,&quot; said Kelly, who will serve for two years. &quot;There's been a lot of changes and there will be continued changes as we head into the year 2000.&quot; Details about each of the main steps of the cross-document coreference algorithm are given below. * First, for each article, CAMP is run on the article. It produces coreference chains for all the entities mentioned in the article. For example, consider the two extracts in Figures 2 and 4.</Paragraph> <Paragraph position="5"> The coreference chains output by CAMP for the two extracts are shown in Figures 3 and 5.</Paragraph> <Paragraph position="7"> Next, for the coreference chain of interest within each article (for example, the coreference chain that contains &quot;John Perry&quot;), the Sentence Extractor module extracts all the sentences that contain the noun phrases which form the corefo erence chain. In other words, the SentenceExtractor module produces a &quot;summary&quot; of the article with respect to the entity of interest. These summaries are a special case of the query sensitive techniques being developed at Penn using CAMP. Therefore, for doc.36 (Figure 2), since at least one of the three noun phrases (&quot;John Perry,&quot; &quot;he,&quot; and &quot;Perry&quot;) in the coreference chain of interest appears in each of the three sentences in the extract, the summary produced by SentenceExtractor is the extract itself. On the other hand, the summary produced by SentenceExtractor for the coreference chain of interest in doc.38 is only the first sentence of the extract because the only element of the coreference chain appears in this sentence.</Paragraph> <Paragraph position="8"> * For each article, the VSM-Disambiguate module uses the summary extracted by the SentenceExtractor and computes its similarity with the summaries extracted from each of the other articles. Summaries having similarity above a certain threshold are considered to be regarding the same entity.</Paragraph> </Section> <Section position="5" start_page="80" end_page="80" type="metho"> <SectionTitle> 4 University of Pennsylvania's CAMP System </SectionTitle> <Paragraph position="0"> The University of Pennsylvania's CAMP system resolves within document coreferences for several different classes including pronouns, and proper names (Baldwin 95). It ranked among the top systems in the coreference task during the MUC-6 and the MUC-7 evaluations.</Paragraph> <Paragraph position="1"> The coreference chains output by CAMP enable us to gather all the information about the entity of interest in an article. This information about the entity is gathered by the SentenceExtractor module and is used by the VSM-Disambiguate module for disambignation purposes. Consider the extract for doc.36 shown in Figure 2. We are able to include the fact that the John Perry mentioned in this article was the president of the Massachusetts Golf Association only because CAMP recognized that the &quot;he&quot; in the second sentence is coreferent with &quot;John Perry&quot; in the first. And it is this fact which actually helps VSM-Disambignate decide that the two John Perrys in doc.36 and doc.38 are the same person.</Paragraph> </Section> <Section position="6" start_page="80" end_page="80" type="metho"> <SectionTitle> 5 The Vector Space Model </SectionTitle> <Paragraph position="0"> The vector space model used for disambignating entities across documents is the standard vector space model used widely in information retrieval (Salton 89). In this model, each summary extracted by the SentenceExtractor module is stored as a vector of terms. The terms in the vector are in their morphological root form and are filtered for stop-words (words that have no information content like a, the, of, an, ... ). If $1 and $2 are the vectors for the two summaries extracted from documents D1 and D2, then their similarity is computed as: Sim(S1, $2) = E wlj x w2j common terms tj where tj is a term present in both $1 and $2, wlj is the weight of the term tj in S1 and w~j is the weight of tj in $2.</Paragraph> <Paragraph position="1"> The weight of a term tj in the vector Si for a summary is given by:</Paragraph> <Paragraph position="3"> where tf is the frequency of the term tj in the summary, N is the total number of documents in the collection being examined, and df is the number of documents in the collection that the term tj occurs 2 is the cosine normaliza- in. ~/si~ + si~ +... + Sin tion factor and is equal to the Euclidean length of the vector Si.</Paragraph> <Paragraph position="4"> The VSM-Disambignate module, for each summary Si, computes the similarity of that summary with each of the other summaries. If the similarity computed is above a pre-defined threshold, then the entity of interest in the two summaries are considered to be coreferent.</Paragraph> </Section> <Section position="7" start_page="80" end_page="80" type="metho"> <SectionTitle> 6 Experiments </SectionTitle> <Paragraph position="0"> The cross-document coreference system was tested on a highly ambiguous test set which consisted of 197 articles from 1996 and 1997 editions of the New York Times. The sole criteria for including an article in the test set was the presence or the absence of a string in the article which matched the &quot;/John.*?Smith/&quot; regular expression. In other words, all of the articles either contained the name John Smith or contained some variation with a middle initial/name. The system did not use any New York Times data for training purposes. The answer keys regarding the cross-document chains were manually created, but the scoring was completely automated. null</Paragraph> <Section position="1" start_page="80" end_page="80" type="sub_section"> <SectionTitle> 6.1 Analysis of the Data </SectionTitle> <Paragraph position="0"> There were 35 different John Smiths mentioned in the articles. Of these, 24 of them only had one article which mentioned them. The other 173 articles were regarding the 11 remaining John Smiths. The background of these John Smiths , and the number of articles pertaining to each, varied greatly. Descriptions of a few of the John Smiths are: Chairman and CEO of General Motors, assistant track coach at UCLA, the legendary explorer, and the main character in Disney's Pocahontas, former president of the Labor Party of Britain.</Paragraph> </Section> </Section> <Section position="8" start_page="80" end_page="83" type="metho"> <SectionTitle> 7 Scoring the Output </SectionTitle> <Paragraph position="0"> In order to score the cross-document coreference chains output by the system, we had to map the cross-document coreference scoring problem to a within-document coreference scoring problem. This was done by creating a meta document consisting 6f the file names of each of the documents that the system was run on. Assuming that each of the documents in the data set was about a single John Smith, the cross-document coreference chains produced by the system could now be evaluated by scoring the corresponding within-document coreference chains in the meta document.</Paragraph> <Paragraph position="1"> We used two different scoring algorithms for scoring the output. The first was the standard algorithm for within-document coreference chains which was used for the evaluation of the systems participating in the MUC-6 and the MUC-7 coreference tasks.</Paragraph> <Paragraph position="2"> The shortcomings of the MUC scoring algorithm when used for the cross-document coreference task forced us to develop a second algorithm.</Paragraph> <Paragraph position="3"> Details about both these algorithms follow.</Paragraph> <Section position="1" start_page="81" end_page="81" type="sub_section"> <SectionTitle> 7.1 The MUC Coreference Scoring </SectionTitle> <Paragraph position="0"> Algorithm 1 The MUC algorithm computes precision and recall statistics by looking at the number of links identified by a system compared to the links in an answer key. In the model-theoretic description of the algorithm that follows, the term &quot;key&quot; refers to the manually annotated coreference chains (the truth) while the term &quot;response&quot; refers to the coreference chains output by a system. An equivalence set is the transitive closure of a coreference chain. The algorithm, developed by (Vilain 95), computes recall in the following way.</Paragraph> <Paragraph position="1"> First, let S be an equivalence set generated by the key, and let R1... Rm be equivalence classes generated by the response. Then we define the following functions over S: * p(S) is a partition of S relative to the response. Each subset of S in the partition is formed by intersecting S and those response sets Ri that overlap S. Note that the equivalence classes defined by the response may include implicit singleton sets - these correspond to elements that are mentioned in the key but not in the response. For example, say the key generates the equivalence class S = {A B C D}, and the response is simply <A-B>. The relative partition p(S) is then {A B} {C} and {D}.</Paragraph> <Paragraph position="2"> * c(S) is the minimal number of &quot;correct&quot; links necessary to generate the equivalence class S. It is clear that c(S) is one less than the cardinality of S, i.e., c(S) = (IS\[ - 1) .</Paragraph> <Paragraph position="3"> * m(S) is the number of &quot;missing&quot; links in the response relative to the key set S. As noted above, this is the number of links necessary to</Paragraph> <Paragraph position="5"> Looking in isolation at a single equivalence class in the key, the recall error for that class is just the number of missing links divided by the number of m(S) correct links, i.e., c(S) * c(S)-m(S) Recall in turn is c(S) , which equals</Paragraph> <Paragraph position="7"> The whole expression can now be simplified to ISl- Ip(S)I ISl- 1 Precision is computed by switching the roles of the key and response in the above formulation.</Paragraph> </Section> <Section position="2" start_page="81" end_page="82" type="sub_section"> <SectionTitle> 7.2 Shortcomings of the MUC Scoring Algorithm </SectionTitle> <Paragraph position="0"> While the (Vilain 95) provides intuitive results for coreference scoring, it however does not work as well in the context of evaluating cross document coreference. There are two main reasons.</Paragraph> <Paragraph position="1"> 1. The algorithm does not give any credit for separating out singletons (entities that occur in chains consisting only of one element, the entity itself) from other chains which have been identified. This follows from the convention in entities that are markable as possibly coreferent with other entities in the text. Rather, entities are only marked as being coreferent if they actually are coreferent with other entities in the text. This shortcoming could be easily enough overcome with different annotation conventions and with minor changes to the algorithm, but it is worth noting.</Paragraph> <Paragraph position="2"> 2. All errors are considered to be equal. The MUC scoring algorithm penalizes the precision numbers equally for all types of errors. It is our position that, for certain tasks, some coreference errors do more damage than others.</Paragraph> <Paragraph position="3"> Consider the following examples: suppose the truth contains two large coreference chains and one small one (Figure 6), and suppose Figures 7 and 8 show two different responses. We will explore two different precision errors. The first error will connect one of the large coreference chains with the small one (Figure 7). The second error occurs when the two large coreference chains are related by the errant coreferent link (Figure 8). It is our position that the second error is more damaging because, compared to the first error, the second error makes more entities coreferent that should not be. This distinction is not reflected in the (Vilain 95) scorer which scores both responses as having a precision score of 90% (Figure 9).</Paragraph> </Section> <Section position="3" start_page="82" end_page="82" type="sub_section"> <SectionTitle> 7.3 Our B-CUBED Scoring Algorithm 2 </SectionTitle> <Paragraph position="0"> Imagine a scenario where a user recalls a collection of articles about John Smith, finds a single article about the particular John Smith of interest and wants to see all the other articles about that individual. In commercial systems with News data, precision is typically the desired goal in such settings.</Paragraph> <Paragraph position="1"> As a result we wanted to model the accuracy of the system on a per-document basis and then build a more global score based on the sum of the user's experiences.</Paragraph> <Paragraph position="2"> Consider the case where the user selects document 6 in Figure 8. This a good outcome with all the relevant documents being found by the system and no extraneous documents. If the user selected document 1, then there are 5 irrelevant documents in the systems output - precision is quite low then.</Paragraph> <Paragraph position="3"> The goal of our scoring algorithm then is to model the precision and recall on average when looking for more documents about the same person based on selecting a single document.</Paragraph> <Paragraph position="4"> Instead of looking at the links produced by a system, our algorithm looks at the presence/absence of entities from the chains produced. Therefore, we compute the precision and recall numbers for each entity in the document. The numbers computed with respect to each entity in the document are then combined to produce final precision and recall numbers for the entire output.</Paragraph> <Paragraph position="5"> For an entity, i, we define the precision and recall with respect to that entity in Figure 10.</Paragraph> <Paragraph position="6"> The final precision and recall numbers are computed by the following two formulae: where N is the number of entities in the document, and wi is the weight assigned to entity i in the document. For all the examples and the experiments in this paper we assign equal weights to each entity i.e. wi = 1IN. We have also looked at the possibilities of using other weighting schemes. Further details about the B-CUBED algorithm including a model theoretic version of the algorithm can be found in (Bagga 98a).</Paragraph> <Paragraph position="7"> Consider the response shown in Figure 7. Using the B-CUBED algorithm, the precision for entity-6 in the document equals 2/7 because the chain output for the entity contains 7 elements, 2 of which are correct, namely {6,7}. The recall for entity-6, however, is 2/2 because the chain output for the entity has 2 correct elements in it and the &quot;truth&quot; chain for the entity only contains those 2 elements. Figure 9 shows the final precision and recall numbers computed by the B-CUBED algorithm for the examples shown in Figures 7 and 8. The figure also shows the precision and recall numbers for each entity (ordered by entity-numbers).</Paragraph> </Section> <Section position="4" start_page="82" end_page="83" type="sub_section"> <SectionTitle> 7.4 Overcoming the Shortcomings of the MUC Algorithm </SectionTitle> <Paragraph position="0"> The B-CUBED algorithm does overcome the the two main shortcomings of the MUC scoring algorithm discussed earlier. It implicitly overcomes the first number of correct elements in the output chain containing entityi</Paragraph> <Paragraph position="2"> number of elements in the output chain containing entityi number of correct elements in the output chain containing entityi number of elements in the truth chain containing entityi shortcoming of the MUC-6 algorithm by calculating the precision and recall numbers for each entity in the document (irrespective of whether an entity is part of a coreference chain). Consider the responses shown in Figures 7 and 8. We had mentioned earlier that the error of linking the the two large chains in the second response is more damaging than the error of linking one of the large chains with the smaller chain in the first response. Our scoring algorithm takes this into account and computes a final precision of 58% and 76% for the two responses respectively. In comparison, the MUC algorithm computes a precision of 90% for both the responses (Figure 9).</Paragraph> </Section> </Section> class="xml-element"></Paper>