File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/06/w06-3804_relat.xml
Size: 6,211 bytes
Last Modified: 2025-10-06 14:15:56
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-3804"> <Title>Measuring Aboutness of an Entity in a Text</Title> <Section position="4" start_page="26" end_page="27" type="relat"> <SectionTitle> 3 Experiments and Results </SectionTitle> <Paragraph position="0"> For learning person related words we used a training corpus consisting of biographical texts of persons obtained from the Web (from http://www.biography.com) and biographical and non-biographical texts from DUC-2002 and DUC2003. For considering a term as biography-related, we set a likelihood ratio threshold such that the hypothesis of independence can be rejected with a significance level of less than 0.0025, assuring that the selected terms are really biography-related.</Paragraph> <Paragraph position="1"> In order to evaluate the aboutness computation, we considered five input queries consisting of a proper person name phrase (&quot;Dan Quayle&quot; (D), &quot;Hillary Clinton&quot; (H), &quot;Napoleon&quot; (N), &quot;Sadam Hussein&quot; (S) and &quot;Sharon Stone&quot; (ST)) and downloaded for each of the queries 5 texts from the Web (each text contains minimally once an exact match with the input query). Two persons were asked to rank the texts according to relevancy, if they were searching biographical information on the input person (100% agreement was obtained). Two aspects are important in determining relevancy: a text should really and almost exclusively contain biographical information of the input person in order not to lose time with other information. For each query, at least one of the texts is a biographical text and one of the texts only marginally mentions the person in question. All texts except for the biography texts speak about other persons, and pronouns are abundantly used.</Paragraph> <Paragraph position="2"> The &quot;Hillary Clinton&quot; texts do not contain many other persons except for Hillary, in contrast with the &quot;Dan Quayle&quot;, &quot;Napoleon&quot; and &quot;Sadam Hussein&quot; texts. The &quot;Hillary Clinton&quot; texts are in general quite relevant for this first lady. For &quot;Napoleon&quot; there is one biographical text on Napoleon's surgeon that mentions Napoleon only marginally. The &quot;Dan Quayle&quot; texts contain a lot of direct speech. For &quot;Sharon Stone&quot; 4 out of the 5 texts described a movie in which this actress played a role, thus being only marginally relevant for a demand of biographical data of the actress.</Paragraph> <Paragraph position="3"> Then we ranked the texts based on the TF, TFCOREF, TFREF, TFCOREFREF and LM scores and computed the congruence of each rank- null where n is the number of items in the 2 rankings and r x,i and r m,i denote the position of the ith item in</Paragraph> <Paragraph position="5"> respectively. Table 1 shows the results.</Paragraph> <Paragraph position="6"> 4 Discussion of the Results and Related Research From our limited experiments we can draw the following findings. It is logical that erroneous coreference resolution worsens the results compared to the TF baseline. In one of the &quot;Napoleon&quot; texts, one mention of Napoleon and one mention of the name of his surgeon entail that a large number of pronouns in the text are wrongly resolved. They all refer to the surgeon, but the system considers them as referring to Napoleon, making that the ranking of this text is completely inversed compared to the ideal one. Adding other reference information gives some mixed results. The ranking based on the principal eigenvector computation of the link matrix of the text that represents reference relationships between entities provides a natural way of computing a ranking of the texts with regard to the person entity. This can be explained as follows. Decomposition into eigenvectors breaks down the original relationships into linear independent components. Sorting them according to their corresponding eigenvalues sorts the components from the most important information to the less important one. When keeping the principal eigenvector, we keep the most important information which best distinguishes it from other information while ignoring marginal information. In this way we hope to smooth some noise that is generated when building the links. On the other hand, when relationships that are wrongly detected are dominant, they will be reinforced (as is the case in the &quot;Napoleon&quot; text). Although an aboutness score is normalized by the sum of a text's entity scores, the effect of this normalization and the behavior of eigenvectors in case of texts of different length should be studied.</Paragraph> <Paragraph position="7"> The work is inspired by link analysis algorithms such as HITS, which uses theories of spectral partitioning of a graph for detecting authoritative pages in a graph of hyperlinked pages (Kleinberg 1998).</Paragraph> <Paragraph position="8"> Analogically, Zha (2002) detects terms and sentences with a high salience in a text and uses these for summarization. The graph here is made of linked term and sentence nodes. Other work on text summarization computes centrality on graphs (Erkan and Radev 2004; Mihalcea and Tarau 2004). We use a linguistic motivation for linking terms in texts founded in reference relationships such as coreference and reference by biographical terms in certain syntactical constructs. Intuitively, an important entity is linked to many referents; the more important the referents are, the more important the entity is. Latent semantic indexing (LSI) is also used to detect main topics in a set of documents/sentences, it will not explicitly model the weights of the edges between entities.</Paragraph> <Paragraph position="9"> Our implementation aims at measuring the aboutness of an entity from a biographical viewpoint. One can easily focus upon other viewpoints when determining the terms that enter into a reference relationship with the input entity (e.g., computing the aboutness of an input animal name with regard to its reproductive activities).</Paragraph> </Section> class="xml-element"></Paper>