File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/w06-3804_intro.xml
Size: 6,149 bytes
Last Modified: 2025-10-06 14:04:15
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-3804"> <Title>Measuring Aboutness of an Entity in a Text</Title> <Section position="3" start_page="0" end_page="26" type="intro"> <SectionTitle> 2 Methods </SectionTitle> <Paragraph position="0"> Our approach involves the detection of entities and their noun phrase coreferents, the generation of terms that are correlated with biographical infor- null mation, the detection of references between entities, and the computation of the aboutness score. As linguistic resources we used the LT-POS tagger developed at the University of Edinburgh and the Charniak parser developed at Brown University.</Paragraph> <Section position="1" start_page="25" end_page="25" type="sub_section"> <SectionTitle> 2.1 Noun Phrase Coreference Resolution </SectionTitle> <Paragraph position="0"> Coreference resolution focuses on detecting &quot;identity'' relationships between noun phrases (i.e. not on is-a or whole/part links). It is natural to view coreferencing as a partitioning or clustering of the set of entities. The idea is to group coreferents into the same cluster, which is accomplished in two steps: 1) detection of the entities and extraction of their features set; 2) clustering of the entities. For the first subtask we use the same set of features as in Cardie and Wagstaff (1999). For the second step we used the progressive fuzzy clustering algorithm described in Angheluta et al. (2004).</Paragraph> </Section> <Section position="2" start_page="25" end_page="25" type="sub_section"> <SectionTitle> 2.2 Learning Biographical Terms </SectionTitle> <Paragraph position="0"> We learn a term's biographical value as the correlation of the term with texts of biographical nature.</Paragraph> <Paragraph position="1"> There are different ways of learning associations present in corpora (e.g., use of the mutual information statistic, use of the chi-square statistic). We use the likelihood ratio for a binomial distribution (Dunning 1993), which tests the hypothesis whether the term occurs independently in texts of biographical nature given a large corpus of biographical and non-biographical texts. For considering a term as biography-related, we set a likelihood ratio threshold such that the hypothesis can be rejected with a certain significance level.</Paragraph> </Section> <Section position="3" start_page="25" end_page="25" type="sub_section"> <SectionTitle> 2.3 Reference Detection between Entities </SectionTitle> <Paragraph position="0"> We assume that the syntactic relationships between entities (proper or common nouns) in a text give us information on their semantic reference status. In our simple experiment, we consider reference relationships found within a single sentence, and more specifically we take into account relationships between two noun phrase entities. The analysis requires that the sentences are syntactically analyzed or parsed. The following syntactic relationships are detected in the parse tree of each sentence: 1) Subject-object: An object refers to the subject (e.g., in the sentence He eats an apple, an apple refers to He). This relationship type also covers prepositional phrases that are the argument of a verb (e.g., in the sentence He goes to Hollywood, Hollywood refers to He). The relationship holds between the heads of the respective noun phrases in case other nouns modify them.</Paragraph> <Paragraph position="1"> 2) NP-PP{NP}: A noun phrase is modified by a prepositional noun phrase: the head of the prepositional noun phrase refers to the head of the dominant noun phrase (e.g., in the chunk The nominee for presidency, presidency refers to The nominee). 3) NP-NP: A noun phrase modifies another noun phrase: the head of the modifying noun phrase refers to the head of the dominant noun phrase (e.g., in the chunk Dan Quayle's sister, Dan Quayle refers to sister, in the chunk sugar factory, sugar refers to factory).</Paragraph> <Paragraph position="2"> When a sentence is composed of different subclauses and when one of the components of the first two relationships has the form of a subclause, the first noun phrase of the subclause is considered. When computing a reference relation with an entity term, we only consider biographical terms found as described in (2.2).</Paragraph> </Section> <Section position="4" start_page="25" end_page="26" type="sub_section"> <SectionTitle> 2.4 Computing the Aboutness Score </SectionTitle> <Paragraph position="0"> The aboutness of a document text D for the input entity E is computed as follows:</Paragraph> <Paragraph position="2"> entity_score is zero when E does not occur in D.</Paragraph> <Paragraph position="3"> Otherwise we compute the entity score as follows.</Paragraph> <Paragraph position="4"> We represent D as a graph, where nodes represent the entities as mentioned in the text and the weights of the connections represent the reference score (in our experiments set to 1 when the entities are coreferents, 0.5 when the entities are other referents). The values 1 and 0.5 were selected ad hoc.</Paragraph> <Paragraph position="5"> Future fine-tuning of the weights of the edges of the discourse graph based on discourse features could be explored (cf. Givon 2001). The edge values are stored in a link matrix A. The authority of an entity is computed by considering the values of the principal eigenvector of A T A. (cf. Kleinberg 1998) (in the results below this approach is referred to as LM). In this way we compute the authority of each entity in a text.</Paragraph> <Paragraph position="6"> We implemented four other entity scores: the term frequency (TF), the term frequency augmented with noun phrase coreference information (TFCOREF), the term frequency augmented with reference information (weighted by 0.5) (TFREF) and the term frequency augmented with coreference and reference information (TFCOREFREF). The purpose is not that the 4 scoring functions are mutually comparable, but that the ranking of the documents that is produced by each of them can be compared against an ideal ranking built by humans. null</Paragraph> </Section> </Section> class="xml-element"></Paper>