File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-2018_metho.xml
Size: 14,185 bytes
Last Modified: 2025-10-06 14:09:47
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-2018"> <Title>Centrality Measures in Text Mining: Prediction of Noun Phrases that Appear in Abstracts</Title> <Section position="3" start_page="103" end_page="103" type="metho"> <SectionTitle> 2 Centrality Measures </SectionTitle> <Paragraph position="0"> Social network analysis studies linkages among social entities and the implications of these linkages. The social entities are called actors. A social network is composed of a set of actors and the relation or relations defined on them (Wasserman and Faust, 1994). Graph theory has been used in social network analysis to identify those actors who impose more influence upon a social network. A social network can be represented by a graph with the actors denoted by the nodes and the relations by the edges or links. To determine which actors are prominent, a measure called centrality is introduced. In practice, four types of centrality are often used.</Paragraph> <Paragraph position="1"> Degree centrality measures how many direct connections a node has to other nodes in a network. Since this measure depends on the size of the network, a standardized version is used when it is necessary to compare the centrality across networks of different sizes.</Paragraph> <Paragraph position="3"> ) is the degree of node i in a network and u is the number of nodes in that network.</Paragraph> <Paragraph position="4"> Closeness centrality focuses on the distances an actor is from all other nodes in the network.</Paragraph> <Paragraph position="6"> ) is the shortest distance between node i and j.</Paragraph> <Paragraph position="7"> Betweenness centrality emphasizes that for an actor to be central, it must reside on many geodesics of other nodes so that it can control the interactions between them.</Paragraph> <Paragraph position="9"> ) is the number of geodesics linking the two nodes that contain node i.</Paragraph> <Paragraph position="10"> Betweenness centrality is widely used because of its generality. This measure assumes that information flow between two nodes will be on the geodesics between them. Nevertheless, &quot;It is quite possible that information will take a more circuitous route either by random communication or [by being] channeled through many intermediaries in order to 'hide' or 'shield' information&quot;. (Stephenson and Zelen, 1989).</Paragraph> <Paragraph position="11"> Stephenson and Zelen (1989) developed information centrality which generalizes betweenness centrality. It focuses on the information contained in all paths originating with a specific actor. The calculation for information centrality of a node is in the Appendix.</Paragraph> <Paragraph position="12"> Recently centrality measures have started to gain attention from researchers in text processing. Corman et al. (2002) use vectors, which consist of NPs, to represent texts and hence analyze mutual relevance of two texts. The values of the elements in a vector are determined by the betweenness centrality of the NPs in a text being analyzed. Erkan and Radev (2004) use the PageRank method, which is the application of centrality concept to the Web, to determine central sentences in a cluster for summarization. Vanderwende et al. (2004) also use the PageRank method to pick prominent triples, i.e. (node i, relation, node j), and then use the triples to generate event-centric summaries.</Paragraph> </Section> <Section position="4" start_page="103" end_page="104" type="metho"> <SectionTitle> 3 NP Networks </SectionTitle> <Paragraph position="0"> To construct a network for NPs in a text, we try two ways of modeling the relation between them.</Paragraph> <Paragraph position="1"> One is at the sentence level: if two noun phrases can be sequentially parsed out from a sentence, a link is added between them. The other way is at the document level: we simply add a link to every pair of noun phrases which are parsed out in succession. The difference between the two ways is that the network constructed at the sentence level ignores the existence of certain connections between sentences.</Paragraph> <Paragraph position="2"> We process a text document in four steps.</Paragraph> <Paragraph position="3"> First, the text is tokenized and stored into an internal representation with structural information. Second, the tokenized text is tagged by the Brill tagging algorithm POS tagger.</Paragraph> <Paragraph position="4"> Third, the NPs in a text document are parsed according to 35 parsing rules as shown in Figure 1. If a new noun phrase is found, a new node is formed and added to the network. If the noun phrase already exists in the network, the node containing it will be identified. A link will be added between two nodes if they are parsed out sequentially for The POS tagger we used can be obtained from http://web.media.mit.edu/~hugo/montytagger/ the network formed at the document level, or sequentially in the same sentence for the network formed at the sentence level.</Paragraph> <Paragraph position="5"> Finally, after the text document has been processed, the centrality of each node in the network is updated.</Paragraph> </Section> <Section position="5" start_page="104" end_page="106" type="metho"> <SectionTitle> 4 Predicting NPs Occurring in Abstracts </SectionTitle> <Paragraph position="0"> In this paper, we refer the NPs occur both in a text document and its corresponding abstract as Co-occurring NPs (CNPs).</Paragraph> <Section position="1" start_page="104" end_page="104" type="sub_section"> <SectionTitle> 4.1 CMP-LG Corpus </SectionTitle> <Paragraph position="0"> In our experiment, a corpus of 183 documents was used. The documents are from the Computation and Language collection and have been marked in XML with tags providing basic information about the document such as title, author, abstract, body, sections, etc. This corpus is a part of the TIPSTER</Paragraph> </Section> <Section position="2" start_page="104" end_page="104" type="sub_section"> <SectionTitle> Text Summarization Evaluation Conference </SectionTitle> <Paragraph position="0"> (SUMMAC) effort acting as a general resource to the information retrieval, extraction and summarization communities. We excluded five documents from this corpus which do not have abstracts.</Paragraph> </Section> <Section position="3" start_page="104" end_page="105" type="sub_section"> <SectionTitle> 4.2 Using Noun Phrase Centrality Heuristics </SectionTitle> <Paragraph position="0"> We assume that a noun phrase with high centrality is more likely to be a central topic being addressed in a document than one with low centrality. Given this assumption, we performed an experiment, in which the NPs with highest centralities are retrieved and compared with the actual NPs in the abstracts. To evaluate this method, we use Precision, which measures the fraction of true CNPs in all predicted CNPs, and Recall, which measures the fraction of correctly predicted CNPs in all CNPs.</Paragraph> <Paragraph position="1"> After establishing the NP network for a document and ranking the nodes according to their centralities, we must decide how many NPs should be retrieved. This number should not be too big; otherwise the Precision value will be very low, although the Recall will be higher. If this number is very small, the Recall will decrease correspondingly. We adopted a compound metric - Fmeasure, to balance the selection: Based on our study of 178 documents in the CMP-LG corpus, we find that the number of CNPs is roughly proportional to the number of NPs in the abstract. We obtain a linear regression model for the data shown in Figure 2 and use this model to calculate the number of nodes we should retrieve from the NP network, given the number of NPs in the abstract known a priori: One could argue that the number of abstract NPs is unknown a priori and thus the proposed method is of limited use. However, the user can provide an estimate based on the desired number of words in the summary. Here we can adopt the same way of asking the user to provide a limit for the NPs in the summary. We used the actual number of NPs the author used in his/her abstract in our experiment.</Paragraph> <Paragraph position="2"> Our experiment results are shown in Figure 3(a) and 3(b). In 3(a) the NP network is formed at sen-</Paragraph> <Paragraph position="4"> tence level. In this way, it is possible the graph will be composed of disconnected subgraphs. In such case, we calculate the closeness centrality (cc), betweenness centrality (bc), and the information centrality (ic) within the subgraphs while the degree centrality (dc) is still computed for the overall graph. In 3(b), the network is constructed at the document level. Therefore, it is guaranteed that every node is reachable from all other node.</Paragraph> <Paragraph position="5"> Figure 3(a) shows the simplest centrality measure dc performs best, with Precision, Recall, and F-measure all greater than 0.2, which are twice of bc and almost ten times of cc and ic.</Paragraph> <Paragraph position="6"> In Figure 3(b), however, all four measures are around 0.25 in all three evaluation metrics. This result suggests to us that when we choose a centrality to represent the prominence of a NP in the text, not only does the kind of the centrality matter, but also the way of forming the NP network.</Paragraph> <Paragraph position="7"> Overall, the heuristic of using centrality itself does not achieve impressive scores. We will see in the next section that using decision trees is a much better way to perform the predictions, when using centrality together with other text features.</Paragraph> </Section> <Section position="4" start_page="105" end_page="106" type="sub_section"> <SectionTitle> 4.3 Using Decision Trees </SectionTitle> <Paragraph position="0"> We obtain the following features for all NPs in a document from the CMP-LG corpus: Position: the order of a NP appearing in the text, normalized by the total number of NPs.</Paragraph> <Paragraph position="1"> Article: three classes are defined for this attribute: INDEfinite (contains a or an), DEFInite (contains the), and NONE (all others).</Paragraph> <Paragraph position="2"> Degree centrality: obtained from the NP network Closeness centrality: obtained from the NP network null Betweenness centrality: obtained from the NP network Information centrality: obtained from the NP network Head noun POS tag: a head noun is the last word in the NP. Its POS tag is used here.</Paragraph> <Paragraph position="3"> Proper name: whether the NP is a proper name, by looking at the POS tags of all words in the NP. Number: whether the NP is just one number.</Paragraph> <Paragraph position="4"> Frequency: how many times a NP occurs in a text, normalized by its maximum.</Paragraph> <Paragraph position="5"> In abstract: whether the NP appears in the authorprovided abstract. This attribute is the target for the decision trees to classify.</Paragraph> <Paragraph position="6"> In order to learn which type of centrality measures helps to improve the accuracy of the predictions, and to see whether centrality measures are better than term frequency, we experiment with six groups of feature sets and compare their performances. The six groups are: All: including all features above.</Paragraph> <Paragraph position="7"> DC: including only the degree centrality measure, and other non-centrality measures except for Frequency. null CC: same as DC except for using closeness centrality instead of degree centrality.</Paragraph> <Paragraph position="8"> BC: same as DC except for using betweenness centrality instead of degree centrality.</Paragraph> <Paragraph position="9"> IC: same as DC except for using information centrality instead of degree centrality.</Paragraph> <Paragraph position="10"> FQ: including Frequency and all other non-centrality features.</Paragraph> <Paragraph position="11"> The 178 documents have generated more than 100,000 training records. Among them only a very small portion (2.6%) belongs to the positive class. When using decision tree algorithm on such imbalanced attribute, it is very common that the class with absolute advantages will be favored (Japkowicz, 2000; Kubat and Matwin, 1997). To reduce the unfair preference, one way is to boost the weak class, e.g., by replicating instances in the minority class (Kubat and Matwin, 1997; Chawla et al., 2000). In our experiments, the 178 documents were arbitrarily divided into three roughly equal groups, generating 36,157, 37,600, and 34,691 records, respectively. After class balancing, the records are increased to 40,109, 42,210, and 38,499. The three data sets were then run through the decision tree algorithm YaDT (Yet another Decision Tree builder), which is much more efficient than C4.5 (Ruggieri, 2004), with 10-fold cross validation. null The experiment results of using YaDT with three data sets and six feature groups to predict the CNPs are shown in Table 1. The mean values of three metrics are also shown in Figure 4(a) and 4(b). Decision trees achieve much higher scores compared with the scores obtained by using centrality heuristics. Together with other text features, DC, CC, BC, and IC obtain scores over 0.7 in all three metric, which are comparable to the scores obtained by using FQ. Moreover, when using all the features, decision trees achieve over 0.8 in precision and over 0.95 in recall. F-measure is as high as 0.88. To see whether F-measure of All is statistically better than that of other settings, we run t-tests to compare them using values of F-measure obtained in the 10-fold cross-validation from the three data sets. The results show the mean value of F-measure of All is significantly higher (p-value =0.000) than that of other settings.</Paragraph> <Paragraph position="12"> Differently from the experiments that use centrality heuristics by itself, almost no obvious distinctions The YaDT software can be obtained from http://www.di.unipi.it/~ruggieri/software.html can be observed when comparing the performances of YaDT with NP network formed in two ways.</Paragraph> </Section> </Section> class="xml-element"></Paper>