File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/05/p05-2018_intro.xml
Size: 3,318 bytes
Last Modified: 2025-10-06 14:03:07
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-2018"> <Title>Centrality Measures in Text Mining: Prediction of Noun Phrases that Appear in Abstracts</Title> <Section position="2" start_page="0" end_page="103" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Research on text summarization, information retrieval, and information extraction often faces the question of how to determine which words are more significant than others in text. Normally we only consider content words, i.e., the open class words. Non-content words or stop words, which are called function words in natural language processing, do not convey semantics so that they are excluded although they sometimes appear more frequently than content words. A content word is usually defined as a term, although a term can also be a phrase. Its significance is often indicated by Term Frequency (TF) and Inverse Document Frequency (IDF). The usage of TF comes from &quot;the simple notion that terms which occur frequently in a document may reflect its meaning more strongly than terms that occur less frequently&quot; (Jurafsky and Martin, 2000). On the contrary, IDF assigns smaller weights to terms which are contained in more documents. That is simply because &quot;the more documents having the term, the less useful the term is in discriminating those documents having it from those not having it&quot; (Yu and Meng, 1998).</Paragraph> <Paragraph position="1"> TF and IDF also find their usage in automatic text summarization. In this circumstance, TF is used individually more often than together with IDF, since the term is not used to distinguish a document from another. Automatic text summarization seeks a way of producing a text which is much shorter than the document(s) to be summarized, and can serve as a surrogate for full-text. Thus, for extractive summaries, i.e., summaries composed of original sentences from the text to be summarized, we try to find those terms which are more likely to be included in the summary.</Paragraph> <Paragraph position="2"> The overall goal of our research is to build a machine learning framework for automatic text summarization. This framework will learn the relationship between text documents and their corresponding abstracts written by human. At the current stage the framework tries to generate a sentence ranking function and use it to produce extractive summaries. It is important to find a set of features which represent most information in a sentence and hence the machine learning mechanism can work on it to produce a ranking function. The next stage in our research will be to use the framework to generate abstractive summaries, i.e. summaries which do not use sentences from the input text verbatim. Therefore, it is important to know what terms should be included in the summary.</Paragraph> <Paragraph position="3"> In this paper we present the approach of using social network analysis technique to find terms, specifically noun phrases (NPs) in our experiments, which occur in the human-written abstracts. We show that centrality measures increase the prediction accuracy. Two ways of constructing noun phrase network are compared. Conclusions and future work are discussed at the end.</Paragraph> </Section> class="xml-element"></Paper>