File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-1805_intro.xml
Size: 2,886 bytes
Last Modified: 2025-10-06 14:02:06
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1805"> <Title>A Language Model Approach to Keyphrase Extraction</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Phraseness and informativeness </SectionTitle> <Paragraph position="0"> The word keyphrase implies two features: phraseness and informativeness.</Paragraph> <Paragraph position="1"> Phraseness is a somewhat abstract notion which describes the degree to which a given word sequence is considered to be a phrase. In general, phraseness is defined by the user, who has his own criteria for the target application. For instance, one user might want only noun phrases while another user might be interested only in phrases describing a certain set of products. Although there is no single definition of the term phrase, in this paper, we focus on collocation or cohesion of consecutive words.</Paragraph> <Paragraph position="2"> Informativeness refers to how well a phrase captures or illustrates the key ideas in a set of documents. Because informativeness is defined with respect to background information and new knowledge, users will have different perceptions of informativeness. In our calculations, we make use of the relationship between foreground and background corpora to formalize the notion of informativeness.</Paragraph> <Paragraph position="3"> The target document set from which representative keyphrases are extracted is called the foreground corpus. The document set to which this target set is compared is called the background corpus. For example, a foreground corpus of the current week's news would be compared to a background corpus of an entire news article archive to determine that certain phrases, like &quot;press conference&quot; are typical of news stories in general and do not capture the particulars of current events in the way that &quot;national museum of antiquities&quot; does.</Paragraph> <Paragraph position="4"> Other examples of foreground and background corpora include: a web site for a certain company and web data in general; a newsgroup and the whole Usenet archive; and research papers of a certain conference and research papers in general.</Paragraph> <Paragraph position="5"> In order to get a ranked keyphrase list, we need to combine both phraseness and informativeness into a single score. A sequence of words can be a good phrase but not an informative one, like the expression &quot;in spite of.&quot; A word sequence can be informative for a particular domain but not a phrase; &quot;toyota, honda, ford&quot; is an example of a non-phrase sequence of informative words in a hybrid car domain.</Paragraph> <Paragraph position="6"> The algorithm we propose for keyphrase finding requires that the keyphrase score well for both phraseness and informativeness.</Paragraph> </Section> class="xml-element"></Paper>