File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-1713_intro.xml

Size: 8,981 bytes

Last Modified: 2025-10-06 14:02:04

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1713">
  <Title>News-Oriented Automatic Chinese Keyword Indexing</Title>
  <Section position="3" start_page="2" end_page="2" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> With more and more information flowing into our life, it is very important to lead people to gain more important information in time as short as possible. Keywords are a good solution, which give a brief summary of a document's content.</Paragraph>
    <Paragraph position="1"> With keywords, people can quickly find what they are most interested in and read them carefully.</Paragraph>
    <Paragraph position="2"> That will save us a lot of time. In addition, key-words are also useful to the research of information retrieval, text clustering, and topic search [Frank 1999]. Manually indexing keywords will cost highly. Thus, automatically indexing keywords from text is of great interests.</Paragraph>
    <Paragraph position="3"> News is always the main domain that people pay a large amount of attention to. Unfortunately, only a small fraction of documents in this field have keywords. However, compared to unrestricted text, news articles are relatively easy to extract keywords from, because they have the following characteristics. Firstly, a news document is always short in length, and usually, only important words or phrases repeat. Secondly, as a rule, the purpose of news articles is to illustrate an event or a thing for readers. Then this kind of articles usually place more emphasis on some name entities such as persons, places, organizations and so on.</Paragraph>
    <Paragraph position="4"> Lastly, important content often occurs the first time in the title, or in the anterior part of the whole text, especially the first paragraph or the first sentence in every paragraph. These characteristics will help us in keywords indexing.</Paragraph>
    <Paragraph position="5"> Several methods have been proposed for extracting English keywords from text. For example, Witten[1999] adopted Naive Bayes techniques, and Turney[1999] combined decision trees and genetic algorithm in his system. These systems achieved satisfying results. However, they need a large amount of training documents with keywords, which are just what we are in need of now. For the Chinese language, some researchers adopt the structure of PAT tree and make use of mutual information to obtain keywords [Chien 1997, Yang 2002]. Unfortunately, the construction of PAT tree will cost a lot of space and time. In this paper, aiming at the characteristics of news-oriented articles, resources and techniques of current situation, we will introduce a simple procedure to index keywords from text. Section 2 will describe the architecture of the whole system. In section 3, we will introduce every module in detail, including how to obtain candidate keywords, how to filter out the meaningless items, and how to score possible keyword candidates according to their feature values. In section 4, experimental results will be given and analyzed. At last, we will end with the conclusion.</Paragraph>
    <Section position="1" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
System Overview
</SectionTitle>
      <Paragraph position="0"> Keyword indexing can also be called keyword extraction. The definition of a keyword is not restricted to one word in our conception. Here, a keyword can be seen as a Chinese character string, which might consist of more than one Chinese word. These character strings can summarize the content of the document they are in.</Paragraph>
      <Paragraph position="1"> Aiming at the task of keywords indexing, our system is designed and composed of three modules.</Paragraph>
      <Paragraph position="2"> As in figure 1, the first module is to recognize some Chinese character strings according to their frequency, and pick out those named entities in the text as the candidate keywords. The second module is a filter to remove all the meaningless character strings from the set of candidates. And the third module is a selector, which evaluates every candidate according to its feature values and choose from the candidate set those keywords with higher score. The higher score a character string has, the more content it will cover of the article it is in.</Paragraph>
      <Paragraph position="3"> In our system, there are three kinds of lexicons.</Paragraph>
      <Paragraph position="4"> The lexicon of proper nouns is used to recognize named entities. The general lexicon includes Chinese words in common use, which is adopted for the segmentation and POS tagging of the text. And the lexicon of content words is used to expand the set of keywords. They will be introduced in detail in the following section.</Paragraph>
      <Paragraph position="5">  It can be seen that one document is composed of a set of character strings. Every character string has its frequency in the document. In general, those character strings that occur several times can reflect the topic of the document. So, we take them out as keyword candidates. In addition, named entities, such as person names, place names, organizations, translation terms, titles of person and so on, are usually very important for the document without reference to their frequency. They will also be picked out from the text by named entities recognizer and input into the filter module with other character strings.</Paragraph>
      <Paragraph position="6"> Unlike English, there are no explicit word boundaries in Chinese sentences, which makes it especially difficult to tell whether a character string is composed of one word or more than one word. Due to this characteristic, we don't use a dictionary, but get those character strings only according to their frequency statistics. We set a threshold value as 2 for the Chinese character strings considering the length of news documents.</Paragraph>
      <Paragraph position="7"> Suppose that a character string is c  ) equals to or is more than 2. That is, only a character string occurs two or more than two times, it can be selected as a candidate keyword.</Paragraph>
      <Paragraph position="8"> There are two kinds of named entities. The first are those which have rules of composition, mainly Chinese names and foreign terms. They can be recognized with statistical and rule-based methods combined. Chinese names are composed of family names and first names, whose lengths are respectively 1 or 2 Chinese characters. Furthermore, there is a relatively stable set of family names, which often provide the anchor to search a name.</Paragraph>
      <Paragraph position="9"> For foreign terms, there are a relatively set of Chinese characters which are generally used as translation characters. Due to the limitation of the paper's length, we don't introduce the process of recognition in detail here. The other kind of named entities is mainly composed of proper nouns which represent names of places, organizations, person titles, etc. They often occur in news documents, but don't have rules of composition.</Paragraph>
      <Paragraph position="10"> Thus, we collect such words into our proper nouns lexicon. Then the module can find these named entities through looking up in this lexicon.</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.2 Filter Module
</SectionTitle>
      <Paragraph position="0"> So far, Chinese character strings are generated only through frequency statistics. Thus, some of them stand out just because of simple repetition and are probably not meaningful units of language. We need to filter out those meaningless items. As in figure 1, we adopt four kinds of filters in filter module. They work as follows.</Paragraph>
      <Paragraph position="1"> (1) Filter of Overlapped and Dependent Items evident that such character strings can't serve as For two character strings S1 and S2, with S1 as a substring of S2, and the frequency of S1 is equal to that of S2, then S1 is overlapped by S2. In fact, we can set a threshold t d for f(S1)-f(S2), where the function f(.) represents the frequency of some character string. If the value of f(S1)-f(S2) is less than t d , then the string S1 is dependent on S2.</Paragraph>
      <Paragraph position="2"> Here, the overlapped and dependent substring will be removed from the candidate set.</Paragraph>
      <Paragraph position="3"> (2) Filter of Items with Punctuations and Function words The recognizer module treats equally all symbols in the text, such as Chinese characters and punctuations, etc. Thus when conducting the process of frequency statistics, for a character string, there might exist some punctuations and function words such as '. ', ' , ', 'Liao ', 'Zhao ', etc. These punctuations and function words usually occur in the head or tail of a character string. It's character strings can't serve as keywords of an article, and they should be deleted from the candidate set.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML