File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/w03-1713_evalu.xml
Size: 3,189 bytes
Last Modified: 2025-10-06 13:59:05
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1713"> <Title>News-Oriented Automatic Chinese Keyword Indexing</Title> <Section position="6" start_page="2" end_page="2" type="evalu"> <SectionTitle> 4 Experimental Results and Analysis </SectionTitle> <Paragraph position="0"> We select 37 news articles from China Daily as our testing material from which experts have manually extracted keywords. There are 23 articles about national politics, 10 articles of international politics, and 4 sports news articles. Here, we automatically extracted keywords from them and evaluated the results with the standard measures of precision and recall, which are defined as follows: Where P represents precision, and R represents recall. In general, these two measures in one system are opposite to each other. When precision is higher, recall will be lower. Otherwise, when precision is improved, recall will decrease. In table 2, we illustrate our experimental results. The first three rows give measures for articles about different styles and the figures in parentheses represent the number of articles. The fourth row gives the average measure of our system. For comparison, we also illustrate the results of Chien's [1997] PAT-tree-based method from his experiments in the last row. From this table, we can see that more emphasis is placed on precision in Chien's system.</Paragraph> <Paragraph position="1"> However, we incline to enhancing recall when precision and recall are assured relatively balanced.</Paragraph> <Paragraph position="2"> When precision is lower, perhaps more noise is introduced into the set of candidate keywords. Because we have adopted segmentation and POS tagging tools which can verify whether a candidate character string is a meaningful unit and found that the noise introduced now is more or less relevant to the content of the article, we don't have to worry more about precision. Therefore, we hope to generate more keywords automatically under the condition that the number of noise words is accepted. null It has to be pointed out that there are no satisfactory results in extracting keywords from texts [Chien, 1997]. Although some keywords extracted are the same as manually extracted ones in meaning, they are often different due to one or two characters mismatched. According to our analysis of experimental results, though only 46% of extracted keywords appear in the set of manual keywords, the rest are also relevant to the text and adapt to the need of information retrieval. At the same time, about 52% of the manual keywords are generated by the automatically indexing method, however, we can often find a substitute for most of the rest in the set of automatically generated keywords. null manually indexing keywords ofnumber recognized keywords genuine ofnumber R llyautomatica indexing keywords ofnumber recognized keywords genuine ofnumber</Paragraph> <Paragraph position="4"> Most of the keywords missed occur only once in the text, but they are mostly proper nouns of places, organizations or titles of person. And this reveals that we need to further improve the techniques to recognize proper nouns.</Paragraph> </Section> class="xml-element"></Paper>