File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/p03-2022_intro.xml
Size: 3,058 bytes
Last Modified: 2025-10-06 14:01:49
<?xml version="1.0" standalone="yes"?> <Paper uid="P03-2022"> <Title>An Evaluation Method of Words Tendency using Decision Tree</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1. INTRODUCTION </SectionTitle> <Paragraph position="0"> Recently, there are many large electronic texts and computers are processing (analysis) them widely.</Paragraph> <Paragraph position="1"> Determination of important keywords is crucial in successful modern Information Retrieval (IR). Usually, frequency of some words in the texts are changing by time (time-series variation), and these words are commonly connected with particular period (e.g.</Paragraph> <Paragraph position="2"> &quot;influenza&quot; is more common in winter). According to Hisano (2000) some Chinese characters (Kanji) appear in newspaper reports change with time-series variation. Ohkubo et al. (1998) proposed a method to estimate information that users might need in order to analysis login data on a WWW search engine. By Ohkubo method, it is confirmed that, word groups connected with search words change according to time when the search is done. Some words have a frequency of use that changes with time-series variation, and often those words attract the attention of the users in a particular period. Such words are often directly connected with the main subject of the text, and can be considered as keywords that express important characteristics of the text.</Paragraph> <Paragraph position="3"> In traditional text dealing methods (Fukumoto, Suzuki & Fukumoto, 1996; Hara, Nakajima & Kitani, 1997; Haruo, 1991; Sagara & Watanabe, 1998) and text search techniques (Liman, 1996; Swerts & Ostendorf, 1995), words frequency change with time-series variation is not considered. Therefore, such methods can not correctly determine the importance of words in a given period (e.g. one-year). If the change of word frequencies with time-series variation is considered, especially when searching for similar texts.</Paragraph> <Paragraph position="4"> This paper presents a new method for estimating automatically the stability classes that indicate index of words popularity with time-series variation based on frequency change in past texts data. To estimate quantitatively the frequency change in the time-series variation of words in each class, this method defines four attributes (proper nouns attributes, slope of regression line, slice of regression line, and correlation coefficient) that are extracted automatically from past texts data. These extracted data are classified manually (Human) into three stability classes. Decision Tree (DT) automatic algorithm C4.5 (Quinlan, 1993; Weiss & Kulikowski, 1991; Honda, Mochizuki, Ho & Okumura, 1997; Passonneau & Litman, 1997; Okumura, Haraguchi & Mochizuki,1999) uses these data as learning data.</Paragraph> <Paragraph position="5"> Finally, DT automatically determines the stability classes of the input analysis data (test data).</Paragraph> </Section> class="xml-element"></Paper>