File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/p03-2022_metho.xml
Size: 12,003 bytes
Last Modified: 2025-10-06 14:08:21
<?xml version="1.0" standalone="yes"?> <Paper uid="P03-2022"> <Title>An Evaluation Method of Words Tendency using Decision Tree</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2. POPULARITY OF WORDS CONSIDERING TIME-SERIES VARIATION </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Stability Classes of the Words: </SectionTitle> <Paragraph position="0"> To judge the index of popularity of words with time-series variation based on the frequency change, and create the stability classes of the words, we defined three classes as follow: (1) Increasing Class &quot;The class that has an increasing frequency with time-series variation&quot; (2) Relatively Constant Class &quot;The class that has a stable frequency with time-series variation&quot; (3) Decreasing Class &quot;The class that has a decreasing frequency with time-series variation&quot;.</Paragraph> <Paragraph position="1"> We call these classes stability classes. The words belong to each class is called: increasing-words, relatively constant-words, and decreasing-words respectively.</Paragraph> <Paragraph position="2"> Table 1 shows a sample of some classified words according to frequency change with time-series variation in each stability class. For example, the names of baseball players &quot;Sammy-Sosa&quot; and &quot;McGwire&quot; are included in increasing class because their frequencies increase with time-series variation. The names of baseball teams &quot;New-York-Mets&quot; and &quot;Texas-Rangers&quot; are included in a relatively constant class because their frequencies relatively stable with time-series variation. The names of baseball players &quot;Hank-Aaron&quot; and &quot;Nap Lajoie&quot; are included in a decreasing class because their frequencies decrease with time-series variation.</Paragraph> <Paragraph position="3"> Words stability classes are decided by the change of their frequencies with time-series variation. In order to determine the change of frequency with time-series variation, texts were grouped according to a given period (one-year) and frequency of words in each group is estimated. However, to absorb the influence caused by difference of number of texts in each group and to judge the change with time-series more correctly, each frequency is normalized by being divided by the total frequencies of the words in each group.</Paragraph> <Paragraph position="4"> Table 1 Sample of Classified Words In this paper, five attributes are defined to decide the stability classes, and the words data that are divided into classes beforehand are input into the DT automatic algorithm C4.5 as the learning data. Then we use the obtained DT to decide automatically the stability classes of increasing words. In the next section, the attributes that are used in the DT learning to judge the stability classes will be described.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3. ATTRIBUTES USED IN JUDGING THE STABILITY CLASS </SectionTitle> <Paragraph position="0"> To obtain the characteristics of the change of word's frequencies quantitatively, the following attributes are defined. The value of each attribute defined here is used as the input data for the DT describe in section 4.</Paragraph> <Paragraph position="1"> 1) Proper Nouns Attributes (pna) 2) Slope of regression straight line (a) 3) Slice of regression straight line (b) 4) Correlation coefficient (r)</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Proper Nouns Attributes (pna) </SectionTitle> <Paragraph position="0"> In this paper, we selected only three kinds of proper nouns attributes: &quot;Player-name&quot;, &quot;Organizationname&quot;, and &quot;Team-name&quot; to study the influence of the time-series variation and to obtain the characteristics of increasing or decreasing stability classes. Also we used &quot;Ordinary-nouns&quot; (e.g. &quot;ball&quot;, &quot;coach&quot;, &quot;homerun&quot;) for the relatively constant class. The characteristics of the stability class are much easier and more correct by using these entities analysis.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 The Slope and the Slice of </SectionTitle> <Paragraph position="0"> Regression Straight Line (a &b): Regression analysis is a statistical method, which approximates the change of the sample value with straight line in two dimension rectangular coordinates, and this approximation straight line is called a regression straight line (Gonick & Smith, 1993).</Paragraph> <Paragraph position="1"> In this progress we take the standard years</Paragraph> <Paragraph position="3"> of the words as a vertical axis. The slope segmentation a and the slice b of the equation y = a x +b can be calculated by the following formula:</Paragraph> <Paragraph position="5"> where , are the average values of x</Paragraph> <Paragraph position="7"> respectively.</Paragraph> <Paragraph position="8"> By obtaining the cross point of the regression straight line and the current time period in rectangular coordinates, it is possible to get the estimated frequencies of the current words. The slope of the regression straight line can estimate the stability classes of the words. In addition, from the slice of the regression straight line, the difference of frequencies between words groups in the same stability class can be estimated. For example the frequency of the words in the same stability class (relatively constant) that have a regression straight line (1) in Fig. 5 is higher every period than that of straight line (2). The value of the slice of regression straight line (1) is also higher than that of regression straight line (2). So, we can decide that the words of the regression straight line (1) are more important than the words in the regression straight line (2), even though all these words are in the same class.</Paragraph> <Paragraph position="10"/> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4. ESTIMATION </SectionTitle> <Paragraph position="0"> In order to confirm the effectiveness of our method, an experiment is designed to study the effect of learning period lengths and all attributes on the distribution</Paragraph> <Paragraph position="2"> Words group in a Similar Class.</Paragraph> <Paragraph position="3"> By obtaining the cross point of the regression straight line and the current time period in rectangular coordinates, the slope of the regression straight line can estimate the stability classes of the words. For example, when the stability class is stabilized, the regression straight line is close to the horizontal line and the slope is close to 0. When the stability class is increasing, its slope is positive, and the slope becomes negative when the stability class is decreasing.</Paragraph> <Paragraph position="4"> In addition, from the slice of the regression straight line, the difference of frequencies between words groups in the same stability class can be estimated. For example the frequency of the words in the same stability class (relatively constant) that have a regression straight line (1) in Fig. 1 is higher every period than that of straight line (2). The value of the slice of regression straight line (1) is also higher than that of regression straight line (2). So, we can decide that the words of the regression straight line (1) are more important than the words in the regression straight line (2), even though all these words are in the same class.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3. Correlation Coefficient (r) </SectionTitle> <Paragraph position="0"> Correlation coefficient is used to judge the reliability of regression straight line. Although, stability classes of words are estimated by slope and slice of the regression straight line, there are some words with the same regression straight line have versus degree of scattering because of the arrangement of frequencies of words in rectangular coordinates as shown in Fig. 2.</Paragraph> <Paragraph position="1"> In such case, there will be some problems in the point of reliability if these different groups of words have the same stability class.</Paragraph> <Paragraph position="2"> So, in order to judge the reliability of the regression straight line that derived from the scattering of frequencies, a correlation coefficient was used that shows the scattering extent (degree) of the frequencies of words in rectangular coordinates. Correlation coefficient is also a statistical method (Gonick & Smith, 1993), and the calculation equation is shown as follows: In the above formula, are the predicted weights determined by regression line and a is the slope of the regression straight line.</Paragraph> <Paragraph position="4"> When the absolute value of correlation coefficient r is approaching to 1, the appearance frequency is concentrated around the regression straight line, and when it approaches to 0, it means that the appearance frequency is irregularly scattering around the regression straight line.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Experimental Data: </SectionTitle> <Paragraph position="0"> The sports section of CNN newspapers (1997-2000) was used as an experimental collection data, because of the uniqueness of the words in this field and their tendency to change with the time-series variation. A specific sub-field from sports &quot;professional baseball&quot; was chosen because it has stabilized frequent reports every year, and it is relatively easy to determine how words frequencies affect by time-series variation. Words identify with four kinds of proper nouns attributes: &quot;Player-name&quot;, &quot;Organization-name&quot;, &quot;Team-name&quot;, and &quot;Ordinary-nouns&quot; were extracted from the selected reports, and the normalized frequency of the selected words in each year was obtained. Then, stability classes classified manually (Human) to these words.</Paragraph> <Paragraph position="1"> The data is divided into two groups: one includes the reports of years (1997- 1999) are used as DT learning data. The other includes the reports of years (1997-2000), that are completely different data than the learning data, are used as test data. For both data sets the attributes are obtained from the change of words frequency with time- series variation included in both periods. The data of extracted words is shown in Table 2.</Paragraph> <Paragraph position="2"> In order to get the accuracy of the correct words that are words that are evaluated automatically by DT , we measured: Precision (P), and Recall (R) rate as follows:</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Relation Between Learning Period and Classification Precision: </SectionTitle> <Paragraph position="0"> In this section, we show the effectiveness of using the longest period M and the shortest period N of learning data to distribution of P & R. We notice that, when the period of learning data is longer (M) the number of words increases and characteristics of the relatively constant and decreasing stability classes become more obvious, so their classifications become clear, and as a result P & R become higher. However, when short learning period is used P & R decrease. The comparison results for the longest and shortest periods are shown in Table 3.</Paragraph> </Section> </Section> class="xml-element"></Paper>