File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/97/w97-1006_evalu.xml

Size: 3,722 bytes

Last Modified: 2025-10-06 14:00:29

<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-1006">
  <Title>A METHOD FOR IMPROVING AUTOMATIC WORD CATEGORIZATION</Title>
  <Section position="6" start_page="44" end_page="44" type="evalu">
    <SectionTitle>
4 Results
</SectionTitle>
    <Paragraph position="0"> The algorithm is tested on a corpus formed with on-line novels collected from the www page of the &amp;quot;Book Stacks Unlimited, Inc.&amp;quot; The corpus consists of twelve free on-line novels adding up to about 1.700.000 words. The corpus is passed through a filtering process where the special words, useless characters and words are filtered and the frequencies of words are collected. Then the most frequent thousand words are chosen and they are sent to the clustering process described in the previous sections. These most frequent thousand words form the 70.4% of the whole corpus. The percentage goes up to about 77% if the next most frequent thousand is added to the lexicon space. The first ten most frequent words in the Korkmaz ~ Ufoluk corpora and their frequencies are presented in Table 1.</Paragraph>
    <Paragraph position="1"> The clustering process builds up a tree of words having words on the leaves and clusters on the inner nodes. The starting node denotes the largest class containing all the lexicon space. The number of leaves that is the number of clusters formed at the initial step is 60. The depth of the tree is 8.</Paragraph>
    <Paragraph position="2"> Leaves appear starting from the 5th level and they are mainly located at the 5th and 6th level. The number of nodes connecting the initial clusters is 18. So on the average about three clusters are combined together in the second step. Table 2 displays two results from the clustering tree. The first one collects a set of nouns from the lexicon space. However the second one is somewhat ill-structured namely two prepositions, two adjectives and a verb cluster are combined into one.</Paragraph>
    <Paragraph position="3"> Some linguistic categories inferred by the algorithm are listed below: . prepositions(l): by with in to and of s prepositions(2): from on at for s prepositions(3): must might will should could would may s determiners(l) : your its our these some this my her all any no . prepositions(4): between among against through under upon over about . adjectives(l) : large young small good long * nouns(l) : spirit body son head power age character death sense part case state * verbs(l) : exclaimed answered cried says knew felt said or is was saw did asked gave took made thought either told whether replied because though how repeated open remained lived died lay does why * verbs(2) : shouted wrote showed spoke makes dropped struck laid kept held raised led carried sent brought rose drove threw drew shook talked yourself listened wished meant ought seem seems seemed tried wanted began used continued returned appeared comes knows liked loved * adjectives(2) : sad wonderful special fresh serious particular painful terrible pleasant happy easy hard sweet  getting hearing knowing finding drawing leaving giving taking making having being seeing doing s nouns(3) : streets village window evening morning night middle rest end road sun garden table room ground door church world name people city year day time house country way place fact river next earth  Korkmaz ~ 09oluk 48 Automatic Word Categorization The ill-placed members in the clusters are shown above using bold font. The clusters represent the linguistic categories with a high success rate (,~ 91%). Some semantic relations in the clusters can also be observed. Group nouns(2) is a good example for such a semantic relation between the words in a cluster. null</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML