File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/98/w98-1214_intro.xml
Size: 2,772 bytes
Last Modified: 2025-10-06 14:06:46
<?xml version="1.0" standalone="yes"?> <Paper uid="W98-1214"> <Title>CHOOSING A DISTANCE METRIC FOR AUTOMATIC WORD CATEGORIZATION</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Statistical natural language processing is a challenging area in the field of computational natural language learning. Researchers of this field have an approach to language acquisition in which learning is visualized as developing a generative, stochastic model of language and putting this model into practice (Marcken, 1996).</Paragraph> <Paragraph position="1"> Automatic word categorization is an important field of application in statistical natural language processing where the process is unsupervised and is carried out by working on n-gram statistics to find out the categories of words. Research in this area points out that it is possible to determine the structure of a natural language by examining the regulaxities of the statistics of language (Finch, 1993). It is possible to construct a bottom-up unsupervised algorithm for the categorization process. In our paper named &quot;A Method for Improving Automatic Word Categorization&quot;(Korkmaz&Uqoluk, 1997) such a method, using a modified greedy-type algorithm supported by the notions of fuzzy logic, has been proposed. The distance metric used to measure the similarities of linguistic elements in this research is the Manhattan Metric. This metric is based on the absolute difference between the corresponding values of vector components. The components of the vectors correspond to bigrarn statistics of words for our case. However words from the same linguistic category in natural language may have totally different frequencies. So using a distance metric based on only the absolute differences may not be so suitable for the linguistic categorization process.</Paragraph> <Paragraph position="2"> In this paper various distance metrics are analyzed with the same algorithm in order to find out the most suitable one that could be used for linguistic elements. Comparisons are made for the results obtained using different metrics.</Paragraph> <Paragraph position="3"> The organization of this paper is as follows. First the related work in the area of word categorization is presented in section 2. Then a general description of the categorization process and our proposed algorithm is given in 3 section, which is followed by presentation of different distance metrics that can be used with the algorithm. In section 5 the results of the experiments and the comparisons between the metrics are given. We discuss the relevance of the results and conclude in the last section.</Paragraph> </Section> class="xml-element"></Paper>