File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/98/w98-1214_relat.xml
Size: 2,264 bytes
Last Modified: 2025-10-06 14:16:11
<?xml version="1.0" standalone="yes"?> <Paper uid="W98-1214"> <Title>CHOOSING A DISTANCE METRIC FOR AUTOMATIC WORD CATEGORIZATION</Title> <Section position="3" start_page="0" end_page="0" type="relat"> <SectionTitle> 2 Related Work </SectionTitle> <Paragraph position="0"> Usually unigram and the bigram statistics are used for automatic word categorization. There exists research where bigram statistics are used for the deterruination of the weight matrix of a neural network (Finch, 1992). Also bigrams are used with greedy algorithm to form the hierarchical clusters of words (Brown, 1992).</Paragraph> <Paragraph position="1"> Genetic algorithms have also been successfully used for the categorization process(Lankhorst, 1994). Lankhorst uses genetic algorithms to determine the members of predetermined classes. The drawback of his work is that the number of classes is determined previous to run-time and the genetic algorithm only searches for the membership of those Korkmaz and G6ktark (lqoluk 111 Choosing A Distance Metric for Word Categorization Emin Erkan Korkmaz and G6ktOrk l)~oluk (1998) Choosing A Distance Metric for Automatic Word Categorization. In D.M.W. Powers (ed.) NeMLaP3/CoNLL98: New Methods in Language Processing and Computational Natural Language Learning, ACL, pp 111-120.</Paragraph> <Paragraph position="2"> classes.</Paragraph> <Paragraph position="3"> McMahon and Smith also use the bigram statistics of a corpus to find the hierarchical clusters (McMahon, 1996). However instead of using a greedy algorithm they use a top-down approach to form the clusters. Firstly the system divides the initial set containing all the words to be clustered into two parts and then the process continues on these new clusters iteratively.</Paragraph> <Paragraph position="4"> Statistical NLP methods have been used also together with other methods of NLP. Wilms (Wilms, 1995) uses corpus based techniques together with knowledge-based techniques in order to induce a lexical sublanguage grammar. Machine Translation is an other area where knowledge bases and statistics are integrated. Knight et al., (Knight, 1994) aim to scale-up grammar-based, knowledge-based MT techniques by means of statistical methods.</Paragraph> </Section> class="xml-element"></Paper>