File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/00/c00-2115_abstr.xml
Size: 3,486 bytes
Last Modified: 2025-10-06 13:41:35
<?xml version="1.0" standalone="yes"?> <Paper uid="C00-2115"> <Title>Extracting semantic clusters from the alignment of definitions</Title> <Section position="1" start_page="0" end_page="0" type="abstr"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> Through tile alignment of definitions fronl two or more dilTerent sources, it is possible to retrieve pairs of words that can be used indistinguishably in the same sentence without changing tile meaning of the concept. As lexicographic work exploits common defining schemes, such as genus and dilTerentia, a concept is simihu'ly defined by different dictionaries. The dilTerence in words used between two lexicographic sources lets us extend lhe lexical knowledge base, so that clustering is available through merging two or more dictionaries into a single database and then using an approlwiate alignment techlaique. Since aligmnent starts from thc same entry of two dictionaries, clustering is l~lster than any other technique.</Paragraph> <Paragraph position="1"> Tile algorithm introduced here is analogybased, and starts from calculating the Levenshtein distance, which is a variation o1' the edit distance, and allows us to align the definitions. As a measure of similarity, the concept el' longest collocation couple is introduced, which is the basis of clustering similar words. The process iterates, replacing similar pairs of words in tile definitions until no new clusters are found.</Paragraph> <Paragraph position="2"> Introduction Clustering methods to identify semantically similar words are usually divided in relation-based and distribution-based approaches \[Hirawaka, Xu and Haase 1996\]. Relation-based clustering methods rely on the relations in a semantic network or ontology to judge the similarity between two concepts, either by measuring the shortest length that connects two concepts in the hierarchical net \[Agirrc and Rigau 199611, oi&quot; by comparing tile information content shared by the members unde,&quot; tile same cluster \[Morris and Hirst 1991, Resnik 1997\]. ltowever, even although these ontologies describe a huge number of members for a cluster, few words of a category may be interchangeable in the same context and then used as members of tile same cluster. This means that not all words in a category arc necessary.</Paragraph> <Paragraph position="3"> Conversely, distribution-based clustering methods depend on pure statistical analysis of the lexical occurrences ill running texts. A relier drawback is that distribution-based methods require us to process a large amount of data in order to get more reliable results. Moreover, tile use el hu'ge corpora is not always practical, due to economic, time or capabilities factors. Gao 11199711 states that tile problem for statistical alignment algorilhms, such as those based on tile facts described by Gale and Church \[1991\], is the low frequency of words that occur in parallel corpora. The consequences for lacking hu'ge corpora include results based on low-frequency words, which are quite unrepresentative for clustering.</Paragraph> <Paragraph position="4"> From a methodological point of view, there is, in addition to the above two approaches, a little known approach called the analogy-based approach. This employs an inferential process and is used ill computatkmal linguistics and artificial intelligence as an alternative to current rule-based linguistic models.</Paragraph> </Section> class="xml-element"></Paper>