File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-2115_metho.xml
Size: 18,374 bytes
Last Modified: 2025-10-06 14:07:14
<?xml version="1.0" standalone="yes"?> <Paper uid="C00-2115"> <Title>Extracting semantic clusters from the alignment of definitions</Title> <Section position="2" start_page="0" end_page="795" type="metho"> <SectionTitle> 1 Analogy-based clustering </SectionTitle> <Paragraph position="0"> Jones \[11996\] suggests corpus alignment as a feasible analogy-based approach, ill order to align two sentences in tile same language, Waterman \[1996\] uses a technique for measuring tile similarity between lexical strings, named edit distance. This matches tile words of two sentences in linear order and determines their correspondence. For example, given the followiug two definitions for alkalimeter: * An apparatus for determining the concentration o1' alkalis in solution \[CED\] * An instrument for ascertaining the amount of alkali in a solution \[OED2\] Alignment may identil'y which words in these definitions are equivalents of each other. A quick observation of the sentences lets us identify three pairs of words: (apparatus, instrument), (determining, ascertaining) and (concentration, amount).</Paragraph> <Paragraph position="1"> The appeal of using definitions as corpora for alignment is l'ounded on two reasons. Firstly, dictionaries contain all necessary information as a knowledge base for extracting keywords \[Boguraev and Pustejovsky 1996\]. Secondly, it is much easier to find the sentences for aligning, since definitions are distinguished by entry headword.</Paragraph> <Paragraph position="2"> Taking into account Waterman's studies, we propose an analogy-based method to identify automatically semantic clusters. The difference in words used between two or more lexicographic definitions enables us to infer paradigms by merging the dictionary definitions into a single database and then using our own alignment technique.</Paragraph> </Section> <Section position="3" start_page="795" end_page="798" type="metho"> <SectionTitle> 2 Clustering algorithm </SectionTitle> <Paragraph position="0"> Tile overall structure of the clustering algorithn\] is shown in figure l, and its description is given below.</Paragraph> <Section position="1" start_page="795" end_page="796" type="sub_section"> <SectionTitle> 2.1 Processing definitions </SectionTitle> <Paragraph position="0"> Our algorithms are used in an overall system called &quot;onomasiological search system&quot; (OSS), whose aim is to allow the user to find terms by giving a description o1' a concept. Lexicographic and terminological definitions constitute the main lexical resources. Our algorithms cluster words that are used in the same context, thus operate on pairs of definitions for a same entry word, drawn fi'om two different dictionaries. If dictionary 1 does not have an entry word that exists in dictionary J, then this entry word is omitted from consideration. In order to balance the number of strings when an entry word in the dictionary 1 has two or more senses, the entry word in dictionary J is repeated as many times as necessary to equal the number of senses of dictionary I. We thus derive two files I and J containing an equal number of strings S~ and S 2, respectively. Each string consists of an entry term followed by its definition, the definition giving only one sense of the entry term. For each string S 1 there is a string S 2.</Paragraph> <Paragraph position="1"> Our experiments focus on 314 terms for measuring instruments extracted with their definitions from CED \[199411 and OED2 \[1994\], resulting in 387 strings from each dictionary.</Paragraph> <Paragraph position="2"> match S t and S 2 /S 1 and S 2</Paragraph> <Paragraph position="4"> The strings consist ()1' the entry term and the definition, so that etymology, part of speech, inl'lected t'orms ol' the cntry term, examples and other inl'ormation were deleted. Subject-field labels, such as 'astronomy' and 'meteorology', were preserved, either in full or slightly abbreviated, as they are helpful to resolve which sense o1' a word to choose, and usually constitute a l'undamental property of the concept.</Paragraph> <Paragraph position="5"> It should be noted that none of the 387 strings suffered any additional transformation, apart l'rom a few cases in order to complete a del'inition when it had been broken in two pm'ts by the dictionary editor, such as when a core meaning appears just once at the beginning of several subsequent senses. Althongh some abbreviations ('U.S.A.'), initials of proper names ('C.T.R. Wilson') and possessives ( :un s rays') will come out as two or more words al'ter deleting punctuation marks and therefore can alter the efficiency el' the algorithm, they were preserved to observe their effect.</Paragraph> </Section> <Section position="2" start_page="796" end_page="796" type="sub_section"> <SectionTitle> 2.2 Aligning definitions </SectionTitle> <Paragraph position="0"> In order to compare two strings of woMs, we use the Levenshtein distance \[Levenshtein 196611, a similar method to the edit distance. This method measures the edit transl'ormations that change one string into other. The Levenshtein distance arrangcs the strings in a matrix, with the words el' Sj heading the columns and those of S 2 heading the rows. A null word is inserted at the beginning of each string S~ and S 2, in position i=0,.j=0. The matrix is filled with the costs of insertion, deletion and substitution using the Where the cost of insertion. D~,,.,(), is 1. and the cost of substitution. D,,i,(), is 0 or 1, according to whether a~ and bj differ or not.</Paragraph> <Paragraph position="1"> Our experimental results have shown that the application of the Levenshtein distance using stem forms gives better matches than nsing full forms. Therefore, we shall fill the matrix with the cost for the stem l'orms, although the strings preserve the fnll forms both l'or the following steps and in the output table. We used the stmnming algorithm or' Porter \[1980\], which removes endings l'ronl words.</Paragraph> <Paragraph position="2"> Building on the Levenshtein distance, Wagner and Fisher \[1974\] propose a dynamic t~rogramming method to align the elements of two strings. Their procedure to return the ordered pairs of the alignment starts with the last cell of the matrix with cost\[n\]\[m\] and works back until either i or j equals 0, according to which o1' its neighbours a cell was derived l'rom. I1' it is derived either from the previous horizontal or vertical cell (\[i-l\]\[j\] or \[i\]lj-l\] respectively) then the difference in cost is.just 1, otherwise it is derived l'rom the diagonal.</Paragraph> </Section> <Section position="3" start_page="796" end_page="796" type="sub_section"> <SectionTitle> 2.3 Extracting triplets </SectionTitle> <Paragraph position="0"> The alignment gives us a list of triplets formed by ~.ll, J,l~, cost\[i\]\[j\]), in decreasing order according to cost\[i\]\[jl, where./.)' I, and ./\]~ arc full forms from the strings S~ and S e, respectively.</Paragraph> <Paragraph position="1"> There are three possible pairings of words: &quot;Equal couple&quot; is defined as the pair (1-\[i, .ffj) of full forms such that the corresponding stem forms are equal (,s.'/' I = 4)&quot; &quot;Matched couple&quot; is a pair (/.)~i, .Oj) such that .sf~ # .ff~. This couple represents a potential pair ot' similar words.</Paragraph> <Paragraph position="2"> &quot;Null couple&quot; is a pair (.g, .g) such that ,s:/I ()r 4 is missing.</Paragraph> <Paragraph position="3"> With respect to the Levcnshtein distance, the equal couple means these words do not need any change to make both equal, while for the matched couple we shall replace one word with the other progressively, and for the null couple we must either insert one word into the given string or delete it from the given string.</Paragraph> <Paragraph position="4"> The purpose of clustering is to match different pairs of words (matched couples), thus neither pairs of equal words (equal couples) nor pairs with a null word (null couples) are relevant.</Paragraph> </Section> <Section position="4" start_page="796" end_page="797" type="sub_section"> <SectionTitle> 2.4 Measuring similarity </SectionTitle> <Paragraph position="0"> As a measure of the similarity between a matched couple, we quantify the surrounding equal couples above and below it. This concept is similar to the &quot;longest common subsequence&quot; of two strings suggested by Wagner and Fisher \[1974\], which is del'ined as the common subsequence of two strings having maximal length, although in our case both strings differ by the single matched couple. By analogy, we use longest collocation couple, henceforth abbreviated lcc, since we refer to couples instead of a single string. Besides, the word &quot;collocation&quot; is more representative for a pair o1' words and their neighbourhood, being the core of two longest common subsequences. We define longest collocation couple as the maximal sequence of pairs of words formed by equal couples surrounding a matched couple.</Paragraph> <Paragraph position="1"> Given the alignment of the strings S~ and S 2 consisting of a list of triplets formed by (ffi., ff, cost\[ill/\]), in decreasing order according to cost\[i\]\[j\], where.ff I, and fl~ are, respectively, full fomas l'rom S~ and $2, the lcc is the longest consecutive sequence of triplets (~i., f~, cost\[i\]\[j\]) formed by one matched couple, such that it meets 3 conditions: the last triplet.</Paragraph> <Paragraph position="2"> By these conditions, only the matched couple becomes the core el' a Icc: we constrain a matched couple 1o be between two or more equal couples, and eliminate the possibility that the matched couple appears at the beginning or end o1' a phrase.</Paragraph> <Paragraph position="3"> As a result, we get a new triplet Off, .\[f~, Icco), where (If, J\[~) is the matched couple and lcc,a is the length of the longest collocation couple. As an example, for the definitions of &quot;dynameter&quot; in table 1, there is only one matched couple, &quot;determining-measuring&quot;, whose lcc is 9 (the extent o1' the Icc is indicated by arrows).</Paragraph> <Paragraph position="4"> Ranking all triplets found by lcc in decreasing order, we observe that the greater the value o1' lcc, the greater the similarity between the words of the matched couple.</Paragraph> </Section> <Section position="5" start_page="797" end_page="797" type="sub_section"> <SectionTitle> 2.5 Removing flmetion words </SectionTitle> <Paragraph position="0"> So far, function words and other noise words will also be clustered by our algorithms, in general, such words interfere in the identification of clusters and can give more wrong than good results. We use a stoplist to automatically identify any pair of words where a non-relevant word appears and exclude it, on the grounds that they are not very useful words for clustering. Thus, when the program comes across a matched pair of different words in a context and il' that matched pair contains a word from the stoplist, then the pair is rejected.</Paragraph> <Paragraph position="1"> Essentially, this is the same thing as using a tagger and looking at the tags as well as the words, since one would not want to choose a noun pairing with a determiner or a relative.</Paragraph> <Paragraph position="2"> By inspection, we observe that, after stoplist discrimination, the best potential clusters are found at higher values ot' Icc. Our experimental results show us that a length of lcc equal to 5 is a reliable threshold. Although there are also good matches for values equal to 4 and 3, the majority of these are duplicates of higher values.</Paragraph> </Section> <Section position="6" start_page="797" end_page="798" type="sub_section"> <SectionTitle> 2.6 Clustering </SectionTitle> <Paragraph position="0"> We introduce the terln binding to represent a candidate cluster, i.e. two words that may be used in the same context without changing the meaning o1' a definition. A binding is a matched couple (J.l~, .\[/'9 formed by the full forms .\[f~ and ft;, after stoplist discrimination, drawn t'rom the strings S, and S~, respectively, in such a way that the stem forms are equivalent, in a determined context, according to a determined threshold'.</Paragraph> <Paragraph position="1"> The threshold associated with a binding is the length of the lcc, and we consider only bindings of matched couples where lcc >_ 5.</Paragraph> <Paragraph position="2"> Each binding can be considered as an initial cluster. Clusters represent sets o1' words that are used with the same meaning in particular contexts. In a consecutive sequence of bindings, it may happen that a stem form occurs in two or more dilTerent bindings. In this case, one can cluster all bindings with a common stem form according to the transitive property.</Paragraph> <Paragraph position="3"> in order to cluster bindings, we use an algorithm consisting o1' three loops. First, it assigns a cluster number to each binding, so those bindings with a common word have the same cluster number. Secondly, it clusters bindings with the same cluster number, but removes duplicate stem forms in tile same cluster.</Paragraph> <Paragraph position="4"> Thirdly, it checks if it is possible to inerge new clusters with those of previous cycles. This process will typically result in a set of overlapping clusters, reflecting the natm'al state where concepts may belong to more than one conceptual class.</Paragraph> </Section> <Section position="7" start_page="798" end_page="798" type="sub_section"> <SectionTitle> 2.7 Cycling </SectionTitle> <Paragraph position="0"> As bindings represent pairs of words such that the stem forms can be substituted in a particular context without changing the meaning, sJi = aJ~ we can replace any of the full formsfPS with the full l'orms ffj according to each binding, so that the corresponding definition preserves the same meaning. After substituting bindings, we observe that several pairs of words will now typically present a high lcc score, even those pairs of words which initially did not yield matches with any word. It is then advantageous to replace thus the bindings in the definitions alld to repeat the entire process until no new clusters are found. The first cycle runs from the reading o1' definitkms up to merging of clusters.</Paragraph> <Paragraph position="1"> All subsequent cycles will start by replacing retained bindings in the definitions, thus each subsequent cycle works with new data.</Paragraph> </Section> </Section> <Section position="4" start_page="798" end_page="798" type="metho"> <SectionTitle> 3 Experimental results </SectionTitle> <Paragraph position="0"> The current clustering algorithm was developed by analysing definitions on the following basis: * Language dictionaries. The use of language dictionaries has been preferred because there are enough to extract data from. As they are in machine-readable form, it is possible to copy definitions, avoiding likely mistakes while typewriting.</Paragraph> <Paragraph position="1"> * Corpus on 314 &quot;measuring instruments&quot;. This domain has the advantage that it is easy to search for the terms that correspond to it, as they usually end in &quot;-meter&quot;, &quot;-scope&quot; or &quot;-graph&quot;. As a conscqueuce o1' applying the clustering program to the 387 strings, it is evident that the maiority of clusters were related to &quot;measure&quot; and &quot;instrument&quot;. * Alignment of two strings. We have shown that two sources of data (pairs of del'inition) are sufficient for clustering to yield good results.</Paragraph> <Paragraph position="2"> * No manipulation el'data. After ktentification of the term and the definitions, these were truncated to 200 characters and punctuation marks were removed. No words in definitions were replaced or moved, to &quot;tidy up&quot; the data, before being submitted to the main process.</Paragraph> <Paragraph position="3"> * Stemming algorithm. The stemmer algorithm presents both overstemming and understemming, but nevertheless the clustering program yiekts good results.</Paragraph> <Paragraph position="4"> * Stoplist discrimination. The stoplist has been used as a tagger, i.e. as a filter to avoid matching words with dil'ferent parts ot' speech.</Paragraph> <Paragraph position="5"> * Bindings for Ice _> 5. The best clusters have been observed for bindings with lcc> 5, and the results presented m'e good.</Paragraph> <Paragraph position="6"> Table 2 presents some cluster results after two cycles of the clustering procedure starting from the Levcnshtein distance. In addition to these clusters, 14 other clusters of two or three elements were obtained.</Paragraph> <Paragraph position="7"> I. apparalus inslrumcnt telescope 2. analyse ascerlaining determining estimating location measuring recording lakins testing 3. amotmt concenlration intensity percentage proportion rate salinity strength Table 2 Cluster results for &quot;measuring illstFunlents&quot; The procedure then stops, as no more matched words with lcc _> 5 have been found for our data. The following sections analyse variations of these considerations.</Paragraph> <Section position="1" start_page="798" end_page="798" type="sub_section"> <SectionTitle> 3.1 Using multiple resources </SectionTitle> <Paragraph position="0"> General language dictionaries present the advantage of using well-established lexicographic criteria to normalise definitions. These criteria, as for example the use of analytical definitious by genus and differentia, have been nowadays implemented by terminological or specialiscd dictionaries, with the addition of a richer vocabulary and the identification of properties that are not always considered relevant in other resources.</Paragraph> <Paragraph position="1"> Unfortunately, these are more oriented to a specific domain, so that it is sometimes necessary to search in two or more resources to compile the data.</Paragraph> <Paragraph position="2"> We used many online lexical resources, some of them available on the lnternct. This allowed us to easily use different databases to extract</Paragraph> </Section> </Section> class="xml-element"></Paper>