File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/93/h93-1053_intro.xml
Size: 10,509 bytes
Last Modified: 2025-10-06 14:05:27
<?xml version="1.0" standalone="yes"?> <Paper uid="H93-1053"> <Title>Augmenting Lexicons Automatically: Clustering Semantically Related Adjectives</Title> <Section position="3" start_page="0" end_page="273" type="intro"> <SectionTitle> 2. ALGORITHM </SectionTitle> <Paragraph position="0"> Our algorithm is based on two sources of linguistic data: data that help establish that two adjectives are related, and data that indicate that two adjectives are unrelated. We extract adjective-noun pairs that occur in a modification relation in order to identify the distribution of nouns an adjective modifies and, ultimately, determine which adjectives it is related to. This is based on the expectation that adjectives describing the same property tend to modify the same set of nouns. For example, temperature is normally defined for physical objects and we can expect to find that adjectives conveying different values of temperature will all modify physical objects. Therefore, our algorithm finds the distribution of nouns that each adjective modifies and categorizes adjectives as similar if they have similar distributions. null Second, we use adjective-adjective pairs occurring as pre-modifiers within the same NP as a strong indication that the two adjectives do not belong in the same group. There are three cases: I. If both adjectives modify the head noun and the two adjectives are antithetical, the NP would be self-contradictory, as in the scalar sequence hot cold or the non-scalar red black.</Paragraph> <Paragraph position="1"> 2. For non-antithetical scalar adjectives which both modify the head noun, the NP would violate the Gricean maxim of Manner \[1\] since the same information is conveyed by the strongest of the two adjectives (e.g.</Paragraph> <Paragraph position="2"> hot warm).</Paragraph> <Paragraph position="3"> 3. Finally, if one adjective modifies the other, the modifying adjective has to qualify the modified one in a different dimension. For example, in light blue shirt, blue is a value of the property color, while light indicates the shade*.</Paragraph> <Paragraph position="4"> The use of linguistic data, in addition to statistical measures, is a unique property of our work and significantly improves the accuracy of our results. One other published model for grouping semantically related words \[5\], is based on a statistical model of bigrams and trigrams and produces word groups using no linguistic knowledge, but no evaluation of the results is performed.</Paragraph> <Paragraph position="5"> Our method works in three stages. First, we extract linguistic data from the parsed corpus in the form of syntactically related word pairs; in the second stage, we compute a measure of similarity between any two adjectives based on the information gathered in stage one; and in the last stage, we cluster the adjectives into groups according to the similarity measure, so that adjectives with a high degree of similarity fall in the same cluster (and, consequently, adjectives with a low degree of similarity fall in different clusters).</Paragraph> <Paragraph position="6"> 2.1. Stage One: Extracting Word Pairs During the first stage, the system extracts adjective-noun and adjective-adjective pairs from the corpus. To determine the syntactic category of each word, and identify the NP boundaries and the syntactic relations between each word, we used the Fidditch parser \[6\]**. For each NP, we then determine its minimal NP, that part of an NP consisting of the head noun and its adjectival pre-modifiers. We match a set of regular expressions, consisting of syntactic categories and representing the different forms a minimal NP can take, against the NPs. From the minimal NP, we produce the different pairs of adjectives and nouns.</Paragraph> <Paragraph position="7"> The resulting adjective-adjective and adjective-noun pairs are filtered by a morphology component, which removes pairs that contain erroneous information (such as mistyped *Note that sequences such as blue-green are usually hyphenated and thus better considered as a compound.</Paragraph> <Paragraph position="8"> **We thank Diane Litman and Donald Hindle for providing us with access to the parser at AT&T Bell Labs.</Paragraph> <Paragraph position="9"> words, proper names, and closed-class words which may be mistakenly classified as adjectives (e.g. possessive pronouns)). This component also reduces the number of different pairs without losing information by transforming words to an equivalent, base form (e.g. plural nouns are converted to singular) so that the expected and actual frequencies of each pair are higher. Stage one then produces as output a simple list of adjective-adjective pairs that occurred within the same minimal NP and a table with the observed frequencies of every adjective-noun combination. Each row in the table contains the frequencies of modified nouns for a given adjective.</Paragraph> <Section position="1" start_page="272" end_page="273" type="sub_section"> <SectionTitle> 2.2. Stage Two: Computing Similarities Between Adjectives </SectionTitle> <Paragraph position="0"> This stage processes the output of stage one, producing a measure of similarity for each possible pair of adjectives.</Paragraph> <Paragraph position="1"> The adjective-noun frequency table is processed first; for each possible pair in the table we compare the two distfibutions of nouns.</Paragraph> <Paragraph position="2"> We use a robust non-parametric method to compute the similarity between the modified noun distributions for any two adjectives, namely Kendall's x coefficient \[7\] for two random variables with paired observations. In our case, the two random variables are the two adjectives we are comparing, and each paired observation is their frequency of co-occurrence with a given noun. Kendall's x coefficient compares the two variables by repeatedly comparing two pairs of their corresponding observations. Formally, if (Xi,Yi) and (Xj,Y~) are two pairs of observations for the adjectives X and ~Y on the nouns i and j respectively, we call these pairs concordant if Xi>X/and Yi>Yj or if Xi<Xj and Yi<Yj; otherwise these pairs are discordant***. If the distributions for the two adjectives are similar, we expect a large number of concordances, and a small number of discordances. null Kendall's x is defined as = Pc-Pd where Pc and Pd are the probabilities of observing a con-&quot; cordance or discordance respectively, x ranges from -1 to +1, with +1 indicating complete concordance, -1 complete discordance, and 0 no correlation between X and Y.</Paragraph> <Paragraph position="3"> An unbiased estimator of x is the statistic T- C-Q where n is the number of paired observations in the sample and C and Q are the numbers of observed concordances and discordances respectively \[8\]. We compute T for each pair of adjectives, adjusting for possible ties in the values ***We discard pairs of observations where Xi=X j or Yi=Yj. of each variable. We determine concordances and discordances by sorting the pairs of observations (noun frequencies) on one of the variables (adjectives), and computing how many of the (~) pairs of paired observations agree or disagree with the expected order on the other adjective. We normalize the result to the range 0 to 1 using a simple linear transformation.</Paragraph> <Paragraph position="4"> After the similarities h/ave been computed for any pair of adjectives, we utilize the knowledge offered by the observed adjective-adjective pairs; we know that the adjectives which appear in any such pair cannot be part of the same group, so we set their similarity to 0, overriding the similarity produced by &quot;r.</Paragraph> <Paragraph position="5"> 2.3. Stage Three: Clustering The Adjectives In stage three we first convert the similarities to dissimilarities and then apply a non-hierarchical clustering algorithm. Such algorithms are in general stronger than hierarchical methods \[9\]. The number of clusters produced is an input parameter. We define dissimilarity as (1 similarity), with the additional provision that pairs of adjectives with similarity 0 are given a higher dissimilarity value than 1. This ensures that these adjectives will never be placed in the same cluster; recall that they were determined to be definitively dissimilar based on linguistic data. The algorithm uses the exchange method \[10\] since the more commonly used K-means method \[9\] is not applicable; the K-means method, like all centroid methods, requires the measure d between the clustered objects to be a distance; this means, among other conditions, that for any three objects x, y, and z the triangle inequality applies. However, this inequality does not necessarily hold for our dissimilarity measure. If the adjectives x and y were observed in the same minimal NP, their dissimilarity is quite large. If neither z and x nor z and y were found in the same minimal NP, then it is quite possible that the sum of their dissimilarities could be less than the dissimilarity between x and y.</Paragraph> <Paragraph position="6"> The algorithm tries to produce a partition of the set of adjectives in such a way that adjectives with high dissimilarities are placed in different clusters. This is accomplished by minimizing an objective function * which scores a partition P. The objective function we use is</Paragraph> <Paragraph position="8"> The algorithm starts by producing a random partition of the adjectives, computing its * value and then computing for each adjective the improvement in * for every cluster where it can he moved; if there is at least one move for an adjective that leads to an overall improvement of ~, then the adjective is moved to the cluster that yields the best improvement and the next adjective is considered. This procedure is repeated until no more moves lead to an improvement of ~.</Paragraph> <Paragraph position="9"> This is a hill-climbing method and therefore is guaranteed to converge, but it may lead to a local minimum of ~, inferior to the global minimum that corresponds to the optimal solution. To alleviate this problem, the partitioning algorithm is called repeatedly with different random starting partitions and the best solution in these runs is kept. It should be noted that the problem of computing the optimal solution is NP-complete, as a generalization of the basic NP-complete clustering problem \[11 \].</Paragraph> </Section> </Section> class="xml-element"></Paper>