File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-1113_metho.xml
Size: 9,758 bytes
Last Modified: 2025-10-06 14:15:15
<?xml version="1.0" standalone="yes"?> <Paper uid="W98-1113"> <Title>Towards Unsupervised Extraction of Verb Paradigms from Large Corpora</Title> <Section position="4" start_page="111" end_page="111" type="metho"> <SectionTitle> 3 Cluster Analysis </SectionTitle> <Paragraph position="0"> For each verb we extracted frequency counts for left and right bigrams called the left and right contexts, respectively. A similarity matrix for left and right contexts was created by calculating the relative entropy or Kullback Leibler (KL) distance 4 between the vectors of context frequency counts for each pair of verbs. The Thomas, 1991) to measure the similarity between two di:stributions of word bigrams. For the moment; we added a small constant to smooth over zero frequencies. Because the distance between verbi and verbj is not in general equal to the distance between verbj and verbi, the KL distances between each pair of verbs are added to produce a symmetric matrix. We tested other measures of distance between words. The total divergence to the average, based on the KL distance (Dagan et al, 1994), produced comparable results, but the the cosine measure (Schuetze, 1993) produced significantly poorer results. We conclude that entropy methods produce more reliable estimates of the probability distributions for sparse data (Ratnaparkhi, 1997).</Paragraph> <Paragraph position="1"> The similarity matrices for left and right contexts are analyzed by a hierarchical clustering algorithm for compact clusters. The use of a &quot;hard&quot; instead of a &quot;soft&quot; clustering algorithm is justified by the observation (Pinker, 1984) that the verbs do not belong to more than one inflectional category or lemma. 5 A hierarchical clustering algorithm (Seber, 1984) constructs from the bottom up using an agglomerative method that proceeds by a series of successive fusions of the N objects into clusters. The compactness of the resulting cluster is used as the primary criterion for membership. This method of complete linkage, also known as farthest neighbor, defines the distance between two clusters in terms of the largest dissimilarity between a member of cluster L1 and a member of cluster L2. We determined experimentally on the development corpus that this algorithm produced the best clusters for our data.</Paragraph> <Paragraph position="2"> Figures 1 and 2 show portions of the cluster trees for left and right contexts. The scales at the top of the Figures indicate the height at which the cluster is attached to the tree in the arbitrary units of the distance metric. The left context tree in Figure 1 shows large, nearly pure clusters for the VBD and VBZ inflectional categories. The right context tree in Figure 2 has smaller clusters for regular and irregular verb lemmas. Note that some, but not all, of the forms of BE form a single cluster.</Paragraph> <Paragraph position="3"> To turn the cluster tree into a classification, we need to determine the height at which to terminate the clustering process. A cut point is a 5The only exception in our data is &quot; 's&quot; which belongs to the lemmas BE and HAVE.</Paragraph> <Paragraph position="5"> fines the cluster membership. A high cut point will produce larger clusters that may include more than one category. A low cut point will produce more single category clusters, but more items will remain unclustered. Selecting the optimum height at which to make the cut is known as the cutting problem.</Paragraph> <Paragraph position="6"> A supervised method for cutting the cluster tree is one that takes into account the known classification of the items. We look at supervised methods for cutting the tree in order to evaluate the success of our proposed unsupervised method.</Paragraph> </Section> <Section position="5" start_page="111" end_page="113" type="metho"> <SectionTitle> 4 Supervised methods </SectionTitle> <Paragraph position="0"> For supervised methods the distribution of categories C in clusters L at a given height T of the tree is represented by the notation in Table 2. For a given cluster tree, the total number of categories C, the distribution of items in categories nc, and the total number of items N are constant across heights. Only the values for L, rnt, and fd will vary for each height T.</Paragraph> <Paragraph position="1"> There are several methods for choosing a</Paragraph> <Paragraph position="3"> C is the number of categories L is the number of clusters mz is the number of instances in cluster l .fd is the instances of category c in cluster l N is the total number of instances for cut T nc is the number of instances in category c cut point in a hierarchical cluster analysis. 6 We investigated three supervised methods, two based on information content, and a third based on counting the number of correctly classified items.</Paragraph> <Section position="1" start_page="112" end_page="113" type="sub_section"> <SectionTitle> 4.1 Gain Ratio </SectionTitle> <Paragraph position="0"> Information gain (Russell & Norvig, 1995) and gain ratio (Quinlan, 1990) were developed as metrics for automatic learning of decision trees.</Paragraph> <Paragraph position="1"> Their application to the cutting problem for cluster analysis is straightforward. Informa- null tion gain is a measure of mutual information, the reduction in uncertainty of a random variable given another random variable (Cover & Thomas, 1991). Let C be a random variable describing categories and L another random variable describing clusters with probability mass functions p(c) and q(1), respectively. The entropy for categories H(C) is defined by</Paragraph> <Paragraph position="3"> where p(c) = nc/N in the notation of Table 2. The average entropy of the categories within clusters, which is the conditional entropy of categories given clusters, is defined by</Paragraph> <Paragraph position="5"> where q(l) = mdN and p(c\]l) = ffct/mt in our notation.</Paragraph> <Paragraph position="6"> Information gain and mutual information</Paragraph> <Paragraph position="8"> Information gain increases as the mixture of categories in a cluster decreases. The purer the cluster, the greater the gain. If we measure infiJrmation gain for each height T, T = 1, ..., 40 of the cluster tree, the optimum cut is at the height with the maximum information gain.</Paragraph> <Paragraph position="9"> Information gain, however, cannot be used directly for our application, because, as is well known, the gain function favors many small clusters, such as found at the bottom of a hierarchical cluster tree. Quinlan (1990) proposed the gain ratio to correct for this. Let H(L) be the entropy for clusters defined, as above, by</Paragraph> <Paragraph position="11"> Then the gain ratio is defined by gain ratio = I(C; L) H(L) The gain ratio corrects the mutual information between categories and clusters by the entropy of the clusters. ..</Paragraph> </Section> <Section position="2" start_page="113" end_page="113" type="sub_section"> <SectionTitle> 4.2 Revised Gain Ratio </SectionTitle> <Paragraph position="0"> We found that the gain ratio still sometimes maximizes at the lowest height in the tree, thus failing to indicate an optimum cut point. We experimented with the revised version of the gain ratio, shown below, that sometimes overcomes this difficulty.</Paragraph> <Paragraph position="1"> revised ratio = I(C; L) H(L) - H(C) The number and composition of the clusters, and hence H(L), changes for each cut of the cluster tree, but the entropy of the categories, H(C), is constant. This function maximizes when these two entropies are most equal. Figure 3 shows the relationship between entropy and mutual information and the quantities defined for the gain ratio and revised ratio.</Paragraph> </Section> <Section position="3" start_page="113" end_page="113" type="sub_section"> <SectionTitle> 4.3 Percent Correct </SectionTitle> <Paragraph position="0"> Another method for determining the cut point is to count the number of items correctly classified for each cut of the tree. The number of correct items for a given cluster is equal to the number of items in the category with the maximum value for that cluster. For singleton clusters, an item is counted as correctly categorized if the the category is also a singleton. The percent correct is the total number of correct items divided by the total number of items. This value is useful for comparing results between cluster trees as well as for finding the cut point.</Paragraph> </Section> </Section> <Section position="6" start_page="113" end_page="114" type="metho"> <SectionTitle> 5 Unsupervised Method </SectionTitle> <Paragraph position="0"> An unsupervised method that worked well was * to select a cut point that maximizes the number of clusters formed within a specified size range. Let s be an estimate of the size of the cluster and r < 1 the range. The algorithm counts the number of clusters at height T that fall within the interval:</Paragraph> <Paragraph position="2"> The optimum cut point is at the height that has the most clusters in this interval.</Paragraph> <Paragraph position="3"> The value of r = .8 was constant for both right and left cluster trees. For right contexts, where we expected many small clusters, s = 8, giving a size range of 2 to 14 items in a cluster. For left contexts, where we expected a few large clusters, s = 100, giving a size range of 20 to 180. The expected cluster size is the only assumption we make to adjust the cut point given the disparity in the expected number of categories for left and right contexts. A fully unsupervised method would not make an assumption about expected size.</Paragraph> </Section> class="xml-element"></Paper>