File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-0506_metho.xml
Size: 19,828 bytes
Last Modified: 2025-10-06 14:10:34
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-0506"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Taxonomy Learning using Term Specificity and Similarity</Title> <Section position="5" start_page="41" end_page="42" type="metho"> <SectionTitle> 2 Term Specificity </SectionTitle> <Paragraph position="0"> Specificity is degree of detailed information of an object about given target object. For example, if an encyclopedia contains detailed information about 'IT domain', then the encyclopedia is 'IT specific encyclopedia'. In this context, specificity is a function of objects and target object to real number. Traditionally term specificity is widely used in information retrieval systems to weight index terms in documents (S. Jones, 1972; Aizawa, 2003; Wong & Yao, 1992). In information retrieval context, term specificity is function of index terms and documents. On the other hand, term specificity is the function of terms and target domains in taxonomy learning context (Ryu & Choi 2005). Term specificity to a domain is quantified to a positive real number as shown in Eq. (1).</Paragraph> <Paragraph position="2"> where t is a term, and Spec(t|D) is the specificity of t in a given domain D. We simply use Spec(t) instead of Spec(t|D) assuming a particular domain D in this paper.</Paragraph> <Paragraph position="3"> Understanding the relation between domain concepts and their lexicalization methods is needed, before we describe term specificity measuring methods. Domain specific concepts can be distinguished by a set of what we call 'characteristics'. More specific concepts are created by adding characteristics to the set of characteristics of existing concepts. Let us consider two concepts: C is a newly created concept by combining new characteristics to the characteristic set of C (ISO, 2000). When domain specific concepts are lexicalized as terms, the terms' word-formation is classified into two categories based on the composition of component words. In the first category, new terms are created by adding modifiers to existing terms. Figure 2 shows a subtree of financial ontology. For example 'current asset' was created by adding the modifier 'current' to its hypernym 'asset'. In this case, inside information is a good evidence to represent the characteristics. In the second category, new terms are created independently of existing terms. For example, 'cache', 'inventory', and 'receivable' share no common words with their hypernyms 'current asset' and 'asset'. In this case, outside information is used to differentiate the characteristics of the terms.</Paragraph> <Paragraph position="4"> There are many kinds of inside and outside information to be used in measuring term specificity. Distribution of adjective-term relation and verb-argument dependency relation are collocation based statistics. Distribution of adjective-term relation refers to the idea that specific nouns are rarely modified, while general nouns are fre- null quently modified in text. This feature has been discussed to measure specificity of nouns in (Caraballo, 1999; Ryu & Choi, 2005) and to build taxonomy of Japanese nouns (Yamamoto et al., 2005). Inversed specificity of a term can be measured by entropy of adjectives as shown Eq.</Paragraph> <Paragraph position="5"> where P(adj|t), the probability that adj modifies t, is estimated as freq(adj,t)/freq(t). The entropy is the average information quantity of all (adj</Paragraph> <Paragraph position="7"> pairs for term t. Specific terms have low entropy, because their adjective distributions are simple.</Paragraph> <Paragraph position="8"> For verb-argument distribution, we assume that domain specific terms co-occur with selected verbs which represent special characteristics of terms while general terms are associated with multiple verbs. Under this assumption, we make use of syntactic dependencies between verbs appearing in the corpus and their arguments such as subjects and objects. For example, 'inventory'</Paragraph> <Paragraph position="10"> in Figure 2, shows a tendency to be objects of specific verbs like 'increase' and 'reduce'. This feature was used in (Cimiano et al., 2005) to learn concept hierarchy. Inversed specificity of a term can be measured by entropy of verb-argument relations as Eq. (3).</Paragraph> </Section> <Section position="6" start_page="42" end_page="43" type="metho"> <SectionTitle> ). The </SectionTitle> <Paragraph position="0"> entropy is the average information quantity of all (t,v arg ) pairs for term t.</Paragraph> <Paragraph position="1"> Conditional probability of term co-occurrence in documents was used in (Sanderson & Croft, 1999) to build term taxonomy. This statistics is based on the assumption that, for two terms, t</Paragraph> <Paragraph position="3"> is said to subsume t j if the following two conditions hold,</Paragraph> <Paragraph position="5"> can be parent of t j in taxonomy. Although a good number of term pairs are found that adhere to the two subsump1 'Inventory' consists of a list of goods and materials held available in stock (http://en.wikipedia.org/wiki/Inventory). tion conditions, it is noticed that many are just failing to be included because a few occurrences of the subsumed term, t</Paragraph> <Paragraph position="7"> . Subsequently, the conditions are relaxed and subsume function is defined as Eq. (5). In case of</Paragraph> <Paragraph position="9"> We apply this function to calculate term specificity as shown Eq. (6) where a term is specific when it is subsumed by most of other terms.</Paragraph> <Paragraph position="10"> Specificity of t is determined by the ratio of terms that subsume t over all co-occurring terms.</Paragraph> <Paragraph position="11"> where n is number of terms co-occurring terms with t.</Paragraph> <Paragraph position="12"> Finally, inside-word information is important to compute specificity for multiword terms. Consider a term t that consists of two words like t = , have their unique characteristics and the characteristics are summed up to the characteristic of t. Mutual information is used to estimate the association between a term and its component words. Let</Paragraph> <Paragraph position="14"> independently.</Paragraph> <Paragraph position="15"> The mutual information represents the reduction of uncertainty about t</Paragraph> <Paragraph position="17"> is observed. The summed mutual information between t i and W, as in Eq. (7), is total reduction of uncertainty about</Paragraph> <Paragraph position="19"> example, 'debenture bond' is more specific concept than 'financial product'. Intuitively, 'debenture' is highly associated to 'debenture bond' compared with 'bond' to 'debenture bond' or 'financial', 'product' to 'financial product'.</Paragraph> </Section> <Section position="7" start_page="43" end_page="43" type="metho"> <SectionTitle> 3 Term Similarity </SectionTitle> <Paragraph position="0"> We evaluate four statistical and lexical features, related to taxonomy learning, in view of term similarity. Three statistical features have been used in existing taxonomy learning researches.</Paragraph> <Paragraph position="1"> (Sanderson & Croft, 1999) used conditional probability of co-occurring terms in same document in taxonomy learning process as shown in Eq. (4). This feature can be used to measure similarity of terms. If two terms co-occur in common documents, they are semantically similar to each other. Based on this assumption, we can calculate term similarity by comparing the frequency of co-occurring t</Paragraph> <Paragraph position="3"> together and the frequency of occurring t</Paragraph> <Paragraph position="5"> independently, as Eq. (8).</Paragraph> <Paragraph position="7"> occurs.</Paragraph> <Paragraph position="8"> (Yamamoto et al., 2005) used adjective patterns to make characteristics vectors for terms in Complementary Similarity Measure (CSM). Although CSM was initially designed to extract superordinate-subordinate relations, it is a similarity measure by itself. They proposed two CSM measures; one is for binary images in which values in feature vectors are 0 or 1, and the other is for gray-scale images in which values in feature vectors are 0 through 1. We adapt gray-scale measure in similarity calculation, because it showed better performance in their research.</Paragraph> <Paragraph position="9"> (Cimiano et al., 2005) applied Formal Concept Analysis (FCA) to extract taxonomies from a text corpus. They modeled the context of a term as a vector representing syntactic dependencies.</Paragraph> <Paragraph position="10"> Similarity based on verb-argument dependencies is calculated using cosine measure as Eq. (9).</Paragraph> <Paragraph position="12"/> <Paragraph position="14"> , appear in corpus one or more times.</Paragraph> <Paragraph position="15"> The last similarity measure is based on inside information of terms. Because many domain terms are multiword terms, component words are clues for term similarity. If two terms share many common words, they share common characteristics in given domain. For example, four words 'asset', 'current asset', 'fixed asset' and 'intangible asset' share characteristics related to 'asset' as in Figure 2. This similarity measure is shown in Eq. (10).</Paragraph> <Paragraph position="17"> common word count in t</Paragraph> <Paragraph position="19"> most of term pairs, it is difficult to catch reliable results for all possible term pairs.</Paragraph> </Section> <Section position="8" start_page="43" end_page="43" type="metho"> <SectionTitle> 4 Taxonomy Learning Process </SectionTitle> <Paragraph position="0"> We model taxonomy learning process as a sequential insertion of new terms to current taxonomy. New taxonomy starts with empty state, and changes to rich taxonomic structure with the repeated insertion of terms as depicted in Figure 3.</Paragraph> <Paragraph position="1"> Terms to be inserted are sorted by term specificity values. Term insertion based on the increasing order of term specificity is natural, because the taxonomy grows from top to down with term insertion process in increasing specificity sequence. null are selected as candidate hypernyms of</Paragraph> <Paragraph position="3"/> <Paragraph position="5"> from taxonomy using term specificity and similarity</Paragraph> </Section> <Section position="9" start_page="43" end_page="45" type="metho"> <SectionTitle> 5 Experiment and Evaluation </SectionTitle> <Paragraph position="0"> We applied our taxonomy learning method to set of terms in existing taxonomy. We removed all relations from the taxonomy, and made new taxonomic relations among the terms. The learned taxonomy was then compared to original taxonomy. Our experiment is composed of four steps. Firstly, we calculated term specificity using specificity measures discussed in chapter 2, secondly, we calculated term similarity using similarity measures described in chapter 3, thirdly, we applied the best specificity and similarity features to our taxonomy building process, and finally, we evaluated our method and compared with other taxonomy learning methods.</Paragraph> <Section position="1" start_page="43" end_page="43" type="sub_section"> <SectionTitle> Finance ontology </SectionTitle> <Paragraph position="0"> which was developed within the GETESS project (Staab et al., 1999) was used in our experiment. We slightly modified original ontology. We unified different expressions of same concept to identical expression. For example, 'cd-rom drive' and 'cdrom drive' are unified as 'cd-rom drive' because the former is more usual expression than the latter. We also removed terms that are not descendents of 'root' node to make the taxonomy have single root node. The taxonomy consists of total 1,819 nodes and 1,130 distinct nodes. Maximum and average depths are 15 and 5.5 respectively, and The ontology can be downloaded at http://www.aifb.unikarlsruhe.de/WBS/pci/FinanceGoldStandard.isa. P. Cimiano and his colleagues added English labels for the originally German labeled nodes (Cimiano et al., 2005) maximum and average children nodes are 32 and</Paragraph> </Section> <Section position="2" start_page="43" end_page="43" type="sub_section"> <SectionTitle> 3.5 respectively. </SectionTitle> <Paragraph position="0"> We considered Reuters21578 corpus, over 3.1 million words in title and body fields. We parsed the corpus using Connexor functional dependency parser and extracted various statistics: term frequency, distribution of adjectives, distribution of co-occurring frequency in documents, and verb-argument distribution.</Paragraph> </Section> <Section position="3" start_page="43" end_page="43" type="sub_section"> <SectionTitle> 5.1 Term Specificity </SectionTitle> <Paragraph position="0"> Term specificity was evaluated based on three criteria: recall, precision and F-measure. Recall is the fraction of the terms that have specificity values by the given measuring method. Precision is the fraction of relations with correct specificity values. F-measure is a harmonic mean of precision and recall into a single measure of overall (p,c) is a valid parent-child relation in original taxonomy, and a relation is valid when the specificity of two terms are measured by the given method. If the specificity of child term, c, is larger than that of parent term, p, then the relation is correct.</Paragraph> <Paragraph position="1"> We tested four specificity measuring methods discussed in section 2 and the result is shown in</Paragraph> </Section> <Section position="4" start_page="43" end_page="45" type="sub_section"> <SectionTitle> adj </SectionTitle> <Paragraph position="0"> showed the highest precision as we anticipated. Because domain specific terms have sufficient information in themselves; they are rarely modified by other words in real text. However, Spec adj showed the lowest recall for data sparseness problem. As mentioned above, it is hard to collect sufficient adjectives for domain specific terms from text. Spec varg showed the lowest precision. This result indicates that distribution of verb-argument relation is less correlated to term specificity. Spec in showed the highest recall because it measures term specificity using component words contrary to other methods. Spec coldoc showed comparable precision and recall.</Paragraph> <Paragraph position="1"> We harmonized Spec in</Paragraph> </Section> <Section position="5" start_page="45" end_page="45" type="sub_section"> <SectionTitle> and Spec adj to Spec </SectionTitle> <Paragraph position="0"> in/adj as described in (Ryu & Choi, 2005) to take advantages of both inside and outside information. Harmonic mean of two specificity values was used in Spec in/adj method. Spec in/adj showed the highest F-measure because precision was higher than that of Spec in and recall was equal to that of</Paragraph> </Section> <Section position="6" start_page="45" end_page="45" type="sub_section"> <SectionTitle> 5.2 Term Similarity </SectionTitle> <Paragraph position="0"> We evaluated similarity measures by comparing with taxonomy based similarity measure. (Budanitsky & Hirst, 2006) calculated correlation coefficients (CC) between human similarity ratings and the five WordNet based similarity measures. Among the five computational measures, (Leacock & Chodorow, 1998)'s method showed the highest correlation coefficients, even though all of the measures showed similar ranging from 0.74 to 0.85. This result means that taxonomy based similarity is highly correlated to human similarity ratings. We can indirectly evaluate our similarity measures by comparing to taxonomy based similarity measure, instead of direct comparison to human rating. If applied similarity measure is qualified, the calculated similarity will be highly correlated to taxonomy based similarity. Leacock and Chodorow proposed following formula for computing the scaled semantic similarity between terms t where the denominator includes the maximum depth of given taxonomy, and len(t in the taxonomy.</Paragraph> <Paragraph position="1"> Besides CC with ontology based similarity measures, recall of a similarity measures is also important evaluation factor. We defined recall of similarity measure, R Sim , as the fraction of the term pairs that have similarity values by the given measuring method as Eq. (13).</Paragraph> <Paragraph position="2"> measure of precision and recall.</Paragraph> <Paragraph position="3"> We calculated term similarity between all possible term pairs in finance ontology using the measures described in section 3. Additionally we introduced new similarity measure Sim in/varg which is combined similarity of Sim and results of other measures.</Paragraph> <Paragraph position="4"> Figure 5 shows variation of CC and recall as threshold of similarity changes from 0.0 to 1.0 for five similarity measures. Threshold is directly proportional to CC and inversely proportional to recall in ideal case. We normalized all similarity values to [0.0, 1.0] in each measure. CC grows as threshold increases in Sim coldoc and Sim varg as we expected. CC of CSM measure, Sim csm , increased as threshold increased and decreased when threshold is over 0.6. For example two terms 'asset' and 'current asset' are very similar to each other based on Sim</Paragraph> </Section> </Section> <Section position="10" start_page="45" end_page="45" type="metho"> <SectionTitle> LC </SectionTitle> <Paragraph position="0"> measure, because edge count between two terms is one in finance ontology. The former can be modified many adjectives such as 'intangible', 'tangible', 'new' and 'estimated', while the latter is rarely modified by other adjectives in corpus because it was already extended from 'asset' by adding adjective 'current'. Therefore, semantically similar terms do not always have similar adjective distributions.</Paragraph> <Paragraph position="1"> CC between Sim in and Sim</Paragraph> </Section> <Section position="11" start_page="45" end_page="46" type="metho"> <SectionTitle> LC </SectionTitle> <Paragraph position="0"> showed high curve in low threshold, but downed as threshold increased. Similarity value above 0.6 is insignificant, because it is hard to be over 0.6 using Eq. (10). For example, similarity between 'executive board meeting' and 'board meeting' is 0.8, the maximum similarity in our test set. The average of inside-word similarity is 0.41.</Paragraph> <Paragraph position="1"> Sim varg showed higher recall than other measures. This means that verb-argument relation is more abundant than other features in corpus. Sim In showed the lowest recall because we could get valid similarity using Eq. (10). Sim varg showed higher F-measure when threshold is over 0.2. This result illustrate that verb-argument relation is adequate feature to similarity calculation. CC but high recall. Sim in/varg showed the highest F-measure.</Paragraph> <Section position="1" start_page="46" end_page="46" type="sub_section"> <SectionTitle> 5.3 Taxonomy learning </SectionTitle> <Paragraph position="0"> In order to evaluate our approach we need to assess how good the automatically learned taxonomies reflect a given domain. The goodness is evaluated by the similarity of automatically learned taxonomy to reference taxonomy. We used (Cimiano et al., 2005)'s ontology evaluation method in which lexical recall (LR ) of learned taxonomy are defined based on the notion of taxonomy overlap. LR Tax is defined as the ratio of number of common terms in learned taxonomy and reference taxonomy over number of terms in reference taxonomy. P Tax is defined as ratio of taxonomy overlap of learned taxonomy to reference taxonomy. F</Paragraph> </Section> </Section> class="xml-element"></Paper>