File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-1408_metho.xml
Size: 21,051 bytes
Last Modified: 2025-10-06 14:08:07
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-1408"> <Title>Automatic Discovery of Term Similarities Using Pattern Mining</Title> <Section position="2" start_page="0" end_page="1" type="metho"> <SectionTitle> 1 Terminology Management </SectionTitle> <Paragraph position="0"> Since vast amount of knowledge still remains unexplored, several systems have been proposed to help scientists to acquire relevant knowledge from scientific literature. For example, GENIES (Friedman et al. (2001)) uses a semantic grammar and substantial syntactic knowledge in order to extract comprehensive information about signaltransduction pathways. Some of the systems are terminology-based, since technical terms semantically characterise documents and therefore represent starting place for knowledge acquisition tasks. For example, Mima et al. (2002) introduce TIMS, a terminology-based knowledge acquisition system, which integrates automatic term recognition, term variation management, context-based automatic term clustering, ontology-based inference, and intelligent tag information retrieval. The systems aim is to provide efficient access and integration of heterogeneous biological textual data and databases.</Paragraph> <Paragraph position="1"> There are numerous approaches to ATR. Some methods (Bourigault (1992), Ananiadou (1994)) rely purely on linguistic information, namely morpho-syntactic features of term candidates.</Paragraph> <Paragraph position="2"> Recently, hybrid approaches combining linguistic and statistical knowledge are becoming increasingly used (Frantzi et al. (2000), Nakagawa et al. (1998)).</Paragraph> <Paragraph position="3"> There is a range of clustering and classification approaches that are based on statistical measures of word co-occurrences (e.g. Ushioda (1996)), or syntactic information derived from corpora (e.g.</Paragraph> <Paragraph position="4"> Grefenstette (1994)). However, few of them deal with term clustering: Maynard and Ananiadou (2000) present a method that uses manually defined semantic frames for specific classes, Hatzivassiloglou et al. (2001) use machine learning techniques to disambiguate names of proteins, genes and RNAs, while Friedman et al. (2001) describe extraction of specific molecular pathways from journal articles.</Paragraph> <Paragraph position="5"> In our previous work, an integrated knowledge mining system in the domain of molecular biology, ATRACT, has been developed (Mima et al.</Paragraph> <Paragraph position="6"> (2001)). ATRACT (Automatic Term Recognition and Clustering for Terms) is a part of the ongoing BioPath project, and its main aim is to facilitate an efficient expert-computer interaction during term-based knowledge acquisition. Term management is based on integration of automatic term recognition and automatic term clustering (ATC). ATR is based on the C/NC-value method (Frantzi et al.</Paragraph> <Paragraph position="7"> BioPath is a Eureka funded project, coordinated by</Paragraph> </Section> <Section position="3" start_page="1" end_page="2" type="metho"> <SectionTitle> LION BioScience (http://www.lionbioscience.com) and </SectionTitle> <Paragraph position="0"> funded by the German Ministry of Research.</Paragraph> <Paragraph position="1"> (2000)), a hybrid approach combining linguistic knowledge (term formation patterns) and statistical knowledge (term length, frequency of occurrence, etc). The extension of the method handles orthographic, morphological and syntactic term variants and acronym recognition as an integral part of the ATR process (Nenadic et al. (2002a)), providing that all term occurrences of a term are considered. The ATC method is based on the</Paragraph> <Section position="1" start_page="1" end_page="2" type="sub_section"> <SectionTitle> Ushiodas AMI (Average Mutual Information) </SectionTitle> <Paragraph position="0"> hierarchical clustering method (Ushioda (1996)).</Paragraph> <Paragraph position="1"> Co-occurrence based term similarities are used as input, and a dendrogram of terms is generated.</Paragraph> </Section> </Section> <Section position="4" start_page="2" end_page="21" type="metho"> <SectionTitle> 2 Term Similarity Measures </SectionTitle> <Paragraph position="0"> In this section we introduce a novel hybrid method to measure term similarity. Our method incorporates three types of similarity measures, namely contextual, lexical and syntactical similarity. We use a linear combination of the three similarities in order to estimate similarity between terms. In the following subsections we describe each of the three similarity measures.</Paragraph> <Section position="1" start_page="2" end_page="3" type="sub_section"> <SectionTitle> 2.1 Contextual Similarity </SectionTitle> <Paragraph position="0"> Determining the similarity of terms based on their contexts is a standard approach based on the hypothesis that similar terms tend to appear in similar contexts. Contextual similarity, however, may be determined in a number of ways depending on the way in which the context is defined. For example, some approaches consider only terms that appear in a close proximity to each other (Maynard and Ananiadou (2000)), while in other approaches, grammatical roles such as object or subject are taken into account (Grefenstette (1994)).</Paragraph> <Paragraph position="1"> Our approach to contextual similarity is based on automatic pattern mining. The aim is to automatically identify and learn the most important context patterns in which terms appear. Context pattern (CP) is a generalised regular expression that corresponds to either left or right context of a term.</Paragraph> <Paragraph position="2"> The following example shows a sample left context pattern of the term high affinity: For the evaluation of the ATR and ATC methods incorporated in ATRACT, see Mima et al. (2001). Left and right contexts are treated separately. V:bind TERM:rxr_heterodimers PREP:with Let us now describe the process of constructing CPs and determining their importance. First, we collect concordances for all automatically recognised terms. Context constituents, which we consider important for discriminating terms (e.g. noun and verb phrases, prepositions, and terms themselves) are identified by a tagger and by appropriate local grammars, which define syntactic phrases (e.g. NPs, VPs). The grammatical and lexical information attached to the context constituents is used to construct CPs. In the simplest case, contexts are mapped into the syntactic categories of their constituents. However, the lemmatised form for each of the syntactic categories can be used as well. For example, when encountered in a context, the preposition with can be either mapped to its POS tag, i.e. PREP, or instead, the lemma can be added, in which case we have an instantiated chunk: PREP:with. Further, some of the syntactic categories can be removed from the context patterns, as not all syntactic categories are equally significant in providing useful contextual information (Maynard and Ananiadou (2000)). Such CPs will be regarded as normalised CPs. In our approach, one can define which categories to instantiate and which to remove. In the examples provided later in the paper (Section 3) we decided to remove the following categories: adjectives (that are not part of a term), adverbs, determiners and so-called linking words (e.g. however, moreover, etc.).</Paragraph> <Paragraph position="3"> Also, we instantiated terms and either verbs or prepositions, as these categories are significant for discriminating terms.</Paragraph> <Paragraph position="4"> Once we have normalised CPs, we calculate the values of a measure called CP-value in order to estimate the importance of the CPs. CP-value is defined similarly to the C/NC-value for terms (Frantzi et al. (2000)). It assesses a CP (p) according to its total frequency (f(p)), its length (|p|, as the number of constituents) and the frequency of its occurrence within other CPs (|T p |, where T p is a set of all CPs that contain p): The CPs whose CP-value is above a chosen threshold are deemed important. Note that these patterns are domain-specific and that they are automatically extracted from a domain specific corpus. Tables 1 and 2 show samples of significant left context patterns extracted from a MEDLINE (only terms and prepositions are instantiated) At this point, each term is associated with a set of the most characteristic patterns in which it occurs. We treat CPs as term features, and we use a feature contrast model (Santini and Jain (1999)) to calculate similarity between terms as a function of both common and distinctive features. Let us now formally define the contextual similarity measure. respectively. Then, the contextual similarity (CS) between t corresponds to the ratio between the number of common and distinctive contexts:</Paragraph> </Section> <Section position="2" start_page="3" end_page="4" type="sub_section"> <SectionTitle> 2.2 Lexical Similarity </SectionTitle> <Paragraph position="0"> We also examine the lexical similarity between words that constitute terms. For example, if terms share the same head, they are assumed to have the receptor). Further, if one of such terms has additional modifiers, this may indicate concept specialisation (e.g. nuclear receptor and orphan nuclear receptor). Bearing that in mind, we base the definition of lexical similarity on having a common head and/or modifier(s).</Paragraph> <Paragraph position="1"> the sets of the stems of their modifiers, then their lexical similarity (LS) is calculated according to the following formula: where a and b are weights such that a > b, since we give higher priority to shared heads over shared modifiers.</Paragraph> <Paragraph position="2"> Note that the lexical similarity between two different terms can have a positive value only if at least one of them is a multiword term. Also, when calculating lexical similarity between terms that are represented by corresponding acronyms, we use normalised expanded forms.</Paragraph> </Section> <Section position="3" start_page="4" end_page="21" type="sub_section"> <SectionTitle> 2.3 Syntactical Similarity </SectionTitle> <Paragraph position="0"> By analysing the distribution of similar terms in corpora, we observed that some general (i.e.</Paragraph> <Paragraph position="1"> domain independent) lexico-syntactic patterns indicate functional similarity between terms. For instance, the following example: ... steroid receptors such as estrogen receptor, glucocorticoid receptor,and progesterone receptor.</Paragraph> <Paragraph position="2"> suggests that all the terms involved are highly correlated, since they appear in an enumeration (represented by the such-as pattern) which indicates their similarity (based on the is_a relationship). Some of these patterns have been previously used to discover hyponym relations between words (Hearst (1992)). We generalised For our approach to acronym acquisition and term normalisation, see Nenadic et al. (2002).</Paragraph> <Paragraph position="3"> the approach by taking into account patterns in which the terms are used concurrently within the same context. We hypothesise that the parallel usage of terms within the same context, as a specific type of co-occurrence, shows their functional similarity. Namely, all the terms within a parallel structure have the same syntactic function within the sentence (e.g. object or subject) and are used in combination with the same verb or preposition. This fact is used as an indicator of their semantic similarity.</Paragraph> <Paragraph position="4"> In our approach, several types of lexico-syntactical patterns are considered: enumeration expressions, coordination, apposition, and anaphora. However, currently we do not discriminate between different similarity relationships among terms (which are represented by different patterns), but instead, we consider terms appearing in the same syntactical roles as highly semantically correlated.</Paragraph> <Paragraph position="5"> A sample of enumeration patterns is shown in Manually defined patterns are applied as syntactic filters in order to retrieve sets of similar terms. These patterns provide relatively good recall and precision. We also used coordination patterns (Klavans et al. (1997)) as another type of parallel syntactic structure. Two types of argument coordination and two types of head coordination patterns were considered (see Table 4). However, not all the sequences that match the coordination patterns are coordinated structures (see Table 5). Therefore, these patterns provide relatively good recall, but not high precision if one wants to retrieve terms involved in such expression.</Paragraph> <Paragraph position="6"> However, both term coordination and (nominal) conjunction of terms indicate their similarity. Based on co-occurrence of terms in these parallel lexico-syntactical patterns, we define the syntactical similarity (SS) measure for a pair of terms as 1 if the two terms appear together in any of the patterns, and 0 otherwise.</Paragraph> <Paragraph position="7"> Non-terminal syntactic categories are given in angle brackets. Non-terminal <&> denotes a conjunctive word sequence, i.e. the following regular expression: (as well as) |(and[/or])|(or[/and]). Special characters (, ), [, ], |, and * have the usual interpretation in regular expression notation.</Paragraph> <Paragraph position="8"> In the experiments that we have performed, the precision of expanding terms from coordinated structures was 70%.</Paragraph> <Paragraph position="10"/> </Section> <Section position="4" start_page="21" end_page="21" type="sub_section"> <SectionTitle> 2.4 Hybrid CLS Similarity </SectionTitle> <Paragraph position="0"> None of the similarities introduced so far is sufficient on its own to reliably estimate similarity between two arbitrary terms. For example, if a term appears infrequently or within very specific CPs, the number of its significant CPs will influence its contextual similarity to other terms.</Paragraph> <Paragraph position="1"> Further, there are concepts that have idiosyncratic names (e.g. a protein named Bride of sevenless), which thus cannot be classified relying exclusively on lexical similarity. Our experiments also show that syntactical similarity provides high precision, but low recall when used on its own, as not all terms appear in a parallel lexico-syntactical expression.</Paragraph> <Paragraph position="2"> Therefore, we introduce a hybrid term similarity measure, called the CLS similarity, as a linear combination of the three similarity measures: ) The choice of the weights a, b, and g in the previous formula is not a trivial problem. In our preliminary experiments (Section 3) we used manually chosen values. However, the parameters have also been fine-tuned automatically by supervised learning method based on a genetic algorithm approach (Spasic et al. (2002)). A domain specific ontology has been used to evaluate the generated similarity measures and to set the direction of their convergence. The differences between results based on the various parameters are presented in the following section.</Paragraph> </Section> </Section> <Section position="5" start_page="21" end_page="21" type="metho"> <SectionTitle> 3 Results, Evaluation and Discussion </SectionTitle> <Paragraph position="0"> The CLS measure was tested on a corpus of 2008 abstracts retrieved from MEDLINE database (MEDLINE (2002)) with manually chosen values 0.3, 0.3 and 0.4 for a, b, and g respectively.</Paragraph> <Paragraph position="1"> Random samples of results have been evaluated by a domain expert, and the combined measure proved to be a good indicator of semantic similarity. Table 6 shows the similarity of term retinoic acid receptor to a number of terms. The examples point out the importance of combining different types of term similarities. For instance, the low value of contextual similarity for retinoid X receptor is balanced out by the other two similarity values, thus correctly indicating it as a term similar to term retinoic acid receptor. Similarly, the high value of the contextual similarity for signal transduction pathway is neutralised by the other two similarity The low value is caused by relatively low frequency of the terms occurrences in the corpus.</Paragraph> <Paragraph position="2"> values, hence preventing it as being labelled as similar to retinoic acid receptor.</Paragraph> <Paragraph position="3"> acid receptor and other terms The combined measure also proved to be consistent in the sense that similar terms share the same &quot;friends&quot; (Maynard and Ananiadou (2000)). For example, the similarity values of two similar terms glucocorticoid receptor and estrogen receptor (the value of their similarity is 0.68) with respect to other terms are mainly approximate (Table 7).</Paragraph> <Paragraph position="4"> receptor and estrogen receptor The supervised learning of parameters resulted in the values 0.13, 0.81 and 0.06 for a, b, and g respectively (see Spasic et al. (2002)). The measure with these values showed a higher degree of stability relative to the ontology-based similarity measure. Note that the lexical similarity appears to be the most important and the syntactical similarity to be insignificant. The ontology used as a seed for learning term similarities contained wellstructured, standardised and preferred terms which resulted in promoting the lexical similarity as the most significant. On the other hand, the SS similarity is corpus-dependent: the size of the corpus and the frequency with which the concurrent lexico-syntactic patterns are realised in it, affect the syntactical similarity. In the training corpus such patterns occurred infrequently relative to the number of terms, which indicates that a bigger corpus is needed in the training phase. In order to increase the number of concurrent patterns, we also aim at including additional patterns that describe appositions and implementing procedures for resolution of co-referring terms. We also plan to experiment with parametrising the values of syntactical similarity depending on the number and type of patterns in which two terms appear simultaneously.</Paragraph> <Paragraph position="5"> The main purpose of discovering term similarities is to produce a similarity matrix to identify term clusters. In Nenadic et al. (2002b) we present some preliminary results on term clustering using the CLS hybrid term similarity measure. Two different methods (namely the nearest neighbour and the Wards method) have been used, and both achieved around 70% precision in clustering semantically similar terms.</Paragraph> <Section position="1" start_page="21" end_page="21" type="sub_section"> <SectionTitle> Conclusions and Further Research </SectionTitle> <Paragraph position="0"> In this paper we have presented a novel method for the automatic discovery of term similarities. The method is based on the combination of contextual, lexical and syntactical similarities between terms.</Paragraph> <Paragraph position="1"> Lexical similarity exposes the resemblance between the words that constitute terms, while syntactical similarity is based on mutual co-occurrence in parallel lexico-syntactic patterns.</Paragraph> <Paragraph position="2"> Contextual similarity is based on the automatic discovery of significant contexts through contextual pattern mining. Although the approach is domain independent and knowledge-poor, automatically collected patterns are domain specific and they identify significant contexts in which terms tend to appear. However, in order to learn domain-appropriate term similarity parameters, we need to customise the method by incorporating domain-specific knowledge. For example, we have used an ontology to represent such knowledge.</Paragraph> <Paragraph position="3"> The preliminary results in the domain of molecular biology have shown that the measure proves to be a good indicator of semantic similarity between terms. Furthermore, the similarity measure is consistent at assigning weights: similar terms tend to share the same friends, i.e. there is a significant degree of overlapping between terms that are similar. These results are encouraging, as terms are grouped reliably according to their contextual, syntactical and lexical similarities.</Paragraph> <Paragraph position="4"> Besides term clustering (presented in Nenadic et al. (2002b)), the similarity measure can be used for several term-oriented knowledge management tasks. Our future work will focus on the term classification and the consistent population and update of ontologies. However, specific term relationship identification that will direct placing terms in a hierarchy is needed. Further, term similarities can be used for term sense disambiguation as well, which is essential for resolving terminological confusion occurring in many domains.</Paragraph> </Section> </Section> class="xml-element"></Paper>