XML Viewer - w06-1708

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1708_metho.xml
Size: 20,511 bytes
Last Modified: 2025-10-06 14:10:46
<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1708">
  <Title>The problem of ontology alignment on the web: a first report</Title>
  <Section position="4" start_page="51" end_page="53" type="metho">
    <SectionTitle>
2 General architecture
</SectionTitle>
    <Paragraph position="0"> The instance based approach we propose uses NLP techniques to compute matching scores based on the documents classified under the nodes ofontologies. Thereisnoassumptiononthestructural properties of the ontologies to be compared: they can be any kind of graph representable in OWL. The instance documents are assumed to be text documents (plain text or HTML).</Paragraph>
    <Paragraph position="1"> The matching process starts from a pair of ontologies to be aligned. The two ontologies are traversed and, for each node having at least one instance, the system computes a signature based on the instance documents. Then, the signatures associated to the nodes of the two ontologies are compared pairwise, and a similarity score for each pair is generated. This score could then be used to estimate the likelihood of a match between a pair of nodes, under the assumption that the semantics of a node corresponds to the semantics of the instance documents classified under that node.</Paragraph>
    <Paragraph position="2"> Figure 1 shows the architecture of our system.</Paragraph>
    <Paragraph position="3"> The two main issues to be addressed are (1) the representation of signatures and (2) the definition of a suitable comparison metric between signatures. For a long time, the Information Re- null trieval community has succesfully adopted a &amp;quot;bag of words&amp;quot; approach to effectively represent and compare text documents. We start from there to define a general signature structure and a metric to compare signatures.</Paragraph>
    <Paragraph position="4"> A signature is defined as a function S : K -R+, mapping a finite set of keys (which can be complex objects) to positive real values. With a signature of that form, we can use the cosine similarity metric to score the similarity between two signatures:</Paragraph>
    <Paragraph position="6"> The cosine similarity formula produces a value in the range [0, 1]. The meaning of that value depends on the algorithm used to build the signature. In particular, there is no predefined threshold that can be used to discriminate matches from non-matches. However, such a threshold could be computeda-posteriori froma statisticalanalysis of experimental results.</Paragraph>
    <Section position="1" start_page="51" end_page="53" type="sub_section">
      <SectionTitle>
2.1 Signature generation algorithms
</SectionTitle>
      <Paragraph position="0"> For our experiments, we defined and implemented four algorithms to generate signatures. The four algorithms make use of text and language processing techniques of increasing complexity.</Paragraph>
      <Paragraph position="1"> 2.1.1 Algorithm 1: Baseline signature The baseline algorithm performs a very simple sequence of text processing, schematically represented in Figure 2.</Paragraph>
      <Paragraph position="2">  HTML tags are first removed from the instance documents. Then, the texts are tokenized and punctuation is removed. Everything is then converted to lowercase. Finally, the tokens are grouped and counted. The final signature has the form of a mapping table token - frequency.</Paragraph>
      <Paragraph position="3"> The main problem we expected with this method is the presence of a lot of noise. In fact, many &amp;quot;irrelevant&amp;quot; words, like determiners, prepositions, and so on, are added to the final signature.  To cope with the problem of excessive noise, people in IR often use fixed lists of stop words to be removed from the texts. Instead, we introduced a syntax based filter in our chain of processing. The main assuption is that nouns are the words that carry most of the meaning for our kind of document comparison. Thus, we introduced a part-of-speech tagger right after the tokenization module (Figure 3). The results of the tagger are used to discard everything but nouns from the input documents. The part-of-speech tagger we used -QTAG 3.1 (Tufis and Mason, 1998), readily available on the web as a Java library- is a Hidden Markov Model based statistical tagger.</Paragraph>
      <Paragraph position="4"> The problems we expected with this approach are related to the high specialization of words in natural language. Different nouns can bear similar meaning, but our system would treat them as if they were completely unrelated words. For example, the words &amp;quot;apple&amp;quot; and &amp;quot;orange&amp;quot; are semantically closer than &amp;quot;apple&amp;quot; and &amp;quot;chair,&amp;quot; but a purely syntactic approach would not make any difference between these two pairs. Also, the current method does not include morphological processing, so different inflections of the same word, such as &amp;quot;apple&amp;quot; and &amp;quot;apples,&amp;quot; are treated as distinct words. In further experiments, we also considered verbs, another syntactic category of words bearing a lot of semantics in natural language. We computed signatures with verbs only, and with verbs and nouns together. In both cases, however, the  performance of the system was worse. Thus, we will not consider verbs in the rest of the paper.</Paragraph>
      <Paragraph position="5"> 2.1.3 Algorithm 3: WordNet signature To address the limitations stated above, we used the WordNet lexical resource (Miller et al., 1990).</Paragraph>
      <Paragraph position="6"> WordNet is a dictionary where words are linked together by semantic relationships. In Word-Net, words are grouped into synsets, i.e., sets of synonyms. Each synset can have links to other synsets. These links represent semantic relationships like hypernymy, hyponymy, and so on.</Paragraph>
      <Paragraph position="7"> In our approach, after the extraction of nouns and their grouping, each noun is looked up on WordNet (Figure 4). The synsets to which the noun belongs are added to the final signature in place of the noun itself. The signature can also be enriched with the hypernyms of these synsets, up to a specified level. The final signature has the form of a mapping synset - value, where value is a weighted sum of all the synsets found.</Paragraph>
      <Paragraph position="8"> Two important parameters of this method are related to the hypernym expansion process mentioned above. The first parameter is the maximum level of hypernyms to be added to the signature (hypernym level). A hypernym level value of 0 would make the algorithm add only the synsets of a word, without any hypernym, to the signature. A value of 1 would cause the algorithm to add also their parents in the hypernym hierarchy to the signature. With higher values, all the ancestors up to the specified level are added. The second parameter, hypernym factor, specifies the damping of the weight of the hypernyms in the expansion process.</Paragraph>
      <Paragraph position="9"> Our algorithm exponentially dampens the hypernyms, i.e., the weigth of a hypernym decreases exponentially as its level increases. The hypernym factor is the base of the exponential function.</Paragraph>
      <Paragraph position="10"> In general, a noun can have more than one sense, e.g., &amp;quot;apple&amp;quot; can be either a fruit or a tree. This is reflected in WordNet by the fact that a noun can belong to multiple synsets. With the current approach, the system cannot decide which  sense is the most appropriate, so all the senses of a word are added to the final signature, with a weight inversely proportional to the number of possible senses of that word. This fact potentially introduces semantic noise in the signature, because many irrelevant senses might be added to the signature itself.</Paragraph>
      <Paragraph position="11"> Another limitation is that a portion of the nouns in the source texts cannot be located in WordNet (see Figure 6). Thus, we also tried a variation (algorithm 3+2) that falls back on to the bare lexical form of a noun if it cannot be found in Word-Net. This variation, however, resulted in a slight decrease of performance.</Paragraph>
      <Paragraph position="12">  The problem of having multiple senses for each word calls for the adoption of word sense disambiguation techniques. Thus, we implemented a word sense disambiguator algorithm, and we inserted it into the signature generation pipeline (Figure 5). For each noun in the input documents, the disambiguator takes into account a specified number of context words, i.e., nouns preceding and/or following the target word. The algorithm computes a measure of the semantic distance between the possible senses of the target word and the senses of each of its context words, pairwise. A sense for the target word is chosen such that the total distance to its context is minimized. The semantic distance between two synsets is defined here as the minimum number of hops in the WordNet hypernym hierarchy connecting the two synsets. This definition allows for a relatively straightforward computation of the semantic distance using WordNet. Other more sophisticateddefinitionsofsemanticdistancecanbefound null in (Patwardhan et al., 2003). The word sense disambiguation algorithm we implemented is certainly simpler than others proposed in the literature, but we used it to see whether a method that is relatively simple to implement could still help.</Paragraph>
      <Paragraph position="13"> The overall parameters for this signature creation algorithm are the same as the WordNet signature algorithm, plus two additional parameters for the word sense disambiguator: left context length and right context length. They represent respectively how many nouns before and after the target should be taken into account by the disambiguator. If those two parameters are both set to zero, then no context is provided, and the first possible sense is chosen. Notice that even in this case the behaviour of this signature generation algorithm is different from the previous one. In a WordNet signature, every possible sense for a word is inserted, whereas in a WordNet disambiguated signature only one sense is added.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="53" end_page="54" type="metho">
    <SectionTitle>
3 Experimental setting
</SectionTitle>
    <Paragraph position="0"> All the algorithms described in the previous section have been fully implemented in a coherent and extensible framework using the Java programming language, and evaluation experiments have been run. This section describes how the experiments have been conducted.</Paragraph>
    <Section position="1" start_page="53" end_page="54" type="sub_section">
      <SectionTitle>
3.1 Test data
</SectionTitle>
      <Paragraph position="0"> The evaluation of ontology matching approaches is usually made difficult by the scarceness of test ontologies readily available in the community.</Paragraph>
      <Paragraph position="1"> This problem is even worse for instance based approaches, because the test ontologies need also to be &amp;quot;filled&amp;quot; with instance documents. Also, we wanted to test our algorithms with &amp;quot;real world&amp;quot; data, rather than toy examples.</Paragraph>
      <Paragraph position="2"> We were able to collect suitable test data starting from the ontologies published by the Ontology Alignment Evaluation Initiative 2005 (Euzenat et al., 2005). A section of their data contained an OWL representation of fragments of the Google, Yahoo, and LookSmart web directories. We &amp;quot;reverse engineered&amp;quot; some of this fragments, in order to reconstruct two consistent trees, one representing part of the Google directory structure, the other representing part of the LookSmart hierarchy. The leaf nodes of these trees were filled with instances downloaded from the web pages classified by the appropriate directories. With this method, we were able to fill 7 nodes of each ontology with 10 documents per node, for a total of 140 documents. Each document came from a distinct web page, so there was no overlap in the data to be compared. A graphical representation of our two test ontologies, source and target, is shown in Fig- null ure 6. The darker outlined nodes are those filled with instance documents. For the sake of readability, the names of the nodes corresponding to real matches are the same. Of course, this information is not used by our algorithms, which adopt a purely instance based approach. Figure 6 also reports the size of the instance documents associated to each node: total number of words, noun tokens, nouns, and nouns covered by WordNet.</Paragraph>
    </Section>
    <Section position="2" start_page="54" end_page="54" type="sub_section">
      <SectionTitle>
3.2 Parameters
</SectionTitle>
      <Paragraph position="0"> The experiments have been run with several combinations of the relevant parameters: number of instance documents per node (5 or 10), algorithm (1 to 4), extracted parts of speech (nouns, verbs, or both), hypernym level (an integer value equal or greater than zero), hypernym factor (a real number), and context length (an integer number equal or greater than zero). Not all of the parameters are applicable to every algorithm. The total number of runs was 90.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="54" end_page="54" type="metho">
    <SectionTitle>
4 Results
</SectionTitle>
    <Paragraph position="0"> Each run of the system with our test ontologies produced a set of 49 values, representing the matching score of every pair of nodes containing instances across the two ontologies. Selected examples of these results are shown in Tables 1, 2, 3, and 4. In the experiments shown in those tables, 10 instance documents for each node were used to compute the signatures. Nodes that actually match (identified by the same label, e.g., &amp;quot;Canada&amp;quot; and &amp;quot;Canada&amp;quot;) should show high similarity scores, whereas nodes that do not match (e.g., &amp;quot;Canada&amp;quot; and &amp;quot;Dendrochronology&amp;quot;), should have low scores. Better algorithms would have higher scores for matching nodes, and lower score for non-matching ones. Notice that the two nodes &amp;quot;Egypt&amp;quot; and &amp;quot;Pyramid Theories,&amp;quot; although intuitively related, have documents that take different perspectives on the subject. So, the algorithms correctly identify the nodes as being different.</Paragraph>
    <Paragraph position="1"> Looking at the results in this form makes it difficult to precisely assess the quality of the algorithms. To do so, a statistical analysis has to be performed. For each table of results, let us partition the scores in two distinct sets:  gorithm (hypernym level=0, left context=1, right context=1) With our test data, we would have 6 values in set A and 43 values in set B. Then, let us compute average and standard deviation of the values included in each set. The average of A represents the expected score that the system would assign to a match; likewise, the average of B is the expected score of a non-match. We define the following measure to compare the performance of our matching algorithms, inspired by &amp;quot;effect size&amp;quot; from (VanLehn et al., 2005): discrimination size = avg(A) [?]avg(B)stdev(A) + stdev(B) Higher discrimination values mean that the scores assigned to matches and non-matches are more &amp;quot;far away,&amp;quot; making it possible to use those scores to make more reliable decisions about the matching degree of pairs of nodes.</Paragraph>
    <Paragraph position="2"> Table 5 shows the values of discrimination size (last column) out of selected results from our experiments. The algorithm used is reported in the first column, and the values of the other relevant parameters are indicated in other columns. We can make the following observations.</Paragraph>
    <Paragraph position="3"> * Algorithms 2, 3, and 4 generally outperform the baseline (algorithm 1).</Paragraph>
    <Paragraph position="4"> * Algorithm 2 (Noun signature), which still uses a fairly simple and purely syntactical technique, shows a substantial improvement.</Paragraph>
    <Paragraph position="5"> Algorithm 3 (WordNet signature), which introduces some additional level of semantics, has even better performance.</Paragraph>
    <Paragraph position="6"> * In algorithms 3 and 4, hypernym expansion looks detrimental to performance. In fact, the best results are obtained with hypernym level equal to zero (no hypernym expansion).</Paragraph>
    <Paragraph position="7"> * The word sense disambiguator implemented in algorithm 4 does not help. Even though disambiguating with some limited context (1 word before and 1 word after) provides slightly better results than choosing the first available sense for a word (context length equal to zero), the overall results are worse than adding all the possible senses to the signature (algorithm 3).</Paragraph>
    <Paragraph position="8"> * Using only 5 documents per node significantly degrades the performance of all the algorithms (see the last 5 lines of the table).</Paragraph>
  </Section>
  <Section position="7" start_page="54" end_page="57" type="metho">
    <SectionTitle>
5 Conclusions and future work
</SectionTitle>
    <Paragraph position="0"> The results of our experiments point out several research questions and directions for future work,  some more specific and some more general. As regards the more specific issues, * Algorithm 2 does not perform morphological processing, whereas Algorithm 3 does. How much of the improved effectiveness of Algorithm 3 is due to this fact? To answer this question, Algorithm 2 could be enhanced to include a morphological processor.</Paragraph>
    <Paragraph position="1"> * The effectiveness of Algorithms 3 and 4 may be hindered by the fact that many words are notyetincludedintheWordNetdatabase(see Figure 6). Falling back on to Algorithm 2 proved not to be a solution. The impact of the incompleteness of the lexical resource should be investigated and assessed more precisely.</Paragraph>
    <Paragraph position="2"> Another venue of research may be to exploit different thesauri, such as the ones automaticallyderivedasin(CurranandMoens, 2002). * The performance of Algorithm 4 might be improved by using more sophisticated word sense disambiguation methods. It would also be interesting to explore the application of the unsupervised method described in (Mc-Carthy et al., 2004).</Paragraph>
    <Paragraph position="3"> As regards our long term plans, first, structural properties of the ontologies could potentially be exploited for the computation of node signatures. This kind of enhancement would make our system move from a purely instance based approach to a combined hybrid approach based on schema and instances.</Paragraph>
    <Paragraph position="4"> More fundamentally, we need to address the lack of appropriate, domain specific resources that can support the training of algorithms and models appropriate for the task at hand. WordNet is a very general lexicon that does not support domain specific vocabulary, such as that used in geosciences or in medicine or simply that contained in a subontology that users may define according to their interests. Of course, we do not want to develop by hand domain specific resources that we have to change each time a new domain arises.</Paragraph>
    <Paragraph position="5"> The crucial research issue is how to exploit extremely scarce resources to build efficient and effective models. The issue of scarce resources makes it impossible to use methods that are succesful at discriminating documents based on the words they contain but that need large corpora for training, for example Latent Semantic Analysis (Landauer et al., 1998). The experiments described in this paper could be seen as providing  a bootstrapped model (Riloff and Jones, 1999; Ng and Cardie, 2003)--in ML, bootstrapping requires to seed the classifier with a small number of well chosen target examples. We could develop a web spider, based on the work described on this paper, to automatically retrieve larger amounts of training and test data, that in turn could be processed with more sophisticated NLP techniques.</Paragraph>
  </Section>
  <Section position="8" start_page="57" end_page="57" type="metho">
    <SectionTitle>
Acknowledgements
</SectionTitle>
    <Paragraph position="0"> This work was partially supported by NSF Awards IIS-0133123, IIS-0326284, IIS-0513553, and</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML