File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-1016_metho.xml
Size: 23,482 bytes
Last Modified: 2025-10-06 14:09:40
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-1016"> <Title>Inducing Ontological Co-occurrence Vectors</Title> <Section position="3" start_page="125" end_page="125" type="metho"> <SectionTitle> 2 Relevant work </SectionTitle> <Paragraph position="0"> Our framework aims at enriching WordNet-like ontologies with syntactic features derived from a non-annotated corpus. Others have also made significant additions to WordNet. For example, in eXtended WordNet (Harabagiu et al. 1999), the rich glosses in WordNet are enriched by disambiguating the nouns, verbs, adverbs, and adjectives with synsets. Another work has enriched WordNet synsets with topically related words extracted from the Web (Agirre et al. 2001). While this method takes advantage of the redundancy of the web, our source of information is a local document collection, which opens the possibility for domain specific applications.</Paragraph> <Paragraph position="1"> Distributional approaches to building semantic repositories have shown remarkable power. The underlying assumption, called the Distributional Hypothesis (Harris 1985), links the semantics of words to their lexical and syntactic behavior. The hypothesis states that words that occur in the same contexts tend to have similar meaning. Researchers have mostly looked at representing words by their surrounding words (Lund and Burgess 1996) and by their syntactical contexts (Hindle 1990; Lin 1998). However, these representations do not distinguish between the different senses of words. Our framework utilizes these principles and representations to induce disambiguated feature vectors. We describe these representations further in Section 3.</Paragraph> <Paragraph position="2"> In supervised word sense disambiguation, senses are commonly represented by their surrounding words in a sense-tagged corpus (Gale et al. 1991). If we had a large collection of sense-tagged text, then we could extract disambiguated feature vectors by collecting co-occurrence features for each word sense. However, since there is little sense-tagged text available, the feature vectors for a random WordNet concept would be very sparse. In our framework, feature vectors are induced from much larger untagged corpora (currently 3GB of newspaper text).</Paragraph> <Paragraph position="3"> Another approach to building semantic repositories is to collect and merge existing ontologies. Attempts to automate the merging process have not been particularly successful (Knight and Luk 1994; Hovy 1998; Noy and Musen 1999). The principal problems of partial and unbalanced coverage and of inconsistencies between ontologies continue to hamper these approaches.</Paragraph> </Section> <Section position="4" start_page="125" end_page="126" type="metho"> <SectionTitle> 3 Resources </SectionTitle> <Paragraph position="0"> The framework we present in Section 4 propagates any type of lexical feature up an ontology.</Paragraph> <Paragraph position="1"> In previous work, lexicals have often been represented by proximity and syntactic features. Consider the following sentence: The tsunami left a trail of horror.</Paragraph> <Paragraph position="2"> In a proximity approach, a word is represented by a window of words surrounding it. For the above sentence, a window of size 1 would yield two features (-1:the and +1:left) for the word tsunami. In a syntactic approach, more linguistically rich features are extracted by using each grammatical relation in which a word is involved (e.g. the features for tsunami are determiner:the and subject-of:leave).</Paragraph> <Paragraph position="3"> For the purposes of this work, we consider the propagation of syntactic features. We used Minipar (Lin 1994), a broad coverage parser, to analyze text. We collected the statistics on the grammatical relations (contexts) output by Minipar and used these as the feature vectors. Following Lin (1998), we measure each feature f for a word e not by its frequency but by its pointwise mutual information, mi ef :</Paragraph> <Paragraph position="5"/> </Section> <Section position="5" start_page="126" end_page="128" type="metho"> <SectionTitle> 4 Inducing ontological features </SectionTitle> <Paragraph position="0"> The resource described in the previous section yields lexical feature vectors for each word in a corpus. We term these vectors lexical because they are collected by looking only at the lexicals in the text (i.e. no sense information is used). We use the term ontological feature vector to refer to a feature vector whose features are for a particular sense of the word.</Paragraph> <Paragraph position="1"> In this section, we describe our framework for inducing ontological feature vectors for each node of an ontology. Our approach employs two phases. A divide-and-conquer algorithm first propagates syntactic features to each node in the ontology. A final sweep over the ontology, which we call the Coup phase, disambiguates the feature vectors of lexicals (leaf nodes) in the ontology.</Paragraph> <Section position="1" start_page="126" end_page="127" type="sub_section"> <SectionTitle> 4.1 Divide-and-conquer phase </SectionTitle> <Paragraph position="0"> In the first phase of the algorithm, we propagate features up the ontology in a bottom-up approach.</Paragraph> <Paragraph position="1"> Figure 1 gives an overview of this phase.</Paragraph> <Paragraph position="2"> The termination condition of the recursion is met when the algorithm processes a leaf node.</Paragraph> <Paragraph position="3"> The feature vector that is assigned to this node is an exact copy of the lexical feature vector for that leaf (obtained from a large corpus as described in Section 3). For example, for the two leaf nodes labeled chair in Figure 2, we assign to both the same ambiguous lexical feature vector, an excerpt of which is shown in Figure 3.</Paragraph> <Paragraph position="4"> When the recursion meets a non-leaf node, like chairwoman in Figure 2, the algorithm first recursively applies itself to each of the node's children. Then, the algorithm selects those features common to its children to propagate up to its own ontological feature vector. The assumption here is that features of other senses of polysemous words will not be propagated since they will not be common across the children. Below, we describe the two methods we used to propagate features: Shared and Committee.</Paragraph> <Paragraph position="5"> Shared propagation algorithm The first technique for propagating features to a concept node n from its children C is the simplest and scored best in our evaluation (see Section 5.2). The goal is that the feature vector for n Input: A node n and a corpus C.</Paragraph> <Paragraph position="6"> If n is a leaf node then assign to n its lexical feature vector as described in Section 3.</Paragraph> <Paragraph position="7"> Step 2: Recursion Step: For each child c of n, reecurse on c and C.</Paragraph> <Paragraph position="8"> Assign a feature vector to n by propagating features from its children.</Paragraph> <Paragraph position="9"> Output: A feature vector assigned to each node of the tree rooted by n.</Paragraph> <Paragraph position="10"> word chair. Grammatical relations are in italics (conjunction and nominal-subject). The first column of numbers are frequency counts and the other are mutual information scores. In bold are the features that intersect with the induced ontological feature vector for the parent concept of chair's chairwoman sense.</Paragraph> <Paragraph position="11"> represents the general grammatical behavior that its children will have. For example, for the concept node furniture in Figure 2, we would like to assign features like object-of:clean since mosttypes of furniture can be cleaned. However, even though you can eat on a table, we do not want the feature on:eat for the furniture concept since we do not eat on mirrors or beds.</Paragraph> <Paragraph position="12"> In the Shared propagation algorithm, we propagate only those features that are shared by at least t children. In our experiments, we experimentally set t = min(3, |C|).</Paragraph> <Paragraph position="13"> The frequency of a propagated feature is obtained by taking a weighted sum of the frequency of the feature across its children. Let f</Paragraph> <Paragraph position="15"> be the frequency of the feature for child i, let c</Paragraph> <Paragraph position="17"> be the total frequency of child i, and let N be the total frequency of all children. Then, the frequency f of the propagated feature is given by:</Paragraph> <Paragraph position="19"> Committee propagation algorithm The second propagation algorithm finds a set of representative children from which to propagate features. Pantel and Lin (2002) describe an algorithm, called Clustering By Committee (CBC), which discovers clusters of words according to their meanings in test. The key to CBC is finding for each class a set of representative elements, called a committee, which most unambiguously describe the members of the class. For example, for the color concept, CBC discovers the following committee members: purple, pink, yellow, mauve, turquoise, beige, fuchsia Words like orange and violet are avoided because they are polysemous. For a given concept c, we build a committee by clustering its children according to their similarity and then keep the largest and most interconnected cluster (see Pantel and Lin (2002) for details).</Paragraph> <Paragraph position="20"> The propagated features are then those that are shared by at least two committee members. The frequency of a propagated feature is obtained using Eq. 1 where the children i are chosen only among the committee members.</Paragraph> <Paragraph position="21"> Generating committees using CBC works best for classes with many members. In its original application (Pantel and Lin 2002), CBC discovered a flat list of coarse concepts. In the finer grained concept hierarchy of WordNet, there are many fewer children for each concept so we expect to have more difficulty finding committees.</Paragraph> </Section> <Section position="2" start_page="127" end_page="128" type="sub_section"> <SectionTitle> 4.2 Coup phase </SectionTitle> <Paragraph position="0"> At the end of the Divide-and-conquer phase, the non-leaf nodes of the ontology contain disambiguated features , that are similar in one sense will be dissimilar in its other senses. Under the distributional hypothesis, similar words occur in the same grammatical contexts and dissimilar words occur in different grammatical contexts. We expect then that most features that are will be the grammatical contexts of their similar sense. Hence, mostly disambiguated features are propagated up the ontology in the Divide-and-conquer phase. However, the feature vectors for the leaf nodes remain ambiguous (e.g. the feature vectors for both leaf nodes labeled chair in Figure 2 are identical). In this phase of the algorithm, leaf node feature vectors are disambiguated by looking at the parents of their other senses. Leaf nodes that are unambiguous in the ontology will have unambiguous feature vectors. For ambiguous leaf nodes (i.e. leaf nodes that have more than one concept parent), we apply the algorithm described in Figure 4. Given a polysemous leaf node n, we remove from its ambiguous By disambiguated features, we mean that the features are co-occurrences with a particular sense of a word; the features themselves are not sense-tagged.</Paragraph> <Paragraph position="1"> Input: A node n and the enriched ontology O output from the algorithm in Figure 1.</Paragraph> <Paragraph position="2"> Step 1: If n is not a leaf node then return.</Paragraph> <Paragraph position="3"> Step 2: Remove from n's feature vector all features that intersect with the feature vector of any of n's other senses' parent concepts, but are not in n's parent concept feature vector.</Paragraph> <Paragraph position="4"> Output: A disambiguated feature vector for each leaf feature vector those features that intersect with the ontological feature vector of any of its other senses' parent concept but that are not in its own parent's ontological feature vector. For example, consider the furniture sense of the leaf node chair in Figure 2. After the Divide-and-conquer phase, the node chair is assigned the ambiguous lexical feature vector shown in Figure 3. Suppose that chair only has one other sense in WordNet, which is the chairwoman sense illustrated in Figure 2. The features in bold in Figure 3 represent those features of chair that intersect with the ontological feature vector of chairwoman. In the Coup phase of our system, we remove these bold features from the furniture sense leaf node chair. What remains are features like &quot;chair and sofa&quot;, &quot;chair and cushion&quot;, &quot;Ottoman is a chair&quot;, and &quot;recliner is a chair&quot;. Similarly, for the chairwoman sense of chair, we remove those features that intersect with the ontological feature vector of the chair concept (the parent of the other chair leaf node).</Paragraph> <Paragraph position="5"> As shown in the beginning of this section, concept node feature vectors are mostly unambiguous after the Divide-and-conquer phase. However, the Divide-and-conquer phase may be repeated after the Coup phase using a different termination condition. Instead of assigning to leaf nodes ambiguous lexical feature vectors, we use the leaf node feature vectors from the Coup phase. In our experiments, we did not see any significant performance difference by skipping this extra Divide-and-conquer step.</Paragraph> </Section> </Section> <Section position="6" start_page="128" end_page="130" type="metho"> <SectionTitle> 5 Experimental results </SectionTitle> <Paragraph position="0"> In this section, we provide a quantitative and qualitative evaluation of our framework.</Paragraph> <Section position="1" start_page="128" end_page="128" type="sub_section"> <SectionTitle> 5.1 Experimental Setup </SectionTitle> <Paragraph position="0"> We used Minipar (Lin 1994), a broad coverage parser, to parse two 3GB corpora (TREC-9 and TREC-2002). We collected the frequency counts of the grammatical relations (contexts) output by Minipar and used these to construct the lexical feature vectors as described in Section 3.</Paragraph> <Paragraph position="1"> WordNet 2.0 served as our testing ontology.</Paragraph> <Paragraph position="2"> Using the algorithm presented in Section 4, we induced ontological feature vectors for the noun nodes in WordNet using the lexical co-occurrence features from the TREC-2002 corpus. Due to memory limitations, we were only able to propagate features to one quarter of the ontology. We experimented with both the Shared and Committee propagation models described in Section 4.1.</Paragraph> </Section> <Section position="2" start_page="128" end_page="129" type="sub_section"> <SectionTitle> 5.2 Quantitative evaluation </SectionTitle> <Paragraph position="0"> To evaluate the resulting ontological feature vectors, we considered the task of attaching new nodes into the ontology. To automatically evaluate this, we randomly extracted a set of 1000 noun leaf nodes from the ontology and accumulated lexical feature vectors for them using the TREC-9 corpus (a separate corpus than the one used to propagate features, but of the same genre). We experimented with two test sets: * Full: The 424 of the 1000 random nodes that existed in the TREC-9 corpus * Subset: Subset of Full where only nodes that do not have concept siblings are kept (380 nodes). For each random node, we computed the similarity of the node with each concept node in the ontology by computing the cosine of the angle (Salton and McGill 1983) between the lexical feature vector of the random node e</Paragraph> <Paragraph position="2"> and the ontological feature vector of the concept nodes e</Paragraph> <Paragraph position="4"> We only kept those similar nodes that had a similarity above a threshold s . We experimentally set s = 0.1.</Paragraph> <Paragraph position="5"> Top-K accuracy We collected the top-K most similar concept nodes (attachment points) for each node in the test sets and computed the accuracy of finding a correct attachment point in the top-K list. Table 1 shows the result.</Paragraph> <Paragraph position="6"> We expected the algorithm to perform better on the Subset data set since only concepts that have exclusively lexical children must be considered for attachment. In the Full data set, the algorithm must consider each concept in the ontology as a potential attachment point. However, considering the top-5 best attachments, the algorithm performed equally well on both data sets.</Paragraph> <Paragraph position="7"> The Shared propagation algorithm performed consistently slightly better than the Committee method. As described in Section 4.1, building a committee performs best for concepts with many children. Since many nodes in WordNet have few direct children, the Shared propagation method is more appropriate. One possible extension of the Committee propagation algorithm is to find committee members from the full list of descendants of a node rather than only its immediate children.</Paragraph> </Section> <Section position="3" start_page="129" end_page="130" type="sub_section"> <SectionTitle> Precision and Recall </SectionTitle> <Paragraph position="0"> We computed the precision and recall of our system on varying numbers of returned attachments.</Paragraph> <Paragraph position="1"> Figure 5 and Figure 6 show the attachment precision and recall of our system when the maximum number of returned attachments ranges between 1 and 5. In Figure 5, we see that the Shared propagation method has better precision than the Committee method. Both methods perform similarly on recall. The recall of the system increases most dramatically when returning two attachments without too much of a hit on precision. The low recall when returning only one attachment is due to both system errors and also to the fact that many nodes in the hierarchy are polysemous. In the next section, we discuss further experiments on polysemous nodes. Figure 6 illustrates the large difference on both precision and recall when using the simpler Subset data set. All 95% confidence bounds in Figure 5 and Figure 6 range between +-2.8% and +-5.3%.</Paragraph> <Paragraph position="2"> Polysemous nodes 84 of the nodes in the Full data set are polysemous (they are attached to more than one concept node in the ontology). On average, these nodes have 2.6 senses for a total of 219 senses. Figure 7 compares the precision and recall of the system on all nodes in the Full data set vs. the 84 polysemous nodes. The 95% confidence intervals range between +-3.8% and +-5.0% for the Full data set and between +-1.2% and +-9.4% for the polysemous nodes. The precision on the polysemous nodes is consistently better since these have more possible correct attachments.</Paragraph> <Paragraph position="3"> Clearly, when the system returns at most one or two attachments, the recall on the polysemous nodes is lower than on the Full set. However, it is interesting to note that recall on the polysemous nodes equals the recall on the Full set after K=3.</Paragraph> </Section> <Section position="4" start_page="130" end_page="130" type="sub_section"> <SectionTitle> 5.3 Qualitative evaluation </SectionTitle> <Paragraph position="0"> Inspection of errors revealed that the system often makes plausible attachments. Table 2 shows some example errors generated by our system.</Paragraph> <Paragraph position="1"> For the word arsenic, the system attached it to the concept trioxide, which is the parent of the correct attachment.</Paragraph> <Paragraph position="2"> The system results may be useful to help validate the ontology. For example, for the word law, the system attached it to the regulation (as an organic process) and ordinance (legislative act) concepts. According to WordNet, law has seven possible attachment points, none of which are a legislative act. Hence, the system has found that in the TREC-9 corpus, the word law has a sense of legislative act. Similarly, the system discovered the symptom sense of vomiting.</Paragraph> <Paragraph position="3"> The system discovered a potential anomaly in WordNet with the word slob. The system classified slob as follows: fool simpleton someone whereas WordNet classifies it as: vulgarian unpleasant person unwelcome person someone The ontology could use this output to verify if fool should link in the unpleasant person subtree. Capitalization is not very trustworthy in large collections of text. One of our design decisions was to ignore the case of words in our corpus, which in turn caused some errors since WordNet is case sensitive. For example, the lexical node Munch (Norwegian artist) was attached to the munch concept (food) by error because our system accumulated all features of the word Munch in text regardless of its capitalization.</Paragraph> </Section> </Section> <Section position="7" start_page="130" end_page="131" type="metho"> <SectionTitle> 6 Discussion </SectionTitle> <Paragraph position="0"> One question that remains unanswered is how clean an ontology must be in order for our methodology to work. Since the structure of the ontology guides the propagation of features, a very noisy ontology will result in noisy feature vectors. However, the framework is tolerant to some amount of noise and can in fact be used to correct some errors (as shown in Section 5.3).</Paragraph> <Paragraph position="1"> We showed in Section 1 how our framework can be used to disambiguate lexical-semantic resources like hyponym lists, verb relations, and unknown words or terms. Other avenues of future work include: Adapting/extending existing ontologies It takes a large amount of time to build resources like WordNet. However, adapting existing resources to a new corpus might be possible using our framework. Once we have enriched the ontology with features from a corpus, we can rearrange the ontological structure according to the inter-conceptual similarity of nodes. For example, consider the word computer in WordNet, which has two senses: a) a machine; and b) a person who calculates. In a computer science corpus, sense b) occurs very infrequently and possibly a new sense of computer (e.g. a processing chip) occurs. A system could potentially remove sense b) since the similarity of the other children of b) and computer is very low. It could also uncover the new processing chip sense by finding a high similarity between computer and the processing chip concept.</Paragraph> <Paragraph position="2"> Validating ontologies This is a holy grail problem in the knowledge representation community. As a small step, our framework can be used to flag potential anomalies to the knowledge engineer.</Paragraph> <Paragraph position="3"> What makes a chair different from a recliner? Given an enriched ontology, we can remove from the feature vectors of chair and recliner those features that occur in their parent furniture concept. The features that remain describe their different syntactic behaviors in text.</Paragraph> <Paragraph position="4"> Figure 7. Attachment precision and recall on the Full set vs. the polysemous nodes in the Full set when the system returns at most K attachments.</Paragraph> </Section> class="xml-element"></Paper>