File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/e06-1003_metho.xml

Size: 23,987 bytes

Last Modified: 2025-10-06 14:10:04

<?xml version="1.0" standalone="yes"?>
<Paper uid="E06-1003">
  <Title>Weakly Supervised Approaches for Ontology Population</Title>
  <Section position="3" start_page="17" end_page="17" type="metho">
    <SectionTitle>
4 introduces Syntactic Network, a formalism used
</SectionTitle>
    <Paragraph position="0"> for the representation of syntactic information and exploited in both the Class-Word and the Class-Example approaches. Section 5 reports on the experimental settings, results obtained, and discusses the three approaches. Section 6 concludes the paper and suggests directions for future work.</Paragraph>
  </Section>
  <Section position="4" start_page="17" end_page="17" type="metho">
    <SectionTitle>
2 Related Work
</SectionTitle>
    <Paragraph position="0"> There are two main paradigms distinguishing Ontology Population approaches. In the first one Ontology Population is performed using patterns (Hearst, 1998) or relying on the structure of terms (Velardi et al., 2005). In the second paradigm the task is addressed using contextual features (Cimiano and V&amp;quot;olker, 2005).</Paragraph>
    <Paragraph position="1"> Pattern-based approaches search for phrases which explicitly show that there is an &amp;quot;is-a&amp;quot; relation between two words, e.g. &amp;quot;the ant is an insect&amp;quot; or &amp;quot;ants and other insects&amp;quot;. However, such phrases do not appear frequently in a text corpus. Forthisreason, someapproachesusetheWeb (Schlobach et al., 2004). (Velardi et al., 2005) experimented several head-matching heuristics according to which if a term1 is in the head of term2, then there is an &amp;quot;is-a&amp;quot; relation between them: For example &amp;quot;Christmas tree&amp;quot; is a kind of &amp;quot;tree&amp;quot;.</Paragraph>
    <Paragraph position="2"> Context feature approaches use a corpus to extract features from the context in which a semantic class tends to appear. Contextual features may be superficial (Fleischman and Hovy, 2002) or syntactic (Lin, 1998a), (Almuhareb and Poesio, 2004). Comparative evaluation in (Cimiano and V&amp;quot;olker, 2005) shows that syntactic features lead to better performance. Feature weights can be calculated either by Machine Learning algorithms (Fleischman and Hovy, 2002) or by statistical measures, like Point Wise Mutual Information or the Jaccard coefficient (Lin, 1998a).</Paragraph>
    <Paragraph position="3"> A hybrid approach using both pattern-based, term structure, and contextual feature methods is presented in (Cimiano et al., 2005).</Paragraph>
    <Paragraph position="4"> State-of-the-art approaches may be divided in two classes, according to different use of training data: Unsupervised approaches (see (Cimiano et al., 2005) for details) and supervised approaches which use manually tagged training data, e.g. (Fleischman and Hovy, 2002). While state-of-the-art unsupervised methods have low performance, supervised approaches reach higher accuracy, but require the manual construction of a training set, which impedes them from large scale applications.</Paragraph>
  </Section>
  <Section position="5" start_page="17" end_page="19" type="metho">
    <SectionTitle>
3 Weakly supervised approaches for
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="17" end_page="18" type="sub_section">
      <SectionTitle>
Ontology Population
</SectionTitle>
      <Paragraph position="0"> In this Section we present three Ontology Population approaches. Two of them are unsupervised:  (i) a pattern-based approach described in (Hearst, 1998), which we refer to as Class-Pattern and (ii) a feature similarity method reported in (Cimiano and V&amp;quot;olker, 2005) to which we will refer as Class-Word. Finally, we describe a new weakly supervised approach for ontology population which accepts as a training data lists of instances for each class under consideration. This method we call Class-Example.</Paragraph>
    </Section>
    <Section position="2" start_page="18" end_page="18" type="sub_section">
      <SectionTitle>
3.1 Class-Pattern approach
</SectionTitle>
      <Paragraph position="0"> This approach was described first in (Hearst, 1998). The main idea is that if a term t belongs to a class c, then in a text corpus we may expect the occurrence of phrases like such c as t,.... In our experiments for ontology population we used the patterns described in the Hearst's paper plus the pattern t is (a  |the) c:  1. t is (a  |the) c 2. such c as t 3. such c as (NP,)*, (and  |or) t 4. t (,NP)* (and  |or) other c 5. c, (especially  |including) (NP, )* t</Paragraph>
    </Section>
    <Section position="3" start_page="18" end_page="18" type="sub_section">
      <SectionTitle>
Foreachinstancefromthetestsettandforeach
</SectionTitle>
      <Paragraph position="0"> conceptcweinstantiatedthepatternsandsearched with them in the corpus. If a pattern which is instantiated with a concept c and a term t appears in the corpus, then we assume the t belongs to c.</Paragraph>
      <Paragraph position="1"> For example, if the term to be classified is &amp;quot;Etna&amp;quot; and the concept is &amp;quot;mountain&amp;quot;, one of the instantiated patterns will be &amp;quot;mountains such as Etna&amp;quot;; if this pattern is found in the text, then &amp;quot;Etna&amp;quot; is considered to be a &amp;quot;mountain&amp;quot;. If the algorithm assigns a term to several categories, we choose the one which co-occurs most often with the term.</Paragraph>
    </Section>
    <Section position="4" start_page="18" end_page="19" type="sub_section">
      <SectionTitle>
3.2 Class-Word approach
</SectionTitle>
      <Paragraph position="0"> (Cimiano and V&amp;quot;olker, 2005) describes an unsupervised approach for ontology population based on vector-feature similarity between each concept c and a term to be classified t. For example, in order to conclude how much &amp;quot;Etna&amp;quot; is an appropriate instance of the class &amp;quot;mountain&amp;quot;, this method finds the feature-vector similarity between the word &amp;quot;Etna&amp;quot; and the word &amp;quot;mountain&amp;quot;. Each instance from the test set T is assigned to one of the classes in the set C. Features are collected from Corpus and the classification algorithm on</Paragraph>
      <Paragraph position="2"> figure 1 is applied. The problem with this approach is that the context distribution of a name (e.g. &amp;quot;Etna&amp;quot;) is sometimes different than the context distribution of the class word (e.g. &amp;quot;mountain&amp;quot;). Moreover, a single word provides a limited quantity of contextual data.</Paragraph>
      <Paragraph position="3"> In this algorithm the context vectors vt and vc are feature vectors whose elements represent weighted context features from Corpus of the term t (e.g. &amp;quot;Etna&amp;quot;) or the concept word c (e.g. &amp;quot;mountain&amp;quot;). Cimiano and V&amp;quot;olker evaluate different context features and prove that syntactic features work best. Therefore, in our experimental settings we considered only such features extracted from a corpus parsed with a dependency parser. Unlike the original approach which relies on pseudo-syntactic features, we used features extracted from dependency parse trees. Moreover, we used virtually all the words connected syntactically to a term, not only the modifiers. A syntactic feature is a pair: (word, syntactic relation) (Lin, 1998a). We use two feature types: First order features, which are directly connected to the training or test examples in the dependency parse trees of Corpus; second order features, which are connected to the training or test instances indirectly byskipping one word(the verb)in the dependency tree. As an example, let's consider two sentences: &amp;quot;Edison invented the phonograph&amp;quot; and &amp;quot;Edison created the phonograph&amp;quot;. If &amp;quot;Edison&amp;quot; is a name to be classified, then two first order features of this name exist - (&amp;quot;invent&amp;quot;, subject-of) and (&amp;quot;create&amp;quot;, subject-of). One second order feature can be extracted - (&amp;quot;phonograph&amp;quot;, object-of+subject); it co-occurs two times with the word &amp;quot;Edison&amp;quot;. In our experiments second order features are considered only those words which are governed by the same verb whose subject is the name which is a training  or test instance (in this example &amp;quot;Edison&amp;quot;).</Paragraph>
    </Section>
    <Section position="5" start_page="19" end_page="19" type="sub_section">
      <SectionTitle>
3.3 Weakly Supervised Class-Example
Approach
</SectionTitle>
      <Paragraph position="0"> The approach we put forward here uses the same processing stages as the one presented in Figure 1 and relies on syntactic features extracted from a corpus. However, the Class-Example algorithm receives as an additional input parameter the sets of training examples for each class c [?] C. These training sets are simple lists of instances (i.e. terms denoting Named Entities), without context, and can be acquired automatically or semi-automatically from an existing ontology or gazetteer. To facilitate their acquisition, the Class-Example approach imposes no restrictions to the training examples - they can be ambiguous and have different frequencies. However, they have to appear in Corpus (in our experimental settings - at least twice). For example, for the class &amp;quot;mountain&amp;quot; training examples are: &amp;quot;Everest&amp;quot;, &amp;quot;Mauna Loa&amp;quot;, etc.</Paragraph>
      <Paragraph position="1"> The algorithm learns from each training set Train(c) asinglefeaturevectorvc, calledthesyntactic model of the class. Therefore, in our algorithm, the statement</Paragraph>
      <Paragraph position="3"> in Figure 1 is substituted with vc = getSyntacticModel(Train(c),Corpus). For each class c, a set of syntactic features F(c) are collected by finding the union of the features extracted from each occurrence in the corpus of each training instance in Train(c). Next, the feature vector vc is constructed: If a feature is not present in F(c), then its corresponding coordinate in vc has value 0; otherwise, it has a value equal to the feature weight.</Paragraph>
      <Paragraph position="4"> The weight of a feature fc [?] F(c) is calculated in three steps:  1. First, the co-occurrence of fc with the train-</Paragraph>
      <Paragraph position="6"> where P(fc,t) is the probability that feature fc co-occurs with t, P(fc) and P(t) are the probabilities that fc and t appear in the corpus, a = 14 for syntactic features with lexical element noun and a = 1 for all the other syntactic features. The a parameter reflects  thelinguisticintuitionthatnounsaremoreinformative than verbs and adjectives which in most cases represent generic predicates. The values of a were automatically learned from the training data.</Paragraph>
      <Paragraph position="7"> 2. We normalize the feature weights, since we observed that they vary a lot between different classes: for each class c we find the feature with maximal weight and denote its</Paragraph>
      <Paragraph position="9"> Next, the weight of each feature fc [?] F(c) is normalized by dividing it with mxW(c):</Paragraph>
      <Paragraph position="11"> 3. To obtain the final weight of fc, we divide weightN(fc) by the number of classes in  which this feature appears. This is motivated by the intuition that a feature which appears in the syntactic models of many classes is not a good class predictor.</Paragraph>
      <Paragraph position="13"> where|Classes(fc)|is the number of classes for whichfc is present in the syntactic model.</Paragraph>
      <Paragraph position="14"> As shown in Figure 1, the classification uses a similarity function sim(vt,vc) whose arguments are the feature vector of a term vt and the feature vector vc for a class c. We defined the similarity function as the dot product of the two feature vectors: sim(vt,vc) = vc.vt. Vectors vt are binary (i.e. the feature value is 1 if the feature is present and, 0-otherwise), while the features in the syntactic model vectors vc receive weights according to the approach described in this Section.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="19" end_page="21" type="metho">
    <SectionTitle>
4 Representing Syntactic Information
</SectionTitle>
    <Paragraph position="0"> Since both the Class-Word and the Class-Example methods work with syntactic features, the main source of information is a syntactically parsed corpus. We parsed about half a gigabyte of a news corpus with MiniPar (Lin, 1998b). It is a statistically based dependency parser which is reported to reach 89% precision and 82% recall on press reportage texts. MiniPar generates syntactic dependency structures - directed labeled graphs whose  vertices represent words and the edges between them represent syntactic relations like subject, object, modifier, etc. Examples for two dependency structures - g1 and g2, are shown in Figure 2: They represent the sentences &amp;quot;John loves Mary&amp;quot; and &amp;quot;John loves Jane&amp;quot;; labels s and o on their edges stand for subject and object respectively.</Paragraph>
    <Paragraph position="1"> The syntactic structures generated by MiniPar are dendroid(tree-like), butstillcyclesappearinsome cases.</Paragraph>
    <Paragraph position="2"> In order to extract information from the parsed corpus, we had to choose a model for representing dependency trees which allows to search efficiently for syntactic structures and to calculate their frequencies. Building a classic index at word levelwasnotanoption,sincewehavetosearchfor syntactic structures, not words. On the other hand, indexing syntactic relations (i.e. word pair and the relation between the words) would be useful, but still does not resolve the problem, since in many cases we search for more complex structures than a relation between two words: for example, when we have to find which words are syntactically related to a Named Entity composed by two words, we have to search for syntactic structures which consists of three vertices and two edges.</Paragraph>
    <Paragraph position="3"> In order to trace efficiently more complex structures in the corpus, we put forward a model for representation of a set of labeled graphs, called Syntactic Network (SyntNet for short). The model is inspired by a model presented earlier in (Szpektor et al., 2004), however our model allows more efficient construction of the representation. The scope of SyntNet is to represent a set of labeled graphs through one aggregate structure in which the isomorphic sub-structures overlap. When SyntNet represents a syntactically parsed text corpus, its vertices are labeled with words from the text while edges represent syntactic relations from the corpus and are labeled accordingly.</Paragraph>
    <Paragraph position="4"> An example is shown in Figure 2, where two syntactic graphs, g1 and g2, are merged into one aggregate representation SyntNet(g1,g2).</Paragraph>
    <Paragraph position="5"> Vertices labeled with equal words in g1 and g2 are merged into one generalizing vertex in SyntNet(g1,g2). For example, the vertices with labelJohning1 andg2 aremergedintoonevertex John in SyntNet(g1,g2).</Paragraph>
    <Paragraph position="6"> Edges are merged in a similar way: (loves,John) [?] g1 and (loves,John) [?] g2 are represented through one edge (loves,John) in SyntNet(g1,g2).</Paragraph>
    <Paragraph position="7"> Each vertex in g1 and g2 is labeled additionally with a numerical index which is unique for the graph set. Numbers on vertices in SyntNet(g1,g2) show which vertices from g1 and g2 are merged in the corresponding SyntNet vertices. For example, vertex loves [?] SyntNet(g1,g2) has a set {1,4} which means that vertices 1 and 4 are merged in it. In a similar way the edge (loves,John) [?] SyntNet(g1,g2) is labeled with two pairs of indices (4,5) and (1,2), which shows that it represents two edges: the edge between vertices 4 and 5 and the edge between 1 and 2.</Paragraph>
    <Paragraph position="8"> Two properties of SyntNet are important: first isomorphic sub-structures from all the graphs represented via a SyntNet are mapped into one structure. This allows us to easily find all the occurrences of multiword terms and named entities. Second, using the numerical indices on vertices and edges, we can efficiently calculate which structures are connected syntactically to the training and test terms. As an example, let's try to calculate in which constructions the word &amp;quot;Mary&amp;quot; appears considering SyntNet in Figure 2. First, in SyntNet we can directly observe that there is the relation loves-Mary labeled with the pair 1 - 3 - therefore this relation appears once in the corpus. Next, tracing the numerical indices on the vertices and edges we can find a path from &amp;quot;Mary&amp;quot; to &amp;quot;John&amp;quot; through &amp;quot;loves&amp;quot;. The path passes through the following numerical indices: 3 - 1 - 2: this means that there is one appearance of the structure  &amp;quot;JohnlovesMary&amp;quot;inthecorpus,spanningthrough vertices 1,2, and 3. Such a path through the numerical indices cannot be found between &amp;quot;Mary&amp;quot; and &amp;quot;Jane&amp;quot; which means that they do not appear in the same syntactic construction in the corpus.</Paragraph>
    <Paragraph position="9"> SyntNet is built incrementally in a straightforward manner: Each new vertex or edge added to the network is merged with the identical vertex or edge, if such already existsin SyntNet. Otherwise, a new vertex or edge is added to the network. The time necessary for building a SyntNet is proportional to the number of the vertices and the edges in the represented graphs (and does not otherwise depend on their complexity).</Paragraph>
    <Paragraph position="10"> The efficiency of the SyntNet model when representing and searching for labeled structures makes it very appropriate for the representation of a syntactically parsed corpus. We used the properties of SyntNet in order to trace efficiently the occurrences of Named Entities in the parsed corpus, to calculate their frequencies, to find the syntactic features which co-occur with these Named Entities, as well as the frequencies of these cooccurrences. Moreover, the SyntNet model allowed us to extract more complex, second order syntactic features which are connected indirectly to the terms in the training and the test set.</Paragraph>
  </Section>
  <Section position="7" start_page="21" end_page="22" type="metho">
    <SectionTitle>
5 Experimental settings and results
</SectionTitle>
    <Paragraph position="0"> We have evaluated all the three approaches described in Section 3. The same evaluation settings were used for the three experiments. The source of features was a news corpus of about half a gigabyte. The corpus was parsed with MiniPar and a Syntactic Network representation was built from thedependencyparsetreesproducedbytheparser.</Paragraph>
    <Paragraph position="1"> Syntactic features were extracted from this SyntNet. null We considered two high-level Named Entity categories: Locations and Persons. For each of them five fine-grained sub-classes were taken into consideration. For locations: mountain, lake, river, city, and country; for persons: statesman, writer, athlete, actor, and inventor.</Paragraph>
    <Paragraph position="2"> For each class under consideration we created a test set of Named Entities using WordNet 2.0 and Internet sites like Wikipedia. For the Class-Example approach we also provided training data using the same resources. WordNet was the primarydatasourcefortrainingandtestdata. Theexamples from it were extracted automatically. We  proach.</Paragraph>
    <Paragraph position="3"> used Internet to get additional examples for some classes. To do this, we created automatic text extraction scripts for Web pages and manually filtered their output when it was necessary.</Paragraph>
    <Paragraph position="4"> Totally, the test data comprised 280 Named Entities which were not ambiguous and appeared at least twice in the corpus.</Paragraph>
    <Paragraph position="5"> For the Class-Example approach we provided a training set of 1194 names. The only requirement to the names in the training set was that they appear at least twice in the parsed corpus. They were allowed to be ambiguous and no manual post-processing or filtering was carried out on this data.</Paragraph>
    <Paragraph position="6"> For both context feature approaches (i.e. Class-Word and Class-Example), we used the same type of syntactic features and the same classification function, namely the one described in Section 3.3. This was done in order to compare better the approaches. null Results from the comparative evaluation are shown in Table 2. For each approach we measured macro average precision, macro average recall, macro average F-measure and micro average F; for Class-Word and Class-Example micro F is equaltotheoverallaccuracy, i.e. thepercentofthe instances classified correctly. The first row shows  the results obtained with superficial patterns. The second row presents the results from the Class-Word approach. The third row shows the results of our Class-Example method. The fourth line presents the results for the same approach but using second-order features for the person category. The Class-Pattern approach showed low performance, similar to the random classification, for which macro and micro F=10%. Patterns succeeded to classify correctly only instances of the classes &amp;quot;river&amp;quot; and &amp;quot;city&amp;quot;. For the class &amp;quot;city&amp;quot; the patterns reached precision of 100% and recall 65%; for the class &amp;quot;river&amp;quot; precision was high (i.e. 75%), but recall was 15%.</Paragraph>
    <Paragraph position="7"> The Class-Word approach showed significantly betterperformance(macroF=33%, microF=42%) than the Class-Pattern approach.</Paragraph>
    <Paragraph position="8"> The performance of the Class-Example (62% macro F and 65%-68% micro F) is much higher than the performance of Class-Word (29% increase in macro F and 23% in micro F). The last row of the table shows that second-order syntactic features augment further the performance of the Class-Example method in terms of micro average F (68% vs. 65%).</Paragraph>
    <Paragraph position="9"> A more detailed evaluation of the Class-Example approach is shown in Table 1. In this table we show the performance of the approach without the second-order features. Results vary between different classes: The highest F is measured for the class &amp;quot;country&amp;quot; - 89% and the lowest is for the class &amp;quot;inventor&amp;quot; - 18%. However, the class &amp;quot;inventor&amp;quot; is an exception - for all the other classes the F measure is over 50%. Another difference may be observed between the Location and Person classes: Our approach performs significantly better for the locations (68% vs. 57% macro F and 78% vs. 57% micro F). Although different classes had different number of training examples, we observed that the performance for a class does not depend on the size of its training set. Wethink,thatthevariationinperformancebetween categories is due to the different specificity of their textual contexts. As a consequence, some classes tend to co-occur with more specific syntactic features, while for other classes this is not true.</Paragraph>
    <Paragraph position="10"> Additionally, we measured the performance of our approach considering only the macrocategories &amp;quot;Location&amp;quot; and &amp;quot;Person&amp;quot;. For this purpose we did not run another experiment, we rather used the results from the fine-grained classification and grouped the already obtained classes. Results are shown in the last two rows of table 1: It turns out that the Class-Example method makes very well the difference between &amp;quot;location&amp;quot; and &amp;quot;person&amp;quot; - 90% of the test instances were classified correctly between these categories.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML