File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0211_metho.xml

Size: 23,990 bytes

Last Modified: 2025-10-06 14:14:37

<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-0211">
  <Title>Towards a Bootstrapping Framework for Corpus Semantic Tagging</Title>
  <Section position="4" start_page="0" end_page="69" type="metho">
    <SectionTitle>
2 Semantically-driven Induction of
Lexical Information
</SectionTitle>
    <Paragraph position="0"> Several phenomena have been (more or less successfully) modeled in the LA literature:  * Acquisition of word taxonomies from a corpus by means of syntactic (Hindle,1990) (Pereira et al.,1993) as well semantic (Basili et al.,1993a,1996) evidence * Probability driven PP-disambignation (Hindle and Rooths,1993), (Basili et ai.,1993c), (Brill and Resnik,1994) ,(Resnik,1995), (Frank,1995), (Collins and Brooks,1995). Some of these methods rely on semantic classes in order to improve robustness.</Paragraph>
    <Paragraph position="1"> * Verb Argument Structure derivation. Many selectional constraints in argumental information have a semantic nature (e.g. +- animate), like in</Paragraph>
    <Paragraph position="3"> Semantic tagging is thus crucial to all the above activities. We propose the following strategy:  1. Tune a predefined (general) classificatory framework using as source an untagged corpus 2. Tag the corpus within the defined model eventually adjusting some of the tuning choices; 3. Use tagged (i.e. semantically typed) contexts to derive a variety of lexical information (e.g. verb argument structures, PP-disambignation rules, ooo) The design of the overall process requires a set of modeling principles: 1. to focus on the suitable tag system 2. to customize the classification to a corpus 3. to tag the corpus correspondingly.</Paragraph>
    <Paragraph position="4"> 3 Tuning a Classification FrAmework to a Domain  The wide-spectrum classification adopted within WordNet is very useful on a purely linguistic ground, but creates unacceptable noise in NLP applications. In a corpus on Remote Sensing (RSD) 2, for example, we computed an average ambiguity of 4,76 senses (i.e. Wordnet syasets). Table 1 counts the WN synsets of some of the most ambiguous verbs found in our RSD corpus.</Paragraph>
    <Paragraph position="5"> Several problems are tackled when a domain driven approach is used. First, ambiguity of words 2The t11.1ng phase has been evaluated over different corpora but results will be discussed over a collection -of publications on Remote Sensing, sized about 350.000 words.</Paragraph>
    <Paragraph position="6"> Table h RSD verbs with the highest initial polysemy</Paragraph>
    <Paragraph position="8"> is reduced in a specific domain, and enumeration of all their senses is unnecessary. Second, some words function as sense primers for others. Third, raw contexts of words provide a significant bundle of information able to guide disambignation. Applying semantic disambignation as soon as possible is useful to improve later LA and other linguistic tasks. Our aim is thus to provide a systematic bootstrapping framework in order to: * Assign sense tags to words * Induce class-based models from the source corpus null * Use the class-based modesl (that have a semantic nature) within a NLP application.</Paragraph>
    <Paragraph position="9"> The implemented system, called GODoT (General purpose Ontology Disambignation and Tuning), has two main components: a classifier C-GODoT, that tunes WordNet to a given domain, and WSD-GODoT that locally disambignates and tags the source corpus contexts. The Lexical Knowledge base (i.e. WordNet) and the (POS tagged) source corpus axe used to select relevant words in each semantic class. The resulting classification is more specific to the sublanguage as the exhaustive enumeration of general-purpose word senses has been tackled, and potential new senses have been introduced. The tuned hierarchy is then used to guide the disambiguation task over local contexts thus producing a final sense tagged corpus from the source data.</Paragraph>
    <Paragraph position="10"> Class-based models can be derived according to the tags appropriate in the corpus and used to derive lexical information according to generalized collocations. null</Paragraph>
    <Section position="1" start_page="66" end_page="67" type="sub_section">
      <SectionTitle>
3.1 A semantic tag system for nouns and
</SectionTitle>
      <Paragraph position="0"> verbs Experimentally it has been observed that sense definitions in dictionaries might not capture the domain specific use of a verb (Basili et al, 1995). This strongly motivated our approach mainly based on the assumption that the corpus itself, rather than dictionary definitions, could be used to derive disambiguation hints. One such approach is undertaken in  (Yarowsky 1992), which inspired our tuning method, although objectives and methods of our classifier (C-GODoT) are slightly different.</Paragraph>
      <Paragraph position="1"> First, the aim is to tune an e~isting word hierarchy to an application domain, rather than selecting the best category for a word occurring in a context. Second, since the training is performed on an unbalanced corpus (and also for verbs, that notoriously exhibit more fuzzy contexts), we introduced local techniques to reduce spurious contexts and improve reliability.</Paragraph>
      <Paragraph position="2"> Third, since we expect also domain-specific senses for a word, during the classification phase we do not make any initial hypothesis on the subset of consistent categories of a word.</Paragraph>
      <Paragraph position="3"> Finally, we consider globally all the contexts in which a given word is encountered in a corpus, and compute a (domain-specific) probability distribution over its expected senses (i.e. hierarchy nodes) A domain specific semantics is obtained through the selection of the suitable high level synsets in the Wordnet hierarchy. A different methodological choice is required for verbs and nouns. As, Word-Net hyperonimy hierarchy is rather bushy and disomogeneous, we considered inappropriate, as initial classification, the WordNet lot~est level synsets. A more efficient choice is selecting the topmost synsets, called unique beginners, thus eliminating branches of the hierarchy, rather than leaves. This is reasonable for nouns, (only 25 unique beginners), but it seems still inappropriate for verbs, that have hundreds of unique beginners (about 208). We hence decided to adopt as initial classification for verbs the 15 semanticaliy distinct categories (verb semantic fields) in WordNet. The average ambiguity of verbs among these categories is 3.5 for our sample in the RSD. A similar value is the ambiguity of nouns in the set of their unique beginners. The first columns in Tables 2 and 3 report the semantic classes for nouns and verbs.</Paragraph>
    </Section>
    <Section position="2" start_page="67" end_page="68" type="sub_section">
      <SectionTitle>
3.2 Tuning verbs and noun~
</SectionTitle>
      <Paragraph position="0"> Given the above reference tag system, our method works as follows: * Step 1. Select the most typical words in each category; * Step 2. Acquire the collective contexts of these words and use them as a (distributional) description of each category; * Step 3. Use the distributional descriptions to evaluate the (corpus-dependent) membership of each word to the different categories.</Paragraph>
      <Paragraph position="1">  Step 1 is carried out detecting the more significant (and less ambiguous) words in any class (semantic fields of verbs and unique beginners for nouns): any of these sets is called kernel of the corresponding class. Rather than training the classifier on all the verbs or noun in the learning corpus, we select only a subset of prototypical words for each category. We call these words w the salient words of a category C. We define the typicality Tw(C) of w in C, as:</Paragraph>
      <Paragraph position="3"> where: N~ is the total number of synsets of a word w, i.e. all the WordNet synonymy sets including w. N .c is the number of synsets of w that belong to the semantic category C, i.e. synsets indexed with C in WordNet. null The typicality depends only on WordNet. A typical verb for a category C is one that is either non ambiguously assigned to C in WordNet, or that has most of its senses (syneets) in C.</Paragraph>
      <Paragraph position="4"> The synonymy Sto of w in C, i.e. the degree of synonymy showed by words other than w in the synsets of the class C in which w appears, is modeled by the following ratio:</Paragraph>
      <Paragraph position="6"> where: O~ is the number of words in the corpus that appear in at least one of the synsets of w.</Paragraph>
      <Paragraph position="7"> Ow,c is the number of words in the corpus appearing in at least one of the synsets of w, that belong to C. The synonymy depends both on WordNet and on the corpus. A verb with a high degree of synonymy in C is one with a high number of synonyms in the corpus, with reference to a specific sense (synset) belonging to C. Salient verbs for C are frequent, typical, and with a high synonymy in C. The salient words to, for a semantic category C, are thus identified maximizing the following function, that we call SCOre:</Paragraph>
      <Paragraph position="9"> where OAw are the absolute occurrences of w in the corpus. The value of Score depends both on the corpus and on WordNet. OAw depends obviously on the corpus.</Paragraph>
      <Paragraph position="10"> The kernel of a category kernel(C), is the set of salient verbs w with a &amp;quot;high&amp;quot; Scorew(C). In Table 2 and 3 the kernel words for both noun and verb classes are reported. The typicality of the words in the Remote Sensing domain is captured (in the tables some highest relevance words in the classes are reported). This is exactly what is needed as a semantic domain bias of the later classification process. null Step 2 uses the kernel words to build (as in (Yarowsky,1992)) a probabilistic model of a class: distributions of class relevance of the surrounding terms in typical contexts for each class are built. In Step 3 a words (verb or noun) is assigned to a class according to the contexts in which it appears: collective contexts are used contemporarily, as what matters here is domain specific class membership and not contextual sense disambiguation.</Paragraph>
      <Paragraph position="11"> Many contexts may cooperate to trigger a given class and several classifications may arise when different contexts suggest independent classes. For a given verb or noun w, and for each category C, we evaluate the following function, that we call Domain Sense</Paragraph>
      <Paragraph position="13"> where k's are the contexts of w, and w I is a generic word in k.</Paragraph>
      <Paragraph position="14"> In (5), Pr(C) is the (not uniform) probability of a class C, given by the ratio between the number of collective contexts for C 3 and the total number of collective contexts.</Paragraph>
      <Paragraph position="15"> The t~ning phase has been evaluated over the RSD corpus, and the resulting average ambiguity of a representative sample of 826 RSD verbs is 2.2, while the corresponding initial WordNet ambiguity was 3.5. For the intrinsic difficulty of deciding the proper domain classes for verbs we designed two tests. In the first ambiguous verbs in WordNet have been evaluated: the automatic classification is compared with the WordNet initial description. A recall (shared classes) of 41% denotes a very high compression (i.e. reduction in the number of senses) with a corresponding precision of 82% that indicate a good agreement between WordNet and the system classifications: many classes are pruned out (lower recall) but most of the remaining ones axe among the initial ones. A second test has been carried out on WordNet unambiguous verbs (e.g. fie.z, convoy, ... ). For such verbs a recall of 91% is obtained over their unique (and confirmed) senses. These results show that tuning a classification using word contexts Sthose collected around the kernel verbs of C is enough precise to be used in a semantic bootstrapping perspective and by its nature it can be used on a large scale.</Paragraph>
    </Section>
    <Section position="3" start_page="68" end_page="69" type="sub_section">
      <SectionTitle>
3.3 T~gging verbs and no~_lns in a corpus
</SectionTitle>
      <Paragraph position="0"> After the tuning phase local tagging is obtained in a similar fashion: given a context k for a word w and the set of the proposed classes {C1,C2,...Cn) for w, a tag C E (C1,C2,...Cn} is assigned tow in k itf adherence of k to the probabilistic model of C is over a given threshold and it is maximal.</Paragraph>
      <Paragraph position="1"> The WSD algorithm (WSD-GODoT) can be sketched as follows: I. Let k be a context of a noun/verb to in the source corpus and {Ci,C2, ...,C,} be the set of domain specific classifications of w, as they have been pre-selected by C-GODoT; 2. For each class Ci, the normalized contextual sense, NCS, is given by:</Paragraph>
      <Paragraph position="3"> where Y(k, Ci) is defined as in (5), and #c,, ac~ are the mean and standard deviation of the  Dsense(w, Ci) over the set of kernel words w in Ci. 3. The sense C that to assumes in the context k is expressed by:</Paragraph>
      <Paragraph position="5"> Experimentation has been carried out over set of 1,000 disambiguated contexts of about 97 verbs randomly extracted fzom RSD. All these 97 verbs where ambiguous, with an average of 2.3 semantic classes per verb persisting ambiguity, even after the semantic tuning phase. Recall and Precision have been measured against a manual classification carried out by three human judges (about 70% cases received the same tag by all the judges, this suggesting a certain complexity ot the task). In 98.74% of cases the tagging system selected one tag. A recall of 85.97% has been obtained. Precision is of about 62.19%.</Paragraph>
      <Paragraph position="6"> Comparing these figures with related works is very diflqcult, due to the differences in the underlying semantic type systems and mainly to the variety of information used by the different methods. (McRoy,1992) (and recently (Wilks and Stevenson,1997) described a word sense disambiguation methods based on multiple models, acting over different linguistic levels (e.g. MRD senses, POS tags,  corpus contexts). Our methodology is less demanding from the point of view of the required source information and possibly should be compared against one only of the levels mentioned in these works.</Paragraph>
      <Paragraph position="7"> (Resnik,1995) reports a human precision of about 67% but on a noun disambiguation task carried out at the level of true WordNet ,,tenses (i.e. synsets): this task seems fairly more complex than ours as we estimated an average of 2.9 synsets per noun on a set of 100 nouns of the RSD. However one of the resuits of our method is also to eliminate most of these senses from the hierarchy, during the tuning phase, so that precision of the two method cannot be directly compared. Exhaustive experimental data on nouns are not yet available. However the significant results obtained for verbs are important, as several authors (e.g. (Yarowsky,1992)) report verb as a category that is more problematic than noun for context driven classification tasks.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="69" end_page="71" type="metho">
    <SectionTitle>
4 Discussion
</SectionTitle>
    <Paragraph position="0"> The relevance of word classes for a variety of lexical acquisition tasks has been described in several works. In (Brown et al.,1993) class-based language models for text processing are described. Classes are derived by pure collocational analysis of corpora.</Paragraph>
    <Paragraph position="1"> Approaches of this type aim to improve the statis~ tical significance of probability estimations, taclde the data sparseness problems and reduce the number of the model parameters. The derived clusters are very interesting but are not amenable for a direct linguistic analysis. Difficulties in interpreting data derived from numerical cluster analysis emerge also in other studies (e.g. (Pereira et al.,1993)) where additional work is required to assign a suitable me~nlug to groups of words. The essential difficulty in separating word senses, when conflating data are derived from distinct senses, is due to the fact that simple collocations are often the surface results of independent linguistic phenomena. Collocationally derived lexical constraints (as in the strong tea vs.</Paragraph>
    <Paragraph position="2"> powerful tea example given in (Smadja,1989)) may be very different from other types of relations, like verb-argument relations. In this case, in fact a statistical significant relationship is not to be detected betwen verb and its lexical arguments, but between the verb and a whole class of words that play, in fact, the role of such arguments. For example, in the RSD corpus the verb catalogue appears 33 times. It takes as a direct object the word information only once, that is an evidence too small to support any probabilistic induction. Information indeed is a typical abstraction that can be catalogued. There is no hope for any inductive method making use of simple lex- null ical collocations instead of class based collocations (e.g. abstraction) to acquire enogh evidence of most of the phenomena.</Paragraph>
    <Paragraph position="3"> Class methods based on taxonomic information may provide more comprehensive information for a larger number of lexical acquisition tasks. In PP-disambiguation tasks several works based on bi-gram statistics collected over syntactic data (e.g. Hindle and Rooths,1993) show evident limitations in coverage and efficacy to deal with complex forms.</Paragraph>
    <Paragraph position="4"> In (Franz,1995) weak performances are reported for ambiguities with more that two attachment sites.</Paragraph>
    <Paragraph position="5"> These last are very frequent in a language like Italian where prepositional phrases play a role similar to English compounds. Class-based approaches (e.g.</Paragraph>
    <Paragraph position="6"> (Basili et al.,1993) and (Brill and Resnik, 1994) are more promising: the implied clustering also tackles the data sparseness difficulties, but mainly they produce selectional constraints that have a direct semantic interpretation. Smaller training data set can be used and also unknown collocates are deal with, if they are able to trigger the proper semantic generalizations. null The method proposed in this paper suggests and provides evidences that processing a corpus, first, to tune a general purpose taxonomy to the underlying domain and, then, sense disambiguating word occurrences according to the derived semantic classification is feasible. The reference information (i.e. the Wordnet taxonomy) is a well-known sharable resource with an explicit semantics (i.e. the hyperonimy/hyponimy hierarchy): this has a beneficial effect on the possibility to extract further lexical phenomenon (e.g. PP disambiguation rules) with a direct semantic interpretation. Let for example: Future Earth observation satellite systems for worldudde high resolution observation purposes require satellites in low Earth orbits, supplemented by geostationary relay satellites to ensure intermediate data transmission from LEO to ground.</Paragraph>
    <Paragraph position="7"> be a potential source document, taken form our RSD domain. Given a preliminary customization of the Wordnet hirerachy, according to the set of kernel verbs and nouns exemplified in Tables 2 and 3, the described methods allow to apply local semantic taggin to the set of verbs and nouns in the document. Some vrbs/nouns are no longer ambiguous in the domain: their unique tag is retained. For the remaining ambiguous words the local disambiguation model is applied (by (7). The tagged version of the source document results as follows: Future Earth/LO observation/AC satellite/OB systems/CO for worldwide high resolution/AT observa-</Paragraph>
    <Paragraph position="9"> ary relay/OB satellites/OB to ensure/CG interme.</Paragraph>
    <Paragraph position="10"> diate data/GR transmission/AC from LEO/AR to ground/LOC. 4 The data now available for any lexical acquisition techniques are not only bigrams or trigrams, or syntactic collocations (like those derived by a robust parser (as in (Grishman and Sterling,1994) or (Basili et al, 1994)) but also disambignated semantic tags for co-occurring words. For example, for the verb require, we extract the following syntactic collocations from the source document:</Paragraph>
    <Paragraph position="12"> These data support several inductions. First, semantic tags allow to cluster togheter source syntactic collocations according to similar classifications. Other occurrence of the verb require, as they have been found in the RSD corpus are:</Paragraph>
    <Paragraph position="14"> When arguments are assigned with the same tags (e.g. OB for the direct objets) basic instances can be generalized into selectional rules: a typical structure induced from the reported instances is thus</Paragraph>
    <Paragraph position="16"> where explicit semantic selectional restrictions (+OB) for syntactic arguments (e.g. Obj) are expressed. A method for deriving a verb subcategorization lexicon from a corpus, according to an example based learning technique applied to robust parsing data is described in (Basili et al,forthcoming).</Paragraph>
    <Paragraph position="17"> Availability of explicit semantic tags (like OB) allows to derive semantic selectional constraints as in (8). Further induction would allow to assign thematic descriptions to arguments in order to extend  (8) in:</Paragraph>
    <Paragraph position="19"> Previous work on the acquisition of high level semantic relations is described in (Basili et al.,1993b), where the feasibility of the derivation of lexical semantic relations from several corpora and domains  has been studied. Interesting results on applicability of semantic filtering to synt~tic data, for the purpose of acquiring verb argument information is reported in (Dorr and Jones,1996). Semantic information greatly improve the precision of a verb syntactic classification.</Paragraph>
    <Paragraph position="20"> The proposed tag system (e.g. Wordnet high level classes) has several advantages. First it puts some limit to enumeration of word senses, thus keeping limited the search space of any generalization process. Learning methods are usually search algorithms through concept spaces. The larger is the set of basic classes, the larger is the size of the search space. It is questionalble how expressive is the resuiting tag system. Previous research in ARIOSTO (Basili et al,1996a) demonstrated the feasibility of complex corpus driven acquisition based on high level semantic classes for a variety of lexical phenomena. A naive semantic type system allows a number of lexical phenomena to be captured with a minimal human intervention. As an example acquisition of verb hierarchies according to verb thematic description is described in (Basili et al.,1996b). Whenever an early tuning of the potetial semantic classes of a given verb in a corpus has been applied and local disambiguation has been carried out as corpus semantic annotation, more precise verb clustering can be applied: * first, local ambiguities have been removed during corpus tagging, second, clustering is applied with an intraclasses strategy and not over the whole set of verbs. First, a set of thematic verb instances from source sentences are collected for each given semantic class, so that social verbs are taken separate from change or cognition verbs.</Paragraph>
    <Paragraph position="21"> Then, separate hirarchies can be generated for each semantic class, in order to have a fully domain driven taxonomic description within general classes , e.g. social, for which a general agreement exists. Later reasoning processes could thus exploit general primitives augmented with domain specific lexico-semantic phenomena. null</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML