File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0205_metho.xml

Size: 24,263 bytes

Last Modified: 2025-10-06 14:14:37

<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-0205">
  <Title>Adaptable Semantic Lexicon with Systematic Pol-</Title>
  <Section position="2" start_page="25" end_page="29" type="metho">
    <SectionTitle>
2 CORELEX: A Semantic Lexicon
with Systematic Polysemous
Classes
</SectionTitle>
    <Paragraph position="0"> In this section I describe the structure and content of a lexicon (CORELEX) that builds on the assumptions about lexical semantics and discourse outlined above. More specifically, it is to be 'structured in such a way that it reflects the lexical semantics of a language in systematic and predictable ways' (Pustejovsky, Boguraev, and Johnston, 1995). This assumption is fundamentally different from the design philosophies behind existing lexical semantic resources like WORDNET that do not account for any regularities between senses. For instance, WORD-NET assigns to the noun book the following senses: the content that is being communicated (communicatiofl) and the medium of communication (artifact). More accurately, book should be assigned a qualia structure which implies both of these interpretations and connects them to each of the more specific senses that WORDNET assigns: that is, facts, drama and a journal can be part-of the content of a book; a section is part-of both the content and the medium; publication, production and recording are all events in which both the content and the medium aspects of a book can be involved.</Paragraph>
    <Paragraph position="1"> An important advantage of the CORELEX approach is more consistency among the assignments of lexical semantic structure. Consider the senses that WORDNET assigns to door, gate and window:  At the top of the WORDNET hierarchy these seven senses can be reduced to two unrelated 'basic senses':  window and gate Obviously these are similar words, something which is not expressed in the WORDNET sense assignments. In the CORELEX approach, these nouns are given the same semantic type, which is underspecifled for any specific 'sense' but assigns them consistently with the same basic lexical semantic structure that expresses the regularities between all of their interpretations.</Paragraph>
    <Paragraph position="2"> However, despite its shortcomings WORDNET is a vast resource of lexical semantic knowledge that can</Paragraph>
    <Paragraph position="4"> be mined, restructured and extended, which makes it a good starting point for the construction of CORELEX. The next sections describe how systematic polysem0us classes and underspecified semantic types can be derived from WORDNET. In this paper I only consider classes of noun,s, but the process described here can also be applied to other parts of speech.</Paragraph>
    <Section position="1" start_page="26" end_page="27" type="sub_section">
      <SectionTitle>
2.1 Systematic polysemous classes
</SectionTitle>
      <Paragraph position="0"> We can arrive at classes of systematically polysemous lexical items by investigating which items share the same senses and are thus polysemous in the same way. This comparison is done at the top levels of the WORDNET hierarchy. WORDNET does not have an explicit level structure, but for the purpose of this research one can distinguish a set of 32 =basic senses' that partly coincides with, but is not based directly on WORDNET'S list of 26 'top types': act (act), agent (agt), animal (~.m), artifact (art), attribute (air), blunder (bln), cell (cel), chemical (chm), communication (corn), event (evl;), food (rod), form (frm), group_biological (grb), group (grp), group_social (grs), h-m~n (hum), llnear_measure (1me), location (loc), 1ocation_geographical (log), measure (mea), natural_object (nat), phenomenon (p\]m), plant (plt), possession (pos), part (prt), psychological (psy), quantity_definite (qud), quantity_indefinite (qui), relation (re1), space (spc), state (sta), time (tree) Figure 3 shows their distribution among noun stems in the BROWN corpus. For instance there are 2550 different noun stems (with 49,824 instances) that have each 2 out of the 32 'basic senses' assigned to them in 238 different combinations (a subset of 322 = 1024 possible combinations).</Paragraph>
      <Paragraph position="1"> We now reduce all of WORDNET'S sense assignments to these basic senses. For instance, the seven different senses that WORDNET assigns to the lexical item book (see Figure I above) can be reduced to the two basic senses: 'art corn'. We do this for each lexical item and then group them into classes according to their assignments.</Paragraph>
      <Paragraph position="2"> From these one can filter out those classes that have only one member because they obviously do not represent a systematically polysemous class. The lexical items in those classes have a highly idiosyncratic behavior and are most likely homonyms. This leaves  Not all of the 442 classes are systematically polysemous. Consider for example the following classes: Some of these classes are collections of homonyms that are ambigtzotz,s in similar ways, but do not lead to any kind of predictable polysemous behavior, for instance the class 'act anm art' with the lexical items: drill ruff solitaire stud. Other classes consist of both homonyms and systematically polysemous lexical items like the class act log, which includes caliphate, clearing, emirate, prefecture, repair, wheeling vs. bolivia, charleston, chicago, michigan.  Whereas the first group of nouns express two separated but related meanings (the act of clearing, repair, etc. takes place at a certain location), the second group expresses two meanings that are not related (the charleston dance which was named after the town by the same name).</Paragraph>
      <Paragraph position="3"> The ambiguous classes need to be removed altogether, while the ones with mixed ambiguous and polllsemous lexical items are to be weeded out carefully. null</Paragraph>
    </Section>
    <Section position="2" start_page="27" end_page="29" type="sub_section">
      <SectionTitle>
2.2 Underspecified semantic types
</SectionTitle>
      <Paragraph position="0"> The next step in the research is to organize the remaining classes into knowledge representations that relate their senses to each other. These representations are based on Generative Lexicon theory (GPS), using qualia roles and (dotted) types (Pustejovsky, 19os).</Paragraph>
      <Paragraph position="1"> Qualia roles distinguish different semantic aspects: FORMAL indicates semantic type; CONSTITUTIVE part-whole information; AGENTIVE and TELIC associated events (the first dealing with the origin of the object, the second with its purpose). Each role is typed to a specific class of lexical items. Types are either simple (human, artifact,...) or complex (e.g., information.physical). Complex types are called dotted types after the 'dots' that are used as type constructors. Here I introduce two kinds of dots: Closed clots '.' connect systematically related types that are always interpreted simultaneonsly. null Open dots 'o' connect systematically related types that are not (normally) interpreted simultaneously.</Paragraph>
      <Paragraph position="2"> Both '#*~&amp;quot; and 'aor' denote sets of pairs of objects (a, b), a an object of type ~ and b an object of type ~'. A condition aRb restricts this set of pairs to only those for which some relation R holds, where R denotes a subset of the Cartesian product of the sets of type ~ objects and type r objects.</Paragraph>
      <Paragraph position="3"> The difference between types '#or' and 'cot' is in the nature of the objects they denote. The type 'aer' denotes sets of pairs of objects where each pair behaves as a complex object in discourse structure. For instance, the pairs of objects that are introduced by the type informationephysical (book, journal, scoreboard .... ) are addressed as the complex objects (x:information, y:physical) in discourse.</Paragraph>
      <Paragraph position="4"> On the other hand, the type '#or' denotes simply a set of pairs of objects that do not occur together in discourse structure. For instance, the pairs of objects that are introduced by the type form.artifact (door, gate, window .... ) are not (normally) addressed simultaneously in discourse, rather one side of the object is picked out in a particular context.</Paragraph>
      <Paragraph position="5"> Nevertheless, the pair as a whole remains active during processing.</Paragraph>
      <Paragraph position="6"> The resulting representations can be seen as under-specified lexical meanings and are therefore referred to as underspecified semantic types. CORELEX currently covers 104 underspecified semantic types.</Paragraph>
      <Paragraph position="7"> This section presents a number of examples, for a complete overview see the CORELEX webpage: http://~, ca. brandeis, edu/&amp;quot;paulb/Cor eLex/corelex, html Closed Dots Consider the underspecified representation for the semantic type actorelation:</Paragraph>
      <Paragraph position="9"> The representation introduces a number of objects that are of a certain type. The FORMAL role introduces an object Q of type actorelation. The CONSTITUTIVE introduces objects that are in a part-whole relationship with Q. These are either of the same type actorelation or of the simple types act or relation. The TELIC expresses the event P that can be associated with an object of type acterelation.</Paragraph>
      <Paragraph position="10"> For instance, the event of increase as in 'increasing the communication between member states' implies 'increasing' both the act of communicating an object</Paragraph>
      <Paragraph position="12"> RI and the communication relation between two objects R2 and Rs. All these objects are introduced on the semantic level and correspond to a number of objects that will be realized in syntax. However, not all semantic objects will be realized in syntax.</Paragraph>
      <Paragraph position="13"> (See Section 3.4 for more on the syntax-semantics interface.) The instances for the type act*relation are given in Figure 7, covering three different systematic polysemous classes. We could have chosen to include only the instances of the 'act rel' class, but the nouns in the other two classes seem similar enough to describe all of them with the same type.</Paragraph>
      <Paragraph position="14"> generative the lexicon should be and if one allows overgeneration of semantic objects.</Paragraph>
      <Paragraph position="15"> .nm rod bluepoint capon clam cockle crawdad crawfish crayfish duckling fowl grub hen lamb langouste limpet lobster monkfish mussel octopus panfish partridge pheasant pigeon poultry prawn pullet quail saki scallop scollop shellfish shrimp snail squid whelk whitebait whitefish winkle  other (the act and relation aspects are intimately connected). The following representation for type -nimalofood describes interpretations that can not occur simultaneously but are however related ~. It therefore uses a 'o' instead of a '.' as a type constructor: null</Paragraph>
      <Paragraph position="17"> The instances for this type only cover the class ' ~,m rod'. A case could be made for including also every instance of the class c~-m' because in principal every animal could be eaten. This is a question of how</Paragraph>
    </Section>
    <Section position="3" start_page="29" end_page="29" type="sub_section">
      <SectionTitle>
2.3 Homonyms
</SectionTitle>
      <Paragraph position="0"> CORELEX is designed around the idea of systematic polysemons classes that exclude homonyms.</Paragraph>
      <Paragraph position="1"> Traditionally a lot of research in lexical semantics has been occupied with the problem of ambiguity in homonyms. Our research shows however that homonyms only make up a fraction of the whole of the lexicon of a language. Out of the 37,793 noun stems that were derived from WORDNET 1637 are to be viewed as true homonyms because they have two or more unrelated senses, less than 5%. The remaining 95% are nouns that do have (an indefinite number of) different interpretations, hut all of these are somehow related and should be inferred from a common knowledge representation. These numbers suggest a stronger emphasis in research on systematic polysemy and less on homonyms, an approach that is advocated here (see also (Killgariff, 1992)).</Paragraph>
      <Paragraph position="2"> In CORZLEX homonyms are simply assigned two or more underspecified semantic types, that need to be disambiguated in a traditional way. There is however an added value also here because each disambiguated type can generate any number of context dependent interpretations.</Paragraph>
    </Section>
  </Section>
  <Section position="3" start_page="29" end_page="31" type="metho">
    <SectionTitle>
3 Adapting CORELEx to Domain
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="29" end_page="29" type="sub_section">
      <SectionTitle>
Specific Corpora
</SectionTitle>
      <Paragraph position="0"> The underspectfied semantic type that CORELEX assigns to a noun provides a basic lexical semantic structure that can be seen as the class-wide backbone semantic description on top of which specific information for each lexical item is to be defined.</Paragraph>
      <Paragraph position="1"> That is, doors and gates are both artifacts but they have different appearances. Gates are typically open constructions, whereas doors tend to be solid. This kind of information however is corpus specific and therefore needs to be adapted specifically to and on the basis of that particular corpus of texts.</Paragraph>
      <Paragraph position="2"> This process involves a number of consecutive steps that includes the probabilistic classification of unknown lexical items:  1. Assignment of underspecified semantic tags to those nouns that are in CORELEX 2. Running class-sensitive patterns over the (partly) tagged corpus 3. (a) Constructing a probabilistic classifier from the data obtained in step 2.</Paragraph>
      <Paragraph position="3"> (b) Probabilistically tag nouns that are not in CORELEX according to this classifier 4. Relating the data obtained in step 2. to one or  more qualia roles Step 1. is trivial, but steps 2. through 4. form a complex process of constructing a corpus specific semantic lexicon that is to be used in additional processing for knowledge intensive reasoning steps (i.e. abduction (Hobbs et al., 1993)) that would solve metaphoric, metonymic and other non-literal use of language.</Paragraph>
    </Section>
    <Section position="2" start_page="29" end_page="29" type="sub_section">
      <SectionTitle>
3.1 Assignment of CORELEX Tags
</SectionTitle>
      <Paragraph position="0"> The first step in analyzing a new corpus involves tagging each noun that is in CORELEX with an underspecified semantic tag. This tag represents the following information: a definition of the type of the noun (FORMAL); a definition of types of possible nouns it can stand in a part-whole relationship with (CONSTITUTIVE); a definition of types of possible verbs it can occur with and their argument structures (AGENTIVE / TELIC). CORELEX is implemented as a database of associative arrays, which allows a fast lookup of this information in pattern matching.</Paragraph>
    </Section>
    <Section position="3" start_page="29" end_page="30" type="sub_section">
      <SectionTitle>
3.2 Class-Sensitive Pattern Matching
</SectionTitle>
      <Paragraph position="0"> The pattern matcher runs over corpora that are: part-of-speech tagged using a widely used tagger (Brill, 1992); stemmed by using an experimental system that extends the Porter stemmer, a stemming algorithm widely used in information retrieval, with the Celex database on English morphology; (partly) semantically tagged using the CORELEX set of underspecified semantic tags as discussed in the previous section.</Paragraph>
      <Paragraph position="1"> There are about 30 different patterns that are arranged around the headnoun of an NP. They cover  the following syntactic constructions that roughly correspond to a VP, an S, an NP and an NP followed by a PP:  The heuristics for finding the headnoun is then simply to take the rightmost noun in the NP, which for English is mostly correct.</Paragraph>
      <Paragraph position="2"> The verb-headnoun patterns approach that of a true 'verb-obj' analysis by including a normalization of passive constructions as follows: \[Noun Have? Be Adv? Verb\] =~ \[Verb Noun\] Similarly, the headnoun-verb patterns approach a true 'sub j-verb' analysis. However, because no deep syntactic analysis is performed, the patterns can only approximate subjects and Objects in this way and I therefore do not refer to these patterns as 'subject-verb' and 'verb-object' respectively.</Paragraph>
      <Paragraph position="3"> The pattern matching is class-sensitive in employing the assigned CORELEX tag to determine if the application of this pattern is appropriate. For instance, one of the headnoun-preposition-headnoun patterns is the following, that is used to detect part- null Clearly not every syntactic construction that fits this pattern is to be interpreted as the expression of a part-whole relation. One of the heuristics we therefore use is that the pattern may only apply if both head nouns carry the same CORELEx tag or if the tag of the second head noun subsumes the tag of the first one through a dotted type. That is, if the second head noun is of a dotted type and the first is of one of its composing types. For instance, 'paragraph' ~The interpretation of '*' and '?' in this section follows that of common usage in regular expressions: 'w indicates 0 or more occurrences; '?' indicates 0 or 1 occurrence and 'journal' can be in a part-whole relation to each other because the first is of type information, while the second is of type information*physical. Similar heuristics can be identified for the application of other patterns.</Paragraph>
      <Paragraph position="4"> Recall of the patterns (percentage of nouns that are covered) is on average, among different corpora (wsJ, BROWN, PDGF - a corpus we constructed for independent purposes from 1000 medical abstracts in the MEDLINE database on Platelet Derived Growth Factor - and DARWIN - the complete Origin of Species), about 70% to 80%. Precision is much harder to measure, but depends both on the accuracy of the output of the part-of-speech tagger and on the accuracy of class-sensitive heuristics.</Paragraph>
    </Section>
    <Section position="4" start_page="30" end_page="30" type="sub_section">
      <SectionTitle>
3.3 Probabilistic Classification
</SectionTitle>
      <Paragraph position="0"> The knowledge about the linguistic context of nouns in the corpus that is collected by the pattern matcher is now used to classify unknown nouns. This involves a similarity measure between the linguistic contexts of classes of nouns that are in CORELEX and the linguistic context of unknown nouns. For this purpose the pattern matcher keeps two separate arrays, one that collects knowledge only on COrtELEx nouns and the other collecting knowledge on all nouns.</Paragraph>
      <Paragraph position="1"> The classifier uses mutual information (MI) scores rather than the raw frequences of the occurring patterns (Church and Hanks, 1990). Computing MI scores is by now a standard procedure for measuring the co-occurrence between objects relative to their overall occurrence. MI is defined in general as follows: null</Paragraph>
      <Paragraph position="3"> We can use this definition to derive an estimate of the connectedness between words, in terms of collocations (Smadja, 1993), but also in terms of phrases and grammatical relations (Hindle, 1990). For instance the co-occurrence of verbs and the heads of their NP objects iN: size of the corpus, i.e. the number of stems): N Cobj (v n) = log2 /(v) /(n) N N All nouns are now classified by running a similaxity measure over their MI scores and the MI scores of each CoRELEx class. For this we use the Jaccard measure that compares objects relative to the attributes they share (Grefenstette, 1994). In our case the 'attributes' are the different linguistic constructions a noun occurs in: headnoun-verb, adjective-headnoun, modifiernoun-headnoun, etc.</Paragraph>
      <Paragraph position="4"> The Jaccard measure is defined as the number of attributes shared by two objects divided by the total number of unique attributes shared by both objects:  The Jaccard scores for each CORELEx class are sorted and the class with the highest score is assigned to the noun. If the highest score is equal to 0, no class is assigned.</Paragraph>
      <Paragraph position="5"> The classification process is evaluated in terms of precision and recall figures, but not directly on the classified unknown nouns, because their precision is hard to measure. Rather we compute precision and recall on the classification of those nouns that are in CoreLex, because we can check their class automatically. The assumption then is that the precision and recall figures for the classification of nouns that are known correspond to those that are unknown. An additional measure of the effectiveness of the classifter is measuring the recall on classification of all nouns, known and unknown. This number seems to correlate with the size of the corpus, in larger corpora more nouns are being classified, but not necessarily more correctly. Correct classification rather seems to depend on the homogeneity of the corpus: if it is written in one style, with one theme and so on.</Paragraph>
      <Paragraph position="6"> Recall of the classifier (percentage of all nouns that are classified &gt; 0) is on average, among different larger corpora (&gt; 100,000 tokens), about 80% to 90%. Recall on the nouns in CoRELEx is between 35% and 55%, while precision is between 20% and 40%. The last number is much better on smaller corpora (70% on average). More detailed information about the performance of the classifier, matcher and acquisition tool (see below) can be obtained from (Buitelaar, forthcoming).</Paragraph>
    </Section>
    <Section position="5" start_page="30" end_page="31" type="sub_section">
      <SectionTitle>
3.4 Lexical Knowledge Acquisition
</SectionTitle>
      <Paragraph position="0"> The final step in the process of adapting CORELEx to a specific domain involves the 'translation' of observed syntactic patterns into corresponding semantic ones and generating a semantic lexicon representing that information.</Paragraph>
      <Paragraph position="1">  There are basically three kinds of semantic patterns that are utilized in a CORELEX lexicon: hyponymy (sub-supertype information) in the FORMAL role, meronymy (part-whole information) in the CONSTITUTIVE role and predicate-argument structure in the TELIC and AGENTIVE roles. There are no compelling reasons to exclude other kinds of information, but for now we base our basic design on ~PS, which only includes these three in its definition of qualia structure. null Hyponymic information is acquired through the classification process discussed in Sections 2.2 and 3.3. Meronymic information is obtained through a translation of various VP and PP patterns into 'has-part' and 'part-of' relations. Predicate-argument structure finally, is derived from verb-headnoun and headnoun-verb constructions.</Paragraph>
      <Paragraph position="2"> The semantic lexicon that is generated in such a way comes in two formats: T2)PS, a Type Description Language based on typed feature-logic (Krieger and Schaefer, 1994a) (Krieger and Schaefer, 1994b) and HTML, the markup language for the World Wide Web. The first provides a constraint-based formalism that allows CORELEX lexicons to be used stralghtforwardiy in constraint-based grammars. The second format is used to present a generated semantic lexicon as a semantic index on a World Wide Web document. We will not elaborate on this further because the subject of semantic indexing is out of the scope of this paper, but we refer to (Pustejovsky et al., 1997).</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="31" end_page="31" type="metho">
    <SectionTitle>
3.5 An Example: The PDGF Lexicon
</SectionTitle>
    <Paragraph position="0"> The semantic lexicon we generated for the PDGF corpus covers 1830 noun stems, spread over 81 CORELEX types. For instance, the noun evidence is of type communication.psychological and the following representation is generated:</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML