XML Viewer - w98-0718

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-0718_metho.xml
Size: 16,942 bytes
Last Modified: 2025-10-06 14:15:08
<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-0718">
  <Title>Usage of WordNet in Natural Language Generation</Title>
  <Section position="3" start_page="128" end_page="128" type="metho">
    <SectionTitle>
3 Problems to be solved
</SectionTitle>
    <Paragraph position="0"> Despite the above advantages, there are some problems to be solved for the application of WordNet in a generation system to be successful. null The first problem is how to adapt WordNet to a particular domain. With 121,962 unique words, 99,642 synsets, and 173,941 senses of words as of version 1.6, WordNet represents the largest publically available lexical resource to date. The wide coverage on one hand is beneficial, since as a general resource, wide coverage allows it to provide information for different applications. On the other hand, this can also be quite problematic since it is very difficult for an application to efficiently handle such a large database. Therefore, the first step towards utilizing WordNet in generation is to prune unrelated information in the general database so as to tailor it to the domain. On the other hand, domain specific knowledge that is not covered by the general database needs to be added to the database.</Paragraph>
  </Section>
  <Section position="4" start_page="128" end_page="133" type="metho">
    <SectionTitle>
4 Solutions
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="128" end_page="132" type="sub_section">
      <SectionTitle>
4.1 Adapting WordNet to a domain
</SectionTitle>
      <Paragraph position="0"> We propose a corpus based method to automatically adapt a general resource like WordNet to a domain. Most generation systems still use hand-coded lexicons and ontologies, however, corpus based automatic techniques are in demand as natural language generation is used in more ambitious applications and large corpora in various domains are becoming available. The proposed method involves three steps of processing. null Step 1: Prune unused words and synsets We first prune words and synsets that are listed in WordNet but not used in the domain. This is accomplished by tagging the domain corpus with part of speech information, then for each word in WordNet, if it appears in the domain corpus and its part of speech is the same as that in the corpus, the word is kept in the result, otherwise it is eliminated; for each synset</Paragraph>
      <Paragraph position="2"> in WordNet, if none of the words in the synset appears in the domain corpus, the synset as a whole is deleted. The only exception is that if a synset is the closest common ancestor of two syrmets in the domain corpus, the synset is always kept in the result. The reason to keep this kind of synsets is to generalize the semantic category of verb arg~lments, as we illustrate in step 2. The frequency of words in such synsets will be marked zero so that they will not be used in output. Figure 1 shows two example pruning operations: (A) is a general case, and (B) is the case involving ancestor syuset. In this step, words are not yet disambiguated, so all the senses of a word remain in the result; the pruning of unlikely senses is achieved in step 2, when verb argument clusters are utilized. Words that are in the corpus but not covered by WordNet are also identified in this stage, and later at step 3, we guess the meanings of these known words and place them into domain ontology.</Paragraph>
      <Paragraph position="3"> A total of 1,015 news reports on basketball games (1.TMB, Clarinet news, 1990-1991) were collected. The frequency count reported totally 1,414 unique nouns (proper names excluded) and 993 unique verbs in the corpus. Compared to 94,473 nouns and 10,318 verbs in WordNet 1.6, only 1.5% of nouns and 9.6% of verbs are used in the domain. As we can see, this first pruning operation results in a significant reduction of entries. For the words in the domain corpus, while some words appear much more often (such as the verb score, which appear 3,141 times in 1,015 reports, average 3.1 times per article), some appear rarely (for example, the verb atone only occur once in all reports). In practical applications, low frequency words are usually not handled by a generation system, so the reduction rate should be even higher.</Paragraph>
      <Paragraph position="4"> 47 (3.3%) nouns and 22 (2.2%) verbs in the corpus are not covered by WordNet. These are domain specific words such as layup and layin. The small portion of these words shows that WordNet is an appropriate general resource to use as a basis for building domain lexicons and ontologies since it will probably cover moat words in a specific domain. But the situation might be different if the domain is very specific, for example, astronomy, in which case specific technical terms which are heavily used in the domain might not be included in WordNet.</Paragraph>
      <Paragraph position="5">  Our study in the basketball domain shows that a word is typically used uniformly in a specific domain, that is, it often has one or a few predominant senses in the domain, and for a verb, its arguments tend to be semantically close to each other and belong to a single or a few more general semantic category. In the following, we show by an example how the uniform usage of words in a domain can help to identify predominant senses and obtain semantic constraints of verb arguments.</Paragraph>
      <Paragraph position="6"> In our basketball corpus, the verb add takes the following set of words as objects: (rebound, assist, throw, shot, basket, points). Based on the assumption that a verb typically take arguments that belong to the same semantic category, we identify the senses of each word that will keep it connected to the largest number of words in the set. For example, for the word rebound, only one out of its three senses are linked  to other words in the set, so it is marked as the predominant sense of the word in the domain.</Paragraph>
      <Paragraph position="7"> The algorithm we used to identify the predominant senses is similar to the algorithm we introduced in (Jing et al., 1997), which identities predominant senses of words using domain-dependent semantic classifications and Word-Net. In this case, the set of arg~,ments for a verb is considered as a semantic cluster. The algorithm can be briefly summarized as follows: Construct the set of arT,ments for a verb Traverse the WordNet hierarchy and locate all the possible finks between senses of words in the set.</Paragraph>
      <Paragraph position="8"> The predominant sense of a word is the sense which has the most n-tuber of finks to other words in the set.</Paragraph>
      <Paragraph position="9"> In this example, the words (rebound, assist, throw, shot, basket) will be disambiguated into the sense that will make all of them fall into the same semantic subtree in WordNet hierarchy, as shown in Figure 2. The word points, however, does not belong to the same category and is not disambiguated. As we can see, the result is much further pruned compared to result from step 1, with 5 out of 6 words are now disambiguated into a single sense. At the mean while, we have also obtained semantic constraints on verb arguments. For this example, the object of the verb add can be classified into two semantic categories: either points or the semantic category (accomplishment, achievement). The closest common ancestor (accomplishment, achievement) is used to generalize the semantic category of the arguments for a verb, even though the word accomplishment and achievement are not used in the domain. This explains why in step I pruning, synsets that are the closest common ancestor of two synsets in the domain are always kept in the result.</Paragraph>
      <Paragraph position="10"> A simple parser is developed to extract subject, object, and the main verb of a sentence. We then ran the algorithm described above and obtained selectional constraints for frequent verbs in the domain. The results show that, for most of frequent verbs, majority of its arguments can be categorized into one or a few semantic categories, with only a small number of  exceptions. Table 1 shows some frequent verbs in the domain and their selectional constraints.</Paragraph>
      <Paragraph position="12"> in Basketball Domain Note, the existing of predominant senses for a word in a domain does not mean every occurrence of the word must have the predominant sense. For example, although the verb hit is used mainly in the sense as in hitting a jumper, hitting a free throw in basketball domain, sentences like &amp;quot;The player fell and hit the floor&amp;quot;</Paragraph>
      <Paragraph position="14"> do appear in the corpus, although rarely. Such usage is not represented in our generalized selectional constraints on the verb arg~lments due to its low frequency.</Paragraph>
      <Paragraph position="15"> Step 3. Guessing unknown words and merging with domain specific ontologies.</Paragraph>
      <Paragraph position="16"> The grouping of verb arguments can also help us to guess the meaning of unknown words.</Paragraph>
      <Paragraph position="17"> For example, the word layup is often used as the object of the verb hit, but is not listed in WordNet. According to selectional constraints from step 2, the object of the verb hit is typically in the semantic category (accomplishment, achievement). Therefore, we can guess that the word layup is probably in the semantic category too, though we do not know exactly where in the semantic hierarchy of Figure 2 to place the word.</Paragraph>
      <Paragraph position="18"> We discussed above how to prune WordNet, whereas the other part of work in adapting WordNet to a domain is to integrate domain-specific ontologies with pruned WordNet ontology. There are a few possible operations to do this: (1) Insertion. For e~ample, in basketball domain, if we have an ontology adapted from WordNet by following step 1 and 2, and we also have a specific hierarchy of basketball team names, a good way to combine them is to place the hierarchy of team name under an appropriate node in WordNet hierarchy, such as the node (basketball team). (2) Replacement. For example, in medical domain, we need an ontology of medical disorders. WordNet includes some information under the node &amp;quot;Medical disorder&amp;quot;, but it might not be enough to satisfy the application's need. If such information, however, can be obtained from a medical dictionary, we can then substitute the subtree on &amp;quot;medical disorder&amp;quot; in WordNet with the more complete and reliable hierarchy from a medical dictionary. (3) Merging. If WordNet and domain ontology contain information on the same topic, but knowledge from either side is incomplete, to get a better ontology, we need to combine the two.</Paragraph>
      <Paragraph position="19"> We studied ontologies in five generation systems in medical domain, telephone network planning, web log, basketball, and business domain. Generally, domain specific ontology can be easily merged with WordNet by either insertion or replacement operation.</Paragraph>
    </Section>
    <Section position="2" start_page="132" end_page="132" type="sub_section">
      <SectionTitle>
4.2 Using the result for generation
</SectionTitle>
      <Paragraph position="0"> The result we obtained after applying step 1 to step 3 of the above method is a reduced Word-Net hierarchy, integrated with domain specific ontology. In addition, it is augmented with selection constraints and word frequency information acquired from corpus. Now we discuss the usage of the result for generation.</Paragraph>
      <Paragraph position="1"> * Lexical Paraphrases. As we mentioned in Section 1, synsets can provide lexical paraphrases, the problem to be solved is determining which words are interchangeable in a particular context. In our result, the words that appear in a synset but axe not used in the domain are eliminated by corpus analysis, so the words left in the synsets are basically all applicable to the domain. They can, however, be further distinguished by the selectional constraints. For example, if A and B are in the same synset but they have different constraints on their arguments, they are not interchangeable. Frequency can also be taken into account. A low frequency word should be avoided if there are other choices. Words left after these restrictions can be considered as interchangeable synonyms and used for paraphrasing. null * Discrimination net for lexicalization.</Paragraph>
      <Paragraph position="2"> The reduced WordNet hierarchy together with selectional and frequency constraints made up a discrimination net for lexicalization. The selection can be based on the generality of the words, for example, a jumper is a kind of throw. If a user wants the output to be as detailed as possible, we can say &amp;quot;He hit a jumper&amp;quot;, otherwise we can say&amp;quot; &amp;quot;He hit a throw.&amp;quot; Selectional constraints can also be used in selecting words. For example, both the word w/n and score can convey the meaning of obtaining advantages, gaining points etc, and w/n is a hypernym of score. In the basketball domain, w/n is mainly used as win(team, game), while score is mainly used as score(player, points), so depending on the categories of input arguments, we can choose between score and udn.</Paragraph>
      <Paragraph position="3"> Frequency can also be used in a way similar to the above. Although selectional constraints and frequency are useful criteria for lexical selection, there are many other constraints that can be used in a generation system for selecting words, for example, syntactic constraints, discourse, and focus etc. These constraints are usually coded in individual systems, not obtained from WordNet.</Paragraph>
      <Paragraph position="4"> Domain ontology.. From step 3, we can acquire a unified ontology by integrating the pruned WordNet hierarchy with domain specific ontologies. The unified ontology can then be used by planning and lexicalization components. How different modules use the ontology is a generation issue, which we will not address in the paper.</Paragraph>
    </Section>
    <Section position="3" start_page="132" end_page="133" type="sub_section">
      <SectionTitle>
4.3 Combining other types of
</SectionTitle>
      <Paragraph position="0"> knowledge for generation Although WordNet contains rich lexical knowledge, its information on verb arg~lment structures is relatively weak. Also, while Word-Net is able to provide lexical paraphrases by its synsets, it can not provide syntactic paraphrases for generation. Other resources such as COMLEX syntax dictionary (Grishman et al., 1994) and English Verb Classes and Alternations(EVCA) (Levin, 1993) can provide verb subcategorization information and syntactic paraphrases, but they are indexed by words thus not suitable to use in generation directly.</Paragraph>
      <Paragraph position="1"> To augment WordNet with syntactic information, we combined three other resources with WordNet: COMLEX, EVCA, and Tagged Brown Corpus. The resulting database contains not only rich lexical knowledge, but also substantial syntactic knowledge and language usage information. The combined database can be adapted to a specific domain using similar techniques as we introduced in this paper. We applied the combined lexicon to PLanDOC (McKeown et al., 1994), a practical generation system for telephone network plaunlng. Together with a flexible architecture we designed, the lexicon is able to effectively improve the system paraphrasing power, minimize the chance of grammatical errors, and simplify the development process substantially. The detailed description of the combining process and the application of the lexicon is presented in (Jing and McKeown, 199S).</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="133" end_page="133" type="metho">
    <SectionTitle>
5 Future work and conclusion
</SectionTitle>
    <Paragraph position="0"> In this paper, we demonstrate that WordNet is a valuable resource for generation: it can produce large amount of paraphrases, provide semantic net for lexicalization, and can be used for building domain ontologies.</Paragraph>
    <Paragraph position="1"> The main problem we discussed is adapting WordNet to a specific domain. We propose a three step procedure based on corpus analysis to solve the problem. First, The general WordNet ontology is pruned based on a domain corpus, then verb argument clusters are used to further prune the result, and finally, the pruned Word-Net hierarchy is integrated with domain specific ontology to build a ,ni6ed ontology. The other problems we discussed are how WordNet knowledge can be used in generation and how to augment WordNet with other types of knowledge.</Paragraph>
    <Paragraph position="2"> In the future, we would like to test our techniques in other domains beside basketball, and apply such techniques to practical generation systems.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML