File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-1104_metho.xml
Size: 19,702 bytes
Last Modified: 2025-10-06 14:07:26
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-1104"> <Title>Semantic Indexing using WordNet Senses</Title> <Section position="4" start_page="37" end_page="37" type="metho"> <SectionTitle> 3 System Architecture </SectionTitle> <Paragraph position="0"> There are three main modules used by this system: 1. Word Sense Dis~rnbiguation (WSD) module, which performs a semi-complete but precise disambiguation of the words in the documents. Besides semantic information, this module also adds part of speech tags to each word and stems the word using the WordNet stemmlug algorithm. Every document in the input set of documents is processed with this module. The output is a new document in which each word is replaced with the new format PoslStemlPOSlO.f.f set where: Pos is the position of the word in the text; Stem is the stemmed form of the word; POS is the part of speech and Offset is the offset of the WordNet synset in which this word occurs.</Paragraph> <Paragraph position="1"> In the case when no sense is assigned by the WSD module or if the word cannot be found in WordNet, the last field is left empty.</Paragraph> <Paragraph position="2"> 2. Indexing module, which indexes the documents, after they are processed by the WSD module. From the new format of a word, as returned by the WSD function, the Stem and, separately, the Offset{POS are added to the index. This enables the retrieval of the words, regarded as lexical strings, or the retrieval of the synset of the words (this actually means the retrieval of the given sense of the word and its synonyms).</Paragraph> <Paragraph position="3"> . Retrieval module, which retrieves documents, based on an input query. As we are using a combined word-based and synset-based indexing, we can retrieve documents containing either (1) the input keywords, (2) the input keywords with an assigned sense or (3) synonyms of the input keywords.</Paragraph> </Section> <Section position="5" start_page="37" end_page="40" type="metho"> <SectionTitle> 4 Word Sense Dis~mbiguation </SectionTitle> <Paragraph position="0"> As stated earlier, the WSD is performed for both the query and the documents from which we have to retrieve information.</Paragraph> <Paragraph position="1"> The WSD algorithm used for this purpose is an iterative algorithm; it was for the first time presented in (Mihalcea and Moldovan, 2000). It determines, in a given text, a set of nouns and verbs which can be disambiguated with high precision. The semantic tagging is performed using the senses defined in WordNet. null In this section, we present the various methods used to identify the correct sense of a word. Then, we describe the main algorithm in which these procedures are invoked in an iterative manner.</Paragraph> <Paragraph position="2"> PROCEDUP.~ 1. This procedure identifies the proper nonn.q in the text, and marked them as having sense ~1.</Paragraph> <Paragraph position="3"> Example. c C Hudson,, is identified as a proper noun and marked with sense #1.</Paragraph> <Paragraph position="4"> PROCEDURE 2. Identify the words having only one sense in WordNet (monosemous words). Mark them with sense #1.</Paragraph> <Paragraph position="5"> Example. The noun subco~aittee has one sense defined in WordNet. Thus, it is a monosemous word and can be marked as having sense #1.</Paragraph> <Paragraph position="6"> PROCEDURE 3. For a given word Wi, at position i in the text, form two pairs, one with the word before W~ (pair Wi-l-Wi) and the other one with the word after Wi (pair Wi-Wi+i). Determiners or conjunctions cannot be part of these pairs. Then, we extract all the occurrences of these pairs found within the semantic tagged corpus formed with the 179 texts from SemCor(Miller et al., 1993). If, in all the occurrences, the word Wi has only one sense #k, and the number of occurrences of this sense is larger than 3, then mark the word Wi as having sense #k.</Paragraph> <Paragraph position="7"> Example. Consider the word approval in the text fragment ' 'committee approval of' '. The pairs formed are ' ~cown-ittee approval' ' and ' ~ approval of ' '. No occurrences of the first pair are found in the corpus. Instead, there are four occurrences of the second pair, and in all these occurrences the sense of approval is sense #1. Thus, approval is marked with sense #1.</Paragraph> <Paragraph position="8"> PROCEDURE 4. For a given noun N in the text, determine the noun-context of each of its senses. This noun-context is actually a list of nouns which can occur within the context of a given sense i of the noun N. In order to form the noun-context for every sense Ni, we are determining all the concepts in the hypernym synsets of Ni. Also, using SemCor, we determine all the nouns which occur within a window of 10 words respect to Ni.</Paragraph> <Paragraph position="9"> All of these nouns, determined using Word-Net and SemCor, constitute the noun-context of Ni. We can now calculate the number of common words between this noun-context and the original text in which the noun N is found. Applying this procedure to all the senses of the noun N will provide us with an ordering over its possible senses. We pick up the sense i for the noun N which: (1) is in the top of this ordering and (2) has the distance to the next sense in this ordering larger than a given threshold.</Paragraph> <Paragraph position="10"> Example. The word diameter, as it appears in the document 1340 from the Cranfield collection, has two senses. The common words found between the noun-contexts of its senses and the text are: for diameter#l: { property, hole, ratio } and for diameter#2: { form}.</Paragraph> <Paragraph position="11"> For this text, the threshold was set to 1, and thus we pick d:i.ameter#1 as the correct sense (there is a difference larger than 1 between the number of nouns in the two sets).</Paragraph> <Paragraph position="12"> PROCEDURE 5. Find words which are semantically connected to the already disambiguated words for which the connection distance is 0. The distance is computed based on the Word_Net hierarchy; two words are semantically connected at a distance of 0 if they belong to the same synset.</Paragraph> <Paragraph position="13"> Example. Consider these two words appearing in the text to be disambiguated: authorize and clear. The verb authorize is a monosemous word, and thus it is disambiguated with procedure 2. One of the senses of the verb clear, namely sense #4, appears in the same synset with authorize#I, and thus clear is marked as having sense #4.</Paragraph> <Paragraph position="14"> PROCEDURE 6. Find words which are semantically connected, and for which the connection distance is 0. This procedure is weaker than procedure 5: none of the words considered by this procedure are already disamo biguated. We have to consider all the senses of both words in order to determine whether or not the distance between them is 0, and this makes this procedure computationally intensive. null Example. For the words measure and bill, both of them ambiguous, this procedure tries to find two possible senses for these words, which are at a distance of 0, i.e. they belong to the same synset. The senses found are measure#4 and bill#l, and thus the two words are marked with their corresponding senses.</Paragraph> <Paragraph position="15"> PROCEDURE 7. Find words which are semantically connected to the already disambiguated words, and for which the connection distance is maximum 1. Again, the distance is computed based on the WordNet hierarchy; two words are semantically connected at a maximum distance of 1 if they are synonyms or they belong to a hypernymy/hyponymy relation. null Example. Consider the nouns subcommittee and committee. The first one is disarm biguated with procedure 2, and thus it is marked with sense #1. The word committee with its sense #1 is semantically linked with the word subcommittee by a hypernymy relation. Hence, we semantically tag this word with sense ~1.</Paragraph> <Paragraph position="16"> PROCEDURE 8. Find words which are semantically connected between them, and for which the connection distance is maximum 1.</Paragraph> <Paragraph position="17"> This procedure is similar with procedure 6: both words are ambiguous, and thus all their senses have to be considered in the process of finding the distance between them.</Paragraph> <Paragraph position="18"> Example. The words gift and donation are both ambiguous. This procedure finds gift with sense #1 as being the hypernym of donation, also with sense ~1. Therefore, both words are disambiguated and marked with their assigned senses.</Paragraph> <Paragraph position="19"> The procedures presented above are applied iteratively. This allows us to identify a set of nouns and verbs which can be disambiguated with high precision. About 55% of the nouns and verbs are disambiguated with over 92% accuracy.</Paragraph> <Paragraph position="20"> Algorithm Step 1. Pre-process the text. This implies tokenization and part-of-speech tagging. The part-of-speech tagging task is performed with high accuracy using an improved version of Brill's tagger (Brill, 1992). At this step, we also identify the complex nominals, based on WordNet definitions. For example, the word sequence ' 'pipeline companies' ' is found in WordNet and thus it is identified as a single concept. There is also a list of words which we do not attempt to dis~.mbiguate. These words are marked with a special flag to indicate that they should not be considered in the disrtmbiguation process. So far, this list consists of three verbs: be, have, do.</Paragraph> <Paragraph position="21"> Step 2. Initi~\]i~.e the Set of Disambiguated Words (SDW) with the empty set SDW={}.</Paragraph> <Paragraph position="22"> Initialize the Set of Ambiguous Words (SAW) with the set formed by all the nouns and verbs in the input text.</Paragraph> <Paragraph position="23"> Step 3. Apply procedure 1. The named entities identified here are removed from SAW and added to SDW.</Paragraph> <Paragraph position="24"> Step 4. Apply procedure 2. The monosemous words found here axe removed from SAW and added to SDW.</Paragraph> <Paragraph position="25"> Step 5. Apply procedure 3. This step allows us to disambiguate words based on their occurrence in the semantically tagged corpus. The words whose sense is identified with this procedure are removed from SAW and added to SDW.</Paragraph> <Paragraph position="26"> Step 6. Apply procedure 4. This will identify a set of nouns which can be disambiguated band on their noun-contexts.</Paragraph> <Paragraph position="27"> Step 7. Apply procedure 5. This procedure tries to identify a synonymy relation between the words from SAW and SDW. The words disambiguated are removed from SAW and added to SDW.</Paragraph> <Paragraph position="28"> Step 8. Apply procedure 6. This step is different from the previous one, as the synonymy relation is sought among words in SAW (no SDW words involved). The words disambiguated are removed from SAW and added to SDW.</Paragraph> <Paragraph position="29"> Step 9. Apply procedure 7. This step tries to identify words from SAW which are linked at a distance of maximum 1 with the words from SDW. Remove the words dis ambiguated from SAW and add them to SDW.</Paragraph> <Paragraph position="30"> Step 10. Apply procedure 8. This procedure finds words from SAW connected at a distance of maximum I. As in step 8, no words from SDW are involved. The words disambiguated are removed from SAW and added to SDW.</Paragraph> <Section position="1" start_page="39" end_page="40" type="sub_section"> <SectionTitle> Results </SectionTitle> <Paragraph position="0"> To determine the accuracy and the recall of the disambiguation method presented here, we have performed tests on 6 randomly selected files from SemCor. The following files have been used: br-a01, br-a02, br-k01, brk18, br-m02, br-r05. Each of these files was split into smaller files with a maximum of 15 lines each. This size limit is based on our observation that small contexts reduce the applicability of procedures 5-8, while large contexts become a source of errors. Thus, we have created a benchmark with 52 texts, on which we have tested the disambiguation method.</Paragraph> <Paragraph position="1"> In table 1, we present the results obtalned for these 52 texts. The first cohlmn indicates the file for which the results are presented.</Paragraph> <Paragraph position="2"> The average number of no, ms and verbs considered by the disambiguation algorithm for each of these files is shown in the second col- null umn. In columns 3 and 4, there are presented the average number of words disambiguated with procedures 1 and 2, and the accuracy obtained with these procedures. Column 5 and 6 present the average number of words disambiguated and the accuracy obtained after applying procedure 3 (cumulative results). The cumulative results obtained after applying procedures 3, 4 and 5, 6 and 7, are shown in columns 7 and 8, 9 and 10, respectively columns 10 and 11.</Paragraph> <Paragraph position="3"> The novelty of this method consists of the fact that the disambiguation process is done in an iterative manner. Several procedures, described above, are applied such as to build a set of words which are disambiguated with high accuracy: 55% of the nouns and verbs are disambiguated with a precision of 92.22%.</Paragraph> <Paragraph position="4"> The most important improvements which are expected to be achieved on the WSD problem are precision and speed. In the case of our approach to WSD, we can also talk about the need for an increased fecal/, meaning that we want to obtain a larger number of words which can be disambiguated in the input text.</Paragraph> <Paragraph position="5"> The precision of more than 92% obtained during our experiments is very high, considering the fact that Word.Net, which is the dictionary used for sense identification, is very fine grained and sometime the senses are very close to each other. The accuracy obtained is close to the precision achieved by humans in sense disambiguation.</Paragraph> </Section> </Section> <Section position="6" start_page="40" end_page="41" type="metho"> <SectionTitle> 5 Indexing and Retrieval </SectionTitle> <Paragraph position="0"> The indexing process takes a group of document files and produces a new index. Such things as unique document identifiers, proper SGML tags, and other artificial constructs are ignored. In the current version of the system, we are using only the AND and OR boolean operators. Future versions will consider the implementation of the NOT and NEAR operators. null The information obtained from the WSD module is used by the main indexing process, where the word stem and location are indexed along with the WordNet synset (if present).</Paragraph> <Paragraph position="1"> Collocations are indexed at each location that a member of the collocation occurs.</Paragraph> <Paragraph position="2"> All elements of the document are indexed.</Paragraph> <Paragraph position="3"> This includes, but is not limited to, dates, numbers, document identifiers, the stemmed words, collocations, WordNet synsets (if available), and even those terms which other indexers consider stop words. The only items currently excluded from the index are punctuation marks which are not part of a word or collocation.</Paragraph> <Paragraph position="4"> The benefit of this form of indexing is that documents may be retrieved using stemmed words, or using synset offsets. Using synset offset values has the added benefit of retrieving documents which do not contain the original stemmed word, but do contain synonyms of the original word.</Paragraph> <Paragraph position="5"> The retrieval process is limited to the use of the Boolean operators AND and OR. There is an auxiliary front end to the retrieval engine which allows the user to enter a textual query, such as, &quot;What financial institutions are .found along the banks of the Nile?&quot; The auxiliary front end will then use the WSD to disambiguate the query and build a Boolean query for the standard retrieval engine.</Paragraph> <Paragraph position="6"> For the preceding example, the auxil- null iary front end would build the query: (financiaLinstitution OR 60031M\[NN) AND (bank OR 68002231NN) AND (Nile OR 68261741NN) where the numbers in the previous query represent the offsets of the synsets in which the words with their determined meaning occur.</Paragraph> <Paragraph position="7"> Once a list of documents meeting the query requirements has been determined, the complete text of each matching document is retrieved and presented to the user.</Paragraph> </Section> <Section position="7" start_page="41" end_page="41" type="metho"> <SectionTitle> 6 An Example </SectionTitle> <Paragraph position="0"> Consider, for example, the following question: &quot;Has anyone investigated the effect of surface mass transfer on hypersonic viscous interactionsf'. The question processing involves part of speech tagging, stemming and word sense disambiguation.</Paragraph> <Paragraph position="1"> The question becomes: &quot;Has anyone investigate I VB1535831 the effectlNN 17766144 o/surfacelN~3447223 massl NN139234 35 transferl Nhq132095 on hypersoniclJJ viscouslJJ interactionlNNl 7840572&quot;.</Paragraph> <Paragraph position="2"> The selection of the keywords is not an easy task, and it is performed using the set of 8 heuristics presented in (Moldovan et al., 1999). Because of space limitations, we are not going to detail here the heuristics and the algorithm used for keywords selection. The main idea is that an initial nnmber of key-words is determined using a subset of these heuristics. If no documents are retrieved, more keywords are added, respectively a too large number of documents will imply that some of the keywords are dropped in the reversed order in which they have been entered. For each question, three types of query are formed, using the AND and OR. operators.</Paragraph> <Paragraph position="3"> 1. QwNstem. Keywords from the question, stemmed based on WordNet, concatenated with the AND operator.</Paragraph> <Paragraph position="4"> 2. QwNoffset. Keywords from the ques- null tion, stemmed based on WordNet, concatenated using the OR. operator with the associated synset offset, and concatenated with the AND operator among them.</Paragraph> <Paragraph position="5"> . QwNHyperOfSset. Keywords from the question, stemmed based on WordNet, concatenated using the OR operator with the associated synset offset and with the offset of the hypernym synset, and concatenated with the AND operator among them.</Paragraph> <Paragraph position="6"> All these types of queries are run against the semantic index created based on words and synset offsets. We denote these rime with RWNStem, RWNOyfset and RWNHyperOffset.</Paragraph> <Paragraph position="7"> The three query formats, for the given question, are presented below: QwNstern. effect AND surface AND mass Using the first type of query, 7 documents were found out of which 1 was considered to be relevant. With the second and third types of query, we obtained 11, respectively 17 documents, out of which 4 were found relevant, and actually contained the answer to the question.</Paragraph> <Paragraph position="8"> (sample answer) ... the present report gives an account of the development o\] an approzimate theory to the problem of hypersonic strong viscous interaction on a fiat plate with mass-transfer at the plate surface. the disturbance flow region is divided into inviscid and viscous flo~ regions .... (craniield0305).</Paragraph> </Section> class="xml-element"></Paper>