File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0502_metho.xml
Size: 13,206 bytes
Last Modified: 2025-10-06 14:14:45
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-0502"> <Title>Langer S. and Hickey M. in preparation. Using Semantic Lexicons for Intelligent Message Retrieval in a Communication Aid. Submitted to Journal of Natural Language Engineering, special issue on Natural Language Processing for Communication</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 The WordKeys system </SectionTitle> <Paragraph position="0"> WordKeys is a system based on full text retrieval of pre-stored messages. It is typically used in two different settings: * When the user wants to prepare a communication, new messages are typed in. These messages are automatically indexed and integrated in the system's database.</Paragraph> <Paragraph position="1"> * In communication mode, WordKeys displays the search field, where the user can type in search words, the list of predicted input words, the list of messages found and the field containing the selected message.</Paragraph> <Paragraph position="2"> Figure 1 demonstrates the overall architecture of the WordKeys system.</Paragraph> <Paragraph position="3"> WordKeys is implemented in C++. There is strong emphasis of re-usability of the software, especially the lexicon modules, for other AAC-systems. We have also taken care to provide the possibility of porting the system to languages other than English. The different lexicons are text files and correspond to a simple and clearly specified format. They can be exchanged for lexicons in other languages.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Indexing </SectionTitle> <Paragraph position="0"> The WordKeys system offers the possibility of importing any text file to add it to the message database. Additionally, at any stage of a conversation, the user can add a message to the database or modify an existing message. When a message is added to the database, the following actions are performed: null</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="13" type="metho"> <SectionTitle> * Tokenization; </SectionTitle> <Paragraph position="0"> supervisor preferences input of neyj~ ~&quot; messag e~,~ inputqUe ry . \[ morphomessage I ;,.,4or \] ~ mmn ~ I logical database ~ ~- lexicon -J module inputofnew~ ~Z ~ % messages indexing \[ message text file \[ \[ stoplist \[ \[ suffixes \[ :eX;tions\[ * Morphological analysis: word forms are analysed to find lemmas and roots and to determine their syntactic category; * The resulting words are looked up in the semantic lexicon to find frequent hypernyms which are added to the list of index words; * The message with the list of index words is added to the database and its index.</Paragraph> <Section position="1" start_page="10" end_page="11" type="sub_section"> <SectionTitle> 3.2 Morphological analysis </SectionTitle> <Paragraph position="0"> Morphological analysers are available in the public domain. However, we decided to use a custom programmed morphological module, because the output of the available analysers did not correspond to our needs, and, at least for English, a simple analysis is relatively easy to implement. The data used for analysis is partially based on the WordNet morphological information. The morphological module uses an affix list in combination with an exception list and the information about syntactic categories from WordNet. The analysis of a word form is carried out in two steps:</Paragraph> <Paragraph position="2"> * determination of the derivational root (only for semantically transparent derivation affixes).</Paragraph> <Paragraph position="3"> Lemmatization is lexicon-based. After the affix removal, the unaffxed form is looked up in the lexicon, considering the possible syntactic category returned by the affix removal process. Only if the form is found there, it is accepted as a lemma and added to the message index. Word forms leading to several possible lemmas are currently not disambiguated. Apart from the lack of disambiguation, we achieved an error-free lemmatisation of all occurring word forms for a trial message database of about 1200 words.</Paragraph> <Paragraph position="4"> After the lemmatization procedure, a derivational analysis is carried out on the lemmatized word forms. We separate the two steps in order to be able to give the link between a word form and the lemma a higher weight in message access than links between morphologically complex words and their roots. The procedure of distinguishing between the results of inflectional and derivational analysis is consistent with the findings reported in Hull (1996). He concludes that complex stemming algorithms can be slightly more effective than simple ones, and that the removal of derivational affixes is not always desirable. This is especially true for a system such as Word-Keys, which uses semantic relationship for retrieval and performs message ranking, which can increase the impact of inaccuracies in the morphological analysis. Semantic relations between a lemma and some word form on the one hand differs considerably from the semantic relations between derived words and their root.</Paragraph> <Paragraph position="5"> To be able to determine semantically related words without loss of precision, information from the morphological analysis is also used to determine the morpho-syntactic categories of word forms and lemmas. The category can be clearly determined in the following cases: * a word has one single entry in the main lexicon, which means the word is already a lemma; * a word form has an inflectional or derivational affix which only occurs with bases of one single morpho-syntactic category.</Paragraph> <Paragraph position="6"> Removing ambiguities concerning syntactic categories has a certain impact on the performance of the semantic expansion module. The less words * with inappropriate syntactic categories are included in the index, the higher precision will be achieved by the system, because less expansions will be generated. For many word forms in the messages, however, the category remains ambiguous. Currently, we are investigating the use stochastic taggers and local grammars for determining syntactic information in these cases.</Paragraph> </Section> <Section position="2" start_page="11" end_page="12" type="sub_section"> <SectionTitle> 3.3 Message ranking </SectionTitle> <Paragraph position="0"> When the user has typed in one or several key words and decides to start the search the following tasks are carried out: * Tokenization: the content of the input field on the interface is parsed into word forms. * Lemmatization: word forms are analysed to be able to look them up in the lexicon.</Paragraph> <Paragraph position="1"> * The word forms and lemmas are looked up in the message index. If they are found, the corresponding message numbers are added to the list of retrieved messages.</Paragraph> <Paragraph position="2"> * The lemmas are looked up in the semantic lex- null icon to retrieve related words. The relations used for query expansion are dependent on the semantic paths defined in the settings. The related words are re-applied for another query to the index of the message database.</Paragraph> <Paragraph position="3"> The messages which have been found are displayed on the screen, the order corresponds to their score. Trials with a number of different settings for the message retrieval algorithm have been carried out to improve message ranking. The ranking algorithm assures that messages which are retrieved, but are not considered very relevant for a query, are put lower in the list or excluded from the display. Conforming to the results of the trials, messages retrieved from the database are ranked according to the criterion of semantic distance between key word and index word. Semantic distance is zero in the beginning of the following list and increases: * other derivation of the root of the key word (investigation- investigate); * synonyms of key word (car -- automobile); * other related words: the semantic paths and their weighting are defined in the settings file. A path is the concatenation of semantic links that are used to get from the input key word to the index word.</Paragraph> <Paragraph position="4"> Table 1 gives the figures for the message ranking criteria applied in the case of one single key word. For several key words, a combination of the semantic distances for different key words is used for ranking. When several key words are typed in, the message retrieval algorithms is working with an OR- link between search words. However, any message being retrieved by more than one of the key words will be given an increased score; the more key words a message is related to, the better its score.</Paragraph> </Section> <Section position="3" start_page="12" end_page="12" type="sub_section"> <SectionTitle> Description Weight decrease Comment </SectionTitle> <Paragraph position="0"> Word in message is same word form as 0 exact match, best rating input word Word in message is lemmatized in index 1 lemmatization leads to less semantic and matches input word distance than derivational analysis Word in message is reduced to root in 2 derivational analysis index to match input word Semantically related word is looked up > 5 depends on semantic relation in lexicon We will illustrate the message ranking with an example. The messages retrieved from an experimental database for the item swim are (in that order): (1) Would you like to go \]or a swim? (2) Normally I don't like swimming, but this Sunday it was so hot that I spent the whole day on the beach and in the water.</Paragraph> <Paragraph position="1"> (3) I'm not a very good swimmer.</Paragraph> <Paragraph position="2"> (4) Shall we go for a dip ? The first message contains the key word itself; message (2) contains another word form of the same lemma. The third message in the list contains a derivation of the key word. Finally, message (4) is an example of retrieval through semantic query expansion. It contains a synonym (dip) of the key word.</Paragraph> </Section> <Section position="4" start_page="12" end_page="13" type="sub_section"> <SectionTitle> 3.4 The lexicon for query expansion </SectionTitle> <Paragraph position="0"> One purpose of the main lexicon in WordKeys is to serve as a lexical database for the indexing module when performing morphological analysis. The main function of this lexicon, however, is to serve as a basis for the semantic query expansion. To choose the right lexicon, we had to bear in mind that WordKeys is a retrieval system for unrestricted text. This implies that the system is able to retrieve messages containing any word of the English language apart from extremely domain specific vocabulary.</Paragraph> <Paragraph position="1"> We decided to use the semantic database WordNet for the following reasons: * it is very comprehensive; * it contains most relevant semantic links; * the information contained in WordNet is stored in text files, and can be easily converted to any other format.</Paragraph> <Paragraph position="2"> In order to use the information in WordNet for our text retrieval algorithm, some preparation was needed.</Paragraph> <Paragraph position="3"> * WordNet was converted to a format suitable for the WordKeys software. We chose a format which was easily portable: a text file containing lemmas together with their syntactic category and related words corresponding to the different senses; * The semantic paths that the WordKeys software uses for query expansion were defined. A semantic path is a series of semantic relations which can be used to reach a lemmatised message word from a lemmatised input key word.</Paragraph> <Paragraph position="4"> This also involved defining weights for the links in order to rank retrieved messages. For example, messages containing synonyms of key words receive a high rating, those containing hypernyms are assigned a lower rating.</Paragraph> <Paragraph position="5"> Additionally we included statistics over word frequencies in the main lexicon, in order to be able to retrieve hypernyms of words that are useful as index words - these are not necessarily the closest superordinated words in the WordNet hierarchy, but ofte.n words occurring several levels higher.</Paragraph> <Paragraph position="6"> Consequently, in each lexicon entry the following information is stored: Syntactic category of word, which is used for morphological analysis and semantic links.</Paragraph> <Paragraph position="7"> Frequency (0 if the word is not included in the frequency list). The frequency stored is retrieved from a large database of mainly written text, the British National Corpus (BNC).</Paragraph> <Paragraph position="8"> The list contains the most frequent 8000 words in this corpus; evaluation of a comparison between a frequency counting lexicon and a lexicon without word frequencies are summarized in the next section.</Paragraph> <Paragraph position="9"> Links to other words in the lexicon, and specification of the type of link (synonym, hyponym etc.).</Paragraph> </Section> </Section> class="xml-element"></Paper>