File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-0825_metho.xml
Size: 11,192 bytes
Last Modified: 2025-10-06 14:09:11
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-0825"> <Title>Contextual Semantics for WSD</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Combined approach </SectionTitle> <Paragraph position="0"> The SinLexEn system is quite similar to the system used during the last Senseval-2 evaluation campaign (Crestan et al., 2001). It is based on two stages: the first stage uses three Semantic Classification Trees in parallel, trained on different size of context. Then, the second stage brings in a decision system based on information retrieval techniques. The novelty of this approach dwells in the use of semantic resource as conceptual view on extended context in both stages.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Data pre-processing </SectionTitle> <Paragraph position="0"> The first step in order to get the most from the data is to lemmatize and clean sentences. Each paragraph from the training and the test data are first passed though an internal tagger/lemmatizer.</Paragraph> <Paragraph position="1"> Then, some grammatical words are removed such as articles and possessive pronouns. Only one word is not handled in this process, it is the word to be disambiguated. Indeed, previous works (Loupy et al., 1998) have shown that the form of this word could bring interesting clues about its</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Association for Computational Linguistics </SectionTitle> <Paragraph position="0"> for the Semantic Analysis of Text, Barcelona, Spain, July 2004 SENSEVAL-3: Third International Workshop on the Evaluation of Systems possible sense. Other pronouns, such as subject pronouns, are replaced by a generic PRP tag.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Semantic Classification Trees for WSD The Semantic Classification Trees (SCT) were </SectionTitle> <Paragraph position="0"> first introduced by Kuhn and De Mori (1995). It can be defined as simple binary decision trees.</Paragraph> <Paragraph position="1"> Training data are used in order to build one or more trees for each word to be disambiguate. An SCT is made up of questions distributed along the tree nodes. Then, each test sequence is presented to the corresponding trees and follows a path along the SCT according to the way the questions are answers. When no more question is available (arrived at a leaf), the major sense is assigned to the test.</Paragraph> <Paragraph position="2"> In order to build the trees, the Gini impurity (Breiman et al., 1984) was used. It is defined as:</Paragraph> <Paragraph position="4"> where P(s/X) is the likelihood of sense s given the population X.</Paragraph> <Paragraph position="5"> At the first step of the tree building process, the Gini impurity is computed for each possible questions. Then, the best question is selected and the population made up of all the examples is divided between the ones which answer the question (yes branch) and the others (no branch). The same process is recursively applied on each branch until the maximum tree depth is reached.</Paragraph> <Paragraph position="6"> In the framework of the SinLexEn system, three different trees have been built for each word to be disambiguated. They use different context size, varying from one to three words on each side of the target word. Following is an example of three training sequences using respectively 1, 2 and 3 words on each side of the target (0#sense): -1#make 0#sense 1#of -2#make -1#more 0#sense 1#to 2#annex -3#ceiling -2#add -1#to 0#sense 1#of 2#space 3#and The number preceding the # character gives the position of the word according to the target. The set of possible questions for the SCT building process is composed of all the words present in considered window width. The tree shown in Figure 1 was built for the word 'sense' on a window width of 3 words. Each node corresponds to a question, while leafs contain the sense to be assigned to the target. The test sequence-1#make 0#sense 1#of will be assigned to sense%1:10:00:: sense (&quot;the meaning of a word or expression&quot;) from WordNet (Miller et al., 1990). For a more detail description of SCT, see (Crestan et al., 2003).</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Semantic in short context </SectionTitle> <Paragraph position="0"> WSD is much more easy to do for a human than for a machine. By simply reading a sentence, we can determine at a glance the meaning of a word in this particular context.</Paragraph> <Paragraph position="1"> However, we are not relying solely on what the words look like. The human brain is able to see the correlation between 'open a box' and 'open a jar'. We have the ability to generalize over similar &quot;concepts&quot;. In order to follow this scheme, we used the WordNet's semantic classes (SC). It enables a generalization over words sharing the same high-level hyperonym. Because the correct SC is not known for each word in context, all the possible SC were included in the set of questions for a given word and position.</Paragraph> <Paragraph position="2"> The WordNet top ontology is separated in 26 SC for the nouns and 15 for the verbs. An extended description of SC can be found in (Ciaramita and Johnson. 2003). In the following example, the two first sequences share the same sense ('cause to open or to become open') whereas the last sequence corresponds to another sense ('start to operate or function or cause to start operating or functioning'): measure.</Paragraph> <Paragraph position="3"> However, they have nothing in common with the word business.</Paragraph> <Paragraph position="4"> Although many wrong SC are proposed for each word according to its context, we noticed a 2% improvement on Senseval-2 data while using these &quot;high level information&quot;.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.4 Semantic in full context </SectionTitle> <Paragraph position="0"> The main improvement for this evaluation is the use of semantic clues at a paragraph level.</Paragraph> <Paragraph position="1"> Sinequa has developed along the last 5 years a large scale semantic dictionary of about 100.000 entries. All the word of the language are organized across a semantic space composed of 800 dimensions. For example, a word such as 'diary' is present in the dimensions: calendar, story, book and newspaper. It has been wildly used in the information retrieval system Intuition (Manigot and Pelletier, 1997), (Loupy et al., 2003).</Paragraph> <Paragraph position="2"> For each training sample, we summed the semantic vectors of each word. This step results on a global semantic vector from which only the 3 most representative dimensions (with highest score) where kept. That additional information has been used as possible question in the tree building process. Then, the same semantic analysis has been done on each test sentence. For example, the major dimension represented in the next sentence for the word material is 'newspaper': 'furthermore , nothing have yet be say about all the research that do not depend on the collection of datum by the sociologist ( primary datum ) but instead make use of secondary datum - the wealth of material already available from other source , such as government statistics , personal diary , newspaper , and other kind of information .' This enables a new vision of the context on a wider scale than the one we used with only short context SCT. Preliminary experiments carried on the Sensenval-2 nouns have shown a 1% improvement. Some nouns such as dyke, sense and spade have been dramatically improved (more than 5%). Although, words such as authority and post have had about 5% decrease in precision. A first hypothesis can be proposed to explain the gain of some words while others have lost in precision: the use of a wide context semantic is mostly benefic in the case of homonymy, while it is not when dealing with polysemy.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.5 Semantic similarity for decision system </SectionTitle> <Paragraph position="0"> In order to select the appropriate sense among the three senses proposed by the SCT, a decision system was used. It is based on the Intuition search engine used on the Default mode: the words and the semantic vectors of documents are used. The final score is a combination between the words score and the semantic score.</Paragraph> <Paragraph position="1"> Moreover, all the sentences linked to a given sense in the training data were concatenated in order to form a unique document (pseudodocument). Then, for a given test instance, the whole paragraph was used to query the engine.</Paragraph> <Paragraph position="2"> The pseudo-document's scores were then used in order to select among the three senses proposed by the SCT. A 2% improvement have been observed during the Senseval-2 evaluation campaign while using this strategy.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Maximum clues approach </SectionTitle> <Paragraph position="0"> Starting from the same preprocessing used for the combined approach, we implemented a simple approach based on Gini impurity.</Paragraph> <Paragraph position="1"> Considering a short context, the Gini impurity is computed for all the possible questions in the training data (including the questions about semantic level). For instance, if the question -1#of appears 3 times with the sense S1, 1 time with S2 and does not appear in 1 example of sense S1 and 2 examples for sense S2, the final score for this question is:</Paragraph> <Paragraph position="3"> Which corresponds to the Gini impurity of the examples where -1#of is present, plus the Gini impurity for the examples where it is not. Then, a score is given for each sense according to each question. For the previous example, the score S1 for the question -1#of is: Score(S1, -1#of) = P(S1/-1#of) * [G - G(-1#of)] Where G is the initial Gini impurity, minus the Gini impurity of G(-1#of) and weighted by the probability of S1 when -1#of was observed.</Paragraph> <Paragraph position="4"> When disambiguating a test phrase, the score for each sense is computed by summing the individual score for each question. The highest score gives the sense.</Paragraph> <Paragraph position="5"> This simple approach has shown similar results as those obtained with the combined approach on nouns. Unlike the trees, this system is able to benefit from all the clues in the training corpus. At the opposite, for the SCT, if two questions get rather good scores at first stage, only on question will be selected in order to build the node. This prevents from using clues from the other question because its population is (or might be) divided between the two branches.</Paragraph> </Section> class="xml-element"></Paper>