File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-0816_metho.xml
Size: 13,628 bytes
Last Modified: 2025-10-06 14:09:11
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-0816"> <Title>Anselmo Pe~nas Dpto. Lenguajes y Sistemas Inform'aticos UNED, Spain anselmo@lsi.uned.es</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Data </SectionTitle> <Paragraph position="0"> Each Lexical Sample Task has a relatively large training set with disambiguated examples. The test examples set has approximately a half of the number of the examples in the training data.</Paragraph> <Paragraph position="1"> Each example offers an ambiguous word and its</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Association for Computational Linguistics </SectionTitle> <Paragraph position="0"> for the Semantic Analysis of Text, Barcelona, Spain, July 2004 SENSEVAL-3: Third International Workshop on the Evaluation of Systems surrounding context, where the average context window varies from language to language. Each training example gives one or more semantic labels for the ambiguous word corresponding to the correct sense in that context.</Paragraph> <Paragraph position="1"> Senseval-3 provided the training data and the test data in XML format. The XML tagging conventions provides an excellent ground for the corpora processing, allowing a simple way for the data browsing and transformation. However, some of the XML well-formedness constraints are not completely satisfied. For example, there is no XML declaration and no root element in the English Lexical Sample documents. Once these shortcomings are fixed any XML parser can normally read and process the data.</Paragraph> <Paragraph position="2"> Despite the similarity in the structure of the different corpora at the lexical sample task in different languages, we had found a heterogeneous vocabulary both in the XML tags and the attributes, forcing to develop 'ad hoc' parsers for each language. We missed a common and public document type definition for all the tasks.</Paragraph> <Paragraph position="3"> Sense codification is another field where different solutions had been taken. In the English corpus nouns and adjectives are annotated using the WordNet 1.7.1. classification1 (Fellbaum, 1998), while the verbs are based on Wordsmyth2 (Scott, 1997). In the Catalan and Spanish tasks the sense inventory gives a more coarse-grained classification than WordNet. Both tasks have provided a dictionary with additional information as examples, typical collocations and the equivalent synsets at WordNet 1.5. Finally, the Italian sense inventory is based on the Multi-Wordnet dictionary3 (Pianta et al., 2002). Unlike the other mentioned languages , the Italian task doesn't provide a separate file with the dictionary. null Besides the training data provided by Senseval, we have used the SemCor (Miller et al., 1993) collection in which every word is already tagged in its part of speech, sense and synset of WordNet.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Preprocessing </SectionTitle> <Paragraph position="0"> A tokenized version of the Catalan, Spanish and Italian corpora has been provided. In this version every word is tagged with its lemma and part of speech tag. This information has been manually annotated by human assessors both in the Catalan and the Spanish corpora. The Italian corpus has been processed automatically by the TnT POStagger4 (Brants, 2000) including similar tags.</Paragraph> <Paragraph position="1"> The English data lacked of this information, leading us to apply the TreeTagger5 (Schmid, 1994) tool to the training and test data as a previous step to the disambiguation process.</Paragraph> <Paragraph position="2"> Since the SemCor collection is already tagged, the preprocessing consisted in the segmentation of texts by the paragraph tag, obtaining 5382 different fragments. Each paragraph of Semcor has been used as a separate training example for the English lexical sample task. We applied the mapping provided by Senseval to represent verbs according to the verb inventory used in Senseval-3.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Approach </SectionTitle> <Paragraph position="0"> The supervised UNED WSD system is an exemplar based classifier that performs the disambiguation task measuring the similarity between a new instance and the representation of some labelled examples. However, instead of representing contexts as bags of terms and defining a similarity measure between the new context and the training contexts, we propose a representation of terms as bags of contexts and the definition of a similarity measure between terms. Thus, words, lemmas and senses are represented in the same space, where similarity measures can be defined between them. We call this space the Context Space. A new disambiguation context (bag of words) is transformed into the Context Space by the inner product, becoming a kind of abstract term suitable to be compared with singular senses that are represented in the same Context Space.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Representation </SectionTitle> <Paragraph position="0"> The training corpus is represented in the usual context cj.</Paragraph> <Paragraph position="1"> A new instance q, represented with the vector of weights (w1q,...,wiq,...,wTq), is transformed into a vector in the context space vectorq = (q1,...,qj,...,qN), where vectorq is given by the usual inner product vectorq = q * A (Figure 1):</Paragraph> <Paragraph position="3"> If vectors cj (columns of matrix A) and vector q (original test context) are normalized to have a length equal to 1, then qj become the cosine between vectors q and cj. More formally,</Paragraph> <Paragraph position="5"> At this point, both senses and the representation of the new instance vectorq are represented in the same context space (Figure 2) and a similarity measure can be defined between them: sim( vectorsenik,vectorq) where senik is the k candidate sense for the ambiguous lemma lemi. Each component j of vectorsenik is set to 1 if lemma lemi is used with sense senik in the training context j, and 0 otherwise.</Paragraph> <Paragraph position="6"> For a new context of the ambiguous lemma lemi, the candidate sense with higher similarity is selected: argmaxk sim( vectorsenik,vectorq)</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Bag of words versus bag of contexts </SectionTitle> <Paragraph position="0"> Table 1 shows experimental results over the English Lexical Sample test of Senseval-3. System has been trained with the Senseval-3 data and the SemCor collection. The Senseval training data has been lemmatized and tagged with TreeTagger. Only nouns and adjectives have been considered in their canonical form.</Paragraph> <Paragraph position="1"> Three different weights wij have been tested: * Co-occurrence: wij and wiq are set to {0,1} depending on whether lemi is present or not in context cj and in the new instance q respectively. After the inner product q * A, the components qj of vectorq get the number of co-occurrences of different lemmas in both q and the training context cj.</Paragraph> <Paragraph position="3"> a standard tf.idf weight where dfi is the number of contexts that contain lemi.</Paragraph> <Paragraph position="4"> These weights have been normalized ( wij||cj||) and so, the inner product q*A generates a vector vectorq of cosines as described above, where qj is the cosine between q and context cj.</Paragraph> <Paragraph position="5"> Two similarity measures have been compared.</Paragraph> <Paragraph position="6"> The first one (maximum) is a similarity of q as bag of words with the training contexts of sense sen. The second one (cosine) is the similarity of sense sen with vectorq in the context space: Similarity with sense sen is the highest similarity (cosine) between q (as bag of words) and each of the training contexts (as bag of words) for sense sen.</Paragraph> <Paragraph position="8"> Similarity with sense sen is the cosine in the Context Space between vectorq and vectorsen Table 1 shows that almost all the results are improved when the similarity measure (cosine) is applied in the Context Space. The exception is the consideration of co-ocurrences to disambiguate nouns. This exception led us to explore an alternative similarity measure aimed to improve results over nouns. The following sections describe this new similarity measure and the criteria underlying it.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Criteria for the similarity measure </SectionTitle> <Paragraph position="0"> Co-occurrences behave quite good to disambiguate nouns as it has been shown in the experiment above. However, the consideration of co-occurrences in the Context Space permits acumulative measures: Instead of selecting the candidate sense associated to the training context with the maximum number of co-occurrences, we can consider the co-occurences of q with all the contexts. The weights and the similarity function has been set out satisfying the follow- null ing criteria: 1. Select the sense senk assigned to more training contexts ci that have the maximum number of co-occurrences with the test context q. For example, if sense sen1 has two training contexts with the highest number of co-occurrences and sense sen2 has only one with the same number of cooccurrences, sen1 must receive a higher value than sen2.</Paragraph> <Paragraph position="1"> 2. Try to avoid label inconsistencies in the training corpus. There are some training examples where the same ambiguous word is used with the same meaning but tagged with different sense by human assessors.</Paragraph> <Paragraph position="2"> Table 2 shows an example of this kind of inconsistencies.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.4 Similarity measure </SectionTitle> <Paragraph position="0"> We assign the weights wij and wiq to have vectorq a vector of co-occurrences, where qj is the number of different nouns and adjectives that co-occurr in q and the training context cj. In this way, wij is set to 1 if lemi is present in the context cj.</Paragraph> <Paragraph position="1"> Otherwise wij is set to 0. Analogously for the new instance q, wiq is set to 1 if lemi is present in q and it is set to 0 otherwise.</Paragraph> <Paragraph position="2"> According to the second criterium, if there is only one context c1 with the higher number of co-occurrences with q, then we reduce the value of this context by reducing artificially its number of co-occurrences: Being c2 a context with the second higher number of co-occurrences with q, then we assign to the first context c1 the number of co-occurrences of context c2.</Paragraph> <Paragraph position="3"> After this slight modification of vectorq we implement the similarity measure between vectorq and a sense senk according to the first criterium:</Paragraph> <Paragraph position="5"> Finally, for a new context of lemi we select the candidate sense that gives more value to the similarity measure: argmaxk sim( vectorsenk,vectorq) <answer instance=&quot;grano.n.1&quot; senseid=&quot;grano.4&quot;/> <previous> La Federacin Nacional de Cafeteros de Colombia explic que el nuevo valor fue establecido con base en el menor de los precios de reintegro mnimo de grano del pas de los ltimos tres das, y que fue de 1,3220 dlares la libra, que fue el que alcanz hoy en Nueva York, y tambin en la tasa representativa del mercado para esta misma fecha (1.873,77 pesos por dlar). </previous> <target> El precio interno del caf colombiano permaneci sin modificacin hasta el 10 de noviembre de 1999, cuando las autoridades cafetaleras retomaron el denominado &quot;sistema de ajuste automtico&quot;, que tiene como referencia la cotizacin del <head>grano</head> nacional en los mercados internacionales. </target> <answer instance=&quot;grano.n.9&quot; senseid=&quot;grano.3&quot;/> <previous> La carga qued para maana en 376.875 pesos (193,41 dlares) frente a los 375.000 pesos (192,44 dlares) que rigi hasta hoy. </previous> <target> El reajuste al alza fue adoptado por el Comit de Precios de la Federacin que fijar el precio interno diariamente a partir de este lunes tomando en cuenta la cotizacin del <head>grano</head> en el mercado de Nueva York y la tasa de cambio del da, que para hoy fueron de 1,2613 dlares libra y1.948,60 pesos por dlar </target> ditions than experiments in Table 1.</Paragraph> <Paragraph position="6"> Comparing results in both tables we observe that the new similarity measure only behaves better for the disambiguation of nouns. However, the difference is big enough to improve overall results. The application of the second criterium (try to avoid label inconsistencies) also improves the results as shown in Tables 3 and 4. Table 4 shows the effect of applying this second criterium to all the languages we have participated in. With the exception of Catalan, all results are improved slightly (about 1%) after the filtering of singular labelled contexts.</Paragraph> <Paragraph position="7"> Although it is a regular behavior, this improvement is not statistically significative.</Paragraph> </Section> </Section> class="xml-element"></Paper>