File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1036_metho.xml
Size: 20,161 bytes
Last Modified: 2025-10-06 14:10:17
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1036"> <Title>Enhancing electronic dictionaries with an index based on associations</Title> <Section position="4" start_page="281" end_page="283" type="metho"> <SectionTitle> 3 Accessing the target word by navigat- </SectionTitle> <Paragraph position="0"> ing in a huge associative network If one agrees with what we have just said, one could view the mental lexicon as a huge semantic network composed of nodes (words and concepts) and links (associations), with either being able to activate the other . Finding a word involves entering the network and following the links leading from the source node (the first word that comes to your mind) to the target word (the one you are looking for). Suppose you wanted to find the word nurse (target word), yet the only token coming to your mind is hospital. In this case the system would generate internally a graph with the source word at the center and all the associated words at the periphery. Put differently, the system would build internally a semantic network with hospital in the center and all its associated words as satellites (see Figure 1, next page).</Paragraph> <Paragraph position="1"> Obviously, the greater the number of associations, the more complex the graph. Given the diversity of situations in which a given object may occur we are likely to build many associations. In other words, lexical graphs tend to become complex, too complex to be a good representation to support navigation. Readability is hampered by at least two factors: high connectivity (the great number of links or associations emanating from each word), and distribution: conceptually related nodes, that is, nodes activated by the same kind of association are scattered around, that is, they do not necessarily occur next to each other, which is quite confusing for the user. In order to solve this problem, we suggest to display by category (chunks) all the words linked by the same kind of association to the source word (see Figure 2). Hence, rather than displaying all the connected words as a flat list, we suggest to present them in chunks to allow for categorial search. Having chosen a category, the user will be presented a list of words or categories from which he must choose. If the target word is in the category chosen by the user (suppose he looked for a hypernym, hence he checked the ISA-bag), search stops, otherwise it continues. The user could choose either another category (e.g. AKO or TIORA), or a word in the current list, which would then become the new starting point.</Paragraph> <Paragraph position="2"> While the links in our brain may only be weighted, they need to be labelled to become interpretable for human beings using them for navigational purposes in a lexicon.</Paragraph> <Paragraph position="4"> l i st of po t e nt i a l t a r ge t w o r ds ( LO P TW ) .. .</Paragraph> <Paragraph position="5"> A b s t r a c t r e pr es en t a t i on of t he s e a r c h g r a p h ho s pi t a l TI O R A IS A AK O c lin ic , s an a to r i um , . .. mi lit ar y h os pit al , p sy chi at ric h osp it al in ma t eS YN O N YM nur s e doc to r, .. .</Paragraph> <Paragraph position="6"> pat i e n t .. .</Paragraph> <Paragraph position="7"> A c on cr e t e e xam pl e F ig u r e 2 : P r opos e d c a ndi da t e s , g r oupe d by f a m il y , i. e . a c c or di ng t o t he na t ur e of t he l i nk A s one c a n s e e , t he f a c t t ha t t he l i nk s a r e l a be l e d h as so m e v er y i m p o r t a n t co n seq u e n ces: ( a ) W h i l e m a i nt a i ni ng t he pow e r of a hi g hl y c o n n e c t e d gr a p h ( p os s i b l e c yc l i c n a vi ga t i on ) , it h a s a t th e in te r f a c e le v e l t h e s im p l ic i ty o f a t r e e : e a c h no de po i nt s on l y t o da t a of t he sam e t y p e , i. e . t o t he s a m e k i nd of a s s oc i a t i on.</Paragraph> <Paragraph position="8"> ( b) W i t h w or d s be i ng pr e s e nt e d i n c l us t e r s , na v i g a t i on c a n be a c c om pl i s he d by c l i c k i ng on t he a ppr opr i a t e c a t e g or y .</Paragraph> <Paragraph position="9"> T he a s s um pt i on be i ng t h a t t he us e r g e ne r a l l y k now s t o w hi c h c a t e g or y t h e t a r g e t w or d be l ong s ( o r a t l ea s t , h e can r eco g n i z e w i t h i n w h i ch o f t h e l is te d c a te g o r i e s i t fa l l s ), a n d th a t c a t e g o r i c a l sear c h i s i n p r i n ci p l e f a s t er t h an se a r c h i n a h u g e l is t o f u n o rd e re d (o r, a l p h a b e t ic a l l y o rd e re d ) w or ds .</Paragraph> <Paragraph position="10"> O bv i ous l y , i n or d e r t o a l l ow f or t hi s k i nd o f ac ce ss, t h e r e so u r ce h as t o b e b u i l t ac co r d i n g l y . T h i s r e qui r e s a t l e a s t t w o t hi ng s : ( a ) i nde x i ng w or ds by t he a s s oc i a t i on s t he y e v ok e , ( b) i de n t i - null Ev e n th o u g h v e r y i m p o r ta n t, a t th is s ta g e w e s h a ll n o t w o r r y t oo m uc h f or t he na m e s g i v e n t o t he l i nk s . I nde e d , o n e m ig h t q u e s tio n n e a r ly a ll o f t h e m . W h a t is im p o r ta n t is th e u n d e r ly in g r a tio n a l: h e lp u s e r s to n a v ig a te o n th e b a s is o f s y m b o lic a ll y q u a lif ie d lin k s . I n r e a lity a w h o le s e t o f w o r d s ( s y no ny m s , o f c our s e , b ut not o n l y ) c o ul d a m oun t t o a lin k , i. e . b e i t s co n cep t u al eq u i v al en t . fying and labelling the most frequent/useful associations. This is precisely our goal. Actually, we propose to build an associative network by enriching an existing electronic dictionary (essentially) with (syntagmatic) associations coming from a corpus, representing the average citizen's shared, basic knowledge of the world (encyclopaedia). While some associations are too complex to be extracted automatically by machine, others are clearly within reach. We will illustrate in the next section how this can be achieved.</Paragraph> </Section> <Section position="5" start_page="283" end_page="286" type="metho"> <SectionTitle> 4 Automatic extraction of topical rela- </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="283" end_page="283" type="sub_section"> <SectionTitle> tions 4.1 Definition of the problem </SectionTitle> <Paragraph position="0"> We have argued in the previous sections that dictionaries must contain many kinds of relations on the syntagmatic and paradigmatic axis to allow for natural and flexible access of words. Synonymy, hypernymy or meronymy fall clearly in this latter category, and well known resources like WordNet (Miller, 1995), EuroWordNet (Vossen, 1998) or MindNet (Richardson et al., 1998) contain them. However, as various researchers have pointed out (Harabagiu et al., 1999), these networks lack information, in particular with regard to syntagmatic associations, which are generally unsystematic. These latter, called TIORA (Zock and Bilac, 2004) or topical relations (Ferret, 2002) account for the fact that two words refer to the same topic, or take part in the same situation or scenario. Word-pairs like doctor-hospital, burglar-policeman or plane-airport, are examples in case. The lack of such topical relations in resources like WordNet has been dubbed as the tennis problem (Roger Chaffin, cited in Fellbaum, 1998). Some of these links have been introduced more recently in WordNet via the domain relation. Yet their number remains still very small. For instance, WordNet 2.1 does not contain any of the three associations mentioned here above, despite their high frequency.</Paragraph> <Paragraph position="1"> The lack of systematicity of these topical relations makes their extraction and typing very difficult on a large scale. This is why some researchers have proposed to use automatic learning techniques to extend lexical networks like WordNet. In (Harabagiu & Moldovan, 1998), this was done by extracting topical relations from the glosses associated to the synsets. Other researchers used external sources: Mandala et al.</Paragraph> <Paragraph position="2"> (1999) integrated co-occurrences and a thesaurus to WordNet for query expansion; Agirre et al.</Paragraph> <Paragraph position="3"> (2001) built topic signatures from texts in relation to synsets; Magnini and Cavaglia (2000) annotated the synsets with Subject Field Codes.</Paragraph> <Paragraph position="4"> This last idea has been taken up and extended by (Avancini et al., 2003) who expanded the domains built from this annotation.</Paragraph> <Paragraph position="5"> Despite the improvements, all these approaches are limited by the fact that they rely too heavily on WordNet and some of its more sophisticated features (such as the definitions associated with the synsets). While often being exploited by acquisition methods, these features are generally lacking in similar lexico-semantic networks. Moreover, these methods attempt to learn topical knowledge from a lexical network rather than topical relations. Since our goal is different, we have chosen not to rely on any significant resource, all the more as we would like our method to be applicable to a wide array of languages. In consequence, we took an incremental approach (Ferret, 2006): starting from a network of lexical co-occurrences collected from a large corpus, we used these latter to select potential topical relations by using a topical analyzer.</Paragraph> </Section> <Section position="2" start_page="283" end_page="284" type="sub_section"> <SectionTitle> 4.2 From a network of co-occurrences to a set of Topical Units </SectionTitle> <Paragraph position="0"> We start by extracting lexical co-occurrences from a corpus to build a network. To this end we follow the method introduced by (Church and Hanks, 1990), i.e. by sliding a window of a given size over some texts. The parameters of this extraction were set in such a way as to catch the most obvious topical relations: the window was fairly large (20-words wide), and while it took text boundaries into account, it ignored the order of the co-occurrences. Like (Church and Hanks, 1990), we used mutual information to measure the cohesion between two words. The finite size of the corpus allows us to normalize this measure in line with the maximal mutual information relative to the corpus.</Paragraph> <Paragraph position="1"> This network is used by TOPICOLL (Ferret, 2002), a topic analyzer, which performs simultaneously three tasks, relevant for this goal: it segments texts into topically homogeneous segments; it selects in each segment the most representative words of its topic; Such a network is only another view of a set of cooccurrences: its nodes are the co-occurrent words and its edges are the co-occurrence relations.</Paragraph> <Paragraph position="2"> it proposes a restricted set of words from the co-occurrence network to expand the selected words of the segment.</Paragraph> <Paragraph position="3"> These three tasks rely on a common mechanism: a window is moved over the text to be analyzed in order to limit the focus space of the analysis. This latter contains a lemmatized version of the text's plain words. For each position of this window, we select only words of the co-occurrence network that are linked to at least three other words of the window (see Figure 3). This leads to select both words that are in the window (first order co-occurrents) and words coming from the network (second order cooccurrents). The number of links between the selected words of the network, called expansion words, and those of the window is a good indicator of the topical coherence of the window's content. Hence, when their number is small, a segment boundary can be assumed. This is the basic principle underlying our topic analyzer.</Paragraph> <Paragraph position="4"> from the co-occurrence network The words selected for each position of the window are summed, to keep only those occurring in 75% of the positions of the segment. This allows reducing the number of words selected from non-topical co-occurrences. Once a corpus has been processed by TOPICOLL, we obtain a set of segments and a set of expansion words for each one of them. The association of the selected words of a segment and its expansion words is called a Topical Unit. Since both sets of words are selected for reasons of topical homogeneity, their co-occurrence is more likely to be a topical relation than in our initial network.</Paragraph> </Section> <Section position="3" start_page="284" end_page="285" type="sub_section"> <SectionTitle> 4.3 Filtering of Topical Units </SectionTitle> <Paragraph position="0"> Before recording the co-occurrences in the Topical Units built in this way, the units are filtered twice. The first filter aims at discarding heterogeneous Topical Units, which can arise as a side effect of a document whose topics are so intermingled that it is impossible to get a reliable linear segmentation of the text. We consider that this occurs when for a given text segment, no word can be selected as a representative of the topic of the segment. Moreover, we only keep the Topical Units that contain at least two words from their original segment. A topic is defined here as a configuration of words. Note that the identification of such a configuration cannot be based solely on a single word.</Paragraph> <Paragraph position="1"> The second filter is applied to the expansion words of each Topical Unit to increase their topical homogeneity. The principle of the filtering of these words is the same as the principle of their selection described in Section 4.2: an expansion word is kept if it is linked in the co-occurrence network to at least three text words of the Topical Unit. Moreover, a selective threshold is applied to the frequency and the cohesion of the co-occurrences supporting these links: only co-occurrences whose frequency and cohesion are respectively higher or equal to 15 and 0.15 are used. For instance in Table 1, which shows an example of a Topical Unit after its filtering, ecrouer (to imprison) is selected, because it is linked in the co-occurrence network to the following words of the text: juge (judge): 52 (frequency) - 0.17 (cohesion) policier (policeman): 56 - 0.17 enquete (investigation): 42 - 0.16 word freq. word freq. word freq. word freq.</Paragraph> </Section> <Section position="4" start_page="285" end_page="285" type="sub_section"> <SectionTitle> 4.4 From Topical Units to a network of </SectionTitle> <Paragraph position="0"> topical relations After the filtering, a Topical Unit gathers a set of words supposed to be strongly coherent from the topical point of view. Next, we record the co-occurrences between these words for all the Topical Units remaining after filtering. Hence, we get a large set of topical co-occurrences, despite the fact that a significant number of non-topical co-occurrences remains, the filtering of Topical Units being an unsupervised process.</Paragraph> <Paragraph position="1"> The frequency of a co-occurrence in this case is given by the number of Topical Units containing both words simultaneously. No distinction concerning the origin of the words of the Topical Units is made.</Paragraph> <Paragraph position="2"> The network of topical co-occurrences built from Topical Units is a subset of the initial network. However, it also contains co-occurrences that are not part of it, i.e. co-occurrences that were not extracted from the corpus used for setting the initial network or co-occurrences whose frequency in this corpus was too low. Only some of these &quot;new&quot; co-occurrences are topical. Since it is difficult to estimate globally which ones are interesting, we have decided to focus our attention only on the co-occurrences of the topical network already present in the initial network.</Paragraph> <Paragraph position="3"> Thus, we only use the network of topical co-occurrences as a filter for the initial co-occurrence network. Before doing so, we filter the topical network in order to discard co-occurrences whose frequency is too low, that is, co-occurrences that are unstable and not representative. From the use of the final network by TOPICOLL (see Section 4.5), we set the threshold experimentally to 5. Finally, the initial network is filtered by keeping only co-occurrences present in the topical network. Their frequency and cohesion are taken from the initial network.</Paragraph> <Paragraph position="4"> While the frequencies given by the topical network are potentially interesting for their topical significance, we do not use them because the results of the filtering of Topical Units are too hard to evaluate.</Paragraph> </Section> <Section position="5" start_page="285" end_page="286" type="sub_section"> <SectionTitle> 4.5 Results and evaluation </SectionTitle> <Paragraph position="0"> We applied the method described here to an initial co-occurrence network extracted from a corpus of 24 months of Le Monde, a major French newspaper. The size of the corpus was around 39 million words. The initial network contained 18,958 words and 341,549 relations. The first run produced 382,208 Topical Units. After filtering, we kept 59% of them. The network built from these Topical Units was made of 11,674 words and 2,864,473 co-occurrences. 70% of these co-occurrences were new with regard to the initial network and were discarded. Finally, we got a filtered network of 7,160 words and 183,074 relations, which represents a cut of 46% of the initial network. A qualitative study showed that most of the discarded relations are non-topical.</Paragraph> <Paragraph position="1"> This is illustrated by Table 2, which gives the co-occurrents of the word acteur (actor) that are filtered by our method among its co-occurrents with a high cohesion (equal to 0.16). For instance, the words cynique (cynical) or allocataire (beneficiary) are cohesive co-occurrents of the word actor, even though they are not topically linked to it. These words are filtered out, while we keep words like gros_plan (close-up) or scenique (theatrical), which topically cohere with acteur (actor) despite their lower frequency than the discarded words.</Paragraph> <Paragraph position="2"> with different networks In order to evaluate more objectively our work, we compared the quantitative results of TOPICOLL with the initial network and its filtered version. The evaluation showed that the performance of the segmenter remains stable, even if we use a topically filtered network (see Table 3). Moreover, it became obvious that a network filtered only by frequency and cohesion performs significantly less well, even with a comparable size. For testing the statistical significance of these results, we applied to the P k values a one-side t-test with a null hypothesis of equal means. Levels lower or equal to 0.05 are considered as statistically significant: These values confirm that the difference between the initial network (I) and the topically filtered one (T) is actually not significant, whereas the filtering based on co-occurrence frequencies leads to significantly lower results, both compared to the initial network and the topically filtered one. Hence, one may conclude that our Precision is given by N</Paragraph> <Paragraph position="4"> being the number of document breaks, N b the number of boundaries found by TOPICOLL and N t the number of boundaries that are document breaks (the boundary should not be farther than 9 plain words from the document break).</Paragraph> <Paragraph position="6"> (Beeferman et al., 1999) evaluates the probability that a randomly chosen pair of words, separated by k words, is wrongly classified, i.e. they are found in the same segment by TOPICOLL, while they are actually in different ones (miss of a document break), or they are found in different segments, while they are actually in the same one (false alarm).</Paragraph> <Paragraph position="7"> method is an effective way of selecting topical relations by preference.</Paragraph> </Section> </Section> class="xml-element"></Paper>