File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/98/w98-1109_abstr.xml
Size: 28,014 bytes
Last Modified: 2025-10-06 13:49:32
<?xml version="1.0" standalone="yes"?> <Paper uid="W98-1109"> <Title>Refining the Automatic Identification of Conceptual Relations in Large..scale Corpora</Title> <Section position="1" start_page="0" end_page="83" type="abstr"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> In the ACRONYM Project, we have taken the Firthian view (e.g. Firth 1957) that context is part of the meaning of the word, and measured similarity of meaning between words through second-order collocation. Using large-scale, free text corpora of UK journalism, we have generated collocational data for all words except for high-frequency grammatical words, and have found that semantically related word pairings can be identified, whilst syntactic relations are disfavoured. We have then moved on to refine this system, to deal with multi-word terms and identify changing conceptual relationships across time. The system, conceived in the late 80's and developed in 1994-97, differs from others of the 90's in purpose, scope, methodology and results, and comparisons will be drawn in the course of the paper.</Paragraph> <Paragraph position="1"> Introduction The team at Liverpool has created over the years a series of automated systems for handling and extracting information from large textual corpora. These systems consist of software and knowledge bases derived from those same textual resources. Most recently, the system known as ACRONYM (Automated Collocational Retrieval of 'Nyms') has involved the identification of conceptually related items. These are referred to as 'nyms', by analogy with 'synonyms' and other sense related items, as a reflection of the fact that it is conceptual similarity that is being discovered through collocation.</Paragraph> <Paragraph position="2"> Like its predecessors, the ACRONYM system has a dual purpose. On the one hand, it is intended to generate pairs or clusters of items which can function as alternative search terms in a diachronic text retrieval environment. On the other, it is intended to support a description of the thesaurus in text. In the latter application, the precise nature of the nyms generated by a given target word is important.</Paragraph> <Paragraph position="3"> The basic system for identifying conceptual relations The starting point for the nyrnic identification system is the raw text from which thesaural relations are to be derived. This corpus currently contains over 300 million words from The Independent newspaper from 1988 to 1997. As with similar work (e.g. Brown et al 1992), the size of the corpus makes preprocessing such as lemmatization, POS tagging or partial parsing, too costly. The sole preprocessing performed on the corpus is thus the relabelling of numeric tokens into general categories.</Paragraph> <Paragraph position="4"> During this preprocessing stage, the corpus is also integerised and indexed. This increases efficiency during later processing stages, in particular the creation of the collocate * database. Within the system, collocates are by default defined as the four words to the left and right of every word. The only exceptions to this rule are that collocates are not recorded for a set of 253 stopwords (high frequency terms; grammatical words, numeric labels and some verbs), and these are also not recorded as collocates of any other word. The raw frequencies of left and right span collocates for a given word-pair are merged, and their significance measured, using a Z-score statistic. Statistically significant collocates are then stored as a sparse matrix in the collocate database. A corpus of 3 x 108 words and 1.5 X 10 6 word types produces just over 4.8 x 10 6 statistically significant collocates (using a liberal threshold for significance).</Paragraph> <Paragraph position="5"> We refer to the set of statistically significant collocates for a given word as a 'collocational profile'. Similarity between any two given words is ~then measured through comparison of their prC/ files; the measure itself is based on the size of tae profiles of both words and the number of collocates they share (i.e., second order collocarion), and also on the collocation between those two words (first order colloc~ation).</Paragraph> <Paragraph position="6"> The definition of'what constitutes first- order collocation differs f,.~r different researchers. For example, the corpus described in Grefensttette (1992) is comparatively small, allowing for extensive preprocessing, including POS tagging and partial parsing. Only modifiers and their modified nouns are recorded as collocating pairs, allowing for a more fine-grained analysis, of just a subset of word classes. In contrast, SchiJtze and Pedersen (1995) treat the set of collocates for a word as a vector containing the frequencies of collocation with other words occurring within a 40-word window. Futrelle and Gauch (1993) use a similar approach but preserve positional information (i.e., the number of words to the left and right of the target word). Positional information is also retained by Brown et ai (1992), who store collocation information as word n-grams.</Paragraph> <Paragraph position="7"> For very large corpora, a lot of collocational information will be generated, making any form of collocational similarity measurement computationally expensive. Again, researchers have adopted differing approaches to this problem. Schtitze and Pedersen (1995) build their collocate vectors using a bootstrap method involving increasingly larger sets of the lexicon, finally constructing a low (20) dimensional word vector space by Singular Value Decomposition. A simpler method is employed blC/ Futrelle and Gauch (1993), whereby collocate vectors are recorded for all word forms, but for each word form, only its frequency of co-occurrence with the top 150 most frequent word forms is recorded.</Paragraph> <Paragraph position="8"> The researchers use a 2-word window to the left and right of the target word, but they also preserve the positional information of the collocates, resulting in a 600-dimension wordvector of mutual information measurements.</Paragraph> <Paragraph position="9"> While there are many different ways to record collocates, the primary difference between the ACRONYM collocate database and those detailed above is the omission both of collocation involving grammatical function words, and of positional information. By omitting these two elements, the resulting collocate database focuses less on the syntagmatic similarity between words and more on their paradigmatic relations.</Paragraph> <Paragraph position="10"> Generating conceptually-related word pairs The definition and purpose of word similarity measures in the above systems also differ. Vector-based models define similarity as the cosine measure between two words, and the purpose of Schiitze and Brown's vector-based work is to cluster grammatical word classes automatically.</Paragraph> <Paragraph position="11"> Mutual information-based approaches, such as those of Brown et ai (1992) and Futrelle and Gauch (1993), measure word similarity in the context of a set of words to be clustered, typically with the aim of clustering for general similarity. The Jaccard coefficient measurement offered by Grefenstette (1992) is the nearest to our own, defining similarity simply in terms of the number of shared 'attributes' between two words against the number of attributes of both.</Paragraph> <Paragraph position="12"> While Schiitze and Pedersen (1993), Brown et al (1992) and Futrelle and Gauch (1993) all demonstrate the ability of their systems to identify word similarity using clustering on the most frequently occurring words in their corpus, only Grefenstette (1992) demonstrates his system by generating word similarities with respect to a set of target words. His purpose is to allow a user to specify a target word, and have the system return an ordered list of related words. To this extent, the purpose of the basic ACRONYM system is echoed in Grefenstette's work.</Paragraph> <Paragraph position="13"> Given the liberal thresholds currently used in the ACRONYM system, such a list of conceptually related words may contain several tens of thousands of entries. As the size of the lexicon renders it computationally infeasible to calculate all word-pair similarities in advance, the system generates word similarity measures for a given word on the fly, using the collocate database described above.</Paragraph> <Paragraph position="14"> Examples of conceptually-related, or nymic, output are given in Table ! for the node words key. medicine, pretty and testing.</Paragraph> <Section position="1" start_page="77" end_page="77" type="sub_section"> <SectionTitle> Syms </SectionTitle> <Paragraph position="0"> factor role element issues areas issue elements figure players component complementary alternative food herbal preventive genito-urinary modern conventional clinical science good sight looks look awful girl silly boring looked stupid nuclear random positive curriculum DNA drug genetic HIV psychometric tests Ten top nyms for nodes key, pretty, medicine, testing Modification to software to increase semantic nature of nymic output The nymic output in Table ! contains collocates and other related items, which are relevant in principle for IT purposes, but which for linguistic purposes may be usefully separated out. This is achieved by using only second order collocation, which boosts semantically (and morphologically) related nymic output, as can be seen in Table 2.</Paragraph> </Section> <Section position="2" start_page="77" end_page="77" type="sub_section"> <SectionTitle> Node Nyms </SectionTitle> <Paragraph position="0"> key crucial important vital significant essential main fundamental major strategic specific medicine medical medicines sciences mathematics biology science chemistry psychology physics clinical pretty fairly quite incredibly extremely terribly really nice extraordinarily lovely sexy testing tests test tested assessment monitoring screening research rigorous clinical curriculum Table 2: Nyms for nodes key, pretty, medicine, testing suppressing first order collocates The Deese Antonyms The focussing effect achieved by suppressing first order collocational information may be further demonstrated with reference to the work of Deese (1964), cited in Grefenstette (1992), and specifically to a set of conceptuaily-related antonymic pairs which Deese had identified by a series of psycholinguistic tests. Grefenstette hypothesised that any system identifying shared collocation between words would pick up the Deese antonyms as being strongly related; he experimented by feeding the 'primer' word for each of those antonym; into his SEXTANT system, and listing the 19 most closely related words produced.</Paragraph> <Paragraph position="1"> The same 'primer' words were fed into the refined ACRONYM System, i.e. with first-order collocates suppressed, and the results are displayed in Table 3 orn the next page. (Deese antonyms, where they o=cur, are capitalised). A basic comparison reveals that the ACRONYM system yiel.ls a similar level of results to Grefenstette's system. 15133 of the Deese antonyms occur in SEXTANT output as first or second most-related word, compared with 13 generated by ACRONYM; whilst 16 of the Deese antonyms appear within the top 10 most related words of SEXTANT output, compared with 18 in output from ACRONYM. Similarly, as with Grefenstette's findings, there are cases where ACRONYM yields a non-Deese antonym which is nevertheless close: see for instance: big-small; dark-pale; deep-surface; happy-unhappy; newexisting; old-modern. A scrutiny of the actual contents of each ACRONYM list further reveals that, like Futrelle and Gauch, ACRONYM when using the particular non collocate upweighting * method discovers &quot;... (entire) graded fields, rather than just pairs of opposites&quot;. Of particular interest in this respect are the results for big, easy, fast and strong.</Paragraph> <Paragraph position="2"> For some words in Table 3, failure to relate closely to their Deese antonym can in part be explained by textual domain. For example, in the listing for empty, the strongest nyms reflect the sense of an empty building or structure, suggesting that such a context predominates throughout the corpus. Likewise, with the word pretty, we see that intensifiers predominate as nyms, with the synonym lovely appearing only in 9th place. Analysis of a random sample of the corpus reveals that pretty is indeed predominantly adverbial, and only rarely adjectival.</Paragraph> </Section> <Section position="3" start_page="77" end_page="79" type="sub_section"> <SectionTitle> Multi-Word Nyms </SectionTitle> <Paragraph position="0"> As described, the basic ACRONYM system generates information on both first and second order collocates within its single-word nymic output. First order collocation can be suppressed to enrich the semantic information, as actively organisations groups involved activities effective activity developing vigorous encourage DEAD loved dying die mum frightened loves buried forever loving down away ball straight FRONT again foot around yards feet GOOD worse dreadful awful poor terrible nasty stupid silly appalling bigger huge large biggest major larger smaller small massive enormous WHITE red brown blue wearing pink yellow green grey leather TOP relegated side relegation feet foot floor inches table straight wash cool dry cleaning smooth kitchen warm water washing shiny HOT warm dry cool boiling wet salt boiled cooked dam p grey pale brown green bright white blue red thick purple deeper profound dark intense sand depth surface feelings thick mixture dried hot warm soft brown creamy crisp salt lemon cold easier difficult impossible harder HARD simple able quick enough unable deserted filled crowded derelict crammed windows floor surrounded cramped nearby faster bowler pace speed bowlers bowling SLOW slowly slower quick pleased unhappy mum nice happily enjoy relaxed OK loving cheerful harder difficult impossible EASY easier unable trying able enough tough artillery thick huge massive metal large high rain vehicles low LOW higher lower levels rising highest level increased increase falling SMALL huge larger vast smaller big tiny substantial mainly plastic leaving ball leave yards corner back injured pulled minutes shot SHORT longer hair dark slow white wearing black down wide steep WIDE broad stretch paths hill lined tiny path brick existing technology proposed latest development plans current commercial systems design ancient Victorian traditional white houses buildings around man black modern fairly quite incredibly extremely terribly really nice extraordinarily lovely looks spicy delicious flavours wealthy sweet fruit flavour ripe soft texture wrong LEFT want freedom rights able back necessary wanted law muddy sand wet dirt grass tricky dusty mud trees damp LONG length brief straight tight balls wide wicket quick ball SWEET salty soured spiced delicious tomato soy creamy chilli pungent stronger strongest powerful WEAK strength sharp steady solid underlying strongly THICK brown strips pale slices orange fat white creamy soft demonstrated in Tables 1 and 2. Often, however, these collocates are really part of multi-word units, combining with other words to form hyponyms. They often combine with the target word itself, thereby forming hyponyms of the type &quot;ordinate plus modifier'. These items are of prime importance in linguistic description, since they represent hitherto undocumented differences between the textual thesaurus and the mental lexicon.</Paragraph> <Paragraph position="1"> Another refinement to the basic ACRONYM system is therefore achieved when the nyms of the single-word nymic output are recombined into multi-word units which better represent the target concept. This procedure is carried out in two stages. First, the list of nyms is ACRONYM nymic output processed by a software module which attempts to identify the most likely word pairs that could be created by combining the individual nyms, making use of a variety of measures including collocational as well as contextual clues from the corpus database. The resulting list of word combinations, which need not necessarily be adjacent, is passed to a second-stage module, which checks which of these candidate word pairs have collocational environments similar to the original node word. The benefits of this approach are that no a priori word-pair list needs to be established, this being decided by the contents of the nym list and by the corpus, and that no collocational profiles need to be stored for word pairs.</Paragraph> <Paragraph position="2"> Tables 4 and 5 display multi-word nyms for therapy and weapons.</Paragraph> <Paragraph position="3"> In Tables 4 and 5, a series of multi-word hyponyms have emerged, several consisting of adjectives or nouns modifying the node or synonyms of it.</Paragraph> <Paragraph position="4"> Multi-word nodes are a further refinement of the system. As with multi-word nymic output, the collocate profiles for multi-word nodes are generated on the fly. Table 6 presents an example of the multi-word nymic output for the multi-</Paragraph> </Section> <Section position="4" start_page="79" end_page="80" type="sub_section"> <SectionTitle> Union Semantic Clustering </SectionTitle> <Paragraph position="0"> As well as generating flat lists of semantically related words, the ACRONYM system can perform clustering upon a set of nyms, in order to reveal their semantic inter-relationships. In ACRONYM, the set of words to be clustered is usually one of the flat lists of nyms of the kind displayed above.</Paragraph> <Paragraph position="1"> This is in contrast to work by researchers such as Schiitze and Pedersen (1992), Brown et al (1992) and Futrelle and Gauch (1995), where it is often the most frequent words in the lexicon which are clustered, predominantly with the purpose of determining their grammatical classes.</Paragraph> <Paragraph position="2"> ACRONYM uses two publicly available clustering tools, PAM and AGNES, described in Kaufman and Rousseeuw (1990). The first, PAM (Partitioning Around Medoids), is a k-medoid partitioning method, while AGNES is a variant on agglomerative nesting. Both algorithms allow object-relations to be represented by a similarity measure, which we take as the collocational profile similarity measure described earlier. An example of PAM output is shown in Table 7.</Paragraph> <Paragraph position="3"> The PAM-generated clusters in Table 7 are created from the top nyms for water, and reflect several different meanings (senses, uses or references) that are associated with the node word, namely: (1) 'water utility', (2) 'body of water', (3) 'type of drinking water', (4) 'fluid measurement', (5) 'unit of water', (6) 'liquid used in cooking', (7) 'medium for certain domestic processes', and (9) 'water in various more or less pure states'. Some of these senses are fairly conventional, others are more contextually determined. Not every cluster is adequate; here, Cluster 8 is weak and uninterpretable. Taken overall, it seems that this type of clustering does sharpen the picture for the user of the system.</Paragraph> <Paragraph position="4"> While previous researchers have used agglomerative nesting clustering (e.g. Brown et al (1992), Futrelle and Gauch (1993)), comparisons with our work are difficult to draw, due to their use of the 1,000 commonest words from their respective corpora.</Paragraph> <Paragraph position="5"> In Brown et al (1992), the authors provide some sample subtrees resulting from such a 1,000-word clustering. The sets of words from each subtree have been fed into the ACRONYM clustering system, and the results from AGNES are shown below. This is not strictly a fair comparison, as the clustering of a superset of these words would doubtless create a different structure. Nevertheless, it appears that ACRONYM organises these subsets into a more satisfactory taxonomy, in contrast with a tendency in Brown et al's system to produce right-heavy taxonomies.</Paragraph> <Paragraph position="6"> In Fig. I, the last example highlights a distinction between syntax and semantics. While Brown et al's system splits the four words along the singular/plural divide (i.e., rep-representative and reps-representatives), ACRONYM splits them semantically; the abbreviated versions refer to sales-people or those representing travel companies, whilst the full versions are used in a political context,</Paragraph> </Section> <Section position="5" start_page="80" end_page="83" type="sub_section"> <SectionTitle> Identifying Change in Conceptual Relations </SectionTitle> <Paragraph position="0"> The ACRONYM database has also been designed in such a way that it can be accessed diachronically. This facility was incorporated in order to ensure that the system remains up to date in its application to text retrieval and linguistic description, and it has already enabled the Unit to establish a new words service within its web site (http://www.rdues.liv.ac.uk/newwds.html).</Paragraph> <Paragraph position="1"> The team first explored the way in which language changes over time in the AVIATOR Project (Renouf 1993, Collier 1993, Blackwell 1993), where they investigated the dynamic aspects not only of single words, but also of the collocational behaviour of those words, with the goal of identifying new collocations or changes in meaning. In ACRONYM (Renouf 1996, Collier & Pacey 1996), the collocational and diachronic concepts have been developed considerably, taking advantage Of improvements in technology and the greater availability of electronic text. The result is an integrated system of databases and indexes which can be accessed as one virtual entity or divided into any desired configuration of its constituent parts. In the current database, the smallest accessible component, which we refer to as a se~,ment, consists of three months' of text from national UK newspapers, containing on average eight million running words (tokens).</Paragraph> <Paragraph position="2"> Each segment is composed of an integerised corpus database* providing all the usual corpus- and text-retrieval facilities from simple frequency information for a single word to the full KWIC, sentence or article context for boolean (multi-word) searches. The frequency data is readily extractable* allowing a word or phrase to be 'tracked' over time. In addition, each segment has one or more collocate databases which store profiles for each word in the corpus.</Paragraph> <Paragraph position="3"> By comparing the output from two collocate databases, the change in coilocational behaviour of any node can be identified in a similar fashion to a change in frequency-of-occurrence. As established in the AVIATOR Project* an alteration in a word's profile signals a change in its meaning, with a consequent change in the set of words which can be regarded as its semantic equivalents. If we require time scales longer than three months, the software needs to perform a comparison across several collocate databases rather than just two. In order to accelerate this process, we have added a facility for creating merged databases, for example combining all four databases for 1994 into one. By using this in conjunction with a similarly merged database based on other individual segments, the collocate comparison process can be carried out more efficiently.</Paragraph> <Paragraph position="4"> When identifying nymic change, we generally increase the time period to a whole year, to avoid recording any seasonal fluctuations in real world events. The usual procedure is to create two merged collocate databases, one for the year in question (the target corpus) and another of all segment databases prior to that year (the baseline corpus). These two databases are then compared and any significant change in collocate profiles is recorded. This may be done for individual words or all the words in the corpus. In looking at the changed profiles, the collocates are categorised into four sets: 'up' collocates those which have increased in significance in the target corpus; 'down' collocates those which have decreased in significance in the target corpus; 'new' collocates those which have appeared for the first time in the target corpus; 'gone' collocates those which appeared in the baseline corpus but which are no longer present in the target corpus.</Paragraph> <Paragraph position="5"> The normal process of nym identification, as explained earlier, finds candidates which have as many collocates as possible in common with the target word. In monitoring the change in nymic relationships, however, only those collocates which are considered to have changed are involved in this process. The semantic proximity of the target word to the candidate nyms is therefore measured in terms of the number of changed collocates the two words have in common. If the task is to identify new nyms, then only 'up' and 'new' collocates are used; conversely, 'down' and 'gone' collocates are employed in finding nyms which have decreased in significance. This is exemplified in Table 8, % which presents the 1997 'up' and for the node word crisis.</Paragraph> <Paragraph position="6"> next stage involves the identification of words which share these collocates with the original node word crisis in the 1997 collocate database. The output from this is given in Table 9.</Paragraph> <Paragraph position="7"> The nyms in Table 9, presented in descending order of strength of association, focus more clearly on the financial crisis rooming in South East Asia.</Paragraph> <Paragraph position="8"> Since one of the chief goals of this methodology is to provide up-to-date information on thesaural equivalents, it can also be used to find nyms which have declined in significance. The output is harder to interpret, since the difference in size between the baseline and target databases results in many more down/gone collocates than up/new ones. For crisis, as an example, there were 3,187 down/gone collocates but only 68 new/up ones. Nevertheless, the nyms which are generated by using the down/gone collocates can be useful and interesting. Tables 10a and 10b show nyms for war, using 1993 as the target corpus and all previous data (19881992) as the baseline; In Table 10a, the 'up&quot; nyms are presented and it can be seen that these all relate to the civil war in the former Yugoslavia. In Table 10b, in contrast, those nyms are listed which in 1993 * became less closely associated with war.</Paragraph> <Paragraph position="9"> The main reference reflected in the 'down nyms' of Table 10b is to the Gulf War, which followed Iraq's invasion of Kuwait. The implication of this evidence is that by 1993, the Gulf War had ceased to figure so prominently in our corpus data and so had become less strongly associated with the concept of war.</Paragraph> <Paragraph position="10"> Concluding Remarks This paper has described the basic ACRONYM system, a set of tools which has relevance both to text retrieval applications and to linguistic description. The focus here has been on outlining the recent modifications which have been carried out to refine the nymic output to facilitate the linguistic task of describing the textual thesaurus. Several of them, in particular semantic clustering, are also intended to improve performance in document retrieval. The nymic output from ACRONYM intuitively appears to have the potential to increase both recall and precision, and initial tests of its effectiveness in this regard have been carried out, by using nymic output to extract article headlines. The next stage of the research will focus more closely on the evaluation and optimisation of the system as a text retrieval facility.</Paragraph> </Section> </Section> class="xml-element"></Paper>