XML Viewer - h01-1030

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/h01-1030_metho.xml
Size: 27,508 bytes
Last Modified: 2025-10-06 14:07:31
<?xml version="1.0" standalone="yes"?>
<Paper uid="H01-1030">
  <Title>First Story Detection using a Composite Document Representation.</Title>
  <Section position="4" start_page="1" end_page="1" type="metho">
    <SectionTitle>
2. LEXICAL CHAINING
</SectionTitle>
    <Paragraph position="0"> A lexical chain is a set of semantically related words in a text. For example in a document concerning cars a typical chain might consist of the following words {vehicle, engine, wheel, car, automobile, steering wheel}, where each word in the chain is directly or indirectly related to another word by a semantic relationship such as holonymy, hyponymy, meronymy and hypernymy.</Paragraph>
    <Paragraph position="1"> When reading any text it is obvious that it is not merely made up of a set of unrelated sentences, but that these sentences are in fact connected to each other in one of two ways cohesion and coherence. As Morris and Hirst [6] point out cohesion relates to the fact that the elements of a text 'tend to hang together'. Whilst coherence refers to the fact that 'there is sense in the text'. Obviously coherence is a semantic relationship and needs computationally expensive processing for identification, however cohesion is a surface relationship and is hence more accessible. As indicated by Halliday and Hasan [7] cohesion can be roughly classified into three distinct classes, reference, conjunction and lexical cohesion.</Paragraph>
    <Paragraph position="2"> Conjunction is the only class, which explicitly shows the relationship between two sentences, 'I have a cat and his name is Felix'. Reference and lexical cohesion on the other hand indicate sentence relationships in terms of two semantically same or related words. In the case of reference, pronouns are the most likely means of conveying referential meaning. For example in the following sentences, ' &amp;quot;Get inside now!&amp;quot; shouted the teacher. When nobody moved, he was furious'. In order for the reader to understand that 'the teacher' is being referred to by the pronoun 'he' in the second sentence, they must refer back to the first sentence. Lexical cohesion on the other hand arises from the selection of vocabulary items and the semantic relationships between them. For example, 'I parked outside the library, and then went inside the building to return my books', where cohesion is represented by the semantic relationship between the lexical items 'library', 'building' and 'books'. For automatic identification of these relationships it is far easier to work with lexical cohesion than reference because less underlying implicit information is needed to discover the relationship between the above pronoun and the word it references. Hence lexical cohesion is used as a linguistic device for investigating the discourse structure of texts and lexical chains have been found to be an adequate means of exposing this discourse structure. These lexical chains have many practical applications in IR and computational linguistics such as hypertext construction [8], automatic document summarization [9], the detection of malapropisms within text [10], as a term weighting technique capturing the lexical cohesion in a text [11], as a means of segmenting text into distinct blocks of self contained text [12]. For the purpose of this project we  exploit three such applications: 1. We use lexical chains as a means of exploring and presenting the most prevalent topics discussed in news stories.</Paragraph>
    <Paragraph position="3"> 2. A valuable side effect of lexical chain creation is that the words of a text are automatically disambiguated.</Paragraph>
    <Paragraph position="4"> 3. Because lexical chains disambiguate words based on  the context in which they occur, lexical chains also address two linguistic problems synonymy and polysemy, which hinder the effectiveness of traditional IR systems such as the vector space model.</Paragraph>
  </Section>
  <Section position="5" start_page="1" end_page="1" type="metho">
    <SectionTitle>
3. CHAIN FORMATION ALGORITHM
</SectionTitle>
    <Paragraph position="0"> In general the first task of an IR system is to execute a set of text operations (e.g. stemming, removal of stopwords) to reduce the complexity of a full text representation of a document into a more manageable set of index terms.</Paragraph>
    <Paragraph position="1"> Although these index terms are a subset of the original representation, their purpose is to adequately represent the semantic content of the original document in a more concise manner. This is a difficult NLP task, as natural language frequently does not obey the principle of compositionality where the meaning of the whole can be strictly determined from its parts. So in order to derive the correct representation of a text, we need to determine the interpretation of a word or phase in the context in which it occurs i.e. before the original text is manipulated into a set of index terms. The creation of lexical chains which is described below, aims to capture this additional textual information while still maintaining a manageable representation size.</Paragraph>
    <Paragraph position="2"> Firstly each term contained in a particular document is dealt with in chronological order. Then each subsequent word is added to an existing lexical chain or becomes the seed of a new chain, in much the same manner as the clustering of documents. A stronger criterion than simple semantic similarity is imposed on the addition of a term to a chain, where terms must be added to the most recently updated (semantically related) chain. This favors the creation of lexical chains containing words that are in close proximity within the text, prompting the correct disambiguation of a word based on the context in which it was used. We use WordNet to determine the semantic relatedness between a candidate word and the words of a chain. If we view WordNet as a large semantic network of nodes (meanings) inter-related by semantic relations (meronymy, hyponymy, etc.), then finding a relationship between two words in the chaining process involves activating the network of one node and observing the activity of the other in this activated network.</Paragraph>
    <Paragraph position="3">  So far we have talked abstractly about how to determine if a word is semantically related to a chain. To explain this fully it is first necessary to discuss the structure of the WordNet thesaurus, which is used to determine this semantic connection or closeness between words in a text.</Paragraph>
    <Paragraph position="4"> In WordNet, nouns, verbs, adjectives, and adverbs are arranged into synsets (group of synonymous words e.g. cat, feline, tabby), which are further organized into a set of lexical source files by syntactic category. In our case we are only interested in the noun index and data files, because the verb file in WordNet has no relation with the three other files (noun, adverb and adjective files), and the adverb file has only unidirectional relations with the adjective file. So each word in a particular document is searched for in the noun index file, if it is not found then we make the assumption that this word is not a noun and hence will play no further part in the chaining process. If the word is found then it will be represented by a unique set of synset numbers, where each synset number represents a particular sense associated with that word. Each synset number points to the position in the noun data file where words related to this sense of the word are stored with a gloss, and sample sentence using this word. Words related to a particular sense are associated with it by several different semantic relations, such as hyponymy (kind-of, lorry/vehicle), hypernymy (is-a, vehicle/car), holonymy (has-part, tree/branch) and meronymy (part-of, engine/car).</Paragraph>
    <Paragraph position="5"> As shown in Figure 1, each sense associated with a word is expanded using WordNet (in reality these senses and senses related to them are represented by synset numbers). This example of the chain formation process shows us that the word 'car' is related to the word 'trunk' by the fact that 'car trunk', one of the senses of 'trunk', is a meronymy of 'automobile' which is a possible sense of 'car'. In this way both words have been successfully disambiguated so all redundant senses belonging to each word are eliminated and 'car' is added to the chain containing 'trunk'. This chain may also contain other semantically related words pertaining to the topic of an automobile e.g. {car, trunk, engine, vehicle...}. The chain formation process is continued in this way until all the words in a particular document (in our case nouns) have been chained. Any words that remain unchained or ambiguous after this chaining process are eliminated from our chain word representation based on the following hypothesis: 'The occurrence of words in a text which fail to participate in the overall cohesive structure of a text (i.e. remain unchained) is purely coincidental. Consequently these words are considered irrelevant in describing the general topic of a document.' This implies that our lexical chaining strategy also provides us with an automatic means of selecting the most salient features of a particular news story. So when all redundant words have been removed in this manner, all remaining chains are then merged into a single chain containing all the synset numbers from each individual chain involved in this process. This representation is a semantic representation as opposed to a syntactic representation (in the case of a 'bag of words' representation) because it contains concepts (i.e. synset numbers) rather than simple terms to represent the content of a document.</Paragraph>
  </Section>
  <Section position="6" start_page="1" end_page="1" type="metho">
    <SectionTitle>
SENSE OF WORD
</SectionTitle>
    <Paragraph position="0"> The final stage of our combined document representation strategy involves collecting all free text words for each document and storing them in a set of index files. So effectively our composite document representation used in the detection process (described in the next section) consists of two weighted vectors, a chain vector and an ordinary term vector, where both chain words and free text words are weighted simply in terms of the frequency in which they occur in a document.</Paragraph>
  </Section>
  <Section position="7" start_page="1" end_page="1" type="metho">
    <SectionTitle>
4. DETECTION ALGORITHM USING THE
FUSION METHOD
</SectionTitle>
    <Paragraph position="0"> Online Detection or First Story Detection is in essence a classification problem where documents arriving in chronological order on the input stream are tagged with a 'YES' flag if they discuss a previously unseen news event, or a 'NO' flag when they discuss an old news topic.</Paragraph>
    <Paragraph position="1"> However unlike detection in a retrospective environment a story must be identified as novel before subsequent stories can be considered. The single-pass clustering algorithm bases its clustering methodology on the same assumption, the general structure of which is summarised as follows.</Paragraph>
    <Paragraph position="2">  1. Convert the current document into a weighted chain word vector and a weighted free text vector.</Paragraph>
    <Paragraph position="3"> 2. The first document on the input stream will become the first cluster.</Paragraph>
    <Paragraph position="4"> 3. All subsequent incoming documents are compared  with all previously created clusters up to the current point in time. A comparison strategy is used here to determine the extent of the similarity between a document and a cluster. In our IR model we use subvectors to describe our two distinct document representations. This involves calculating the closeness or similarity between the chain word vectors and free text vectors for each document/cluster comparison using the standard cosine similarity measure (used in this variation of the vector space model to compute the cosine of the angle between two weighted vectors).</Paragraph>
    <Paragraph position="5"> The data fusion element of this experiment involves the combination of two distinct representations of document content in a single cluster run i.e. j equals 2 in equation (1). So the overall similarity between a document D and a cluster C is a linear combination of the similarities for each sub-vector formally defined as: where Sim(X, Y) is the cosine similarity measure for two vectors X and Y, and w is a coefficient that biases the weight of evidence each document representation j, contributes to the similarity measure.</Paragraph>
    <Paragraph position="6"> 4. When the most similar cluster is found a thresholding strategy [13] is used to discover if this similarity measure is high enough to warrant the addition of that document to the cluster and the classification of the current document as an old event. If this document does not satisfy the similarity condition set out by the thresholding methodology then the document is declared as discussing a new event, and this document will form the seed of a new cluster.</Paragraph>
    <Paragraph position="7"> 5. This clustering process will continue until all documents in the input stream have been classified.</Paragraph>
  </Section>
  <Section position="8" start_page="1" end_page="2" type="metho">
    <SectionTitle>
5. EXPERIMENTAL RESULTS
</SectionTitle>
    <Paragraph position="0"> A number of experiments were conducted on the TDT-1 broadcast news collection [1]. The results of these experiments were used to observe the effects on first story detection when lexical chains are used in conjunction with free text as a combined document classifier. The main aim of the experiments was to determine if lexical chains are a suitable document representation when classifying news stories in the TDT domain. The official TDT evaluation requires that the system output is a declaration (a YES or NO flag) for each story processed. These declarations are then used to calculate two system errors percentage misses and false alarms. Misses occur when the system fails to detect the first story discussing a new event and false alarms occur when a document discussing a previously detected event is classified as a new event.</Paragraph>
    <Section position="1" start_page="1" end_page="2" type="sub_section">
      <SectionTitle>
5.1 System Descriptions
</SectionTitle>
      <Paragraph position="0"> Three distinct detection systems TRAD, CHAIN and LexDetect are examined in the following set of experiments. The TRAD system [13], our benchmark system in these experiments is a basic FSD system that classifies news stories based on the syntactic similarity between documents and clusters. The design of this system is based on a traditional vector space model which represents documents as a vector, each component of which corresponds to a particular word and who's value reflects the frequency of that word in the document. Classification of a new event occurs in a similar manner to that described in Section 4, the most important difference between the two methods is that a single free text representation is used to express document content, rather than a combined representation. A Time Window [13] of length 30 is employed in the TRAD, CHAIN and LexDetect systems.</Paragraph>
      <Paragraph position="1"> The design of our second system LexDetect has been described in detail in sections 3 and 4. The dimensionality of LexDetect (80 words) remains static through out these experiments. Using the current method of lexical chain creation, just under 72% of documents contained greater than or equal to 30 chained words. We therefore normalized the length of chain word representations by imposing a chain dimensionality value  . In theory it is possible to vary the length of the free text representation in our combined representation however in these experiments all schemes contain free text representations of length 50, since optimal performance is achieved for TRAD when dimensionality 50 is used. The final system parameter to be varied in these experiments is the weighting coefficient w j used in equation (1). The design of our third system CHAIN like TRAD, involves the use of a singular document representation. However this document representation contains chain words only rather than free text terms, and so the dimensionality of the system must be 30.</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
5.2 The Data Fusion Experiment
</SectionTitle>
      <Paragraph position="0"> From the results shown in Figure 2 (a Detection Error Tradeoff Graph where points closer to the origin indicate better overall performance), we deduce that a marginal increase in system effectiveness can be achieved when lexical chain representations are used in conjunction with free text representations in the detection process. In particular, we see that the miss rate of our FSD system LexDetect decreased with little or no impact to the false alarm rate of the system.</Paragraph>
      <Paragraph position="1">  An IR 'system' and an IR 'scheme' are used in this context to describe two different concepts. An IR system refers to the physical implementation of an IR algorithm, which can have various operational modes or various parameter settings. The same IR system may be used to execute different IR schemes by adjusting these parameters [20].</Paragraph>
      <Paragraph position="2"> Optimal performance for the LexDetect system (as shown in Figure 2) was found when a weighted combination of evidence was used. This involved treating our free text representation as weaker evidence during the detection process. Results shown in Figure 3 contrast the effect on LexDetect performance when both the chain and free text representations are given equal weight (Lex) and when the weight of the free text representation is halved (LexDetect). This is an interesting result as similar experiments using composite document representations to improve search system performance based on ranking, only experienced optimal effectiveness when they allowed free text evidence to bias the retrieval process [14, 15]. This prompted us to question the necessity of the free text component of our composite representation, however results show that system performance degrades when this element of document content is excluded. This is due to the inability of WordNet to correlate the relationship between proper nouns and other semantically related concepts i.e.</Paragraph>
      <Paragraph position="3"> {Bill Clinton, US president}, which are often crucial in representing journalistic event identity because they reflect the 'who, what, where, when and how' of a news story.</Paragraph>
      <Paragraph position="4"> Our final experiment involves plotting TRAD_80 against LexDetect shown in Figure 4. The aim of this experiment is to prove that the increase in system effectiveness observed when a composite document representation is used can be attributed solely to the combination of evidence derived from our free text and chain representations rather than as a consequence of increasing the dimensionality of the system to 80 features. As the DET graph in Figure 4 shows, our LexDetect system still outperforms our TRAD system under conditions of equal dimensionality.</Paragraph>
    </Section>
  </Section>
  <Section position="9" start_page="2" end_page="2" type="metho">
    <SectionTitle>
6. CRITERIA FOR SUCCESSFUL DATA
FUSION
</SectionTitle>
    <Paragraph position="0"> In the previous section our results showed that when a chain word representation is used in conjunction with a free text representation of a document, improvements in FSD effectiveness are observed. However these results fail to provide any concrete reasoning as to why data fusion under these particular conditions work. There are many papers in the data fusion literature, which attempt to explain why certain data fusion experiments succeed where others have failed. Many of these papers look at the effects of combining specific sources of evidence such as the combination of rank retrieval lists, multiple searches or multiple queries. However Ng and Kantor [16] have tried to formulate some general preconditions for successful data fusion involving non-specific sources of evidence.</Paragraph>
    <Paragraph position="1"> The first of these criteria is based on the dissimilarity between two sources of evidence.</Paragraph>
    <Paragraph position="2">  1. Dissimilarity: Data fusion between operationally very  similar IR systems may not give better performance.</Paragraph>
    <Paragraph position="3"> To calculate the level of dissimilarity between our FSD systems described in Section 5, we now define two ratios based on the number of common relevant and common non-relevant tagged documents between two distinct systems. The number of relevant tagged documents,  are similarly defined in terms of the number of incorrectly classified documents returned by both systems (i.e. missed events or wrongly detected new events) as shown in equation 3.</Paragraph>
    <Paragraph position="4">  The results for this experiment are shown in tables 1 and 2 below. We can see that in general the relevant document overlap R overlap between the pair-wise similarities of all four systems is between 85% and 92%, the most similar systems being not surprisingly our two TRAD schema which differ only in the length of their classifiers. The pair-wise similarities N overlap of all four systems regarding non-relevant document classifications exhibit a similar trend of high similarity between the TRAD and LexDetect systems. However the most important point to be taken from these sets of results regards the fact that our CHAIN and TRAD systems exhibit the lowest relevant and non-relevant document overlap of all our pair-wise comparisons. This is an important and encouraging result as it shows that our chain word representations (used in CHAIN) is sufficiently dissimilar to our simple 'bag of words' representation (used in TRAD) to contribute additional evidence to a combination experiment involving both these representations. In particular this satisfaction of Ng and Kantor's dissimilarity criteria explains why marginal improvements in system performance were observed in our data fusion experiment.</Paragraph>
    <Paragraph position="5">  The second criteria defined for successful data fusion regards efficacy or the quality of the individual sources of evidence before they are combined in the data fusion process.</Paragraph>
    <Paragraph position="6"> 2. Efficacy: Data fusion between a capable IR system and a very incapable IR system may not give better performance.</Paragraph>
    <Paragraph position="7"> In our data fusion experiment in Section 5 we observed that our CHAIN system was our worst performing FSD system. So as the efficacy criteria suggests a better performing chain word representation is needed before further improvements are observed in our combination system LexDetect.</Paragraph>
  </Section>
  <Section position="10" start_page="2" end_page="3" type="metho">
    <SectionTitle>
7. FUTURE WORK
</SectionTitle>
    <Paragraph position="0"> There are many factors which can affect the final chain word representation of a document, ranging from the greedy nature of the chaining algorithm, to the effects caused when varying degrees of freedom are used in this algorithm (i.e. system parameters such as the amount of activation used in WordNet). However the single biggest influence on the quality of the resultant lexical chains is the knowledge source used to create them. In other words the quality of our lexical chain formation is directly dependent on the comprehensiveness/complexity of the thesaurus used to create them. In the case of WordNet, there are a number of structural inadequacies that degrade the effectiveness of our chain representation:  1. Missing semantic links between related words.</Paragraph>
    <Paragraph position="1"> 2. Inconsistent semantic distances between different concepts.</Paragraph>
    <Paragraph position="2"> 3. Overloaded synsets such as 'being' which are connected to a large number of synsets. These types of synsets cause spurious chaining, where an unrelated word is added to a chain based on a weak yet semantically close relationship with one of these overloaded synsets (a special case of 2.).</Paragraph>
    <Paragraph position="3"> 4. No means of correlating the relationship between proper nouns and other noun phrases (see Section 5.2). 5. The level of sense granularity used to define word  meanings in WordNet is often too fine for the chain formation process.</Paragraph>
    <Paragraph position="4"> All of these factors play a part in reducing the effectiveness of the disambiguation process and the comprehensiveness and accuracy of the final chain representation. A number of these weaknesses are discussed in previous work on lexical chaining [8, 12]. However the last two cases are particularly important when considering the similarity between documents and clusters in the detection process. As explained in Section 6.2 lexical chains are an incomplete means of representing events in a topic detection application since they fail to contain information on the proper nouns involved in the discourse structure of the text.</Paragraph>
    <Paragraph position="5"> The last case is more a comment on the unsuitability of WordNet as a knowledge source in this application rather than as a reference to any specific weakness in its design. For example consider two distinct documents which both contain the word 'city' in their respective chain representations. WordNet defines three distinct meanings or senses of this word: = An incorporated administrative district establish by a state charter.</Paragraph>
    <Paragraph position="6"> = A large densely populated municipality.</Paragraph>
    <Paragraph position="7"> = An urban center.</Paragraph>
    <Paragraph position="8"> When disambiguating a word like 'city' in the chain formation process this level of sense distinction is unnecessary. In fact if our aforementioned documents have chosen two different yet closely related definitions of this word (i.e. different synset numbers) then these documents will be considered less related than they actually are. Other research efforts in the lexical chaining area have suggested 'cleaning' WordNet [8] of rare senses or using some additional knowledge source in the chaining process that could biases the suitability of certain senses in particular contexts  . In future work we hope to address this problem by considering the use of collocation information like noun pairs such as 'physician/hospital' or 'Gates/Microsoft' in the chain formation process. Using such information will help to smooth out the discrepancies in semantic distances between concepts and help detect missing semantic relationships between these concepts. This occurrence information could also reduce the sensitivity of the detection process to fine levels of sense granularity if such information was used when determining the similarity between two document representations. So effectively this technique would eliminate the need for a composite representation in the identification of novel events in a news stream. Instead the data fusion element of our system would involve supplementing our knowledge source WordNet with word co-occurrence information in the chain formation process.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML