File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/n03-3001_metho.xml

Size: 18,049 bytes

Last Modified: 2025-10-06 14:08:14

<?xml version="1.0" standalone="yes"?>
<Paper uid="N03-3001">
  <Title>Semantic Language Models for Topic Detection and Tracking</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Past work
</SectionTitle>
    <Paragraph position="0"> Traditionally NLP techniques have not met with much success in the IR domain. However, after several advances in tasks such as automatic tagging of text with high level semantics such as parts-of-speech (Ratnaparkhi, 1996), named-entities (Bikel et al., 1999), sentence-parsing (Charniak, 1997), etc., there is increasing hope that one could leverage this information into IR techniques. Traditional vector space models (Salton et al., 1975) and the more recent language models (Ponte and Croft, 1998) tend to ignore any semantic information and consider only word-tokens or word-stems as basic features.</Paragraph>
    <Paragraph position="1"> We know of no prior work in the language modeling framework that tries to incorporate semantic information into IR models. However, in vector space modeling framework, there have been a few attempts. For example, Allan, et al (Allan et al., 1999) use an ad-hoc weighting scheme to weight named-entities higher than other tokens in their vector space models for the new event detection task of TDT. They do not report any signi cant improvements in their results. Additionally, the weighting scheme is empirical and they present no principled approach to compute the weights.</Paragraph>
    <Paragraph position="2"> In the eld of ad-hoc retrieval, emerging research on integrating NLP tools into retrieval models seems encouraging. Mihalcea and Mihalcea (Mihalcea and Mihalcea, 2001) show that retrieval effectiveness can be improved by indexing words with their semantic classes such as parts-of-speech, named-entity-type, WordNet synonyms, hypernyms, hyponyms, etc.</Paragraph>
    <Paragraph position="3"> In this work, we present a principled approach to integrating semantic information into the language modeling framework and show how to compute the relative importance of various semantic classes automatically.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Semantic language models
</SectionTitle>
    <Paragraph position="0"> Recall that our task involves analyzing and comparing the content of news stories by the topics that they discuss.</Paragraph>
    <Paragraph position="1"> The topic of a news story is typically characterized by an event, one or more key players which may include persons or organizations (the who? of the event), a location to which the event is associated (the where? of the event), a time of occurrence of the event (the when? of the event) and a description of the event (the what? of the event).</Paragraph>
    <Paragraph position="2"> Hence, when comparing news stories, it makes sense to compare those features between the stories that answer the above mentioned four 'wh' questions (Allan et al., 2002b). However, extracting these features may not be a trivial task. It may need a deep understanding of the semantics of the story.</Paragraph>
    <Paragraph position="3"> As a rst step towards that end, we can leverage the ability of statistical taggers that can recognize automatically all instances of named-entities such as persons, locations, organizations, and parts-of-speech such as nouns, verbs, adjectives, etc., in a news story. As an approximation to our exact answers to the four 'wh' questions, we will assume that the set of tokens labeled as persons and organizations by the taggers correspond to an answer to the who? question, the set of dates correspond to the when? question, the set of locations to the where? question and lastly, the set of nouns, verbs and adjectives to the what? question. Our hope is that these categories of named-entities and parts-of-speech help us capture the semantics of the news story. Hence we will address these categories as semantic classes in this work and our model as semantic language model. Our model is a two-stage process in which the rst stage involves computing class-speci c likelihood ratios while the second stage consists of combining the ratios using a weighted perceptron. The ensuing discussion presents the mathematical description of the two stage process.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Class-speci c likelihood ratio
</SectionTitle>
      <Paragraph position="0"> Let C = fC1;::;CjCjg be the set of semantic classes.</Paragraph>
      <Paragraph position="1"> Let C(w) be a relation that maps a given occurrence of a word w to its semantic class C 2 C. Then, for any story D, we de ne the list of features Fi(D) that belong to class Ci as follows:</Paragraph>
      <Paragraph position="3"> (2) where n = jFi(D)j. In other words, Fi(D) represents the list of all tokens in the story D that fall into the category Ci. Thus, each story is now represented as a set of feature-lists of all the semantic-classes as shown below: D fF1(D);:::;FjCj(D)g (3) For each semantic class Ci and story D, we de ne the class-speci c semantic language model Mi(D) as follows: null</Paragraph>
      <Paragraph position="5"> where f(w;Fi(D)) is the number of occurrences of a word w in a story D in the class Ci and GE is a general English collection, while is a smoothing parameter that lies between 0 and 1. Thus, the class-speci c semantic language model Mi(D) is a smoothed probability distribution of words in class Ci of story D. This is analogous to the standard document language models used by IR researchers.</Paragraph>
      <Paragraph position="6"> Given two stories D1 and D2, the semantic class speci c likelihood of D2 with respect to D1 is given by:</Paragraph>
      <Paragraph position="8"> where n = jFi(D2)j. We compute the log-likelihood ratio instead of just the generative probability P(D2jM(D1)) to overcome the tendency of the generative probability to favor shorter stories.</Paragraph>
      <Paragraph position="9"> The generative semantic-class-speci c general English model is given by:</Paragraph>
      <Paragraph position="11"/>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Weighted Perceptron approach
</SectionTitle>
      <Paragraph position="0"> Now, all that remains to be done is to combine the semantic class-speci c log-likelihood scores [L1(D2jD1);::;LjCj(D2jD1)]T in a principled way to obtain the overall similarity score of D1 with respect to D2. Towards that end, we cast the link detection task as a two-class classi cation problem, the two classes being 'on-topic' and 'off-topic'. In other words, each storypair (D1;D2) is a sample and the classi cation task involves assigning the label 'on-topic' or 'off-topic' to the story pair. We compute the semantic-class-speci c log-likelihood scores for all classes and treat them as components of the feature vector x of the sample as shown below:</Paragraph>
      <Paragraph position="2"> We use a linear discriminant function that is a linear combination of the components of x for classi cation as shown in the following equation:</Paragraph>
      <Paragraph position="4"> where y is the augmented feature vector given by y = [1; x]T , w = [w0;w1;::;wjCj]T is the weight vector. In particular w0 is called the bias or threshold weight. For a discriminant function of the form of equation 8, a two-class classi er implements the following decision rule: Decide 'on-topic' if g(y) &gt; 0 and 'off-topic' otherwise.</Paragraph>
      <Paragraph position="5"> The linear discriminant function clearly constitutes a perceptron. Figure 1 shows a graphical representation of the perceptron that takes the semantic-class-speci c log-likelihood scores as input.</Paragraph>
      <Paragraph position="6">  As the gure indicates, for each story pair (D1;D2), we build semantic-class-speci c models [M1(D1);::;MjCj(D1)] from story D1 as given by equation 4. We also construct the semantic class-speci c feature lists fF1(D2);::;FjCj(D2)g from story D2 as de ned in equation 2 and then compute the feature vector x = [L1(D2jD1);::;LjCj(D2jD1)]T where each component likelihood ratio is computed as given in equation 5. We then perform an inner product of the augmented feature vector y and the weight vector w of the perceptron and the resulting score is output as shown in the gure.</Paragraph>
      <Paragraph position="7"> The standard perceptron learns the optimal weight vector w by minimizing its misclassi cation rate on a training set (Duda et al., 2000). However, in TDT, misses (classi cation of an on-topic story pair as off-topic) are penalized 10 times more strongly than false alarms (classi cation of an off-topic story pair as on-topic) (Allan, 2002a). We have therefore incorporated these penalties into the criterion function to force the perceptron learn the optimal classi cation based on TDT's cost function.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Link Detection task and Evaluation
</SectionTitle>
    <Paragraph position="0"> In this section, we describe one of the tasks called link detection, on which we performed the experiments reported in this work.</Paragraph>
    <Paragraph position="1"> Link detection requires determining whether or not two randomly selected stories (D1;D2) discuss the same topic. The evaluation methodology of a link detection system requires the system to output a score for each story pair that represents the system's con dence that both stories in the pair discuss the same topic. The system's performance is then evaluated using a topicweighted Detection Error Trade-off (DET) curve (Martin et al., 1997) that plots miss rate against false alarm over a large number of story pairs, at different values of decision-threshold. A Link Detection cost function Clink is then used to combine the miss and false alarm probabilities at each value of threshold into a single normalized evaluation score (Allan, 2002a). We use the minimum value of Clink as the primary measure of effectiveness and show DET curves to illustrate the error trade-offs. It may be useful for the reader to remember that, since the DET curve is an error-tradeoff plot, the closer the curve is to the origin, the better is the performance, unlike the standard precision-recall curve familiar to the IR community. null</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="90" type="metho">
    <SectionTitle>
5 Experiments and results
</SectionTitle>
    <Paragraph position="0"> We have used Identi nder (Bikel et al., 1999) and Jtag (Xu et al., 1994) respectively, to tag each term by its named-entity type and its part of speech category. Additionally, we have used a list of 423 most frequent words to remove stop words from stories. Stemming is done using the Porter stemmer (Porter, 1980) while the model is implemented using Java.</Paragraph>
    <Paragraph position="1"> As a training set, we have used a subset of TDT3 corpus that consists of news stories from eight English sources collected roughly from October through December 1998. We have used manual transcriptions of stories when the source is audio/video. The training set consists of 7200 story pairs. For the general English model for this set, we have used the same TDT3 natural English manually transcribed set consisting of 37,526 news stories. For the test set, we have used a randomly chosen sub-set of natural English, manually transcribed stories from TDT2 corpus. It consists of 6,363 story pairs and the general English statistics are derived from 40,851 stories.</Paragraph>
    <Paragraph position="2"> In the unigram language modeling approach to link detection, which we have used as baseline in our experiments, we build a topic model M(D1) from one of the stories D1 in the pair. We then compute the log-likelihood ratio L(D2jD1) of the second story D2 with respect to M(D1) similar to equation 5 but considering the entire document as a single feature list. The semantic language model score, on the other hand, is computed as described in section 3.</Paragraph>
    <Paragraph position="3"> Sometimes we may use a symmetrized version of the formula, as shown below:</Paragraph>
    <Paragraph position="5"> However, in this work, we have considered only the asymmetric version of the formula to maintain simplicity of the scoring function. For fair comparison, we have used an asymmetric version of the baseline unigram language model too.</Paragraph>
    <Paragraph position="6"> We have considered the categories in gure 2 as our semantic classes. Note that only terms that are not classied as persons, organizations or locations are considered as candidates for nouns. The numbers in the table indicate the weight assigned by the perceptron to each class. We have trained the perceptron using the 7200 labeled story-pairs of the training set.</Paragraph>
    <Paragraph position="7"> The class All corresponds to the unigram model and consists of all the terms of the story. Note that some of the classes are de ned as the union of two or more subclasses. We have done this to nullify the labeling error of the named-entity and parts-of-speech taggers. For example, we have noticed that Identi nder mislabels Persons as Organizations and vice versa quite frequently. Our hope is that creating a new class that is a union of both Persons and Organizations will offset such tagging errors. null</Paragraph>
    <Paragraph position="9"> The optimum class-weights as learnt by the Perceptron offer some interesting insights. First we note that the class All receives the highest weight and this seems quite intuitive since this class contains all the information in the story. However, somewhat surprisingly, the class N SV SA receives higher weight than the class P SO SL indicating that the former class contains more topical information than the latter. Also, note that Persons are more important than Locations which are in turn more important than Organizations which seems to agree with common sense.</Paragraph>
    <Paragraph position="10"> Next we trained the unigram model on the training set and found the optimum value of the smoothing parameter to be 0:2. We have used the same value for the smoothing parameter in all the classes of the class-speci c language models and combined the class-speci c likelihood scores using the perceptron weights. A comparison of the performance of semantic language model and unigram model on the training set is shown in the DET curve of gure 3. Quite disappointingly, the results indicate that the overall performance as indicated by the minimum cost in the DET curve has only worsened.</Paragraph>
    <Paragraph position="11"> Figure 4 presents a comparison between unigram and semantic language models on the test set. The smoothing parameters and the perceptron weights are set to the val- null formance on training set ues learnt on the training set. This time, however, we note that the minimum cost of the semantic language model is slightly lower than that of the unigram model, but the improvement is very insigni cant.</Paragraph>
  </Section>
  <Section position="7" start_page="90" end_page="90" type="metho">
    <SectionTitle>
6 Discussion
</SectionTitle>
    <Paragraph position="0"> In this section, we rst brie y touch upon the variations in the model we considered and the various experiments we performed, but could not report in detail owing to space constraints. Secondly we discuss why we think the model's performance is unsatisfactory.</Paragraph>
    <Paragraph position="1"> We have considered a simple mixture model to start with, wherein each class-speci c model Mi(D1) generates a list of features in its class Fi(D2) but the model itself is sampled with a prior probability of P(Mi(D1)) which we made dependent on jFi(D1)j. This model's performance is found to be far below that of the uni-gram approach and hence we abandoned it to favor the perceptron-weighted likelihood ratios.</Paragraph>
    <Paragraph position="2"> In terms of experiments done, we started out with the basic semantic classes of P;O;L;N;V;A and Ad without considering unions of the classes. We found that taking unions improved performance and we report the list of classes whose combination performed the best.</Paragraph>
    <Paragraph position="3"> Coming to the second part of our discussion, we are yet to perform exploratory data analysis to understand the reasons behind the unsatisfactory performance of the new approach, but we believe the reasons could be three-fold: Firstly, it is possible that we are operating the semantic language model at a sub-optimal level. For example, we have used the same value of the smoothing parameter that we have learnt for the unigram model in all the classes of the semantic language model. It is possible that different classes may require different levels of smoothing for optimum performance. We believe one could use a gradient descent algorithm on TDT's cost function to learn from the training set the optimum values of the smoothing parameters for different classes.</Paragraph>
    <Paragraph position="4"> Secondly, a linear discriminant function that a perceptron implements is an overly simplistic classi er and may not be doing a good job on separating the on-topic pairs from the off-topic ones. A non-linear classi er such as an SVM (Burges, 1998) could help improve our accuracy.</Paragraph>
    <Paragraph position="5"> Lastly, it is possible that the unigram model is already capturing the relative importance of terms that we are trying to model using our semantic language models. The likelihood ratio score we use in the unigram approach behaves similar to the tf-idf weights, which we know are a powerful statistic to capture the relative importance of terms. If this were true, then the semantic language model may be rendered redundant.</Paragraph>
    <Paragraph position="6"> The real reasons will only be revealed by an analysis of the data and we hope to do this as part of our future work.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML