File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-0703_metho.xml
Size: 18,936 bytes
Last Modified: 2025-10-06 14:09:05
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-0703"> <Title>Event Clustering on Streaming News Using Co-Reference Chains and Event Words</Title> <Section position="3" start_page="0" end_page="22" type="metho"> <SectionTitle> 2 Event Clustering using Co-Reference Chains </SectionTitle> <Paragraph position="0"> A co-reference chain in a document denotes an equivalence class of noun phrases. (Cardie and Wagstaff, 1999) A co-reference resolution procedure is first to find all the possible NP candidates. It includes word segmentation, named entity extraction, part of speech tagging, and noun phrase chunking. Then the candidates are partitioned into equivalence classes using the attributes such as word/phrase itself, parts of speech of head nouns, named entities, positions in a document, numbers (singular, plural, or unknown), pronouns, gender (female, male, or unknown), and semantics of head nouns. As the best F-measure of automatic co-reference resolution in English documents in MUC-7 was 61.8% (MUC, 1998), a corpus hand-tagged with named entities, and co-reference chains are prepared and employed to examine the real effects of co-reference chains in event clustering r.</Paragraph> <Paragraph position="1"> Headlines of a news story can be regarded as its short summary. That is, the words in the headline represent the content of a document in some sense. The co-reference chains that are initiated by the words in the headlines are assumed to have higher weights. A sentence which contains any words in a given co-reference chain is said to &quot;cover&quot; that chain. Those sentences which cover more co-reference chains contain more information, and are selected to represent a document. Each sentence in a document is ranked according to the number of co-reference chains that it covers. Five scores shown below are computed. Sentences are sorted by the five scores in sequence and the sentences of the highest score are selected. The selection procedure is repeated until the designated number of sentences, e.g., 4 in this paper, is obtained.</Paragraph> <Paragraph position="2"> (1) For each sentence that is not selected, count the number of noun co-reference chains from the headline, which are covered by this sentence and have not been covered by the previously selected sentences.</Paragraph> <Paragraph position="3"> (2) For each sentence that is not selected, count the number of noun co-reference chains from the headline, which are covered by this sentence, and add the count to the number of verbal terms in this sentence which also appear in the headline.</Paragraph> <Paragraph position="4"> (3) For each sentence that is not selected, count the number of noun co-reference chains, which are covered by this sentence and have not been covered by the previously selected sentences.</Paragraph> <Paragraph position="5"> (4) For each sentence that is not selected, count the number of noun co-reference chains, which are covered by this sentence, and add the count to the number of verbal terms in this sentence which also appear in the headline.</Paragraph> <Paragraph position="6"> (5) The position of a sentence Score 1 only considers nominal features.</Paragraph> <Paragraph position="7"> Comparatively, Score 2 considers both nominal and verbal features together. Both scores are initiated by headlines. Scores 3 and 4 consider all the co-reference chains no matter whether these chains are initiated by headlines or not. These two scores ranks those sentences of the same scores 1 and 2. Besides, they can assign scores to news stories without headlines. Scores 1 and 3 are recomputed in the iteration. Finally, since news stories tend to contain more information in the leading paragraphs, Score 5 determines which sentence will be selected according to position of sentences, when sentences are of the same scores (1)-(4). The smaller the position number of a sentence is, the more it will be preferred.</Paragraph> <Paragraph position="8"> The sentences extracted from a document form a summary for this document. It is in terms of a term vector with weights defined below. It is a normalized TF-IDF.</Paragraph> <Paragraph position="10"> is frequency of term t j in summary i, N is total number of summaries in the collection being examined, df j is number of summaries that term t j occurs, and s ij denotes the TF-IDF value of term t</Paragraph> <Paragraph position="12"> summary i.</Paragraph> <Paragraph position="13"> A single-pass complete link clustering algorithm incrementally divides the documents into several event clusters. We compute the similarities of the summary of an incoming news story with each summary in a cluster. Let V</Paragraph> <Paragraph position="15"> If all the similarities are larger than a fixed threshold, the news story is assigned to the cluster.</Paragraph> <Paragraph position="16"> Otherwise, it forms a new cluster itself. Life span is a typical phenomenon for an event. It may be very long. Figure 1 shows the life span of an air crash event is more than 100 days. To tackle the long life span of an event, a dynamic threshold (d_th) shown below is introduced, where th is an initial threshold. In other words, the earlier the documents are put in a cluster, the smaller their thresholds are. Assume the published day of document D2 is later than that of document D1.</Paragraph> <Paragraph position="18"> where dist (day distance) denotes the number of days away from the day at which the event happens, and w_size (window size) keeps the threshold unchanged within the same window.</Paragraph> <Paragraph position="19"> Moreover, we use square root function to prevent the dynamic threshold from downgrading too fast.</Paragraph> </Section> <Section position="4" start_page="22" end_page="26" type="metho"> <SectionTitle> 3 Test Collection </SectionTitle> <Paragraph position="0"> In our experiment, we used the knowledge base provided by the United Daily News (http://udndata.com/), which has collected 6,270,000 Chinese news articles from 6 Taiwan local newspaper companies since 1975/1/1. To prepare a test corpus, we first set the topic to be &quot;Hua Hang Kong Nan &quot; (Air Accident of China Airlines), and the range of searching date from 2002/5/26 to 2002/9/4 (stopping all rescue activities). Total 964 related news articles, which have published date, news source, headline and content, respectively, are returned from search engine. All are in SGML format. After reading those news articles, we deleted 5 news articles which have headlines but without any content. The average length of a news article is 15.6 sentences. Figure 1 depicts the distribution of the document number within the event life span, where the x-axis denotes the day from the start of the year. For example, &quot;146&quot; denotes the day of '2002/5/26', which is the 146th day of year 2002.</Paragraph> <Paragraph position="1"> Then, we identify thirteen focus events, e.g., rescue status. Meanwhile, two annotators are asked to read all the 959 news articles and classify these articles into 13 events. If a news article can not be classified, the article is marked as &quot;other&quot; type. A news article which reports more than one event may be classified into more than one event cluster. We compare the classification results of annotators and consider those consistent results as our answer set. Table 1 shows the distribution of the 13 focus events.</Paragraph> <Paragraph position="2"> We adopt the metric used in Topic Detection and Tracking (Fiscus and Doddington, 2002). The evaluation is based on miss and false alarm rates. Both miss and false alarm are penalties. They can measure more accurately the behavior of users who try to retrieve news stories. If miss or false alarm is too high, users will not be satisfied with the clustering results. The performance is characterized by a detection cost, , in terms of the probability of miss and false alarm:</Paragraph> <Paragraph position="4"> where and are costs of a miss and a false alarm, respectively, and are the conditional probabilities of a miss and a false alarm, and and are set to 1, 0.1 and 0.02, respectively. The less the detection cost is, the higher the performance is.</Paragraph> <Paragraph position="5"> For comparison, the centroid-based approach and single pass clustering is regarded as a baseline model. Conventional TF-IDF scheme selects 20 features for each incoming news articles and each cluster uses 30 features to be its centroid. Whenever an article is assigned to a cluster, the 30 words of the higher TF-IDFs are regarded as the new centroid of that cluster. The experimental results with various thresholds are shown in Table 2. The best result is 0.012990 when the threshold is set to 0.05.</Paragraph> <Paragraph position="6"> Kuo, Wong, Lin and Chen (2002) indicated that near 26% of compression rate is suitable for a normal reader in multi-document summarization. Recall that the average length of a news story is 15.6 sentences. Following their postulation, total 4 sentences, i.e., 16/4, are selected using co-reference chains. Table 3 shows the detection cost with various threshold settings. We found that the best result could be obtained using threshold 0.05, however, it was lower than the result of baseline (i.e., 0.013137 > 0.012990).</Paragraph> <Paragraph position="7"> Next, we study the effects of dynamic thresholds. Three dynamic threshold functions are experimented under the window size 1. A linear decay approach removes the square root function in Formula (3). A slow decay approach adds a constant (0.05) to Formula (3) to keep the minimum threshold to be 0.05 and degrades the threshold slowly. Table 4 shows that Formula (3) obtained the best result, and the dynamic threshold approach is better than the baseline model.</Paragraph> <Section position="1" start_page="26" end_page="26" type="sub_section"> <SectionTitle> Threshold Functions (Initial Threshold = 0.05) </SectionTitle> <Paragraph position="0"> Additionally, we evaluate the effect of the window size. Table 5 shows the results using various window sizes in Formula (3). The best detection cost, i.e., 0.012647, is achieved under window size 2. It also shows the efficiency of dynamic threshold and window size.</Paragraph> <Paragraph position="1"> The co-reference chains in the above approach considered those features, such as person name, organization name, location, temporal expression and number expression. However, the important words &quot;black box&quot; or &quot;rescue&quot; in an air crash event are never shown in any co-reference chain. This section introduces the concepts of event words.</Paragraph> <Paragraph position="2"> Topic and event words were applied to topic tracking successfully (Fukumoto and Suzuki, 2000). The basic hypothesis is that an event word associated with a news article appears across paragraphs, but a topic word does not. In contrast, a topic word frequently appears across all news documents. Because the goal of event clustering is to extract all the events associated with a topic, those documents belonging to the same topic, e.g., China Airlines Air Accident, always have the similar topic words like &quot;China Airlines&quot;, &quot;flight 611&quot;, &quot;air accident&quot;, &quot;Pen-Hu&quot;, &quot;Taiwan strait&quot;, &quot;rescue boats&quot;, etc. Topic words seem to have no help in event clustering. Comparatively, each news article has different event words, e.g., &quot;emergency command center&quot;, &quot;set up&quot;, &quot;17:10PM&quot;, &quot;CKS airport&quot;, &quot;Commander Lin&quot;, &quot;stock market&quot;, &quot;body recovery&quot;, and so on. Extracting such keywords is useful to understand the events, and distinguish one document from another.</Paragraph> <Paragraph position="3"> The postulation by Fukumoto and Suzuki (2002) is that the domain dependency among words is a key clue to distinguish a topic and an event. This can be captured by dispersion value and deviation value. The former tells if a word appears across paragraphs (documents), and the latter tells if a word appears frequently. Event words are extracted by using these two values. Formula (5) defines a weight of term t in the i-th story.</Paragraph> <Paragraph position="5"> are defined in the paragraph level. The dispersion value of term t in the story level denotes how frequently term t appears across m stories. The deviation value of term t in the i-th story denotes how frequently it appears in a particular story. Coefficients a and b are used to adjust the number of event words. In our experiments, 20 event words are extracted for each document. In such a case, (a , b ) is set to (10, 50) in story level and set to (10, 25) in paragraph level, respectively.</Paragraph> <Paragraph position="6"> Formula (8) shows that term t frequently appears across paragraphs rather than stories. Formula (9) shows that term t frequently appears in the i-th story rather than paragraph P Below shows the event clustering using event words only. At first, we extract the event words of each news article using the whole news collection. For each sentence, we then compute the number of event words in it. After sorting all sentences, the designated number of sentences are extracted according to their number of event words. In the experiments, we use different window sizes to study the change of detection cost after introducing event words. Table 6 shows the experimental results under the same threshold (0.005) and test collection mentioned in Section 3.</Paragraph> </Section> <Section position="2" start_page="26" end_page="26" type="sub_section"> <SectionTitle> Various Window Sizes </SectionTitle> <Paragraph position="0"> The results in Table 6 are much better than those in Table 5, because inclusion of event words selects more informative or representative sentences or paragraphs. The more informative feature words documents have, the more effectively documents of one event can be distinguished from those of another. In other words, the similarities of documents among different events become smaller, so that the documents cannot be assigned to the same cluster easily under the higher threshold, and the best performance is shifted from window size 2 to window size 3.</Paragraph> </Section> </Section> <Section position="5" start_page="26" end_page="26" type="metho"> <SectionTitle> 5 Event Clustering Using Both Co-reference </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="26" end_page="26" type="sub_section"> <SectionTitle> Chains and Event Words </SectionTitle> <Paragraph position="0"> According to the above experimental results, it is evident that either co-reference chains or event words are useful for event clustering on streaming news. As co-reference chains and event words are complementary in some sense, we further examine the effect on event clustering using both of them.</Paragraph> <Paragraph position="1"> Thus, two models called summation model and two-level model, respectively, are proposed. The summation model is used to observe the summation effect using both the co-reference chains and the event words on event clustering.</Paragraph> <Paragraph position="2"> On the other hand, the two-level model is used to observe the interaction between co-reference chains and event words.</Paragraph> </Section> <Section position="2" start_page="26" end_page="26" type="sub_section"> <SectionTitle> 5.1 Summation Model </SectionTitle> <Paragraph position="0"> In summation model, we simply add the scores for both co-reference chains and event words, which are described above respectively to be the score for each sentence in the news document. At first, we extract the event words of each news article using the whole news collection described in Section 3. For each sentence, we then compute the number of event words in it, and add this count to the number of co-reference chains it covers. The iterative procedure specified in Section 2 extracts the designated number of sentences according to the number of event words and co-reference chains.</Paragraph> <Paragraph position="1"> Table 7 summarizes the experimental results under the same test collection mentioned in Section 3. The experiments of summation model show that the best detection cost is 0.011603.</Paragraph> <Paragraph position="2"> Comparing the best result with those in Tables 5 and 6, the detection costs are decreased 9% and 2%, respectively.</Paragraph> </Section> <Section position="3" start_page="26" end_page="26" type="sub_section"> <SectionTitle> 5.2 Two-level model </SectionTitle> <Paragraph position="0"> By comparing the experimental results described in Section 3 and 4, we noticed that the event word factor seems more important than the co-reference factor on event clustering of news document.</Paragraph> <Paragraph position="1"> Moreover, from the summation model we only know that both factors are useful on event clustering. In order to make clear which factor is more important during event clustering of news documents, a two-level model is designed in such a way that the co-reference chains or the event words are used separately rather than simultaneously. For example, we use the score function and the sentence selection algorithm described in Section 3 first, when there is a tie during sentence selection. Then we use the score function described in Section 4 to decide which sentence is selected from those candidate sentences, and vice versa. Thus, two alternatives are considered. Type 1 model uses the event words sentence selection algorithm described in Section 4 to select the representative sentences from each document, the co-reference chains are used to solve the tie issue. In contrast, type 2 model uses the co-reference chains sentence selection algorithm described in Section 3 to select the representative sentences for each documents and use event words to solve the tie issue. Table 8 shows the experimental result under the same test collection as described in previous sections.</Paragraph> <Paragraph position="2"> event words is better than the co-reference chains in event clustering. Furthermore, the best score of type 1 is also better than the best score of Table 6. Thus, the introduction of co-reference chains can really improve the performance of event clustering using event words. On the other hand, the introduction of event words in type 2 does not have such an effect. Moreover, to further examine the use of co-reference chain information and the event words in event clustering, a more elaborate combination, e.g., using mutual information or entropy, of the two approaches is needed.</Paragraph> </Section> </Section> class="xml-element"></Paper>