File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/96/w96-0109_evalu.xml
Size: 6,388 bytes
Last Modified: 2025-10-06 14:00:20
<?xml version="1.0" standalone="yes"?> <Paper uid="W96-0109"> <Title>EXPLOITING TEXT STRUCTURE FOR TOPIC IDENTIFICATION</Title> <Section position="7" start_page="106" end_page="108" type="evalu"> <SectionTitle> 5. EXPERIMENTS </SectionTitle> <Paragraph position="0"> We have conducted a set of experiments to see how a full-text and &quot;discard&quot; model compare in terms of the performance on the topic identification task. Our experiments used the total of 43,253 full-text news articles from Nihon Keizai Shimbun, a Japanese business daily (Nihon-Keizai-Shimbun-Sha, 1992).</Paragraph> <Paragraph position="1"> All of the articles appeared in the first half of the year 1992. Of these, 40,553 articles, which appeared on May 31, 1992 and earlier, were used for training and the remaining 2,700 articles, which appeared on June 1, 1992 or later, were used for testing.</Paragraph> <Paragraph position="2"> A training set and a test set were obtained by extracting nouns from the newspaper corpus, which involves as a sub-step tokenizing each article into a set of words. The procedure was carried out with the tokenizer program JUMAN. The resultant training set contained some 2.5 million words excluding stop words.</Paragraph> <Paragraph position="3"> The test set was then divided into nine subsets of news articles according to the length. Each subset contained 300 articles. In Table 1, the test set 1, for instance, consists of articles, each of which which contains from 100 to 200 characters. The test set 2, on the other hand, consists of larger articles, which are between 200 and 300 character long.</Paragraph> <Paragraph position="4"> test set length (in char.) num. of doc.</Paragraph> <Paragraph position="5"> In the experiments, we were interested in finding out the effectiveness of a segment model which considers a starting block of the article and ignores everything else. Here we tried two approaches; one is based on a fixed-length segment and the other on a proportional-length segment. The fixed-length approach uses the first i words of the text, i being constant across texts, whereas the proportional-length approach uses the first j% of words contained in the text, so that the actual length of segment is proportional to that of the whole text.</Paragraph> <Paragraph position="6"> Table 2 and Table 3 show break even points of experiments using the fixed-length and proportional-length strategies, respectively. A break even point is a highest point where recall and precision is equal. It is meant as a summary figure of the performance. Precision and recall are determined for each text in the test set, by the formulae below: the number of words correctly identified as title words Precision = the number of words assigned the number of words correctly identified as title words Recall = the number of actual topics We use a assigning strategy called probabilistic thresholding (Lewis, 1992) to decide what words to be assigned to the text as potential title indicators. Basically, what we do is to pick up a thresholding constant k and assign words whose probability of being a title word is greater than k. Typically, a large value of k gives high recall and low precision, while the opposite is the case with a small value of k. A break even point is obtained by varying the value of k.</Paragraph> <Paragraph position="7"> Returning to Table 2 and Table 3, i indicates the size of segment, and I the length of text. The '+/-' figure next to each break even point indicates the improvement (or drop) as compared to a topic identification task using full texts. The asterisk '.' means that no break even point was found for the associated experiment and the precision at the highest recall is listed instead (the highest recall is given parenthetically). In case that the length of a text is smaller than that of the segment, the whole text is used.</Paragraph> <Paragraph position="8"> The column labelled &quot;10&quot; in Table 2 is the result of applying a segment model which considers the starting 10-word block from a text. The table shows that at i = 10, there were no break even points found for texts with more than 400 characters (1 > 400).</Paragraph> <Paragraph position="9"> Both the FLM and PLM approaches produced an improvement over the full-text model. Discarding rear portions of a text turns out to be more effective for large texts (1 > 200) than for short texts (100 < l < 200). However, the effectiveness of the &quot;discard&quot; strategy slowly declines as the text length increases. In Table 2, for instance, the effectiveness falls from .42 to .32 at i = 20. The distribution of similarity measurements for large texts in Fig. 4 suggests that the similarity distribution for large texts tend to be less skewed to the left than that for short texts. This would mean that title-indicating terms are scattered more evenly over the text, and thus it becomes all the more difficult to demarcate between relevant and irrelevant parts of the text.</Paragraph> <Paragraph position="10"> A problem with the PLM approach is that a segment from which topical words are chosen is too small for short texts. Thus at 20%, for instance, its performance on 100-200 character texts drops by 17% compared to the full-text approach, but gradually improves as the value of j increases. (See Table 3 and Fig. 8). Interestingly enough, the situation turns around when l is large and j is small: thus at j = 20, there is a 20 % increase for 500 < l < 600 but a 17 % decrease for 100 < l < 200.</Paragraph> <Paragraph position="11"> The results of experiments using paragraphs are shown in Fig. 4. The experiments used the first paragraph of a text as a segment. Though the use of paragraph achieved better results at some points (300-400 and 500-600) than other approaches, the overall performance is not outstanding compared to either FLM or PLM. In particular, the 'first paragraph' strategy is outperformed by the full-text method on texts with more than 900 characters.</Paragraph> <Paragraph position="12"> Table 4: Results for using paragraphs. Figures are in break even point.</Paragraph> <Paragraph position="13"> size 100-200 200-300 300-400 400-500 500-600 600-700 700-800 800-900 900-1000 I b.e. pts .396 .358 0.389 .371 .381 .338 .290 .283 .250</Paragraph> </Section> class="xml-element"></Paper>