File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/w96-0109_metho.xml

Size: 12,958 bytes

Last Modified: 2025-10-06 14:14:26

<?xml version="1.0" standalone="yes"?>
<Paper uid="W96-0109">
  <Title>EXPLOITING TEXT STRUCTURE FOR TOPIC IDENTIFICATION</Title>
  <Section position="3" start_page="101" end_page="102" type="metho">
    <SectionTitle>
2. APPROACH TO TOPIC IDENTIFICATION
</SectionTitle>
    <Paragraph position="0"> Topic identification concerns the problem of identifying a topic of text with no information being given on the text's title or keywords whatsoever. In this paper, we propose a probabilistic approach to the problem. We start with the definition of topic. Here we simply assume that a title term, or a noun that appears in the title, counts as a topic for the text. The reason is that it gives a clear cut, if not simple-minded, definition of topic and lends itself easily to a statistical treatment. Thus what is involved in identifying a topic is a task of locating an instance of a title term. One might ask how we might locate an instance of topic without any information on the title or keywords. We take a probabilistic approach where you estimate the likelihood that a noun appears as a title term and choose among the best scoring nouns. Although it is quite possible to expand the notion of 'topic' to include nouns semantically related to title terms, the possibility is not explored here.</Paragraph>
    <Paragraph position="1"> The method we use for topic identification is basically one that is standardly used in research on text categorization, which aims at finding an effective way of classifying documents with predefined categories (Lewis, 1992). Roughly speaking, text categorization proceeds in two steps; first, for each of the given categories, estimate the likelihood that it is a correct category of a document, and second, decide whether to assign the category to the document based on the estimate; the rule is to use a suitable cutoff point to determine a choice.</Paragraph>
    <Paragraph position="2"> Now to adapt text categorization for use in topic identification requires a slight change in the former. This is because in topic identification, we want to assign documents to nouns which occur in the document, rather than to categories given a priori.</Paragraph>
    <Paragraph position="3"> As mentioned above, topic identification is a two-part process; estimating and assigning. Let us explain a bit more about how the estimating works. We will talk about the assigning part later in the paper. We begin with some terminology. We call a word or an expression which classifies the text its potential topic and those that appear in the title actual topics. Let/:(c I d) denote the likelihood that a document d is assigned to, or classified by a category c, W(d) a set of words or expressions comprising a text d, and S(d) a set of potential topics for d. Then the process of estimating consists in computing the likelihood value ~(c I d) for each c in S(d) such that S(d) C_ W(d). Later, we will be concerned with whether a particular choice of the set S(d) will in any way affect the performance on topic identification.  Following Iwayama et al (1994) and Fuhr (1989), we define the likelihood function by:</Paragraph>
    <Paragraph position="5"> which is meant to be a relativization of the relationship between c and d to some index t (Fuhr, 1989); The index t could be anything from simple units such as a word, a bigram, or a trigram to more complex forms such as a phrase or a sentence. Put simply, the equation above says that the greater the number of indices associated with both c and d is, the more likely d is to predict c. In text categorization, a set of indices is said to represent a text. Assume that every index t will be assigned to some category. Then by Bayes' theorem, equation 1 can be transformed into:</Paragraph>
    <Paragraph position="7"> We estimate the component probabilities by:</Paragraph>
    <Paragraph position="9"> D is a collection of texts (news articles) found in the training corpus. Dc is a collection of texts in D whose title contains a term c. doc_f(D) is the count of texts in D. Similarly, doc_f(Dc) refers to the count of texts which have a term c in the title. Fc(t) is the frequency of an index t in De. token_f(DC/) is the total count of word tokens in Dc, and similarly for token_f (d) and token_f (D). Fd(t) is the frequency of an index t in d. Finally, FD(t) is the frequency of an index t in D.</Paragraph>
    <Paragraph position="10"> (Futsu) (gin) -ga (Kiev)-ni (chuuzai) (in) (jimusho).</Paragraph>
    <Paragraph position="11"> french bank SBJ Kiev at resident staff office (Futsu) (gin) (Oote) -no (SocietY) (General) -wa (15-nichi), (U*) (kura*) (ina*)-no (shuto) french bank big-name which is Societd General as for on 15th U- kra- ine whose capital (Kiev) -ni (chuzai) (in) (jimusho)-wo (kaisetsu)-C/ suruto (happyo)-C/ shita. Sude-ni (Kiev) (shi) Kiev at resident staff office OBJ open plan disclose did Already Kiev city (tookyoku)-no (kyoka) -mo eta to-iwu.</Paragraph>
    <Paragraph position="12"> authority whose permission as well obtained sources say</Paragraph>
  </Section>
  <Section position="4" start_page="102" end_page="103" type="metho">
    <SectionTitle>
MAJOR FRENCH BANK OPENS OFFICE IN KIEV
</SectionTitle>
    <Paragraph position="0"> Societd General, a major french bank, disclosed on the 15th a plan to open a resident office in Kiev,  Shown in Fig. 1 is a sample news article from the corpus we used. Nouns marked with parentheses are automatically extracted by using a Japanese tokenizer program called JUMAN (Matsumoto et al., 1994). The star indicates that the relevant items are wrongly tokenized. What we have in Fig. 2 is sets</Paragraph>
  </Section>
  <Section position="5" start_page="103" end_page="104" type="metho">
    <SectionTitle>
3. PROBLEM
</SectionTitle>
    <Paragraph position="0"> This section briefly discusses some of the problems with the present approach to topic identification (Nomoto, 1995). A most serious problem is that as the length of a story increases, the model's performance quickly degrades. A cause of the problem appears to be an assumption we made that S(d) is equal to W(d), that is, that every noun in the text counts as a potential topic of text. As a consequence, an increase in text length results in a larger set of potential topics.</Paragraph>
    <Paragraph position="1">  8 shows how the proportion of actual topics to indices (that is, TW~d)) ) changes with the increase Fig. in text length. (Information is based on the news articles in the test set we used in the experiments later. '100' denotes a set of news articles between 100 to 200 character long, '200' means news articles between 200 to 300 character long, and similarly for others. The vertical dimension represents the proportion of words in text against words in title.) Since the title length stays rather constant over the test corpus, the possibility that an actual topic is identified by chance would be higher for short texts than for lengthy ones; we find 13% of indices to be actual at 100, while the rate goes down to 3% at 900. In the  following, we will investigate ways of reducing the size of S(d) without hurting the performance of topic identification.</Paragraph>
  </Section>
  <Section position="6" start_page="104" end_page="106" type="metho">
    <SectionTitle>
4. METHOD
</SectionTitle>
    <Paragraph position="0"> Our approach to the reduction problem above is to use a text structure to demarcate between relevant and irrelevant parts of text. Since newspaper articles in general do not have formal structure indicators, we use some similarity function to discover the structure of text. One such function is proposed by Hearst (1994), where a text structure is determined by measuring the similarity between adjacent blocks of text. Rather than to use a measure for within document similarity, the present approach chose to use a similarity measure between the text and its title to determine the structure of text. Behind this is an assumption that parts of the text most similar to its title would best represent its content. Thus it is expected that discarding parts dissimilar to the title reduces irrelevancy in text, contributing to an improvement on the performance.</Paragraph>
    <Paragraph position="1"> Let .T(d) be a complete set of non-overlapping text segments comprising a document d. Let d E 9V(d) and h be a title associated with the document d. Then the similarity between a headline and text segment is given by the usual tf. idf measurement (Wilkinson, 1994):</Paragraph>
    <Paragraph position="3"> N is the number of words that appear in h. ntft d is a normalized term frequency of t in d, which is given by: tfta ntft d -- __ max_t f d where tftd denotes a frequency of the term t in d and max_tfd the frequency of the most frequent term in d.</Paragraph>
    <Paragraph position="4"> = log log df tdf is the number of segments which have an occurrence of t. df is the total number of segments in d. log df is a normalization factor. Further, any repetitions of words are removed from the title. The length of a segment d is fixed at 10 words in the experiments. The idea of idf or the inverse document frequency is to give more points to words which have a localized distribution, that is, those that appear only in some of the documents and not others.</Paragraph>
    <Paragraph position="5"> Fig. 4 shows similarity graphs for news articles published from January, 1992 through April, 1992 (Nihon-Keizai-Shimbun-Sha, 1992). Each graph corresponds to articles of a particular length; the one marked with '100-200' means that it is for articles from 100 to 200 character long. Since the test sets we used in the later experiments of topic identification ranges from 100 to 1000 characters in length, the title/block similarity is measured only for the relevant sizes. 't' denotes the number of articles considered. We divided each news article into a set of 10-word segments ordered according to their appearance. For each of the 10-word segments contained in the text, we measured its similarity to the title. The horizontal dimension represents a position at which a segment appears. A segment at x = 5, for instance, spans the 40th to the 49th word of text. The vertical dimension gives the probability that a segment at a particular position is chosen as most similar to the title. It is clear from Fig. 4 that the initial portion of text is more likely to be chosen as most similar to the title than other parts of text; the later a segment appears in text, the less chance it has of being selected as most similar to the title. In terms of the rhetorical structure theory (RST)(Fox, 1987), the results could be interpreted as indicating a stylistic norm particular to the newspaper domain, i.e., that the main claim is presented in  the beginning of the article, followed by supplemental materials. Consider, for instance, a sample news article given in Fig. 1. Since it does not affect points of the discussion, the English translation is used here for the convenience. The article consists of a title (1) plus two sentences(2,3).</Paragraph>
    <Paragraph position="6">  (1) Major French Bank opens office in Kiev.</Paragraph>
    <Paragraph position="7"> (2) Societg General, a major French bank, disclosed on the 15th a plan to open a resident office in Kiev, capital of Ukraine.</Paragraph>
    <Paragraph position="8"> (3) The bank has already obtained a permission from the city authority, sources say.  Considering the title (1), it is fair to say that the first sentence constitutes a main news of the article, while the second is its elaboration, providing supplementary details about the news. In RST, the article would be analyzed as having an Issue structure, which consists of one nucleus and some adjuncts. A nucleus is a set of clauses that presents a main claim of the text, something that makes the text newsworthy, while an adjunct supplements the main claim with some background or ancillary information. Fig. 5 shows a diagrammatic representation of the article along the lines of Fox (1987).  A node labelled with '(2)' represents a nucleus of the text and a node labelled with '(3)' an adjunct to the nucleus. The arc going from '(2)' to '(3)' is labelled with the type of relationship that holds between the relevant nodes) 1,Elaboration, is one of the three types that occur with an Issue structure; the other two are 'Evidence' and 'Background' (Fox, 1987).</Paragraph>
    <Paragraph position="10"> In fact the results of the experiments shown in Fig. 4 give some ground for believing that texts from the newspaper domain, as a rule, take a rhetorical structure similar to Fig. 5 such as one in Fig. 6, where the nucleus appears at the beginning of the text, followed by any number of supplementary adjuncts.</Paragraph>
    <Paragraph position="11"> In the light of this, we have conducted a series of experiments to determine whether discarding rear portions of text affects the performance of topic identification.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML