File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/95/e95-1037_metho.xml

Size: 11,595 bytes

Last Modified: 2025-10-06 14:14:02

<?xml version="1.0" standalone="yes"?>
<Paper uid="E95-1037">
  <Title>Topic Identification in Discourse</Title>
  <Section position="3" start_page="0" end_page="268" type="metho">
    <SectionTitle>
2 A Language Model
</SectionTitle>
    <Paragraph position="0"> Brown and Yue (1983) pointed out there are two kinds of topics: one is sentence topic and the other is discourse topic. The discourse topic is usually the form of topic sentence. We postulate, further, that the noun in the topic sentence play important roles in the whole discourse. Thus nouns play the core part in the underlying language model. The associations of a noun with other nouns and verbs are supporting factors for it to be a topic.</Paragraph>
    <Paragraph position="1">  The importance of a specific verb or noun is defined by Inverse Document Frequency (IDF) (Salton, 1986b):</Paragraph>
    <Paragraph position="3"> where P is the number of documents in LOB Corpus, i.e. 500, O(I4/) is the number of documents with word W, and c is a threshold value. LOB Corpus is a million-word collection of present-day British English texts. It contains 500 texts of approximately 2,000 words distributed over 15 text categories (Johansson, 1986). These categories include reportage, editorial, reviews, religion, skills, trades, popular lore, belles lettres, biography, essays, learned and scientific writings, fictions, humour, adventure and western fiction, love story, etc. That is to say, LOB Corpus is a balanced corpus. The tag set of LOB Corpus is based on the philosophy of that of Brown Corpus (Francis and Kucera, 1979), but some modifications are made. This is to achieve greater delicacy, while preserving comparability with the Brown Corpus.</Paragraph>
    <Paragraph position="4"> Those words that appear more than one haft of the documents in LOB Corpus have negative log((P. O(W))/O(W)) shown below.</Paragraph>
    <Paragraph position="5"> Noun: Verb:</Paragraph>
    <Paragraph position="7"> The threshold values for nouns and verbs are set to 0.77 and 2.46 respectively. The two values are used to screen out the unimportant words, whose 1DF values are negative. That is, their 1DF values are reset to zero. The strength of one occurrence of a verb-noun pair or a noun-noun pair is computed by the importance of the words and their distances:</Paragraph>
    <Paragraph position="9"> where SNV denotes the strength of a noun-verb pair, SNN the strength of a noun-noun pair, and D(X,Y) represents the distance between X and Y. When i equals to k, the SNN(Ni,Nk) is set to zero. Including the distance factor is motivated by the fact that the related events are usually located in the same texthood. This is the spatial locality of events in a discourse.</Paragraph>
    <Paragraph position="10"> The distance is measured by the difference between cardinal numbers of two words. We assign a cardinal number to each verb and noun in sentences. The cardinal numbers are kept continuous across sentences in the same paragraph. For example, With so many problems 1 to solve2, it would be a great helP3 to select 4 some one problem 5 which might be the key 6 to all the others, and begin 7 there. If there is any such keyproblem 8, then it is undoubtedly the problem 9 of the unitYlo of the Gospelll. There are three viewsl2 of the Fourth Gospell3 which have been held14.</Paragraph>
    <Paragraph position="11"> Therefore, the cardinal number of problems, C(problems), equals to 1 and C(held) is 14. The distance can be defined to be</Paragraph>
    <Paragraph position="13"> The association norms of verb-noun and noun-noun pairs are summation of the strengths of all their occurrences in the corpus: ANV(Nj, V~) = Z SNV(Ni' Vs) (5) ANN(Ni, N k) = Z SNN(N~, N k ) (6) where ANV denotes the association norm of a noun-verb pair, and ANN the association norm of a noun-noun pair. The less frequent word has a higher IDF value so that the strength SNV and SNN of one occurrence may be larger. However, the number of terms to be summed is smaller. Thus, the formulae IDF and ANV (ANN) are complementary. LOB Corpus of approximately one million words is used to train the basic association norms. They are based on different levels: the paragraph and sentence levels for noun-noun and noun-verb pairs respectively.</Paragraph>
    <Paragraph position="14"> Table 1 shows the statistics of the training corpus. The words with tags NC, NNU and NNUS and Ditto tags are not considered. Here NC means cited words, and NNU (NNUS) denotes abbreviated (plural) unit of measurement unmarked for number. Ditto tags are those words whose senses in combination differ from the role of the same words in other context.</Paragraph>
    <Paragraph position="15"> For example, &amp;quot;as to&amp;quot;, &amp;quot;each other&amp;quot;, and &amp;quot;so as to&amp;quot; (Johansson, 1986).</Paragraph>
    <Paragraph position="16">  Under the topic coherence postulation in a paragraph, we compute the connectivities of the nouns in each sentence with the verbs and nouns in the paragraph. For example, 439 verbs in LOB Corpus have relationships with the word &amp;quot;problem&amp;quot; in different degrees. Some of them are listed below in descending order by the strength.</Paragraph>
    <Paragraph position="17"> solve(225.21), face(84.64) ..... specify(16.55) ..... explain(6.47), ..., fal1(2.52) ..... suppose(1.67) .... For the example in Section 1, the word &amp;quot;problem&amp;quot; and &amp;quot;dislocation&amp;quot; are coherent with the verbs and nouns in the discourse. The nouns with the strongest connectivity form the preferred topic set. Consider the interference effects. The constituents far apart have less relationship. Distance D(X,Y) is used to measure such effects. Assume there are m nouns and n verbs in a paragraph. The connective strength of a noun Ni (1 &lt; i &lt; m) is defined to be:</Paragraph>
    <Paragraph position="19"> where CS denotes the connective strength, and PAr and PV are parameters for CSNN and CSNV and PN+PV=I.</Paragraph>
    <Paragraph position="20"> The determination of par and PV is via deleted interpolation (Jelinek, 1985). Using equation PN + PV = 1 and equation 9, we could derive PAr and PV as equation 10 and equation 11 show.</Paragraph>
    <Paragraph position="22"> LOB corpus are separated into two parts whose volume ratio is 3:1. Both PN and PV are initialized to 0.5 and then are trained by using the 3/4 corpus.</Paragraph>
    <Paragraph position="23"> Alter the first set of parameters is generated, the remain 1/4 LOB corpus is run until par and PV converge using equations 9, 10 and 11. Finally, the parameters, PN and PV, converge to 0.675844 and 0.324156 respectively.</Paragraph>
  </Section>
  <Section position="4" start_page="268" end_page="268" type="metho">
    <SectionTitle>
3 Topic Identification in a Paragraph
</SectionTitle>
    <Paragraph position="0"> The test data are selected from the first text of the files LOBT-DI, LOBT-F1, LOBT-G1, LOBT-H1, LOBT-KI, LOBT-M1 and LOBT-N1 of horizontal version of LOB Tagged Corpus for inside test (hereafter, we will use D01, F01, G01, H01, K01, M01, and N01 to represent these texts respectively). Category D denotes religion, Category F denotes popular lore, Category G denotes belles lettres, biography and essays, Category H denotes Miscellaneous texts, Category K denotes general fiction, Category M denotes science fiction, and Category N denotes adventure and western fiction.</Paragraph>
    <Paragraph position="1"> Each paragraph has predetermined topics (called assumed topics) which are determined by a linguist.</Paragraph>
    <Paragraph position="2"> Because a noun with basic form N may appear more than once in the paragraph, say k times, its strength is normalized by the following recursive formula:</Paragraph>
    <Paragraph position="4"> where NCS represents the net connective strength, o(k) denotes the cardinal number of the k'th occurrence of the same N such that C(NoO)) &lt; C(No(2)) &lt; C(No(3)) &lt;... &lt; C(No(k-l)) &lt; C(No(k)).</Paragraph>
    <Paragraph position="5"> The possible topic N* has the high probability NCS(N*). Here, a topic set whose members are the first 20% of the candidates is formed. The performance can be measured as the Table 2 shows.</Paragraph>
  </Section>
  <Section position="5" start_page="268" end_page="269" type="metho">
    <SectionTitle>
4 The Preliminary Experimental Results
</SectionTitle>
    <Paragraph position="0"> According to the language model mentioned in Section 2, we build the ANN and ANV values for each noun-noun pair and noun-verb pair. Then, we apply recursive formula of NCS shown in equations 12 and 13 to identifying the topic set for test texts. Table 3 shows experimental results. Symbols tx and c denotes mean and standard deviation. (+) denotes correct number, (-) denotes error number and (?) denotes undecidable number in topic identification.</Paragraph>
    <Paragraph position="1"> The undecidable case is that the assumed topic is a pronoun. Figure 1 shows correct rate, error rate, and undecidable rate.</Paragraph>
    <Paragraph position="2"> Row (1) in Table 3 shows the difficulty in finding topics from many candidates. In general, there are more than 20 candidates in a paragraph, It is impossible to select topics at random. Row (2) gives  the rank of assumed topic. The assumed topics are assigned by a linguist. Comparing row (1) and row (2), the average number of candidates are much larger than the rank of assumed topic. Since it is impossible to randomly select candidates as topics, we know topic identification is valuable.</Paragraph>
    <Paragraph position="3"> Rows (3), (4) and (5) list the frequencies of candidates, assumed topics and computed topic. The results intensify the viewpoint that the repeated words make persons impressive, and these words are likely to be topics. Our topic identification algorithm demonstrates the similar behavior (see rows (4) and (5)). The average frequencies of assumed topics and computed topics are close and both of them are larger than average frequency of candidates. Figure 2 clearly demonstrates this point. Row (6) reflects an interesting phenomenon. The topic shifted by authors from paragraph to paragraph is demonstrated through comparison of data shown in this row and row (2). The rank value of previous topics do obviously increase. Recall that large rank value denotes low rank.</Paragraph>
  </Section>
  <Section position="6" start_page="269" end_page="269" type="metho">
    <SectionTitle>
5 Concluding Remarks
</SectionTitle>
    <Paragraph position="0"> Discourse analysis is a very difficult problem in natural language processing. This paper proposes a corpus-based language model to tackle topi.c identification. The word association norms of noun-noun pairs and noun-verb pairs which model the meanings of texts are based on three factors: 1) word importance, 2) pair occurrence, and 3) distance. The nouns that have the stronger connectivities with other nouns and verbs in a discourse could form a preferred topic set. Inside test of this proposed algorithm shows 61.07% correct rate (80 of 131 paragraphs).</Paragraph>
    <Paragraph position="1"> Besides topic identification, the algorithm could detect topic shift phenomenon. The meaning transition from paragraph to paragraph could be detected by the following way. The connective strengths of the topics in the previous paragraph with the nouns and the verbs in the current paragraph are computed, and compared with the topics in the current paragraph. As our experiments show, the previous topics have the tendency to decrease their strengths in the current paragraph.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML