File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/97/w97-0305_intro.xml
Size: 17,691 bytes
Last Modified: 2025-10-06 14:06:19
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-0305"> <Title>Detecting Subject Boundaries Within Text: A Language Independent Statistical Approach</Title> <Section position="4" start_page="47" end_page="49" type="intro"> <SectionTitle> 2 Design </SectionTitle> <Paragraph position="0"> The algorithm is divided into five distinct stages.</Paragraph> <Paragraph position="1"> Figure 1 shows the sequential, modular structure of the algorithm. Each stage of the algorithm is described in more detail below.</Paragraph> <Section position="1" start_page="47" end_page="47" type="sub_section"> <SectionTitle> 2.1 Preprocessing (stage 1) </SectionTitle> <Paragraph position="0"> In her implementation of the TextTiling algorithm Hearst ignores preprocessing, claiming it does not affect the results (Hearst, 1994). By preprocessing we mean lemmatizing, stemming, converting upper to lower case etc. Testing this assumption on her algorithm indeed seems not to change the results. However, using preprocessing in conjunction with stage 2 of our algorithm, does improve results. It is important for our algorithm that morphological differences between semantically related words are resolved, so that words like &quot;bankrupt&quot; and &quot;bankruptcy&quot;, for example, are identified as the same word.</Paragraph> </Section> <Section position="2" start_page="47" end_page="49" type="sub_section"> <SectionTitle> 2.2 Calculating a significance value for </SectionTitle> <Paragraph position="0"> each word (stage 2) Hearst treats a text more or less as a bag of words in its statistical analysis. But natural language is no doubt more structured than this. Different words have differing semantic functions and relationships with respect to the topic of discourse. We can broadly distinguish two extreme categories of words; content words versus function words. Content words introduce concepts, and are the means for the expression of ideas and facts, for example nouns, proper nouns, adjectives and so on. Function words (for example determiners, auxiliary verbs etc.) support and coordinate the combination of content words into meaningful sentences. Obviously, both are needed to form meaningful sentences, but, intuitively, it is the content words that carry most weight in defining the actual topic of discourse. Based on this intuition, we believe it would be advantageous to identify these content words in a text. It would then be possible to bias the calculation of lexical correspondences (stage 3) taking into account the higher significance of these words relative to function words.</Paragraph> <Paragraph position="1"> We would ideally like firstly to reduce the effect of noisy non-content words on the algorithm's performance, and secondly to pay more attention to words with a high semantic content. In her implementation, Hearst attempts to do this by having a finite list of problematic words that are filtered out from the text before the statistical analysis takes place (Hearst, 1994). These problematic words are primarily function words and low semantic content words, such as determiners, conjunctions, prepositions and very common nouns.</Paragraph> <Paragraph position="2"> Church and Gale (Church and Gale, 1995) mention the correlation between a word's semantic content and various measures of its distribution throughout corpora. They show that: &quot;Word rates vary from genre to genre, topic to topic, author to author, document to document, section to sec: tion, paragraph to paragraph. These factors tend to decrease the entropy and increase the other test variables&quot;. One of these other test variables mentioned by Church and Gale is burstiness. They attribute the innovation of the notion of burstiness to Slava Katz, who, pertaining to this topic, writes (Katz, 1996): &quot;The notion of burstiness.., will be used for the characterisation of two closely related but distinct phenomena: (a) document-level burstiness, i.e. multiple occurrence of a content word or phrase in a single text document, which is contrasted with the fact that most other documents contain no other instances of this word or phrase at all; and (b) within-document burstiness (or burstiness proper), i.e. close proximity of all or some individual instances of a content word or phrase within a document exhibiting multiple occurrence.&quot; Katz has highlighted many interesting features of the distribution of content words, which do not conform to the predictions of statistical models such as the Poisson. Katz (Katz, 1996) states that, when a concept named by a content word is topical for the document, then that content word tends to be characterised by multiple and bursty occurrence. He claims that, while a single occurrence of a topically used content word or phrase is possible, it is more likely that a newly introduced topical entity, will be repeated, &quot;if not for breaking the monotonous effect of pronoun use, then for emphasis or clarity&quot;. He also claims o that, unlike function words, the number of instances E of a specific content word is not directly associated with the document length, but is rather a function ~ of how much the document is about the concept ex- i ~5 pressed by that word.</Paragraph> <Paragraph position="3"> z Therefore, the characteristic distribution pattern of topical content words, which contrasts markedly with that of non-topical and non-content words, could provide a useful aid in identifying the semantically relevant words within a text. Brief mention should be made of the work done by Justeson and Katz (Justeson and Katz, 1995), which, to a certain degree, relates to the requirements of our task. In their paper, Justeson and Katz describe some linguistic properties of technical terminology, and use them to formulate an algorithm to identify the technical terms in a given document. However, their algorithm deals with complex noun phrases only, and, although the technical terms identified by their algorithm are generally highly topical, the algorithm does not provide the context sensitive information of how topical each incidence of a given meaningful phrase is, relative to its direct environment. It is precisely this information that is needed to judge the content of a particular segment of text.</Paragraph> <Paragraph position="4"> Although Katz (Katz, 1996) acknowledges what he calls two distinct, but closely related, forms of burstiness, he concentrates on modelling the inter-document distributions of content words and phrases. He then uses the inter-document distributions to make inferences about probabilities of the repeat occurrences of content words and phrases within a single document. Another divergence between what Katz has done so far and what the task of subject boundary insertion requires, is that he decides to ignore the issues of coincidental repetitions of non-topically used content words and simply equates &quot;single occurrence with non-topical occurrence, and multiple occurrence with topical occur-</Paragraph> <Paragraph position="6"> where x is an individual word in the document and Dx,i is the distance between word x and its ith nearest neighbour. The 1st nearest neighbour of word x is the nearest occurrence of the same word. The 2nd nearest neighbour of x is the nearest occurrence of the same word ignoring the 1st nearest neighbour. In general, the ith nearest neighbour of x is the nearest occurrence of the same word ignoring the 1st, 2nd, 3rd,...,(i- 1)th nearest neighbours. W is the total number of words in the text. w is the number of occurrences of the word like x. n is the number of nearest neighbours to include in the calculation and depends on the overall frequency of the word in the text. This formula will yield a significance score that lies within the range 0 to ~ (high significance to low significance). This number is then normalised to between 0 and 1, with 0 indicating a very low significance, and 1 indicating a very high significance.</Paragraph> <Paragraph position="7"> The exact value of n is calculated separately for each distinct word, using the following formula:</Paragraph> <Paragraph position="9"> We have implemented a method which assigns an estimated significance score based on a measure of two context dependent properties; local burstiness and global frequency. The heart of our solution to the problem of assigning context-based values of topical significance to all words in a text, can be summed up in the following formula: This is essentially a sigmoid function with the range varying between two and ten, as shown in Figure 2. The constants scale and translate the function to yield the desired behaviour, which was derived empirically. The number of nearest neighbours to consider in equation 1 increases with the word's frequency. For example, when calculating the signif- null icance of the least frequent words, only two nearest neighbours are considered. But for the most frequently occurring words, the number of nearest neighbours is ten. Figure 3 shows the main features of the performance of this significance assignment algorithm when tested on a sample text. The results for three very different words are shown.</Paragraph> <Paragraph position="10"> Two general trends are the most important features of this graph. Firstly, elevated significance scores are associated with local clusters of a word.</Paragraph> <Paragraph position="11"> For example the cluster of three occurrences of &quot;software&quot; (a content word) at the end of the document have high significance scores. This contrasts with the relatively isolated occurrences of the word &quot;software&quot; in the middle of the document, which are deemed to be little more significant than several occurrences of the word &quot;the&quot; (a function word). Secondly, frequent words tend to receive lower significance scores. For example, even local clusters of the word &quot;the&quot; only receive relatively low significance scores, simply because the word has a high frequency throughout the document. Conversely, &quot;McNealy&quot; (a high semantic content word), which only occurs in a cluster of three, receives a high significance value.</Paragraph> <Paragraph position="12"> The important result shown by the graph is that content words (real names such as &quot;McNealy&quot;) receive higher significance values than function words (&quot;the&quot;).</Paragraph> <Paragraph position="13"> We found that an optimal solution to the problem of balancing local density against global frequency was rather elusive. For example, the words at the centre of a cluster automatically receive a higher score, whereas it may be more desirable to have all the members of a cluster assigned a score lying in a narrower range. There are many other contentious issues which need to be investigated, such as the use of the ratio of all the occurrences of a word in a given text to the total length of that text in order to calculate the relative significance measure. Based on intuition, partly derived from Katz's discussion (Katz, 1996) of the relationship between document length and word frequency, the exact nature of this relationship across various document lengths may not be reliable enough. It may be more consistent to consider this ratio within a constant window size, e.g. 1000 words.</Paragraph> <Paragraph position="14"> The advantage of this simple statistical method of distinguishing significant content words from non-content words is that no words need to be removed before allowing the algorithm to proceed. The output of this stage is a normalised significance score (0-1) for each word in the text. This significance score can then be taken into account when analysing the text for subject boundaries.</Paragraph> </Section> <Section position="3" start_page="49" end_page="49" type="sub_section"> <SectionTitle> 2.3 Calculate Biased Lexical </SectionTitle> <Paragraph position="0"> Correspondences (stage 3) Let us consider two sets of words, set A and set B. The main aim of this stage of the processing is concerned with calculating a correspondence measure between two such sets depending on how similar they are, where similarity is defined as a measure off lexical correspondence. If many words are shared by both set A and B, then the lexical correspondence between the two sets is high. If the two sets do not share many words, then the correspondence is low.</Paragraph> <Paragraph position="1"> Now let A t be the subset of A that contains only those words that occur somewhere in B. And let B' be the subset of B that contains only those words that occur somewhere in A. The lexical correspondence between sets A and B can then be calculated using the simple formula:</Paragraph> <Paragraph position="3"> This yields a value within the range 0 to 1. IAI can be re-written as 1+1+1+1+1 .... by adding a 1 for every word in A. Each word has already been given a significance value as described in stage 2 of the algorithm and this information is taken into account by re-defining IAI as sl+s2+s3+.., where sl is the significance value assigned to the first word in A, s2 the second and so on. The same can be done for A ', B and B ~. The formula now takes the average of the biased ratios. All this means is that instead of each word counting for '1' in a set, it counts for its significance value (a value between 0 (insignificant) and 1 (highly significant)). The result is that each word affects the correspondence measure according to its significance in the text.</Paragraph> <Paragraph position="4"> So far, a word that occurs only in A and not in B, contributes zero to JAn\[. This means that a highly significant word occurring only in A has exactly the same effect as an insignificant word occurring only in A. In other words the significance biasing is only taking place for words that appear in both A and B.</Paragraph> <Paragraph position="5"> Therefore, the formula actually used is: Correspondence= L~I ~-~ &quot;k I-~P-~I 2 where A&quot; is the subset of A which contains only those words that occur in A and not in B. Similarly, B ~ is the subset of B which contains only those words that occur in B and not in A. This is shown in Figure 4.</Paragraph> <Paragraph position="6"> Recall that \[A\[, \[m'\[, \[A&quot;\[, \[B\], \[B'\] and IS&quot;\[ are not calculated by adding one for each word in each set, but by summing the significance values of the words in each set.</Paragraph> <Paragraph position="7"> This stage of the processing looks at the output from the significance calculation stage and considers every sentence break in turn - starting at the top of the document and working down. The algorithm assigns a correspondence measure to each sentence break as follows: Firstly, set A is generated by taking all the words in the previous fifteen sentences. Next, set B is generated by taking all the words in the following fifteen sentences. 1 Now sets A p, A ~, B ~ and B ~ are generated as described and then the formula above is applied which assigns a correspondence value to the sentence break currently under consideration. The algorithm then moves to the next sentence break and repeats the process.</Paragraph> <Paragraph position="8"> The output from this stage of the algorithm is a list of sentence break numbers (1..n, with n = number of sentences in the document) and a lexical correspondence measure. These numbers provide the input for stage four - smoothing.</Paragraph> </Section> <Section position="4" start_page="49" end_page="49" type="sub_section"> <SectionTitle> 2.4 Smoothing (stage 4) </SectionTitle> <Paragraph position="0"> A graph can be plotted with lexical correspondence along the y-axis and sentence number along the xaxis. In order to distinguish the significant peaks and troughs from the many minor fluctuations, a simple smoothing algorithm is used. Taking three neighbouring points on the graph, P1, P2, P3:</Paragraph> <Paragraph position="2"> The line P1P3 is bisected and this point is labelled A. P2 is perturbed by a constant amount (not deependent on the distance between A and P2) towards A. This new point is labelled B and becomes the new P2. This is performed simultaneously on every point on the graph. The process is k then iterated a fixed number of times. The result is that noise is flattened out while the larger peaks and troughs remain (although slightly smaller).</Paragraph> <Paragraph position="3"> The output from this stage is simply the sentence break numbers and their new, smoothed correspondence values.</Paragraph> </Section> <Section position="5" start_page="49" end_page="49" type="sub_section"> <SectionTitle> 2.5 Inserting subject boundaries (stage 5) </SectionTitle> <Paragraph position="0"> Considering the graph described in the previous section, generating subject boundaries is simply a matter of identifying local minima on the graph. The confidence of a boundary is calculated from the 'depth' of the local minimum. This depth is calculated simply by taking the average of the heights of the 'peak' (relative to the height of the minimum) on either side of the minimum. This now yields a list of candidate subject boundaries and an associated confidence measure for each one. Breaks are then inserted into the original text at the places corresponding to the local minima if their confidence value satisfies a 'minimum confidence' criterion. This cut-off criterion is arbitrary, and in our implementation can be specified at run time.</Paragraph> </Section> </Section> class="xml-element"></Paper>