File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1007_metho.xml

Size: 12,446 bytes

Last Modified: 2025-10-06 14:08:42

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1007">
  <Title>Combining Hierarchical Clustering and Machine Learning to Predict High-Level Discourse Structure</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Feature Set
</SectionTitle>
    <Paragraph position="0"> Each training example is described by a set of features. The features were deliberately kept fairly shallow, i.e. they make use only of tokenisation, part-of-speech and sentence boundary information (all of which were taken from the original Penn Treebank mark-up). They do not require any deep processing, such as parsing.</Paragraph>
    <Paragraph position="1"> The model uses features from 7 areas: segment position, segment length, term overlap, punctuation, tense, cue phrases, and lexical chains.</Paragraph>
    <Paragraph position="2"> Segment position This set comprises 3 features, indicating whether the left (right) segment of the pair is the first (last) in the text and whether the merged segment would be in the beginning, middle or end of the text. The motivation for these features is that the beginning and end of a text often have a special discourse role (at least in this domain), e.g. the first paragraph frequently leads into the text, while the last often provides a summary.</Paragraph>
    <Paragraph position="3"> Segment length This set consists of 6 features: the number of words, sentences, and paragraphs of the left and right segment. Segment length can often be a clue as to whether two segments should be merged. For example, very long segments are not normally merged with very short segments unless the short segment has a special position, e.g. is the first or last of the text.</Paragraph>
    <Paragraph position="4"> Term overlap We use the formulae in Section 2 to calculate term overlap. This yields a real-valued score between 0 and 1, which was quantised by breaking the range into 10 equal intervals.</Paragraph>
    <Paragraph position="5"> Punctuation This set comprises 7 features: the final punctuation mark of the left segment and whether the left (right) segment contains, starts with, or ends with a quotation mark. The presence of quotations in both segments may indicate that they are related and so should increase their merging probability. Likewise, the final punctuation mark can sometimes be an important clue, e.g. if the left segment ends with a question mark, the next segment might provide an answer to the question and this should increase the merging probability.</Paragraph>
    <Paragraph position="6"> Tense We use 6 tense features: the first, last, and majority tense of the left (right) segment. Tense information was obtained by using regular expressions to extract verbal complexes from the part-of-speech tagged text and then determine their tense. Tense often serves as a cue for discourse structure (Lascarides and Asher, 1993; Webber, 1988b). A shift from simple past to past perfect, for instance, can indicate the start of an embedded segment.</Paragraph>
    <Paragraph position="7"> Cue phrases This set comprises 4 features. The first three features are reserved for potential cue phrases in the first sentence of the right segment.</Paragraph>
    <Paragraph position="8"> Cue phrases are identified by scanning a sentence (or the first 100 characters of it, whichever is shorter) for an occurrence of one of the cue phrases listed in Knott (1996). We have three features to be able to deal with multiple cue phrases (e.g. But because. . . ). In this case, the feature first cue phrase will be assigned the first cue word (but), second cue phrase the second cue word (because) and so on. Cue phrases are often ambiguous between syntactic and discourse use, as well as among different rhetorical relations. While our algorithm does not attempt proper disambiguation between syntactic and discourse usage, some non-discourse usages are filtered out on the basis of part-of-speech information. For example, second can be an adverb (as in Example 4) as well as an adjective (as in Example 5) but when used as a discourse marker it is usually an adverb.</Paragraph>
    <Paragraph position="9">  (4) Second, the extra savings would spur so much extra economic growth that the Treasury wouldn't suffer.</Paragraph>
    <Paragraph position="10"> (5) It was announced yesterday that the profits have  fallen for the second year in a row.</Paragraph>
    <Paragraph position="11"> The fourth cue phrase feature encodes whether the first sentence of the right segment contains a discourse anaphor, i.e. an anaphor which refers to a discourse segment rather than a real world entity, and if so which it is. An example is that in Example 6 (cf. Webber (1988a)). We do not attempt proper anaphora resolution, instead we treat first sentence occurrences of this and that as discourse anaphors if they seem to be complete NPs, e.g. are directly followed by a verb. This method potentially over-generates as these expressions could still refer to a preceding NP and it potentially undergenerates as it can sometimes also refer to discourse segments.</Paragraph>
    <Paragraph position="12"> However, previous research has found that demonstrative anaphors rarely refer to NPs, while it rarely refers to discourse segments (Webber (1988a)).</Paragraph>
    <Paragraph position="13">  (6) It's always been presumed that when the  glaciers receded, the area got very hot. The Folsum men couldn't adapt, and they died out.</Paragraph>
    <Paragraph position="14"> That's what is supposed to have happened.</Paragraph>
    <Paragraph position="15"> Lexical chains This set comprises 28 features.</Paragraph>
    <Paragraph position="16"> The idea of using lexical chains as indicators of lexical cohesion goes back to Morris and Hirst (1991). A lexical chain is a sequence of semantically related words and can indicate the presence and extent of subtopics in a text. We use our own implementation to compute chains.</Paragraph>
    <Paragraph position="17"> A distinction is made between common noun chains, which are built on the basis of semantic relatedness using WordNet (Miller et al., 1990), and proper noun chains, which contain nouns not found in WordNet and are based on co-reference rather than semantic relatedness. As a first step, nouns are extracted and lemmatised using the Morpha analyser (Minnen et al., 2001) and then looked up in WordNet. If no entry can be found and the noun is a compound noun, the first lexeme is removed and the remaining string is looked up until an entry is found or only one lexeme remains. For example, if chief executive officer could not be found in WordNet, our algorithm would try executive officer and then officer. Each term that can be found in WordNet is treated as a potential element of a common noun chain, even if it is strictly speaking a proper noun. This allows chains like Mexico - country - Chile. If a noun cannot be found in WordNet it is treated as a potential member of a proper noun chain.</Paragraph>
    <Paragraph position="18"> A potential problem for lexical chains is that words can have more than one sense and semantic relatedness depends on the sense rather than the word itself. We take a greedy approach to word sense disambiguation: while a noun is in a chain on its own, the algorithm is agnostic about its sense but this changes when another noun is added. A new noun a0 is added by comparing each of its senses to the senses of the members of existing chains and a score is calculated for each sense pair depending on the WordNet distance between them. Only distances up to an empirically set cut-off point count as a match, where the cut-off point depends on whether the term is a proper noun and on the nature of the semantic relation (only hypernym, hyponym and synonym relations are considered). If there are one or more matches, the noun is added with the sense that achieved the highest score to the chain a1 with which this score was achieved. If a1 contains only one noun</Paragraph>
    <Paragraph position="20"> with which the match was achieved. Repeated occurrences of the same noun in a text are placed in the same chain, i.e. it is assumed that a word keeps its sense throughout the text.</Paragraph>
    <Paragraph position="21"> When all common noun chains have been built, the significance of each chain is assessed and chains that are not considered significant are deleted. To be considered significant a chain has to contain at least two nouns (or two occurrences of the same noun) and the Gsig (see equation 3) averaged over all its elements either has to be relatively high or the chain has to be relatively long compared to the overall length of all other chains, where length is measured as the number of &amp;quot;hits&amp;quot; a chain has in the text.2 For example, Wall Street Journal articles frequently contain expressions of date, such as December, month, Tuesday, but these do not normally make interesting chains as they are high frequency expressions and the appearance of various date expressions throughout the text does not normally indicate a subtopic, i.e. it does not mean that the text is &amp;quot;about&amp;quot; time and date expressions. However, if time and date expression are very frequent in the text this may be an indicator that these do indeed form a subtopic and that the chain should be retained.</Paragraph>
    <Paragraph position="22"> Proper noun chains are built for words not in WordNet. Chain membership is determined on the basis of identity, i.e. a chain contains repeated occurrences of the same noun. Some proper noun phrase matching is done. For example, the expressions U.S. District Judge Peter Smith, Judge Smith, and Mr. Smith are treated as referring to the same entity and can therefore be placed in the same chain.</Paragraph>
    <Paragraph position="23"> When all proper noun chains have been built, those that contain only one element (i.e. one occurrence of a term) are removed. All other chains are retained.</Paragraph>
    <Paragraph position="24"> Note, unlike most approaches that make use of lexical chains, we do not break a chain in two if too many sentences intervene between the individual chain elements; chains are continued as long as new elements can be found. However, the algorithm keeps track of where in the text chain elements were found. If a chain skips one or two paragraphs this is actually an important clue because it can indicate that the two paragraphs form an embedded segments. This is especially true if there are also chains which start in the left paragraph and end in the right. For example, Figure 4 shows a text with 5 paragraphs (A to E) and two lexical chains. Chain 1 spans the whole text but skips paragraphs B and C, while chain 2 only spans paragraphs B and C. A situation like this makes it likely that B and C should be merged before either of them is merged with another paragraph. Hence Tree 1 in Figure 5 should be more likely than Tree 2. For this analysis it is crucial that chain 1 is not broken into two. Obviously 2Both thresholds were empirically set.</Paragraph>
    <Paragraph position="25"> for very long texts the situation will be slightly different and there will be circumstances where a chain should be broken.</Paragraph>
    <Paragraph position="26">  The individual chain features distinguish between proper and common noun chains. The reason for this is that the former are likely to be more reliable as they are based on term identity rather than semantic relatedness. For both types the features encode whether and how many chains: a0 span the two segments a0 exclusively span the two segment (i.e. start in the left segment and end in the right) a0 start or end in the left (right) segment a0 skip both of the segments a0 exclusively skip the two segments (i.e. skip both segments but none of the neighbouring segments) a0 skip one of the two segments a0 exclusively skip the left (right) segment To combine all features, we trained a maximum entropy model (see e.g. Ratnaparkhi (1998)) on the training set. Each feature is automatically assigned a weight reflecting its usefulness. Once trained the model outputs a probability distribution over the classes merge and don't merge for each pair of segments, based on the weighted features for the pair. To prevent the model from overfitting we used a feature cut-off of 10, i.e. feature-value pairs that occur less than 10 times in the training set were discarded.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML