File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/i05-2027_metho.xml

Size: 16,440 bytes

Last Modified: 2025-10-06 14:09:34

<?xml version="1.0" standalone="yes"?>
<Paper uid="I05-2027">
  <Title>Machine Learning Approach To Augmenting News Headline Generation</Title>
  <Section position="3" start_page="155" end_page="156" type="metho">
    <SectionTitle>
2 Topiary System
</SectionTitle>
    <Paragraph position="0"> In this section, we describe the Topiary system developed at the University of Maryland with BBN Technologies. As already stated, this system was the top performing headline generation system at DUC 2004. A Topiary-style headline consists of a set of topic labels followed by a compressed version of the lead sentence. Hence, the Topiary system views headline generation as a two-step process: first, create a compressed version of the lead sentence of the source text, and second, find a set of topic descriptors that adequately describe the general topic of the news story. We will now look at each of these steps in more detail.</Paragraph>
    <Paragraph position="1"> Dorr et al. (2003) stated that when human subjects were asked to write titles by selecting words in order of occurrence in the source text, 86.8% of these headline words occurred in the first sentence of the news story. Based on this result Dorr, Zajic and Schwartz, concluded that compressing the lead sentence was sufficient when generating titles for news stories.</Paragraph>
    <Paragraph position="2"> Consequently, their DUC 2003 system HedgeTrimmer used linguistically-motivated heuristics to remove constituents that could be eliminated from a parse tree representation of the lead sentence without affecting the factual correctness or grammaticality of the sentence.</Paragraph>
    <Paragraph position="3"> These linguistically-motivated trimming rules (Dorr et al., 2003; Zajic et al., 2004) iteratively remove constituents until a desired sentence compression rate is reached.</Paragraph>
    <Paragraph position="4"> The compression algorithm begins by removing determiners, time expressions and other low content words. More drastic compression rules are then applied to remove larger constituents of the parse tree until the required headline length is achieved. For the DUC 2004 headline generation task systems were required to produce headlines no longer than 75 bytes, i.e. about 10 words. The following worked example helps to illustrate the sentence compression process.</Paragraph>
    <Paragraph position="5">  The part of speech tags in the example are explained as follows: S represents a simple declarative clause; SBAR represents a clause introduced by a (possibly empty) subordinating conjunction; NP is a noun phrase; VP is a verb phrase; ADVP is an adverbial phrase.</Paragraph>
    <Paragraph position="6">  Lead Sentence: The U.S. space shuttle Discovery returned home this morning after astronauts successfully ended their 10-day Hubble Space telescope service mission.</Paragraph>
    <Paragraph position="7"> Parse: (S (NP (NP The U.S. space shuttle) Discovery) (VP returned (NP home) (NP this morning)) (SBAR after (S (NP astronauts) (VP (ADVP successfully) ended (NP their 10-day Hubble Space telescope service mission))))) 1. Choose leftmost S of parse tree and remove all determiners, time expressions and low content units such as quantifiers (e.g.</Paragraph>
    <Paragraph position="8"> each, many, some), possessive pronouns (e.g.</Paragraph>
    <Paragraph position="9"> their, ours, hers) and deictics (e.g. this, tese, those): Before: (S (NP (NP The U.S. space shuttle) Discovery) (VP returned (NP home) (NP this morning)) (SBAR after (S (NP astronauts) (VP (ADVP successfully) ended (NP their 10-day Hubble Space telescope service mission))))) After: (S (NP (NP U.S. space shuttle) Discovery) (VP returned (NP home)) (SBAR after (S (NP astronauts) (VP (ADVP successfully) ended (NP 10-day Hubble Space telescope service mission))))) 2. The next step iteratively removes  constituents until the desired length is reached. In this instance the algorithm will remove the trailing SBAR.</Paragraph>
    <Paragraph position="10"> Before: (S (NP (NP U.S. space shuttle) Discovery) (VP returned (NP home)) (SBAR after (S (NP astronauts) (VP (ADVP successfully) ended (NP 10-day Hubble Space telescope service mission))))) After: U.S. space shuttle Discovery returned home.</Paragraph>
    <Paragraph position="11"> Like the 'trailing SBAR' rule, the other iterative rules identify and remove non-essential relative clauses and subordinate clauses from the lead sentence. A more detailed description of these rules can be found in Dorr et al. (2003) and Zajic et al. (2004) In this example, we can see that after compression the lead sentence reads  more like a headline. The readability of the sentence in this case could be further improved by replacing the past tense verb 'returned' with its present tense form; however, this refinement is not currently implemented by the Topiary system or by our implementation of this compression algorithm.</Paragraph>
    <Paragraph position="12"> As stated earlier, a list of relevant topic words is also concatenated with this compressed sentence resulting in the final headline. The topic labels are generated by the UTD (Unsupervised Topic Discovery) algorithm (Zajic et al., 2004). This unsupervised information extraction algorithm creates a short list of useful topic labels by identifying commonly occurring words and phrases in the DUC corpus. So for each document in the corpus it identifies an initial set of important topic names for the document using a modified version of the tf.idf metric. Topic models are then created from these topic names using the OnTopic(TM) software package. The list of topic labels associated with the topic models closest in content to the source document are then added to the beginning of the compressed lead sentence produced in the previous step, resulting in a Topiary-style summary.</Paragraph>
    <Paragraph position="13"> One of the problems with this approach is that it will only produce meaningful topic models and labels if they are generated from a corpus containing additional on-topic documents on the news story being summarised. In the next section, we explore two alternative techniques for identifying topic labels, where useful summary words are identified 'locally' by analysing the source document rather than 'globally' using the entire DUC corpus, i.e. the UTD method.</Paragraph>
  </Section>
  <Section position="4" start_page="156" end_page="156" type="metho">
    <SectionTitle>
3 C5.0
</SectionTitle>
    <Paragraph position="0"> C5.0 (Quinlan, 1998) is a commercial machine learning program developed by RuleQuest Research and is the successor of the widely used ID3 (Quinlan, 1983) and C4.5 (Quinlan, 1993) algorithms developed by Ross Quinlan. C5.0 is a tool for detecting patterns that delineate categories. It subsequently generates decision trees based on these patterns. A decision tree is a classifier represented as a tree structure, where each node is either a leaf node, a classification that applies to all instances that reach the leaf (Witten, 2000), or a non-leaf node, some test is carried out on a single attribute-value, with one branch and sub-tree for each possible outcome of the test. A decision tree is a powerful and popular tool for classification and prediction and can be used to classify an instance by starting at the root of the tree and moving down the tree branch until reaching a leaf node. However, a decision tree may not be very easy to understand. An important feature of C5.0 is that it can convert trees into collections of rules called rulesets. C5.0 rulesets consist of unordered collections of simple if-then rules. It is easy to read a set of rules directly from a decision tree. One rule is generated for each leaf. The antecedent of the rule includes a condition for every node on the path from the root to that leaf, and the consequent of the rule is the class assigned by the leaf. This process produces rules that are unambiguous in that the order in which they are executed is irrelevant (Witten, 2000).</Paragraph>
    <Paragraph position="1"> C5.0 has been used for text classification in a number of research projects. For example, Akhtar et al. (2001) used C5.0 for automatically marking up XML documents, Newman et al.</Paragraph>
    <Paragraph position="2"> (2005) used it for generating multi-document summary, while Zhang et al. (2004) applied this approach to World Wide Web site summarisation.</Paragraph>
  </Section>
  <Section position="5" start_page="156" end_page="157" type="metho">
    <SectionTitle>
4 HybridTrim System
</SectionTitle>
    <Paragraph position="0"> The HybridTrim system uses our implementation of the Hedge Trimmer algorithm and the C5.0 (Quinlan, 1998) machine learning algorithm to create a decision tree capable of predicting which words in the source text should be included in the resultant gist.</Paragraph>
    <Paragraph position="1"> To identify pertinent topic labels the algorithm follows a two-step process: the first step involves creating an intermediate representation of a source text, and the second involves transforming this representation into a summary text. The intermediate representation we have chosen is a set of features, that we feel are good indicators of possible 'summary words'. We focus our efforts on the content words of a document, i.e. the nouns, verbs and adjectives that occur within the document. For each occurrence of a term in a document, we calculate several features: the tf, or term  frequency of the word in the document; the idf, or inverse document frequency of the term taken from an auxiliary corpus (TDT, 2004); and the relative position of a word with respect to the start of the document in terms of word distance. We also include binary features indicating whether a word is a noun, verb or adjective and whether it occurs in a noun or proper noun phrase. The final feature is a lexical cohesion score calculated with the aid of a linguistic technique called lexical chaining. Lexical chaining is a method of clustering words in a document that are semantically similar with the aid of a thesaurus, in our case WordNet. Our chaining method identifies the following word relationship (in order of strength): repetition, synonymy, specialisation and generalisation, and part/whole relationships. Once all lexical chains have been created for a text then a score is assigned to each chained word based on the strength of the chain in which it occurs. More specifically, as shown in Equation (1), the chain strength score is the sum of each strength score assigned to each word pair in the chain.</Paragraph>
    <Paragraph position="2"> where reps</Paragraph>
    <Paragraph position="4"> is the frequency of word i in the text, and rel(i,j) is a score assigned based on the strength of the relationship between word i and j. More information on the chaining process and cohesion score can be found in Doran et al.</Paragraph>
    <Paragraph position="5"> (2004a) and Stokes (2004).</Paragraph>
    <Paragraph position="6"> Using the DUC 2003 corpus as the training data for our classifier, we then assigned each word a set of values for each of these features, which are then used with a set of gold standard human-generated summaries to train a decision tree summarisation model using the C5.0 machine learning algorithm. The DUC 2003 evaluation provides four human summaries for each document, where words in the source text that occur in these model summaries are considered to be positive training examples, while document words that do not occur in these summaries are considered to be negative examples. Further use is made of these four summaries, where the model is trained to classify a word based on its summarisation potential. More specifically, the appropriateness of a word as a summary term is determined based on the class assigned to it by the decision tree. These classes are ordered from strongest to weakest as follows: 'occurs in 4 summaries', 'occurs in 3 summaries', 'occurs in 2 summaries', 'occurs in 1 summary', 'occurs in none of the summaries'. If the classifier predicts that a word will occur in all four of the human generated summaries, then it is considered to be a more appropriate summary word than a word predicted to occur in only three of the model summaries. This resulted in a total of 103267 training cases, where 5762 instances occurred in one summary, 1791 in two, 1111 in three, 726 in four, and finally 93877 instances were negative. A decision tree classifier was then produced by the C5.0 algorithm based on this training data.</Paragraph>
    <Paragraph position="7"> To gauge the accuracy of our decision tree topic label classifier, we used a training/test data split of 90%/10%, and found that on this test set the classifier had a precision (true positives divided by true positives and false positives) of 63% and recall (true positives divided by true positives and false negatives) of 20%.</Paragraph>
  </Section>
  <Section position="6" start_page="157" end_page="157" type="metho">
    <SectionTitle>
5 Evaluation and Results
</SectionTitle>
    <Paragraph position="0"> In this section we present the results of our headline generation experiments on the DUC 2004 corpus.</Paragraph>
    <Paragraph position="1">  We use the ROUGE (Recall-Oriented Understudy for Gisting Evaluation) metrics to evaluate the quality of our automatically generated headlines. In DUC 2004 task 1, participants were asked to generate very short (&lt;=75 bytes) single-document summaries for documents on TDT-defined events.</Paragraph>
    <Paragraph position="2"> The DUC 2004 corpus consists of 500 Associated Press and New York Times newswire documents. The headline-style summaries created by each system were evaluated against a set of human generated (or model) summaries using the ROUGE metrics. The format of the evaluation was based on six scoring metrics: ROUGE-1, ROUGE-2, ROUGE-3, ROUGE-4, ROUGE-LCS and ROUGE-W. The first four metrics are based on the average n-gram match between a set of model summaries and the system-generated summary for each document in the corpus.</Paragraph>
  </Section>
  <Section position="7" start_page="157" end_page="158" type="metho">
    <SectionTitle>
ROUGE-LCS calculated the longest common
</SectionTitle>
    <Paragraph position="0"> Details of our official DUC 2004 headline generation system can be found in Doran et al. (2004b). This system returned a list of keywords rather than 'a sentence + keywords' as a headline. It used a decision tree classifier to identify appropriate summary terms in the news story based on a number of linguistic and statistical word features.</Paragraph>
    <Paragraph position="1"> [?] += )),(*)(()( jirelrepsrepschainScore ji (1)  sub-string between the system summaries and the models, and ROUGE-W is a weighted version of the LCS measure. So for all ROUGE metrics, the higher the ROUGE value the better the performance of the summarisation system, since high ROUGE scores indicate greater overlap between the system summaries and their respective models. Lin and Hovy (2003) have shown that these metrics correlated well with human judgments of summary quality, and the summarisation community is now accepting these metrics as a credible and less time-consuming alternative to manual summary evaluation. In the official DUC 2004 evaluation all summary words were stemmed before the ROUGE metrics were calculated; however, stopwords were not removed. No manual evaluation of headlines was performed.</Paragraph>
    <Section position="1" start_page="158" end_page="158" type="sub_section">
      <SectionTitle>
5.1 ROUGE Evaluation Results
</SectionTitle>
      <Paragraph position="0"> Table 1 shows the results of our headline generation experiments on the DUC 2004 collection. Seven systems in total took part in this evaluation, three Topiary-style headline generation systems and four baselines: the goal of our experiments was to evaluate linguistically-motivated heuristic approaches to title generation, and establish which of our alternative techniques for padding Topiary-style headlines with topic labels works best.</Paragraph>
      <Paragraph position="1"> Since the DUC 2004 evaluation, Lin (2004) has concluded that certain ROUGE metrics correlate better with human judgments than others, depending on the summarisation task being evaluated, i.e. single document, headline, or multi-document summarisation. In the case of headline generation, Lin found that ROUGE-1,</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML