File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/w01-0719_metho.xml

Size: 12,861 bytes

Last Modified: 2025-10-06 14:07:40

<?xml version="1.0" standalone="yes"?>
<Paper uid="W01-0719">
  <Title>Combining Linguistic and Machine Learning Techniques for Email Summarization</Title>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
CX
</SectionTitle>
    <Paragraph position="0">  ,wereassigned to account for the presence in the email subject line and/or headlines in the email body.  These weights were manually chosen after a set of experiments, but we plan to use a regression method to automatically learn them.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Symbolic Machine Learning Models
</SectionTitle>
      <Paragraph position="0"> We compared three symbolic machine learning paradigms (decision trees, rule induction and decision forests) applied to the task of salient NP extraction, evaluating five classifiers.</Paragraph>
      <Paragraph position="1">  Decision trees classify instances represented as feature vectors, where internal nodes of the tree test one or several attributes of the instance and where the leaves represent categories. Depending on how the test is performed at each node, there exists two types of decision tree classifiers: axis parallel and oblique. The axis-parallel decision trees check at each node the value of a single attribute. If the attributes are numeric, the test has  are real-valued coefficients.</Paragraph>
      <Paragraph position="2"> We compared the performance of C4.5, an axis-parallel decision tree classifier (Quinlan, 1993) and OC1, an oblique decision tree classifier (Murthy et al., 1993).</Paragraph>
      <Paragraph position="3">  In rule induction, the goal is to learn the smallest set of rules that capture all the generalisable knowledge within the data. Rule induction classification is based on firing rules on a new instance, triggered by the matching feature values to the left-hand side of the rules. Rules can be of various normal forms and can be ordered. However, the appropriate ordering can be hard to find and the key point of many rule induction algorithms is to minimize the search strategy through the space of possible rule sets and orderings. For our task, we test the effectiveness of two rule induction algorithms : C4.5rules that form production rules from unpruned decision tree, and a fast top-down propositional rule learning system, RIPPER (Cohen, 1995). Both algorithms first construct an initial model and then iteratively improve it. C4.5rules improvement strategy is a greedy search, thus potentially missing the best rule set. Furthermore, as discussed in (Cohen, 1995), for large noisy datasets RIPPER starts with an initial model of small size, while C4.5rules starts with an over-large initial model. This means that RIPPER's search is more efficient for noisy datasets and thus is more appropriate for our data collection. It also allows the user to specify the loss ratio, which indicates the ratio of the cost of false positives to the cost of false negatives, thus allowing a trade off between precision and recall. This is crucial for our analysis since we deal with sparse data due to the fact that in a document the number of salient NPs is much smaller than the number of irrelevant NPs.</Paragraph>
      <Paragraph position="4">  Decision forests are a collection of decision trees together with a combination function. We test the performance of DFC (Ho, 1998), a decision forest classifier that systematically constructs decision trees by pseudo-randomly selecting sub-sets of components of feature vectors. The advantage of this classifier is that it combines a set of different classifiers in order to improve accuracy. It implements different splitting functions. In the setting of our evaluation we tested the information gain ratio (similar to the one used by Quinlan in C4.5). An augmented feature vector (pairwise sums, differences, and products of features) was used for this classifier.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Linguistic Knowledge Enhances
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Machine Learning
</SectionTitle>
      <Paragraph position="0"> Not all simple noun phrases are equally important to reflect document meaning.</Paragraph>
      <Paragraph position="1"> Boguraev and Kennedy (1999) discuss the issue that for the task of document gisting, topical noun phrases are usually noun-noun compounds.</Paragraph>
      <Paragraph position="2"> In our work, we rely on ML techniques to decide which are the salient NPs, but we claim that a shallow linguistic filtering applied before the learning process improves the accuracy of the classifiers. We performed four filtering steps: 1. Inflectional morphological processing: Grouping inflectional variants together can help especially in case of short documents (which is sometimes the case for email messages). English nouns have only two kinds of regular inflection: a suffix for the plural mark and another suffix for the possessive one.</Paragraph>
      <Paragraph position="3"> 2. Removing unimportant modifiers: In this second step we remove the determiners that accompany the nouns and also the auxiliary words most and more that form the periphrastic forms of comparative and superlative adjectives modifying the nouns (e.g. &amp;quot;the most complex morphology&amp;quot; will be filtered to &amp;quot;complex morphology&amp;quot;).</Paragraph>
      <Paragraph position="4"> 3. Removing common words: We used a list of 571 common words used in IR systems in order to further filter the list of candidate NPs. Thus, words like even, following, every, are eliminated from the noun phrase structure.</Paragraph>
      <Paragraph position="5"> 4. Removing empty nouns: Words like lot, group, set, bunch are considered empty heads. For example the primary concept of the noun phrases like &amp;quot;group of students&amp;quot;, &amp;quot;lots of students&amp;quot; or &amp;quot;bunch of students&amp;quot; is given by the noun &amp;quot;students&amp;quot;. We extracted all the nouns that appear in front of the preposition &amp;quot;of&amp;quot; and then sorted them by frequency of appearance. A threshold was then used to select the final list (Klavans et al., 1990). Three different data collections were used: the Brown corpus, the Wall Street Journal, and a set of 4000 email messages (most of them related to a conference organization). We generated a set of 141 empty nouns that we used in this forth step of the filtering process.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Results and Discussion
</SectionTitle>
    <Paragraph position="0"> One important step in summarization is the discovery of the relevant information from the source text. Our approach was to extract the salient NPs using linguistic knowledge and machine learning techniques. Our evaluation corpus consists of a collection of email messages which is heterogeneous in genre, length, and topic. We used 2,500 NPs extracted from 51 email messages as a training set and 324 NPs from 8 messages for testing.</Paragraph>
    <Paragraph position="1"> Each NP was manually tagged for saliency by one human judge. We are planning to add more judges in the future and measure the interuser agreement.</Paragraph>
    <Paragraph position="2"> This section outlines a comparative evaluation of five classifiers using two feature settings on the task of extracting salient NPs from email messages. The evaluation shows the following important results: Result 1. In the context of gisting, the head-modifier relationship is an ordered relation between semantically equal elements.</Paragraph>
    <Paragraph position="3"> We evaluate the impact of adding mh tfidf (see section 2.2), as an additional feature in the feature vector. This is shown in Table 2 in the different feature vectors fv1 and fv2. The first feature vector, fv1, contains the features in sections 2.2.1 and 2.2.2, while fv2 includes as an additional feature mh tfidf.</Paragraph>
    <Paragraph position="4"> As can be seen from Table 3, the results of evaluating these two feature settings using five different classifiers, show that fv2 performed better than fv1. For example, the DFC classifier shows an increase both in precision and recall. This allows us to claim that in the context of gisting, the syntactic head of the noun phrase is not always the semantic head, and modifiers can have also an important role.</Paragraph>
    <Paragraph position="5"> One advantage of the rule-induction algorithms is that their output is easily interpretable by humans. Analyzing C4.5rules output, we gain an insight on the features that contribute most in the classification process. In case of fv1,themostimportant features are: the first appearance of the NP and its head (np focc, head focc), the length of NP in number of words (np length words)and the tf*idf measure of the whole NP and its head (np tfidf, head tfidf ). For example:  In case of fv2, the new feature m tfidf impacts the rules for both Relevant and Not relevant categories. It supercedes the need for np tfidf and head tfidf, as can be seen also from the rules below: null  on the characteristics of the corpus, and combining classifiers improves accuracy This result was postulated by evaluating the performance of five different classifiers in the task of extracting salient noun phrases. As measures of performance we use precision and recall . The evaluation was performed according to what degree the output of the classifiers corresponds to the user judgments and the results are presented in Table 3.</Paragraph>
    <Paragraph position="6"> We first compare two decision tree classifiers: one which uses as the splitting function only a single feature (C4.5) and the other, the oblique tree classifier (OC1) which at each internal node tests a linear combination of features. Table 3 shows that OC1 outperforms C4.5.</Paragraph>
    <Paragraph position="7"> Columns 4 and 5 from Table 3 show the relative performance of RIPPER and C4.5rules. As discussed in (Cohen, 1995), RIPPER is more appropriate for noisy and sparse data collection than C4.5rules. Table 3 shows that RIPPER performs better than C4.5rules in terms of precision.</Paragraph>
    <Paragraph position="8"> Finally, we investigate whether a combination of classifiers will improve performance. Thus we choose the Decision Forest Classifier, DFC, to perform our test. DFC obtains the best results, as can be seen from column 6 of Table 3.</Paragraph>
    <Paragraph position="9"> Result 3. Linguistic filtering is an important step in extracting salient NPs As seen from Result 2, the DFC performed best in our task, so we chose only this classifier to present the impact of linguistic filtering. Table 4 shows that linguistic filtering improves precision and recall, having an important role especially on fv2, where the new feature, mh tfidf was used (from 69.2% precision and 56.25% recall to 85.7% precision and 87.9% recall).</Paragraph>
    <Paragraph position="10"> without filtering with filtering precision recall precision recall  This is explained by the fact that the filtering presented in section 3 removed the noise introduced by unimportant modifiers, common and empty nouns.</Paragraph>
    <Paragraph position="11"> Result 4. Noun phrases are better candidates than n-grams Presenting the gist of an email message by phrase extraction addresses one obvious question: are noun-phrases better than n-grams for representing the document content? To answer this question we compared the results of our system, GIST-IT, that extracts linguistically well motivated phrasal units, with KEA output, that extracts bigrams and trigrams as key phrases using aNa &amp;quot; ive Bayes model (Witten et al., 1999). Table 5 shows the results on one email message. The n-gram approach of KEA system extracts phrases like sort of batch, extracting lots, wn,andeven URLs that are unlikely to represent the gist of a document. This is an indication that the linguistically motivated GIST-IT phrases are more useful for document gisting. In future work we will perform also a task-based evaluation of these two</Paragraph>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
GIST-IT KEA
</SectionTitle>
    <Paragraph position="0"> perl module wordnet interface module 'wn' command line program sort of batch simple easy perl interface WordNet data wordnet.pm module accesses the WordNet wordnet system lots of WordNet query perl module WordNet perl wordnet QueryData wordnet package wn wordnet relation perl module command line extracting wordnet data use this module included man page extracting lots free software WordNet system</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML