File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/w01-1011_metho.xml

Size: 16,292 bytes

Last Modified: 2025-10-06 14:07:45

<?xml version="1.0" standalone="yes"?>
<Paper uid="W01-1011">
  <Title>GIST-IT: Summarizing Email Using Linguistic Knowledge and Machine Learning Evelyne Tzoukermann</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 System architecture
</SectionTitle>
    <Paragraph position="0"> The input to GIST-IT is a single email message. The architecture, presented in Figure 1 consists of four distinct functional components. The first module is an email preprocessor developed for Text-To-Speech</Paragraph>
    <Paragraph position="2"> text processing unit, which is actually a pipeline of modules for extraction and filtering of simple NP candidates. The third functional component is a machine learning unit, which consists of a feature selection module and a text classifier.</Paragraph>
    <Paragraph position="3"> This module uses a training set and a testing set that were devided from our email corpus. In order to test the performance of GIST-IT on the task of summarization, we use a heterogeneous collection of email messages in genre, length, and topic. We represent each email as a set of NP feature vectors. We used 2,500 NPs extracted from 51 email messages as a training set and 324 NPs from 8 messages for testing.</Paragraph>
    <Paragraph position="4"> Each NP was manually tagged for saliency by one of the authors and we are planning to add more judges in the future. The final module deals with presentation of the gisted email message.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 The Email Preprocessor
</SectionTitle>
      <Paragraph position="0"> This module uses finite-state transducer technology in order to identify message content.</Paragraph>
      <Paragraph position="1"> Information at the top of the message related to &amp;quot;From/To/Date'' as well as the signature block are separated from the message content.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Candidate Simple Noun Phrase Extraction and
Filtering Unit
</SectionTitle>
      <Paragraph position="0"> This module performs shallow text processing for extraction and filtering of simple NP candidates, consisting of a pipeline of three modules: text tokenization, NP extraction, and NP filtering. Since the tool was created to preprocess email for speech output, some of the text tokenization suitable for speech is not accurate for text processing and some modifications needed to be implemented (e.g.</Paragraph>
      <Paragraph position="1"> email preprocessor splits acronyms like DLI2 into DLI 2). The noun phrase extraction module uses Brill's POS tagger [Brill (1992)]and a base NP chunker [Ramshaw and Marcus (1995)].</Paragraph>
      <Paragraph position="2"> After analyzing some of these errors, we augmented the tagger lexicon from our training data and we added lexical and contextual rules to deal mainly with incorrect tagging of gerund endings. In order to improve the accuracy of classifiers we perform linguistic filtering, as discussed in detail in Section 3.1.2.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Machine Learning Unit
</SectionTitle>
      <Paragraph position="0"> The first component of the ML unit is the feature selection module to compute NP vectors. In the training phase, a model for identifying salient simple NPs is created.</Paragraph>
      <Paragraph position="1"> The training data consist of a list of feature vectors already classified as salient/nonsalient by the user. Thus we rely on userrelevance judgments to train the ML unit. In the extraction phase this unit will classify relevant NPs using the model generated during training. We applied three machine learning paradigms (decision trees, rule induction algorithms, and decision forest) evaluating three different classifiers.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.4 Presentation
</SectionTitle>
      <Paragraph position="0"> The presentation of the message gist is a complex user interface issue with its independent set of problems. Depending on the application and its use, one can think of different presentation techniques. The gist of the message could be the set of NPs or the set of sentences in which these NPs occur so that the added context would make it more understandable to the user. We do not address in this work the disfluency that could occur in listing a set of extracted sentences, since the aim is to deliver to the user the very content of the message even in a raw fashion. GIST-IT is to be used in an application where the output is synthesized speech. The focus of this paper is on extracting content with GIST-IT, although presentation is a topic for future research.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Combining Linguistic Knowledge and
</SectionTitle>
    <Paragraph position="0"> Machine Learning for Email Gisting We combine symbolic machine learning and linguistic processing in order to extract the salient phrases of a document. Out of the large syntactic constituents of a sentence, e.g. noun phrases, verb phrases, and prepositional phrases, we assume that noun phrases (NPs) carry the most contentful information about the document, even if sometimes the verbs are important too, as reported in the work by [Klavans and Kan (1998)]. The problem is that no matter the size of a document, the number of informative noun phrases is very small comparing with the number of all noun phrases, making selection a necessity. Indeed, in the context of gisting, generating and presenting the list of all noun phrases, even with adequate linguistic filtering, may be overwhelming. Thus, we define the extraction of important noun phrases as a classification task, applying machine learning techniques to determine which features associated with the candidate NPs classify them as salient vs. non-salient. We represent the document -- in this case an email message -- as a set of candidate NPs, each of them associated with a feature vector used in the classification model. We use a number of linguistic methods both in the extraction and in the filtering of candidate noun phrases, and in the selection of the features.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Candidate NPs
</SectionTitle>
      <Paragraph position="0"> Noun phrases were extracted using Ramshaw and Marcus's base NP chunker [Ramshaw and Marcus (1995)]. The base NP is either a simple NP as defined by Wacholder (1998) or a conjunction of two simple NPs. Since the feature vectors used in the classifier scheme are simple NPs we used different heuristics to automatically split the conjoined NPs (CNP) into simple ones (SNP), properly assigning the premodifiers. Table 1 presents such an example:  Since not all simple noun phrases are equally important to reflect the document meaning, we use well-defined linguistic properties to extract only those NPs (or parts of NPs) that have a greater chance to render the salient information. By introducing this level of linguistic filtering before applying the learning scheme, we improve the accuracy of the classifiers, thus obtaining better results (see discussion in sections 4.1.3 and 5.3). We performed four  filtering steps: 1. Inflectional morphological processing.</Paragraph>
      <Paragraph position="1"> English nouns have only two kinds of inflection: an affix that marks plural and an affix that marks possessive.</Paragraph>
      <Paragraph position="2"> 2. Removing unimportant modifiers. In this second step we remove the determiners that accompany the nouns and also the auxiliary words most and more that form the periphrastic forms of comparative and superlative adjectives modifying the nouns.</Paragraph>
      <Paragraph position="3"> 3. Remove common words. We used a list of  571 common words used in IR systems in order to further filter the list of candidate NPs. Thus, words like even, following, every, are eliminated from the noun phrase structure. (i.e. &amp;quot;even more detailed information&amp;quot; and &amp;quot;detailed information&amp;quot; will also be grouped together).</Paragraph>
      <Paragraph position="4"> 4. Remove 'empty' nouns. Words like lot, group, set, bunch are considered 'empty' nouns in the sense that they have no contribution to the noun phrase meaning. For example the meaning of the noun phrases like &amp;quot;group of students&amp;quot;, &amp;quot;lots of students&amp;quot; or &amp;quot;bunch of students&amp;quot; is given by the noun &amp;quot;students&amp;quot;. In order not to bias the extraction of empty nouns we used three different data collections: Brown corpus, Wall Street Journal, and a set of 4000 email messages (most of which were collected during a conference organization). Our algorithm was a simple one: we extracted all the nouns that appear in front of the preposition &amp;quot;of&amp;quot; and then sorted them by frequency of appearance in all three corpora and used a threshold to select the final list. We generated a set of 141 empty nouns that we used in this forth step of filtering process.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Feature Selection
</SectionTitle>
      <Paragraph position="0"> We select a set of nine features that fall into three categories: linguistic, statistical (frequency-based) and positional. These features capture information about the relative importance of NPs to the document meaning.</Paragraph>
      <Paragraph position="1"> Several studies rely on linguistic intuition that the head of the noun phrase makes a greater contribution to the semantics of the nominal group than the modifiers. For some NLP tasks, the head is not necessarily the most important item of the noun phrase. In analyzing email messages from the perspective of finding salient NPs, we claim that the constituents of the NP have often as much semantic content as the head. This opinion is also supported in the work of [Strzalkowski et al (1999)]. In many cases, the meaning of the NP is given equally by modifier(s) -- usually nominal modifiers(s) -and head. Consider the following list of simple NPs selected as candidates:  (1) &amp;quot;conference workshop announcement&amp;quot; (2) &amp;quot;international conference&amp;quot; (3) &amp;quot;workshop description&amp;quot; (4) &amp;quot;conference deadline&amp;quot;  In the case of noun phrase (1) the importance of the noun phrase is found in the two noun modifiers: conference and workshop as much as in the head announcement. We test this empirical observation by introducing as a separate feature in the feature vector, a new TF*IDF measure that counts for both the modifiers and the head of the noun phrase, thus seeing the NP as a sequence of equally weighted elements. For the example above the new feature will be:</Paragraph>
      <Paragraph position="3"> We divided the set of features into three groups: one associated with the head of the noun phrase, one associated with the whole NP and one that represents the new TF*IDF measure discussed above. Since we want to use this technique on other types of documents, all features are independent of the text type or genre. For example, in the initial selection of our attributes we introduced as separate features the presence or the absence of NPs in the subject line of the email and in the headline of the body. Kilander (1996) pointed out that users estimate that &amp;quot;subject lines can be useful, but also devastating if their importance is overly emphasized&amp;quot;. Based on this study and also on our goal to provide a method that is domain and genre independent we decided not to consider the subject line and the headlines as separate features, but rather as weights included in the TF*IDF measures as presented below. Another motivation for this decision is that in email processing the correct identification of headlines is not always clear.</Paragraph>
      <Paragraph position="4"> 3.2.1 Features associated with the Head We choose two features to characterize the head of the noun phrases: head_tfidf - the TF*IDF measure of the head of the candidate NP.</Paragraph>
      <Paragraph position="5"> head_focc - The first occurrence of the head in text (the numbers of words that precede the head divided by the total number of words in the document).</Paragraph>
      <Paragraph position="6"> 3.2.2 Features associated with the whole</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
NP
</SectionTitle>
    <Paragraph position="0"> We select six features that we consider relevant in association with the whole NP: np_tfidf - the TF*IDF measure associated with the whole NP.</Paragraph>
    <Paragraph position="1"> np_focc - The first occurrence of the noun phrase in the document.</Paragraph>
    <Paragraph position="2"> np_length_words - Noun phrase length measured in number of words, normalized by dividing it with the total numbers of words in the candidate NPs list.</Paragraph>
    <Paragraph position="3"> np_length_chars - Noun phrase length measured in number of characters, normalized by dividing it with the total numbers of characters in the candidate NPs list.</Paragraph>
    <Paragraph position="4"> sent_pos - Position of the noun phrase in sentence: the number of words that precede the noun phrase, divided by the sentence length. For noun phrases in the subject line and headlines (which are usually short and will be affected by this measure), we consider the maximum length of sentence in document as the normalization factor.</Paragraph>
    <Paragraph position="5"> par_pos - Position of noun phrase in paragraph, same as sent_pos, but at the paragraph level.</Paragraph>
    <Paragraph position="6"> 3.2.3 Feature that considers all constituents of the NP equally weighted m_htfidf - the new TF*IDF measure that take into consideration the importance of the modifiers.</Paragraph>
    <Paragraph position="7"> In computing the TF*IDF measures (head_tfidf, np_tfidf, m_tfidf), weights w</Paragraph>
    <Paragraph position="9"> were assigned to account for the presence in the subject line and/or headline.</Paragraph>
    <Paragraph position="11"> These weights were manually chosen after a set of experiments, but we plan to use either a regression method or explore with genetic algorithms to automatically learn them.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Three Paradigms of Supervised Machine
Learning
</SectionTitle>
      <Paragraph position="0"> Symbolic machine learning is used in conjunction with many NLP applications (syntactic and semantic parsing, POS tagging, text categorization, word sense disambiguation). In this paper we compare three symbolic learning techniques applied to the task of salient NP extraction: decision tree, rule induction learning and decision forests.</Paragraph>
      <Paragraph position="1"> We tested the performance of an axis-parallel decision tree, C4.5 [Quinlan (1993)]; a rule learning system RIPPER [Cohen (1995)] and a decision forest classifier (DFC) [Ho (1998)].</Paragraph>
      <Paragraph position="2"> RIPPER allows the user to specify the loss ratio, which indicates the ratio of the cost of a false positive to the cost of a false negative, thus allowing the trade off between precision and recall. This is crucial for our analysis since we deal with sparse data set (in a document the number of salient NPs is much smaller than the number of irrelevant NPs). Finally we tried to prove that a combination of classifiers might improve accuracy, increasing both precision and recall. The Decision Forest Classifier (DFC) uses an algorithm for systematically constructing decision trees by pseudo-randomly selecting subsets of components of feature vectors. It implements different splitting functions. In the setting of our evaluation we tested the information gain ratio (similar to the one used by Quinlan in C4.5). An augmented feature vector (pairwise sums, differences, and products of features) was used for this classifier.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML