File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/01/w01-0719_intro.xml
Size: 7,153 bytes
Last Modified: 2025-10-06 14:01:10
<?xml version="1.0" standalone="yes"?> <Paper uid="W01-0719"> <Title>Combining Linguistic and Machine Learning Techniques for Email Summarization</Title> <Section position="4" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Machine Learning for Content Extraction </SectionTitle> <Paragraph position="0"> Symbolic machine learning has been applied successfully in conjunction with many NLP applications (syntactic and semantic parsing, POS tagging, text categorization, word sense disambiguation) as reviewed by Mooney and Cardie (1999).</Paragraph> <Paragraph position="1"> We used machine learning techniques for finding salient noun phrases that can represent the summary of an email message. This section describes the three steps involved in this classification task: 1) what representation is appropriate for the information to be classified as relevant or non-relevant (candidate phrases), 2) which features should be associated with each candidate, 3) which classification models should be used.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Candidate Phrases </SectionTitle> <Paragraph position="0"> Of the major syntactic constituents of a sentence, e.g. noun phrases, verb phrases, and prepositional phrases, we assume that noun phrases (NPs) carry the most contentful information about the document, a well-supported hypothesis (Smeaton, 1999; Wacholder, 1998).</Paragraph> <Paragraph position="1"> As considered by Wacholder (1998), the simple NPs are the maximal NPs that contain pre-modifiers but not post-nominal constituents such as prepositions or clauses. We chose simple NPs for content representation because they are semantically and syntactically coherent and they are less ambiguous than complex NPs. For extracting simple noun phrases we first used Ramshaw and Marcus's base NP chunker (Ramshaw and Marcus, 1995). The base NP is either a simple NP or a coordination of simple NPs. We used heuristics based on POS tags to automatically split the co-ordinate NPs into simple ones, properly assigning the premodifiers. Table 1 presents some coordinate NPs (CNP) encountered in our data collection and the results of our algorithm which split them into simple NPs (SNP1 and SNP2).</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Features used for Classification </SectionTitle> <Paragraph position="0"> The choice of features used to represent the candidate phrases has a strong impact on the accuracy of the classifiers (e.g. the number of examples needed to obtain a given accuracy on the test data, the cost of classification). For our classification task of determining if a noun phrase is salient or not to the document meaning, we chose a set of nine features.</Paragraph> <Paragraph position="1"> Several studies rely on the linguistic intuition that the head of the noun phrase makes a greater contribution to the semantics of the nominal group than the modifiers. However, for some specific tasks in NLP , the head is not necessarily the most semantically important part of the noun phrase. In analyzing email messages from the perspective of finding salient NPs, we claim that the modifier(s) of the noun phrase - usually nominal modifiers(s), often have as much semantic content as the head. This opinion is also supported in the work of Strzalkowski et al. (1999), where syntactic NPs are captured for the goal of extracting their semantic content but are processed as an &quot;ordered&quot; string of words rather than a syntactic unit. Thus we introduce as a separate feature in the feature vector, a new TF*IDF measure which consider the NP as a sequence of equally weighted elements, counting individually the modifier(s) and the head.</Paragraph> <Paragraph position="2"> Consider the following list of simple NPs selected as candidates: 1. conference workshop announcement 2. international conference 3. workshop description 4. conference deadline In the case of the first noun phrase, for example, its importance is found in the two noun modifiers: conference and workshop as much as in the head announcement, due to their presence as heads or modifiers in the candidate NPs 2-4. Our new feature will be: CCBYA3C1BWBY</Paragraph> <Paragraph position="4"> these linguistic observations we divided the set of features into three groups, as we mentioned also in (Tzoukermann et al., 2001): 1) one associated with the head of the noun phrase; 2) one associated with the whole NP and 3) one that represents the new TF*IDF measure discussed above.</Paragraph> <Paragraph position="5"> 2.2.1 Features associated with the Head We choose two features to characterize the head of the noun phrases: AF head tfidf: the TF*IDF measure of the head of the candidate NP. For the NP in example (1) this feature will be CCBY A3</Paragraph> <Paragraph position="7"> AF head focc: The position of the first occurrence of the head in text (the number of words that precede the first occurrence of the head divided by the total number of words in the document).</Paragraph> <Paragraph position="8"> 2.2.2 Features associated with the whole NP We select six features that we consider relevant in determining the relative importance of the noun phrase: AF np tfidf: the TF*IDF measure of the whole NP. For the NP in the example (1) this feature will be</Paragraph> <Paragraph position="10"> AF np focc: The position of the first occurrence of the noun phrase in the document.</Paragraph> <Paragraph position="11"> AF np length words: Noun phrase length measured in number of words, normalized by dividing it with the total number of words in the candidate NP list.</Paragraph> <Paragraph position="12"> AF np length chars: Noun phrase length measured in number of characters, normalized by dividing it with the total number of characters in the candidate NPs list.</Paragraph> <Paragraph position="13"> AF sent pos: Position of the noun phrase in the sentence: the number of words that precede the noun phrase, divided by sentence length.</Paragraph> <Paragraph position="14"> For noun phrases in the subject line (which are usually short and will be affected by this measure), we consider the maximum length of sentence in document as the normalization factor.</Paragraph> <Paragraph position="15"> AF par pos: Position of noun phrase in paragraph, same as sent pos, but at the paragraph level.</Paragraph> <Paragraph position="16"> 2.2.3 Feature that considers all constituents of the NP equally weighted One of the important hypotheses we tested in this work is that both the modifiers and the head of NP contribute equally to its salience. Thus we consider mh tfidf as an additional feature in the feature vector.</Paragraph> <Paragraph position="17"> AF mh tfidf: the new TF*IDF measure that takes also into consideration the importance of the modifiers. In our example the value of this feature will be : CCBY A3 C1BWBY</Paragraph> <Paragraph position="19"> In computing the TF*IDF measures (head tfidf, np tfidf, mh tfidf), specific weights, DB</Paragraph> </Section> </Section> class="xml-element"></Paper>