File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/h05-1056_metho.xml
Size: 23,994 bytes
Last Modified: 2025-10-06 14:09:33
<?xml version="1.0" standalone="yes"?> <Paper uid="H05-1056"> <Title>Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 443-450, Vancouver, October 2005. c(c)2005 Association for Computational Linguistics Extracting Personal Names from Email: Applying Named Entity Recognition to Informal Text</Title> <Section position="4" start_page="443" end_page="444" type="metho"> <SectionTitle> 3 Existing NER Methods </SectionTitle> <Paragraph position="0"> In our first set of experiments we apply CRF, a machine-learning based probabilistic approach to labeling sequences of examples, and evaluate it on the problem of extracting personal names from email.</Paragraph> <Paragraph position="1"> Learning reduces NER to the task of tagging (i.e., classifying) each word in a document. We use a set of five tags, corresponding to (1) a one-token entity, (2) the first token of a multi-token entity, (3) the last token of a multi-token entity, (4) any other token of a multi-token entity and (5) a token that is not part of an entity.</Paragraph> <Paragraph position="2"> The sets of features used are presented in Table 2. All features are instantiated for the focus word, as well as for a window of 3 tokens to the left and to the right of the focus word. The basic features include the lower-case value of a token t, and its capitalization pattern, constructed by replacing all capital letters with the letter &quot;X&quot;, all lower-case letters with &quot;x&quot;, all digits with &quot;9&quot; and compressing runs of the same letter with a single letter. The dictionary features define various categories of words including common words, first names, last names 3 and &quot;roster names&quot; 4 (international names list, where first and</Paragraph> </Section> <Section position="5" start_page="444" end_page="446" type="metho"> <SectionTitle> Basic Features </SectionTitle> <Paragraph position="0"> t, lexical value, lowercase (binary form, e.g. f(t=&quot;hello&quot;)=1) capitalization pattern of t (binary form, e.g. f(t.cap=x+)=1) Dictionary Features inCommon: t in common words dictionary inFirst: t in first names dictionary inLast: t in last names dictionary inRoster: t in roster names dictionary First: inFirst [?]!isLast [?]!inCommon Last: !inFirst [?] inLast [?]!inCommon Name: (First [?] Last [?] inRoster) [?]! inCommon Title: t in a personal prefixes/suffixes dictionary Org: t in organization suffixes dictionary Loc: t in location suffixes dictionary Email Features t appears in the header t appears in the &quot;from&quot; field t is a probable &quot;signoff&quot; ([?] after two line breaks and near end of message) t is part of an email address (regular expression) does the word starts a new sentence ([?] capitalized after a period, question or exclamation mark) t is a probable initial (X or X.) t followed by the bigram &quot;and I&quot; t capitalized and followed by a pronoun within 15 tokens dictionary and is not in the common-words or lastname dictionaries is designated a &quot;sure first name&quot;. The common-words dictionary used consists of base forms, conjugations and plural forms of common English words, and a relatively small ad-hoc dictionary representing words especially common in email (e.g., &quot;email&quot;, &quot;inbox&quot;). We also use small manually created word dictionaries of prefixes and suffixes indicative of persons (e.g., &quot;mr&quot;, &quot;jr&quot;), locations (e.g., &quot;ave&quot;) and organizations (e.g., &quot;inc&quot;). Email structure features: We perform a simplified document analysis of the email message and use this to construct some additional features. One is an indicator as to whether a token t is equal to some token in the &quot;from&quot; field. Another indicates whether a token t in the email body is equal to some token appearing in the whole header. An indicator feature based on a regular expression is used to mark tokens that are part of a probable &quot;sign-off&quot; (i.e., a name at the end of a message). Finally, since the annotation rules do not consider email addresses to be names, we added an indicator feature for tokens that are inside an email address.</Paragraph> <Paragraph position="1"> denoted by its direction comparing to the focus word (l/r), offset and lexical value.</Paragraph> <Paragraph position="2"> We experimented with features derived from POS tags and NP-chunking of the email, but found the POS assignment too noisy to be useful. We did include some features based on approximate linguistic rules. One rule looks for capitalized words that are not common words and are followed by a pronoun within a distance of up to 15 tokens. (As an example, consider &quot;Contact Puck tomorrow. He should be around.&quot;). Another rule looks for words followed by the bigram &quot;and I&quot;. As is common for hand-coded NER rules, both these rules have high precision and low recall.</Paragraph> <Section position="1" start_page="444" end_page="445" type="sub_section"> <SectionTitle> 3.1 Email vs Newswire </SectionTitle> <Paragraph position="0"> In order to explore some of the differences between email and newswire NER problems, we stripped all header fields from the Mgmt-Game messages, and trained a model (using basic features only) from the resulting corpus of email bodies. Figure 1 shows the features most indicative of a token being part of a name in the models trained for the Mgmt-Game and MUC-6 corpora. To make the list easier to interpret, it includes only the features corresponding to tokens surrounding the focus word.</Paragraph> <Paragraph position="1"> As one might expect, the important features from the MUC-6 dataset are mainly formal name titles such as &quot;mr&quot;, &quot;mrs&quot;, and &quot;jr&quot;, as well as job titles and other pronominal modifiers such as &quot;president&quot; and &quot;judge&quot;. However, for the Mgmt-Game corpus, most of the important features are related to email-specific structure. For example, the features &quot;left.1.by&quot; and &quot;left.2.by&quot; are often associated with a quoted excerpt from another email message, which in the Mgmt-Game corpus is often marked by mailers with text like &quot;Excerpts from mail: 7- null Sep-97 Re: paper deadline by Richard Wang&quot;. Similarly, features like &quot;left.1.thanks&quot; and &quot;right.1.ps&quot; indicate a &quot;signoff&quot; section of an email, as does &quot;right.2.home&quot; (which often indicates proximity to a home phone number appearing in a signature).</Paragraph> </Section> <Section position="2" start_page="445" end_page="446" type="sub_section"> <SectionTitle> 3.2 Experimental Results </SectionTitle> <Paragraph position="0"> We now turn to evaluate the usefulness of the feature sets described above. Table 3 gives entity-level F1 performance 5 for CRF trained models for all datasets, using the basic features alone (B); the basic and email-tailored features (B+E); the basic and dictionary features (B+D); and, all of the feature sets combined (B+D+E). All feature sets were tuned using the Mgmt-Game validation subset. The given results relate to previously unseen test sets.</Paragraph> <Paragraph position="1"> across all datasets, with CRF training.</Paragraph> <Paragraph position="2"> The results show that the email-specific features are very informative. In addition, they show that the dictionary features are especially useful. This can be explained by the relatively weak contextual evidence in email. While dictionaries are useful in named entities extraction in general, they are in fact more essential when extracting names from email text, where many name mentions are part of headers, names lists etc. Finally, the results for the combined feature set are superior in most cases to any subset of the features.</Paragraph> <Paragraph position="3"> Overall the level of performance using all features is encouraging, considering the limited training set size. Performance on Mgmt-Teams is somewhat lower than for Mgmt-Game mainly because (by design) there is less similarity between training and test sets with this split. Enron emails seem to be harder than Mgmt-Game emails, perhaps because they include fewer structured instances of names.</Paragraph> <Paragraph position="4"> Enron-Meetings emails also contain a number of constructs that were not encountered in the Mgmt-Game corpus, notably lists (e.g., of people attending a meeting), and also include many location and or- null that appear in at most K distinct documents as a function of K. ganization names, which are rare in Mgmt-Game. A larger set of dictionaries might improve performance for the Enron corpora.</Paragraph> <Paragraph position="5"> 4 Repetition of named entities in email In the experiments described above, the extractors have high precision, but relatively low recall. This typical behavior suggests that some sort of recall-enhancing procedure might improve overall performance. null One family of recall-enhancing techniques are based on looking for multiple occurrences of names in a document, so that names which occur in ambiguous contexts will be more likely to be recognized. It is an intuitive assumption that the ways in which names repeat themselves in a corpus will be different in email and newswire text. In news stories, one would expect repetitions within a single document to be common, as a means for an author to establish a shared context with the reader. In an email corpus, one would expect names to repeat more frequently across the corpus, in multiple documents-at least when the email corpus is associated with a group that works together closely. In this section we support this conjecture with quantitative analysis.</Paragraph> <Paragraph position="6"> In a first experiment, we plotted the percentage of person-name tokens w that appear in at most K distinct documents as a function of K. Figure 2 shows this function for the Mgmt-Game, MUC6, Enron-Meetings, and Enron-Random datasets.</Paragraph> <Paragraph position="7"> There is a large separation between MUC-6 and Mgmt-Game, the most workgroup-oriented email corpus. In MUC-6, for instance, almost 80% of the a single document (SDR) or across multiple documents (MDR). names appear in only a single document, while in Mgmt-Game, only 30% of the names appear in only a single document. At the other extreme, in MUC-6, only 1.3% of the names appear in 10 or more documents, while in Mgmt-Game, almost 20% do. The Enron-Random and Enron-Meetings datasets show distributions of names that are intermediate between Mgmt-Game and MUC-6.</Paragraph> <Paragraph position="8"> As a second experiment, we implemented two very simple extraction rules. The single document repetition (SDR) rule marks every token that occurs more than once inside a single document as a name. Adding tokens marked by the SDR rule to the tokens marked by the learned extractor generates a new extractor, which we will denote SDR+CRF.</Paragraph> <Paragraph position="9"> Thus, the recall of SDR+CRF serves as an upper bound on the token recall6 of any recall-enhancing 6Token level recall is recall on the task of classifying tokens as inside or outside an entity name.</Paragraph> <Paragraph position="10"> method that improves the extractor by exploiting repetition within a single document. Analogously, the multiple document repetition (MDR) rule marks every token that occurs in more than one document as a name. Again, the token recall of MDR+CRF rule is an upper bound on the token recall of any recall-enhancing method that exploits token repetition across multiple documents.</Paragraph> <Paragraph position="11"> The left bars in Figure 3 show the recall obtained by the SDR (top) and the MDR rule (bottom). The MDR rule has highest recall for the two Mgmt corpora, and lowest recall for the MUC-6 corpus. Conversely, for the SDR rule, the highest recall level obtained is for MUC-6. The middle bars show the token recall obtained by the CRF extractor, using all features. The right bars show the token recall of the SDR+CRF and MDR+CRF extractors. Comparing them to the other bars, we see that the maximal potential recall gain from a SDR-like method is on MUC-6. For MDR-like methods, there are large potential gains on the Mgmt corpora as well as on Enron-Meetings and Enron-Random to a lesser degree. This probably reflects the fact that the Enron corpora are from a larger and more weakly interacting set of users, compared to the Mgmt datasets. These results demonstrate the importance of exploiting repetition of names across multiple documents for entity extraction from email.</Paragraph> </Section> </Section> <Section position="6" start_page="446" end_page="449" type="metho"> <SectionTitle> 5 Improving Recall With Inferred </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="446" end_page="447" type="sub_section"> <SectionTitle> Dictionaries </SectionTitle> <Paragraph position="0"> Sequential learners of the sort used here classify tokens from each document independently; moreover, the classification of a word w is independent of the classification of other occurrences of w elsewhere in the document. That is, the fact that a word w has appeared somewhere in a context that clearly indicates that it is a name does not increase the probability that it will be classified as a name in other, more ambiguous contexts.</Paragraph> <Paragraph position="1"> Recently, sequential learning methods have been extended to directly utilize information about name co-occurrence in learning the sequential classifier.</Paragraph> <Paragraph position="2"> This approach provides an elegant solution to modeling repetition within a single document. However, it requires identifying candidate related entities in advance, applying some heuristic. Thus, Bunescu & Mooney (2004) link between similar NPs (requiring their head to be identical), and Sutton and Mccallum (2004) connect pairs of identical capitalized words.</Paragraph> <Paragraph position="3"> Given that in email corpora capitalization patterns are not followed to a large extent, there is no adequate heuristic that would link candidate entities prior to extraction. Further, it is not clear if a collective classification approach can scale to modeling multiple-document repetition.</Paragraph> <Paragraph position="4"> We suggest an alternative approach of recall-enhancing name matching, which is appropriate for email. Our approach has points of similarity to the methods described by Stevenson and Gaizauskas (2000), who suggest matching text against name dictionaries, filtering out names that are also common words or appear as non-names in high proportion in the training data. The approach described here is more systematic and general. In a nutshell, we suggest applying the noisy dictionary of predicted names over the test corpus, and use the approximate (predicted) name to non-name proportions over the test set itself to filter out ambiguous names. Therefore, our approach does not require large amount of annotated training data. It also does not require word distribution to be similar between train and test data. We will now describe our approach in detail.</Paragraph> </Section> <Section position="2" start_page="447" end_page="447" type="sub_section"> <SectionTitle> 5.1 Matching names from dictionary </SectionTitle> <Paragraph position="0"> First, we construct a dictionary comprised of all spans predicted as names by the learned model. For personal names, we suggest expanding this dictionary further, using a transformation scheme. Such a scheme would construct a family of possible variations of a name n: as an example, Figure 4 shows name variations created for the name span &quot;Benjamin Brown Smith&quot;. Once a dictionary is formed, a single pass is made through the corpus, and every longest match to some name-variation is marked as a name7. It may be that a partial name span n1 identified by the extractor is subsumed by the full name span n2 identified by the dictionary-matching scheme. In this case, entity-level precision is increased, having corrected the entity's boundaries.</Paragraph> <Paragraph position="1"> 7Initials-only variants of a name, e.g., &quot;bs&quot; in Figure 4 are marked as a name only if the &quot;inSignoff&quot; feature holds--i.e., if they appear near the end of a message in an apparent signature. benjamin brown smith benjamin-brown-s. b. brown s. bbs benjamin-brown smith benjamin-b. s. b. b. smith bs benjamin brown-smith benjamin-smith b. brown-s.</Paragraph> <Paragraph position="2"> benjamin-brown-smith benjamin smith benjamin benjamin brown s. b. brown smith brown benjamin-b. smith benjamin b. s. smith benjamin b. smith b. brown-smith b. smith benjamin brown-s. benjamin-s. b. b. s benjamin-brown s. benjamin s. b. s.</Paragraph> </Section> <Section position="3" start_page="447" end_page="448" type="sub_section"> <SectionTitle> Brown Smith&quot; 5.2 Dictionary-filtering schemes </SectionTitle> <Paragraph position="0"> The noisy dictionary-matching scheme is susceptible to false positives. That is, some words predicted by the extractor to be names are in fact non-names.</Paragraph> <Paragraph position="1"> Presumably, these non-names could be removed by simply eliminating low-confidence predictions of the extractor; however, ambiguous words -that are not exclusively personal names in the corpus- may need to be identified and removed as well. We note that ambiguity better be evaluated in the context of the corpus. For example, &quot;Andrew&quot; is a common first name, and may be confidently (and correctly) recognized as one by the extractor. However, in the Mgmt-Game corpus, &quot;Andrew&quot; is also the name of an email server, and most of the occurrences of this name in this corpus are not personal names. The high frequency of the word &quot;Andrew&quot; in the corpus, coupled with the fact that it is only sometimes a name, means that adding this word to the dictionary leads to a substantial drop in precision.</Paragraph> <Paragraph position="2"> We therefore suggest a measure for filtering the dictionary. This measure combines two metrics. The first metric, predicted frequency (PF), estimates the degree to which a word appears to be used consistently as a name throughout the corpus: PF(w) [?] cpf(w)ctf(w) where cpf(w) denotes the number of times that a word w is predicted as part of a name by the extractor, and ctf(w) is the number of occurrences of the word w in the entire test corpus (we emphasize that estimating this statistic based on test data is valid, as it is fully automatic &quot;blind&quot; procedure).</Paragraph> <Paragraph position="3"> Predicted frequency does not assess the likely cost of adding a word to a dictionary: as noted above, ambiguous or false dictionary terms that occur frequently will degrade accuracy. A number of statistics could be used here; for instance, practitioners sometimes filter a large dictionary by simply discarding all words that occur more than k times in a test corpus. We elected to use the inverse document frequency (IDF) of w to measure word frequency:</Paragraph> <Paragraph position="5"> Here df(w) is the number of documents that contain a word w, and N is the total number of documents in the corpus. Inverse document frequency is often used in the field of information retrieval (Allan et al., 1998), and the formula above has the virtue of being scaled between 0 and 1 (like our PF metric) and of including some smoothing. In addition to bounding the cost of a dictionary entry, the IDF formula is in itself a sensible filter, since personal names will not appear as frequently as common English words.</Paragraph> <Paragraph position="6"> The joint filter combines these two multiplicatively, with equal weights:</Paragraph> <Paragraph position="8"> PF.IDF takes into consideration both the probability of a word being a name, and how common it is in the entire corpus. Words that get low PF.IDF scores are therefore either words that are highly ambiguous in the corpus (as derived from the extractors' predictions) or are common words, which were inaccurately predicted as names by the extractor.</Paragraph> <Paragraph position="9"> In the MDR method of Figure 3, we imposed an artificial requirement that words must appear in more than one document. In the method described here, there is no such requirement: indeed, words that appear in a small number of documents are given higher weights, due to the IDF factor. Thus this approach exploits both single-document and multiple-document repetitions.</Paragraph> <Paragraph position="10"> In a set of experiments that are not described here, the PF.IDF measure was found to be robust to parameter settings, and also preferable to its separate components in improving recall at minimal cost in precision. As described, the PF.IDF values per word range between 0 and 1. One can vary the threshold, under which a word is to be removed from the dictionary, to control the precision-recall trade-off. We tuned the PF.IDF threshold using the validation subsets, optimizing entity-level F1 (a threshold of 0.16 was found optimal).</Paragraph> <Paragraph position="11"> In summary, our recall-enhancing strategy is as follows: 1. Learn an extractor E from the training corpus Ctrain. 2. Apply the extractor E to a test corpus Ctest to assign a preliminary labeling.</Paragraph> <Paragraph position="12"> 3. Build a dictionary Sth[?] including the names n such that (a) n is extracted somewhere in the preliminary labeling of the test corpus, or is derived from an extracted name applying the name transformation scheme and (b) PF.IDF(n) > th[?].</Paragraph> <Paragraph position="13"> 4. Apply the dictionary-matching scheme of Section 5.1, us- null ing the dictionary Sth[?] to augment the preliminary labeling, and output the result.</Paragraph> </Section> <Section position="4" start_page="448" end_page="449" type="sub_section"> <SectionTitle> 5.3 Experiments with inferred dictionaries </SectionTitle> <Paragraph position="0"> Table 4 shows results using the method described above. We consider all of the email corpora and the CRF learner, trained with the full feature set. The results are given in terms of relative change, compared to the baseline results generated by the extractors (scoreresult/scorebaseline [?]1) and final value. As expected, recall is always improved. Entity-level F1 is increased as well, as recall is increased more than precision is decreased. The largest improvements are for the Mgmt corpora --the two e-mail datasets shown to have the largest potential improvement from MDR-like methods in Figure 3. Recall improvements are more modest for the Enron datasets, as was anticipated by the MDR analysis.</Paragraph> <Paragraph position="1"> Another reason for the gap is that extractor baseline performance is lower for the Enron datasets, so that the Enron dictionaries are noisier.</Paragraph> <Paragraph position="2"> As detailed in Section 2, the Mgmt-Teams dataset was constructed so that the names in the training and test set have only minimal overlap. The performance improvement on this dataset shows that repetition of mostly-novel names can be detected using our method. This technique is highly effective when names are novel, or dense, and is optimal when extractor baseline precision is relatively high.</Paragraph> <Paragraph position="3"> applying name-matching on models trained with CRF and the full feature set (F1 baseline given in Table 3).</Paragraph> </Section> </Section> class="xml-element"></Paper>