XML Viewer - h05-1013

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/h05-1013_metho.xml
Size: 22,513 bytes
Last Modified: 2025-10-06 14:09:28
<?xml version="1.0" standalone="yes"?>
<Paper uid="H05-1013">
  <Title>Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 97-104, Vancouver, October 2005. c(c)2005 Association for Computational Linguistics A Large-Scale Exploration of Effective Global Features for a Joint Entity Detection and Tracking Model</Title>
  <Section position="3" start_page="97" end_page="98" type="metho">
    <SectionTitle>
2 Learning as Search Optimization
</SectionTitle>
    <Paragraph position="0"> When one attempts to apply current, standard machine learning algorithms to problems with combinatorial structured outputs, the resulting algorithm implicitly assumes that it is possible to nd the best structures for a given input (and some model parameters). Furthermore, most models require much more, either in the form of feature expectations for conditional likelihood-based methods (Lafferty et al., 2001) or local marginal distributions for margin-based methods (Taskar et al., 2003). In many cases including EDT and coreference this is a false assumption. Often, we are not able to nd the best solution, but rather must employ an approximate search to nd the best possible solution, given time and space constraints. The Learning as Search Algo Learn(problem, initial, enqueue, a4 , a5 , a6 ) nodes a7 MakeQueue(MakeNode(problem,initial)) while nodes is not empty do node a7 RemoveFront(nodes) if none of nodes a8a10a9 nodea11 is a6 -good or GoalTest(node) and node is not a6 -good then sibs a7 siblings(node, a6 )</Paragraph>
    <Paragraph position="2"> Optimization (LaSO) framework exploits this dif culty as an opportunity and seeks to nd model parameters that are good within the context of search.</Paragraph>
    <Paragraph position="3"> More formally, following the LaSO framework, we assume that there is a set of input structures a13 and a set of output structures a14 (in our case, elements a15a17a16a17a13 will be documents and elements a18a19a16a20a14 will be documents marked up with mentions and their coreference sets). Additionally, we provide the structure of a search space a21 that results in elements of a14 (we will discuss our choice for this component later in Section 3). The LaSO framework relies on a monotonicity assumption: given a structure a18a20a16a17a14 and a node a22 in the search space, we must be able to calculate whether it is possible for this node a22 to eventually lead to a18 (such nodes are called a18 -good).</Paragraph>
    <Paragraph position="4"> LaSO parameterizes the search process with a weight vector a23a24a16a26a25a28a27 , where weights correspond to features of search space nodes and inputs. Specifically, we write a29a31a30a32a13a34a33a20a21a36a35a37a25a38a27 as a function that takes a pair of an input a15 and a node in the search space a22 and produces a vector of features. LaSO takes a standard search algorithm and modi es it to incorporate learning in an online manner to the algorithm shown in Figure 1. The key idea is to perform search as normal until a point at which it becomes impossible to reach the correct solution. When this happens, the weight vector a23 is updated in a corrective fashion. The algorithm relies on a parameter update formula; the two suggested by (Daum*e III and Marcu, 2005) are a standard Perceptron-style update and an approximate large margin update of the sort proposed by (Gentile, 2001). In this work, we only use the large margin update, since in the original LaSO work, it consistently outperformed the sim- null pler Perceptron updates. The update has the form given below:</Paragraph>
    <Paragraph position="6"> Where a77 is the update number, a78 is a tunable parameter and proj projects a vector into the unit sphere.</Paragraph>
  </Section>
  <Section position="4" start_page="98" end_page="99" type="metho">
    <SectionTitle>
3 Joint EDT Model
</SectionTitle>
    <Paragraph position="0"> The LaSO framework essentially requires us to specify two components: the search space (and corresponding operations) and the features. These two are inherently tied, since the features rely on the search space, but for the time being we will ignore the issue of the feature functions and focus on the search.</Paragraph>
    <Section position="1" start_page="98" end_page="98" type="sub_section">
      <SectionTitle>
3.1 Search Space
</SectionTitle>
      <Paragraph position="0"> We structure search in a left-to-right decoding framework: a hypothesis is a complete identi cation of the initial segment of a document. For instance, on a document with a79 words, a hypothesis that ends at position a80a45a81a82a22a36a81a31a79 is essentially what you would get if you took the full structured output and chopped it off at word a22 . In the example given in the introduction, one hypothesis might correspond to Bill Clinton gave a (which would be a a18 -good hypothesis), or to Bill Clinton gave a (which would not be a a18 -good hypothesis).</Paragraph>
      <Paragraph position="1"> A hypothesis is expanded through the application of the search operations. In our case, the search procedure rst chooses the number of words it is going to consume (for instance, to form the mention Bill Clinton, it would need to consume two words).</Paragraph>
      <Paragraph position="2"> Then, it decides on an entity type and a mention type (or it opts to call this chunk not an entity (NAE), corresponding to non-underlined words). Finally, assuming it did not choose to form an NAE, it decides on which of the foregoing coreference chains this entity belongs to, or none (if it is the rst mention of a new entity). All these decisions are made simultaneously, and the given hypothesis is then scored.</Paragraph>
    </Section>
    <Section position="2" start_page="98" end_page="98" type="sub_section">
      <SectionTitle>
3.2 An Example
</SectionTitle>
      <Paragraph position="0"> For concreteness, consider again the text given in the introduction. Suppose that we are at the word them and the hypothesis we are expanding is correct. That is, we have correctly identi ed Bill Clinton with entity type person and mention type name; that we have identi ed the Senate with entity type organization and mention type name; and that we have identi ed both The President and his as entities with entity type person and mention types nominal and pronoun, respectively,  and that The President points back to the chain a83 Bill Clinton a84 and that his points back to the chain a83 Bill Clinton, The President</Paragraph>
      <Paragraph position="2"> At this point of search, we have two choices for length: one or two (because there are only two words left: them and a period). A rst hypothesis would be that the word them is NAE. A second hypothesis would be that them is a named person and is a new entity; a third hypothesis would be that them is a named person and is coreference with the Bill Clinton chain; a fourth hypothesis would be that them is a pronominal organization and is a new entity; next, them could be a pronominal organization that is coreferent with the Senate ; and so on. Similar choices would be considered for the string them . when two words are selected.</Paragraph>
    </Section>
    <Section position="3" start_page="98" end_page="99" type="sub_section">
      <SectionTitle>
3.3 Linkage Type
</SectionTitle>
      <Paragraph position="0"> One signi cant issue that arises in the context of assigning a hypothesis to a coreference chain is how to compute features over that chain. As we will discuss in Section 4, the majority of our coreference-speci c features are over pairs of chunks: the proposed new mention and an antecedent. However, since in general a proposed mention can have well more than one antecedent, we are left with a decision about how to combine this information.</Paragraph>
      <Paragraph position="1"> The rst, most obvious solution, is to essentially do nothing: simply compute the features over all pairs and add them up as usual. This method, however, intuitively has the potential for over-counting the effects of large chains. To compensate for this, one might advocate the use of an average link computation, where the score for a coreference chain is computed by averaging over its elements. One might also consider a max link or min link scenario, where one of the extrema is chosen as the value. Other research has suggested that a simple last link, where a mention is simply matched against the most recent mention in a chain might be appropriate, while rst link might also be appropriate because the rst mention of an entity tends to carry the most information. In addition to these standard linkages, we also  consider an intelligent link scenario, where the method of computing the link structure depends on the mention type. The intelligent link is computed as follow, based on the mention type of the current mention, a85 : If a85a87a86 NAM then: match rst on NAM elements in the chain; if there are none, match against the last NOM element; otherwise, use max link.</Paragraph>
      <Paragraph position="2"> If a85a87a86 NOM then: match against the max NOM in the chain; otherwise, match against the most last NAM; otherwise, use max link.</Paragraph>
      <Paragraph position="3"> If a85a87a86 PRO then: use average link across all PRO or NAM; if there are none, use max link.</Paragraph>
      <Paragraph position="4"> The construction of this methodology as guided by intuition (for instance, matching names against names is easy, and the rst name tends to be the most complete) and subsequently tuned by experimentation on the development data. One might consider learning the best link method, and this may result in better performance, but we do not explore this option in this work. The initial results we present will be based on using intelligent link, but we will also compare the different linkage types explicitly.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="99" end_page="101" type="metho">
    <SectionTitle>
4 Feature Functions
</SectionTitle>
    <Paragraph position="0"> All the features we consider are of the form basefeature a33 decision-feature, where base features are functions of the input and decisions are functions of the hypothesis. For instance, a base feature might be something like the current chunk contains the word 'Clinton ' and a decision feature might be something like the current chunk is a named person.</Paragraph>
    <Section position="1" start_page="99" end_page="101" type="sub_section">
      <SectionTitle>
4.1 Base Features
</SectionTitle>
      <Paragraph position="0"> For pedagogical purposes and to facility model comparisons, we have separated the base features into eleven classes: lexical, syntactic, pattern-based, count-based, semantic, knowledge-based, classbased, list-based, inference-based, string match features and history-based features. We will deal with each of these in turn. Finally, we will discuss how these base features are combined into meta-features that are actually used for prediction.</Paragraph>
      <Paragraph position="1"> Lexical features. The class of lexical features contains simply computable features of single words. This includes: the number of words in the current chunk; the unigrams (words) contained in this chunk; the bigrams; the two character pre xes and suf xes; the word stem; the case of the word, computed by regular expressions like those given by (Bikel et al., 1999); simple morphological features (number, person and tense when applicable); and, in the case of coreference, pairs of features between the current mention and an antecedent.</Paragraph>
      <Paragraph position="2"> Syntactic features. The syntactic features are based on running an in-house state of the art part of speech tagger and syntactic chunker on the data.</Paragraph>
      <Paragraph position="3"> The words include unigrams and bigrams of part of speech as well as unigram chunk features. We have not used any parsing for this task.</Paragraph>
      <Paragraph position="4"> Pattern-based features. We have included a whole slew of features based on lexical and part of speech patterns surrounding the current word. These include: eight hand-written patterns for identifying pleonastic it and that (as in It is raining or It seems to be the case that . . . ); identi cation of pluralization features on the previous and next head nouns (this is intended to help make decisions about entity types); the previous and next content verb (also intended to help with entity type identication); the possessor or possessee in the case of simple possessive constructions ( The president 's speech would yield a feature of president on the word speech , and vice-versa; this is indented to be a sort of weak sub-categorization principle); a similar feature but applied to the previous and next content verbs (again to provide a weak sort of subcategorization); and, for coreference, a list of part of speech and word sequence patterns that match up to four words between nearby mentions that are either highly indicative of coreference (e.g., of, said, am , a ) or highly indicative of non-coreference (e.g., 's, and, in the, and the ). This last set was generated by looking at intervening strings and nding the top twenty that had maximal mutual information with with class (coreferent or not coreferent) across the training data.</Paragraph>
      <Paragraph position="5"> Count-based features. The count-based features apply only to the coreference task and attempt to capture regularities in the size and distribution of coreference chains. These include: the total number of entities detected thus far; the total number of mentions; the entity to mention ratio; the entity  to word ratio; the mention to word ratio; the size of the hypothesized entity chain; the ratio of the number of mentions in the current entity chain to the total number of mentions; the number of intervening mentions between the current mention and the last one in our chain; the number of intervening mentions of the same type; the number of intervening sentence breaks; the Hobbs distance computed over syntactic chunks; and the decayed density of the hypothesized entity, which is computed as</Paragraph>
      <Paragraph position="7"> all previous mentions (constrained in the numerator to be in the same coreference chain as our mention) and a105a107a106a108a85a17a109 is the number of entities away this mention is. This feature is captures that some entities are referred to consistently across a document, while others are mentioned only for short segments, but it is relatively rare for an entity to be mentioned once at the beginning and then ignored again until the end.</Paragraph>
      <Paragraph position="8"> Semantic features. The semantic features used are drawn from WordNet (Fellbaum, 1998). They include: the two most common synsets from Word-Net for all the words in a chunk; all hypernyms of those synsets; for coreference, we also consider the distance in the WordNet graph between pairs of head words (de ned to be the nal word in the mention name) and whether one is a part of the other. Finally, we include the synset and hypernym information of the preceding and following verbs, again to model a sort of sub-categorization principle.</Paragraph>
      <Paragraph position="9"> Knowledge-based features. Based on the hypothesis that many name to nominal coreference chains are best understood in terms of background knowledge (for instance, that George W. Bush is the President ), we have attempted to take advantage of recent techniques from large scale data mining to extract lists of such pairs. In particular, we use the name/instance lists described by (Fleischman et al., 2003) and available on Fleischman's web page to generate features between names and nominals (this list contains a110a111a85 pairs mined from a112a73a96 GBs of news data). Since this data set tends to focus mostly on person instances from news, we have additionally used similar data mined from a a112a73a113a49a114 GB web corpus, for which more general ISA relations were mined (Ravichandran et al., 2005).</Paragraph>
      <Paragraph position="10"> Class-based features. The class-based features we employ are designed to get around the sparsity of data problem while simultaneously providing new information about word usage. The rst class-based feature we use is based on word classes derived from the web corpus mentioned earlier and computed as described by (Ravichandran et al., 2005). The second attempts to instill knowledge of collocations in the data; we use the technique described by (Dunning, 1993) to compute multi-word expressions and then mark words that are commonly used as such with a feature that expresses this fact.</Paragraph>
      <Paragraph position="11"> List-based features. We have gathered a collection of about 40 lists of common places, organization, names, etc. These include the standard lists of names gathered from census data and baby name books, as well as standard gazetteer information listing countries, cities, islands, ports, provinces and states. We supplement these standard lists with lists of airport locations (gathered from the FAA) and company names (mined from the NASDAQ and NYSE web pages). We additionally include lists of semantically plural but syntactically singular words (e.g., group ) which were mined from a large corpus by looking for patterns such as ( members of the . . . ). Finally, we use a list of persons, organizations and locations that were identi ed at least 100 times in a large corpus by the BBN IdentiFinder named entity tagger (Bikel et al., 1999).</Paragraph>
      <Paragraph position="12"> These lists are used in three ways. First, we use simple list membership as a feature to improve detection performance. Second, for coreference, we look for word pairs that appear on the same list but are not identical (for instance, Russia and England appearing on the country list but not being identical hints that they are different entities). Finally, we look for pairs where one element in the pair is the head word from one mention and the other element in the pair is a list. This is intended to capture the notion that a word that appears on out country list is often coreferent with the word country.</Paragraph>
      <Paragraph position="13"> Inference-based features. The inference-based features are computed by attempting to infer an underlying semantic property of a given mention. In particular, we attempt to identify gender and semantic number (e.g., group is semantically plural although it is syntactically singular). To do so, we cre- null ated a corpus of example mentions labels with number and gender, respectively. This data set was automatically extracted from our EDT data set by looking for words that corefer with pronouns for which we know the number or gender. For instance, a mention that corefers with she is known to be singular and female, while a mention that corefers with they is known to be plural. In about 5% of the cases, this was ambiguous these cases were thrown out. We then used essentially the same features as described above to build a maximum entropy model for predicting number and gender. The predictions of this model are used both as features for detection as well as coreference (in the latter case, we check for matches). Additionally, we use several pre-existing classi ers as features. This are simple maximum entropy Markov models trained off of the MUC6 data, the MUC7 data and our ACE data.</Paragraph>
      <Paragraph position="14"> String match features. We use the standard string match features that are described in every other coreference paper. These are: string match; sub-string match; string overlap; pronoun match; and normalized edit distance. In addition, we also use a string nationality match, which matches, for instance Israel and Israeli, Russia and Russian, England and English, but not Netherlands and Dutch. This is done by checking for common suf xes on nationalities and matching the rst half of the of the words based on exact match. We additionally use a linguistically-motivated string edit distance, where the replacement costs are lower for vowels and other easily confusable characters. We also use the Jaro distance as an additional string distance metric. Finally, we attempt to match acronyms by looking at initial letters from the words in long chunks.</Paragraph>
      <Paragraph position="15"> History-based features. Finally, for the detection phase of the task, we include features having to do with long-range dependencies between words.</Paragraph>
      <Paragraph position="16"> For instance, if at the beginning of the document we tagged the word Arafat as a person's name (perhaps because it followed Mr. or Palestinian leader ), and later in the document we again see the word Arafat, we should be more likely to call this a person's name, again. Such features have previously been explored in the context of information extraction from meeting announcements using conditional random elds augmented with long-range links (Sutton and McCallum, 2004), but the LaSO framework makes no Markov assumption, so there is no extra effort required to include such features.</Paragraph>
    </Section>
    <Section position="2" start_page="101" end_page="101" type="sub_section">
      <SectionTitle>
4.2 Decision Features
</SectionTitle>
      <Paragraph position="0"> Our decision features are divided into three classes: simple, coreference and boundary features.</Paragraph>
      <Paragraph position="1"> Simple. The simple decision features include: is this chunk tagged as an entity; what is its entity type; what is its entity subtype; what is its mention type; what is its entity type/mention type pair.</Paragraph>
      <Paragraph position="2"> Coreference. The coreference decision features include: is this entity the start of a chain or continuing an existing chain; what is the entity type of this started (or continued) chain; what is the entity subtype of this started (or continued) chain; what is the mention type of this started chain; what is the mention type of this continued chain and the mention type of the most recent antecedent.</Paragraph>
      <Paragraph position="3"> Boundary. The boundary decision features include: the second and third order Markov features over entity type, entity subtype and mention type; features appearing at the previous (and next) words within a window of three; the words that appear and the previous and next mention boundaries, speci ed also by entity type, entity subtype and mention type.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML