XML Viewer - p06-1060

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1060_metho.xml
Size: 20,250 bytes
Last Modified: 2025-10-06 14:10:17
<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1060">
  <Title>Factorizing Complex Models: A Case Study in Mention Detection</Title>
  <Section position="3" start_page="473" end_page="474" type="metho">
    <SectionTitle>
2 Multi-Task Classification
</SectionTitle>
    <Paragraph position="0"> Many tasks in Natural Language Processing involve labeling a word or sequence of words with a specific property; classic examples are part-of-speech tagging, text chunking, word sense disambiguation and sentiment classification. Most of the time, the word labels are atomic labels, containing a very specific piece of information (e.g. the word 3While not wishing to delve too deep into the issue of label bias, we would also like to point out (as it was done, for instance, in (Klein, 2003)) that the label bias of MEMM classifiers can be significantly reduced by allowing them to examine the right context of the classification point - as we have done with our model.</Paragraph>
    <Paragraph position="1"> is noun plural, or starts a noun phrase, etc). There are cases, though, where the labels consist of several related, but not entirely correlated, properties; examples include mention detection--the task we are interested in--, syntactic parsing with functional tag assignment (besides identifying the syntactic parse, also label the constituent nodes with their functional category, as defined in the Penn Treebank (Marcus et al., 1993)), and, to a lesser extent, part-of-speech tagging in highly inflected languages.4 The particular type of mention detection that we are examining in this paper follows the ACE general definition: each mention in the text (a reference to a real-world entity) is assigned three types of information:5 * An entity type, describing the type of the entity it points to (e.g. person, location, organization, etc) * An entity subtype, further detailing the type (e.g. organizations can be commercial, governmental and non-profit, while locations can be a nation, population center, or an international region) * A mention type, specifying the way the entity is realized - a mention can be named (e.g. John Smith), nominal (e.g. professor), or pronominal (e.g. she).</Paragraph>
    <Paragraph position="2"> Such a problem - where the classification consists of several subtasks or attributes - presents additional challenges, when compared to a standard sequence classification task. Specifically, there are inter-dependencies between the subtasks that need to be modeled explicitly; predicting the tags independently of each other will likely result in inconsistent classifications. For instance, in our running example of mention detection, the subtype task is dependent on the entity type; one could not have a person with the subtype non-profit. On the other hand, the mention type is relatively independent of the entity type and/or subtype: each entity type could be realized under any mention type and viceversa. null The multi-task classification problem has been subject to investigation in the past. Caruana et al. (1997) analyzed the multi-task learning 4The goal there is to also identify word properties such as gender, number, and case (for nouns), mood and tense (for verbs), etc, besides the main POS tag. The task is slightly different, though, as these properties tend to have a stronger dependency on the lexical form of the classified word.</Paragraph>
    <Paragraph position="3"> 5There is a fourth assigned type - a flag specifying whether a mention is specific (i.e. it refers at a clear entity), generic (refers to a generic type, e.g. &amp;quot;the scientists believe ..&amp;quot;), unspecified (cannot be determined from the text), or negative (e.g. &amp;quot;no person would do this&amp;quot;). The classification of this type is beyond the goal of this paper.</Paragraph>
    <Paragraph position="4">  (MTL) paradigm, where individual related tasks are trained together by sharing a common representation of knowledge, and demonstrated that this strategy yields better results than one-task-ata-time learning strategy. The authors used a back-propagation neural network, and the paradigm was tested on several machine learning tasks. It also contains an excellent discussion on how and why the MTL paradigm is superior to single-task learning. Florian and Ngai (2001) used the same multi-task learning strategy with a transformation-based learner to show that usually disjointly handled tasks perform slightly better under a joint model; the experiments there were run on POS tagging and text chunking, Chinese word segmentation and POS tagging. Sutton et al. (2004) investigated the multitask classification problem and used a dynamic conditional random fields method, a generalization of linear-chain conditional random fields, which can be viewed as a probabilistic generalization of cascaded, weighted finite-state transducers. The subtasks were represented in a single graphical model that explicitly modeled the sub-task dependence and the uncertainty between them. The system, evaluated on POS tagging and base-noun phrase segmentation, improved on the sequential learning strategy.</Paragraph>
    <Paragraph position="5"> In a similar spirit to the approach presented in this article, Florian (2002) considers the task of named entity recognition as a two-step process: the first is the identification of mention boundaries and the second is the classification of the identified chunks, therefore considering a label for each word being formed from two sub-labels: one that specifies the position of the current word relative in a mention (outside any mentions, starts a mention, is inside a mention) and a label specifying the mention type . Experiments on the CoNLL'02 data show that the two-process model yields considerably higher performance.</Paragraph>
    <Paragraph position="6"> Hacioglu et al. (2005) explore the same task, investigating the performance of the AIO and the cascade model, and find that the two models have similar performance, with the AIO model having a slight advantage. We expand their study by adding the hybrid joint model to the mix, and further investigate different scenarios, showing that the cascade model leads to superior performance most of the time, with a few ties, and show that the cascade model is especially beneficial in cases where partially-labeled data (only some of the component labels are given) is available. It turns out though, (Hacioglu, 2005) that the cascade model in (Hacioglu et al., 2005) did not change to a &amp;quot;mention view&amp;quot; sequence classification6 (as we did in Section 3.3) in the tasks following the entity detection, to allow the system to use longer range features.</Paragraph>
    <Paragraph position="7"> 6As opposed to a &amp;quot;word view&amp;quot;.</Paragraph>
  </Section>
  <Section position="4" start_page="474" end_page="477" type="metho">
    <SectionTitle>
3 Classification Models
</SectionTitle>
    <Paragraph position="0"> This section presents the three multi-task classification models, which we will experimentally contrast in Section 4. We are interested in performing sequence classification (e.g. assigning a label to each word in a sentence, otherwise known as tagging). LetX denote the space of sequence elements (words) and Y denote the space of classifications (labels), both of them being finite spaces. Our goal is to build a classifier</Paragraph>
    <Paragraph position="2"> which has the property that |h(-x) |= |-x|,[?]-x [?]X+ (i.e. the size of the input sequence is preserved).</Paragraph>
    <Paragraph position="3"> This classifier will select the a posteriori most likely label sequence -y = argmax-yprime pparenleftbig-yprime|-xparenrightbig; in our case p(-y|-x) is computed through the standard Markov assumption:</Paragraph>
    <Paragraph position="5"> where yi,j denotes the sequence of labels yi..yj.</Paragraph>
    <Paragraph position="6"> Furthermore, we will assume that each label y is composed of a number of sub-labels y =parenleftbig y1y2 ...ykparenrightbig7; in other words, we will assume the factorization of the label space into k subspaces</Paragraph>
    <Paragraph position="8"> The classifier we used in the experimental section is a maximum entropy classifier (similar to (McCallum et al., 2000))--which can integrate several sources of information in a rigorous manner.</Paragraph>
    <Paragraph position="9"> It is our empirical observation that, from a performance point of view, being able to use a diverse and abundant feature set is more important than classifier choice, and the maximum entropy framework provides such a utility.</Paragraph>
    <Section position="1" start_page="474" end_page="475" type="sub_section">
      <SectionTitle>
3.1 The All-In-One Model
</SectionTitle>
      <Paragraph position="0"> As the simplest model among those presented here, the all-in-one model ignores the natural factorization of the output space and considers all labels as atomic, and then performs regular sequence classification. One way to look at this process is the following: the classification space Y = Y1 xY2 x ...xYk is first mapped onto a same-dimensional space Z through a one-to-one mapping o : Y -Z; then the features of the system are defined on the space X+ xZ, instead of X+ xY.</Paragraph>
      <Paragraph position="1"> While having the advantage of being simple, it suffers from some theoretical disadvantages: * The classification space can be very large, being the product of the dimensions of sub-task spaces. In the case of the 2004 ACE data there are 7 entity types, 4 mention types and many subtypes; the observed number of actual  the all-in-one and joint models sub-label combinations on the training data is 401. Since the dynamic programing (Viterbi) search's runtime dependency on the classification space is O(|Z|n) (n is the Markov dependency size), using larger spaces will negatively impact the decoding run time.8 * The probabilities p(zi|-x,zi[?]n,i[?]1) require large data sets to be computed properly. If the training data is limited, the probabilities might be poorly estimated.</Paragraph>
      <Paragraph position="2"> * The model is not friendly to partial evaluation or weighted sub-task evaluation: different, but partially similar, labels will compete against each other (because the system will return a probability distribution over the classification space), sometimes resulting in wrong partial classification.9 * The model cannot directly use data that is only partially labeled (i.e. not all sub-labels are specified).</Paragraph>
      <Paragraph position="3"> Despite the above disadvantages, this model has performed well in practice: Hajic and Hladk'a (1998) applied it successfully to find POS sequences for Czech and Florian et al. (2004) reports good results on the 2003 ACE task. Most systems that participated in the CoNLL 2002 and 2003 shared tasks on named entity recognition (Tjong Kim Sang, 2002; Tjong Kim Sang and De Meulder, 2003) applied this model, as they modeled the identification of mention boundaries and the assignment of mention type at the same time.</Paragraph>
    </Section>
    <Section position="2" start_page="475" end_page="475" type="sub_section">
      <SectionTitle>
3.2 The Joint Model
</SectionTitle>
      <Paragraph position="0"> The joint model differs from the all-in-one model in the fact that the labels are no longer atomic: the features of the system can inspect the constituent sub-labels. This change helps alleviate the data 8From a practical point of view, it might not be very important, as the search is pruned in most cases to only a few hypotheses (beam-search); in our case, pruning the beam only introduced an insignificant model search error (0.1 F-measure).</Paragraph>
      <Paragraph position="1"> 9To exemplify, consider that the system outputs the following classifications and probabilities: O (0.2), B- null sparsity encountered by the previous model by allowing sub-label modeling. The joint model theoretically compares favorably with the all-in-one model:  require less training data to be properly estimated, as different sub-labels can be modeled separately.</Paragraph>
      <Paragraph position="2"> * The joint model can use features that predict just one or a subset of the sub-labels. Table 1 presents the set of basic features that predict the start of a mention for the CoNLL shared tasks for the two models. While the joint model can encode the start of a mention in one feature, the all-in-one model needs to use four features, resulting in fewer counts per feature and, therefore, yielding less reliably estimated features (or, conversely, it needs more data for the same estimation confidence).</Paragraph>
      <Paragraph position="3"> * The model can predict some of the sub-tags ahead of the others (i.e. create a dependency structure on the sub-labels). The model used in the experimental section predicts the sub-labels by using only sub-labels for the previous words, though.</Paragraph>
      <Paragraph position="4"> * It is possible, though computationally expensive, for the model to use additional data that is only partially labeled, with the model change presented later in Section 3.4.</Paragraph>
    </Section>
    <Section position="3" start_page="475" end_page="476" type="sub_section">
      <SectionTitle>
3.3 The Cascade Model
</SectionTitle>
      <Paragraph position="0"> For some tasks, there might already exist a natural hierarchy among the sub-labels: some sub-labels could benefit from knowing the value of other, primitive, sub-labels. For example, * For mention detection, identifying the mention boundaries can be considered as a primitive task. Then, knowing the mention boundaries, one can assign an entity type, subtype, and mention type to each mention.</Paragraph>
      <Paragraph position="1"> * In the case of parsing with functional tags, one can perform syntactic parsing, then assign the functional tags to the internal constituents.</Paragraph>
      <Paragraph position="2">  erties, making use of the fact that one knows the main tag.</Paragraph>
      <Paragraph position="3"> The cascade model is essentially a factorization of individual classifiers for the sub-tasks; in this framework, we will assume that there is a more or less natural dependency structure among subtasks, and that models for each of the subtasks will be built and applied in the order defined by the dependency structure. For example, as shown in Figure 1, one can detect mention boundaries and entity type (at the same time), then detect mention type and subtype in &amp;quot;parallel&amp;quot; (i.e. no dependency exists between these last 2 sub-tags).</Paragraph>
      <Paragraph position="4"> A very important advantage of the cascade model is apparent in classification cases where identifying chunks is involved (as is the case with mention detection), similar to advantages that rescoring hypotheses models have: in the second stage, the chunk classification stage, it can switch to a mention view, where the classification units are entire mentions and words outside of mentions. This allows the system to make use of aggregate features over the mention words (e.g. all the words are capitalized), and to also effectively use a larger Markov window (instead of 2-3 words, it will use 2-3 chunks/words around the word of interest). Figure 2 contains an example of such a case: the cascade model will have to predict the type of the entire phrase Donna Karan International, in the context 'Since &lt;chunk&gt; went public in ..', which will give it a better opportunity to classify it as an organization. In contrast, because the joint model and AIO have a word view of the sentence, will lack the benefit of examining the larger region, and will not have access at features that involve partial future classifications (such as the fact that another mention of a particular type follows).</Paragraph>
      <Paragraph position="5"> Compared with the other two models, this classification method has the following advantages:  * The classification spaces for each subtask are considerably smaller; this fact enables the creation of better estimated models * The problem of partially-agreeing competing labels is completely eliminated * One can easily use different/additional data to train any of the sub-task models.</Paragraph>
    </Section>
    <Section position="4" start_page="476" end_page="477" type="sub_section">
      <SectionTitle>
3.4 Adding Partially Labeled Data
</SectionTitle>
      <Paragraph position="0"> Annotated data can be sometimes expensive to come by, especially if the label set is complex. But not all sub-tasks were created equal: some of them might be easier to predict than others and, therefore, require less data to train effectively in a cascade setup. Additionally, in realistic situations, some sub-tasks might be considered to have more informational content than others, and have precedence in evaluation. In such a scenario, one might decide to invest resources in annotating additional data only for the particularly interesting sub-task, which could reduce this effort significantly.</Paragraph>
      <Paragraph position="1"> To test this hypothesis, we annotated additional data with the entity type only. The cascade model can incorporate this data easily: it just adds it to the training data for the entity type classifier model. While it is not immediately apparent how to incorporate this new data into the all-in-one and joint models, in order to maintain fairness in comparing the models, we modified the procedures to allow for the inclusion. Let T denote the original training data, and T prime denote the additional training data.</Paragraph>
      <Paragraph position="2"> For the all-in-one model, the additional training data cannot be incorporated directly; this is an inherent deficiency of the AIO model. To facilitate a fair comparison, we will incorporate it in an indirect way: we train a classifier C on the additional training data T prime, which we then use to classify the original training data T. Then we train the all-in-one classifier on the original training data T, adding the features defined on the output of applying the classifier C on T.</Paragraph>
      <Paragraph position="3"> The situation is better for the joint model: the new training data T prime can be incorporated directly into the training data T.10 The maximum entropy model estimates the model parameters by maximizing the data log-likelihood</Paragraph>
      <Paragraph position="5"> where ^p(x,y) is the observed probability distribution of the pair (x,y) and ql (y|x) =</Paragraph>
      <Paragraph position="7"> In the case where some of the data is partially annotated, the log-likelihood becomes</Paragraph>
      <Paragraph position="9"> 10The solution we present here is particular for MEMM models (though similar solutions may exist for other models as well). We also assume the reader is familiar with the normal MaxEnt training procedure; we present here only the differences to the standard algorithm. See (Manning and Sch&amp;quot;utze, 1999) for a good description.</Paragraph>
      <Paragraph position="11"> The only technical problem that we are faced with here is that we cannot directly estimate the observed probability ^p(x,y) for examples in T prime, since they are only partially labeled. Borrowing the idea from the expectation-maximization algorithm (Dempster et al., 1977), we can replace this probability by the re-normalized system proposed probability: for (x,yx) [?]T prime, we define</Paragraph>
      <Paragraph position="13"> where yx is the subset of labels from Y which are consistent with the partial classification of x in T prime.</Paragraph>
      <Paragraph position="14"> d(y [?] yx) is 1 if and only if y is consistent with the partial classification yx.11 The log-likelihood computation in Equation (2) becomes</Paragraph>
      <Paragraph position="16"> To further simplify the evaluation, the quantities ^q(x,y) are recomputed every few steps, and are considered constant as far as finding the optimum l values is concerned (the partial derivative computationsandnumericalupdatesotherwisebecome null quite complicated, and the solution is no longer unique). Given this new evaluation function, the training algorithm will proceed exactly the same way as in the normal case where all the data is fully labeled.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML