XML Viewer - w06-0206

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-0206_metho.xml
Size: 14,623 bytes
Last Modified: 2025-10-06 14:10:35
<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-0206">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Data Selection in Semi-supervised Learning for Name Tagging</Title>
  <Section position="5" start_page="48" end_page="48" type="metho">
    <SectionTitle>
3 Motivation
</SectionTitle>
    <Paragraph position="0"> The performance of name taggers has been limited in part by the amount of labeled training data available. How can an unlabeled corpus help to address this problem? Based on its original training (on the labeled corpus), there will be some tags (in the unlabeled corpus) that the tagger will be very sure about. For example, there will be contexts that were always followed by a person name (e.g., &amp;quot;Capt.&amp;quot;) in the training corpus. If we find a new token T in this context in the unlabeled corpus, we can be quite certain it is a per-son name. If the tagger can learn this fact about T, it can successfully tag T when it appears in the test corpus without any indicative context. In the same way, if a previously-unseen context appears consistently in the unlabeled corpus before known person names, the tagger should learn that this is a predictive context.</Paragraph>
    <Paragraph position="1"> We have adopted a simple learning approach: we take the unlabeled text about which the tagger has greatest confidence in its decisions, tag it, add it to the training set, and retrain the tagger.</Paragraph>
    <Paragraph position="2"> This process is performed repeatedly to bootstrap ourselves to higher performance. This approach can be used with any supervised-learning tagger that can produce some reliable measure of confidence in its decisions.</Paragraph>
  </Section>
  <Section position="6" start_page="48" end_page="48" type="metho">
    <SectionTitle>
4 Baseline Multi-lingual Name Tagger
</SectionTitle>
    <Paragraph position="0"> Our baseline name tagger is based on an HMM that generally follows the Nymble model (Bikel et al, 1997). Then it uses best-first search to generate NBest hypotheses, and also computes the margin - the difference between the log probabilities of the top two hypotheses. This is used as a rough measure of confidence in our name tagging.</Paragraph>
    <Paragraph position="1">  In processing Chinese, to take advantage of name structures, we do name structure parsing using an extended HMM which includes a larger number of states (14). This new HMM can handle name prefixes and suffixes, and transliterated foreign names separately. We also augmented the HMM model with a set of post-processing rules to correct some omissions and systematic errors. The name tagger identifies three name types: Person (PER), Organization (ORG) and Geo-political (GPE) entities (locations which are also political units, such as countries, counties, and cities).</Paragraph>
  </Section>
  <Section position="7" start_page="48" end_page="52" type="metho">
    <SectionTitle>
5 Two Semi-Supervised Learning Meth-
</SectionTitle>
    <Paragraph position="0"> ods for Name Tagging We have applied this bootstrapping approach to two sources of data: first, to a large corpus of unlabeled data and second, to the test set. To distinguish the two, we shall label the first &amp;quot;bootstrapping&amp;quot; and the second &amp;quot;self-training&amp;quot;. We begin (Sections 5.1 and 5.2) by describing the basic algorithms used for these two processes.</Paragraph>
    <Paragraph position="1"> We expected that these basic methods would provide a substantial performance boost, but our experiments showed that, for best gain, the additional training data should be related to the target problem, namely, our test set. We present measures to select documents (Section 5.3) and sentences (Section 5.4), and show (in Section 6) the effectiveness of these measures.</Paragraph>
    <Section position="1" start_page="48" end_page="49" type="sub_section">
      <SectionTitle>
5.1 Bootstrapping
</SectionTitle>
      <Paragraph position="0"> We divided the large unlabeled corpus into segments based on news sources and dates in order to: 1) create segments of manageable size; 2) separately evaluate the contribution of each segment (using a labeled development test set) and reject those which do not help; and 3) apply the latest updated best model to each subsequent  We have also used this metric in the context of rescoring of name hypotheses (Ji and Grishman, 2005); Scheffer et al. (2001) used a similar metric for active learning of name tags.  segment. The procedure can be formalized as follows.</Paragraph>
      <Paragraph position="1">  1. Select a related set RelatedC from a large corpus of unlabeled data with respect to the test set TestT, using the document selection method described in section 5.3.</Paragraph>
      <Paragraph position="2"> 2. Split RelatedC into n subsets and mark them  . Call the updated HMM name tagger NameM (initially the baseline tagger), and a development test set DevT.</Paragraph>
      <Paragraph position="3"> 3. For i=1 to n (1) Run NameM on C i ; (2) For each tagged sentence S in C i , if S is tagged with high confidence, then keep S; otherwise remove S; (3) Relabel the current name tagger (NameM) as OldNameM, add C i to the training data, and retrain the name tagger, producing an updated model NameM; (4) Run NameM on DevT; if the performance gets worse, don't use C</Paragraph>
      <Paragraph position="5"/>
    </Section>
    <Section position="2" start_page="49" end_page="49" type="sub_section">
      <SectionTitle>
5.2 Self-training
</SectionTitle>
      <Paragraph position="0"> An analogous approach can be used to tag the test set. The basic intuition is that the sentences in which the learner has low confidence may get support from those sentences previously labeled with high confidence.</Paragraph>
      <Paragraph position="1"> Initially, we build the baseline name tagger from the labeled examples, then gradually add the most confidently tagged test sentences into the training corpus, and reuse them for the next iteration, until all sentences are labeled. The procedure can be formalized as follows.</Paragraph>
      <Paragraph position="2">  1. Cluster the test set TestT into n clusters T</Paragraph>
      <Paragraph position="4"> , by collecting document pairs with low cross entropy (described in section 5.3.2) into the same cluster.</Paragraph>
      <Paragraph position="5"> 2. For i=1 to n (1) NameM = baseline HMM name tagger; (2) While (there are new sentences tagged with confidence higher than a threshold) a. Run NameM on T i ; b. Set an appropriate threshold for margin; c. For each tagged sentence S in T</Paragraph>
      <Paragraph position="7"> tagged with high confidence, add S to the training data; d. Retrain the name tagger NameM with augmented training data.</Paragraph>
      <Paragraph position="8"> At each iteration, we lower the threshold so that about 5% of the sentences (with the largest margin) are added to the training corpus.  As an example, this yielded the following gradually improving performance for one English cluster</Paragraph>
    </Section>
    <Section position="3" start_page="49" end_page="49" type="sub_section">
      <SectionTitle>
Self-training (English)
</SectionTitle>
      <Paragraph position="0"> Self-training can be considered a cache model variant, operating across the entire test collection.</Paragraph>
      <Paragraph position="1"> But it uses confidence measures as weights for each name candidate, and relies on names tagged with high confidence to re-adjust the prediction of the remaining names, while in a cache model, all name candidates are equally weighted for voting (independent of the learner's confidence).</Paragraph>
    </Section>
    <Section position="4" start_page="49" end_page="50" type="sub_section">
      <SectionTitle>
5.3 Unlabeled Document Selection
</SectionTitle>
      <Paragraph position="0"> To further investigate the benefits of using very large corpora in bootstrapping, and also inspired by the gain from the &amp;quot;essence&amp;quot; of self-training, which aims to gradually emphasize the predictions from related sentences within the test set, we reconsidered the assumptions of our approach.</Paragraph>
      <Paragraph position="1"> The bootstrapping method implicitly assumes that the unlabeled data is reliable (not noisy) and uniformly useful, namely:  To be precise, we repeatedly reduce the threshold by 0.1 until an additional 5% or more of the sentences are included; however, if more than an additional 20% of the sentences are captured because many sentences have the same margin, we add back 0.1 to the threshold.</Paragraph>
      <Paragraph position="2">  * The unlabeled data supports the acquisition of new names and contexts, to provide new evidence to be incorporated in HMM and reduce the sparse data problem; * The unlabeled data won't make the old estimates worse by adding too many names whose tags are incorrect, or at least are incorrect in the context of the labeled training data and the test data.</Paragraph>
      <Paragraph position="3"> If the unlabeled data is noisy or unrelated to the test data, it can hurt rather than improve the learner's performance on the test set. So it is necessary to coarsely measure the relevance of the unlabeled data to our target test set. We define an IR (information retrieval) - style relevance measure between the test set TestT and an unlabeled document d as follows.</Paragraph>
      <Paragraph position="4"> 5.3.1 'Query set' construction We model the information expected from the unlabeled data by a 'bag of words' technique. We construct a query term set from the test corpus TestT to check whether each unlabeled document d is useful or not.</Paragraph>
      <Paragraph position="5"> * We prefer not to use all the words in TestT as key words, since we are only concerned about the distribution of name candidates.</Paragraph>
      <Paragraph position="6"> (Adding off-topic documents may in fact introduce noise into the model). For example, if one document in TestT talks about the presidential election in France while d talks about the presidential election in the US, they may share many common words such as 'election', 'voting', 'poll', and 'camp', but we would expect more gain from other unlabeled documents talking about the French election, since they may share many name candidates.</Paragraph>
      <Paragraph position="7"> * On the other hand it is insufficient to only take the name candidates in the top one hypothesis for each sentence (since we are particularly concerned with tokens which might be names but are not so labeled in the top hypothesis).</Paragraph>
      <Paragraph position="8"> So our solution is to take all the name candidates in the top N best hypotheses for each sentence to construct a query set Q.</Paragraph>
      <Paragraph position="9">  Using Q, we compute the cross entropy H(TestT, d) between TestT and d by:  where x is a name candidate in Q, and prob(x|TestT) is the probability (frequency) of x appearing in TestT while prob(x|d) is the probability of x in d. If H(T, d) is smaller than a threshold then we consider d a useful unlabeled</Paragraph>
    </Section>
    <Section position="5" start_page="50" end_page="52" type="sub_section">
      <SectionTitle>
5.4 Sentence Selection
</SectionTitle>
      <Paragraph position="0"> We don't want to add all the tagged sentences in a relevant document to the training corpus because incorrectly tagged or irrelevant sentences can lead to degradation in model performance.</Paragraph>
      <Paragraph position="1"> The value of larger corpora is partly dependent on how much new information is extracted from each sentence of the unlabeled data compared to the training corpus that we already have.</Paragraph>
      <Paragraph position="2"> The following confidence measures were applied to assist the semi-supervised learning algorithm in selecting useful sentences for re-training the model.</Paragraph>
      <Paragraph position="3">  For each sentence, we compute the HMM hypothesis margin (the difference in log probabilities) between the first hypothesis and the second hypothesis. We select the sentences with margins larger than a threshold  to be added to the training data.</Paragraph>
      <Paragraph position="4"> Unfortunately, the margin often comes down to whether a specific word has previously been observed in training; if the system has seen the word, it is certain, if not, it is uncertain. Therefore the sentences with high margins are a mix of interesting and uninteresting samples. We need to apply additional measures to remove the uninteresting ones. On the other hand, we may have confidence in a tagging due to evidence external to the HMM, so we explored measures beyond the HMM margin in order to recover additional sentences.</Paragraph>
      <Paragraph position="5">  We also tried a single match method, using the query set to find all the relevant documents that include any names belonging to Q, and got approximately the same result as cross-entropy. In addition to this relevance selection, we used one other simple filter: we removed a document if it includes fewer than five names, because it is unlikely to be news.</Paragraph>
      <Paragraph position="6">  In bootstrapping, this margin threshold is selected by testing on the development set, to achieve more than 93% FMeasure. null  5.4.2 Name coreference to find more reliable sentences Names introduced in an article are likely to be referred to again, so a name coreferred to by more other names is more likely to have been correctly tagged. In this paper, we use simple coreference resolution between names such as substring matching and name abbreviation resolution. null In the bootstrapping method we apply single-document coreference for each individual unlabeled text. In self-training, in order to further benefit from global contexts, we consider each cluster of relevant texts as one single big document, and then apply cross-document coreference. Assume S is one sentence in the document, and there are k names tagged in S: {N  In bootstrapping on unlabeled data, the margin criterion often selects some sentences which are too short or don't include any names. Although they are tagged with high confidence, they may make the model worse if added into the training data (for example, by artificially increasing the probability of non-names). In our experiments we don't use a sentence if it includes fewer than six words, or doesn't include any names.</Paragraph>
    </Section>
    <Section position="6" start_page="52" end_page="52" type="sub_section">
      <SectionTitle>
5.5 Data Flow
</SectionTitle>
      <Paragraph position="0"> We depict the above two semi-supervised learning methods in Figure 1 and Figure 2.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML