XML Viewer - c02-1088

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/c02-1088_metho.xml
Size: 14,815 bytes
Last Modified: 2025-10-06 14:07:51
<?xml version="1.0" standalone="yes"?>
<Paper uid="C02-1088">
  <Title>Unsupervised Named Entity Classification Models and their Ensembles</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
8 IREX (IREX committee, 1999) categories,
</SectionTitle>
    <Paragraph position="0"> with 294,000 tokens IREX training corpus. It takes a lot of time and labor to build a large corpus like this.</Paragraph>
    <Paragraph position="1"> This paper proposes an unsupervised learning model that uses a small-scale named entity dictionary and an unlabeled corpus for classifiying named entities. Collins and Singer (1999) opened the possibility of using an unlabeled corpus to classify named entities.</Paragraph>
    <Paragraph position="2"> They showed that the use of unlabeled data can reduce the requirements for supervision to just 7 simple seed rules. They used natural redundancy in the data : for many named-entity instances, both the spelling of the name and the context in which it appears are sufficient to determine its type.</Paragraph>
    <Paragraph position="3"> Our model considers syntactic relations in a sentence to resolve the semantic ambiguity and uses the ensemble of three different learning methods to improve the performance. They are Maximum Entropy Model, Memory-based</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Learning and Sparse Network of Winnows
</SectionTitle>
      <Paragraph position="0"> (Roth, 1998).</Paragraph>
      <Paragraph position="1"> This model classifies proper nouns appeared in the documents into person, organization and location on the assumption that the boundaries of proper nouns were already recognized.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="3" type="metho">
    <SectionTitle>
2 The System for NE Classification
</SectionTitle>
    <Paragraph position="0"> This section describes a system that classifies named entities by using a machine learning algorithm. The system consists of four modules as shown in Figure 1.</Paragraph>
    <Paragraph position="1"> First, we builds a training set, named entity tagged corpus, automatically. This set will be used to predict the categories of named entities within target documents received as the input of the system.</Paragraph>
    <Paragraph position="2"> The second module extracts syntactic relations from the training set and target documents. They are encoded to the format of training and test examples for machine learning. In the third module, each learning for classification is progressed independently by three learning methods. Three results generated by each learner are combined into one result.</Paragraph>
    <Paragraph position="3"> Finally, the system decides the category by using a rule for the test examples that did not be labeled yet. And then the system outputs a named entity tagged corpus.</Paragraph>
    <Section position="1" start_page="0" end_page="2" type="sub_section">
      <SectionTitle>
2.1 Building a Training Set
</SectionTitle>
      <Paragraph position="0"> The system requires a training set which has categories in order to get knowledge for the classification. We build a training set automatically using a named entity dictionary and a POS tagged corpus, and then use it instead of a hand-tagged set in machine learning.</Paragraph>
      <Paragraph position="1"> We randomly extract 1500 entries per each category (person, location, and organization) from a Proper Noun dictionary made by KORTERM and then reconstruct the named entity dictionary. The Proper Noun dictionary has about 51,000 proper nouns classified into 41 categories (person, animal, plant and etc.). We do not extract homonyms to reduce the ambiguity. In order to show that it is possible to classify named entities with a small-scale dictionary, we limit the number of entries to be 1500.</Paragraph>
      <Paragraph position="2"> We label the target word, proper noun or capital alphabet, appeared in the POS tagged corpus  by means of the NE dictionary mentioned above. The corpus is composed of  We used a KAIST POS tagged corpus one million eojeols  . It is not easy to classify named entity correctly only with a dictionary, since named entity has the semantic ambiguity. So we have to consider the context around the target word.</Paragraph>
      <Paragraph position="3"> In order to consider the context, we use co-occurrence information between the category  (c) of a target word (tw) and a head word (hw) appeared on the left of the target word or the right of the target word. We modify categories labeled by the NE dictionary by following process.</Paragraph>
      <Paragraph position="4"> 1. We extract pairs [c, hw] from the corpus labeled by means of the dictionary.</Paragraph>
      <Paragraph position="5"> 2. If hw is occurred with several different categories, we suppose tw occurred with hw may have an ambiguity and then we remove the category label of tw.</Paragraph>
      <Paragraph position="6"> 3. We make rules for predicting the category of tw from pairs [c, hw] and apply them to the corpus. The rule is that tw occurred with hw has a c.</Paragraph>
      <Paragraph position="7"> 4. We extract sentences including the  labeled target word in the corpus.</Paragraph>
      <Paragraph position="8"> In the step 3, 9 rules are made. We label the c for unlabeled target word occurred with hw if the pair [c, hw] is found more than a threshold. We set the threshold to be 10. Sentences including the 4,504 labeled target word are made as a tringing set in this process (Table 1).</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
2.2 Extracting Syntactic Relations
</SectionTitle>
      <Paragraph position="0"> In order to predict the category, most of machine learning systems usually consider two words on the left and two ones on the right of a target word as a context (Uchimoto and et al. 2000,  Korean linguistic units that is separated by blank or punctuation Petasis and et al. 2000). However this method have some problems.</Paragraph>
      <Paragraph position="1"> If some words that are not helpful to predict the category are near the target word, they can cause an incorrect prediction. In the following example, 'Kim' can be predicted as an organization instead of a person because of a left word 'Jeong-bu' (the government).</Paragraph>
    </Section>
    <Section position="3" start_page="2" end_page="3" type="sub_section">
      <SectionTitle>
Example
</SectionTitle>
      <Paragraph position="0"> The goverment supports KIA on the premise that the chairman Kim submits a resignation.</Paragraph>
      <Paragraph position="1"> Jeong-bu neun Kim hoi-jang i (N:the goverment) (PP) (PN) (N:the chairman) (PP) sa-pyo reul je-chul-han-da neun (N:a resignation) (PP) (V :submit) (PP) jeon-je ro KIA reul ji-won-han-da.</Paragraph>
      <Paragraph position="3"> The system cannot consider important words that are out of the limit of the context. In the former example, the word 'je-chul-han-da' (submit) is an important feature for predicting the category of 'Kim'. If a Korean functional word is counted as one window, we cannot get this information within right 4 windows. Even if we do not count the functional words, sometimes it is neccessary to consider larger windows than 2 windows like above example.</Paragraph>
      <Paragraph position="4"> We notice that words that modify the target word or are modified by the target word are more helpful to the prediction than any other words in the sentence. So we extract the syntactic relations like Figure 2 as the context.  The modifier is a word modifying the target word and the modifiee is one modified by the target word. Josa  is a postposition that follows the target word and te predicate is a verb that predicates the target word. The 'BLANK' label represents that there is no word which corresponds to the slot of the templet. These syntactic relations are extracted by a simple heuristic parser. We will show that these syntactic relations bring to a better result through an experiment in the section 3.</Paragraph>
      <Paragraph position="5"> These syntactic relations seem to be language specific. Josa represents a case for the target word. If case information is extracted in a sentence, these syntactic relations like Figure 2 are also made in other languages.</Paragraph>
      <Paragraph position="6"> As machine learner requires training and test examples represented in a feature-vector format, syntactic relations are encoded as Figure 3.</Paragraph>
    </Section>
    <Section position="4" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
2.3 Ensemble Learning
</SectionTitle>
      <Paragraph position="0"> The ensemble of several classifiers can be improve the performance. Errors made by the minority can be removed through the ensemble of classifiers (Thomas G. Dietterich, 1997). In the base noun phrase identification, Tjong Kim Sang, et al. (2000) showed that the result combined by seven different machine learning algorithms outperformed the best individual result.</Paragraph>
      <Paragraph position="1"> In our module, machine learners train with the training examples and then classify the named entities in the test examples. This process is shown in Figure 4.</Paragraph>
      <Paragraph position="2">  Josa, attached to a nominal, is a postpositional particle in Korean.</Paragraph>
      <Paragraph position="3">  This ensemble learning has two characteristics. One is that the classification is progressed by three different learners independently and those results are combined into one result. The other is that the learning is repeated with new training examples generated through the learning. It enables the system to receive an incremental feedback.</Paragraph>
      <Paragraph position="4"> Through the this learning method, we can get larger and more precise training examples for predicting the categories. It is important in an unsupervisd learning model because there is no labeled data for learning.</Paragraph>
      <Paragraph position="5">  We use three learning methods : Memory-based Learning, Sparse Network of Winnows, Maximum Entropy Model. We describe these methods briefly in this section.</Paragraph>
      <Paragraph position="6"> Memory-based Learning stores the training examples and classifies new examples by choosing the most frequent classification among training examples which are closest to a new example. Examples are represented as sets of feature-value pairs. Each feature receives a weight which is based on the amount of information which it provides for computing the classification of the examples in the training data. We use the TiMBL (Daelemans, et al., 1999), a Memory-Based Learning software package.</Paragraph>
      <Paragraph position="7"> Sparse Network of Winnows learning architecture is a sparse network of linear units. Nodes in the input layer of the network represent simple relations over the input example and things being used as the input features. Each linear unit is called a target node and represents classifications which are interested in the input examples. Given training examples, each input example is mapped into a set of features which are active (present) in it; this representation is presented to the input layer of SNoW and propagated to the target nodes. We use SnoW (Carlson, et al., 1999), Sparse Network of Winnows software package.</Paragraph>
      <Paragraph position="8"> Maximum Entropy Model (MEM) is especially suited for integrating evidences from various information sources. MEM allows the computation of p(f|h) for any f in the space of possible futures, F, and for every h in the space of possible histories, H. Futures are defined as the possible classification and a history is all of the conditioning data which enable us to make a decision in the space of futures. The computation of p(f|h) is dependent on a set of features which are binary functions of the histroy and future. A feature is represented as following.</Paragraph>
      <Paragraph position="10"> future the of one is f and condition some meets h if1 ),( fhg Given a set of features and some training examples, a weighing parameter</Paragraph>
      <Paragraph position="12"> g is computed. This allows us to compute the conditional probability as follows :</Paragraph>
      <Paragraph position="14"> We use MEMT, Maximum Entropy Modeling Toolkit (Ristad, 1998), to compute the parameter for the features.</Paragraph>
      <Paragraph position="15">  We use three different voting mechanisms to combine results generated by three learners. The first method is a majority voting. Each classification receives the same weight and the most frequent classification is chosen. The ending condition is satisfied when there is no difference between a result combined in this loop and one combined in the former loop.</Paragraph>
      <Paragraph position="16"> The second method is a probability voting.</Paragraph>
      <Paragraph position="17"> MEMT and SNoW propose the probabilities for all category, but Timbl proposes only one appropriate category for one test example. We set the probability for the category Timbl proposes to be 0.6 and for the others to be 0.2. For each category, we multiply probabilities proposed by 3 learners and then choose N examples that have the largest probability. In the next learning we set N = N + 100. When N is larger than a threshold, the ending condition is satisfied and the learning is over. We set it to be 3/4 of the number of test examples.</Paragraph>
      <Paragraph position="18"> The last method is a mixed voting. We use two voting methods mentioned above one after another. First, we use probability voting. After the learning is over we use majority voting. The threshold of the probability voting is 1/2 of the number of test examples here.</Paragraph>
    </Section>
    <Section position="5" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
2.4 Post-Processing
</SectionTitle>
      <Paragraph position="0"> After the learning, the system modifies test examples by using a rule, one sense per discourse. One sense per discourse means that the sense of a target word is highly consistent within any given document. David Yarowsky (1995) showed it was accurate in the word sense disambiguation. We label the examples that are not labeled yet as the category of the labeled word in the discourse as following example and we output named entity tagged corpus.</Paragraph>
    </Section>
    <Section position="6" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
Example
</SectionTitle>
      <Paragraph position="0"> after the ensemble learning ... ... KIA&lt;type=organization&gt; reul ji-won-han-da. KIA neon ... ...</Paragraph>
      <Paragraph position="1"> after post-processing ... ... KIA&lt;type=organization&gt; reul ji-won-han-da. KIA&lt;type=organization&gt; neon ... ...</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML