File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/a00-1034_metho.xml

Size: 17,492 bytes

Last Modified: 2025-10-06 14:07:02

<?xml version="1.0" standalone="yes"?>
<Paper uid="A00-1034">
  <Title>A Hybrid Approach for Named Entity and Sub-Type Tagging*</Title>
  <Section position="2" start_page="247" end_page="249" type="metho">
    <SectionTitle>
1 FST-based Pattern Matching Rules for
Textract NE
</SectionTitle>
    <Paragraph position="0"> The most attractive feature of the FST (Finite State Transducer) formalism lies in its superior time and space efficiency \[Mohri 1997\] \[Roche &amp; Schabes 1997\]. Applying a deterministic FST depends linearly only on the input size of the text.</Paragraph>
    <Paragraph position="1"> Our experiments also show that an FST rule system is extraordinarily robust. In addition, it has been verified by many research programs \[Krupka &amp; Hausman 1998\] \[Hobbs 1993\] \[Silberztein 1998\] \[Srihari 1998\] \[Li &amp; Srihari 2000\], that FST is also a convenient tool for capturing linguistic phenomena, especially for idioms and semi-productive expressions like time NEs and numerical NEs.</Paragraph>
    <Paragraph position="2"> The rules which we have currently implemented include a grammar for temporal expressions (time, date, duration, frequency, age, etc.), a grammar for numerical expressions (money, percentage, length, weight, etc.), and a grammar for other non-MUC NEs (e.g. contact information like address, email).</Paragraph>
    <Paragraph position="3"> The following sample pattern rules give an idea of what our NE grammars look like. These rules capture typical US addresses, like: 5500 Main St., Williamsville, NY14221; 12345 Xyz Avenue, Apt. 678, Los Angeles, CA98765-4321.</Paragraph>
    <Paragraph position="4"> The following notation is used: @ for macro; I for logical OR; + for one or more; (...) for optionality.</Paragraph>
    <Paragraph position="6"> Our work is similar to the research on FST local grammars at LADL/University Paris VII \[Silberztein 1998\] 1, but that research was not turned into a functional rule based NE system.</Paragraph>
    <Paragraph position="7"> The rules in our NE grammars cover expressions with very predictable patterns. They were designed to address the weaknesses of our statistical NE tagger. For example, the following missings (underlined) and mistagging originally made by our statistical NE tagger have all been correctly identified by our temporal NE grammar.</Paragraph>
    <Paragraph position="8"> began &lt;TIMEX TYPE=&amp;quot;DATE&amp;quot;&gt;Dec. 15, the&lt;/TIMEX&gt; space agency on Jan. 28, &lt;TIMEX  including a grammar for certain temporal expressions and a grammar for stock exchange sub-language. TYPE=&amp;quot;TIME&amp;quot;&gt;Saturday at&lt;/TIMEX&gt; 2:42 a.m. ES&lt;ENAMEX  We use two gazetteers in our system, one for person and one for location. The person gazetteer consists of 3,000 male names, 5,000 female names and 14,000 family names. The location gazetteer consists of 250,000 location names with their categories such as CITY, PROVINCE, COUNTRY, AIRPORT, etc. The containing and being-contained relationship among locations is also provided.</Paragraph>
    <Paragraph position="9"> The following is a sample line in the location gazetteer, which denotes &amp;quot;Aberdeen&amp;quot; as a city in &amp;quot;California&amp;quot;, and &amp;quot;California&amp;quot; as a province of  Although gazetteers obviously contain useful name entity information, a straightforward word match approach may even degrade the system performance since the information from gazetteers is too ambiguous. There are a lot of common words that exist in the gazetteers, such as 'T', &amp;quot;A&amp;quot;, &amp;quot;Friday&amp;quot;, &amp;quot;June&amp;quot;, &amp;quot;Friendship&amp;quot;, etc. Also, there is large overlap between person names and location names, such as &amp;quot;Clinton&amp;quot;, &amp;quot;Jordan&amp;quot;, etc.</Paragraph>
    <Paragraph position="10"> Here we propose a machine learning approach to incorporate the gazetteer information with other common contextual information based on MaxEnt. Using MaxEnt, the system may learn under what situation the occurrence in gazetteers is a reliable evidence for a name entity. We first define &amp;quot;LFEATURE&amp;quot; based on occurrence in the location gazetteer as follows:  There is precedence from the first LFEATURE to the last one. Each token in the input document is assigned a unique &amp;quot;LFEATURE&amp;quot;. We also define &amp;quot;NFEATURE&amp;quot; based on occurrence in the name gazetteer as follows:  With these two extra features, every token in the document is regarded as a three-component vector (word, LFEATURE, NFEATURE). We can build a statistical model to evaluate the conditional probability based on these contextual and gazetteer features. Here &amp;quot;tag&amp;quot; represents one of the three possible tags (Person, Location, Other), and history represents any possible contextual history. Generally, we have:</Paragraph>
    <Paragraph position="12"> A maximum entropy solution for probability has the form \[Rosenfeld 1994\] \[Ratnaparkhi 1998\]</Paragraph>
    <Paragraph position="14"> where fi (history, tag) are binary-valued feature functions that are dependent on whether the feature is applicable to the current contextual history. Here is an example of our feature function:</Paragraph>
    <Paragraph position="16"> In (2) and (3) a i are weights associated to feature functions.</Paragraph>
    <Paragraph position="17"> The weight evaluation scheme is as follows: We first compute the average value of each feature function according to a specific training corpus. The obtained average observations are set as constraints, and the Improved Iterative Scaling (IIS) algorithm \[Pietra et al. 1995\] is employed to evaluate the weights. The resulting probability distribution (2) possesses the maximum entropy among all the probability distributions consistent with the constraints imposed by feature function average values.</Paragraph>
    <Paragraph position="18"> In the training stage, our gazetteer module contains two sub-modules: feature function induction and weight evaluation \[Pietra et al. 1995\]. The structure is shown in Figure 2.</Paragraph>
    <Paragraph position="19">  We predefine twenty-four feature function templates. The following are some examples and others have similar structures:  if current word = _, and tag = _ else if previous word = _, and tag = _ else if following word = _, and tag = _ else where the symbol .... denotes any possible values which may be inserted into that field. Different fields will be filled different values. Then, using a training corpus containing 230,000 tokens, we set up a feature function candidate space based on the feature function templates. The &amp;quot;Feature Function Induction Module&amp;quot; can select next feature function that reduces the Kullback-Leibler divergence the most \[Pietra et al. 1995\]. To make the weight evaluation computation tractable at the feature function induction stage, when trying a new feature function, all previous computed weights are held constant, and we only fit one new constraint that is imposed by the candidate feature function. Once the next feature function is selected, we recalculate the weights by IIS to satisfy all the constraints, and thus obtain the next tentative probability. The feature function induction module will stop when the Log-likelihood gain is less than a pre-set threshold.</Paragraph>
    <Paragraph position="20"> The gazetteer module recognizes the person and location names in the document despite the fact that some of them may be embedded in an organization name. For example, &amp;quot;New York Fire Department&amp;quot; may be tagged as &lt;LOCATION&gt; New York &lt;/NE&gt; Fire Department. In the input stream for HMM, each token being tagged as location is accordingly transformed into one of the built-in tokens &amp;quot;CITY&amp;quot;, &amp;quot;PROVINCE&amp;quot;, &amp;quot;COUNTRY&amp;quot;. The HMM may group &amp;quot;CITY Fire Department&amp;quot; into an organization name. A similar technique is applied for person names.</Paragraph>
    <Paragraph position="21"> Since the tagged tokens from the gazetteer module are regarded by later modules as either person or location names, we require that the current module generates results with the highest possible precision. For each tagged token we will compute the entropy of the answer. If the entropy is higher than a pre-set threshold, the system will not be certain enough about the answer, and the word will be untagged. The missed location or person names may be recognized by the following HMM module.</Paragraph>
  </Section>
  <Section position="3" start_page="249" end_page="251" type="metho">
    <SectionTitle>
3 Improving NE Segmentation through
</SectionTitle>
    <Paragraph position="0"> constrained HMM Our original HMM is similar to the Nymble \[Bikel et al. 1997\] system that is based on bigram statistics. To correct some of the leading errors, we incorporate manual segmentation rules with HMM. These syntactic rules may provide information beyond bigram and balance the limitation of the training corpus.</Paragraph>
    <Paragraph position="1"> Our manual rules focus on improving the NE segmentation. For example, in the token sequence &amp;quot;College of William and Mary&amp;quot;, we have rules based on global sequence checking to determine if the words &amp;quot;and&amp;quot; or &amp;quot;of&amp;quot; are common words or parts of organization name.</Paragraph>
    <Paragraph position="2"> The output of the rules are some constraints on the HMM transition network, such as &amp;quot;same tags for tokens A, B&amp;quot;, or &amp;quot;common word for token A&amp;quot;. The Viterbi algorithm will select the optimized path that is consistent with such constraints.</Paragraph>
    <Paragraph position="3"> The manual rules are divided into three categories: (i) preposition disambiguation, (ii) spurious capitalized word disambiguation, and (iii) spurious NE sequence disambiguation.</Paragraph>
    <Paragraph position="4"> The rules of preposition disambiguation are responsible for determination of boundaries involving prepositions (&amp;quot;of&amp;quot;, &amp;quot;and&amp;quot;, &amp;quot;'s&amp;quot;, etc.). For example, for the sequence &amp;quot;A of B&amp;quot;, we have the following rule: A and B have same tags if the lowercase of A and B both occur in OXFD dictionary. A &amp;quot;global word sequence checking&amp;quot; \[Mikheev, 1999\] is also employed. For the sequence &amp;quot;Sprint and MCI&amp;quot;, we search the document globally. If the word &amp;quot;Sprint&amp;quot; or  &amp;quot;MCI&amp;quot; occurs individually somewhere else, we mark &amp;quot;and&amp;quot; as a common word.</Paragraph>
    <Paragraph position="5"> The rules of spurious capitalized word disambiguation are designed to recognize the first word in the sentence. If the first word is unknown in the training corpus, but occurs in OXFD as a common word in lowercase, HHM's unknown word model may be not accurate enough. The rules in the following paragraph are designed to treat such a situation.</Paragraph>
    <Paragraph position="6"> If the second word of the same sentence is in lowercase, the first word is tagged as a common word since it never occurs as an isolated NE token in the training corpus unless it has been recognized as a NE elsewhere in the document.</Paragraph>
    <Paragraph position="7"> If the second word is capitalized, we will check globally if the same sequence occurs somewhere else. If so, the HMM is constrained to assign the same tag to the two tokens. Otherwise, the capitalized token is tagged as a common word.</Paragraph>
    <Paragraph position="8"> The rules of spurious NE sequence disambiguation are responsible for finding spurious NE output from HMM, adding constraints, and re-computing NE by HMM. For example, in a sequence &amp;quot;Person Organization&amp;quot;, we will require the same output tag for these two tokens and run HMM again.</Paragraph>
  </Section>
  <Section position="4" start_page="251" end_page="253" type="metho">
    <SectionTitle>
4 NE Sub-Type Tagging using Maximum
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="251" end_page="252" type="sub_section">
      <SectionTitle>
Entropy Model
</SectionTitle>
      <Paragraph position="0"> The output document from constrained HMM contains MUC-standard NE.tags such as person, location and organization. However, for a real information extraction system, the MUC-standard NE tag may not be enough and further detailed NE information might be necessary. We have predefined the following sub-types for person, location and organization:  If a NE is not covered by any of the above sub-categories, it should remain a MUC-standard tag. Obviously, the sub-categorization requires much more information beyond bigram than MUC-standard tagging. For example, it is hard to recognize CNN as a Mass Media company by bigram if the token &amp;quot;CNN&amp;quot; never occurs in the training corpus. External gazetteer information is critical for some sub-category recognition, and trigger word models may also play an important role.</Paragraph>
      <Paragraph position="1"> With such considerations, we use the Maximum entropy model for sub-categorization, since MaxEnt is powerful enough to incorporate into the system gazetteer or other information sources which might become available at some later time.</Paragraph>
      <Paragraph position="2"> Similar to the gazetteer module in Section 2, the sub-categorization module in the training stage contains two sub-modules, (i) feature function induction and (ii) weight evaluation. We have the following seven feature function templates:</Paragraph>
      <Paragraph position="4"> f(history, tag)= {10 if following_Word= _,MUC_tag = _,andelse tag=_ f(history, tag)={lo ifMUC_tag= ,contain_male_name, and tag  = 1l if ,oc_ta =_,co. .in_fema,e_.ame,a.d,ag=_ f (history, tag ) to else We have trained 1,000 feature functions by the feature function induction module according to the above templates.</Paragraph>
      <Paragraph position="5"> Because much more external gazetteer information is necessary for the sub-categorization and there is an overlap between male and female name gazetteers, the result from the current MaxEnt module is not sufficiently accurate. Therefore, a conservative strategy has been applied. If the entropy of the output answer is higher than a threshold, we will back-off to the MUC-standard tags. Unlike MUC NE categories, local contextual information is not sufficient for sub-categorization. In the future more external gazetteers focusing on recognition of government, company, army, etc. will be incorporated into our system. And we are considering using trigger words \[Rosenfeld, 1994\] to recognize some sub-categories. For example, &amp;quot;psalms&amp;quot; may be a trigger word for &amp;quot;religious person&amp;quot;, and &amp;quot;Navy&amp;quot; may be a trigger word for &amp;quot;military person&amp;quot;.</Paragraph>
    </Section>
    <Section position="2" start_page="252" end_page="253" type="sub_section">
      <SectionTitle>
Experiment and Conclusion
</SectionTitle>
      <Paragraph position="0"> We have tested our system on MUC-7 dry run data; this data consists of 22,000 words and represents articles from The New York Times.</Paragraph>
      <Paragraph position="1"> Since a key was provided with the data, it is possible to properly evaluate the performance of our NE tagger. The scoring program computes both the precision and recall, and combines these two measures into f-measure as the weighted harmonic mean \[Chinchor, 1998\]. The formulas are as follows: number of correct responses Precision = number responses number of correct responses Recall = number correct in key  If the gazetteer module is removed from our system, and the constrained HMM is restored to the standard HMM, the f-measures for person, location, and organization are as follows:  Obviously, our gazetteer model and constrained HMM have greatly increased the system accuracy on the recognition of persons, locations, and organizations. Currently, there are some errors in our gazetteers. Some common words such as &amp;quot;Changes&amp;quot;, &amp;quot;USER&amp;quot;, &amp;quot;Administrator&amp;quot;, etc. are mistakenly included in the person name gazetteer. Also, too many person names are included into the location gazetteer. By cleaning up the gazetteers, we can continue improving the precision on person name and locations.</Paragraph>
      <Paragraph position="2"> We also ran our NE tagger on the formal test files of MUC-7. The following are the results:  There is some performance degradation in the formal test. This decrease is because that the formal test is focused on satellite and rocket domains in which our system has not been trained. There are some person/location names used as spacecraft or robot names (ex. Mir, Alvin, Columbia...), and there are many high-tech company names which do not occur in our HMM training corpus. Since the finding of organization names totally relies on the HMM model, it suffers most from domain shift (10% degradation). This difference implies that gazetteer information may be useful in overcoming the domain dependency. This paper has demonstrated improved performance in an NE tagger by combining symbolic and statistical approaches. MaxEnt has been demonstrated to be a viable technique for integrating diverse sources of information and has been used in NE sub-categorization.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML