File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/96/c96-1072_metho.xml

Size: 18,015 bytes

Last Modified: 2025-10-06 14:14:12

<?xml version="1.0" standalone="yes"?>
<Paper uid="C96-1072">
  <Title>Learning to Recognize Names Across Languages</Title>
  <Section position="2" start_page="0" end_page="424" type="metho">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Proper names represent a unique challenge for MT and IR systems. They are not found in dictionaries, are very large in number, come and go every day, and appear in many alias forms. For these reasons, list based matching schemes do not achieve desired performance levels. Hand coded heuristics can be developed to achieve high accuracy, however this approach lacks portability. Much human effort is needed to port the system to a new domain.</Paragraph>
    <Paragraph position="1"> A desirable approach is one that maximizes reuse and minimizes human effort. This paper presents an approach to proper name recognition that uses machine learning and a language independent fi'amework. Knowledge incorporated into the framework is based on a set of measurable linguistic characteristics, or ,features. Some of this knowledge is constant across languages. The rest can be generated automatically through machine learning techniques.</Paragraph>
    <Paragraph position="2"> The problem being considered is that of segmenting natural language text into lexical units, and of tagging those units with various syntactic and semantic features. A lexical unit may be a word (e.g., &amp;quot;started&amp;quot;) or a phrase (e.g., &amp;quot;The Washington Post&amp;quot;). The particular lexical units of interest here are proper names. Segmenting and tagging proper names is very important for natural language processing, particularly IR and MT.</Paragraph>
    <Paragraph position="3"> Whether a phrase is a proper name, and what type of proper name it is (company name, location name, person name, date, other) depends on (1) the internal structure of the phrase, and (2) the surrounding context. null Internal: &amp;quot;Mr. Brandon&amp;quot; Context: &amp;quot;The new company, Safetek, will make air bags.&amp;quot; The person title &amp;quot;Mr.&amp;quot; reliably shows &amp;quot;Mr. Brandon&amp;quot; to be a person name. &amp;quot;Safetek&amp;quot; can be recognized as a company name by utilizing the preceding contextual phrase and appositive &amp;quot;The new company,&amp;quot;. The recognition task can be broken down into delimitation and classification. Delimitation is the determination of the boundaries of the proper name, while classification serves to provide a more specific category.</Paragraph>
    <Paragraph position="4"> Original: John Smith , chairman of Safetek , announced his resignation yesterday.</Paragraph>
    <Paragraph position="5"> Delimit: &lt;PN&gt; John Smith &lt;/PN&gt; , chairman of &lt;PN&gt; Safetek &lt;/PN&gt;, announced his resignation yesterday.</Paragraph>
    <Paragraph position="6"> Classify: &lt;person&gt; John Smith &lt;/person&gt; , chairman of &lt;company&gt; Safetek &lt;/company&gt; , announced his resignation yesterday.</Paragraph>
    <Paragraph position="7">  During the delimit step, the boundarics of all proper names are identified. Next, the delimited proper names are classified into more specific categodcs.</Paragraph>
    <Paragraph position="8"> How can a system developed in one language be ported to another language with minimal additional effort and comparable performance results? How much additional elTort will be required, and what degradation in performance, if any, is to be expected? These questions are addressed in the following sections. null</Paragraph>
  </Section>
  <Section position="3" start_page="424" end_page="425" type="metho">
    <SectionTitle>
2 Method
</SectionTitle>
    <Paragraph position="0"> The approach taken here is to utilize a data-drivcn knowledge acquisition strategy based on decision trees which uses contextual information. This differs from other approaches which attempt to achieve this task by: (1) hand-coded heuristics, (2) list-based matching schemes, (3) human-generated knowledge bases, and (4) combinations thereof. Delimitation occurs through the application of phrasal templates.</Paragraph>
    <Paragraph position="1"> These temphttes, built by hand, use logical ol~eralors (AND, OR, etc.) to combine features strongly asst)ciated with proper names, including: proper ,mun, ampersand, hyphen, and comma. In addition, ambiguities with delimitation are handled by including other predictive features within the templates.</Paragraph>
    <Paragraph position="2"> To acquire the knowledge required for classilication, each word is tagged with all of its associated features. These features are obtained through automated and manual techniques. A decision trec is built (lk)r each name class) from the initial feature set using a rccursive partitioning algorithm (Quinhtn, 1986; l)treiman et al., 1984) that uses the following function as its selection (splitting) criterion:</Paragraph>
    <Paragraph position="4"> where p represents the proportion of names behmging to the class for which the tree is built. The fea-ture which minimizes the weighted sum o1' tiffs function across both child nodes resulting from the split is chosen. A nmltitrce approach was chosen over learning a single tree for all name classes because it allows for the straightforward association of features within the tree with specific natllC classes, and facilitates troubleshooting.</Paragraph>
    <Paragraph position="5"> The result is a hierarchical collection of' co-occurring fcatures which predict inclusion to or exclusion from a particuhtr proper name class. Since a tree is built for each nmne class el' interest, the trees arc all applied individually, and then the results are mergcd.</Paragraph>
    <Section position="1" start_page="424" end_page="425" type="sub_section">
      <SectionTitle>
2.1 Features
</SectionTitle>
      <Paragraph position="0"> Various types o1' features indicate the type el' name: parts of speech, designators, morphology, syntax, semantics, and more. 1)esignators are features which altmc provide strong evidence lbr or against a partic~ ular nantc type. l';xamplcs include &amp;quot;Co.&amp;quot; (company), &amp;quot;l)r.&amp;quot; (person), anti &amp;quot;County&amp;quot; (location). For exampie, of all the company nmnes in the English training text, 28% are associated with a corporate designator.</Paragraph>
      <Paragraph position="1"> Other features are predetermined, obtained via on-line lists, or are selected automatically based on statistical measures. Parts of speech features are predetermined based on the part of speech tagger employed. On-line lists provide lists of cities, person names, nationalities, regitms, etc. The initial set of lexical features is selected by choosing those that appear most frequently (above somc threshold) throughout the training data, and those that appear most \['requcntly near the positive instances in the training data.</Paragraph>
      <Paragraph position="2"> Some features, such as morphological, keyword, and key phrase features, are determined by hantl analysis tff the text. Capitalization is one obvious</Paragraph>
      <Paragraph position="4"> morphological feature of importance. Determining keyword and key phrase features amounts to selecting prudent subject categories. These categories are associated with lists of lexical items or already existing features. For example, many of the statistically derived lexical features may fall under common sub-ject categories. The words &amp;quot;build&amp;quot;, &amp;quot;make&amp;quot;, &amp;quot;manufacture&amp;quot;, and &amp;quot;produce&amp;quot; can be associated with the subject category &amp;quot;make-type verbs&amp;quot;. Analysis of the immediate context surrounding company names may lead to the discovery of key phrases like &amp;quot;said it&amp;quot;, &amp;quot;entered a venture&amp;quot;, and &amp;quot;is located in&amp;quot;. Table 1 shows a summary of various types of features used in system development. The longest common substring (LCS) feature (Jacobs et al., 1993) is useful for finding proper name aliases.</Paragraph>
    </Section>
    <Section position="2" start_page="425" end_page="425" type="sub_section">
      <SectionTitle>
2.2 Feature Trees
</SectionTitle>
      <Paragraph position="0"> The ID3 algorithm (Quinlan, 1986) selects and organizes features into a discrimination tree, one tree for each type of name (person, company, etc.). The tree, once built, typically contains 100+ nodes, each one inquiring about one feature in the text, within the locality of the current proper name of interest.</Paragraph>
      <Paragraph position="1"> An example of a tree which was generated for companies is shown in Figure 1. The context level for this example is 3, meaning that the feature in question must occur within the region starting 3 words to the left of and ending 3 words to the right of the proper name's left boundary. A &amp;quot;(L)&amp;quot; or &amp;quot;(R)&amp;quot; following the feature name indicates that the feature must occur to the left of or to the right of the proper name's left boundary respectively. The numbers directly beneath a node of the tree represent the number of negative and positive examples present from the training set. These numbers are useful for associating a confidence level with each classification.</Paragraph>
      <Paragraph position="2"> Definitions for the features in Figure 1 (and other abbreviations) can be found in the appendix.</Paragraph>
      <Paragraph position="3"> The training set used for this example contains 1084 negative and 669 positive examples. To obtain the best initial split of the training set, the feature &amp;quot;CN_alias&amp;quot; is chosen. Recursively visiting and optimally splitting each concurrent subset results in the generation of 97 nodes (not including leaf nodes).</Paragraph>
      <Paragraph position="4"> N N ...... N '- N -' P</Paragraph>
    </Section>
    <Section position="3" start_page="425" end_page="425" type="sub_section">
      <SectionTitle>
2.3 Architecture
</SectionTitle>
      <Paragraph position="0"> Figure 2 shows the working development system.</Paragraph>
      <Paragraph position="1"> The starting point is training text which has been pre-tagged with the locations of all proper names. The tokenizer separates punctuation from words. For non-token languages (no spaces between words), it also separates contiguous characters into constituent words. The part of speech (POS) tagger (Brill, 1992; Farwell et. al., 1994; Matsumoto et al., 1992) attaches parts of speech. Thc set of derived features is attached. During the delimitation phase, proper names are delimited using a set of POS-based hand-coded tcmplates. Using ID3, a dccision tree is generated based on the existing feature set and thc specified level of context to be considered. The generated tree is applied to test data and scored. Manual analysis of the tree and scored result leads to the discovery of new features. The new features are added to the tokenized training text, and the process repeats.</Paragraph>
    </Section>
    <Section position="4" start_page="425" end_page="425" type="sub_section">
      <SectionTitle>
2.4 Cross Language Porting
</SectionTitle>
      <Paragraph position="0"> In order to work with another language, the following resources are needed: (1) pre-tagged training text in the new language using same tags as belore, (2) a tokenizer for non-token languages, (3) a POS tagger (plus translation of the tags to a standard POS convention), and (4) translation of designators and lexical (list-based) features.</Paragraph>
      <Paragraph position="1"> These language-specific modules are highlighted in Figure 2 with bold bordcrs. Feature translation occurs through the utilization of: on-line resources, dictionaries, atlases, bilingual speakers, etc. The remainder is constant across languages: a language independent core development system, and an optimally derived feature set for English.</Paragraph>
      <Paragraph position="2"> Also worth noting are the parts of development system that are executed by hand. These are shown shaded. Everything else is automatic.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="425" end_page="427" type="metho">
    <SectionTitle>
3 Experiment
</SectionTitle>
    <Paragraph position="0"> The system was first built for English and then ported to Spanish and Japanese. For English, the training text consisted of 50 messages obtained from the English Joint Ventures (EJV) domain MUC-5 corpus of the US Advanced Research Projects Agency (ARPA). This data was hand-tagged with the locations of company names, person names, locations names, and dates. The test set consisted of 10 new messages.</Paragraph>
    <Paragraph position="1"> Experimental results were obtained by applying the generated trees to test texts. The initial raw text is tokenized and tagged with parts of speech. All features necessary to apply rules and trees are attached. Phrasal template rules are applied in order to delimit proper names. Then trees for each proper name type are applied individually to the proper names in the featurized text. Proper names which are  choosing the highest priority class. Priorities are determined based on the independent perlormance of each tree. For example, if person trces perform better independently than location trees, then a per-son classification will be chosen over a location classification. Also, designators have a large impact on resolving conflicts.</Paragraph>
    <Section position="1" start_page="426" end_page="426" type="sub_section">
      <SectionTitle>
3.1 English
</SectionTitle>
      <Paragraph position="0"> Various parameterizations were used for system development, including: (1) context depth, (2) feature set size, (3) training set size, and (4) incorporation of hand-coded phrasal templates.</Paragraph>
      <Paragraph position="1"> Figure 3 shows ttle performance results for English. The metrics used were recall (R), precision (P), and an averaging measure, P&amp;R, defined as:</Paragraph>
      <Paragraph position="3"> Obtained results for English compare to the English results of Rau (1992) and McDonald (1993). The weighted average of the P&amp;R for companies, persons, locations, and dates is 94.0%.</Paragraph>
      <Paragraph position="4">  The date grammar is rather small in comparison to other name classes, hence the performance for dates was perfect. Locations, by contrast, exhibited the lowest performance. This can be attributed mainly to: (1) locations are commonly associated with commas, which can create ambiguities with delimitation, and (2) locations made up a small percentage of all names in the training set, which could have resulted in overfitting of the built tree to the training data.</Paragraph>
      <Paragraph position="5"> Features strengths were measured for companies, persons, and locations. This experiment involved removing one feature at a time from the text used for testing and then reapplying the stone tree. Figure 4 and Table 2 show performance results (P&amp;R) when the three most powerful features are removed, one at a time, for companies, persons, and locations respectively. This experiment demonstrates tim power of designator features across all proper name types, and the importance of the alias feature for companies.</Paragraph>
    </Section>
    <Section position="2" start_page="426" end_page="427" type="sub_section">
      <SectionTitle>
3.2 Spanish
</SectionTitle>
      <Paragraph position="0"> Three experiments have been conducted for Spanish.</Paragraph>
      <Paragraph position="1"> In the first experiment, the English trees, generated  from the feature set optimized for English, are applied to the Spanish text (E-E-S). In the second experiment, new Spanish-specific trees are generated from the feature set optimized for English and applied to the Spanish text (S-E-S). The third experiment proceeds like the second, except that minor adjustments and additions are made to the t'eature set with the goal of improving performance (S-S-S).</Paragraph>
      <Paragraph position="2"> The additional resources required for the first Spanish experiment (E-E-S) are a Spanish POS-tagger (Farwell et al., 1994) and also the translated feature set (including POS) optimally derived for English. The second and third Spanish experiments (S-E-S, S-S-S) require in addition pre-tagged Spanish training text using the same tags as for English.</Paragraph>
      <Paragraph position="3"> The obtained Spanish scores as compared to the scores from the initial English experiment (E-E-E) are shown in figure 5.</Paragraph>
      <Paragraph position="4">  The additional Spanish specific features derived for S-S-S are shown in Table 3. Only a few new features added to the core feature set allows for significant pcrfommnce improvement.</Paragraph>
    </Section>
    <Section position="3" start_page="427" end_page="427" type="sub_section">
      <SectionTitle>
3.3 Japanese
</SectionTitle>
      <Paragraph position="0"> The same three experiments conducted lor Spanish are being conducted for Japanese. The first two, E-E-J and J-E-J, have been completed; J-J-J is in progress.</Paragraph>
      <Paragraph position="1"> The additional resources required for the first Japanese experiment (E-E-J) are a Japanese tokenizer and POS-tagger (Matsumoto et al., 1992) and also the translated feature set optimally derived for English. The second and third Japanese experiments (J-E-J, J-J-J) require in addition pre-taggcd Japanese training text using the same tags as for English.</Paragraph>
      <Paragraph position="2"> The obtained Japanese scores as compared to the scores from the initial English experiment (E-E-E) are shown in Figure 6. The weighted averages of the P&amp;R measures across all languages, for companies, persons, locations, and dates, are shown in Figure 7. Table 4 shows comparisons to other work.</Paragraph>
      <Paragraph position="3"> companies persons locations dales</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="427" end_page="428" type="metho">
    <SectionTitle>
I~ E-E'E J
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
class="xml-element"></Paper>
Download Original XML