XML Viewer - w97-0901

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-0901_metho.xml
Size: 21,371 bytes
Last Modified: 2025-10-06 14:14:43
<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-0901">
  <Title>Reuse of a Proper Noun Recognition System in Commercial and Operational NLP Applications</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Description of NameTag
</SectionTitle>
    <Paragraph position="0"> NameTag is a multilingual name recognition system.</Paragraph>
    <Paragraph position="1"> It finds and disambiguates in texts the names of people, organizations, and places, as well as time and numeric expressions with very high accuracy. The design of the system makes possible the dynamic recognition of names: NameTag does not rely on long lists of known names. Instead, NameTag makes use of a flexible pattern specification language to identify novel names that have not been encountered previously. In addition, NameTag can recognize and link variants of names in the same document automatically. For instance, it can link &amp;quot;IBM&amp;quot; to &amp;quot;International Business Machines&amp;quot; and &amp;quot;President Clinton&amp;quot; to &amp;quot;Bill Clinton.&amp;quot; NameTag incorporates a language-independent C-t-+ pattern-matching engine along with the language-specific lexicons, patterns, and other resources necessary for each language. In addition, the Japanese, Chinese, and Thai versions integrate word segmenters to deal with the orthographic challenges of these languages. (NameTag currently has these language versions available plus ones for English, Spanish, and French.) NameTag is an extremely fast and robust system that can be easily integrated with other applications through its API. It has been our experience that NameTag has lent itself to so many successful integrations in diverse applications not just due to its accuracy, but to its speed. (Its NT version is currently benchmarked at 300 megabytes/hour on a Pentium Pro.) It is an attractive package to embed in an application, as it does not cause significant retardation of performance.</Paragraph>
    <Paragraph position="2"> In the following discussion, we refer to various versions of NameTag, most prominently systems for English and Japanese. Their extraction accuracy varies. For example, in the Sixth Message Understanding Conference (MUC-6), the English systern was benchmarked against the Wall Street Journal blind test set for the name tagging task, and achieved a 96% F-measure, which is a combination ot&amp;quot; recall and precision measures. Our internal testing of the Japanese system against blind test sets of w~rious Japanese newspaper articles indicates that it achieves from high-80 to 1ow-90% accuracy, depending on the types of corpora. Indexing names in Japanese texts is usually more challenging than English for two main reasons. First, there is no case distinction in .Japanese, whereas English names in newspapers are capitalized, and capitalization is a very strong clue for English name tagging. Second, Japanese words are not separated by spaces and therefore must be segmented into separate words before the name tagging process. As segmentation is not 100% accurate, segmentation errors can sometimes can use name tagging rules not to fire or to misfire.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Proper Name Recognition
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Integrated With a Browsing &amp;
Retrieval System
</SectionTitle>
      <Paragraph position="0"> We have recently developed a system incorporating NarneTag that allows monolingual users to access information on the World Wide Web in languages that they do not know (Aone, Charocopos, and Gorlinsky, 1997). For example, previously it was not easy for a monolingual English speaker to locate necessary information written in Japanese. The user would not know the query terms in Japanese even if the search engine accepted Japanese queries. In addition, even when the users located a possibly relevant text in Japanese, they would have little idea about what was in the text. Output of off-the-shelf machine translation (MT) systems are often of low quality, and even &amp;quot;high-end&amp;quot; MT systems have problems particularly in translating proper names and specialized domain terms, which often contain the most critical information to the users.</Paragraph>
      <Paragraph position="1"> Now these users have available our multilingual (or cross-linguistic) information browsing and retrieval system, which is aimed at monolingual users who are interested in information from multiple language sources. The system takes advantage of namerecognition software as embodied in NameTag to improve the accuracy of cross-linguistic retrieval and to provide innovative methods t.o browse and explore multilingual document collections. The system indexes texts in different languages (currently English and Japanese) and allows the users to retrieve relevant texts in their native language (currently English). The retrieved text is then presented to the users with proper names and specialized domain terms translated and hyperlinked. Among the innovations in our system is the stress placed upon proper names and their role as indices for document content.</Paragraph>
      <Paragraph position="2"> The system consists of an Indexing Module, a Client Module, and a Term Translation Module.</Paragraph>
      <Paragraph position="3"> The Indexing Module creates and inserts indices into a database while the Client, Module allows browsing and retrieval of information in the database through a Web-browser-based graphical user interface ((~ IJ l). The Term Translation Module dynamically translates English user queries into Japanese and the indexed terms in retrieved Japanese documents into English.</Paragraph>
      <Paragraph position="4"> The Indexing Module For the present application, the system indexes names of people, entities, and locations, as well as scientific and technical (S&amp;T) terms in both English and Japanese texts, and allows the user to query and browse the indexed database in English.</Paragraph>
      <Paragraph position="5"> As NameTag processes texts, the indexed terms are stored in a relational database with their semantic type information (person, entity, place, S&amp;T term) and alias information along with such meta data as source, date, language, and frequency information.</Paragraph>
      <Paragraph position="6"> The Client Module The Client Module lets the user both retrieve and browse information in the database through the Web-browser-based GUI. In the query mode, a form-based Boolean query issued by a user is automatically translated into an SQL query, and the English terms in the query are sent to the Term Translation Module. The Client Module then retrieves documents which match either the original English query or the translated .Japanese query. As the indices are names and terms which may consist of multiple words (e.g., &amp;quot;Warren Christopher,&amp;quot; &amp;quot;memory chip&amp;quot;), the query terms are delimited in separate boxes in the form, making sure no ambiguity occurs in both translation and retrieval.</Paragraph>
      <Paragraph position="7"> In its browsing mode, the Client Module allows the user to browse the information in the database in various ways. For example, once the user selects a particular document for viewing, the client sends it to an appropriate (i.e., English or Japanese) indexing server for creating hyperlinks for the indexed terms, and, in the case of a Japanese document, sends the indexed terms to the Term Translation Module to translate the Japanese terms into English.</Paragraph>
      <Paragraph position="8"> The result that the user browses is a document each of whose indexed terms are hyperlinked to other documents containing the same indexed terms. Since hyperlinking is based on the original or translated English terms, the monolingual English speaker can follow the links to both English and .Japanese documents transparently. In addition, the Client Module is integrated with a commercial MT system for a rough translation of the whole text.</Paragraph>
      <Paragraph position="10"> Module bi-directionally in two different modes.</Paragraph>
      <Paragraph position="11"> That is. it translates English query terms into Japanese in the query mode and, in reverse, translates Japanese indexed terms into English for viewing of a retrieved Japanese text in the browsing mode.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Issues Concerning Proper Name
</SectionTitle>
      <Paragraph position="0"> Recognition for Browsing and Retrieval Based on the system description above in the preceding sections, we describe in more detail in the following the impacts of name recognition on multi-lingual browsing and retrieval.</Paragraph>
      <Paragraph position="1">  To index, the system uses two different configurations of NameTag for English and Japanese. Indexing of names is particularly significant in the Japanese case, where the accuracy of indexing depends on the accuracy of segmentation of a sentence. In English, since words are separated by spaces, there is no issue of indexing accuracy for individual words. However, in languages such as Japanese, where word boundaries are not explicitly marked by spaces, word segmentation is necessary to index terms. However, most segmentation algorithms are more likely to make errors on names, as these are less likely to be in the lexicons. Thus, use of name indexing can improve overall segmentation and indexing accuracy.</Paragraph>
      <Paragraph position="2">  As described above, the Indexing Module not only identifies names of people, entities and locations, but also disambiguates types among themselves and between names and non-names. Thus, if the user is searching for documents with the location &amp;quot;Washington&amp;quot; (not a person or company named &amp;quot;Washington&amp;quot;) or a person &amp;quot;Clinton&amp;quot; (not a location), the system allows the user to specify, through the GUI, the type of each query term. This ability to disambiguate types of queries not only constrains the search and hence improves retrieval precision, but also speeds up the search time considerably, especially when the size of the database is very large.  In developing this system, we have intentionally avoided an approach where we first translate foreign-language documents into English and index the translated English texts (Fluhr, 1995; Kay, 1995; Oard and Doff, 1996). In (Aone et al., 1994), we have shown that, in an application of extracting information from foreign language texts and presenting the results in English, the &amp;quot;MT first, Information Extraction second&amp;quot; approach was less accurate than the approach in the reverse order, i.e., &amp;quot;Information Extraction first. MT second.&amp;quot; In particular, translation quality of names by even the best NIT systems was poor. In an indexing and retrieval application such as the one under discussion, the proper identification and translation of names are critical.</Paragraph>
      <Paragraph position="3"> There are two cases where an MT system fails to translate names. First, it fails to recognize where a name starts and ends in a text string. This is a non-trivial problem in languages such as Japanese where words are not segmented by spaces and there is no capitalization convention. Often, an MT system &amp;quot;chops up&amp;quot; names into words and translates each word individually. For example, among the errors we have encountered, an MT system failed to recognize a person name &amp;quot;Mori Hanae&amp;quot; in kanji characters, segmented it into three words &amp;quot;mori,&amp;quot; &amp;quot;hana,&amp;quot; and &amp;quot;e&amp;quot; and translated them into &amp;quot;forest,&amp;quot; &amp;quot;England,&amp;quot; and &amp;quot;blessing,&amp;quot; respectively. Another common MT system error is where the system fails to make a distinction between names and non-names. This distinction is very important in getting correct translations as names are usually translated very differently from non-names. For example, a person named &amp;quot;Dole&amp;quot; in katakana was translated into a common noun &amp;quot;doll.&amp;quot; Abbreviated country names for Japan and the United States in single kanji characters, which often occurs in newspapers, were sometimes translated by an MT system into their literal kanji meanings, &amp;quot;day&amp;quot; and &amp;quot;rice,&amp;quot; respectively.</Paragraph>
      <Paragraph position="4"> The proper name recognition capability provided by NameTag solves both of these problems.</Paragraph>
      <Paragraph position="5"> NameTag's ability to identify names prevents chopping names into pieces. NarneTag's ability to assign semantic types to names makes possible greater precision in translating names.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Proper Name Recognition
</SectionTitle>
    <Paragraph position="0"> Integrated with Text Clustering Multimedia Fusion (MMF) is a system SRA developed to provide a tool to help people deal with large incoming streams of multimedia data (Aone, Bennett, and Gorlinsky, 1996). MMF clusters texts automatically into a hierarchical concept tree, and, unlike a typical message routing system, the users do not need to specify beforehand what topics that the incoming texts cluster into. It employs Cobweb-based conceptual clustering (Fisher, 1987), with the feature vectors required for that algorithm supplied by keywords picked from the body of a text based upon their worth as determined by the Inverse Document Frequency (IDF) metric (Church and Gale, 1995). In addition, NameTag is run over the incoming texts (CNN closed-captions and ClariNet news feeds) to identify the proper names in the document (persons. companies, locations).</Paragraph>
    <Paragraph position="1"> One of the novel features in this system was the important role of proper name recognition. It is important to recognize that using white-spacedelimited tokens in a text as keywords provides significantly less information than is actually available. The proper name information (for persons, organizations, and locations) adds considerable information to what otherwise would be a meaningless string of tokens.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Issues Concerning Proper Name
Recognition for Text Clustering
</SectionTitle>
      <Paragraph position="0"> Proper names are natural keywords to characterize the contents of text clusters. Without proper name recognition, &amp;quot;International Business Machines&amp;quot; is just a string containing three common nouns that may or may not be informative keywords. Recognizing it as a proper name enlarges the set of possible keywords for the document. Second, proper name recognition allows the disambiguation of names from non-names, such as &amp;quot;white&amp;quot; in &amp;quot;white shirt&amp;quot; vs. &amp;quot;Bob White,&amp;quot; which enhances the accuracy of keyword selection.</Paragraph>
      <Paragraph position="1"> The alias forms generated by NameTag are also of great value to IDF calculations, since we can select one of the name forms of a particular entity occurring within a document (&amp;quot;President Clinton,&amp;quot; &amp;quot;Bill Clinton,&amp;quot; &amp;quot;Clinton&amp;quot;) to be the canonical form for all the name forms. This reduces the chances that alternate forms of the same name will be used as distinct keywords for the same document.</Paragraph>
      <Paragraph position="2"> As discussed in (Aone, Bennett, and Gorlinsky, 1996), we quantitatively evaluated the accuracy of clustering, and the use of proper name recognition enhanced the F-measure (a combined measure of recall and precision) from 50% to 61% in clustering in ClariNet news feed.</Paragraph>
      <Paragraph position="3">  The name recognition technology embodied in NameTag had to be customized for MMF, particularly for the closed-caption texts from CNN Headline News. It had to handle all upper-case closed caption texts, which pose some challenges due to the absence of case information. In general, lexical information has to be available for name recognition, in upper-case text, which tends to have a retarding effect on system performance. In addition, since the closed captions are transcriptions of speech by anchor persons or reporters, characteristics of spoken language had to be accommodated (e.g., &amp;quot;OK&amp;quot; is Oklahoma in a text while it is an answer in a caption).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Recognition
</SectionTitle>
      <Paragraph position="0"> The proper name recognition used in MMF has to be highly accurate. In name recognition, as in other areas of language technology, there is a trade-off between recall and precision. In applications such as text clustering, high precision is preferred over high recall so that the system does not introduce errors in keyword selection. That is, not recognizing &amp;quot;'BILL CLINTON&amp;quot; is more acceptable than mis-recognizing &amp;quot;BILL FOLDER&amp;quot; as a person name.</Paragraph>
      <Paragraph position="1"> To handle this, NameTag provides three settings, depending on what kind of application it is being used for: &amp;quot;High Precision,&amp;quot; &amp;quot;High Accuracy,&amp;quot; and &amp;quot;Normal.&amp;quot; The first, setting ensures that. all names identified are correct, even at the expense of some possibly missed ones. The second setting focuses on identifying all possible names within a document., even at the cost of some false positives. The third setting is a balanced one, aiming at achieving the highest possible combined scores. For MMF, we used the &amp;quot;High Precision&amp;quot; setting to optimize the Cobweb clustering performance.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="4" type="metho">
    <SectionTitle>
5 Proper Name Recognition
</SectionTitle>
    <Paragraph position="0"> Integrated with Manual Text Indexing SRA recently developed an an operational indexing system, the Human Indexing Assistant (HIA), which assists the human indexing of an incoming flow of documents. The task involves human indexers who process a large incoming stream of documents and fill out a template with names of products and equipment, as well as companies and individuals involved in the manufacture of those items. Integral to it is the use of NameTag.</Paragraph>
    <Paragraph position="1"> HIA's GUI presents the user with three screens, the first containing the original document to be indexed, the second the template to be filled out, and the third used for iconic representations of the indexed material. This third screen serves as a working area where the user, having filled out a template for a name to be indexed, can iconify it, place it in the third screen and make links between it and other iconified objects such as company names. As part of the indexing process, NameTag is first launched from the indexing interface, and it highlights in the first screen the proper names in the document to be indexed. The indexer can then select what they think is appropriate to index and paste the names into templates in the second screen.</Paragraph>
    <Section position="1" start_page="0" end_page="4" type="sub_section">
      <SectionTitle>
5.1 Issues Concerning Proper Name
</SectionTitle>
      <Paragraph position="0"> Recognition for Manual Text Indexing For this application, we used the &amp;quot;high recall&amp;quot; setting of NameTag, as discussed in Section 4.1.3. It was important that as many potential names as possible be identified. It is a part of the indexers' job that no names be missed during indexing. Inserting NameTag into the process required that it gain the indexers' confidence that it could indeed hit all possible names. The indexers were not particularly concerned with misidentification of names, as these could be quickly passed over by the human  (e.g., &amp;quot;BILL FOLDER&amp;quot; as a personal name in all upper-case text). Once the users had confidence in NameTag, it was possible for them to stop reading documents in toto, thus producing great increases in the throughput of the operation.</Paragraph>
      <Paragraph position="1"> As a side issue in this sort of application, it should be pointed out that information on quality of the indexing when done without automated or semi-automated help is rarely available or reliable. The users impose requirements that they themselves may not meet consistently (&amp;quot;whatever you do, your system can't miss any names&amp;quot;), and the developers must work within what may be a more or less fictional framework. However, dealing with issues of this kind is critical to success. Successful insertion of HIA into the workplace involved getting the indexers to buy into its value for them. Ultimately, the deciding factor was the clearly faster rate of indexing with the system than with the line-editing-oriented, totally manual system being replaced.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML