File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/99/e99-1001_intro.xml
Size: 6,094 bytes
Last Modified: 2025-10-06 14:06:50
<?xml version="1.0" standalone="yes"?> <Paper uid="E99-1001"> <Title>Named Entity Recognition without Gazetteers</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Named Entity recognition involves processing a text and identifying certain occurrences of words or expressions as belonging to particular categories of Named Entities (NE). NE recognition software serves as an important preprocessing tool for tasks such as information extraction, information retrieval and other text processing applications. null What counts as a Named Entity depends on the application that makes use of the annotations. One such application is document retrieval or automated document forwarding: documents annoted with NE information can be searched more &quot;Now also at Harlequin Ltd. (Edinburgh office) accurately than raw text. For example, NE annotation allows you to search for all texts that mention the company &quot;Philip Morris&quot;, ignoring documents about a possibly unrelated person by the same name. Or you can have all documents forwarded to you about a person called &quot;Gates&quot;, without receiving documents about things called gates. In a document collection annotated with Named Entity information you can more easily find documents about Java the programming language without getting documents about Java the country or Java the coffee.</Paragraph> <Paragraph position="1"> Most common among marked categories are names of people, organisations and locations as well as temporal and numeric expression. Here is an example of a text marked up with Named In an article on the Named Entity recognition competition (part of MUC-6) Sundheim (1995) remarks that &quot;common organization names, first names of people and location names can be handled by recourse to list lookup, although there are drawbacks&quot; (Sundheim 1995: 16). In fact, participants in that competition from the University of Durham (Morgan et al., 1995) and from SRA (Krupka, 1995) report that gazetteers did not make that much of a difference to their system. Nevertheless, in a recent article Cucchiarelli et al. (1998) report that one of the bottlenecks in designing NE recognition systems is the limited availability of large gazetteers, particularly gazetteers for different languages (Cucchiarelli et al. 1998: 291). People also use gazetteers of very different sizes. The basic gazetteers in the Isoquest system for MUCdeg7 contain 110,000 names, but Krupka and Hausman (1998) show that system performance does not degrade much when the Proceedings of EACL '99 gazetteers are reduced to 25,000 and 9,000 names; conversely, they also show that the addition of an extra 42 entries to the gazetteers improves performance dramatically.</Paragraph> <Paragraph position="2"> This raises several questions: how important are gazetteers? is it important that they are big? if gazetteers are important but their size isn't, then what are the criteria for building gazetteers? One might think that Named Entity recognition could be done by using lists of (e.g.) names of people, places and organisations, but that is not the case. To begin with, the lists would be huge: it is estimated that there are 1.5 million unique surnames just in the U.S. It is not feasible to list all possible surnames in the world in a Named Entity recognition system. There is a similar problem with company names. A list of all current companies worldwide would be huge, if at all available, and would immediately be out of date since new companies are formed all the time. In addition, company names can occur in variations: a list of company names might contain &quot;The Royal Bank of Scotland plc&quot;, but that company might also be referred to as &quot;The Royal Bank of Scotland&quot;, &quot;The Royal&quot; or &quot;The Royal plc&quot;. These variations would all have to be listed as well.</Paragraph> <Paragraph position="3"> Even if it was possible to list all possible organisations and locations and people, there would still be the problem of overlaps between the lists.</Paragraph> <Paragraph position="4"> Names such as Emerson or Washington could be names of people as well as places; Philip Morris could be a person or an organisation. In addition, such lists would also contain words like &quot;Hope&quot; and &quot;Lost&quot; (locations) and &quot;Thinking Machines&quot; and &quot;Next&quot; (companies), whereas these words could also occur in contexts where they don't refer to named entities.</Paragraph> <Paragraph position="5"> Moreover, names of companies can be complex entities, consisting of several words. Especially where conjunctions are involved, this can create problems. In &quot;China International Trust and Investment Corp decided to do something&quot;, it's not obvious whether there is a reference here to one company or two. In the sentence &quot;Mason, Daily and Partners lost their court case&quot; it is clear that &quot;Mason, Daily and Partners&quot; is the name of a company. In the sentence &quot;Unfortunately, Daily and Partners lost their court case&quot; the name of the company does not include the word &quot;unfortunately&quot;, but it still includes the word &quot;Daily&quot;, which is just as common a word as &quot;unfortunately&quot;. In this paper we report on a Named Entity recognition system which was amongst the highest scoring in the recent MUC-7 Message Understanding Conference/Competition (MUC). One of the features of our system is that even when it is run without any lists of name.,; of organisations or people it still performs at a level comparable to that of many other MUC-systems. We report on experiments which show the di\[fference in performance between the NE system with gazetteers of different sizes for three types of named entities: people, organisations and locations.</Paragraph> </Section> class="xml-element"></Paper>