File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/99/e99-1001_metho.xml
Size: 27,237 bytes
Last Modified: 2025-10-06 14:15:19
<?xml version="1.0" standalone="yes"?> <Paper uid="E99-1001"> <Title>Named Entity Recognition without Gazetteers</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 The MUC Competition </SectionTitle> <Paragraph position="0"> The MUC competition for which we built our system took place in March 1998. Prior to the competition, participants received a detailed coding manual which specified what should and should not be marked up, and how the markup should proceed. They also received a few hundred articles from the New York Times Service, marked up by the organisers according to the rules of the coding manual.</Paragraph> <Paragraph position="1"> For the competition itself, participants received 100 articles. They then had 5 days to perform the chosen information extraction tasks (in our case: Named Entity recognition) without human intervention, and markup the text with the Named Entities found. The resulting marked up file then had to be returned to the organisers for scoring.</Paragraph> <Paragraph position="2"> Scoring of the results is done automatically by the organisers. The scoring software compares a participant's answer file against a carefully prepared key file; the key file is considered to be the &quot;correctly&quot; annotated file. Amongst many other things, the scoring software calculates a system's recall and precision scores: Recall: Number of correct tags in the answer file over total number of tags in the key file.</Paragraph> <Paragraph position="3"> Precision: Number of correct tags in the answer file over total number of tags in the answer file.</Paragraph> <Paragraph position="4"> Recall and precision are generally accepted ways of measuring system performance in this field. For example, suppose you have a text which is 1000 words long, and 20 of these words express a location. Now imagine a system which assigns the LOCATION tag to every single word in the text.</Paragraph> <Paragraph position="5"> This system will have tagged correctly all 20 locations, since it tagged everything as LOCATION; its recall score is 20/20, or 100%. But of the 1000 LOCATION tags it assigned, only those 20 were correct; its precision is therefore only 20/1000, or 2%.</Paragraph> </Section> <Section position="5" start_page="0" end_page="4" type="metho"> <SectionTitle> 3 Finding Named Entities </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 A simple system </SectionTitle> <Paragraph position="0"> We decided first to test to what extent NE recognition can be carried out merely by recourse to list lookup. Such a system could be domain and language independent. It would need no grammars or even information about tokenization but simply mark up known strings in the text. Of course, the development and maintenance of the name lists would become more labour intensive.</Paragraph> <Paragraph position="1"> (Palmer and Day, 1997) evaluated the performance of such a minimal NE recognition system equipped with name lists derived from MUC-6 training texts. The system was tested on news-wire texts for six languages. It achieved a recall rate of about 70% for Chinese, Japanese and Portuguese and about 40% for English and French.</Paragraph> <Paragraph position="2"> The precision of the system was not calculated but can be assumed to be quite high because it would only be affected by cases where a capitalized word occurs in more than one list (e.g. &quot;Columbia&quot; could occur in the list of organisations as well as locations) or where a capitalised word occurs in a list but could also be something completely different (e.g. &quot;Columbia&quot; occurs in the list of locations but could also be the name of a space shuttle).</Paragraph> <Paragraph position="3"> We trained a similar minimal system using the MUC-7 training data (200 articles) and ran it on the test data set (100 articles). The corpus we used in our experiments were the training and test corpora for the MUC-7 evaluation.</Paragraph> <Paragraph position="4"> From the training data we collected 1228 person names, 809 names of organizations and 770 names of locations. The resulting name lists were the only resource used by the minimal NE recognition system. It nevertheless achieved relatively high precision (around 90%) and recall in the range 4070%. The results are summarised in Figure 1 in the &quot;learned lists&quot; column.</Paragraph> <Paragraph position="5"> Despite its simplicity, this type of system does presuppose the existence of training texts, and these are not always available. To cope with the absence of training material we designed and tested another variation of the minimal system.</Paragraph> <Paragraph position="6"> Instead of collecting lists from training texts we instead collected lists of commonly known entities-we collected a list of 5000 locations (countries and American states with their five biggest cities) from the CIA World Fact Book, a list of 33,000 organization names (companies, banks, associations, universities, etc.) from financial Web sites, and a list of 27,000 famous people from several websites.</Paragraph> <Paragraph position="7"> The results of this run can be seen in Figure 1 in the &quot;common lists&quot; column. In essence, this system's performance was comparable to that of the system using lists from the training set as far as location was concerned; it performed slightly worse on the person category and performed badly on organisations.</Paragraph> <Paragraph position="8"> In a final experiment we combined the two gazetteers, the one induced from the training texts with the one acquired from public resources, and achieved some improvement in recall at the expense of precision. The results of this test run are given in the &quot;combined lists&quot; column in Figure 1. We can conclude that the pure list lookup approach performs reasonably well for locations (precision of 90-94%; recall of 75-85%). For the person category and especially for the organization category this approach does not yield good performance: although the precision was not extremely bad (around 75-85%), recall was too low (lower than 50%)--i.e. every second person name or organization failed to be assigned.</Paragraph> <Paragraph position="9"> For document retrieval purposes low recall is not necessarily a major problem since it is often sufficient to recognize just one occurrence of each distinctive entity per document, and many of the unassigned person and organization names were just repetitions of their full variants. But for many other applications, and for the MUC competition, higher recall and precision are necessary.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Combining rules and statistics </SectionTitle> <Paragraph position="0"> The system we fielded for MUC-7 makes extensive use of what McDonald (1996) calls internal (phrasal) and external (contextual) evidence in named entity recognition. The basic philosophy underlying our approach is as follows. A DD is a digit; PROF is a profession; REL is a relative; J J* is a sequence of zero or more adjectives; LOC is a known location.</Paragraph> <Paragraph position="1"> string of words like &quot;Adam Kluver&quot; has an internal (phrasal) structure which suggests that this is a person name; but we know that it can also be used as a shortcut for a name of organization (&quot;Adam Kluver Ltd.&quot;) or location (&quot;Adam Kluver Country Park&quot;). Looking it up on a list will not necessarily help: the string may not be on a list, may be on more than one list, or may be on the wrong list. However, somewhere in the text, there is likely to be some contextual material which makes it clear what type of named entity it is. Our strategy is to only make a decision once we have identified this bit of contextual information.</Paragraph> <Paragraph position="2"> We further assume that, once we have identified contextual material which makes it clear that &quot;Adam Kluver&quot; is (e.g.) the name of a company, then any other mention of &quot;Adam Kluver&quot; in that document is likely to refer to that company. If the author at some point in the same text also wants to refer to (e.g.) a person called &quot;Adam Kluver&quot;, s/he will provide some extra context to make this clear, and this context will be picked up in the first step. The fact that at first it is only an assumption rather than a certainty that &quot;Adam Kluver&quot; is a company, is represented explicitly, and later processing components try to resolve the uncertainty. null If no suitable context is found anywhere in the text to decide what sort of Named Entity &quot;Adam Kluver&quot; is, the system can check other resources, e.g. a list of known company names and apply compositional phrasal grammars for different categories. Such grammars for instance can state that if a sequence of capitalized words ends with the word &quot;Ltd.&quot; it is a name of organization or if a known first name is followed by an unknown capitalized word this is a person name.</Paragraph> <Paragraph position="3"> In our MUC system, we implemented this approach as a staged combination of a rule-based system with probabilistic partial matching. We describe each stage in turn.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Step 1. Sure-fire Rules </SectionTitle> <Paragraph position="0"> In the first step, the system applies sure-fire grammar rules. These rules combine internal and external evidence, and only fire when a possible candidate expression is surrounded by a suggestive context. Sure-fire rules rely on known corporate designators (Ltd., Inc., etc.), person titles (Mr., Dr., Sen.), and definite contexts such as those in Figure 2. The sure-fire rules apply after POS tagging and simple semantic tagging, so at this stage words like &quot;former&quot; have already been identified as JJ (adjective), words like &quot;analyst&quot; have been identified as PROF (professions), and words like &quot;brother&quot; as REL (relatives).</Paragraph> <Paragraph position="1"> At this stage our MUC system treats information from the lists as likely rather than definite and always checks if the context is either suggestive or non-contradictive. For example, a likely company name with a conjunction (e.g. &quot;China International Trust and Investment Corp&quot;) is left untagged at this stage if the company is not listed in a list of known companies. Similarly, the system postpones the markup of unknown organizations whose name starts with a sentence initial common word, as in &quot;Suspended Ceiling Contractors Ltd denied the charge&quot;.</Paragraph> <Paragraph position="2"> Names of possible locations found in our gazetteer of place names are marked as LOCATION only if they appear with a context that is suggestive of location. &quot;Washington&quot;, for example, can just as easily be a surname or the name of an organization. Only in a suggestive context, like &quot;in Washington&quot;, will it be marked up as location.</Paragraph> </Section> <Section position="4" start_page="0" end_page="4" type="sub_section"> <SectionTitle> 3.4 Step 2. Partial Match 1 </SectionTitle> <Paragraph position="0"> After the sure-fire symbolic transduction the system performs a probabiiistic partial match of the identified entities. First, the system collects all named entities already identified in the document.</Paragraph> <Paragraph position="1"> It then generates all possible partial orders of the composing words preserving their order, and marks them if found elsewhere in the text. For instance, if &quot;Adam Kluver Ltd&quot; had already been recognised as an organisation by the sure-fire rule, in this second step any occurrences of &quot;Kluver Ltd&quot;, &quot;Adam Ltd&quot; and &quot;Adam Kluver&quot; are also tagged as possible organizations. This assignment, however, is not definite since some of these words (such as &quot;Adam&quot;) could refer to a different entity. This information goes to a pre-trained maximum entropy model (see Mikheev (1998) for more details on this aproach). This model takes into account contextual information for named entities, such as their position in the sentence, whether they exist in lowercase in general, whether they were used in lowercase elsewhere in the same document, etc. These features are passed to the model as attributes of the partially matched words. If the model provides a positive answer for a partial match, the system makes a definite assignment.</Paragraph> </Section> <Section position="5" start_page="4" end_page="4" type="sub_section"> <SectionTitle> 3.5 Step 3. Rule Relaxation </SectionTitle> <Paragraph position="0"> Once this has been done, the system again applies the grammar rules. But this time the rules have much more relaxed contextual constraints and extensively use the information from already existing markup and from the lexicon compiled during processing, e.g. containing partial orders of already identified named entities.</Paragraph> <Paragraph position="1"> At this stage the system will mark word sequences which look like person names. For this it uses a grammar of names: if the first capitalized word occurs in a list of first names and the following word(s) are unknown capitalized words, then this string can be tagged as a PERSON. Note that it is only at this late stage that a list of names is used. At this point we are no longer concerned that a person name can refer to a company. If the name grammar had applied earlier in the process, it might erroneously have tagged &quot;Adam Kluver&quot; as a PERSON instead of an ORGANIZATION. But at this point in the chain of N~. processing, that is not a problem anymore: &quot;Adam Kluver&quot; will by now already have been identified as an ORGANIZATION by the sure-fire rules or during partial matching.</Paragraph> <Paragraph position="2"> If it hasn't, then it is likely to be the name of a person.</Paragraph> <Paragraph position="3"> At this stage the system will also attempt to resolve conjunction problems in names of organisations. For example, in &quot;China International Trust and Investment Corp&quot;, the system checks if possible parts of the conjunctions were used in the text on their own and thus are names of different organizations; if not, the system has no reason to assume that more than one company is being talked about.</Paragraph> <Paragraph position="4"> In a similar vein, the system resolves the attachment of sentence initial capitalized modifiers, the problem alluded to above with the &quot;Suspended Ceiling Contractors Ltd&quot; example: if the modifier was seen with the organization name elsewhere in the text, then the system has good evidence that the modifier is part of the company name; if the modifier does not occur anywhere else in the text with the company name, it is assumed not to be part of it.</Paragraph> <Paragraph position="5"> This strategy is also used for expressions like &quot;Murdoch's News Corp'. The genitival &quot;Murdoch's&quot; could be part of the name of the organisation, or could be a possessive. Further inspection of the text reveals that Rupert Murdoch is referred to in contexts which support a person interpretation; and &quot;News Corp&quot; occurs on its own, without the genitive. On the basis of evidence like this, the system decides that the name of the organisation is &quot;News Corp', and that &quot;Murdoch&quot; should be tagged separately as a person.</Paragraph> <Paragraph position="6"> At this stage known organizations and locations from the lists available to the system are marked in the text, again without checking the context in which they occur.</Paragraph> </Section> <Section position="6" start_page="4" end_page="4" type="sub_section"> <SectionTitle> 3.6 Step 4. Partial Match 2 </SectionTitle> <Paragraph position="0"> At this point, the system has exhausted its resources (rules about internal and external evidence for named entities, as well as its gazetteers). The system then performs another partial match to annotate names like &quot;White&quot; when &quot;James White&quot; had already been recognised as a person, and to annotate company names like &quot;Hughes&quot; when &quot;Hughes Communications Ltd.&quot; had already been identified as an organisation.</Paragraph> <Paragraph position="1"> As in Partial Match 1, this process of partial matching is again followed by a probabilistic assignment supported by the maximum entropy model. For example, conjunction resolution makes use of the fact that in this type of text it is more common to have conjunctions of like entities.</Paragraph> <Paragraph position="2"> In &quot;he works for Xxx and Yyy&quot;, if there is evidence that Xxx and Yyy are two entities rather than one, then it is more likely that Xxx and Yyy are two entities of the same type, i.e. both organisations or are both people, rather than a mix of the two.</Paragraph> <Paragraph position="3"> This means that, even if only one of the entities in the conjunction has been recognised as definitely of a certain type, the conjunction rule will help decide on the type of the other entity. One of the texts in the competition contained the string &quot;UTited States and Russia&quot;. Because of the typo in &quot;UTited States&quot;, it wasn't found in a gazetteer. But there was internal evidence that it could be &quot;States&quot;); and there was external evidence that it could be a location (the fact that it occurred in a conjunction with &quot;Russia&quot;, a known location). These two facts in combination meant that the system correctly identified &quot;UTited States&quot; as a location.</Paragraph> </Section> <Section position="7" start_page="4" end_page="4" type="sub_section"> <SectionTitle> 3.7 Step 5. Title Assignment </SectionTitle> <Paragraph position="0"> Because titles of news wires are in capital letters, they provide little guidance for the recognition of names. In the final stage of NE processing, entities in the title are marked up, by matching or partially matching the entities found in the text, and checking against a maximum entropy model trained on document titles. For example, in &quot;GENERAL TRENDS ANALYST PREDICTS LITTLE SPRING EXPLOSION&quot; &quot;GENERAL TRENDS&quot; will be tagged as an organization because it partially matches &quot;General Trends Inc&quot; elsewhere in the text, and &quot;LITTLE SPRING&quot; will be tagged as a location because elsewhere in the text there is supporting evidence for this hypothesis. In the headline</Paragraph> </Section> </Section> <Section position="6" start_page="4" end_page="4" type="metho"> <SectionTitle> &quot;MURDOCH SATELLITE EXPLODES ON TAKE-OFF&quot;, </SectionTitle> <Paragraph position="0"> &quot;Murdoch&quot; is correctly identified as a person because of mentions of Rupert Murdoch elsewhere in the text. Applying a name grammar on this kind of headline without checking external evidence might result in erroneously tagging &quot;MUR-DOCH SATELLITE&quot; as a person (because &quot;Murdoch&quot; is also a first name, and &quot;Satellite&quot; in this headline starts with a capital letter).</Paragraph> </Section> <Section position="7" start_page="4" end_page="4" type="metho"> <SectionTitle> 4 MUC results </SectionTitle> <Paragraph position="0"> In the MUC competition, our system's combined precision and recall score was 93.39%. This was the highest score, better in a statistically significant way than the score of the next best system.</Paragraph> <Paragraph position="1"> Scores varied from 93.39% to 69.67%. Further details on this can be found in (Mikheev et al., 1998).</Paragraph> <Paragraph position="2"> The table in Figure 3 shows the progress of the performance of the system we fielded for the MUC competition through the five stages.</Paragraph> <Paragraph position="3"> As one would expect, the sure-fire rules give very high precision (around 96-98%), but very low recall--in other words, they don't find many named entities, but the ones they find are correct.</Paragraph> <Paragraph position="4"> Subsequent phases of processing add gradually more and more named entities (recall increases from around 40% to around 90%), but on occasion introduce errors (resulting in a slight drop in precision). Our final score for 0RGhNISATION, PERSON and LOCATION is given in the bottom line of Figure 3.</Paragraph> </Section> <Section position="8" start_page="4" end_page="4" type="metho"> <SectionTitle> 5 The role of gazetteers </SectionTitle> <Paragraph position="0"> Our system fielded for the MUC competition made extensive use of gazetteers, containing around 4,900 names of countries and other place names, some 30,000 names of companies and other organdeg isations, and around 10,000 first names of people. As explained in the previous section, these lists were used in a judicious way, taking into account other internal and external evidence before making a decision about a named entity. Only in step 3 is information from the gazetteers used without context-checking.</Paragraph> <Paragraph position="1"> It is not immediately obvious from Figure 3 what exactly the impact is of these gazetteers. To try and answer this question, we ran our system over 70 articles of the MUC competition in different modes; the remaining 30 articles were used to compile a limited gazetteer as described below and after that played no role in the experiments.</Paragraph> <Paragraph position="2"> Full gazetteers. We first ran the system again with the full gazetteers, i.e. the gazetteers used in the official MUC system. There are minor differences in Recall and Precision compared to the official MUC results, due to the fact that we were using a slightly different (smaller) corpus.</Paragraph> <Paragraph position="3"> No gazetteers. We then ran the system without any gazetteers. In this mode, the system can still use internal evidence (e.g. indicators such as &quot;Mr&quot; for people or &quot;Ltd&quot; for organisations) as well as external evidence (contexts such as &quot;XXX, the chairman of YYY&quot; as evidence that XXX is a person and YYY an organisation).</Paragraph> <Paragraph position="4"> The hypothesis was that names of organisations and without gazetteers, tested on 70 articles from the MUC-7 competition. and names of people should still be handled relatively well by the system, since they have much internal and external evidence, whereas names of locations have fewer reliable contextual clues. For example, expressions such as &quot;XXX is based in YYY&quot; is not sure-fire evidence that YYY is a location - it could also be an organisation. And since many locations are so well-known, they receive very little extra context (&quot;in China&quot;, &quot;in Paris&quot;, vs &quot;in the small town of Ekeren&quot;). Some locations. We then ran the system with some locational information: about 200 names of countries and continents from www. yahoo, corn/ Regional/and, because MUC rules say explicitly that names of planets should be marked up as locations, the names of the 8 planets of our solar system. The hypothesis was that even with those reasonably common location names, Named Entity recognition would already dramatically improve. This hypothesis was confirmed, as can be seen in Figure 4.</Paragraph> <Paragraph position="5"> Inspection of the errors confirms that the system makes most mistakes when there is no internal or external evidence to decide what sort of Named Entity is involved. For example, in a reference to &quot;a Hamburg hospital&quot;, &quot;Hamburg&quot; no longer gets marked up as a location, because the word occurs nowhere else in the text, and that context is not sufficient to assume it indicates a location (cf. a Community Hospital, a Catholic Hospital, an NHS Hospital, a Trust-Controlled Hospital, etc). Similarly, in a reference to &quot;the Bonn government&quot;, &quot;Bonn&quot; is no longer marked up as a location, because of lack of supportive context (cf.</Paragraph> <Paragraph position="6"> the Clinton government, the Labour government, etc). And in financial newspaper articles NYSE will be used without any indication that this is an organisation (the New York Stock Exchange).</Paragraph> <Paragraph position="7"> Limited gazetteers. The results so far suggest that the most useful gazetteers are those that contain very common names, names which the authors can expect their audience already to know about, rather than far-fetched examples of little known places or organisations.</Paragraph> <Paragraph position="8"> This suggests that it should be possible to tune a system to the kinds of Named Entities that occur in its particular genre of text. To test this hypothesis, we wanted to know how the system would perform if it started with no gazetteers, started processing texts, then built up gazetteers as it goes along, and then uses these gazetteers on a new set of texts in the same domain. We simulated these conditions by taking 30 of the 100 official MUC articles and extracting all the names of people, organisations and locations and using these as the only gazetteers, thereby ensuring that we had extracted Named Entities from articles in the same domain as the test domain.</Paragraph> <Paragraph position="9"> Since we wanted to test how easy it was to build gazetteers automatically, we wanted to minimise the amount of processing done on Named Entities already found. We decided to only used first names of people, and marked them all as &quot;likely&quot; first names: the fact that &quot;Bill&quot; actually occurs as a first name does not guarantee it will definitely be a first name next time you see it. Company names found in the 30 articles were put in the company gazetteer, irrespective of whether they were full company names (e.g. &quot;MCI Communications Corp&quot; as well as &quot;MCI&quot; and &quot;MCI Communications&quot;). Names of locations found in the 30 texts were simply added to the list of 200 location names already used in the previous experiments.</Paragraph> <Paragraph position="10"> The hope was that, despite the little effort involved in building these limited gazetteers, there would be an improved performance of the Named Entity recognition system.</Paragraph> <Paragraph position="11"> Figure 4 summarises the Precision and Recall results for each of these modes and confirms the hypotheses.</Paragraph> </Section> class="xml-element"></Paper>