XML Viewer - w03-0101

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0101_metho.xml
Size: 24,196 bytes
Last Modified: 2025-10-06 14:08:18
<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0101">
  <Title>Experiments with geographic knowledge for information extraction Dimitar Manov,</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 The KIM platform
</SectionTitle>
    <Paragraph position="0"> The KIM Platform provides a novel Knowledge and Information Management (KIM1) infrastructure and services for automatic semantic annotation, indexing and retrieval of unstructured and semi-structured content. The ontologies and knowledge bases are kept in Semantic  repositories based on cutting edge Semantic Web technology and standards, including RDF(S) repositories2, ontology middleware3 (Kiryakov et al, 2002) and reasoning4. It provides a mature infrastructure for scalable and customizable information extraction as well as annotation and document management, based on GATE (Cunningham et al., 2002). GATE, a General Architecture for Text Engineering, is developed by the Sheffield NLP group and has been used in many language processing projects; in particular for Information Extraction in a variety of languages (Maynard and Cunningham, 2003).</Paragraph>
    <Paragraph position="1"> An essential idea for KIM is the semantic (or entity) annotation, depicted on figure 1. It can be seen as a classical named-entity recognition and annotation process.</Paragraph>
    <Paragraph position="2"> However, in contrast to most of the existing IE system, KIM provides for each entity reference in the text (i) a pointer (URI) to the most specific class in the ontology and (ii) pointer to the specific instance in the knowledge base. The latest is (to the best of our knowledge) an unique KIM feature which allows further indexing and retrieval of documents with respect to entities.</Paragraph>
    <Paragraph position="3"> For the end-user, the usage of a KIM-based application is straightforward and simple - one can highlight text in the browser and further explore the available knowledge for the entity, as shown in figure 3. A semantic query web user interface allows for queries such as &amp;quot;Organization- null Information retrieval functionality is available, based on Lucene5, which is adapted to measure relevance to entities instead of tokens and stems. The full architecture is shown in figure 2. It is important to note that KIM as a software platform is domain and task independent.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 The ontology
</SectionTitle>
      <Paragraph position="0"> KIM Ontology (KIMO) covers the most general 250 classes of entities and 40 relations. The main classes are Entity, EntitySource and LexicalResource. The most important class in the ontology is Entity, further specialized into Object, Abstract and Happening. LexicalResource class and its subclasses are used for different IE-related information. The instances of the Alias class represent different names of instances of Entity. hasAlias relation is used to link Entity to its aliases (one-to-many relation). The hasMainAlias links to the main alias (the official name). Each instance of Entity is linked to an instance of EntitySource via generatedBy relation. There are two types of EntitySource - Trusted and Recognized.</Paragraph>
      <Paragraph position="1"> The &amp;quot;trusted&amp;quot; entities are those pre-defined. The recognized are the ones which were recognized from text as part of the IE tasks.</Paragraph>
      <Paragraph position="2"> The upper part of the ontology can be seen on the same figure 3 in the left frame.</Paragraph>
      <Paragraph position="3"> For ontology representation we choose RDF(S), mainly because it allows easy extension to OWL6 (Lite).</Paragraph>
      <Paragraph position="4"> Location sub-ontology Because the Geographic features (Locations) form a large part of the entities of general importance, we de- null veloped a Location sub-ontology as part of the KIM ontology. The goal was to include the most important and frequently used types of Locations (which are specializations of Entity), including relations between them (such as hasCapital, subRegionOf (more specific than part-of)), relations between Locations and other Entities (Organization locatedIn Location) and various attributes. The Location entity denotes an area in 3D space7, which includes geographic entities with physical boundaries, such as geographical areas and landmasses, bodies of water, geological formations and also politically defined areas (e.g. &amp;quot;U.S. Administered areas&amp;quot;). The classification hierarchy (consisting of 97 classes) is based on the ADL Feature Type Thesaurus version 070203. The differences target simplicity; a number of distinctions and unnecessary levels of abstraction were removed where irrelevant to general (non-geographic) context, as we wanted the ontology to be easy to understand for an average user. Examples of sub-classes omitted: Territorial waters, Tribal areas, Administrative Areas (its sub-types are put directly under Location).</Paragraph>
      <Paragraph position="5"> The Location ontology provides the following additional information: * the exact type of a feature, for example to be able to recognize a geographic feature as CountryCapital instead of just Location.</Paragraph>
      <Paragraph position="6"> * relations between geographic feature and other entities (e.g. &amp;quot;Diego Garcia&amp;quot; is a MilitaryBase, located somewhere in the Indian Ocean and it is subRegionOf USA).</Paragraph>
      <Paragraph position="7"> * the different names of a location (&amp;quot;Peking&amp;quot; and &amp;quot;Beijing&amp;quot; are two aliases for one location).</Paragraph>
      <Paragraph position="8"> * the transitive subRegionOf relation allows one to search for Entities located in a continent (e.g. &amp;quot;Morgan Stanley&amp;quot; - locatedIn - &amp;quot;New York&amp;quot; - subRegionOf - &amp;quot;NY&amp;quot; - subRegionOf - &amp;quot;USA&amp;quot; - subRegionOf - &amp;quot;North America&amp;quot;) * &amp;quot;trusted&amp;quot; vs &amp;quot;recognized&amp;quot; sources in generatedBy property of a Location is an extra hint in disambiguation tasks. The class hierarchy is shown in figure 5.</Paragraph>
      <Paragraph position="9"> 7Actually, the instances of Location are Entities with spatial identity criteria (Guarino and Welty, 2000). For instance a building can be considered as Property, Location or Cultural Artifact, but the focus in the ontology is placed on the Location</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 The knowledge base
</SectionTitle>
      <Paragraph position="0"> Geographic information usually introduces a high level of ambiguity between named entities, for the following three reasons: * there could be several Locations with the same name (this includes sharing common alias); * a name of a Location could match a common English word (e.g. &amp;quot;Has&amp;quot;, &amp;quot;The&amp;quot;); * other named entities (Company, Person, even Date or Numeric data) could share a common alias with a Location (examples: &amp;quot;Paris Corporation&amp;quot;, &amp;quot;O'Brian&amp;quot; county, &amp;quot;10&amp;quot; district, &amp;quot;Departamento de Nueve de Julio&amp;quot; with alias &amp;quot;9 de Julio&amp;quot;). In order to allow easy bootstrapping of applications based on KIM and to eliminate the need for them to write a Geo-gazetteer, the KIM knowledge base provides exhaustive coverage of entities of general importance. By limiting the Locations to only &amp;quot;important&amp;quot; ones, we also keep the system as generic, domain- and task-independent as possible. The term &amp;quot;importance&amp;quot; of a location is hard to define, and part of the problem is that it is dependent on the domain where the IE tasks are focused. Yet it is common sense that such locations include continents, countries, big cities, some rivers, mountains, etc. In addition to the above predefined locations, KIM: * learns from the texts it analyses; * has a comprehensive set of rules and patterns helping it to recognize unknown entities; * has a Hidden Markov Model learner, capable of correcting symbolic patterns.</Paragraph>
      <Paragraph position="1"> As a test domain, KIM uses political and economic news articles from leading newswires8.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Populating the location knowledge base
</SectionTitle>
    <Paragraph position="0"> As a main source of geographic knowledge we used NIMA's GEOnet Names Server (GNS) data. GNS database is the official repository of foreign place-name decisions approved by the U.S. Board on Geographic Names (US BGN) and contains approximately 3.9 million features with 5.37 million names. Approximately 20,000 of the database's features are updated monthly.</Paragraph>
    <Paragraph position="1"> The data is available for download in standard formatted text files, which contain: unique feature index (UFI), several names per Location (the official name, short name, sometimes different transcriptions of the name), geographic coordinates (one point; no bounding rectangle).</Paragraph>
    <Paragraph position="2"> Geographic coverage of the data is worldwide, exclud- null data we used partially USGS/GNIS data9, which follows similar format as GNS data. For country names we followed FIPS10, which was natural choice since GNS data is structured that way. A list of big cities was obtained from UN Statistics site, which covers city data (http://unstats.un.org/unsd/citydata/).</Paragraph>
    <Paragraph position="3"> We then created a mapping between our location classes and GNS feature designators. Some of the features were completely ignored (e.g. &amp;quot;abandoned populated places&amp;quot;, &amp;quot;drainage ditch&amp;quot;), other were combined into one (e.g. &amp;quot;ADM2&amp;quot;, &amp;quot;ADMD&amp;quot; into County).</Paragraph>
    <Paragraph position="4"> There is some inconsistency in the way the data is entered for different countries, mostly because of improper usage of designators (using different designators for similar geographic features and vice versa). This made creation of the mapping a bit harder, as we needed to include more designators mapped to one class. The per-country files were almost consistently entered (with some exceptions, for example in UK, &amp;quot;England&amp;quot;, &amp;quot;Scotland&amp;quot;, &amp;quot;Northern Ireland&amp;quot; and &amp;quot;Wales&amp;quot; are entered as AREA, which hints the same importance as the other 40 areas in UK). We expect that a per-country mapping instead of a global one will lead to better performance results, yet we haven't experimented with this as it will require manual tuning for about 250 countries.</Paragraph>
    <Paragraph position="5"> The different names of the geographic features are mapped to aliases of the Location entities, with a main alias pointing to the official name. The RDF representation of a Location is shown in figure 4. Because these names sometimes match common English words and Per-son names a list of stop words is created and the aliases are filtered.</Paragraph>
    <Paragraph position="6"> The import procedure uses the mapping described  above but can also be restricted by list of countries and classes to be imported. Currently imported classes are: Continent, GlobalRegion, Country, Province, County, CountryCapital, LocalCapital, City, Ocean, Sea, Gulf, OilField, Monument, Bridge, Plateau, Mountain, MountainRange, Plain. These classes were selected as &amp;quot;important&amp;quot;, based on common sense and statistical information derived from GNS data.</Paragraph>
    <Paragraph position="7"> The GNS data has three main problems when it comes to extracting only geographical entities of global importance and the relations between them: * There is no way to tell the importance of a location (e.g. is Chirpan a big city or a small town); * The only part-of relations available are between a location and its country, but not province or county; * Some locations are not country-specific (e. g.</Paragraph>
    <Paragraph position="8"> oceans, seas, mountains) but are listed as separate locations with different identifiers in different per-country lists.</Paragraph>
    <Paragraph position="9"> We addressed the first problem by limiting the types of locations to a small subset of important ones (as explained above). The importance of cities was determined by using a list of all big cities (with population over 100,000). We attempted to solve the second problem by using an algorithm to calculate the distance between a location and all provinces/counties in this country, and then to create a part-of relation with the nearest one. However, our experiments showed that the accuracy of the results was not satisfactory. This is mostly due to the fact that in GNS data only the location footprint is given, but not the extent. Comparing the geographic coordinates of the locations with a common alias and type and then combining the matching ones into a single entity in the knowledge base solved the third problem.</Paragraph>
    <Paragraph position="10"> Currently the KB contains about 50,000 Locations grouped into 6 Continents, 27 GlobalRegions (such as &amp;quot;Caribbean&amp;quot; or &amp;quot;Eastern Europe&amp;quot;), 282 Countries, all country capitals and 4,700 Cities (including all the cities with population over 100,000). Each location has several aliases (usually including English, French and sometimes the local transcription of the location), geographic coordinates, the designator (DSG) and Unique Feature Index (UFI), according to GNS. The figures for entities of global importance in KIM KB are shown in table 1.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Experiments with direct use for IE
</SectionTitle>
    <Paragraph position="0"> The locations KB is used for Information Extraction (IE) as part of the KIM system, combining symbolic and stochastic approaches, based on the ANNIE IE components from GATE. As a baseline, using a gazetteer module, the aliases of the entities (including all locations) are  being looked up in the text. Further, unknown or not precisely matching entities are recognized with pattern-based grammars: * using location pre/post keys to identify locations, e.g. &amp;quot;The River Thames&amp;quot; * using location pre/post keys + Location, e.g. &amp;quot;north Egypt&amp;quot;, &amp;quot;south Wales&amp;quot; * context-based recognition, such as: &amp;quot;in&amp;quot; + Tokenwith-first-uppercase Number of disambiguation problems (mostly in the case of Location names occurring in the composite name of other Entities) are also detected and resolved: * ambiguity between Person and Organization, e.g.</Paragraph>
    <Paragraph position="1"> &amp;quot;U.S. Navy&amp;quot; (this would normally be recognized as a Person name from the pattern &amp;quot;two initials + Fam null ily name&amp;quot;, but in this case the initials match a loca-tion alias) * occurrence of locations in person names, e.g. &amp;quot;Jack London&amp;quot; (disambiguated because in the KB there is LexicalResource &amp;quot;Jack&amp;quot; is a first name of Person) * occurrence of locations in Organization names, e.g. &amp;quot;Scotland Yard&amp;quot; (disambiguated because in the KB there is such Organization) Finally, some of the recognized Entities (including Locations), which are not marked as noun by the part of speech tagger are discarded.</Paragraph>
    <Paragraph position="2"> Some of the newly recognized Locations appear frequently in the analyzed texts. Those, which could be found in the GNS data are potential candidates to be entered in the knowledge base, because there is an extra evidence for their importance. This is a way to extend the knowledge base and make it contain all the &amp;quot;important&amp;quot; Locations in the sense of frequently used in the one or more application domain(s).</Paragraph>
    <Paragraph position="3"> The performance of the KIM system was measured on a news corpus using GATE's evaluation tools. The system was also compared to an high-precision named entity recognition system, which uses small flat gazetteer lists.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Evaluation Corpus
</SectionTitle>
      <Paragraph position="0"> The corpus was collected from 3 online English newspapers: the Independent, the Guardian and the Financial Times. In total it contains 101 documents with 56,221 words. The corpus was manually annotated with entities.</Paragraph>
      <Paragraph position="1"> Table 2 shows the number of entities of each type in the corpus.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Corpus Benchmark Tool
</SectionTitle>
      <Paragraph position="0"> The Corpus Benchmark Tool(CBT) is one of the components in GATE which enables automatic evaluation of an application in terms of Precision, Recall and F-measure, against a set of ground truths. Furthermore, it also enables two versions of a system to be compared against each other (e.g. for regression testing) or two different systems to be compared. Each system is evaluated by comparing the annotations produced with a set of key annotations (produced manually) and producing a score two systems can therefore be compared with each other and indications are given as to where they differ from each other.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.3 MUSE
</SectionTitle>
      <Paragraph position="0"> MUSE is an information extraction system developed within GATE which aims to perform named entity recognition on different types of text (Maynard et al, 2002). MUSE recognises the standard MUC entity types of Person, Location, Organisation, Date, Time, Percent, and some additional types such as Addresses and Identifiers.</Paragraph>
      <Paragraph position="1"> The system is based on ANNIE, the default IE system within GATE, but has been extended to deal with a variety of text sources and genres, and incorporates a mechanism for automatically selecting the most appropriate set of resources depending on the text type.</Paragraph>
      <Paragraph position="2"> MUSE uses flat-list gazetteers which primarily contain contextual clues that help with the identification of named entities, e.g., company designators (such as Ltd, GmbH), job titles, person titles (such as Mr, Mrs), common first names, typical organisation types (e.g., Ministry, University). In addition, MUSE has lists enumerating concrete types of locations which have about 27 500 entries, in- null As can be seen from the location entries in the MUSE gazetteers, the system is specifically tailored to recognise UK locations with high recall and precision, whereas the KIM locations KB is not skewed towards any particular country.</Paragraph>
      <Paragraph position="3"> We ran the MUSE system over our test corpus to see how KIM matched up to it.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5.4 Results
</SectionTitle>
    <Paragraph position="0"> MUSE vs KIM performance comparison is given in table 4. When interpreting these results one also must bear in mind that the high-performance IE system is only tagging geographical entities as locations, whereas the GNS-based system is actually disambiguating them with respect to their specific type (e.g., City, Province, Country).</Paragraph>
    <Paragraph position="1"> Investigation of the reasons behind the lower recall shows that: * the KB is too coarse-grained, i.e., there are no &amp;quot;smaller&amp;quot; locations, such as small towns/counties in UK, we do not import military bases in KB from GNS data (&amp;quot;Diego Garcia&amp;quot;), etc.</Paragraph>
    <Paragraph position="2"> * The application was not specifically tuned for the corpus/news texts, e.g. we do not use the fact, that the texts often clarify the locations when they are first mentioned (e.g., Aberdeen, UK).</Paragraph>
    <Paragraph position="3"> * there are not any historical Locations, such as &amp;quot;Soviet Union&amp;quot;.</Paragraph>
    <Paragraph position="4"> It is expected that the first two problems will be fixed with enhancement of the KB with regard to domain targeting of a KIM-based application. To check this assumption we did another experiment. Because the corpus contains a lot of UK-related information (the articles are from three English newspapers) and MUSE is specifically tailored to UK locations, we needed extra UK-specific information in the KB. As we mentioned earlier the import procedure is flexible to the extend that allowed to add all the locations from UK GNS data. The performance of this enhanced KB is shown in table 5.</Paragraph>
    <Paragraph position="5"> The recall is higher than in MUSE (increased to 95% vs 93%).</Paragraph>
    <Paragraph position="6"> The precision is 10% behind MUSE (85% vs 95%).</Paragraph>
    <Paragraph position="7"> An obvious reason is that we have more entities in KB, and we do not control the aliases (except for stop words list), while all the locations in MUSE gazetteer lists are manually entered and therefore produce very little ambiguity.</Paragraph>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Discussion
</SectionTitle>
    <Paragraph position="0"> We produced a KB of locations with world wide coverage using GNS data. The size of about 50,000 Location is more than most other IE systems have. It is not big (compared to 4M locations in ADL Gazetteer), but provides good coverage of Locations (91%). Because the KB was not tuned for the test corpus specifics we could expect similar coverage for other corpora.</Paragraph>
    <Paragraph position="1"> Our flexible import procedure allows for domaintargeted versions of the KB (by means of importing more Location types) to be produced, which is expected to have good-enough coverage on locations.</Paragraph>
    <Paragraph position="2"> The impact of the location KB on the IE performance is still under evaluation and improvement. We are working on improvements in two directions: i) decreasing the amount of GNS-data entered in KB - for both locations and their aliases; ii) changing the way in which the IE system uses the KB to improve precision. On the latter, we are currently experimenting with applying the regular named entity recognition grammars first and then using the location KB to lookup only the unclassified entities, instead of using it as a gazetteer prior to named entity recognition as we do now.</Paragraph>
    <Paragraph position="3"> 7 Bootstrapping IE for new languages from the KB We were able to make use of the KB as part of the TIDES Surprise Language Exercise, a collaborative effort between a number of sites to develop resources and tools for various language engineering tasks on an unknown language. A dry run of this program took place in March 2003, whereby participants were given a week from the time the language was announced, to collect tools and resources for processing that language. The language chosen was Cebuano, spoken by 24% of the population in the Phillipines. The University of Sheffield developed a Named Entity recognition system for Cebuano, to which we contributed a list of locations from the Philippines.</Paragraph>
    <Paragraph position="4"> This was particularly useful as this kind of information was not readibly available from the Internet, and time was of the essence. The NE system (developed within a week) achieved scores for the recognition of locations at 73%  Precision, 78% Recall and 76% F-measure. We predict that this kind of information will be very useful for the full Surprise Language Program in June, where participants will have more time (a month) to create resources on another surprise language - not only for Information Extraction but also for tasks such as Cross-Language Information Retrieval and Machine Translation.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML