File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-0110_intro.xml
Size: 7,715 bytes
Last Modified: 2025-10-06 14:01:53
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0110"> <Title>On building a high performance gazetteer database</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> We are interested in collecting the largest possible set of geographic entities, so as to be able to produce a variety of extremely comprehensive gazetteers. These gazetteers are currently produced to search for both direct and indirect geospatial references in text. The production process can be tailored to produce custom gazetteers for other applications, such as historical queries.</Paragraph> <Paragraph position="1"> The purpose of the MetaCarta GazDB is to provide both a place and supporting mechanisms for storing, maintaining, and exporting everything we know about our collection of geographic entities.</Paragraph> <Paragraph position="2"> To produce a gazetteer from various data sources, we make use of a database, the GazDB, as well as two sets of scripts: conversion scripts, to transfer the data from its source format into the GazDB, and export scripts to output data from the GazDB in the form of gazetteers.</Paragraph> <Paragraph position="3"> The interaction between these elements is illustrated in Geographic input data is collected from multiple (not necessarily disjoint) sources, each with their own peculiar format. As such, the conversion scripts must perform some amount of normalization and classification of the input data in order to maintain a single unified repository of geographic data. However, in order to justify the overhead of consolidating all the data into a single entity, it must be possible to output all of it into multiple gazetteers designed for different goals.</Paragraph> <Paragraph position="4"> It should also be possible to perform filtering operations on the gazetteer entries, such as comparing entry names against common-language dictionaries. This can be used determine whether occurrences of gazetteer names in documents are geographically relevant (Rauch et al., 2003).</Paragraph> <Paragraph position="5"> This is the task for the export scripts. However, in this paper, we shall focus on the heart of the system, namely the GazDB. Section 2 describes how the GazDB relates geographic names and features. In Section 3 we describe how the GazDB handles ambiguities and inconsistencies in geographic names. Finally, in Section 4 we outline the classification and storage system used for geographic features.</Paragraph> <Paragraph position="6"> 2 Gazetteer entries in the GazDB The most basic form of a gazetteer entry consists of a mapping between a geographic name and a geographic Figure 2: Relating features and names in the GazDB location. The Alexandria Digital Library Project (Hill, 2000), however, defines a gazetteer entry as also requiring a type designation to describe the entity referred to by the name and location. Because a geographical type designation classifies the physical entity rather than the name assigned to it, we think of gazetteer entries produced by the GazDB as relating geographic names and geographic features (which have inherent types). We will separately discuss geographic names and geographic features in greater detail later, and focus on the stored relations between them first.</Paragraph> <Paragraph position="7"> A naive approach to creating a gazetteer is to maintain a flat file with one gazetteer entry per line, as follows: Boston 42* 21'30&quot;N, 71* 4'23&quot;W Cambridge 42* 23'30&quot;N, 71* 6'22&quot;W Somerville 42* 23'15&quot;N, 71* 6'00&quot;W This schema is overly simplistic because it supposes a one-to-one mapping between geographic names and features, when in reality many geographic features have more than one name commonly associated with them.</Paragraph> <Paragraph position="8"> For instance, the tallest mountain in North America is unambiguously referred to as either Mount McKinley or Denali. Using this gazetteer, recording both names for the mountain would result in the creation of two entries. This is highly impractical on a large scale due to space requirements and the complexity of systematically updating or modifying the gazetteer.</Paragraph> <Paragraph position="9"> The GazDB uses the well-known relational approach (Codd, 1970) to store the geographic data for the gazetteer. To do so, we separate the notion of a geographic name from the geographic feature that it represents. We maintain distinct tables for locations and names- mappings between names and locations are stored in a third table, keyed by the unique numerical identifiers of both the name and the location, as shown Figure 3: Updating a name in the GazDB in Figure 2. This system enables the GazDB to support both many-to-one relations between names and features, as in the case of Denali and McKinley, and one-to-many relations such as London being the name of both a city in Britain and a town in Connecticut.</Paragraph> <Paragraph position="10"> In the GazDB, several other relational tables are used to store numerical data associated with the known geographic features. For example, population data is kept in a separate table that links census figures with the ID's of entries in the feature table. This is useful because it facilitates queries to be performed only on inhabited places. Elevation data is stored in a similar manner.</Paragraph> <Paragraph position="11"> As gazetteers get updated, corrections are often made to the name or to the feature data. To update a name, we formally abandon the old ID, create a new name entry, and update the name-feature mapping table by replacing the old name ID with the new one, as in Figure 3. We repeat this process for each table in the GazDB that refers to the old ID- this is simple, because the tables are indexed by ID. Updating geographic locations or numerical data in the GazDB is done in an identical manner. The GazDB also includes a table for storing detailed information about the sources of the data in the GazDB- for instance, &quot;NIMA GeoNet names datafile for Afghanistan (AF), published November 8 2002&quot;. Every element in the GazDB is then associated with the appropriate entry in the source table. This enables the accountability of all entries in the GazDB, preventing the appearance of &quot;mystery data&quot;. The source table also allows easy, systematic, source-specific modifications of the GazDB's entries to keep pace with frequently updated datasets, thereby maintaining the freshness of the GazDB's data.</Paragraph> <Paragraph position="12"> The GazDB also includes a complete log of all updates to the database tables and entries. Because data rows are abandoned but not deleted during updates, it is possible to recreate the state of the database prior to any particular set of updates.</Paragraph> <Paragraph position="13"> The flexibility of the relational design also allows the inclusion of new kinds of data that were not thought of or not available in the original schema. For instance, one could add yearly precipitation data for geographic locations by creating an additional table mapping locations to rainfall amounts, without the need to re-ingest the data already in the GazDB.</Paragraph> <Paragraph position="14"> The GazDB also maintains a historical geographical record by capturing temporal extents for mappings - i.e.</Paragraph> <Paragraph position="15"> the city at 59* 54'20&quot;N, 30* 16'9&quot;E would be associated with the names: The GazDB can thus export temporally-sensitive gazetteers customized for use in historical documents.</Paragraph> </Section> class="xml-element"></Paper>