File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0110_metho.xml

Size: 13,190 bytes

Last Modified: 2025-10-06 14:08:21

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0110">
  <Title>On building a high performance gazetteer database</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Geographic names
</SectionTitle>
    <Paragraph position="0"> Geographic names present a number of challenges to a gazetteer. These include issues inherent to translation and transliteration of foreign names, mediation between repeated entries and multiple sources, and the (in)accuracy of placename specifications.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Resolution of names
</SectionTitle>
      <Paragraph position="0"> The first hurdle is internationalization (i18n). Differences between character encodings and display capabilities result in some names taking on a variety of forms (e.g.</Paragraph>
      <Paragraph position="1"> printing S~ao Tom'e as Sao Tome). Although the printed forms of the name are not character-identical, the name itself has not changed from its original representation.</Paragraph>
      <Paragraph position="2"> To resolve this, the GazDB defines and stores a geographic name as a triple: [canonical name, display name, search name], with each element at a different level of resolution. The canonical form of the feature's name is kept as a 16 bit string (Unicode / UTF-8), the display form is 8 bits (ISO 8859-1), and the search name is 7-bit uppercase ASCII. These resolutions are appropriate for different purposes: wide characters are necessary for Chinese/Japanese/Korean (CJK) content, the display name is a necessary compromise given the default display capabilities of Internet browsers, and the search name is necessary given the data entry capabilities of the default (US-ASCII) keyboard. We henceforth use the term name to implicitly refer to this triple.</Paragraph>
      <Paragraph position="3"> We also support Soundex and Metaphone geographic name searches at a 7 bit resolution, by storing the hash codes in separate tables within the GazDB.</Paragraph>
      <Paragraph position="4"> However, there are cases when variances in a name arise due to multiple transliteration, rather than character encodings, as in the case of Macau and Macao. As such, we further define a spelling of a geographic name to be a similarly constructed triple of [UTF-8, 8859-1, ASCII] encodings, with the added restriction that while the authoritative name is directly associated to a geographical entity, a spelling is only directly associated to a name.</Paragraph>
      <Paragraph position="5"> Thus while Macao is a spelling variant of Macau, and Macau is the name of a city in Southern China, nonetheless Macao is not considered to be a GazDB name proper for the city.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Authoritativeness
</SectionTitle>
      <Paragraph position="0"> The GazDB also makes a distinction about the authoritativeness of names. We view a placename as an information resource in and of itself, independent of the feature that it names. This is analogous to the Unicode standard, where the name of a character is treated as an information resource independent of the glyph it corresponds to.</Paragraph>
      <Paragraph position="1"> There are multiple names that refer to the same geographic feature but are neither spelling variants of another nor are they seemingly derived from one another, such as Holland vs. The Netherlands or Nihon vs. Japan. Because of this, we define and maintain alternate names for each authoritative name. Each geographic entity is permitted to have only one authoritative name, but that authoritative name can have several more informal alternate names associated to it. Both alternate names and authoritative names can have variant spellings.</Paragraph>
      <Paragraph position="2"> Conflicts between authoritative names from different sources are inevitable. However, we cannot independently determine the proper solution in an objective way because we are not a mapping agency- we seek to use geographic data, not produce it. Without being able to take our own measurements, resolving these discrepancies must therefore be done on the basis of the perceived trustworthiness of the sources providing the data. The GazDB's source data consists of many sources that can be trusted to varying degrees. We put the highest trust in the Geographic Names Information System (USGS, 2003) data and the GEOnet Names Server (NIMA, 2003) data, and mediate the incorporation of all the other sources accordingly. null To enforce the distinction between the authoritative and the alternate versions of a name,, and to emphasize the authoritative name, we speak of &amp;quot;names&amp;quot; referring only to the authoritative name. For all others, we speak of &amp;quot;alternates&amp;quot; and &amp;quot;spellings&amp;quot;.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Explicitness
</SectionTitle>
      <Paragraph position="0"> Lastly, the GazDB distinguishes fully specified geographic names, such as New York City, New York, USA from their short forms such as New York City or even the more colloquial yet ambiguous New York.</Paragraph>
      <Paragraph position="1"> The GazDB maintains a taxonomy of geographic features, consisting of an administrative hierarchy of the world. The administrative hierarchy serves to locate geographic entities by country, then state, county, and so forth. This is based upon both the FIPS 10-4 citeFIPS and the ISO 3166-2 (ISO, 1998) codes. However, these standards often disagree and update infrequently, so we base ours upon the Hierarchical Administrative Subdivision Codes (HASC) system (Law, 1999). Using this taxonomy, we can specify geographic entities by name and by their location within the political divisions of the world. The GazDB is capable of maintaining multiple taxonomies for geographic entities, such as one based upon physical features (for instance: &amp;quot;Mont Blanc is a mountain in the Alps which are in Europe&amp;quot;, in addition to &amp;quot;Mont Blanc is a mountain in France&amp;quot;), however these have not yet been completed.</Paragraph>
      <Paragraph position="2"> We define as an authoritative title the unambiguous list of hierarchical administrative regions that contain the geographic entity. Here New York State, United States would be the authoritative title, such that the sequence New York City, New York State, USA unambiguously refers to a single geographic entity. The authoritative title is the ordered sequence of the authoritative names for the list of hierarchical regions that contain the feature, so it is easy to compute from a hierarchical region tree in the GazDB. Other titles can be computed by using variants or spellings of the containing regions' names, or by omitting some of them (New York City, USA, for example).</Paragraph>
      <Paragraph position="3"> We have thus imposed an order on the GazDB geographic names: each feature can have one primary (most authoritative) GazDB name and some alternate GazDB names. Each GazDB name, both primary and alternate, can have multiple spellings associated with it. All of the above are available at all three encoding resolutions.</Paragraph>
      <Paragraph position="4"> This ordering allows the GazDB to classify geographic names along three orthogonal scales: general/vernacular vs. authoritative; raw (original character encoding) vs.</Paragraph>
      <Paragraph position="5"> cooked (character-set- and transliteration-normalized); and implicit (short form) vs. explicit (long form). This allows us to export, on an as-needed basis, multiple gazetteers from the GazDB at different name resolutions.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 Language information
</SectionTitle>
      <Paragraph position="0"> The multilingual support in the GazDB goes beyond the use of Unicode. To map different name entries to geographic features for different languages, we also maintain within the GazDB a detailed list of the world's languages (Grimes and Grimes, 2000), and associate all names and descriptions with their language.</Paragraph>
      <Paragraph position="1"> The GazDB can keep one authoritative name (but arbitrary numbers of associated spellings, variants, and titles) per language in the world for any geographic feature. Therefore, given authoritative sets of raw geographic data in a foreign language, the GazDB could produce a gazetteer in that language. By matching gazetteer entries by feature, the GazDB could potentially issue a multilingual gazetteer as well. Of course, obtaining the large, accurate, geographic datasets in foreign languages required for this purpose is a major ongoing undertakingone that we make no claim to have completed!</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Geographic features
</SectionTitle>
    <Paragraph position="0"> As mentioned in Section 2, a geographic feature includes both a geographic location and some categorization of what is situated there. The GazDB classifies geographic entities along 3 orthogonal scales: spatial representation, functional class, and administrative type. These classifications allows users to better restrict gazetteer queries, perhaps via pull-down menus, for more relevant results.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Spatial representations
</SectionTitle>
      <Paragraph position="0"> Simple point/bounding-box categorization does not accurately depict the topological footprint of most features (Hill et al., 1999). Points do not represent the geographic extents of locations, and bounding boxes misrepresent features by oversimplifying the shape. Of particular interest is the ability to categorize geographic entities with &amp;quot;fuzzy boundaries&amp;quot;, such as the extent of wetlands, or disjoint regions, such as an archipelago. The GazDB classifies features by their footprint into 6 major types (each with numerous subtypes):  1 point - 0-dimensional (approximated to a point, e.g.</Paragraph>
      <Paragraph position="1"> a factory gate or a well) 2 line - 1-dimensional (e.g. a road or power line) 3 area - 2-dimensional without clearly defined boundaries (e.g. wetlands) 4 point-area - a 2-D region with clearly defined boundaries (e.g. county or lake) 5 cluster of point-areas - e.g. an archipelago 6 probability density distribution - a feature that shifts over time, e.g. ice packs 0 unknown/unclassified</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Functional classes
</SectionTitle>
      <Paragraph position="0"> Many features, particularly structures, can also be described by their functional class:  It is worth reiterating that these categorizations are deliberately broad and are used for filtering purposes only. The GazDB maintains a complete hierarchical tree of all the administrative subdivisions within a country and the geographic entities contained therein, without any depth limitations.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.4 Using feature categorization
</SectionTitle>
      <Paragraph position="0"> The particular categories and classifications are specified for a number of reasons: To facilitate Knowledge Representation within the GazDB by axiomatizing how we classify data. We currently have no ontology for the geographic entities, but we leave open the option to add one to our taxonomies.</Paragraph>
      <Paragraph position="1"> To reduce the need for human training, such that an average user of the gazetteer can have reasonable expectations of what each category includes based on intuition. User convenience: the categories in the appropriate pull-down menu should be ones useful to a user.</Paragraph>
      <Paragraph position="2"> To make querying more efficient: for example, we can use axiomatic expectation to assume a polygonal feature to only match other polygons.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.5 Storing geographic locations
</SectionTitle>
      <Paragraph position="0"> A major advantage that coordinate systems have over naming systems is that, given an appropriate method, it is possible to convert from one coordinate system to another with reasonable accuracy. As such, the GazDB currently only stores geocoordinates in decimal degrees (albeit in two versions: one high-precision, and the other rounded for display purposes). However, the conversion and export scripts are already prepared to handle a wide variety of coordinate systems, such as Degrees-Minutes-Seconds (DMS), Military Grid Reference System (MGRS), Universal Transverse Mercator (UTM) coordinates, to name a few.</Paragraph>
      <Paragraph position="1"> The GazDB scripts can also convert between map projections, but so far it is only done to convert source data into the GazDB standard format.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML