File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0112_metho.xml

Size: 5,357 bytes

Last Modified: 2025-10-06 14:08:22

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0112">
  <Title>Main Library Building</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 The geo-coding process
</SectionTitle>
    <Paragraph position="0"> In its current implementation, the service consists of two main components, the geo-parser and the geoXwalk gazetteer, with a generic demonstrator interface. The term geo-parsing refers to the identification of place-names in a document/resource, where geo-coding refers to the tagging of the candidate and consequently the resource with a geographic footprint. Figure 1 shows the basic geo-coding flowline.</Paragraph>
    <Paragraph position="1"> A resource is submitted to the geo-parser, which identifies a series of potential placenames. Each placename is displayed along with the number of occurrences in the text, and the number of matching gazetteer candidates.</Paragraph>
    <Paragraph position="2"> For each placename, a link to the gazetteer records is displayed and a highlight option is available for identification in the original text which is displayed beneath  the table. Various sorting functions are also available for records of the table. County and feature type are the default attributes for disambiguation, although more are available through the geoXwalk feature specification. Currently multiple gazetteer entries can be attached to a single placename, enabling output of different instances of the same name in the text. Geo-coding output is available in an application specific xml schema, csv, or html, and contains parser and editor matadata. Outputted placenames can be viewed on a map. Clearly the degree of human interaction is high duri! ng the review stage, with the process currently limited to individual resources. As geo-parser development continues, user interaction at this stage of the process will become less, although the potential for 'post process' queries will rise, as the parser is more closely integrated with the geoXwalk database.</Paragraph>
    <Paragraph position="3"> As geo-parser development progresses the interface will need to accommodate a more flexible approach to the geo-coding process, as clearly interface requirements are determined by users with associated collections of specific document types, and output requirements. A range of functionality is required at various levels between a fully automated batch processing mode and a more interactive analytical approach to individual documents. Further investigation is required on the integration of geo-coding output into existing document metadata.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 The geo-parser
</SectionTitle>
    <Paragraph position="0"> The current architecture of the geo-parser is conceptually based on several passes across the text at varying levels of abstraction. Documents are split into blocks, blocks into tokens. Tokens are re-constituted into sentences, and the sentences run through a place name finders to identify candidate place names. The current parser implementation uses two techniques. The first applies approximately 300 different regular expressions at the token level based on patterns from training data (The Statistical Accounts of Scotland (http://edina.ac.uk/statacc). Once all the patterns have been run on the document then a second pass is made to find likely placenames in conjunctions / disjunctions with other placenames. Other patterns are also used to attempt to remove false positives such as the names of people, while others are based on the proximity of placename-like words ('shire', 'river' etc.). The second approach uses the Brill tagger (Brill, 1994) to mark each token with a p! art-of-speech tag, enabling rules to be applied to the text surrounding proper nouns to select likely placenames. Candidate placenames are then cross-referenced with the geoXwalk gazetteer, and a marked up version of the original document and a summary XML version of results returned. The need for large quantities experimental data in order to develop identification and disambiguation further is recognised.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 GeoXwalk
</SectionTitle>
    <Paragraph position="0"> GeoXwalk is more than just a simple lookup facility for the geo-parser as every geographic feature stored in the gazetteer has its detailed geometry stored with it.</Paragraph>
    <Paragraph position="1"> This clearly enables more complex searching. The ability to derive the relationships between features implicitly by geometric computation is significant and provides more accurate results than can be ascertained by simple lookups based on hierarchical thesauri methods as in traditional gazetteers. When candidates are referenced against the gazetteer, geoXwalk provides a means to access its 'alternate' geographies (of which there are many in the UK) as well as a standard footprint. For example a candidate placename 'Knowsley' could be resolved as parish code 'BX003' as well as grid reference 340900, 392300 - 347217, 397660. The result is that more powerful geographical based search strategies can be applied e.g. 'find me all documents about Gaelic songs that do not reference the Western Isles'.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML