File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-0103_intro.xml

Size: 8,670 bytes

Last Modified: 2025-10-06 14:01:47

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0103">
  <Title>Semi-supervised learning of geographical gazetteers from the internet</Title>
  <Section position="2" start_page="0" end_page="1" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Reasoning about locations is essential for many NLP tasks, such as, for example, Information Extraction.</Paragraph>
    <Paragraph position="1"> Knowledge on place names comes normally from a Named Entity Recognition module. Unfortunately, most state-of-the-art Named Entity Recognition systems support very coarse-grained classifications and thus can distinguish only between locations and non-locations.</Paragraph>
    <Paragraph position="2"> One of the main components of a Named Entity Recognition system is a gazetteer -- a huge list of preclassified entities. It has been shown in (Mikheev et al., 1999) that a NE Recognition system performs reasonably well for most classes even without gazetteers. Locations, however, could not be reliably identified (51,7% F-measure without gazetteers compared to 94,5% with a full gazetteer). And obviously, when one needs more sophisticated classes, including various types of locations, gazetteers should become even more important.</Paragraph>
    <Paragraph position="3"> One possible solution would be to create gazetteers manually, using World atlases, lists of place names on the Web, and already existing digital collections, such as (ADL, 2000). This task is only feasible, of course, when those resources have compatible formats and, thus, can be merged automatically. Otherwise it becomes very timeconsuming. null Manually compiled gazetteers can provide high-quality data. Unfortunately, these resources have some drawbacks. First, some items can simply be missing. For example, the atlases (Knaur, 1994), (Philip, 2000), and (Collins, 2001), we used in our study, do not list small islands, rivers, and mountains. Such gazetteers contain only positive information: if CG is not classified as an ISLAND, we cannot say whether there is really no island with the name CG, or simply the gazetteer is not complete. Another problem arises, when one wants to change the underlying classification, for example, subdividing CITY into CAPITAL and NON-CAPITAL. In this case it might be necessary to reclassify all (or substantial part of) the items. When done manually, it again becomes a very time-consuming task. Finally, geographical names vary across languages. It takes a lot of time to adjust a French gazetteer to German, moreover, such a resource can hardly bring a lot for languages with non-Latin alphabets, for example, Armenian or Japanese. Even collecting different variants of proper names in one language is a non-trivial task. One possible solution was proposed in (Smith, 2002).</Paragraph>
    <Paragraph position="4"> At least some information on almost any particular location already exists somewhere on the Internet. The only problem is that this knowledge is highly distributed over millions of web pages and, thus, difficult to find.</Paragraph>
    <Paragraph position="5"> This leads us to a conclusion that one can explore standard Data Mining techniques in order to induce gazetteers from the Internet (semi-)automatically. As it has been shown recently in (Keller et al., 2002), Internet counts produce reliable data for linguistic analysis, correlating well with corpus statistics and plausibility judgments.</Paragraph>
    <Paragraph position="6"> In this paper we present an approach for learning geographical gazetteers using very scarce resources. This work is a continuation of our previous study (Ourioupina, 2002), described briefly in Section 3. In the previous work we obtained collocational information from the Internet, using a set of manually precompiled patterns. The system used this information to learn six binary classifiers, determining for a given word, whether it is a CITY, ISLAND, RIVER, MOUNTAIN, REGION, and COUN-TRY. Although the previous approach helped us to reduce hand-coding drastically, we still needed some manually encoded knowledge. In particular, we spent a lot of time looking for a reasonable set of patterns. In addition, we had to compile a small gazetteers (see Section 2 for details) to be able to train and test the system. Finally, we were only able to classify items, provided by users, and not to get new place names automatically. Classifiers, unlike gazetteers, produce negative information (X is not an ISLAND), but they are slower due to the fact that they need Internet counts. A combination of classifiers and gazetteers would do the job better.</Paragraph>
    <Paragraph position="7"> In our present study we attempt to overcome these drawbacks by applying bootstrapping, as described in (Riloff and Jones, 1999). Bootstrapping is a new approach to the machine learning task, allowing to combine efficiently small portion of labeled (seed) examples with a much bigger amount of unlabeled data. E. Riloff and R. Jones have shown, that even with a dozen of preclassified items, bootstrapping-based algorithms perform well if a reasonable amount of unlabeled data is available.</Paragraph>
    <Paragraph position="8"> It must be noted, however, that Riloff and Jones run their algorithm on a carefully prepared balanced corpus.</Paragraph>
    <Paragraph position="9"> It is not a priori clear, whether bootstrapping is suitable for such noisy data as the World Wide Web. S. Brin describes in (Brin, 1998) a similar approach aiming at mining (book title, author) pairs from the Internet. Although his system was able to extract many book pairs (even some very rare ones), it needed a human expert for checking its results. Otherwise the books list could quickly get infected and the system's performance deteriorate. This problem is extremely acute when dealing with huge noisy datasets.</Paragraph>
    <Paragraph position="10"> In our approach we apply bootstrapping techniques to six classes.</Paragraph>
    <Paragraph position="11">  Comparing obtained results we are able to reduce the noise substantially. Additionally, we use Machine Learning to select the most reliable candidates (names and patterns). Finally, we used the seed examples and learned classifiers not only to initialize and continue the processing, but also as another means of control over the noise. This allows us to avoid expensive manual checking.</Paragraph>
    <Paragraph position="12"> The approach is described in detail in Section 4 and evaluated in Section 5.</Paragraph>
    <Paragraph position="13">  Riloff and Jones also had several classes, but they were processed rather separately.</Paragraph>
    <Paragraph position="14">  However, incorporating additional classes is not problematic. As the classes may overlap (for example, Washington belongs to the classes CITY, REGION, ISLAND and MOUNTAIN), the problem was reformulated as six binary classification tasks.</Paragraph>
    <Paragraph position="15"> Our main dataset consists of 1260 names of locations.</Paragraph>
    <Paragraph position="16"> Most of them were sampled randomly from the indexes of the World Atlases (Knaur, 1994), (Collins, 2001), and (Philip, 2000). However, this random sample contained mostly names of very small and unknown places. In order to balance it, we added a list of several countries and well-known locations, such as, for example, Tokyo or Hokkaido. Finally, our dataset contains about 10% low-frequency names (BOBEBC Web pages pro name), 10% high-frequency names (BQBDBCBCBCBCBCBC pages pro name, the most frequent one (California) was found by AltaVista in about 25000000 pages), and 80% medium-frequency ones.</Paragraph>
    <Paragraph position="17"> These names were classified manually using the above mentioned atlases and the Statoids webpage (Law, 2002).</Paragraph>
    <Paragraph position="18"> The same dataset was used in our previous experiments as well. An example of the classification is shown in table 1. For the present study we sampled randomly 100 items of each class from this gazetteer. This resulted in six lists (of CITIES, ISLANDS,...). As many names refer to several geographical objects, those lists overlap to some extent (for example, Victoria is both in the ISLAND and MOUNTAIN lists). Altogether the lists contain 520 different names of locations. The remaining part of the gazetteer (740 items) was reserved for testing. Both training and testing items were preclassified by hand: although Washington is only in the MOUNTAIN list, the system knows that it can be a CITY, a REGION, or an ISLAND as well (we also tried to relax this requirement, consider section 5.2 for details).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML