File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/87/t87-1005_abstr.xml

Size: 4,375 bytes

Last Modified: 2025-10-06 13:46:30

<?xml version="1.0" standalone="yes"?>
<Paper uid="T87-1005">
  <Title>WORDS AND WORLDS</Title>
  <Section position="1" start_page="0" end_page="16" type="abstr">
    <SectionTitle>
WORDS AND WORLDS
</SectionTitle>
    <Paragraph position="0"> For several years now I have been concerned with how artificial intelligence is going to build the substitute for human world knowledge needed in performing the task of text understanding.</Paragraph>
    <Paragraph position="1"> I continue to believe that the bulk of this knowledge will have to be derived from existing machine-readable texts produced as a byproduct of computer typesetting and word processing technologies which have overtaken the publishing industries. However, there are many obstacles to the acquisition of world knowledge from text.</Paragraph>
    <Paragraph position="2"> There are some, I am sure, who would argue that world knowledge of the form needed in text understanding will have to be hand-coded and cannot be derived from existing reference books or other texts. My basic argument against those who hold this view is that they are ignoring the magnitude of the task ahead. Whether measured in terms of bytes or man-years, the sum of recorded knowledge is so massive that it is unlikely to be capable of being recoded in anything less than man-centuries.</Paragraph>
    <Paragraph position="3"> Put another way, there currently exist sizeable publishing empires in this country which every day employ hundreds of people involved directly in the coding of information for new reference texts and revised editions of older reference works. To attempt a recoding of world knowledge solely for use in AI would eventually become an attempt to parallel this effort. It would become a major industry in itself. Thus, it is more likely that, instead of a new knowledge-base industry, we will see an evolutionary change in the methods used by the existing publishing empires to record knowledge in a manner that is of use in producing text both for human consumption and as knowledge bases for computers. Researchers in AI and computational linguistics therefore have some responsibility to determine how the existing printed knowledge can evolve into usable computational world knowledge Now, of course, I do admit there are subclasses of world knowledge that evidence to date has not shown to exist in print at all. Jerry Hobbs is attempting to codify one such subclass in his work on TACITUS (Hobbs et al. 1986). There are others as well, such as some forms of linguistic knowledge. However, I am concerned about the very large body of knowledge that we try to communicate to people through books, newspapers and other texts. This knowledge of the outside world, of experiences in which the individual has not and in fact may never personally be involved, is nevertheless shared knowedge known to all of us through reading and listening to the words of others.</Paragraph>
    <Paragraph position="4"> Another assumption, and one that has been guiding my work for many years now, is that natural language systems cannot understand text for which they do not possess the lexicon. This seems so elemental an assumption that I find it hard to see how to ignore the fact that we do not have a lexicon of any real world text as common as a newspaper.</Paragraph>
    <Paragraph position="5"> What is in this missing lexicon? The problem has several parts. First, it now seems clear that even unabridged dictionaries miss sizeable amounts of the lexicon needed to do lexical recognition in a newspaper such as The New York Times. Earlier results (Walker 2z Amsler 1986)  Amsler TINLAP3 have shown that some of this lexicon was excluded from the dictionaries by choice, such as the proper nouns, but more recent research has revealed that even here the problem is more complex. Proper nouns are not quite lexical in nature. They possess a grammatical structure which some researchers have noted (Carroll 1985). This is to say that a typical proper noun has a variety of forms which tend to make the use of a single lexical entry for the proper noun less computationally useful than for a common noun.</Paragraph>
    <Paragraph position="6"> Thus we recognize, &amp;quot;International Business Machine Corporation's Thomas J. Watson Research Center at Yorktown Heights, New York,&amp;quot; as the same thing as</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML