File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/99/w99-0601_abstr.xml

Size: 2,930 bytes

Last Modified: 2025-10-06 13:49:51

<?xml version="1.0" standalone="yes"?>
<Paper uid="W99-0601">
  <Title>What's Happened Since the First SIGDAT</Title>
  <Section position="1" start_page="0" end_page="0" type="abstr">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> The first workshop on Very Large Corpora was held just before the 1993 ACL meeting in Columbus, Ohio. The turnout was even greater than anyone could have predicted (or else we would have called the meeting a conference rather than a workshop).</Paragraph>
    <Paragraph position="1"> We knew that corpus-based language processing was a &amp;quot;hot area,&amp;quot; but we didn't appreciate just how hot it would turn out to be.</Paragraph>
    <Paragraph position="2"> The 1990s were: witnessing a resurgence of interest in 1950s-style empirical and statistical methods of language analysis, Empiricism was at its peak in the 1950s, dominating a broad set of fields ranging from psychology (behaviorism) to electrical engineering (information theory). At that time, it was common practice in linguistics to classify words not only on the basis of their meanings but also on the basis of their co-occurrence with other words. Firth, a leading figure in British linguistics during the 1950s, summarized the approach with the memorable line: &amp;quot;You shall know a word by the company it keeps.&amp;quot; Regrettably, interest in empiricism faded in the late 1950s and early 1960s with a number of significant events including Chomsky's criticism of n-grams in Syntactic Structures (Chomsky, 1957) and Minsky and Papert's criticism of neural networks in Pereeptrons (Minsky and Papert, 1969).</Paragraph>
    <Paragraph position="3"> Perhaps the most immediate reason for this empirical renaissance is the availability of massive quantities of data: text is available like never before. Just ten years earlier, the one-million word Brown Corpus (Francis and Kucera, 1982) was considered large, but these days, everyone has access to the web. Experiments are routinely carried out on many gigabytes of text. Some researchers are even working with terabytes. null The big difference since the first SIGDAT meeting in 1993 is that large corpora are now having a big impact on ordinary users. Web search engines/portals are an obvious example. Managing gigabytes is not only the title of a popular book that recently came out with a second edition (Moffat, Bell and Witten, 1999), but it is something that ordinary users are beginning to take for granted. Recent progress in Information Retrieval and Digital Libraries is worth a fortune (according to the stockmarket). Speech Recognition and Machine Translation are also changing the world. If you walk into any software store these days, you will find a shelf full of speech recognition and machine translation products. And it is getting so you can't use the telephone these days without talking to a computer.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML