File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/89/h89-2013_intro.xml

Size: 2,462 bytes

Last Modified: 2025-10-06 14:04:51

<?xml version="1.0" standalone="yes"?>
<Paper uid="H89-2013">
  <Title>Enhanced Good-Turing and Cat.Cal: Two New Methods for Estimating Probabilities of English Bigrams (abbreviated version)</Title>
  <Section position="2" start_page="0" end_page="82" type="intro">
    <SectionTitle>
1. Materials
</SectionTitle>
    <Paragraph position="0"> Our corpus was selected from articles distributed by the Associated Press (AP) during 1988. Some portions of the year were lost. The remainder was processed automatically by Riley and Liberman to remove nearly identical articles. There remained N = 4.4x107 words in the corpus, with a vocabulary of V = 400,653. When we speak of &amp;quot;words,&amp;quot; we use a common term to hide a number of processing decisions. Roughly, a word is a string of characters delimited by white space. For instance, The and the are different words, and so are need and needs. In addition, punctuation such as period and comma are treated as &amp;quot;words&amp;quot;. Additional tokens are inserted automatically to delimit sentences, paragraphs and discourses. In the future we hope to use a more balanced sample of general English. However, for the purpose of testing methods, a large sample is desirable; the the AP corpus is considerably larger than alternatives such as the Brown Corpus. The vocabulary size is also considerably larger than the 5000 word vocabulary reported in (N~das, 1984).</Paragraph>
    <Paragraph position="1">  We split the 1988 AP wire into two halves by assigning bigrams beginning with even numbered words to one sample, those beginning with odd numbered words to the other. It is important that we made this split by taking every other bigram. We have found that spliting the corpus into two half-year periods, for example, generates two quite different samples, which complicates matters considerably. Since our aim is to study methods, we have adopted this extreme measure in order to construct two very similar samples.</Paragraph>
    <Paragraph position="2"> Our goal is to develop a methodology for extending an n-gram model to an (n+l)-gram model. We regard the model for unigrams as completely fixed before beginning to study bigrams. This includes specifying V, the vocabulary, and e(p(x)), an estimate of the probability of each word. We also suppose that variances of the estimates are known. Likewise, we would regard a bigram model as fixed before studying a trigram model.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML