File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/98/w98-1210_intro.xml

Size: 1,145 bytes

Last Modified: 2025-10-06 14:06:47

<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-1210">
  <Title>Finding Structure via Compression</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> The modelling of a symbol sequence requires some assumptions about the nature of the process which generated it, and the modelling of English text would, for example, commonly make the assumption that the text consists of words, short strings which usually recur and which are separated by whitespace, and punctuation symbols. The whitespace symbol (which we shall represent explicitly by A) and its distinctive function do not seem to occur in spoken English, and would not seem to be essential in written English. We are concerned with finding such structure using weak assumptions, rather than being given it as part of a model.</Paragraph>
    <Paragraph position="1"> In this paper we show that a statistical model may be used to do just that, and our results indicate that a model which bootstraps itself using this structure undergoes a reduction in perplexity.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML