File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/c94-2165_metho.xml

Size: 1,887 bytes

Last Modified: 2025-10-06 14:13:45

<?xml version="1.0" standalone="yes"?>
<Paper uid="C94-2165">
  <Title>CATCHING THE CHESHIRE CAT</Title>
  <Section position="4" start_page="1021" end_page="1021" type="metho">
    <SectionTitle>
MATERIAL
</SectionTitle>
    <Paragraph position="0"> The material is Alice's Adventures in Wonderhmd by Lewis Carrol, available in electronic format via email from the Gutenberg Project. The text contains 27332 words of which 2576 are unique, making up a total of 14509 unique word pairs. Alice in Wonderland was chosen because it is a well-known text, it contains some phrases that we know are in there (e.g. March Hare), and it contains a sufficient number of words, and variations of words, to be interesting for the experiment. Studies could be done for other collections of texts, e.g.</Paragraph>
    <Paragraph position="1"> medical abstracts. As morn documents ate available, comparisons between documents can be done (Steier &amp; Belew, 1991). This experiment only contains within comparisons of phrases for one specific text.</Paragraph>
  </Section>
  <Section position="5" start_page="1021" end_page="1021" type="metho">
    <SectionTitle>
METHOD
</SectionTitle>
    <Paragraph position="0"> For each of the unique words in the text the fiequencies of all immediately following words were collected. In this text, no filtering of the text was performed. Some initial experiments were performed, with a stoplist, to remove function words and some other common words (see Fox, 1992, for details). Some simple stemming was also tried, e.g. removing 's' and 'ed' from the end of words. Stemming may lead to difficulties in distinguishing compounds from noun-verb complexes. It is not clear if the pros of using stemming outweighs the cons, consequently we decided to work with the raw text. Stoplists and stemming might be more important when the ordinary g-measure is used.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML