File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/03/w03-1804_concl.xml

Size: 1,082 bytes

Last Modified: 2025-10-06 13:53:53

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1804">
  <Title>Using Masks, Suffix Array-based Data Structures and Multidimensional Arrays to Compute Positional Ngram Statistics from Corpora</Title>
  <Section position="9" start_page="18" end_page="18" type="concl">
    <SectionTitle>
6 Conclusion
</SectionTitle>
    <Paragraph position="0"> In this paper, we have described an implementation to compute positional ngram statistics based on masks, suffix array-based data structure and multidimensional arrays. Our C++ solution shows that it takes 8.59 minutes to compute both frequency and Mutual Expectation for a 1.092.723-word corpus on an Intel Pentium III 900 MHz for a seven-word size window context. In fact, our architecture evidences O(h(F) N log N) time complexity. To some extent, this work proposes a response to the conclusion of (Kit and Wilks, 1998) that claims that &amp;quot;[...] a utility for extracting discontinuous co-occurrences of corpus tokens, of any distance from each other, can be implemented based on this program [The</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML