File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-1804_intro.xml
Size: 2,366 bytes
Last Modified: 2025-10-06 14:02:07
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-1804"> <Title>Using Masks, Suffix Array-based Data Structures and Multidimensional Arrays to Compute Positional Ngram Statistics from Corpora</Title> <Section position="4" start_page="0" end_page="3" type="intro"> <SectionTitle> CCFFN </SectionTitle> <Paragraph position="0"> Equation 1: Number of positional ngrams In order to illustrate this equation, 4.299.742 positional ngrams (n=1..7) would be generated from a 100.000word size corpus in a seven-word size window context. In comparison, only 700.000 ngrams would be computed for the classical ngram model. It is clear that huge efforts need to be made to process positional ngram statistics in reasonable time and space.</Paragraph> <Paragraph position="1"> In this paper, we describe an implementation that computes the Frequency and the Mutual Expectation (Dias et al. 1999) of any positional ngram with time complexity O(h(F) N log N). The global architecture is based on the definition of masks that allow virtually representing any positional ngram in the corpus. Thus, we follow the Virtual Corpus approach introduced by Kit and Wilks (1998) and apply a suffix-array-like method, coupled to the Multikey Quicksort algorithm (Bentley and Sedgewick, 1997), to compute positional ngram frequencies.</Paragraph> <Paragraph position="2"> Finally, a multidimensional array is built to easily process the Mutual Expectation, an association measure for collocation extraction.</Paragraph> <Paragraph position="3"> The evaluation of our C++ implementation has been realized over the CETEMPublico corpus and shows satisfactory results. For example, it takes 8.59 minutes to compute both frequency and Mutual Expectation for a 1.092.723 -word corpus on an Intel Pentium III 900 MHz Personal Computer for a seven-word size window context.</Paragraph> <Paragraph position="4"> This article is divided into four sections: (1) we explain the basic principles of positional ngrams and the mask representation to build the Virtual Corpus; (2) we present the suffix-array-based data structure that allows counting occurrences of positional ngrams; (3) we show how a multidimensional array eases the efficient computation of the Mutual Expectation; (4) we present results over different size sub-corpora of the CETEMPublico corpus.</Paragraph> </Section> class="xml-element"></Paper>