File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-0412_metho.xml

Size: 19,596 bytes

Last Modified: 2025-10-06 14:09:05

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-0412">
  <Title>Non-Contiguous Word Sequences for Information Retrieval</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 The use of phrases in IR
</SectionTitle>
    <Paragraph position="0"> There are various ways to incorporate phrases in the document modeling. The usual technique is to consider phrases as supplementary terms of the vector space, with the same technique as for word terms. In other words, phrases are thrown into the bag of words. However, Strzalkowski and Carballo (1996) argue that using a standard weighting scheme is inappropriate for mixed feature sets (such as single words and phrases). The weight given to least frequent phrases is considered too low. Their specificity is nevertheless often crucial in order to determine the relevance of a document (Lahtinen, 2000). In weighting the phrases, the interdependency between a phrase and the words that compose it is another difficult issue to account for Strzalkowski et al. (1998).</Paragraph>
    <Paragraph position="1"> There are two main types of phrases: statistical phrases, formed by straight word occurrence counts, and syntactical phrases.</Paragraph>
    <Paragraph position="2"> Statistical Phrases. Mitra et al. (1987) form a statistical phrase for each pair of 2 stemmed adjacent words that occur in at least 25 documents of the TREC-1 collection. The selected pairs are then sorted in lexicographical order. In this technique, we see 2 problems. First, this lexicographical sorting means to ignore crucial information about word pairs: their order of occurrence ! This is equivalent to saying that AB is identical to BA. Furthermore, no gap is allowed, although it is frequent to represent the same concept by adding at least one word between two others. For example, this definition of a phrase does not permit to note any similarity between the two text fragments &amp;quot;XML document retrieval&amp;quot; and &amp;quot;XML retrieval&amp;quot;. This model is thus quite far from natural language.</Paragraph>
    <Paragraph position="3"> Syntactical Phrases. The technique presented by Mitra et al. (1987) for extracting syntactical phrases is based on a parts-of-speech analysis (POS) of the document collection. A set of tag sequence patterns are predefined to be recognized as useful phrases. All maximal sequences of words accepted by this grammar form the set of syntactical phrases. For example, a sequence of words tagged as &amp;quot;verb, cardinal number, adjective, adjective, noun&amp;quot; will constitute a syntactical phrase of size 5. Every sub-phrase occurring in this same order is also generated, with an unlimited gap (e.g., the pair &amp;quot;verb, noun&amp;quot; is also generated). This technique offers a sensible representation of natural language. Unfortunately, to obtain the POS of a whole document collection is very costful. The index size is another issue, given that all phrases are stored, regardless of their frequency. In the experiments, the authors indeed admit to creating no index a priori, but instead that the phrases were generated according to each query.</Paragraph>
    <Paragraph position="4"> This makes the process tractable, but implies very slow answers from the retrieval system, and quite a long wait for the end user.</Paragraph>
    <Paragraph position="5"> On top of computational problems, we see a few further issues. First, the lack of a minimal frequency threshold to reduce the number of phrases in the index. This means that unfrequent phrases are taking up most of the space, and have a big influence on the results, whereas their low frequency may simply illustrate an inadequate use or a typographical error. To allow an illimited gap so as to generate subpairs is dangerous as well: the phrase &amp;quot;I like to eat hot dogs&amp;quot; will generate the subpair &amp;quot;hot dogs&amp;quot;, but it will also generate the subpair &amp;quot;like dogs&amp;quot;, whose semantical meaning is very far from that of the original sentence.</Paragraph>
    <Paragraph position="6"> Other types of phrases. Many efficient techniques exist to extract multiword expressions, collocations, lexical units and idioms (Church and Hanks, 1989; Smadja, 1993; Dias et al., 2000; Dias, 2003). Unfortunately, very few have been applied to information retrieval with a deep evaluation of the results.</Paragraph>
    <Paragraph position="7"> Maximal Frequent Sequences. We propose Maximal Frequent Sequences (MFS) as a new alternative to account for word ordering in the modeling of textual documents. One of their strength is the fact that they are extracted if and only if they occur more often than a given frequency threshold, which hopefully permits to avoid storing the numerous least significant phrases. A gap between words is allowed within the extraction process itself, permitting to deal with a larger variety of language.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Maximal Frequent Sequences
</SectionTitle>
    <Paragraph position="0"> In our approach, we represent documents by word features within the vector space model, and by Maximal Frequent Sequences, accounting for the sequential aspect of text. For each of those two representations, a Retrieval Status Value (RSV) is computed. Those values are later combined to form a single RSV per document. null</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Definition and Extraction
Technique
</SectionTitle>
      <Paragraph position="0"> MFS are sequences of words that are frequent in the document collection and, moreover, that are not contained in any other longer frequent sequence. Given a frequency threshold s, a sequence is considered to be frequent if it appears in at least s documents.</Paragraph>
      <Paragraph position="1"> Ahonen-Myka (1999) presents an algorithm combining bottom-up and greedy methods, which permits to extract maximal sequences without considering all their frequent subsequences. This is a necessity, since maximal frequent sequences in documents may be rather long.</Paragraph>
      <Paragraph position="2"> Nevertheless, when we tried to extract the maximal frequent sequences from the collection of documents, their number and the total number of word features in the collection did pose a clear computational problem and did not actually permit to obtain any result.</Paragraph>
      <Paragraph position="3"> To bypass this complexity problem, we decomposed the collection of documents into several disjoint subcollections, small enough so that we could efficiently extract the set of maximal frequent sequences of each subcollection. Joining all the sets of MFS', we obtained an approximate of the maximal frequent sequence set for the full collection.</Paragraph>
      <Paragraph position="4"> We conjecture that more consistent subcollections permit to obtain a better approximation. This is due to the fact that maximal frequent sequences are formed from similar text fragments. Accordingly, we formed the subcollection by clustering similar documents together using the well-known k-means algorithm (see for example Willett (1988) or Doucet and Ahonen-Myka (2002)).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Main Strengths of the Maximal
Frequent Sequences
</SectionTitle>
      <Paragraph position="0"> The method efficiently extracts all the maximal frequent word sequences from the collection. From the definitions above, a sequence is said to be maximal if and only if no other frequent sequence contains that sequence.</Paragraph>
      <Paragraph position="1"> Furthermore, a gap between words is allowed: in a sentence, the words do not have to appear continuously. A parameter g tells how many other words two words in a sequence can have between them. The parameter g usually gets values between 1 and 3.</Paragraph>
      <Paragraph position="2"> For instance, if g = 2, a phrase &amp;quot;President Bush&amp;quot; will be found in both of the following text fragments: ..President of the United States Bush..</Paragraph>
      <Paragraph position="3"> ..President George W. Bush..</Paragraph>
      <Paragraph position="4"> Note: Articles, prepositions and small words were pruned away during the preprocessing.</Paragraph>
      <Paragraph position="5"> This allowance of gaps between words of a sequence is probably the strongest specificity of the method, compared to most existing methods for extracting text descriptors. This greatly increases the quality of the phrase, since processing takes the variety of natural language into account.</Paragraph>
      <Paragraph position="6"> The other powerful specificity of the technique is the ability to extract maximal frequent sequences of any length. This permits to obtain a very compact description of documents. For example, by restricting the length of phrases to 8, the presence, in the document collection, of a frequent phrase of 25 words would result in thousands of phrases representing the same knowledge as this one maximal sequence.</Paragraph>
      <Paragraph position="7"> The result of this extraction is that each document of the collection is described by a (possibly empty) set of MFS.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Evaluating Documents
</SectionTitle>
    <Paragraph position="0"> Once documents and queries are represented within our two models, a way to estimate the relevance of a document with respect to a query remains to be found. As mentioned earlier, we compute two separate RSV values for the word features vector space model and the MFS model. In the second step, we aggregate these two RSVs into one single relevance score for each document with respect to a query.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Word features RSV
</SectionTitle>
      <Paragraph position="0"> The vector space model offers a very convenient framework for computing similarities between documents and queries. Indeed, there exist a number of techniques to compare two vectors, Euclidean distance, Jaccard and cosine similarity being the most frequently used in IR.</Paragraph>
      <Paragraph position="1"> We have used cosine similarity because of its computational efficiency. By normalizing the vectors, which we did in the indexing phase, cosine([?]-d1,[?]-d2) indeed simplifies to the vector product (d1 *d2).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 MFS RSV
</SectionTitle>
      <Paragraph position="0"> The first step is to create an MFS index for the document collection. Once a set of maximal frequent sequences has been extracted and each document is attached to the corresponding phrases, as detailed in the previous section, it remains to define the procedure to match a phrase describing a document and a keyphrase (from a query).</Paragraph>
      <Paragraph position="1"> Note that from here onwards, keyphrase denotes a phrase found in a query, and maximal sequence denotes a phrase extracted from a document. null Our approach consists in decomposing keyphrases of the query into pairs. Each of these pairs is bound to a score representing its quantity of relevance. Informally speaking, the quantity of relevance of a word pair tells how much it makes a document relevant to include an occurrence of this pair. This value depends on the specificity of the pair (expressed in terms of inverted document frequency) and modifiers, among which is an adjacency coefficient, reducing the quantity of relevance given to a pair formed by two words that are not adjacent.</Paragraph>
      <Paragraph position="2">  Let D be a collection of N documents and A1 ...Am a keyphrase of size m. Let Ai and Aj be 2 words of A1 ...Am occurring in this order, and n be the number of documents of the collection in which AiAj was found. We define the quantity of relevance of the pair AiAj to be:</Paragraph>
      <Paragraph position="4"> and when decomposing the keyphrase A1 ...Am into pairs, adj(AiAj) is a score modifier to penalize word pairs AiAj formed from non-adjacent words, and d(Ai,Aj) indicates the number of words appearing between the two words Ai and Aj (d(Ai,Aj) = 0 signifies that</Paragraph>
      <Paragraph position="6"> Accordingly, the larger the distance between the two words, the lower a quantity of relevance is attributed to the corresponding pair. In our runs, we will actually ignore distances higher than 1 (i.e., (k &gt; 1) = (ak = 0)).</Paragraph>
      <Paragraph position="7">  For example, ignoring distances above 1, a keyphrase ABCD is decomposed into 5 tuples (pair, adjacency coefficient): (AB, 1), (BC, 1), (CD, 1), (AC, a1), (BD, a1) Let us compare this keyphrase to the documents d1,d2,d3,d4 and d5, described respectively by the frequent sequences AB, AC, AFB, ABC and ACB. The corresponding quantities of relevance brought by the keyphrase ABCD are shown in table 1. Note that in practice, we lost the maximality property during the partitionjoin step presented in subsection 4.1. Hence, there can be a frequent sequence AB together with a frequent sequence ABC, if they were extracted from two different document clusters. Assuming equal idf values, we observe that the quantities of relevance form a coherent order. The longest matches rank first, and matches of equal size are untied by adjacency. Moreover, non-adjacent matches (AC and ABC) are not ignored as in many other phrase representations (Mitra et al., 1987).</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.3 Aggregated RSV
</SectionTitle>
      <Paragraph position="0"> In practice, some queries do not contain any keyphrase, and some documents do not contain any MFS. However, there can of course be correct answers to these queries, and those documents must be relevant to some queries. Also, all documents containing the same matching phrases get the same MFS RSV. Therefore, it is necessary to find a way to separate them. The word-based cosine similarity measure is very appropriate for that.</Paragraph>
      <Paragraph position="1"> Another natural response would have been to re-decompose the pairs into single words and form document vectors accordingly. However, this would not be satisfying, because the least frequent words are all missed by the algorithm for MFS extraction. An even more important category of missed words is that of frequent words that do not frequently co-occur with other words. The loss would be considerable. null This is the reason to compute another RSV using a basic word-features vector space model.  To combine both RSVs to one single score, we must first make them comparable by mapping them to a common interval. To do so, we used Max Norm, as presented by Vogt and Cottrell (1998), which permits to bring all positive scores within the range [0,1]: New Score = Old ScoreMax Score Following this normalization step, we aggregate both RSVs using a linear interpolation factor l representing the relative weight of scores obtained with each technique (similarly as in Marx et al. (2002)).</Paragraph>
      <Paragraph position="2"> Aggregated Score = l*RSVWord Features+(1[?]l)*RSVMFS The evidence of experiments with the INEX 2002 collection showed good results when weighting the single word RSV with the number of distinct word terms in the query (let a be that number), and the MFS RSV with the number of distinct word terms found in keyphrases of the query (let b be that number). Thus: l = aa+b For example, in Figure 1 showing topic 47, there are 11 distinct word terms and 7 distinct word terms occurring in keyphrases. Thus, for this topic, we have l = 1111+7.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Experiments and Results
</SectionTitle>
    <Paragraph position="0"> We based our experiments on the 494Mb INEX document collection (Initiative for the Evaluation of XML retrieval1). INEX was created in 2002 to compensate the lack of an evaluation forum for the XML information retrieval.</Paragraph>
    <Paragraph position="1"> This collection consists of 12,107 scientific articles written in English from IEEE journals, combined to a set of queries and corresponding manual assessments. The specificity of this</Paragraph>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
ABCD
</SectionTitle>
    <Paragraph position="0"> document collection is its rich logical structure into sections, subsections, paragraphs, lists, etc.</Paragraph>
    <Paragraph position="1"> However, in the present experiments, we ignore this structure and only exploit plain text to return full articles as our candidate retrieval answers. null The manual assessments indeed tell us which candidate answers are relevant and which ones are not. We use these relevance values to compute precision and recall measures, which permit scoring each set of candidate answers, and equivalently the means by which each set was obtained. In our experiments, we used average precision over the n first hits as our main reference. This evaluation measure was first introduced by Raghavan et al. (1989) and was used as the official evaluation measure in the INEX 2002 campaign (G&amp;quot;overt et al., 2003).</Paragraph>
    <Paragraph position="2"> Protocol of the Experiments. As a baseline, we computed and evaluated a run using only single word terms, as detailed in section 2. Our goal was to compare our new technique to the state of the art. Thus we computed one run using our technique (aggregating the MFS RSVs and the single word term RSVs topic-wise, with the weighting scheme mentioned hereabove), and one run by calculating all statistical phrases following the definition of Mitra et al. (1987). The only difference is that we did not set a minimal document frequency threshold. We made this choice from the standpoint that our aim was not to measure efficiency, but the quality of the results. The corresponding number of features is given in table 2. We extracted 328,289 MFS of different sizes. Their splitting forms no more than 674,257 pairs (this number is probably lower because the same pair can be extracted from numerous MFS).</Paragraph>
    <Paragraph position="3"> MFS vs. Statistical Phrases. For those representations, the average precision for the n first retrieved documents are presented in table 3. We learn two things from those results.</Paragraph>
    <Paragraph position="4">  ear combinations First, the fact that phrases improve results in lower levels of recall is confirmed, as greater improvement is obtained when we check further down the ranked list. Second, our technique outperforms that of statistical phrases. However, as we use different phrases indeed, but also a different technique to match them against queries, it remains to find out whether the improvement stems from the MFS themselves, from the way they are used, or from both. Thus we experimented with various linear combinations to aggregate the word term RSV and the statistical phrase RSV. The results are presented in table 4. The technique of gathering word and pairs features within the same vector space clearly performs better in this case. Therefore, the better performance of MFS is not only due to the aggregation weigthing scheme presented in subsection 5.3. This underlines their intrinsic quality as document descriptors.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML