File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/h94-1071_metho.xml
Size: 8,043 bytes
Last Modified: 2025-10-06 14:13:50
<?xml version="1.0" standalone="yes"?> <Paper uid="H94-1071"> <Title>Learning from Relevant Documents in Large Scale Routing Retrieval</Title> <Section position="4" start_page="358" end_page="360" type="metho"> <SectionTitle> 3. RELEVANT SUBDOCUMENT SELECTION STRATEGIES </SectionTitle> <Paragraph position="0"> Our approach to uneven full text collections \[3,6,8\] has been to segment long documents on the next paragraph boundary after a run of 360 words, giving more uniform length subdocument units. Documents with unrelated multiple stories with detectable separation markers are also segmented at the markers.</Paragraph> <Paragraph position="1"> This approach may impact favorably on: 1) precision because shorter, more local units may diminish chance occurrence of terms used in senses different from what is intended; 2) term weighting because unrealistic probability estimates of term weights may be avoided; 3) query training and expansion because long documents may have unrelated and irrelevant topics and concepts that can add noise to these operations; 4) retrieval output display because one can narrow down to the relevant portion of a long document for the user; and 5) general efficiency because of handling multiple, more uniform subdocuments instead of one long document. In the TREC collections, documents of thousands of words long are not uncommon, and an example of a really long document is in the Diskl Federal Register: FR89119-0111 with 400,748 words. With respect to itenlt 3) query training and expansion, having many of these long documents in the training set would not only overwhelm our system but also lead to ambiguity and imprecision. Segmenting them into subdocuments may provide us with strategies in selecting the appropriate relevant portions of documents for learning. In the next subsections we consider document selection methods that can be broadly classified into three types: approaches based on document properties only, approaches based on ranking, and on combinations of both.</Paragraph> <Section position="1" start_page="359" end_page="359" type="sub_section"> <SectionTitle> 3.1 Subdocument Selection Based on Document Properties </SectionTitle> <Paragraph position="0"> These selection methods employ some heuristics on the properties of documents. Because they are based solely on a list of known relevant subdocuments they can bring in concepts that are not explicitely stated or related to the query. These methods are also efficient because no ranking operation is required. A risk of this type of approach is that if the selection method is not well designed, many irrelevant portions of relevant documents may be included for training and becomes counter-productive. Four methods have been experimented with and the rationale for their choice are given below: (a) Use al...2 subdocuments for learning and query expansion. This is the usual approach in small collections. In a large scale environment it may have the drawback of ambiguity, imprecison and inefficiency discussed in Section 1, but will serve as a basis for comparison.</Paragraph> <Paragraph position="1"> (b) Use only relevant documents that 'break' into a maximum of max subdocuments. This effectively means eliminating long documents for learning, and may diminish ambiguities that come with them. Short documents should be more concentrated and focused in their content, and can be considered as quality items for training. In particular, max=l means employing only 'nonbreak' documents. This was the strategy used in the original submitted results of our TREC-2 experiments. However, if the given relevants are mostly long, we may artificially diminish the available number of relevants used for training.</Paragraph> <Paragraph position="2"> (c) Many articles including scholarly documents, certain newspaper and magazine items introduce their themes by stating the most important concepts and contents at the beginning of a document. They also summarize at the end. Therefore another approach is to use only the first or last subdocuments for training. Because of the way we segment documents so that some last subdocuments may be only a few words long, and the fact that some Wall Street Journal articles can have multiple unrelated stories within a document, we can only approximate our intent with these experiments.</Paragraph> <Paragraph position="3"> (d) A method labelled ffmax=2 uses the first subdocument of max=2 items. This strategy will use quality items (b) but also include the beginning portion of documents (c) about twice as long, and would remedy the fact that there may not be sufficient quality items for training.</Paragraph> </Section> <Section position="2" start_page="359" end_page="360" type="sub_section"> <SectionTitle> 3.2 Subdocument Selection Based on a Ranking Operation </SectionTitle> <Paragraph position="0"> These methods do a subdocument ranking operation with the routing queries first so that we can select the best ranking units for training. By design, best ranking subdocuments have high probability of being 'truely relevant' to their queries and have been proven to work in user relevance feedback. By ignoring poorer ranked units one hopes to suppress the noise portions of documents for training. A drawback in this case is that the best ranked subdocuments by default share many or high-weighted terms with a query, so that learning may become limited to enhancing the given free-text representation of the query. Subdocuments that are relevant but do not resemble the query (and therefore are not ranked early) will not be used. Performing a ranking is also time-consuming compared with methods in Section 3.1. We have experimented with two methods as given below: (e) Select the bestn best-ranked relevant subdocuments for training after ranking with respect to the given routing query representations. A variant of this method is to enhance/expand the query representations first by using method (b) max=l documents before doing the ranking. Selecting these bestnx best-ranked subdocuments would include more 'truely relevant' ones than before because the ranking operation is more sophisticated and has been shown to achieve improved performance in our initial TREC2 experiments \[8\].</Paragraph> <Paragraph position="1"> (If) Select the topn highest ranked subdocuments of every relevant. Since our purpose is try to avoid noise portions of relevant documents, these top ranked units should have high probability that they are mostly the signal portions as in (e). Moreover, because all relevant documents are used, this method may include the advantage of Section 3.1 that units not resembling the query would also be included for training. A variant is, as before, to enhance/expand the queries first before ranking for the topnx highest ranked subdocuments for later training.</Paragraph> </Section> <Section position="3" start_page="360" end_page="360" type="sub_section"> <SectionTitle> 3.3 Subdocument Selection Based on Combination of Methods </SectionTitle> <Paragraph position="0"> By combining training document sets obtained from the best of the previous two subsections, we hope to improve on the individual approaches alone. Our objective is to define a training set of subdocuments that are specific to and resemble a query representation, as well as including overall subdocuments that are relevant. The following two methods have been tried: (g) Merge documents obtained by method (e) bestn/bestnx retrieved, with those of method (b) using max=l. The rationale is that method (e) selects the best of those resembling the query, and method (b) uses short quality relevant documents in general.</Paragraph> <Paragraph position="1"> (h) Merge documents obtained by method (e) bestn/bestnx retrieved, with those of method (If) topn/topnx=l units of every document. This is similar to (g), except that instead of using short documents only, we now incorporate the best portions of every relevant.</Paragraph> </Section> </Section> class="xml-element"></Paper>