XML Viewer - w98-1220

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-1220_metho.xml
Size: 3,680 bytes
Last Modified: 2025-10-06 14:15:16
<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-1220">
  <Title>I Evolution and Evaluation of Document Retrieval Queries</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3. Methodology
</SectionTitle>
    <Paragraph position="0"> The first point to note is that the optimized query expansion method will be evolved in a development phase, prior to everyday use of the search enhancement.</Paragraph>
    <Paragraph position="1"> Steele and Powers 163 Evolu'~on, Evaluation of Document Retrieval Queries Robert Steele and David Powers (I 998) Evolution and Evaluation of Document Retrieval Queries. In D.M.W. Powers (ed.) NeMLaP3/CoNLL98: New Methods in Language Processing and ComputationaI Natural Language'Learning, ACL, pp 163-164. In constructing a Genetic Programming system, there are three basic variables that need to be defined.</Paragraph>
    <Paragraph position="2"> Firstly, the internal nodes for the application. Here they will be either 'and', 'or', or 'not'. The reason for this choice, is that these are the operators already commonly available in search engines.</Paragraph>
    <Paragraph position="3"> Secondly, the leaf nodes must be chosen. Here they will be the words of the original query, and various related words produced by WordNet. The important feature of these, is that they will not be fixed words, but rather of the form A, synonym(A) or hyponym(A) for example (where A is an original search term), and it is this that allows the evolved expression trees to be applicable to any search that may be made.</Paragraph>
    <Paragraph position="4"> Thirdly, the fitness evaluation function is required. Fitness in this case is determined by the relevance of the documents returned by a query. A number of possible measures exist (Hatter, 1996); *frequency of original search word in document, *nearness of multiple search words in the full document, *correct relative frequencies of the words desired, *cluster signatures can be used to indicate if the retrieved documents are similar to each other.</Paragraph>
    <Paragraph position="5"> Greater homogeneity is better.</Paragraph>
    <Paragraph position="6"> .location and frequency of various related words, suggested by WordNet in the full document.</Paragraph>
    <Paragraph position="7"> The evaluation function will weigh up all the data that can be extracted from the full returned documents, and weight according to which is deemed the best indicator of relevance.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4. Implementation
</SectionTitle>
    <Paragraph position="0"> A problem with development of the system is that it will require the retrieval of many documents. For this reason it is best to develop it off-line. The TIPSTER CD used at the TREC conferences, represents a good benchmark. To make use of this, we will create a basic indexing system, similar to those of existing search engines.</Paragraph>
    <Paragraph position="1"> This will involve creating a file for each word that occurs in the database (excepting very frequent words, and possibly words that do not occur in any document more times than some threshold number), and storing in the file, a reference to each document the file occurs in, the corresponding frequency and its first location in the document.</Paragraph>
    <Paragraph position="2"> This simulated search engine will order the importance of documents with the following rules:</Paragraph>
    <Paragraph position="4"> Where Rf is the relevance based on frequency, RI is the relevance based on location, b is some weight and R is the overall relevance value.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML