File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/n03-2020_metho.xml

Size: 9,844 bytes

Last Modified: 2025-10-06 14:08:15

<?xml version="1.0" standalone="yes"?>
<Paper uid="N03-2020">
  <Title>A Robust Retrieval Engine for Proximal and Structural Search</Title>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 A Ranking Model for Structured
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Queries and Texts
</SectionTitle>
      <Paragraph position="0"> This section describes the definition of the relevance between a document and a structured query represented by the region algebra. The key idea is that a structured query is decomposed into subqueries, and the relevance of the whole query is represented as a vector of relevance measures of subqueries.</Paragraph>
      <Paragraph position="1"> The region algebra (Clarke et al., 1995) is a set of operators, which represent the relation between the extents (i.e. regions in texts). In this paper, we suppose the region algebra has seven operators; four containment operators (/, C/, 6/, 6C/) representing the containment relation between the extents, two combination operators (4, 5) corresponding to &amp;quot;and&amp;quot; and &amp;quot;or&amp;quot; operator of the boolean model, and ordering operator (3) representing the order of words or structures in the texts. For convenience of explanation, we represent a query as a tree structure as</Paragraph>
      <Paragraph position="3"> shown in Figure 1 1 . This query represents 'Retrieve the books whose title has the word &amp;quot;retrieval.&amp;quot; ' Our model assigns a relevance measure of the structured query as a vector of relevance measures of the subqueries. In other words, the relevance is defined by the number of portions matched with subqueries in a document. If an extent matches a subquery of query q, the extent will be somewhat relevant to q even when the extent does not exactly match q. Figure 1 shows an example of a query and its subqueries. In this example, even when an extent does not match the whole query exactly, if the extent matches &amp;quot;retrieval&amp;quot; or '[title]/&amp;quot;retrieval&amp;quot;', the extent is considered to be relevant to the query. Subqueries are formally defined as following.</Paragraph>
      <Paragraph position="4"> Definition 1 (Subquery) Let q be a given query and n1;:::;nm be the nodes of q. Subqueries q1;:::;qm of q are the subtrees of q. Each qi has node ni as a root node.</Paragraph>
      <Paragraph position="5"> When a relevance (qi;d) between a subquery qi and a document d is given, the relevance of the whole query is defined as following.</Paragraph>
      <Paragraph position="6"> Definition 2 (Relevance of the whole query) Let q be a given query, d be a document and q1;:::;qm subqueries of q. The relevance vector S(q;d) of d is defined as follows:</Paragraph>
      <Paragraph position="8"> A relevance of a subquery should be defined similarly to that of keyword-based queries in the traditional ranked retrieval. For example, TFIDF, which is used in our experiments in Section 3, is the most simple and straightforward one, while other relevance measures recently proposed in (Robertson and Walker, 2000) can be applied. TF value is calculated using the number of extents matching the subquery, and IDF value is calculated using the number of documents including the extents matching the subquery.</Paragraph>
      <Paragraph position="9"> While we have defined a relevance of the structured query as a vector, we need to sort the documents according to the relevance vectors. In this paper, we first map a vector into a scalar value, and then sort the documents 1In this query, '[x]' is a syntax sugar of 'hxi3h/xi'.</Paragraph>
      <Paragraph position="10"> according to this scalar measure. Three methods are introduced for the mapping from the relevance vector to the scalar measure. The first one simply works out the sum of the elements of the relevance vector.</Paragraph>
      <Paragraph position="12"> The second represents the rareness of the structures.</Paragraph>
      <Paragraph position="13"> When the query is A / B or A C/ B, if the number of extents matching the query is close to the number of extents matching A, matching the query does not seem to be very important because it means that the extents that match A mostly match A/B or AC/B. The case of the other operators is the same as with / and C/.</Paragraph>
      <Paragraph position="14"> Definition 4 (Structure Coefficient) When the operator op is 4, 5 or 3, the structure coefficient of the query A op B is:</Paragraph>
      <Paragraph position="16"> and when the operator op is / or C/, the structure coefficient of the query A op B is:</Paragraph>
      <Paragraph position="18"> where A and B are the queries and C(A) is the number of extents that match A in the document collection.</Paragraph>
      <Paragraph position="19"> The scalar measure %0sc(qi;d) is then defined as</Paragraph>
      <Paragraph position="21"> The third is a combination of the measure of the query itself and the measure of the subqueries. Although we calculate the score of extents by subqueries instead of using only the whole query, the score of subqueries can not be compared with the score of other subqueries. We assume normalized weight of each subquery and interpolate the weight of parent node and children nodes.</Paragraph>
      <Paragraph position="22"> Definition 5 (Interpolated Coefficient) The interpolated coefficient of the query qi is recursively defined as follows:</Paragraph>
      <Paragraph position="24"> where ci is the child of node ni, l is the number of children of node ni, and 0 * , * 1.</Paragraph>
      <Paragraph position="25"> This formula means that the weight of each node is defined by a weighted average of the weight of the query and its subqueries. When , = 1, the weight of each query is normalized weight of the query. When , = 0, the weight of each query is calculated from the weight of the subqueries, i.e. the weight is calculated by only the  weight of the words used in the query.</Paragraph>
      <Paragraph position="26"> 1 '([cons]/([sem]/&amp;quot;G#DNA domain or region&amp;quot;))4(&amp;quot;in&amp;quot;3([cons]/([sem]/(&amp;quot;G#tissue&amp;quot;5&amp;quot;G#body part&amp;quot;))))' 2 '([event]/([obj]/&amp;quot;gene&amp;quot;))4(&amp;quot;in&amp;quot;3([cons]/([sem]/(&amp;quot;G#tissue&amp;quot;5&amp;quot;G#body part&amp;quot;))))' 3 '([event]/([obj]3([sem]/&amp;quot;G#DNA domain or region&amp;quot;)))4(&amp;quot;in&amp;quot;3([cons]/([sem]/(&amp;quot;G#tissue&amp;quot;5&amp;quot;G#body part&amp;quot;))))' 4 '([event]/([dummy]/&amp;quot;G#DNA domain or region&amp;quot;))4(&amp;quot;in&amp;quot;3([cons]/([sem]/(&amp;quot;G#tissue&amp;quot;5&amp;quot;G#body part&amp;quot;))))'</Paragraph>
    </Section>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Experiments
</SectionTitle>
    <Paragraph position="0"> In this section, we show the results of our preliminary experiments of text retrieval using our model. Because there is no test collection of the structured query and tag-annotated text, we used the GENIA corpus (Ohta et al., 2002) as a structured text, which was an XML document composed of paper abstracts in the field of biomedical science. The corpus consisted of 1,990 articles, 873,087 words (including tags), and 16,391 sentences.</Paragraph>
    <Paragraph position="1"> We compared three retrieval models, i) our model, ii) exact matching of the region algebra (exact), and iii) not-structured flat model. In the flat model, the query was submitted as a query composed of the words in the queries in Table 1 connected by the &amp;quot;and&amp;quot; operator (4). The queries submitted to our system are shown in Table 1, and the document was &amp;quot;sentence&amp;quot; represented by &amp;quot;hsentencei&amp;quot; tags. Query 1, 2, and 3 are real queries made by an expert in the field of biomedicine. Query 4 is a toy query made by us to see the robustness compared with the exact model easily. The system output the ten results that had the highest relevance for each model2.</Paragraph>
    <Paragraph position="2"> Table 2 shows the number of the results that were judged relevant in the top ten results when the ranking was done using %0sum. The results show that our model was superior to the exact and flat models for Query 1, 2, and 3. Compared to the exact model, our model output more relevant documents, since our model allows the partial matching of the query, which shows the robustness of our model. In addition, our model outperforms the flat model, which means that the structural specification of the query was effective for finding the relevant documents. For Query 4, our model succeeded in finding the relevant results although the exact model failed to find results because Query 4 includes the tag not contained in the text (&amp;quot;hdummyi&amp;quot; tag). This result shows the robustness of our model.</Paragraph>
    <Paragraph position="3"> Although we omit the results of using %0sc and %0ic because of the limit of the space, here we summarize the results of them. The number of relevant results using %0sc was the same as that of %0sum, but the rank of irrelevant 2For the exact model, ten results were selected randomly from the exactly matched results if the total number of results was more than ten. After we had the results for each model, we shuffled these results randomly for each query, and the shuffled results were judged by an expert in the field of biomedicine whether they were relevant or not.</Paragraph>
    <Paragraph position="4"> Query our model exact flat  of all results) in top 10 results.</Paragraph>
    <Paragraph position="5"> results using %0sc was lower than that of %0sum. The results using %0ic varied between the results of the flat model and the results of %0sum depending on the value of ,.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML