File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/p03-2008_metho.xml
Size: 17,813 bytes
Last Modified: 2025-10-06 14:08:19
<?xml version="1.0" standalone="yes"?> <Paper uid="P03-2008"> <Title>A Ranking Model of Proximal and Structural Text Retrieval Based on Region Algebra</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Background: Region algebra </SectionTitle> <Paragraph position="0"> The region algebra (Salminen and Tompa, 1994; Clarke et al., 1995; Jaakkola and Kilpelainen, 1999) is a set of operators representing the relation between the extents (i.e. regions in texts), where an extent is represented by a pair of positions, beginning and ending position. Region algebra allows for the specification of the structure of text.</Paragraph> <Paragraph position="1"> In this paper, we suppose the region algebra proposed in (Clarke et al., 1995). It has seven operators as shown in Table 1; four containment operators (/, 6/, C/, 6C/) representing the containment relation between the extents, two combination operators (4, 5) corresponding to &quot;and&quot; and &quot;or&quot; operator of the boolean model, and ordering operator (3) representing the order of words or structures in the texts. A containment relation between the extents is represented as follows: e = (ps;pe) contains e0 = (p0s;p0e) iff ps * p0s * p0e * pe (we express this relation as e = e0). The result of retrieval is a set of non-nested extents, that is defined by the following function G over a set of extents S:</Paragraph> <Paragraph position="3"/> <Paragraph position="5"> Intuitively, G(S) is an operation for finding the shortest matching. A set of non-nested extents matching query q is expressed as Gq.</Paragraph> <Paragraph position="6"> For convenience of explanation, we represent a query as a tree structure as shown in Figure 1 ('[x]' is a abbreviation of 'hxi 3 h/xi'). This query represents 'Retrieve the books whose title has the word &quot;retrieval.&quot; ' The algorithm for finding an exact match of a query works efficiently. The time complexity of the algorithm is linear to the size of a query and the size of documents (Clarke et al., 1995).</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 A Ranking Model for Structured </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Queries and Texts </SectionTitle> <Paragraph position="0"> This section describes the definition of the relevance between a document and a structured query represented by the region algebra. The key idea is that a structured query is decomposed into subqueries, and the relevance of the whole query is represented as a vector of relevance measures of subqueries.</Paragraph> <Paragraph position="1"> Our model assigns a relevance measure of the query matching extents in (1,15) matching extents in (16,30) constructed by 1 2 3 4 5 6 htitlei tf and idf h/titlei ranked 7 8 9 10 11 12 retrieval h/chapteri h/booki hbooki htitlei structured 13 14 15 16 17 18 text h/titlei hchapteri htitlei search for 19 20 21 22 23 24 structured text h/titlei retrieval h/chapteri h/booki 25 26 27 28 29 30 structured query as a vector of relevance measures of the subqueries. In other words, the relevance is defined by the number of portions matched with subqueries in a document. If an extent matches a subquery of query q, the extent will be somewhat relevant to q even when the extent does not exactly match q. Figure 2 shows an example of a query and its subqueries. In this example, even when an extent does not match the whole query exactly, if the extent matches &quot;retrieval&quot; or '[title]/&quot;retrieval&quot;', the extent is considered to be relevant to the query. Subqueries are formally defined as follows.</Paragraph> <Paragraph position="2"> Definition 1 (Subquery) Let q be a given query and n1;:::;nm be the nodes of q. Subqueries q1;:::;qm of q are the subtrees of q. Each qi has node ni as a root node.</Paragraph> <Paragraph position="3"> When a relevance (qi;d) between a subquery qi and a document d is given, the relevance of the whole query is defined as follows.</Paragraph> <Paragraph position="4"> Definition 2 (Relevance of the whole query) Let q be a given query, d be a document and q1;:::;qm be subqueries of q. The relevance vector S(q;d) of d is defined as follows:</Paragraph> <Paragraph position="6"> A relevance of a subquery should be defined similarly to that of keyword-based queries in the traditional ranked retrieval. For example, TFIDF, which is used in our experiments in Section 4, is the most simple and straightforward one, while other relevance measures recently proposed (Robertson and Walker, 2000; Fuhr, 1992) can be applied. TF of a subquery is calculated using the number of extents matching the subquery, and IDF of a subquery is calculated using the number of documents including the extents matching the subquery. When a text is given as Figure 3 and document collection is f(1,15),(16,30)g, extents matching each subquery in each document are shown in Table 2. TF and IDF are calculated using the number of extents matching subquery in Table 2.</Paragraph> <Paragraph position="7"> While we have defined a relevance of the structured query as a vector, we need to arrange the documents according to the relevance vectors. In this paper, we first map a vector into a scalar value, and then sort the documents according to this scalar measure.</Paragraph> <Paragraph position="8"> Three methods are introduced for the mapping from the relevance vector to the scalar measure. The first one simply works out the sum of the elements of the relevance vector.</Paragraph> <Paragraph position="10"> The second appends a coefficient representing the rareness of the structures. When the query is A/B or A C/ B, if the number of extents matching the query is close to the number of extents matching A, matching the query does not seem to be very important because it means that the extents that match A mostly match A/B or AC/B. The case of the other operators is the same as with / and C/.</Paragraph> <Paragraph position="11"> Definition 4 (Structure Coefficient) When the operator op is 4, 5 or 3, the structure coefficient of the query A op B is:</Paragraph> <Paragraph position="13"> and when the operator op is / or C/, the structure coefficient of the query A op B is:</Paragraph> <Paragraph position="15"> where A and B are the queries and C(A) is the number of extents that match A in the document collection. null The scalar measure %0sc(qi;d) is then defined as</Paragraph> <Paragraph position="17"> The third is a combination of the measure of the query itself and the measure of the subqueries. Although we calculate the score of extents by subqueries instead of using only the whole query, the score of subqueries can not be compared with the score of other subqueries. We assume normalized weight of each subquery and interpolate the weight of parent node and children nodes.</Paragraph> <Paragraph position="18"> Definition 5 (Interpolated Coefficient) The interpolated coefficient of the query qi is recursively defined as follows:</Paragraph> <Paragraph position="20"> where ci is the child of node ni, l is the number of children of node ni, and 0 * , * 1.</Paragraph> <Paragraph position="21"> This formula means that the weight of each node is defined by a weighted average of the weight of the query and its subqueries. When , = 1, the weight of a query is normalized weight of the query. When , = 0, the weight of a query is calculated from the weight of the subqueries, i.e. the weight is calculated by only the weight of the words used in the query.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"> In this section, we show the results of our preliminary experiments of text retrieval using our model.</Paragraph> <Paragraph position="1"> We used the GENIA corpus (Ohta et al., 2002) and the OHSUMED test collection (Hersh et al., 1994).</Paragraph> <Paragraph position="2"> We compared three retrieval models, i) our model, ii) exact matching of the region algebra (exact), and iii) not structured model (flat). The queries submitted to our system are shown in Table 3 and 4. In the flat model, the query was submitted as a query composed of the words in the queries connected by the &quot;and&quot; operator (4). For example, in the case of Query 1, the query submitted to the system in the flat model is ' &quot;G#DNA domain or region&quot; 4 &quot;in&quot; 4 &quot;G#tissue&quot; 4 &quot;G#body part&quot; .' The system output the ten results that had the highest relevance for each model.</Paragraph> <Paragraph position="3"> In the following experiments, we used a computer that had Pentium III 1.27GHz CPU, 4GB memory.</Paragraph> <Paragraph position="4"> The system was implemented in C++ with Berkeley DB library.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 GENIA corpus </SectionTitle> <Paragraph position="0"> The GENIA corpus is an XML document composed of paper abstracts in the field of biomedical science. The corpus consisted of 1,990 articles, 873,087 words (including tags), and 16,391 sentences. In the GENIA corpus, the document structure was annotated by tags such as &quot;harticlei&quot; and &quot;hsentencei&quot;, technical terms were annotated by &quot;hconsi&quot;, and events were annotated by &quot;heventi&quot;. The queries in Table 3 are made by an expert in the field of biomedicine. The document was &quot;sentence&quot; in this experiments. Query 1 retrieves sentences including a gene in a tissue. Queries 2 and 3 retrieve sentences representing an event having a gene as an object and occurring in a tissue. In Query 2, a gene was represented by the word &quot;gene,&quot; and in Query 3, a gene was represented by the annotation OHSUMED. The first line is a query submitted to the system, the second and third lines are the original query of the OHSUMED test collection, the second is information of patient and the third is request information. For the exact model, ten results were selected randomly from the exactly matched results if the number of results was more than ten. The results are blind tested, i.e., after we had the results for each model, we shuffled these results randomly for each query, and the shuffled results were judged by an expert in the field of biomedicine whether they were relevant or not.</Paragraph> <Paragraph position="1"> Table 5 shows the number of the results that were judged relevant in the top ten results. The results show that our model was superior to the exact and flat models for all queries. Compared to the exact model, our model output more relevant documents, since our model allows the partial matching of the query, which shows the robustness of our model. In addition, our model gives a better result than the flat model, which means that the structural specification of the query was effective for finding the relevant documents.</Paragraph> <Paragraph position="2"> Comparing our models, the number of relevant results using %0sc was the same as that of %0sum. The results using %0ic varied between the results of the flat model and the results of the exact model depending on the value of ,.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 OHSUMED test collection </SectionTitle> <Paragraph position="0"> The OHSUMED test collection is a document set composed of paper abstracts in the field of biomed-Query our model exact flat OHSUMED test collection (&quot;all results&quot; are relevance-judged results in the exact model) ical science. The collection has a query set and a list of relevant documents for each query. From 50 to 300 documents are judged whether or not relevant to each query. The query consisted of patient information and information request. We used title, abstract, and human-assigned MeSH term fields of documents in the experiments. Since the original OHSUMED is not annotated with tags, we annotated it with tags representing document structures such as &quot;harticlei&quot; and &quot;hsentencei&quot;, and annotated technical terms with tags such as &quot;hdiseasei&quot; and &quot;htherapeutici&quot; by longest matching of terms of Unified Medical Language System (UMLS). In the OHSUMED, relations between technical terms such as events were not annotated unlike the GENIA corpus. The collection consisted of 348,566 articles, 78,207,514 words (including tags), and 1,731,953 sentences.</Paragraph> <Paragraph position="1"> 12 of 106 queries of OHSUMED are converted into structured queries of Region Algebra by an expert in the field of biomedicine. These queries are shown in Table 4, and submitted to the system. The document was &quot;article&quot; in this experiments. For the exact model, all exact matches of the whole query were judged. Since there are documents that are not judged whether or not relevant to the query in the OHSUMED, we picked up only the documents that are judged.</Paragraph> <Paragraph position="2"> Table 6 shows the number of relevant results in top ten results. The results show that our model succeeded in finding the relevant results that the exact model could not find, and was superior to the flat model for Query 4, 5, and 6. However, our model was inferior to the flat model for Query 14 and 15.</Paragraph> <Paragraph position="3"> Comparing our models, the number of relevant results using %0sc and %0ic was lower than that using</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Discussion </SectionTitle> <Paragraph position="0"> In the experiments on OHSUMED, the number of relevant documents of our model were less than that of the flat model in some queries. We think this is because i) specifying structures was not effective, ii) weighting subqueries didn't work, iii) MeSH terms embedded in the documents are effective for the flat model and not effective for our model, iv) or there are many documents that our system found relevant but were not judged since the OHSUMED test collection was made using keyword-based retrieval.</Paragraph> <Paragraph position="1"> As for i), structural specification in the queries is not well-written because the exact model failed to achieve high precision and its coverage is very low. We used only tags specifying technical terms as structures in the experiments on OHSUMED. This structure was not so effective because these tags are annotated by longest match of terms. We need to use the tags representing relations between technical terms to improve the results. Moreover, structured query used in the experiments may not specify the request information exactly. Therefore we think converting queries written by natural language into the appropriate structured queries is important, and lead to the question answering using variously tag-annotated texts.</Paragraph> <Paragraph position="2"> As for ii), we think the weighting didn't work because we simply use frequency of subqueries for weighting. To improve the weighting, we have to assign high weight to the structure concerned with user's intention, that are written in the request information. This is shown in the results of Query 9. In Query 9, relevant documents were not retrieved except the model using %0ic, because although the request information was information concerned &quot;lupus nephritis&quot;, the weight concerned with &quot;lupus nephritis&quot; was smaller than that concerned with &quot;thrombotic&quot; and &quot;thrombocytopenic purpura&quot; in the models except %0ic. Because the structures concerning with user's intention did not match the most weighted structures in the model, the relevant documents were not retrieved.</Paragraph> <Paragraph position="3"> As for iii), MeSH terms are human-assigned key-words for each documents, and no relation exists across a boundary of each MeSH terms. in the flat model, these MeSH term will improve the results. However, in our model, the structure sometimes matches that are not expected. For example, In the case of Query 14, the subquery ' &quot;chronic&quot; 3 &quot;fatigue&quot; 3 &quot;syndrome&quot; ' matched in the field of MeSH term across a boundary of terms when the MeSH term field was text such as &quot;Affective Disorders/*CO; Chronic Disease; Fatigue/*PX; Human; Syndrome &quot; because the operator 3 has no limitation of distance.</Paragraph> <Paragraph position="4"> As for iv), the OHSUMED test collection was constructed by attaching the relevance judgement to the documents retrieved by keyword-based retrieval.</Paragraph> <Paragraph position="5"> To show the effectiveness of structured retrieval more clearly, we need test collection with (structured) query and lists of relevant documents, and the tag-annotated documents, for example, tags representing the relation between the technical terms such as &quot;event&quot;, or taggers that can annotate such tags. Table 7 and 8 show that the retrieval time increases corresponding to the size of the document collection. The system is efficient enough for information retrieval for a rather small document set like GENIA corpus. To apply to the huge databases such as Web-based applications, we might require a constant time algorithm, which should be the subject of future research.</Paragraph> </Section> </Section> class="xml-element"></Paper>