File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/02/w02-0309_evalu.xml

Size: 9,907 bytes

Last Modified: 2025-10-06 13:58:50

<?xml version="1.0" standalone="yes"?>
<Paper uid="W02-0309">
  <Title>Biomedical Text Retrieval in Languages with a Complex Morphology</Title>
  <Section position="5" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
4 Experimental Results
</SectionTitle>
    <Paragraph position="0"> The assessment of the experimental results is based on the aggregation of all 52 selected queries on the one hand, and on a separate analysis of expert vs.</Paragraph>
    <Paragraph position="1"> layman queries, on the other hand. In particular, we calculated the average interpolated precision values at fixed recall levels (we chose a continuous increment of 10%) based on the consideration of the top 200 documents retrieved. Additionally, we provide the average of the precision values at all eleven fixed recall levels (11pt recall), and the average of the precision values at the recall levels of 20%, 50%, and 80% (3pt recall).</Paragraph>
    <Paragraph position="2"> We here discuss the results from the analysis of the complete query set the data of which is given in Table 2 and visualized in Figure 1. For our base-line (WS), the direct match between query terms and document terms, precision is already poor at low recall points (a66a68a67a70a69a11a71a73a72 ), ranging in an interval from 53.3% to 31.9%. At high recall points (a66a75a74a77a76a78a71a73a72 ), 24,539 index entries remaining after segmentation, indicates a significantly lower reduction rate of 52%. The size of the English subword dictionary (only 300 entries less than the German one) does not explain the data. Rather this finding reveals that the English corpus has fewer single-word compounds than the  precision drops from 19.1% to 3.7%. When we take term proximity (adjacency) into account (WSA), we observe a small though statistically insignificant increase in precision at all recall points, 1.6% on average. Orthographic normalization only (WSO), however, caused, interestingly, a marginal decrease of precision, 0.6% on average. When both parameters, orthographic normalization and adjacency, are combined (WSAO), they produce an increase of precision at nine from eleven recall points, 2.5% on average compared with WS. None of these differences are statistically significant when the two-tailed Wilcoxon test is applied at all eleven recall levels.</Paragraph>
    <Paragraph position="3"> Trigram indexing (TG) yields the poorest results of all methodologies being tested. It is comparable to WS at low recall levels (a66a79a67a14a69a11a71a73a72 ), but at high ones its precision decreases almost dramatically. Unless very high rates of misspellings are to be expected (this explains the favorable results for trigram indexing in (Franz et al., 2000)) one cannot really recommend this method.</Paragraph>
    <Paragraph position="4"> The subword approach (SU) clearly outperforms the previously discussed approaches. We compare it here with WSAO, the best-performing lexicon-free method. Within this setting, the gain in precision for SU ranges from 6.5% to 14% (a66a80a67a81a69a11a71a73a72 ), while for high recall values (a66a82a74a77a76a78a71a73a72 ) it is still in the range of 4.8% to 6%. Indexing by synonym class identifiers (SY) results in a marginal decrease of overall performance compared with SU. To estimate the statistical significance of the differences SU vs. WSAO and SY vs. WSAO, we compared value pairs at each  fixed recall level, using the two-tailed Wilcoxon test (for a description and its applicability for the interpretation of precision/recall graphs, cf. (Rijsbergen, 1979)). Statistically significant results (a83a85a84a87a86a31a72 ) are in bold face in Table 2.</Paragraph>
    <Paragraph position="5"> The data for the comparison between expert and layman queries is given in the Tables 3 and 4, respectively, and they are visualized in the Figures 2 and 3, respectively. The prima facie observation that layman recall data is higher than those of the experts is of little value, since the queries were acquired in quite different ways (cf. Section 3). The adjacency criterion for word index search (WSA) has no influence on the layman queries, probably because they contain fewer search terms. This may also explain the poor performance of trigram search. A considerably higher gain for the subword indexing approach (SU) is evident from the data for layman queries.</Paragraph>
    <Paragraph position="6"> Compared with WSAO, the average gain in precision amounts to 9.6% for layman queries, but only 5.6% for expert queries. The difference is also obvious when we compare the statistically significant differences (a83a88a84a89a86a31a72 ) in both tables (bold face). This is also compatible with the finding that the rate of query result mismatches (cases where a query did not yield any document as an answer) equals zero for SU, but amounts to 8% and 29.6% for expert and laymen queries, respectively, running under the token match paradigm WS* (cf. Table 5).</Paragraph>
    <Paragraph position="7"> When we compare the results for synonym class indexing (a90a92a91 ), we note a small, though statistically insignificant improvement for layman queries at some recall points. We attribute the different re- null sults partly to the lower baseline for layman queries, partly to the probably more accentuated vocabulary mismatch between layman queries and documents using expert terminology. However, this difference is below the level we expected. In forthcoming releases of the subword dictionary in which coverage, stop word lists and synonym classes will be augmented, we hope to demonstrate the added value of the subword approach more convincingly.</Paragraph>
    <Paragraph position="8"> Generalizing the interpretation of our data in the light of these findings, we recognize a substantial increase of retrieval performance when query and text tokens are segmented according to the principles of the subword model. The gain is still not overwhelming. null  With regard to orthographic normalization, we expected a higher performance benefit because of the well-known spelling problems for German medical terms of Latin or Greek origin (such as in 'Z&amp;quot;akum', 'C&amp;quot;akum', 'Zaekum', 'Caekum', 'Zaecum', 'Caecum'). For our experiments, however, we used quite a homogeneous document collection following the spelling standards of medical publishers. The same standards apparently applied to the original multiple choice questions, by which the acquisition of expert queries was guided (cf. Section 3). In the layman queries, there were only few Latin or Greek terms, and, therefore, they did not take advantage of the spelling normalization. However, the experience with medical text retrieval (especially on medical reports which exhibit a high rate of spelling variations) shows that orthographic normalization is a desider- null (adjacency) of search terms as a crucial parameter for output ranking proved useful, so we use it as default for subword and synonym class indexing.</Paragraph>
    <Paragraph position="9"> Whereas the usefulness of Subword Indexing became evident, we could not provide sufficient evidence for Synonym Class Indexing, so far. However, synonym mapping is still incomplete in the current state of our subword dictionary. A question we have to deal with in the future is an alternative way to evaluate the comparative value of synonym class indexing. We have reason to believe that precision cannot be taken as the sole measure for the advantages of a query expansion in cases where the sub-word approach is already superior (for all layman and expert queries this method retrieved relevant documents, whereas word-based methods failed in 29.6% of the layman queries and 8% of the expert queries, cf. Figure 5). It would be interesting to evaluate the retrieval effectiveness (in terms of precision and recall) of different versions of the synonym class indexing approach in those cases where retrieval using word or subword indexes fails due to a complete mismatch between query and documents. This will become even more interesting when mappings of our synonym identifiers to a large medical thesaurus (MeSH, (NLM, 2001)) are incorporated into our system. Alternatively, we may think of user-centered comparative studies (Hersh et al., 1995).</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 The AltaVistaTM Experiment
</SectionTitle>
      <Paragraph position="0"> Before we developed our own search engine, we used the AltaVistaTM Search Engine 3.0 (http:// solutions.altavista.com) as our testbed, a widely distributed, easy to install off-the-shelf IR system. For the conditions WSA, SU, and SY, we give the comparative results in Table 6. The experiments were run on an earlier version of the dictionary - hence, the different results. AltaVistaTM yielded a superior performance for all three major test scenarios compared with our home-grown engine. This is not at all surprising given all the tuning  the AltaVistaTM with our Experimental Search Engine efforts that went into AltaVistaTM. The data reveals clearly that commercially available search engines comply with our indexing approach. In an experimental setting, however, their use is hardly justifiable because their internal design remains hidden and, therefore, cannot be modified under experimental conditions.</Paragraph>
      <Paragraph position="1"> The benefit of the subword indexing method is apparently higher for the commercial IR system. For AltaVistaTM the average precision gain was 15.9% for SU and 11.5% for SY, whereas our simple tfidf driven search engine gained only 5.3% for SU and 3.4% for SY. Given the imbalanced benefit for both systems (other things being equal), it seems highly likely that the parameters feeding AltaVistaTM profit even more from the subword approach than our simple prototype system.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML