File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-1128_intro.xml

Size: 4,334 bytes

Last Modified: 2025-10-06 14:03:35

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1128">
  <Title>Semantic Retrieval for the Accurate Identification of Relational Concepts in Massive Textbases Yusuke Miyao[?] Tomoko Ohta[?] Katsuya Masuda[?] Yoshimasa Tsuruoka+</Title>
  <Section position="3" start_page="0" end_page="1017" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> Rapid expansion of text information has motivated the development of efficient methods of accessing information in huge texts. Furthermore, user demand has shifted toward the retrieval of more precise and complex information, including relational concepts. For example, biomedical researchers deal with a massive quantity of publications; MEDLINE contains approximately 15 million references to journal articles in life sciences, and its size is rapidly increasing, at a rate of more than 10% yearly (National Library of Medicine, 2005). Researchers would like to be able to search this huge textbase for biomedical correlations such as protein-protein or gene-disease associations (Blaschke and Valencia, 2002; Hao et al., 2005; Chun et al., 2006). However, the framework of traditional information retrieval (IR) has difficulty with the accurate retrieval of such relational concepts because relational concepts are essentially determined by semantic relations between words, and keyword-based IR techniques are insufficient to describe such relations precisely.</Paragraph>
    <Paragraph position="1"> The present paper demonstrates a framework for the accurate real-time retrieval of relational concepts from huge texts. Prior to retrieval, we prepare a semantically annotated textbase by applying NLP tools including deep parsers and term recognizers. That is, all sentences are annotated in advance with semantic structures and are stored in a structured database. User requests are converted on the fly into patterns of these semantic annotations, and texts are retrieved by matching these patterns with the pre-computed semantic annotations. The accurate retrieval of relational concepts is attained because we can precisely describe relational concepts using semantic annotations. In addition, real-time retrieval is possible because semantic annotations are computed in advance.</Paragraph>
    <Paragraph position="2"> This framework has been implemented for a text retrieval system for MEDLINE. We first apply a deep parser (Miyao and Tsujii, 2005) and a dictionary-based term recognizer (Tsuruoka and Tsujii, 2004) to MEDLINE and obtain annotations of predicate argument structures and ontological identifiers of genes, gene products, diseases, and events. We then provide a search engine for these annotated sentences. User requests are converted into queries of region algebra (Clarke et al., 1995) extended with variables (Masuda et al., 2006) on these annotations. A search engine for the extended region algebra efficiently finds sentences having semantic annotations that match the input queries. In this paper, we evaluate this system with respect to the retrieval of biomedical correlations  and examine the effects of using predicate argument structures and ontological identifiers.</Paragraph>
    <Paragraph position="3"> The need for the discovery of relational concepts has been investigated intensively in Information Extraction (IE). However, little research has targeted on-demand retrieval from huge texts.</Paragraph>
    <Paragraph position="4"> One difficulty is that IE techniques such as pattern matching and machine learning require heavier processing in order to be applied on the fly.</Paragraph>
    <Paragraph position="5"> Another difficulty is that target information must be formalized beforehand and each system is designed for a specific task. For instance, an IE system for protein-protein interactions is not useful for finding gene-disease associations. Apart from IE research, enrichment of texts with various annotations has been proposed and is becoming a new research area for information management (IBM, 2005; TEI, 2004). The present study basically examines this new direction in research.</Paragraph>
    <Paragraph position="6"> The significant contribution of the present paper, however, is to provide the first empirical results of this framework for a real task with a huge textbase.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML