File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1128_metho.xml

Size: 14,993 bytes

Last Modified: 2025-10-06 14:10:24

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1128">
  <Title>Semantic Retrieval for the Accurate Identification of Relational Concepts in Massive Textbases Yusuke Miyao[?] Tomoko Ohta[?] Katsuya Masuda[?] Yoshimasa Tsuruoka+</Title>
  <Section position="4" start_page="1017" end_page="1018" type="metho">
    <SectionTitle>
2 Background: Resources and Tools for
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="1017" end_page="1017" type="sub_section">
      <SectionTitle>
Semantic Annotations
</SectionTitle>
      <Paragraph position="0"> The proposed system for the retrieval of relational concepts is a product of recent developments in NLP resources and tools. In this section, ontology databases, deep parsers, and search algorithms for structured data are introduced.</Paragraph>
    </Section>
    <Section position="2" start_page="1017" end_page="1017" type="sub_section">
      <SectionTitle>
2.1 Ontology databases
</SectionTitle>
      <Paragraph position="0"> Ontology databases are collections of words and phrases in specific domains. Such databases have been constructed extensively for the systematic management of domain knowledge by organizing textual expressions of ontological entities that are detached from actual sentences.</Paragraph>
      <Paragraph position="1"> For example, GENA (Koike and Takagi, 2004) is a database of genes and gene products that is semi-automatically collected from well-known databases, including HUGO, OMIM, Genatlas, Locuslink, GDB, MGI, FlyBase, WormBase,  CYGD, and SGD. Table 1 shows an example of a GENA entry. &amp;quot;Symbol&amp;quot; and &amp;quot;Name&amp;quot; denote short forms and nomenclatures of genes, respectively. &amp;quot;Species&amp;quot; represents the organism species in which this gene is observed. &amp;quot;Synonym&amp;quot; is a list of synonyms and name variations. &amp;quot;Product&amp;quot; gives a list of products of this gene, such as proteins coded by this gene. &amp;quot;External links&amp;quot; provides links to other databases, and helps to obtain detailed information from these databases. For biomedical terms other than genes/gene products, the Unified Medical Language System (UMLS) meta-thesaurus (Lindberg et al., 1993) is a large database that contains various names of biomedical and health-related concepts.</Paragraph>
      <Paragraph position="2"> Ontology databases provide mappings between textual expressions and entities in the real world. For example, Table 1 indicates that CRP, MGC88244, and PTX1 denote the same gene conceptually. Hence, these resources enable us to canonicalize variations of textual expressions of ontological entities.</Paragraph>
    </Section>
    <Section position="3" start_page="1017" end_page="1018" type="sub_section">
      <SectionTitle>
2.2 Parsing technologies
</SectionTitle>
      <Paragraph position="0"> Recently, state-of-the-art CFG parsers (Charniak and Johnson, 2005) can compute phrase structures of natural sentences at fairly high accuracy. These parsers have been used in various NLP tasks including IE and text mining. In addition, parsers that compute deeper analyses, such as predicate argument structures, have become available for  the processing of real-world sentences (Miyao and Tsujii, 2005). Predicate argument structures are canonicalized representations of sentence meanings, and express the semantic relations of words explicitly. Figure 1 shows an output of an HPSG parser (Miyao and Tsujii, 2005) for the sentence &amp;quot;A normal serum CRP measurement does not exclude deep vein thrombosis.&amp;quot; The dotted lines express predicate argument relations. For example, the ARG1 arrow coming from &amp;quot;exclude&amp;quot; points to the noun phrase &amp;quot;A normal serum CRP measurement&amp;quot;, which indicates that the subject of &amp;quot;exclude&amp;quot; is this noun phrase, while such relations are not explicitly represented by phrase structures.</Paragraph>
      <Paragraph position="1"> Predicate argument structures are beneficial for our purpose because they can represent relational concepts in an abstract manner. For example, the relational concept of &amp;quot;CRP excludes thrombosis&amp;quot; can be represented as a predicate argument structure, as shown in Figure 2. This structure is universal in various syntactic expressions, such as passivization (e.g., &amp;quot;thrombosis is excluded by CRP&amp;quot;) and relativization (e.g., &amp;quot;thrombosis that CRP excludes&amp;quot;). Hence, we can abstract surface variations of sentences and describe relational concepts in a canonicalized form.</Paragraph>
    </Section>
    <Section position="4" start_page="1018" end_page="1018" type="sub_section">
      <SectionTitle>
2.3 Structural search algorithms
</SectionTitle>
      <Paragraph position="0"> Search algorithms for structured texts have been studied extensively, and examples include XML databases with XPath (Clark and DeRose, 1999) and XQuery (Boag et al., 2005), and region algebra (Clarke et al., 1995). The present study focuses on region algebra extended with variables (Masuda et al., 2006) because it provides an efficient search algorithm for tags with cross boundaries. When we annotate texts with various levels of syntactic/semantic structures, cross boundaries are inherently nonnegligible. In fact, as described in Section 3, our system exploits annotations of predicate argument structures and ontological entities, which include substantial cross boundaries.</Paragraph>
      <Paragraph position="1"> Region algebra is defined as a set of operators on regions, i.e., word sequences. Table 2 shows operators of the extended region algebra, where A and B denote regions, and results of operations are also regions. For example, &amp;quot;A &amp; B&amp;quot; denotes a region that includes both A and B. Four containment operators, &gt;, &gt;&gt;, &lt;, and &lt;&lt;, represent ancestor/descendant relations in XML. For example,  search algorithms for region algebra, the cost of retrieving the first answer is constant, and that of an exhaustive search is bounded by the lowest frequency of a word in a query (Clarke et al., 1995).</Paragraph>
      <Paragraph position="2"> Variables in the extended region algebra allow us to express shared structures and are necessary in order to describe predicate argument structures.</Paragraph>
      <Paragraph position="3"> For example, Figure 3 shows a formula in the extended region algebra that represents the predicate argument structure of &amp;quot;CRP excludes something.&amp;quot; This formula indicates that a sentence contains a region in which the word &amp;quot;exclude&amp;quot; exists, the first argument (&amp;quot;arg1&amp;quot;) phrase of which includes the word &amp;quot;CRP.&amp;quot; A predicate argument relation is expressed by the variable, &amp;quot;$subject.&amp;quot; Figure 4 shows a situation in which this formula is satisfied.</Paragraph>
      <Paragraph position="4"> Three horizontal bars describe regions covered by &lt;sentence&gt;, &lt;phrase&gt;, and &lt;word&gt; tags, respectively. The dotted line denotes the relation expressed by this variable. Given this formula as a query, a search engine can retrieve sentences having semantic annotations that satisfy this formula.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="1018" end_page="1020" type="metho">
    <SectionTitle>
3 A Text Retrieval System for MEDLINE
</SectionTitle>
    <Paragraph position="0"> While the above resources and tools have been developed independently, their collaboration opens up a new framework for the retrieval of relational concepts, as described below (Figure 5).</Paragraph>
    <Paragraph position="1"> Off-line processing: Prior to retrieval, a deep parser is applied to compute predicate argument  structures, and a term recognizer is applied to create mappings from textual expressions into identifiers in ontology databases. Semantic annotations are stored and indexed in a structured database for the extended region algebra.</Paragraph>
    <Paragraph position="2"> On-line processing: User input is converted into queries of the extended region algebra. A search engine retrieves sentences having semantic annotations that match the queries.</Paragraph>
    <Paragraph position="3"> This framework is applied to a text retrieval engine for MEDLINE. MEDLINE is an exhaustive database covering nearly 4,500 journals in the life sciences and includes the bibliographies of articles, about half of which have abstracts. Research on IE and text mining in biomedical science has focused mainly on MEDLINE. In the present paper, we target all articles indexed in MEDLINE at the end of 2004 (14,785,094 articles). The following sections explain in detail off-/on-line processing for the text retrieval system for MEDLINE.</Paragraph>
    <Section position="1" start_page="1019" end_page="1020" type="sub_section">
      <SectionTitle>
3.1 Off-line processing: HPSG parsing and
</SectionTitle>
      <Paragraph position="0"> term recognition We first parsed all sentences using an HPSG parser (Miyao and Tsujii, 2005) to obtain their predicate argument structures. Because our target is biomedical texts, we re-trained a parser (Hara et al., 2005) with the GENIA treebank (Tateisi et al., 2005), and also applied a bidirectional part-of-speech tagger (Tsuruoka and Tsujii, 2005) trained with the GENIA treebank as a preprocessor.</Paragraph>
      <Paragraph position="1"> Because parsing speed is still unrealistic for parsing the entire MEDLINE on a single machine, we used two geographically separated computer clusters having 170 nodes (340 Xeon CPUs).</Paragraph>
      <Paragraph position="2"> These clusters are separately administered and not dedicated for use in the present study. In order to effectively use such an environment, GXP (Taura, 2004) was used to connect these clusters and distribute the load among them. Our processes were given the lowest priority so that our task would not disturb other users. We finished parsing the entire  Next, we annotated technical terms, such as genes and diseases, to create mappings to ontological identifiers. A dictionary-based term recognition algorithm (Tsuruoka and Tsujii, 2004) was applied for this task. First, an expanded term list was created by generating name variations of terms in GENA and the UMLS meta-thesaurus1.</Paragraph>
      <Paragraph position="3"> Table 3 shows the size of the original database and the number of entries expanded by name variations. Terms in MEDLINE were then identified by the longest matching of entries in this expanded list with words/phrases in MEDLINE.</Paragraph>
      <Paragraph position="4"> The necessity of ontologies is not limited to nominal expressions. Various verbs are used for expressing events. For example, activation events of proteins can be expressed by &amp;quot;activate,&amp;quot; &amp;quot;enhance,&amp;quot; and other event expressions. Although the numbers of verbs and their event types are much smaller than those of technical terms, verbal expressions are important for the description of relational concepts. Since ontologies of event expressions in this domain have not yet been constructed, we developed an ontology from scratch. We investigated 500 abstracts extracted from MEDLINE, and classified 167 frequent expressions, including verbs and their nominalized forms, into 18 event types. Table 4 shows a part of this ontology. These expressions in MEDLINE were automatically annotated with event types.</Paragraph>
      <Paragraph position="5"> As a result, we obtained semantically annotated MEDLINE. Table 5 shows the size of the original MEDLINE and semantic annotations. Figure 6 shows semantic annotations for the sentence in Figure 1, where &amp;quot;-&amp;quot; indicates nodes of XML,2 1We collected disease names by specifying a query with the semantic type as &amp;quot;Disease or Syndrome.&amp;quot; 2Although this example is shown in XML, this textbase contains tags with cross boundaries because tags for predicate argument structures and technical terms may overlap.</Paragraph>
      <Paragraph position="6">  although the latter half of the sentence is omitted because of space limitations. Sentences are annotated with four tags,3 &amp;quot;phrase,&amp;quot; &amp;quot;word,&amp;quot; &amp;quot;sentence,&amp;quot; and &amp;quot;entity name,&amp;quot; and their attributes as given in Table 6. Predicate argument structures are annotated as attributes, &amp;quot;mod&amp;quot; and &amp;quot;argX,&amp;quot; which point to the IDs of the argument phrases. For example, in Figure 6, the &lt;word&gt; tag for &amp;quot;exclude&amp;quot; has the attributes arg1=&amp;quot;1&amp;quot; and arg2=&amp;quot;24&amp;quot;, which denote the IDs of the subject and object phrases, respectively.</Paragraph>
      <Paragraph position="7"> 3Additional tags exist for representing document structures such as &amp;quot;title&amp;quot; (details omitted).</Paragraph>
    </Section>
    <Section position="2" start_page="1020" end_page="1020" type="sub_section">
      <SectionTitle>
Tag Attributes
</SectionTitle>
      <Paragraph position="0"> phrase id, cat, head, lex head word id, cat, pos, base, mod, argX, rel type sentence sentence id entity name id, type, gene id/disease id, gene symbol, gene name, species, db site</Paragraph>
    </Section>
    <Section position="3" start_page="1020" end_page="1020" type="sub_section">
      <SectionTitle>
Attribute Description
</SectionTitle>
      <Paragraph position="0"> id unique identifier cat syntactic category head head daughter's ID lex head lexical head's ID pos part-of-speech base base form of the word mod ID of modifying phrase argX ID of the X-th argument of the word rel type event type sentence id sentence's ID type whether gene, gene prod, or disease gene id ID in GENA disease id ID in the UMLS meta-thesaurus gene symbol short form of the gene gene name nomenclature of the gene species species that have this gene db site links to external databases</Paragraph>
    </Section>
    <Section position="4" start_page="1020" end_page="1020" type="sub_section">
      <SectionTitle>
3.2 On-line processing
</SectionTitle>
      <Paragraph position="0"> The off-line processing described above results in much simpler on-line processing. User input is converted into queries of the extended region algebra, and the converted queries are entered into a search engine for the extended region algebra. The implementation of a search engine is described in detail in Masuda et al. (2006).</Paragraph>
      <Paragraph position="1"> Basically, given subject x, object y, and verb v, the system creates the following query:</Paragraph>
      <Paragraph position="3"> Ontological identifiers are substituted for x, y, and v, if possible. Nominal keywords, i.e., x and y, are replaced by [entity_name gene_id=&amp;quot;n&amp;quot;] or [entity_name disease_id=&amp;quot;n&amp;quot;], where n is the ontological identifier of x or y. For verbal keywords, base=&amp;quot;v&amp;quot; is replaced by rel_type=&amp;quot;r&amp;quot;, where r is the event type of v.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML