File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-4005_metho.xml

Size: 8,409 bytes

Last Modified: 2025-10-06 14:10:31

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-4005">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics An intelligent search engine and GUI-based efficient MEDLINE search tool based on deep syntactic parsing</Title>
  <Section position="4" start_page="0" end_page="17" type="metho">
    <SectionTitle>
2 Enju: An English HPSG Parser
</SectionTitle>
    <Paragraph position="0"> We developed an English HPSG parser, Enju 1 (Miyao and Tsujii, 2005; Hara et al., 2005; Ninomiya et al., 2005). Table 1 shows the performance. The F-score in the table was accuracy of the predicate-argument relations output by the parser. A predicate-argument relation is defined as a tuple &lt;s,wh,a,wa&gt; , where s is the predicate type (e.g., adjective, intransitive verb), wh is the head word of the predicate, a is the argument label (MOD, ARG1, ..., ARG4), and wa is the head word of the argument. Precision/recall is the ratio of tuples correctly identified by the parser. The lexicon of the grammar was extracted from Sections 02-21 of Penn Treebank (39,832 sentences). In the table, 'HPSG-PTB' means that the statistical model was trained on Penn Treebank. 'HPSG-GENIA' means that the statistical model was trained on both Penn Treebank and GENIA treebank as described in (Hara et al., 2005).</Paragraph>
    <Paragraph position="1"> The GENIA treebank (Tateisi et al., 2005) consists of 500 abstracts (4,446 sentences) extracted from MEDLINE.</Paragraph>
    <Paragraph position="2"> Figure 1 shows a part of the parse tree and fea- null ture structure for the sentence &amp;quot;NASA officials vowed to land Discovery early Tuesday at one of three locations after weather conditions forced them to scrub Monday's scheduled return.&amp;quot;</Paragraph>
  </Section>
  <Section position="5" start_page="17" end_page="17" type="metho">
    <SectionTitle>
3 MEDIE: a search engine for
MEDLINE
</SectionTitle>
    <Paragraph position="0"> Figure 2 shows the top page of the MEDIE. MEDIE is an intelligent search engine for the accurate retrieval of relational concepts from MEDLINE 2 (Miyao et al., 2006). Prior to retrieval, all sentences are annotated with predicate argument structures and ontological identifiers by applying Enju and a term recognizer.</Paragraph>
    <Section position="1" start_page="17" end_page="17" type="sub_section">
      <SectionTitle>
3.1 Automatically Annotated Corpus
</SectionTitle>
      <Paragraph position="0"> First, we applied a POS analyzer and then Enju.</Paragraph>
      <Paragraph position="1"> The POS analyzer and HPSG parser are trained by using the GENIA corpus (Tsuruoka et al., 2005; Hara et al., 2005), which comprises around 2,000 MEDLINE abstracts annotated with POS and Penn Treebank style syntactic parse trees (Tateisi et al., 2005). The HPSG parser generates parse trees in a stand-off format that can be converted to XML by combining it with the original text.</Paragraph>
      <Paragraph position="2"> We also annotated technical terms of genes and diseases in our developed corpus. Technical terms are annotated simply by exact matching of dictio- null nary entries and the terms separated by space, tab, period, comma, hat, colon, semi-colon, brackets, square brackets and slash in MEDLINE.</Paragraph>
      <Paragraph position="3"> The entire dictionary was generated by applying the automatic generation method of name variations (Tsuruoka and Tsujii, 2004) to the GENA dictionary for the gene names (Koike and Takagi, 2004) and the UMLS (Unified Medical Language System) meta-thesaurus for the disease names (Lindberg et al., 1993). It was generated by applying the name-variation generation method, and we obtained 4,467,855 entries of a gene and disease dictionary.</Paragraph>
    </Section>
    <Section position="2" start_page="17" end_page="17" type="sub_section">
      <SectionTitle>
3.2 Functions of MEDIE
</SectionTitle>
      <Paragraph position="0"> MEDIE provides three types of search, semantic search, keyword search, GCL search. GCL search provides us the most fundamental and powerful functions in which users can specify the boolean relations, linear order relation and structural relations with variables. Trained users can enjoy all functions in MEDIE by the GCL search, but it is not easy for general users to write appropriate queries for the parsed corpus. The semantic search enables us to specify an event verb with its subject and object easily. MEDIE automatically generates the GCL query from the semantic query, and runs the GCL search. Figure 3 shows the output of semantic search for the query 'What disease does dystrophin cause?'. This example will give us the most intuitive understandings of the proximal and structural retrieval with a richly annotated parsed corpus. MEDIE retrieves sentences which include event verbs of 'cause' and noun 'dystrophin' such that 'dystrophin' is the subject of the event verbs. The event verb and its subject and object are highlighted with designated colors. As seen in the figure, small sentences in relative clauses, passive forms or coordination are retrieved. As the objects of the event verbs are highlighted, we can easily see what disease dystrophin caused. As the target corpus is already annotated with diseases entities, MEDIE can efficiently retrieve the disease expressions.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="17" end_page="19" type="metho">
    <SectionTitle>
4 Info-PubMed: a GUI-based
</SectionTitle>
    <Paragraph position="0"> MEDLINE search tool Info-PubMed is a MEDLINE search tool with GUI, helping users to find information about biomedical entities such as genes, proteins, and  the interactions between them 3.</Paragraph>
    <Paragraph position="1"> Info-PubMed provides information from MEDLINE on protein-protein interactions. Given the name of a gene or protein, it shows a list of the names of other genes/proteins which co-occur in sentences from MEDLINE, along with the frequency of co-occurrence.</Paragraph>
    <Paragraph position="2"> Co-occurrence of two proteins/genes in the same sentence does not always imply that they interact. For more accurate extraction of sentences that indicate interactions, it is necessary to identify relations between the two substances. We adopted PASs derived by Enju and constructed extraction patterns on specific verbs and their arguments based on the derived PASs (Yakusiji, 2006).</Paragraph>
    <Section position="1" start_page="18" end_page="19" type="sub_section">
      <SectionTitle>
4.1 Functions of Info-PubMed
</SectionTitle>
      <Paragraph position="0"> In the 'Gene Searcher' window, enter the name of a gene or protein that you are interested in.</Paragraph>
      <Paragraph position="1"> For example, if you are interested in Raf1, type &amp;quot;raf1&amp;quot; in the 'Gene Searcher' (Figure 4). You will see a list of genes whose description in our dictionary contains &amp;quot;raf1&amp;quot; (Figure 5). Then, drag  one of the GeneBoxes from the 'Gene Searcher' to the 'Interaction Viewer.' You will see a list of genes/proteins which co-occur in the same sentences, along with co-occurrence frequency.</Paragraph>
      <Paragraph position="2"> The GeneBox in the leftmost column is the one you have moved to 'Interaction Viewer.' The GeneBoxes in the second column correspond to gene/proteins which co-occur in the same sentences, followed by the boxes in the third column, InteractionBoxes.</Paragraph>
      <Paragraph position="3"> Drag an InteractionBox to 'ContentViewer' to see the content of the box (Figure 6). An InteractionBox is a set of SentenceBoxes. A SentenceBox corresponds to a sentence in MEDLINE in which the two gene/proteins co-occur. A SentenceBox indicates whether the co-occurrence in the sentence is direct evidence of interaction or not. If it is judged as direct evidence of interaction, it is indicated as Interaction. Otherwise, it is indicated as Co-occurrence.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="19" end_page="19" type="metho">
    <SectionTitle>
5 Conclusion
</SectionTitle>
    <Paragraph position="0"> We presented an English HPSG parser, Enju, a search engine for relational concepts from MED-LINE, MEDIE, and a GUI-based MEDLINE search tool, Info-PubMed.</Paragraph>
    <Paragraph position="1"> MEDIE and Info-PubMed demonstrate how the results of deep parsing can be used for intelligent text mining and semantic information retrieval in the biomedical domain.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML