File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-0311_metho.xml
Size: 20,311 bytes
Last Modified: 2025-10-06 14:07:57
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-0311"> <Title>Utilizing Text Mining Results: The PastaWeb System</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 The PASTA System </SectionTitle> <Paragraph position="0"> The overall aim of the PASTA system is to extract information about the roles of residues in protein molecules, specifically to assist in identifying active sites and binding sites. We do not describe the system in great detail here, as this is described elsewhere (Demetriou et al., 2002).</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 PASTA Extraction Tasks 3.1.1 Terminological Tagging </SectionTitle> <Paragraph position="0"> A key component of PASTA, and of various other IE systems operating in the biomedical domain is the identification and classification of textual references (terms) to key entity types in the domain. We have identified 12 significant classes of technical terms in the PASTA domain: protein, species, residue, site, region, secondary structure, supersecondary structure, quaternary structure, base, atom (element), non-protein compound, interaction. Guidelines defining the scope of the term classes were written, and an SGML-based markup scheme specified to allow instances of the term classes to be tagged in texts1.</Paragraph> <Paragraph position="1"> The PASTA template conforms to the MUC template specification and is object oriented. Slot fillers are of three types: (1) string fill - a string excised directly from the text (e.g. Pseudomonas cepacia); (2) set fill - a normalised form selected from a predefined set (e.g. the expressions Seror serine are mapped to SERINE, one of a set of normalised forms that represent the 20 standard amino acids); (3) pointer fill - a pointer to another template object, used, e.g., for indicating relations between objects. To meet the objectives of PASTA, three template elements and two template relations were identified. The elements are RESIDUE, PROTEIN and SPECIES; the two relations are IN PROTEIN, holding between a residue and the protein in which it occurs, and IN SPECIES, holding between a protein and the species in which it occurs.</Paragraph> <Paragraph position="2"> An example of a template produced by PASTA for a Medline abstract is shown in Figure 1, which illustrates the three template element objects and two template relation objects. As can be seen from the figure, the <RESIDUE> template object contains slots for the residue name and the residue number in the sequence (NO). Secondary and quaternary structural arrangements of the part of the structure in which the residue is found are stored in the SEC STRUCT and QUAT STRUCT slots respectively. TheSITE/FUNCTIONslot is filled with widely recognizable descriptions that indicate that this residue is important for the structure's activation (e.g. active-site) or functional characteristics (e.g. catalytic). The REGION slot is about the more general geographical areas of the structure (e.g. lid) in which this particular residue is found2. TheINTERACTIONslot captures textual references to hydrogen bonds, disulphide bonds or other types of atomic contacts. At this point the only attributes extracted for protein and species objects are their names.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 System Architecture </SectionTitle> <Paragraph position="0"> The PASTA system has been adapted from an IE system called LaSIE (Large Scale Information Extraction), originally developed for participation in the MUC competitions (Humphreys et al., 1998).</Paragraph> <Paragraph position="1"> The PASTA system is a pipeline of processing components that perform the following major tasks: text preprocessing, terminological processing, syntactic and semantic analysis, discourse interpretation, and template extraction.</Paragraph> <Paragraph position="2"> Text Preprocessing The text preprocessing phase aims at low-level text processing tasks including the analysis of the structure of the MEDLINE abstracts in terms of separate sections (e.g. the title, author names, abstract etc.), tokenisation and sentence boundary identification. With respect to tokenisation, tokens are identified at the subword level resulting in the splitting of biochemical compound terms into their constituents which need to be matched separately during the lexical lookup phase. For example, the term Cys128 is split to the three-letter residue abbreviation Cys and the numeral 128.</Paragraph> <Paragraph position="3"> Terminological Processing The aim of the 3stage terminological processing phase is to identify and correctly classify instances of the term classes described above in section 3.1.1. During the morphological analysis stage individual tokens are analysed to see if they contain interesting biochemical affixes such as -ase or -in that indicate candidate protein names respectively.</Paragraph> <Paragraph position="4"> During the lexical lookup stage the previously tokenised terms are matched against terminological lexicons which have been compiled from biological databases such as CATH3 and SCOP4 and have been augmented with terms produced by corpus processing techniques (Demetriou and Gaizauskas, 2000). Additional subcategorisation information is provided for multi-token terms by splitting the terms into their constituents and placing the constituents into subclasses whose combination is determined by Finally, in a terminology parsing stage, a rule-based parser is used to analyse the tokenisation information and the morphological and lexical properties of component terms and to combine them into a single multi-token unit.</Paragraph> <Paragraph position="5"> Syntactic and Semantic Processing Terms classified during the previous stages (proteins, species, residues etc.) are passed to the syntactic processing modules as non-decomposable noun phrases and a part-of-speech tagger assigns syntactic labels to the remaining text tokens. With the application of phrasal grammar rules, the phrase structure of each sentence is derived and this is used to build a semantic representation via compositional semantic rules. cessing stage, the semantic representation of each sentence is added to a predefined domain model which provides a conceptualisation of the knowledge of the domain. The domain model consists of a concept hierarchy (ontology) together with inheritable properties and inference rules for the concepts. Instances of concepts are gradually added to the hierarchy in order to construct a complete discourse model of the input text.</Paragraph> <Paragraph position="6"> Template Extraction A template writing module scans the final discourse model for any instances that are relevant to the template filling task, ensures that it has all necessary information to generate a template and fill its slots, and then formats and outputs the templates.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Development and Evaluation </SectionTitle> <Paragraph position="0"> Following standard IE system development methodology, a corpus of texts relevant to the study of protein structure was assembled. The corpus consists of 1513 Medline abstracts from 20 major scientific journals that publish new macromolecular structures. Of these abstracts, 113 were manually tagged for the 12 term classes mentioned above and 55 had associated templates filled manually. These annotated data were divided into distinct training and test sets.</Paragraph> <Paragraph position="1"> The corpus and annotated data assisted in the refinement of the extraction task definitions, supported system development and permitted final blind evaluation of the system. Detailed results of the evaluation, for each term class, and for each slot in the templates, can be found in Demetriou et al. (2002).</Paragraph> <Paragraph position="2"> In Table 1 we present the summary totals for the development corpus, the unseen final evaluation corpus (Blind) and the human interannotator agreement where one annotator is taken to be the gold standard and the other scored against him/her. The evaluation metrics are the well known measures of precision and recall.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 The PastaWeb Interface </SectionTitle> <Paragraph position="0"> The PASTAWeb interface5 is aimed at providing quick access and navigation facilities through the database of the PASTA tagged texts and their associated templates. PASTAWeb has borrowed ideas from the interface component of the TRESTLE6 system Gaizauskas et al. (2001) developed to support information workers in the pharmaceutical industry. Key characteristics of PASTAWeb are the seamless integration between the PASTA IE results and WWW-based browsing technology, the dynamic generation of WWW pages from &quot;static&quot; content and the fusion of information relating to proteins and amino acid residues when found in different sources.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 PASTAWeb Architecture </SectionTitle> <Paragraph position="0"> The PASTAWeb architecture is illustrated in Fig 2.</Paragraph> <Paragraph position="1"> Initially, MEDLINE abstracts are fed through the PASTA IE system which produces two kinds of output: (i) texts annotated with SGML tags describing term class information for protein, residues, species, regions, and (ii) templates which are used as the main stores of information about residues including relational information between proteins and residues and between proteins and species.</Paragraph> <Paragraph position="2"> Once PASTA has run, a separate indexing process creates three indices. The first associates with each processed document the terminology tagged version of the text and any templates extracted from the text. The second is a relational table between each document and each of the instances of the main term classes (i.e. proteins, residues or species) mentioned in the document. This index also points to the title of the document, because the title can provide vital clues about the content of the text.</Paragraph> <Paragraph position="3"> The final index is used to assist the 'fusion' of the information in templates generated from multiple texts for the same protein. This index provides information about those proteins for which there are templates generated from multiple documents. Due to variations in the expression of the same protein name from text to text, the identification of suitable templates for fusion is not trivial. The problem of matching variant expressions of the same term in different databases is a well known problem in bioinformatics. The current implementation of the indexing addresses this problem using simple heuristic rules. Simply put, two templates are considered suitable for fusion if the protein names either match exactly (ignoring case sensitivity) or they include the same &quot;head term&quot;. The applicability of the heuristic for finding a &quot;head term&quot; is limited to constituent terms ending in -ase or -in (to exclude common words, such as &quot;protein&quot;, &quot;domain&quot; etc.). For example, the protein terms &quot;scorpion toxin&quot;, &quot;diphtheria toxin&quot; and &quot;toxins&quot; would match with each other because they all include the head term &quot;toxin&quot;. Consequently, the corresponding template information about the residues occurring in these proteins would be merged into a single table, though information about which slot fillers belong to which term variant is retained.</Paragraph> <Paragraph position="4"> The decision to do the matching of variant names at the index level and not at the interface level is simply due to operational issues. Matching the protein names from multiple texts involves the pair-wise string comparisons between all proteins in the PASTA templates. The number of these comparisons increases very rapidly as new texts and templates are added to the database and it was found that is causes considerable delay to the operation of the PASTAWeb interface.</Paragraph> <Paragraph position="5"> Since information seeking tasks of molecular biologists may require complex navigation capabilities, the storing of the results in &quot;static&quot; HTML pages would have been unsuitable both practically (more difficult to implement pointers between different pieces of information and to alter and maintain pages) and economically (requires more disk space). We therefore opted for a dynamic page creator that is triggered by the users' requests expressed as choices over hypertext links. The dynamic page creator compiles the information from the indices and the associated databases (texts and templates) and sends the results to the WWW browser via a Web server. In the dynamically created pages, each hypertext link encodes the current frame, the information to be displayed when the link is selected, and the frame in which this information is to be displayed. For example, the hypertext link for a title of a document encodes information about the document id of the document as well as about the target frame in which the text will be displayed. Clicking on this link expresses a request to PASTAWeb for displaying that particular text in the target frame. The whole operation of PASTAWeb loosely resembles the operation of a finite-state automaton.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Interface Overview </SectionTitle> <Paragraph position="0"> PASTAWeb offers a number of ways of accessing the protein structure information extracted by PASTA. As shown in Fig 3 the interface layout can be split into four main areas (frames). On the left side of the page we find the &quot;Access Frame&quot; which allows the user to select amongst text access options.</Paragraph> <Paragraph position="1"> These options include browsing the contents of the text databases via either the protein, the residue or the species indices or via a text search option over these indexed terms.</Paragraph> <Paragraph position="2"> The right hand side of the screen is split into three frames. The top frame, so called &quot;Header Frame&quot;(see Fig 3), is used to generate an alphabetical index for protein or species names whenever the user has chosen the protein or species access modes for navigation. For residues, rather than an alphabetical index, a list of residue names is displayed in the &quot;Header Frame&quot;. This is because while the number of protein names and their variants is probably indeterminable, the number of residues remains constant (i.e the 20 standard amino acids).</Paragraph> <Paragraph position="3"> Just below the &quot;Header Frame&quot; is the &quot;Document Index Frame&quot; which initially serves to display the automatically generated indices together with document information. The &quot;Index Frame&quot; is split into two columns, the left of which is used to present an alphabetically sorted list of the chosen type of index (i.e. protein, residue, species). The right column occupies more space because it displays the list of corresponding document titles (as extracted by the PASTA IE system). These titles are presented as clickable hyperlinks to the full texts each of which can be displayed in the &quot;Tagged Text Frame&quot; below. A second use of the &quot;Index Frame&quot; is for displaying template results, explained in more detail below.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Term Access to Texts </SectionTitle> <Paragraph position="0"> A typical interaction session with PASTAWeb requires the user to select one of the three term categories in the &quot;Access Frame&quot;, i.e. proteins, residues or species. The &quot;Header Frame&quot; then displays a list of alphabetical indices (for proteins and species) or a list of residue names. Selecting any of these indices, e.g. &quot;M&quot; for proteins, activates the dynamic generation of a list of protein terms that are indexed by &quot;M&quot; (on the left) of the &quot;Index Frame&quot; and their corresponding document titles (on the right). Different font colours are used to distinguish between the two different kinds of information.</Paragraph> <Paragraph position="1"> The selection of any of the title links causes the system to dynamically transform the PASTA-tagged text from SGML to HTML and display it in the bottom &quot;Tagged Text Frame&quot; with the recognised term types highlighted in different colours. The colour index for the term categories can be viewed in a frame just below the &quot;Access Frame&quot; (the &quot;Colour Index Frame&quot;). Each tagged protein, species or residue term is itself a hyperlink which can be used to dynamically fetch the indices of the texts in which this term occurs and display them in the &quot;Index Frame&quot;.</Paragraph> <Paragraph position="2"> Using this functionality, the user can navigate through a succession of texts following a single term or at any point branching off this chain by selecting another term and following its occurrences in the text collection.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.4 Web-based Access to Templates </SectionTitle> <Paragraph position="0"> Unfortunately, although the type of object-oriented template produced by PASTA (Fig 1) is an efficient data structure for storing complex information, it is not suitable for displaying to end-users. For this reason, the templates are dynamically converted to a format that can be readily accommodated to the screen's layout while being at the same time easily accessible. The format chosen for displaying the PASTA templates is tabular and is implemented as an HTML table (see background picture in Fig 3).</Paragraph> <Paragraph position="1"> Access to the templates produced by PASTA is facilitated by special template &quot;icons&quot; or &quot;flags&quot; which are displayed next to text titles or protein terms in the &quot;Index Frame&quot;.</Paragraph> <Paragraph position="2"> When a &quot;single&quot; template icon is displayed to the right of a title, this serves to flag that a template for this text is available and can be accessed by clicking on the icon. On the other hand, when a &quot;double&quot; template icon is displayed next to a protein name in the left column of the &quot;index frame&quot;, this indicates that there are multiple templates (i.e. templates extracted from different texts) for this protein. Clicking on either of these icons will trigger PASTAWeb to scan the corresponding object-oriented templates, analyse their structures and convert them into tabular format. In the case of fused templates the information is assimilated into a single template. The template information is then displayed in the &quot;Index Frame&quot; together a hyperlink to the title of the original text which, when selected, displays the (tagged) text in the &quot;Tagged Text Frame&quot;. This enables the user to retrieve more detailed information from the text if needed, or to inspect and verify the correctness of the extracted information.</Paragraph> <Paragraph position="3"> PASTAWeb offers a simple and easy to use mechanism for the tracking of information for a specific entity from text to text, but can also assist in the linking of information between different entities in multiple documents. Starting with a specific protein in mind for example, a molecular biologist may want to investigate structural similarities between that and other proteins with respect to what has been described in the literature.</Paragraph> </Section> </Section> class="xml-element"></Paper>