File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1635_metho.xml

Size: 20,638 bytes

Last Modified: 2025-10-06 14:10:46

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1635">
  <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Protein folding and chart parsing</Title>
  <Section position="4" start_page="293" end_page="295" type="metho">
    <SectionTitle>
2 A brief introduction to protein folding
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="293" end_page="293" type="sub_section">
      <SectionTitle>
2.1 Protein structure
</SectionTitle>
      <Paragraph position="0"> The primary structure describes the linear sequence of amino acids that are linked via peptide bonds (and form the backbone of the polypeptide chain). Each amino acid has one side chain which branches off the backbone. Proteins contain twenty different kinds of amino acids, which differ only in the size and chemical properties of their side-chains. One important distinction is that between hydrophobic (water-repelling) and hydrophillic (polar) amino acids.</Paragraph>
      <Paragraph position="1"> The secondary structure refers to patterns of local structures such as a-helices or b-sheets, which occur in many different folded structures. These secondary structure elements often assemble into larger domains. The tertiary structure represents the fully folded three-dimensional conformation of a single-chain protein, and typically consists of multiple domains. Since proteins in the cell are surrounded by water, hydrophobic side-chains are typically inside this structure and in close contact to each other, forming a hydrophobic core, whereas polar side-chains are more likely to be on the surface of this structure. This hydrophobic effect is known to be the main driving force for the folding process.</Paragraph>
      <Paragraph position="2"> Computational models of protein folding often use a very simplified representation of these structures. Ultimately, models which explicitly capture all atoms and their physical interactions are required to study the folding of real proteins. However, since such models often require huge computational resources such as supercomputers or distributed systems, novel search strategies and other general properties of the folding problem are usually first studied with coarse-grained, simplified representations, such as the HP model (Lau and Dill, 1989; Dill et al., 1995) used here.</Paragraph>
    </Section>
    <Section position="2" start_page="293" end_page="294" type="sub_section">
      <SectionTitle>
2.2 Folding and thermodynamics
</SectionTitle>
      <Paragraph position="0"> As first shown by Anfinsen (1973), protein folding is a reversible process: under &amp;quot;denaturing&amp;quot; conditions, proteins typically unfold into a random state (which still preserves the chain connectivity of the primary amino acid sequence), and refold again into their unique native state if the natural folding conditions are restored. Thus, all the information that is necessary to determine the folded structure has to be encoded in the primary sequence. This is analogous to natural language, where the meaning of sentences such as I drink coffee with milk vs. I drink coffee with friends is also determined by their words.</Paragraph>
      <Paragraph position="1"> Since folding occurs spontaneously, the native state has to be the thermodynamically optimal structure (under folding conditions), ie. the structure that results in the lowest free energy. The free energy G a0 H a1 TS of a system depends on its energy H, its entropy S (the amount of disorder in the system), and the temperature T. A computational model therefore requires an energy function ph : Rn a2 R, which maps n-dimensional vectors that describe the structure of a polypeptide chain (eg.</Paragraph>
      <Paragraph position="2"> in terms of the coordinates of its atoms) to the free energies of the corresponding structures. The native state is assumed to be the global minimum of this function. This is again analogous to statistical parsing, where the correct analysis is assumed to be the structure with the highest probability. In the case of proteins, we can use the laws of physics to determine the energy function, whereas in language, the &amp;quot;energies&amp;quot; have to be estimated from corpora.1 The energy H of a single protein structure depends essentially on the interactions (contacts) between side-chains and on the bond angles along the backbone, whereas the entropy S also depends on the surrounding solvent (water). It is this impact on S which creates the hydrophobic effect.</Paragraph>
      <Paragraph position="3"> For simplicity's sake most computational models use an implicit solvent energy function, which captures the hydrophobic effect by assuming that the contact energies between hydrophobic side-chains are particularly favorable. Since bond angles alone cannot capture the hydrophobic effect (Dill, 1999), simplified models typically ignore their impact and represent the energy of a conformation only  in terms of the side chain contacts. One particularly well-known example is the Miyazawa-Jernigan (1996) energy function, a 20x20 matrix of contact potentials whose parameters are estimated from the Protein Data Bank, a database of experimentally verified protein structures. These simplified energy functions are therefore very similar to the bi-lexical dependency models that are commonly used in statistical parsing.</Paragraph>
      <Paragraph position="4"> It is this similarity between inter-residue contacts and word-word dependencies that grammar-based approaches (Searls, 2002) exploit. The set of contacts for a given structure can be represented as a polymer graph, although often only the edges of this graph are given in the form of a contact map (a triangular matrix whose entry Cij corresponds to the contact between the ith and jth residue).</Paragraph>
      <Paragraph position="5"> The edges in this graph are inherently undirected.</Paragraph>
      <Paragraph position="6"> In a-helices and parallel b-sheets, the edges are crossing. Although grammars that capture the &amp;quot;dependencies&amp;quot; in specific kinds of protein structures have been written (Chiang, 2004), it is at present unclear whether such an approach can be generalized. The difficulty for all approximations to structural representations (grammar-based or otherwise) lies in accounting for excluded volume or steric clashes (the fact that no two amino acids can occupy the same point in space).</Paragraph>
      <Paragraph position="7"> The so-called &amp;quot;New View&amp;quot; of protein folding (Dill and Chan, 1997) assumes that the speed of the folding process can be explained by the shape of the energy landscape (ie. the surface of the energy function for all possible structures of a given chain). Folding is fastest if the landscape is funnel-shaped (ie. has no local minima, and there is a direct downward path from all points to the native state). If the energy landscape is rugged (ie. has many local minima) or golf-course shaped (ie. all structures except for the native state have the same, high, energy), folding is slow. In the first case, energetic barriers slow down the folding process: the chain gets stuck in local minima, or kinetic traps. Such traps correspond to structures that contain &amp;quot;incorrect&amp;quot; (non-native) contacts which have to be broken (thus increasing the energy) before the native state can be reached. In the case of a plateau in the landscape, the search for the native state is slowed down by entropic barriers, i.e. a situation where a large number of equivalent structures with the same energy are accessible. Implicit in the landscape perspective is  &amp;quot;Greek key&amp;quot; b-sheet (1-17) and a-helix (17-24) the assumption that folding is a greedy search that local moves in the landscape can successfully identify the global minimum. Not all amino acid sequences have such landscapes, and in fact, most random amino acid sequences are unlikely to fold into a unique structure. This is again similar to language, where random sequences of words are also unlikely to form a grammatical sentence.</Paragraph>
      <Paragraph position="8"> Computational simulations of the folding process are typically based on Monte Carlo or related techniques. These approaches require an energy function as well as a &amp;quot;move set&amp;quot; (a set of rules which describe how one conformation can be transformed into another). However, since each individual simulation can only capture the folding trajectory of a single chain, many runs are typically required to sample the entire landscape to a sufficient degree.</Paragraph>
    </Section>
    <Section position="3" start_page="294" end_page="295" type="sub_section">
      <SectionTitle>
2.3 The HP model
</SectionTitle>
      <Paragraph position="0"> The HP model (Lau and Dill, 1989; Dill et al., 1995) is one of the most simplified protein models.</Paragraph>
      <Paragraph position="1"> Here, proteins are short chains that are placed onto a 2-dimensional square lattice (Figure 1). Each HP sequence consists of two kinds of monomers, hydrophobic (H) and polar (P), and each monomer is represented as a single bead on a lattice site.</Paragraph>
      <Paragraph position="2"> The chain is placed onto the lattice such that each lattice site is occupied by at most one bead, and beads that are adjacent in the sequence are on adjacent lattice sites, so that it forms a self-avoiding walk (SAW) on the lattice. Such lattice models are commonly used in polymer physics, since they capture excluded volume effects, and the properties of such SAWs on different types of lattices are a well-studied problem in combinatorics.</Paragraph>
      <Paragraph position="3"> Each distinct SAW corresponds to one &amp;quot;conformation&amp;quot;, or possible structure. The energy of a conformation is determined by the contacts between two H monomers i and j that are not adjacent in the sequence. Contacts arise if the chain is in a configuration such that monomers i and j</Paragraph>
      <Paragraph position="5"> HH-contact contributes a1 1 to the energy. The energy E a1 ca2 of a conformation c with n HH contacts is therefore a1 n. We consider only sequences that have a single lowest-energy conformation (native state), since these are the most protein-like.</Paragraph>
      <Paragraph position="6"> All unique-folding sequences up to a length of 25 monomers and their natives states are known (Irb&amp;quot;ack and Troein, 2002). In our experiments, we will concentrate on the set of all unique-folding HP sequences of length 20, of which there are 24,900. These 20-residue chains have 41,889,578 viable conformations on the 2D lattice.</Paragraph>
      <Paragraph position="7"> Despite its simplicity, the HP model is commonly used to test protein folding algorithms, since it captures essential physical properties of proteins such as chain connectivity and the hydrophobic effect, and since finding the lowest energy conformation is an NP-complete problem (Crescenzi et al., 1998; Berger and Leighton, 1998), as in real proteins.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="295" end_page="297" type="metho">
    <SectionTitle>
3 Folding as hierarchical search
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="295" end_page="295" type="sub_section">
      <SectionTitle>
3.1 Evidence for hierarchical folding
</SectionTitle>
      <Paragraph position="0"> There is substantial evidence in the experimental literature (starting with Crippen (1978) and Rose (1979); but see also Baldwin and Rose (1999a; 1999b)) that the folding process is guided by a hierarchical search strategy, whereby folding begins simultaneously and independently in different parts of the chain, leading initially to the formation of local structures which either grow larger, or assemble with other local structure. Folded protein structures can typically be recursively decomposed, and in many proteins, small, contiguous parts of the chain form near-native structures during early stages of the folding process. On the theoretical side, Dill et al. (1993) demonstrate that local contacts are easiest to form when the chain is unfolded, and facilitate the subsequent formation of less local contacts, leading to a &amp;quot;zipping&amp;quot; effect, where small, local structures grow larger before being assembled.</Paragraph>
    </Section>
    <Section position="2" start_page="295" end_page="297" type="sub_section">
      <SectionTitle>
3.2 Folding routes as trees
</SectionTitle>
      <Paragraph position="0"> Folding routes describe how individual chains move from the unfolded to the native state. If protein folding is a recursive, parallel process, as assumed here, folding routes are trees whose leaf nodes represent substrings of the primary sequence, and whose root represents the folded structure of the entire chain (Figure 2). The nodes in between the leaves and root correspond to chain segments whose length lies between that of the shortest initial segments and the final complete chain. Folding begins independently and simultaneously at each of the leaves, and moves toward the root. Each internal node of a folding route tree represents a set of partially folded conformations of the corresponding chain segment that is found by combining conformations of smaller pieces formed in previous steps.</Paragraph>
      <Paragraph position="1"> Figure 2 also shows that the state of the entire chain at different stages during the folding process is given by a horizontal treecut, a set of nodes whose segments span the entire chain, but do not overlap.</Paragraph>
      <Paragraph position="2"> Because we assume that folding routes are trees, contacts between two adjacent segments A and B can only be formed when A and B are combined to form their parent C. Our assumption also implies that in a sequence uvw, contacts between v and w or between v and u have to be formed before or at the same time as contacts between u and w.</Paragraph>
      <Paragraph position="3"> Trees provide a unified representation of the growth and assembly process assumed by hierarchical folding theories: A growth step corresponds to a local tree in which a non-terminal node and a leaf node are combined, whereas an assembly step corresponds to a local tree in which two non-terminal nodes are combined.</Paragraph>
      <Paragraph position="4"> Folding route trees thus play a very different role from the traditional phrase structure trees in natural language, since they represent merely the process by which the desired structure was formed, and not the structure itself. This is more akin to the role of syntactic derivations in for- null malisms such as CCG (Steedman, 2000): in CCG, syntactic derivation trees do not constitute an autonomous level of representation, but only specify how the semantic interpretation of a sentence is constructed. We will see below that proteins, like sentences in CCG, have a &amp;quot;flexible&amp;quot; constituent structure, with multiple folding routes leading to the native state.</Paragraph>
      <Paragraph position="5"> 4 Protein folding as chart parsing Here, we show how the CKY algorithm (Kasami, 1965; Younger, 1967) can be adapted to protein folding in the HP model. Although we use a simplified lattice model, our technique is sufficiently general to be applicable to other representations. As in standard CKY, structures for sub-strings ia0a1a0 j are formed from pairs of previously identified structures for substrings ia0a1a0a1a0 k and ka2 1a0a1a0 j, and, as in standard probabilistic CKY, we use a pruning strategy akin to Viterbi search, and only retain the lowest energy structures in each cell.</Paragraph>
      <Paragraph position="6"> The complexity of standard CKY is O a1 n3 a3 Ga3 a2 , where n is the length of the input string and a3 Ga3 the &amp;quot;size&amp;quot; of the grammar. Since we do not have a grammar with a fixed set of nonterminals, which would allow us to compactly represent all possible structures for a given substring, the constant factor a3 Ga3 is replaced by an exponential factor nc, representing the number of possible conformations of a chain of length n. Our pruning strategy captures the physical assumption that only locally optimal structures are stable enough not to unfold before further contacts can be made. With a larger set of amino acids and a corresponding energy function, a beam search strategy (with threshold pruning) may be more appropriate. Pruning is an essential part of our algorithm - without it, it would amount to exhaustive enumeration, repeated O a1 n3 a2 times. The chart Since only HH contacts contribute to the energy of a conformation, the dimensions of the chart are determined by the number of H monomers in the sequence. We segment every HP sequence into h substrings that contain one H each (splitting long substrings of Ps in the middle). For efficiency reasons, non-empty prefixes or suffixes of P monomers (eg. in sequences of the form PPPH a0a1a0a1a0a1a0a1a0 HP) may also be split off as additional substrings (and are then only combined with the rest of the chain once the substring from the first to the last H monomer has been analyzed).</Paragraph>
      <Paragraph position="7"> These substrings correspond to the leaf nodes in the folding trees. Other regimes are also conceivable. Since no adjacent H monomers can form a contact, up to three consecutive Hs may be kept in the same substring. While this typically leads to an increase in efficiency, it comes at a slight cost in accuracy with our current pruning strategy. Long substrings of Ps could also be treated as separate substrings in a manner similar to P pre- and suffixes. null Chart items The items in our chart represent the lowest-energy conformations that are found for the corresponding substring. Unlike in standard CKY, each cell contains the full set of structures for its substring (which leads to the exponential worst-case behavior observed above). Therefore, the chart does not need to be unpacked to obtain the desired output structure. Backpointers from items in chart a4 ia5a6a4 ja5 to pairs of items in charta4 ia5a6a4 ka5 and chart a4 k a2 1a5a6a4 ja5 represent the folding route trees, and thus record the history of the folding process. Each item can only have at most j a1 i pairs of backpointers, since it can only be constructed from one pair of conformations in each pair of cells.</Paragraph>
      <Paragraph position="8"> Initializing the chart The chart is initialized by filling the cells charta4 ia5a6a4 ia5 which correspond to the ith substring. Since each initial substring has at most one H, all its conformations are equivalent (and the size of chart a4 ia5a6a4 ia5 is thus exponential in the length of its substring). This exhaustive enumeration can be performed off-line.</Paragraph>
      <Paragraph position="9"> Filling the chart As in standard CKY, the internal cells chart a4 ia5a6a4 ja5 are filled by combining the entries of cells charta4 ia5a6a4 ka5 and chart a4 k a2 1a5a6a4 ja5 for i a7 k a0 j. Two conformations l a8 charta4 ia5a6a4 ka5 and r a8 chart a4 ia5a6a4 ka5 are combined like two pieces of a jigsaw puzzle where the only constraint is that two pieces may not overlap. That is, we append all (rotational and translational) variants of r to any free site adjacent to the site of l's last monomer, and add all resulting viable conformations c (ie. those where no two monomers occupy the same lattice site) into chart a4 ia5a6a4 ja5 .</Paragraph>
      <Paragraph position="10"> With our current pruning strategy, only the lowest-energy conformations in each cell are kept.</Paragraph>
      <Paragraph position="11"> CKY terminates when the top cell, charta4 1a5a6a4 na5 , is filled. It has succeeded if the top cell contains an item with only one conformation, the native state.</Paragraph>
      <Paragraph position="12">  Contact maps as node labels We have also developed a variant of this algorithm where the entries in a cell correspond to contact maps (sets of HH-contacts), and where each entry corresponds in turn to the set of conformations that corresponds to this contact map. Conformations that have the same contact map are assumed to be physically equivalent. While the number of possible contact maps is also exponential in the length of the sub-string (Vendruscolo et al., 1999), it is obviously much smaller than the number of actual conformations. In our current implementation, the amount of search required is identical in both variants; but in extending this approach beyond the lattice, it may be possible to use a more efficient sampling approach to speed up the combination of conformations in two cells.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML