ECCC-3 Contents Page THEOCHEM Home Page Elsevier Chemistry Home Page

A Priori and a Posteriori Mixture Distributions for Using Databases in Protein Structure Determination

Stanley L. Sclove and Simon A. Sherman

University of Illinois at Chicago and University of Nebraska Medical Center
E-mail: slsclove@uic.edu and ssherm@mail.unmc.edu

Keywords: protein local structure determination; cluster analysis; statistical finite mixture model; a priori and a posteriori distributions

Abstract:

The twists and turns of protein molecules correspond to the rotational angles of the main and side chains in their constituent amino acid residues. Due to stereochemical restrictions the joint distribution of these angles falls into several well-separated clusters. Consequently, we fit a statistical finite mixture model to data from the Brookhaven Protein Data Bank (PDB), obtaining a new classification of conformational states of amino acids. The results of this database modeling are important for knowledge-based approaches to protein structure determination and analysis and can be used in conjunction with experimental data in the determination of protein spatial structure. As part of this process, the mixture distributions we have fit can be used as a priori distributions. A posteriori distributions corresponding to these a priori mixture distributions are characterized and applications of the distributions in the build-up and refinement of protein structure determinations are illustrated.


0. Introduction and Summary

This paper is directed toward the further use of statistical methods in the study of protein structure. It is a companion to the papers [7] and [11] in this conference (earlier work is [14]) and discusses the procedures from the viewpoint of mathematical statistics.

First we present our work on statistical description of the dihedral angle values in the PDB by means of finite mixture model cluster analysis. Then we discuss statistical methods for applying this description to the assessment and refinement of alternative hypothesized protein structures. Refinement involves the use of the statistical mixture model as a prior distribution for use in Bayesian classification and estimation. The nature of the corresponding posterior distribution is discussed.

1. Background

A protein is a directed sequence ("chain")

R1, R2, ..., Ri, ..., Rn

of amino-acid residues, a residue being the combining form of a molecule, i.e., the form it takes as part of a macromolecule. Each Ri (i = 1, 2, ...,n residues) is one of twenty amino acids, Alanine (Ala), Arginine (Arg), Asparagine (Asn), Aspartate (Asp), Cysteine (Cys), Glutamate (Glu), Glutamine (Gln), Glycine (Gly), Histidine (His), Isoleucine (Ile), Leucine (Leu), Lysine (Lys), Methionine (Met), Phenylalnine (Phe), Proline (Pro), Serine (Ser), Threonine (Thr), Tryptophan (Trp), Tyrosine (Tyr), Valine (Val).

Amino acids. An amino acid typically consists of a main chain and a side chain. The main chain or "backbone" is formed by two successive single bonds, nitrogen to carbon to carbon (N-C-C). The first carbon, the alpha carbon, is denoted by Ca; the second, the carbonyl carbon, C' .. Let the index i range over the residues in a sequence; i = 1, 2,..., n. One can consider successive residues

Ri-1, Ri, Ri+1,

in terms of

{Ni-1, Ci-1a, Ci-1'}, {Ni, Cia, Ci'}, {Ni+1, Ci+1a, Ci+1')}.

All but one of the amino acids (Gly) have side chains. The individual amino acids differ according to these side chains. The side chain is linked to Ca and begins with a carbon atom, denoted by Cb, typically followed by a gamma atom, delta atom, etc. Thus we may extend our description of the i-th residue Ri to {Ni, Cia, Ci', Cib}.

Dihedral angles. In general, a dihedral angle is an angle between planes. Given a system of four successive atoms A-B-C-D, a dihedral angle describes the relationship between the C-D and A-B bonds. This angle is the one between the plane determined by the bonds B-C and C-D (i.e., by the three atoms B, C and D) and the plane determined by the bonds A-B and B-C (i.e., by the atoms A, B, C). (The dihedral angles are also called torsion angles or rotational angles.) The angle phi is the dihedral angle of the system with A=Ci-1', B=Ni, C=Cia, D=Ci'. The angle psi is the angle of the system with A=Ni, B=Cia, C=Ci', D=Ni+1.

All but two of the amino acids (Ala and Gly) have a side chain with one or more angles of rotation. Pro has a side chain but its rotational angles are fixed. The side-chain dihedral angle chi1 is defined as the angle between the plane determined by N, Ca and Cb and that determined by Ca , Cb and the gamma atom. The angle chi1 is the first rotational angle in the side chain; successive such angles chi2, chi3, etc., are defined. Here we are concerned only with chi1. The three bivariate distributions are described to some extent in the literature, but the trivariate distribution of phi, psi and chi1 is of interest and is a subject of our investigation. It is known that the distribution of chi1 is tri-modal, with modes at about -60, +60 and 180 degrees. (These modes -- "rotamers" -- are sometimes denoted by g-, g+ and t, but the notation is not consistent across all amino acids.) This tri-modality is illustrated in Fig. 1, a dotplot of chi1 for Val.

 
                                                              :
                                                              :
                                                            .::
                                                            :::
                                                            :::
                     .                                     ::::
                    ::                                   .:::::.
                . .::::...            ....:......       .:::::::..
               -------+---------+---------+---------+---------+---------
                    -60         0        60        120       180
                                       chi1

  Fig. 1.  Dot plot of chi1 for Val. (Each dot represents 10 points.)

Database. We analyze the conformations of 58 inhomologous proteins of high resolution (2 Å or better) from the PDB, which includes X-ray measurements of proteins in the crystalline state. Our dataset consisted of several thousand observations from this databank. Each case in the dataset corresponds to a particular amino acid in a particular position in the sequence of a particular protein. The aim of the phase of our research reported here is the statistical modeling and description of this database of angular values, to see how they cluster and how the clusters vary across amino acids. The result is a characterization, in terms of the statistical finite mixture model, of the joint distributions, for data in the PDB, of phi, psi and chi1 for the amino acids with side chains, and the joint distribution of phi and psi for the others. Dunbrack & Karplus [3,4,5] have also done extensive work to describe these distributions. In contrast to their work, which produces descriptive statistics for these data, we perform probabilistic modeling of the data, using the statistical finite mixture model to fit the clustered trivariate distribution of angular values.

Motivating applications. The description of the database is interesting and useful in its own right. Also, the resulting parameter estimates are used in the determination of protein strucuture, as follows. For proteins in solution NMR is used to obtain estimates of inter-proton distances; from these, estimates of the dihedral angles are obtained. These estimates are not very precise. We supplement them with the additional information from the modeling of the databank. The finite-mixture distribution fit to the X-ray data from the PDB is used as an a priori ("prior") distribution. This procedure defines a number of clusters (states) for the dihedral angles. Cluster means can be combined with the NMR measurements to provide improved a posteriori ("posterior") estimates. Description of the procedure and its computer implementation is given in the companion papers [7,11]. Mathematical description of the procedure is given later in this paper.

The fitting of distributions to the angular data in the PDB by finite mixture model cluster analysis is described next.

2. Finite Mixture Model Cluster Analysis: the Model and Method

The finite mixture model [8,15] was used to fit the clusters of points in angular space. The probability density function (p.d.f.) of a finite mixture distribution is of the form

f(y) = p1 f1(y) + p2 f2(y) + . . . + pc fc(y) + ... pCfC(y) ,

where the mixing probabilities pc, c = 1,2,...., C are positive and sum to 1 and the p.d.f.'s fc(y) are the components of the mixture. Often the components are assumed to belong to a specified parametric family, e.g., Gaussian ("normal"). The integer C is the number of components. The number of components C can be greater than the number of clusters (in the sense of modes) because two component densities close together might result in only one mode.

Fitting the model involves determining C and estimating the mixing probabilities and the distributional parameters.

In our case the variable y is the vector of the three angles phi, psi, chi1. (Later we shall refer to this vector as A, for "angles".) In our work to date, the component distributions are taken to be trivariate Gaussian with different covariance matrices as well as different means. Different covariance matrices are used because it is clear from steric contour plots (Ramachandran diagrams) and data plots that the clusters have different orientations. In two dimensions, a cluster running from southwest to northeast is indicative of positive correlation; from northwest to southeast, negative correlation. As an example, a plot of psi vs. phi for Val is shown in Fig. 2.

The E-M algorithm [2,8,15,16] was used for estimation. It is an iterative, maximum likelihood estimation procedure. In the finite-mixture model situation, it starts with preliminary estimates of the distributional parameters, computes posterior probabilities based on these, uses the posterior probabilities to update the estimates of the distributional parameters, etc.


    180+         2   * *
       -      *376432   2***
       -   *  3*895+72 32332*
       -     22567+66845532423*
       -    ***45+++++696843833                    *  2
    120+     * *3++6+++8562*2**
       -        *6656436* 2 * *
       -    * *   **   * 2
       -                   *
       -
     60+
       -            *                                  *
       -        *    *                           2
       -            2
       -           2** 2*
 psi 0+         * * 22* *2   *
       -            242 *** 3*
       -        *  *23** *37733
       -              **228+++9*
       -         *  **2**259+++4
    -60+           ***   *2  342
       -              *        *
       -
       -
       -             *
   -120+
       -
       -
       -             *     *
       -       *
   -180+        **     *
         +---------+---------+---------+---------+------
       -180      -120      -60         0        60
                                phi
Fig. 2.  Scatterplot of  psi vs. phi for Val.  Numbers are
counts at each point (* means 1,  + means >9).

When a mixture of C distributions is to be fit, the parameters to be estimated are the C mixing probabilities and the distributional parameters for the C component distributions. When the components are multivariate Gaussian with p variables, the distributional parameters for each component are p means, p standard deviations and p(p-1)/2 correlations. In our case there are three variables, phi, psi and chi1, so there are 3 means, 3 standard deviations and 3 correlations: phi with psi, phi with chi1, and psi with chi1.

The procedure used is iterative and hence requires initial values. Initial values for the means were suggested by the Ramachandran diagrams and from earlier work [9,12,13] . Ramachandran diagrams are plots in the (phi,psi)-plane showing which (phi,psi)-pairs are stericallly allowed. A (phi,psi)-pair is disallowed if it would cause some pair of atoms to become too close to one another. (See, e.g., [1], pp. 258-260.)

The initial values of the standard deviations were set at 10 degrees for all three angles; the correlations were initialized at 0. To start, the values of the mixing probabilities were taken as equal, i.e., all equal to 1/C.

3. Data Pre-processing

Data coding. The data are angular. A recorded value of -179 is the same as 181 even though their numerical values are far apart. Consequently, some folding was necessary.

It is known that chi1 has three modes, near -60, +60, and +180 (= -180) degrees. Values from -360 to about -120 were folded over to the corresponding positive values.

Further data coding. An interesting approach to the problem of angular data, not used here because it doubles the number of variables, is to use trigonometric coordinates, representing an angle A by the pair (cos A, sin A). This would have resulted in six, rather than three, variables, nonlinearly related in pairs, so we did not opt for this procedure. Rather, since clustering was to be done anyway, the approach used was first to fit the clusters and then combine coinciding clusters by suitably recoding the data in one of them. E.g., if there were a cluster C1 centered at (phi,psi,chi1) = (170, 180, 60), say, and another C2 centered at (-190, -180, 60), i.e., the same centers coded differently, then the points in C2 would be recoded according to phi' = phi + 360, psi' = psi + 360, chi1' = chi1. Comparison of Figure 3 with Figure 2 illustrates this; the cluster at the upper left in Figure 3 is formed from clusters at the upper left and lower left in Figure 2; the bottom left of Figure 2 becomes the top left in Figure 3. There is a number of clusters present; in fact, we fit a model with sixteen clusters for the combined dataset for all seventeen amino acids with variable side-chain angles.

       -             *     *
       -       *
    180+        *3   * 2
       -      *376432   2***
       -   *  3*895+72 32332*
       -     22567+66845532423*
       -    ***45+++++696843833                    *  2
    120+     * *3++6+++8562*2**
       -        *6656436* 2 * *
       -    * *   **   * 2
       -                   *
       -
     60+
       -            *                                  *
       -        *    *                           2
       -            2
       -           2** 2*
 psi  0+         * * 22* *2   *
       -            242 *** 3*
       -        *  *23** *37733
       -              **228+++9*
       -         *  **2**259+++4
    -60+           ***   *2  342
       -              *        *
       -
       -
       -             *
   -120+
       -
       -
         +---------+---------+---------+---------+------
       -180      -120      -60         0        60
                                phi

Fig. 3.  Scatterplot of  psi vs. phi  for Val.
Numbers are counts at each point (* means 1,  + means >9).  Preceding
figure with portion near -180 moved to +180 to put together a cluster.

There are different conventions in recording the data for the various amino acids. E.g., 120 degrees must be added to chi1 measurements for Ile and Thr to make them comparable to those for the other amino acids.

4. Results of the Cluster Analysis

Mixture distributions were fit to the data for the 20 amino acids.

The computer programs used, implementing the E-M algorithm for multivariate Gaussian component distributions, were programmed by Sclove [10]. Similar programs are available elsewhere also (see, e.g., [8] ). The estimate of the mean vector for a given cluster is a weighted average of all n observations, the weights being proportional to the estimate at that stage of the probability that the observation arose from that cluster, i.e., the conditional probability of cluster c, given observation i. The estimate of the covariance matrix for a given cluster can be similarly interpreted.

The clusters obtained admit interpretations in terms of six (phi,psi) configurations constituting structures (called B, P, R T, L, and E), taken in combination with the three modes for chi1. Our B denotes beta-structure; P, polyproline II helix; R, righthand alpha-helix; L, lefthand alpha-helix; T, twisted transition state, E, extended transition state. These six, taken in combination with the three rotamers of chi1, would give 18 clusters, except two of these (L and E) cannot (or at least are very unlikely to) exist with chi1 at +60. Within the six configurations there is a consistent pattern of (phi,psi) locations corresponding to the three chi1 states, with +60 to the upper left, 180 to the lower right and -60 between these. See Figure 4.

          -
          -      B3 B1    P3
          -               P1
          -                   P2
       120+           B2E1
          -                E2
          -
          -
          -
        60+
          -
          -                                        L1L2
          -
          -            T3
   psi   0+
          -               T1
          -                   R3
          -                T2 R1
          -                    R2
       -60+
            +---------+---------+---------+---------+--
         -180      -120       -60         0        60
                               phi

       Fig. 4.  The 16 cluster means.   See columns 3 and 4 of Table 1 for
       the coordinates plotted (cluster mean values of phi and psi).
       Code for the corresonding regular structures:--  B: beta-structure;
       P:  polyproline II helix;  E:  extended transition state; R: righthand
       alpha-helix;  T: twisted transition state;  L: lefthand alpha-helix.
       Code for  chi1:  1 means near -60;  2,  near +180;  3, near +60.
       Example: "L2" indicates that the mean of that cluster has values
       of phi and psi approximating those for a lefthand alpha-helix
       and chi1 is near 180 degrees.

The results include estimates of the 9 distributional parameters (means and variances of the 3 angles and covariances between the 3 pairs of angles) for each component distribution.

The probabilities across the 16 clusters for the combined data for the seventeen amino acids with variable side-chain angles, as well as means, standard deviations and correlations, are given in Table 1. A negative correlation means that the corresponding elliptical cluster runs from northwest to southeast; a positive correlation, that the cluster runs from southwest to northeast.

TABLE 1.  Probabilities, means (in degrees), standard deviations (in
degrees) and correlations for the combined data of the 17 amino acids
with variable side-chain angles (n = 6730)
_______________________________________________________________________

                    Means         Std.Devs.            Correlations                      ________________  _____________  _______________________
                                                  phi,  phi,   psi,
Cluster Prob    phi  psi  chi1    phi psi chi1    psi   chi1   chi1
_______________________________________________________________________

   B1   .062   -129  153   -61     10  14    9   -0.2  -0.1    0.1
   B2   .160   -114  125   180     21  14    9   -0.2   0.0   -0.1
   B3   .051   -146  160    64     13  14   10   -0.3   0.1    0.0

   P1   .105    -89  148   -63     20  15   11   -0.1  -0.2    0.2
   P2   .038    -67  133   180     11   8    9    0.0   0.1   -0.1
   P3   .025    -91  157    66     21  19   10   -0.2   0.0    0.1

   E1   .029   -111  115   -61     16  26   10   -0.5  -0.1    0.2
   E2   .010    -84  103   185      8  19    7    0.1   0.3   -0.4

   R1   .124    -64  -37   -69      6   9    9   -0.4   0.0    0.0
   R2   .142    -62  -44   178      7   6   10   -0.3   0.0    0.0
   R3   .047    -69  -22    67     11  13   11   -0.7   0.1    0.0

   T1   .139    -92  -12   -63     19  22   10   -0.5  -0.1    0.0
   T2   .029    -84  -39   179     23  21   11    0.1   0.1   -0.1
   T3   .022   -109    8    63     21  30   11   -0.2   0.2   -0.1

   L1   .017     59   36   -62      8  13   11   -0.6   0.0    0.0
   L2   .002     65   39   190     18  11    6   -0.4   0.0   -0.2
_______________________________________________________________________

The estimates of the mixing probabilities in the finite-mixture model differ across amino-acid residues; i.e., different amino-acid residues have different propensities for the 16 alternative states. This can be seen from a cross-tabulation (not shown here) of amino acid by cluster, the row- percentages of which are the mixing probabilities for the corresponding amino acid across the final 16 component distributions.

For the three amino acids not having a variable side-chain angle (Ala, Gly, Pro) we fit the bivariate mixture distribution of (phi,psi). See Tables 2 (Pro), 3 (Ala) and 4 (Gly). Note that various correlational patterns were observed within clusters. E.g., sometimes the correlation of phi with psi is positive, sometimes negative, sometimes essentially zero; and similarly for the correlations of phi with chi1 and of psi with chi1. There were earlier claims (see, e.g., [6] that the side- chain angle varies only slightly with the main-chain angles. In fact, within clusters, these angles are highly correlated for some amino acids.

TABLE 2.  Probabilities, means, standard deviations and correlations
 for Pro (n=446)
 _________________________________________________________

                                              Correlation
                     Means      Std.Devs.       between
 Cluster  Prob     phi   psi   phi   psi     phi and psi
 ________________________________________________________

      1   .399    -63    -28    11   14       - .6
      2   .022    -76     58     9   12       - .8
      3   .115    -73    138    13   20       - .3
      4   .438    -64    148     8   12       - .5
      5   .026    -75    180    10   21         .0
 ________________________________________________________



TABLE 3.  Probabilities, means, standard deviations and correlations
for Ala (n = 923)
___________________________________________________________

                                                Correlation
                     Means         Std.Devs.      between
 Cluster  Prob     phi    psi     phi    psi    phi and psi
____________________________________________________________

      1   .200    -131    142      25     20      -.5
      2   .148     -71    143      14     14      -.5
      3   .180     -83    -13      25     30      -.7
      4   .427     -63    -38       6      9      -.4
      5   .028      57     19      22     72      -.1
      6   .013    -128   -135      29     41       .3
      7   .003     162    162      12     15      -.1
 _________________________________________________________



TABLE 4.  Probabilities, means, standard deviations and correlations
for Gly (n = 949)
__________________________________________________________

                                              Correlation
                     Means        Std.Devs.    between
 Cluster  Prob     phi    psi     phi   psi   phi and psi
__________________________________________________________

     1    .027    -162   -164      12     11       .2
     2    .026     154    172      16      6       .3
     3    .061    -158    165      12     12      -.2
     4    .040     161   -163      17     13      -.2
     5    .044    -100   -158      20     20      -.4
     6    .061     100    150      20     23      -.4
     7    .094     -85     -7      25     35      -.5
     8    .107      95      0      20     25      -.2
     9    .087     -82    154      19     17       .0
    10    .101      73   -145      15     20      -.5
    11    .096     -62    -41       6      8      -.3
    12    .258      80      9      13     19      -.9
 _________________________________________________________


5. Applications of the Results of the Cluster Analysis

The procedures discussed here for determination and refinement of protein structures are bottom-up procedures, working at the atomic level within successive residues. The work can be done in terms of atomic coordinates on the one hand or the dihedral angles on the other. Our work is in terms of the angles, though the mathematical formulation here could be applied similarly to coordinates.

Procedures for protein structure determination involve combining experimental data with the existing knowledge base, the database. Thus the mixture-distribution fit to the database has applications in the determination and refinement of protein structures. One way to do this is simply to use the results of the cluster analysis to suggest starting values for optimization of a target function. (The target function includes an energy function and penalty terms for violation of constraints provided by other experimental data.) Another way to use the results is to use the Bayesian statistical paradigm to combine the information in the database with experimental information. The mixture-distribution is taken as a prior distribution and combined with the experimental data. Note that in this application of the Bayesian paradigm, there is no subjective element: The prior distribution is obtained from empirical data, namely, the existing database.

5.1. The Bayesian Paradigm

Let the vector A denote the true, unknown value of the dihedral angles at a given residue. The vector A is an unknown parameter, to be estimated. Experimental data x, bearing upon the value A, will be obtained. This experimental data will be combined with the prior knowledge about A. The p.d.f. of the prior distribution will be denoted by f(A). It is this density which has been modeled according to the finite-mixture model. The prior distribution is combined with the experimental data according to Bayes' formula to obtain posterior probabilities. The experimental data x can be either discrete or continuous.

5.2. Discrete experimental data

When the experimental data x is discrete, it means x takes on, say, m distinct values, v1, v2, ..., vm.

Much of the work reported in the companion papers [7,11] is done in terms of categorical variables, the "d-connectivities." Each such d is a function of the distance between two protons. E.g., dNa relates to the protons in the nitrogen and alpha carbon atoms. The d's are categorical variables indicating the absence or presence and strength of a "crosspeak," a peak in the bispectrum of the two protons.

We consider first an oversimplified example. Suppose there were three d's (corresponding to three pairs of protons), d1, d2, d3: x = (d1 d2 d3). Suppose further that each d is either 1 or 0, corresponding to the presence or absence of a crosspeak. Then the vector x can have m = 8 possible values (patterns), (1 1 1), (1 1 0), (1 0 1), (0 1 1), (1 0 0), (0 1 0 ), (0 0 1), and (0 0 0). Note that to discriminate among the 16 possible states, these 8 values would have to be supplemented with other information.

In the actual application by program FiSiNOE reported earlier in [14], the d's are not coded simply as 1 or 0 but rather are semi-quantitatively coded on an ordinal scale, absent, weak, medium, or strong. Further, there are not just three d's but nine, five intra-residue d's and four sequential (inter-residue) d's. Also, there are three coupling constants (in Hz), coded as strong, medium, weak, and absent/very weak. Now, x represents such data at a given site. Given x, it is possible to say which clusters c are compatible with it. Typically, there will be two or three such clusters, but sometimes only one. Suppose there are three matching clusters, and, according to the fitted mixture model, these have prior probabilities (pc's) equal to .20, .12, and .08. Then the posterior probabilities of these clusters are obtained by normalization as .20/(.20+.12+.08) = .5, .12/(.20+.12+.08) = .3, and .08/(.20+.12+.08) = .2.

A more formal procedure for obtaining these posterior probabilities would be the following. The conditional probability of cluster c, given that x equals pattern vj, j = 1,2,...,m, is, using Bayes' formula and letting p(.) denote the requisite probability mass functions,

p(c|x=vj) = pcp(vj|c)/p(x=vj)

[Note that this formula is analogous to the conditional probability formula Pr(A|B) = Pr(A and B)/Pr(B) = Pr(A)Pr(A|B)/Pr(B).] Expanding the numerator in terms of the C states, we have

p(c|x=vj) = pcp(vj|c)/[p1p(vj|1) + p2p(vj|2) + ... + pcp(vj|c) + ... + pCp(vj|C)] .

The conditional probability p(vj|c), the conditional probability of the particular experimental outcome, given state (cluster) c, is a measure of compatibility, on a zero-to-one scale, between that state and that experimental result. The above procedure, then, corresponds to estimating this as 1 or 0, according as there is or is not a match between c and vj. Alternatively, the p(vj|c) could presumably be estimated experimentally. This would mean obtaining experimental data x a number of times from a residue or residues whose angles are in configuration c and obtaining the distribution across the vj's. This would need to be done for each state c.

Using posterior probabilities in protein spatial structure determination. In this way, posterior probabilities of the states are obtained for each of the n sites (residues). Random structures are then built up as follows, using a Metroplis-type algorithm. Consider a case in which three components, say C1, C2 and C3, have posterior probabilities of .5, .3, and .2. A random number u between 0 and 1 is obtained. If u is between 0 and .5, the site is classified as C1; if u is between .5 and .8, as C2; if u is between .8 and 1, as C3. The means for these components can then be used as the values of the dihedral angles at that site. This is done for all n sites. This gives one configuration for the chain. The procedure is repeated for all n sites, a number , say M, times. (M might be a hundred, for example.) This yields M suggested configurations for the given chain. These M can all be tried as starting configurations in a target-function optimization. This Metropolis-type algorithm, being a form of importance sampling, constitutes a distinct improvement over the pure random sampling form of Monte Carlo simulation. A Metropolis-type algorithm has been coded in the program FiSiNOE-3; see companion paper [11]. The maximum likelihood estimate of the structure is the one where, at each site, the estimate of the dihedral angles is taken to be the mean of the component with the highest probability. In addition to this highest-probability structure, on the order of 100,000 alternative structures can be generated; on the order of 1,000 of these are chosen as starting configurations for the target-function optimization. These 1,000 can be chosen to be those with the highest probabilities.

5.3. Continuous experimental data

Numerical values for the dihedral angles are continuous data. Such values are available both from the build-up process and from final suggested structures. Bayesian procedures of classification and estimation may be applied to such values.

Classification. Now let x denote the vector of three angles. The state from which the vector x arose can be estimated by Bayesian classification, as follows. Maximum a posteriori (MAP) classification can be used. This consists of simply assigning x to that state for which the posterior probability is highest. That is,

assign x to c* iff. c* = arg max c=1,2,...,C{p(c|x)} .

Here

p(c|x) = pcfc(x)/f(x) .

The multivariate Gaussian p.d.f. with mean vector mc and covariance matrix Sc is

fc(x) = (2pi) -(p/2)(det Sc)-1/2exp[ - (1/2)D2(x,mc;Sc)] ,

where ln is the natural logarithm, p is the number of variables, in our case, 3, det denotes determinant, and, given any p-vectors u, v and any nonsingular p-by-p matrix M, the function D2 is the squared statistical distance (Mahalanobis distance), the quadratic form

D2(u,v;M) = (u-v)'M-1(u-v),

where v' denotes the transpose of the (column) vector v. [If the covariance matrix were the identity, this would be Euclidean distance.]

Consequently, we have

ln p(c|x) = ln pc -(p/2) ln(2pi) - (1/2)ln det Sc - (1/2)D2(x,mc;Sc) - ln f(x).

Thus, multiplying by -2 and dropping terms which do not vary with c, it follows that

c* = arg minc=1,2,...,C {D2(x,mc;Sc) + ln det Sc - 2 ln pc } .

[If all the covariance matrices Sc were equal and all the prior probabilities pc were equal, this would be simply minimum distance classification, where the distance is the appropriate statistical (Mahalanobis) distance.]

Having classified x as having arisen from state c*, one might take the cluster mean mc* as the estimate of the vector A of dihedral angles at that site. Alternatively, Bayesian estimation could be used.

Estimation. If one wishes to combine x with the information in the database to obtain an improved estimate of the vector A of dihedral angles at a site, one can proceed as follows. Let us just illustrate what happens in the estimate of a single angle, say A. ("A" would be phi, psi or chi1.) The posterior estimate (mean of the posterior distribution) is

E[A|x] = w1x + w2E(A),

where E(A) is the mean of the mixture prior distribution and the weights w1 and w2 are proportional to the precisions (reciprocal variances):

w1 = Prec(x|A)/Const., w2 = Prec(A)/Const.,


where

Prec(A) = 1/Var(A), Prec(x|A) = 1/Var(x|A)

and

Const. = Prec(A) + Prec(x|A).

Here Var(A) is the variance of the prior distribution; Var(x|A), that of the conditional distribution of x, given A.

That is, the estimator is a weighted average of the prior mean and the experimental result, where the weights depend upon the relative precisions.

Consider now the expression for the posterior p.d.f. in terms of the prior p.d.f:

h(A|x) = f(A)g(x|A)/k(x),

where g(x|A) is the p.d.f. of x, for fixed A (sometimes called the"likelihood"), and k(x) is the marginal p.d.f. of x, i.e., the integral of the numerator with respect to A. When the prior has the finite-mixture model form, it can be shown that the posterior can be expressed in terms of posteriors corresponding to the component p.d.f's. In particular, the mean of the posterior comes out in terms of the means of the posteriors corresponding to the component p.d.f.'s. In turn, the means of the component posteriors involve the means of the components of the mixture prior. When the experimental data x consists of numerical estimates of the dihedral angles, this provides a way of combining experimental data x with the means of the clusters corresponding to the mixture prior. Details will be given in future work.

Scoring hypothesized alternative structures. Suppose there are T competing suggested structures for a protein of n residues. Denote the dihedral angles in the t-th suggested structure as {ati, i = 1,2,...,n}. The statistical likelihood function provides an appropriate way of scoring these suggested structures, i.e., the likelihood is a figure-of-merit for alternative suggested structures. The likelihood provides a measure of consistency of the suggested structure with the information in the database. The likelihood for structure t is f(at1) x f(at2) x ... x f(atn), where f(a) is the fitted mixture p.d.f., evaluated at a. This product is computed for each structure t; the highest value is the best suggested structure. When the a's have been obtained from the Bayesian posterior probability formula, the resulting product h(at1|x) x h(at2|x) x ... x h(atn|x) is the posterior probability of the suggested structure.

6. Conclusions

By finite mixture model cluster analysis we have fit distributions to dihedral angles from the PDB. The trivariate distribution of (phi, psi, chi1) is fit for amino acid residues with variable side-chain angles and the bivariate distribution of (phi, psi) is fit for those without variable side-chain angles.

The resulting 16 clusters admit interpretation as combinations of six states (B, P, R, E, T, and L) with three modes for chi1.

Different amino-acid residues have different propensities for the 16 alternative states.

The fitted mixture model is useful in obtaining starting values for target-function optimization and for Bayesian refinement of estimates of dihedral angles.


References

(In this Web version of the paper the References are given in a more informative long form, including titles.)

[1] C.R. Cantor and P.R. Schimmel. Biophysical Chemistry. W. H. Freeman, New York, 1980.

[2] A.P. Dempster, N.M. Laird and D.B. Rubin. Maximum likelihood from incomplete data via the E-M algorithm, J. Royal Statistical Society B39 (1977) 1-38.

[3] R. L. Dunbrack, Jr. and M. Karplus. Backbone-dependent rotamer library for proteins: application to side-chain prediction, J. Molecular Biology 230 (1993) 543-574.

[4] R. L. Dunbrack, Jr. and M. Karplus. Conformational analysis of the backbone-dependent rotamer preferences of protein sidechains, Nature Structural Biology 1 (1994) 334-340.

[5] R. L. Dunbrack, Jr. The backbone-dependent rotamer library Webpage, 1996. World Wide Web URL http: //www.cmpharm.ucsf.edu /~dunbrack.

[6] J. Janin, S. Wodak, M. Levitt and B. Maigret. Conformation of amino acid side-chains in proteins, J. Molecular Biology 125 (1978) 357-386.

[7] L. Kirnarsky, O. Shats and S. Sherman. Improving the efficiency of protein structure determination from NMR. ECCC3, Paper #52.

[8] G.J. McLachlan and K.E. Basford. Mixture Models: Inference and Applications to Clustering. Marcel Dekker, New York, 1987.

[9] J. Moult and M.N.G. James. An algorithm for determining the conformation of polypeptide segments in proteins by systematic search. Proteins: Structure, Function and Genetics 1 (1986) 146-163.

[10] S.L. Sclove. CLUSPAC: Computer Programs for Mixture-Model Cluster Analysis. CRIM Working Paper No. 92-2 (1992). Center for Research in Information Management, University of Illinois at Chicago, Chicago, Ill.

[11] O. Shats and S.A. Sherman. The FiSiNOE-3 Program for Determination of Protein and Peptide Conformations from NMR Data. ECCC3, Paper #53.

[12] S.A. Sherman, A.M. Andrianov and A.A. Akhrem. Method of determining protein conformations by the two-dimensional nuclear Overhauser enhancement spectroscopy data, J. Biomolecular Structure and Dynamics 4 (1987) 869-884.

[13] S.A. Sherman, A.M. Andrianov and A.A. Akhrem. Method of modeling protein structure by the two- dimensional nuclear magnetic resonance spectroscopy data: application to the proteinase inhibitor BUSI IIA from bull seminal plasma. J. Biomolecular Structure and Dynamics 5 (1988) 785-801.

[14] S. Sherman, S. Sclove, L. Kirnarsky, I. Tomchin and O. Shats, Improvement in the accuracy of protein local structure determination from NMR data. (Paper in ECCC2) J. Mol. Struct.: THEOCHEM 368 (1996), 153-162.

[15] D.M. Titterington, A.F.M. Smith and U.E. Makov. Statistical Analysis of Finite Mixture Distributions. John Wiley, New York, 1985.

[16] J.H. Wolfe. Pattern clustering by multivariate mixture analysis, Multivariate Behavioral Research 5 (1970) 329-350.


Stanley L. Sclove, Ph.D.
Information & Decision Sciences Dept.           M/C 294
University of Illinois at Chicago
601 S. Morgan St.
Chicago, IL 60607-7124
slsclove@uic.edu
www.uic.edu/~slsclove

Simon A. Sherman, Ph.D.
Eppley Institute for Research in Cancer and Allied Diseases
University of Nebraska Medical Center
600 S. 42nd St.
Omaha, NE 68198-9805
ssherm@mail.unmc.edu


latest revision 18-June-1997