Introduction Virtual screening in the pharmaceutical industry is an essential part of molecular modeling’s contribution to lead discovery and, to a lesser extent, lead optimization. This has led to considerable research into what method or approach works best, typically by means of ‘retrospective’ evaluations, i.e. attempting to predict future, i.e. prospective, behavior by appraising techniques on known systems. Despite this there is no agreed upon theory as to how to conduct a retrospective evaluation. As a consequence, it is very difficult for an outsider to assess if methods are getting better, have stayed the same, or even worsened over time. In a practical enterprise, such as drug discovery, the proposed benefits of virtual screening, i.e. avoiding the cost and time of a real screen, have to be weighed against one simple question: does it actually work? Without proper metrics of success, i.e. ones that go beyond the anecdotal, molecular modeling is not guaranteed a vibrant future. 1 2 Why is this? Why is the modeling field so poor at the most basic elements of evaluation? A charitable view would be that, as with communication skills, most modelers receive little appropriate formal training. Certainly there is no central resource, whether scholastic review, book or paper. A slightly less charitable view is that journals have not developed standards for publication and as such there is little Darwinian pressure to improve what the community sees as acceptable. It is to be hoped that this is a learning curve, i.e. that editors will eventually appreciate what is required in a study. An extreme view is that we are poor at evaluations because we simply do not matter very much. If large fortunes were won or lost on the results from computational techniques there would be immense debate as to how to analyze and compare methods, on what we know and exactly when we know it. There would be double blind, prospective and rigorously reviewed studies of a scale and depth unknown in our field but common in, for instance, clinical trials. In short, there would be standards. variance 3 5 6 9 Experimental design In what follows we consider the importance of both intensive and extensive properties of an experiment. An intensive property is something intrinsic to a design, whereas extensive properties change with the size of the system. For example, the type of decoys used in a retrospective study is an intensive property; the number of such is an extensive property. We believe the most overlooked intensive characteristic is the design goal, i.e. what is trying to be proved. This typically falls into a few discrete classes and appropriate labeling would help combine lessons from different studies. For extensive quantities we consider how common statistical approaches can aid the choice of numbers of actives, decoys and targets. Finally, actives, decoys, targets or methods are not always independent and this has to be quantified even in as simple a matter as comparing two programs. Techniques for accounting for correlation within an experimental design are known but rarely applied. Intensive properties One of the most basic issues in designing a retrospective screen is how to chose decoys. Typically there are a certain number of active compounds and one wishes to see if a method can distinguish these from a second set, presumed inactive. This is the most basic of classification problems. Is X of type A or type B? The legal system often has the same dilemma, e.g. was X at the scene of a crime or not? A police line-up has all the components of a virtual screen. Usually the number of actives (suspects) is small, usually one. The number of decoys (called ‘fillers’) has to be sufficient that random selection does not compete with real recognition; a minimum of four is usual. But it cannot be so large that guilt is hidden within the statistical variance of the innocent. The fillers need to be convincing, i.e. not outlandishly dissimilar to the guilty party, but not too similar or even potentially also at the scene (i.e. false false positives). As courtroom verdicts can depend on the appropriateness of a line-up, standard procedures are well known. Universal Drug-like Mimetics known Modeled universal universal 10 prior 3 universal 11 12 12 13 14 3  15 3 4 universal 16 universal drug-like mimetic modeled Mimetic 17 18 mimetic self drug-like all Mimetic self all mimetics Modeled mimetics seen modeled mimetic modeled mimetic modeled mimetic modeled universal drug-like mimetic modeled 19 all Extensive properties In addition to intensive properties, there are the extensive properties such as how many actives, decoys and targets are used. Once again the important consideration is knowing what we want to know. If the purpose is to evaluate a single method on a single target the necessary extensive properties are quite different than for a broad study on the efficacy of several methods on many targets. We illustrate this with some basic error analysis. M V M V V M 3 5 1 1 4 21 Fig. 1 a b 20  independent \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym} 
\usepackage{amsfonts} 
\usepackage{amssymb} 
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$ {\text{Err}} \approx {\sqrt {{\left( {{\text{Err}}_{{\text{1}}} ^{{\text{2}}} {\text{ + Err}}_{{\text{2}}} ^{{\text{2}}} {\text{ + Err}}_{{\text{3}}} ^{{\text{2}}} {\text{ \ldots }}} \right)}} } $$\end{document} \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym} 
\usepackage{amsfonts} 
\usepackage{amssymb} 
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$ \begin{aligned}{} {\text{Err(method)}} & \approx {\sqrt {{\text{(Err}}_{{{\text{targets}}}} ^{{\text{2}}} {\text{ + Err}}_{{{\text{actives}}}} ^{{\text{2}}} {\text{ + Err}}_{{{\text{inactives}}}} ^{{\text{2}}} {\text{)}}} } \\ & \approx {\sqrt {{\text{(Var}}_{{{\text{targets}}}} {\text{/N}}_{{\text{t}}} {\text{ + Var}}_{{{\text{actives}}}} {\text{/N}}_{{\text{a}}} {\text{ + Var}}_{{{\text{inactives}}}} {\text{/N}}_{{\text{i}}} {\text{)}}} }{\text{ }} \\ \end{aligned} $$\end{document} The variances are intrinsic properties to ‘targets’, ‘actives’ and ‘inactives’. How do we know what these variances are? One way is to boot-strap, i.e. leave out a randomly chosen fraction of the targets, or subset of actives or inactives, and measure changes in performance. Repeating this procedure many times gives a statistical sampling of the sensitivity to outliers and the number of measurements. Alternatively, in some cases the variance can be established more precisely. In the case of AUC, for example, it can be shown that for a particular target the variance for both actives and inactives can be approximated by: \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym} 
\usepackage{amsfonts} 
\usepackage{amssymb} 
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$ {\text{Var}}_{{{\text{active}}}} {\text{ = }}\sum {\text{ (p}}_{{\text{i}}} - \langle {\text{p}}\rangle {\text{)}}^{{\text{2}}} {\text{/N}}_{{{\text{active}}}} {\text{ }} $$\end{document} \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym} 
\usepackage{amsfonts} 
\usepackage{amssymb} 
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$ {\text{Var}}_{{{\text{active}}}} {\text{ = }}\sum {\text{ (q}}_{{\text{j}}} - \langle {\text{q}}\rangle {\text{)}}^{{\text{2}}} {\text{/N}}_{{{\text{inactive}}}} {\text{ }} $$\end{document} i i j j 7 2 Fig. 2 AUC values ordered from left to right by number of actives for each target in the DUD set. Program used: FRED with Chemscore as the posing and scoring function. Error bars are 95% confidence intervals for each virtual screen  \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym} 
\usepackage{amsfonts} 
\usepackage{amssymb} 
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$ {\text{Err(AUC) }} \approx {\sqrt {{\text{(Var}}_{{{\text{Obs}}}} {\text{/N}}_{{\text{t}}} {\text{)}}} } $$\end{document} \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym} 
\usepackage{amsfonts} 
\usepackage{amssymb} 
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$ {\text{Var}}_{{{\text{Obs}}}} {\text{ = }}\sum {\text{ (AUC}}_{{\text{i}}} - \langle {\text{AUC}}\rangle {\text{)}}^{{\text{2}}} {\text{/N}}_{{\text{t}}} $$\end{document} Therefore \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym} 
\usepackage{amsfonts} 
\usepackage{amssymb} 
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$ {\text{Var}}_{{{\text{targets}}}} {\text{ = N}}_{{\text{t}}} {\text{ * \{(Var}}_{{{\text{Obs}}}} {\text{/N}}_{{\text{t}}} {\text{)}} - {\text{(Var}}_{{{\text{actives}}}} {\text{/N}}_{{\text{a}}} {\text{)}} - {\text{(Var}}_{{{\text{decoys}}}} {\text{/N}}_{{\text{i}}} {\text{)\} }} $$\end{document} 1 self self system method self Table 1 The contribution to observed variance from actives, decoys and targets over the DUD dataset (DUD-self decoys) Method 2 2 2 2 FRED 0.000048 0.0020 0.023 0.021 ROCS 0.000025 0.0022 0.041 0.039 MACCS  0.00004 0.0017 0.030 0.028 LINGOS 0.000039 0.0017 0.035 0.033 The estimated error (squared) from the variation between targets is estimated from the observed variance and corresponds to that which would be obtained if the number of actives and inactives were infinite When calculating the properties of a single system the number of actives is fairly important, but the number of inactives does not have to be substantially larger. A ratio of decoys to actives of 4:1 only has an error 11% higher than the limiting value from an infinite number of inactives. It would be more useful to include sets of inactives designed for different purposes than to attempt to ‘overwhelm’ the actives with decoys. If the purpose is to test a method against other methods with 95% confidence then the number of systems required is very large, much larger than even DUD. In our analysis the contributions to the variance from a limited numbers of actives is almost insignificant compared to the target-to-target variation. For example, it would take over 100 test systems to tease apart the difference between the ligand-based method ROCS and the docking program FRED with 95% confidence. (See below.) 22 Correlations i independent 22 23 operationally operational 24 universal drug-like mimetic modeled not as decoys latent latent explicit drug-like 3 independent Fig. 3 Docking performance against the two isoforms in the Warren study (PDFS and PDFE), compared to the averaged difference over all other pairs of targets  new 1 differences \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym} 
\usepackage{amsfonts} 
\usepackage{amssymb} 
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$ {\text{Var}}_{{{\text{diff}}}} {\text{ = }} \sum{\text {((}}{\text{A}}_{{\text{i}}}  -  {\text{B}}_{{\text{i}}} {\text{)}} - {\text{(}}\langle {\text{A}}\rangle - \langle  {\text{B}}\rangle {\text{))}}^{{\text{2}}} {\text{/N}}_{{\text{t}}} $$\end{document} \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym} 
\usepackage{amsfonts} 
\usepackage{amssymb} 
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$ {\text{Err(diff) }} \approx {\sqrt {{\text{(Var}}_{{{\text{diff}}}} {\text{/N}}_{{\text{t}}} {\text{)}}} } $$\end{document} diff \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym} 
\usepackage{amsfonts} 
\usepackage{amssymb} 
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$ {\text{Var}}_{{{\text{diff}}}} {\text{ = Var}}_{{\text{A}}} {\text{ + Var}}_{{\text{B}}} - {\text{2*Corr(A,B) }} $$\end{document} \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym} 
\usepackage{amsfonts} 
\usepackage{amssymb} 
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$ {\text{Corr(A,B) = }}\sum {\text{(}}{\text{A}}_{{\text{i}}} - \langle {\text{A}}\rangle {\text{)}}{\text{(B}}_{{\text{i}}} - \langle {\text{B}}\rangle {\text{)/N}}_{{\text{t}}} $$\end{document} Here Corr(A,B) is a measure of the correlation between methods A and B and is related to the Pearson correlation coefficient, thus: \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym} 
\usepackage{amsfonts} 
\usepackage{amssymb} 
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$ {\text{Pear(A,B) = Corr(A,B)/}}{\sqrt {{\text{( Var}}_{{\text{A}}} {\text{*\,Var}}_{{\text{B}}} {\text{)}}} } $$\end{document} joint difference p p p \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym} 
\usepackage{amsfonts} 
\usepackage{amssymb} 
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$ p{\text{ = (1}}-erf{\text{(}}\langle {\text{A}}-{\text{B}}\rangle {\text{*}}{\surd{{\text{(0}}{\text{.5*N}}_{{\text{t}}}{\text{/Var}}_{{{\text{diff}}}}))) {\text{/2}}} }{\text{ }} $$\end{document} erf error t p estimate p 2 1 self p 14 18 22 9 Table 2 Statistical measures necessary to accurately assess the relative performance of methods, here applied to the DUD data set (DUD-self decoys) Method FRED ROCS MACCS LINGOS FRED 0.684/0.043 0.11/0.08/0.07 0.1/0.07/0.06 0.1/0.07/0.065 ROCS 0.17/0.09 0.732/0.065 0.12/0.085/0.05 0.125/0.09/0.05 MACCS  0.03/0.05 0.70/0.47 0.734/0.055 0.115/0.08/0.055 LINGOS 0.19/0.14 0.65/0.36 0.54/0.31 0.72/0.061 p p 25 p 5 2 Metrics necessarily Properties of virtual screening metrics 26 In a somewhat circular manner, one of the first characteristics of a good measure is that everyone uses it. Clearly one of the problems with a field with diverse measures is incomparability, the “apples and oranges” problem. The most straightforward solution is not imposition of a particular standard but full disclosure of all data. The authors of a study may want to present enrichment at 5%, but if the data is freely available others may calculate the enrichment at 1% or 13% or whatever they wish. This would inevitably lead to standardization as independent parties harvest data from many sources, publishing larger and larger studies on the advantages and disadvantages of different methods and measures. This would provide another example of meta-analysis described above. Sometimes a valid excuse against disclosure is that compounds or targets are proprietary. However, just providing lists of actives and inactives in rank order with unique, but not necessarily identifying, tags is enough to calculate most of the metrics for a particular virtual screen. Currently the field of modeling lacks even an agreed upon format for the exchange of such rarely available information. Independence to extensive variables Robustness Straightforward assessment of error bounds No free parameters Easily understood and interpretable Take for example the very popular “enrichment” measure. Everyone understands the concept of enrichment: swirl a pan of water and gravel from the Klondike river in 1896 in just the right way and you ended up with mostly gold. In virtual screening you look at the top few percent and see whether there are more actives than you would expect by chance. As a mathematical formula this is typically presented as: \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym} 
\usepackage{amsfonts} 
\usepackage{amssymb} 
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$ {\text{EF(X\%) = (100/X) * (Fraction of Actives Found)}} $$\end{document} improvement particular 27 14 6 9 Early performance in virtual screening 4 Fig. 4 Example ROC plots for “early” and “late” methods  28 29 4 So does BedROC or RIE qualify as a good metric for virtual screening? Comparing against the five criteria listed above, both are more robust than enrichment, and the error protocols for BedROC satisfies criteria (iii). RIE suffers from having an ill-defined numerical interpretation (i.e. how good is an RIE of 5.34?). BedROC attempts to overcome this by scaling between 0.0 and 1.0, but does this qualify as being understandable? There is no absolute, interpretable meaning to a BedROC (or RIE) number, only a relative meaning when ranking methods. 4 Cost structures of virtual screening 4 any 30 rates 4 TP = 8.0 FN = −2.0 FP = −0.16 TN = 0.02 t \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym} 
\usepackage{amsfonts} 
\usepackage{amssymb} 
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$ {\text{Cost(t) = TPR\,*\,N}}_{{\text{a}}}{\text{\,*\,(8}}{\text{.0) + (1}} - {\text{TPR)\,*\,N}}_{{\text{a}}}{\text{\,*\,(}}-{\text{2}}{\text{.0) + FPR\,*\,N}}_{{\text{i}}}{\text{\,*\,(}}-{\text{0}}{\text{.16) +(1}}-{\text{FPR)\,*\,N}}_{{\text{i}}} {\text{\,*\,(0}}{\text{.02)}}  $$\end{document} a i \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym} 
\usepackage{amsfonts} 
\usepackage{amssymb} 
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$ \begin{aligned}{} {\text{ Cost(t)/Ni }} {\text{= (TPR\,*\,(8}}.0 +2.0)-{\text{2}}{\text{.0)/100}} -{\text{FPR\,*\,(0}}{\text{.16 + 0}}{\text{.02) + 0}}{\text{.02}}\\ {\text{ = 0}}{\text{.10\,*\,TPR}} -{\text{0}}{\text{.18\,*\,FPR}}\\ \end{aligned} $$\end{document} 4 5 Fig. 5 a b  TP = 8.0 FN = −2.0 FP = −0.04 TN = 0.03 \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym} 
\usepackage{amsfonts} 
\usepackage{amssymb} 
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$ {\text{Cost(t)/N}}_{{\text{i}}} {\text{ =0}}{\text{.1\,*\,TPR}}-{\text{0}}{\text{.07\,*\,FPR + 0}}{\text{.01}} $$\end{document} 5 These examples are obviously only illustrative, but the point they make is real. Early enrichment is important only because of an assumed cost structure. Clearly much more complicated models could be constructed, possibly with real data, as with medical tests. However, to the author’s knowledge this has never been published, presented or even discussed within the industry. It is an assumption that early enrichment is better. Likewise, it is also an assumption that virtual screening itself is a productive exercise compared to physical screening. Averaged properties of virtual screening 4 6 4 target averaged 7 1 7 8 9 Fig. 6 Averaged ROC curves for twenty methods in the Warren study for which scores for all eight targets where available. Programs and scoring functions listed to the right of the graph  Fig. 7 Average ROC curves for FRED, ROCS, MACCS keys and LINGOS over DUD, with DUD-self decoys. FRED was run with the ChemGauss3 scoring function  Fig. 8 BedROC scores with an exponential factor of 5.0 versus the AUC for 270 virtual screens from the Warren study  Fig. 9 The average AUC for each method run against all eight targets in the Warren study versus the averaged BedROC score for each such method  9 Conclusions In this study we have considered several aspects of experimental design and performance metrics for virtual screening. There is clearly interest in doing things the right way, not least because of a popular, if unproven, belief that virtual screening saves the pharmaceutical industry money. As with many relatively young endeavors, molecular modeling has been long on promises and short on standards, and it is standards that ultimately deliver the proof that our field is useful. For many years the computer industry suffered from similar growing pains. Not only were there few, if any, reliable comparison metrics for different processors, operating systems, compilers and so forth, the proposed benefits of computers were more assumed than quantified. These days no one doubts the impact of the computing revolution. It is to be hoped that a similar statement can one day be made for molecule modeling. It is with this in mind that the following observations and recommendations are made. universal drug-like mimetic modeled Providing access to primary data would allow the field to gain cumulative knowledge. The field of modeling has almost no “meta-analysis”, i.e. research combining the results from studies, largely because of a lack of standards as to procedures and measures, but also due to the lack of primary data. A comprehensive format for virtual screening information would be useful. The inclusion of multiple decoy sets of different design and intent for each target in an evaluation would, in combination with (i) and (ii) above, greatly increase the cumulative value of published studies. The number of targets, actives and inactives need to be carefully considered with respect to the purpose of the experiment and the required accuracy of the results. These can be derived from simple statistical methods that are almost never applied. explicit latent Correlation between targets needs further research, in particular the question of the variance of computational methods on closely related systems. p Deciding on the metrics to be reported should be a community effort, although access to primary data to encourage “meta-analysis” would aid the autonomous adoption of metrics. There are good reasons metrics such as the AUC are popular in other fields and any new or additional measures for virtual screening need to be assessed against the characteristics that have made such metrics successful. Five characteristics required for a metric to be of similar heft to the AUC are proposed: independence to extensive variables, robustness, error bounds, no adjustable parameters and ease of interpretation. As an illustration, an improvement to the common enrichment measure is described. We propose the term “ROC enrichment” for this new measure. Similar improvements to early measures are proposed. average The assumption that ‘early’ behavior is necessarily a benefit is based on an assumed cost structure that may or may not hold. Similar statements are true for virtual screening in general. A rigorous attempt to assign real-word costs would be of use to the field. average Divergence from (v) may be an indicator of local or domain knowledge, i.e. knowing the right answer and/or extensive knowledge of the system under study. A potential future area of research is whether this is also an indicator of over-parameterization, posterior system preparation or other reliance on retrospective knowledge. Interestingly, 2D methods applied to DUD, showed no evidence of such a divergence.