QSAR for Decision Support
Most often models built by scientists are characterized largely by the type and quality of data available. For example, given permeability values across a membrane, one can either build a regression model to directly predict the value or build a classifier that predicts the value-bin a compound falls in (such as high or low). Conventionally, a model is considered good based on certain statistical metrics such as R2 or Q2 for regression models, or precision and recall for classifiers. But in reality the only metric that model needs to satisfy and indeed, the only one that counts is whether the model is useful or not. This leads us to the question, how does one define usefulness? To do this, let us step back and understand the need for models in the first place. Models are mathematical representations of biological/chemical hypotheses that need to be confirmed by experiments. By creating a QSAR model, we are implicitly asserting the following.
1) There exists a relationship between structure and activity.
2) This relationship is complex and not easily discernable upon inspection.
3) However, this complex relationship can be captured mathematically.
4) This mathematical relationship will enable the chemist to pick/design the next set of molecules to experiment upon.
What is implicit is that a useful model will allow me to make a different decision from what I would do in its absence, i.e. the presence of the model predictions changes my next experimental step. Indeed, if I make the same decision whether or not I have access to a model, why do I need it! Once we accept that the primary need of modeling is its utilization, we are immediately faced with the next question – what is the decision that my model is going to impact? In fact, depending upon where one is in the pharma pipeline, one faces different challenges, different model needs and hence different metrics for model “goodness”. Is the model being used as a screen? Do you want to select leads or candidates from a pool of possibilities? Is the model an aid for design? The answers to these questions lead us to the interesting and perhaps not so intuitive notion that factors external to the data-type or quality have a huge impact on the kind of model that needs to be built. I would like to illustrate using a real case how this notion of model utility impacted the kind of model that was eventually successfully deployed within a pharmaceutical organization.
The problem: A mid-size pharma company faced the challenge of identifying leads that had the ability to penetrate the blood-brain-barrier (BBB) and bind to a particular target localized in brain. The 3-D structure of that target was such that good binders inevitably ended up being rather large molecules that did not easily penetrate the BBB. Molecules that seemed to satisfy permeability and binding criteria in vitro were injected into rats and the brain-plasma ratio (B/P) measured. Normally, molecules that have B/P > 2 are called penetrants, those with B/P < 0.5 are non-penetrants while all others fall into an ambiguous category. B/P measurement in animals requires multiple time-point sacrifices to construct the brain- and the blood-AUC ending up being expensive and time-consuming. What the organization required was a model that would predict the B/P of their leads to enable prioritization for the animal experiments. The twist to this problem is that since it is difficult to design good binders that can enter the brain, it is critical that false negatives are avoided, i.e. the model must flag all penetrants accurately even if it misclassifies some non-penetrants.
The Models: Using the data provided, we first built a set of two regression models: a 20-feature model (with R2 = 0.97, Q2 = 0.895) and a 9-feature model (R2 = 0.87, Q2 = 0.744). The statistical metrics looked good; y-randomization also seemed to support the fact that the signal in the data was modeled. To test whether these models would be good for decision-making, we used the cross-validation predictions on the training set to understand the discriminatory power of the model in identifying penetrants. The results are shown in the table below.
| Accuracy (%) |
20 feature model |
9 feature model |
| Overall |
85.7 |
78.6 |
| False Positives |
0.0 |
0.0 |
| False Negatives |
66.7 |
33.3 |
The models did not seem to perform as well as we wanted them to. While the overall accuracy was good, they seemed to have an unacceptably high number of false negatives. Hence, we did not believe that they would perform well for decision making.
We then took an alternate approach of building a 3-way classifier using a 4-feature neural network. Testing the classifier in the same manner as above we were able to see better performance, especially low false negatives. Based on the performance illustrated in the table below, we believed that the classifier would be a better model for decision-making.
| Accuracy (%) |
Classifier |
| Overall |
89.9 |
| False Positives |
2.3 |
| False Negatives |
0.0 |
Our hunch proved correct – when we tested both the models on a large test set gathered over the next several months, the classifier vastly outperformed the regression model and proved to be excellent for decision making as shown by the statistics in the table below.
| Accuracy (%) |
20 feature model |
Classifier |
| Overall |
61.5 |
69.8 |
| False Positives |
10.0 |
22.0 |
| False Negatives |
66.7 |
18.2 |
Conclusions: What I hope, I have convinced you of is that a model primary measure of goodness is its utility rather than statistical measures. In this study we used the cross-validation predictions as a measure of model utility and it proved to be a reasonable approach. Even after one constructs the best possible model, it is a rather dangerous approach to use models, especially in later stages of the pipeline as a filter. The right approach is to use models to prioritize ones next experiments, with the belief that a “good” model will allow to perform the right confirmatory experiment earlier. This was indeed true for the model described above; cost savings of $425000 per 10 leads selected was estimated.