Sunteți pe pagina 1din 26

Data Visualization in

Cheminformatics
Simon Xi
Computational Sciences CoE
Pfizer Cambridge

My Background
Professional Experience
Senior Principal Scientist, Computational Sciences CoE, Pfizer
Cambridge
9-year experience in pharmaceutical research with a focused on
developing cheminformatics and bioinformatics applications for
research scientists

Education
MSc in Molecular Cell Biology in UTDallas
MSc in Software Engineering in SMU
Finishing Ph.D in Bioinformatics in Boston University

What we will cover today

Introduction to drug discovery


Cheminformatics basics
Encoding of the chemical structures
Visualizing data and structures
Design and optimization of compound library
A case study

The Billion Dollar Molecules


Drug Name
Lipitor
Nexium
Advair
Prevacid
Plavix
Singulair
Seroquel
Effexor
Norvasc

2006 WorldWide Sales


$14,385M
$5,182M
$6,129M
$3,425M
$6,057M
$3,579M
$3,560M
$3,722M
$4,866M

Primary Use
cholesterol
heartburn
asthma
heartburn
anticoagulant
asthma
depression
depression
hypertension

Lipitor 14 billion
annual sales

Industry Productivity vs. Investment


The Challenge
Total R&D Investment ($ Billions)

NME/$
$25
$20
$15

# NMEs

$10
$5

Source: PhRMA annual survey, 2000

20
00

19
98

19
96

19
94

19
92

19
90

19
88

19
86

19
84

19
82

19
80

19
78

19
76

19
74

19
72

19
70

$0

60
40
20
0

Nature Reviews Drug Discovery 3, 451-456 (2004)

~100
~100 Discovery
Discovery Approaches
Approaches

Attrition On The R&D Process

Millions of
Compounds Screened

Preclinical
Pharmacology
Preclinical Safety

1-2
Products

Clinical Pharmacology
& Safety
Discovery

Exploratory Development
Phase I

Idea

Full Development

Phase II

Phase III

10

11 - 15 Years

15

Drug

Nat Rev Drug Discov. 2007 6:636-49.

What is Chemoinformatics?
Use of computer and informational techniques,
applied to a range of problems in the field of chemistry.
These in silico techniques are commonly used in
pharmaceutical companies in the process of drug
discovery.
Chemistry is a visual science. Data visualization is a
key component of cheminformatics.

What is Chemoinformatics?

Encoding Chemical Structures


SD format
Lipitor

Atoms

Bonds
SMILES format
CC(C)C1=C(C(=O)NC2=CC=CC=C2)C(
C3=CC=CC=C3)=C(N1CCC(O)CC(O)C
C(O)=O)C4=CC=C(F)C=C4

Representing Structure as Fingerprints

010 0 100 0 1001 00000 1 00

Compound Similarity Search

Compound Properties/Descriptors
1D, 2D, 3D, multi-dimensional properties

1D: Molecular Weight, clogP, #of Atoms,


charge, #H-Bond donors and acceptors

2D: Atom pairs, substructures functional groups

3D: Shape, pharmacophores

nD: Fingerprints, etc..


3D
Chemical series compounds sharing
the same core structures

Series Classifications
Wards Clustering

Iteratively merging a pair of


nodes until all nodes are
merged.
At each merging step, two
nodes that give minimal
variance are chosen and merged
into one new node.
Once the tree hierarchy is
generated, clusters can be
defined by cutting the tree at
certain dissimilarity threshold

What makes a drug?


Primary pharmacology
In vitro potency
Cell based potency
Functional assays
Selectivity against other targets
Toxicity Properties
Inhibition of CYP450 isozymes
PXR transactivation
Human hepatocyte toxicity
Mutagenicity
Mitochondria toxicity
Covalent protein binding
Inhibition of HERG

ADME/Physicochemical Properties
Solubility
Chemical stability
Hydrophobicity/hydrogen bonding
potential
Intestinal mucosal cell permeation
Liver and kidney clearance
Metabolism
Transporters
Charge
Size
Protein binding
Blood-brain barrier permeation
Target cell permeation

Drug-Likeness: Rule of Five


Proposed by C. Lipinski to describe drug-like molecules.
Molecules displaying good oral absorption and /or distribution
properties are likely to possess the following characteristics:
Molecular Weight < 500
logP < 5.0
H-donors < 5
H-acceptors (number of N and O atoms) < 10

Data Visualization
Grid View
Table View
Plot View

Heatmap View

Software Relevance
Software Usability
Software
Management

Building Predictive Models using Machine


Learning Techniques

Use computational models to understand Structure-Activitive


Relationship (SAR)

Use computational models to run virtual screen to guide


compound selection for synthesis

Interpretability of Predictive Models


The good part

Can we derive this for non-linear models?

The not so
good part

Multiple Parameter Optimization in


Combinatorial Library Design
Given a 100x100x100 virtual library space and a set of
predictive models for various properties (e.g. potency,
ADME, selectivity), select the best 300 compounds for
synthesis with the highest probability of being potent and
drug-like and with diverse sampling of the chemical
space
R3

N
N
R1

N
R2

For example diaminopyrimidine library

The problem of Multiple Parameters Optimzation

The chemical space is huge

Predictive models are not very


predictive

Many parameters to optimize and


sometime contradictory to each other

MPO a case study with kinase selectivity


~200 cmpds from a library tested against 40
kinases, can we design another 100 cmpds that
are highly selective
F F
F

N
N

R1

N
R2

Identify compounds with desired seletivity


profile in the expanded virtual chemical space

Trifluoro-diaminopyrimidine
series (~200 cmpds)

Virtual Library Profile

Tested compounds
de
Mo

FW

R1

ng
ildi
u
B
l

Solving R-groups
contribution using
linear regression

R1

R2

Predictable Virtual
Chemical Space

Only few combination RgroupKinase have been previously tested

5-50x
expansion

R1

R2

Enu
m er
atio
n

R2

Predictive models - Leave-One-Out Validations

Experimental Validation of Predictions


KSS pIC50 vs. FW pIC50
r2=0.45

r2=0.59

r2=0.92

r2=0.86

~40 cmpds in two


series were selected
for KSS testing
r2=0.74

r2=0.83

r2=0.63

r2=0.88

More promiscuous

r2=0.85

r2=0.81

r2=0.81

r2=0.85

More selective

Cheminformatics Challenges for Drug


Discovery
Information retrieval and knowledge managment - rapidly and
efficiently present all relevant data/knowledge to scientists at
the right time and right place
Predictive models - drastically improve the accuracy and
interpretability of in silico models for potency and ADME
endpoints
Computer-aided design provide easy to use software
applications to help scientists analyze/visualize their data and
make efficient use of prior knowledge during compound
designs

References
1. Agrafiotis, D. K., Lobanov, V. S. and Salemme, F. R. (2002) Combinatorial
informatics in the post-genomics ERA. Nat Rev Drug Discov. 1, 337-346
2. Lipinski, C. and Hopkins, A. (2004) Navigating chemical space for biology
and medicine. Nature. 432, 855-861
3. Paolini, G. V., Shapland, R. H., van Hoorn, W. P., Mason, J. S. and
Hopkins, A. L. (2006) Global mapping of pharmacological space. Nat
Biotechnol. 24, 805-815

S-ar putea să vă placă și