Molecular Descriptors and Virtual Screening Using Datamining Approach

Molecular Descriptors and
Virtual
Screening using Datamining
approach
Aim of Cheminformatics
Project
To screen molecules interacting with

the Potential TB targets using
classifiers.
Select the selected molecules and dock
with Targets to further screen the
molecules for leads.
Use cheminformatics techniques such
as QSAR ,3D qsar, ADMET to look for
potential leads and design Drugs using
the leads by building combinatorial
libraries.
Tuberculosis
Obstacles For Drug Design
HIV-epidemic that has dramatically increased risk for

developing active TB.
increasing emergence of multi-drug resistant TB (MDR-TB)
emergence of extensively drug-resistant (XDR) TB strains
XDR-TB is characterized by resistance to at least the two

first-line drugs rifampicin and isoniazid and additionally to
a fluoroquinolone and an injectable drug- kanamycin
Existing TB drugs are therefore only able to target actively

growing bacteria through the inhibition of cell processes
such as cell wall biogenesis and DNA replication.
TB chemotherapy characterized by an efficient bactericidal

activity but an extremely weak sterilizing activity i.e
inability to kill slowly growing and slowly metabolizing
strains.
Drugs Currently in
Development
Expected timelines towards approval of candidate drugs

currently in clinical stage of development
(Sources: Global TB Alliance Annual report 20042005;StopTBPartnership Working Group on New Drugs for
TB. Strategic Plan 2006-2015)
Commonly Used TB drugs and

Targets
Main Properties of Anti TB drugs
QSAR and Drug Design

Compounds + biological
activity
QSAR
New compounds with

improved biological
activity
What is QSAR?
QSAR is a mathematical relationship between a
biological activity of a molecular system and its
geometric and chemical characteristics.
A general formula for a quantitative structure-activity
relationship
(QSAR) can be given by the following:
activity = f (molecular or fragmental properties)
QSAR attempts to find consistent relationship

between biological activity and molecular
properties, so that these rules can be used to
evaluate the activity of new compounds.
Molecule Properties
SPC : Structure Property
Correlation
CHEMICAL PROPERTIES
MOLECULE
STRUCTURE
INTRINSIC PROPERTIES
Molar Volume
Connectivity Indices
Charge Distribution
Molecular Weight
Polar surface Area....
.......
CHEMICAL PROPERTIES
pKa
Log P
Solubility
Stability
BIOLOGICAL PROPERTIES
Activity
Toxicity
Biotransformation
Pharmacokinetics
Molecule Descriptors
o Molecular descriptors are numerical values
that
characterize properties of molecules.
o The descriptors fall into Four classes .
a) Topological
b) Geometrical
c) Electronic
d) Hybrid or 3D Descriptors
Classification of Descriptors
Topological Descriptors
Topological descriptors are derived directly from the connection table
representation of the structure which include:
a) Atom and Bond Counts
b) substructure counts
c) molecular connectivity Indices (Weiner Index , Randic Index, Chi Index)
d) Kappa Indices
e) path descriptors
f) distance-sum Connectivity
g) Molecular Symmetry
Geometrical Descriptors
Geometrical descriptors are derived from the threedimensional representations and include:
a) principal moments of inertia,
b) molecular volume,
c)solvent-accessible surface area,
d) Charged partial Surface area
e) Molecular Surface area
Electronic Descriptors
Electronic descriptors characterize the molecular
Strcutures with such
quantities :
a) dipole
moment,
b) Quadrupole moment,
c) polarizibility,
d) HOMO and LUMO energies,
e) Dielectric energy
f) Molar Refractivity
Hybrid and 3D Descriptors

a)
b)
c)
d)
e)
f)
g)
h)
i)
geometric atom pairs and

topological torsions
spatial autocorrelation vectors
WHIM indices
BCUTs
GETAWAY descriptors
Topomers
pharmacophore fingerprints
Eva Descriptors
Descriptors of Molecular Field
Limit Of Descriptors
The
data set should contain at least 5 times as

many compounds as descriptor in the QSAR.
The
reason for this is that too few compounds

relative to the number of descriptors will give
a
falsely high correlation:
in
2 point exactly determine a line.

3 points exactly determine a plane (etc.)
A data set of drug candidate that is similar
size meaningless correlation
Tools To calculate Molecular

Descriptors Freely available
CDK
tool
http://rguha.net/code/java/cdkdesc.html
POWER MV
http://nisla05.niss.org/PowerMV/?
q=PowerMV/
MOLD2
http://www.fda.gov/ScienceResearch/Bioi
nformaticsTools/Mold2/default.htm
PADEL Descriptor
http://www.downv.com/Windows/installPaDEL- Descriptor-10439915.htm
Admet Descriptors to Screen Molecules
Bioavailability
The Bioavailability of a compound is
classified as :
Bioavailability
Liver
Absorbtion
Permeability
Lipophilicity
Hydrogen
Bonding
Metabolism
Gut-wall
Metabolism
Solubility
Molecular
Size/Shape
Transporters
Flexibility
PREDICTION OF
ADMET PROPERTIES
Requirements
for a drug:
Must bind tightly to the biological target in

vivo
Must pass through one or more physiological
barriers (cell membrane or blood-brain
barrier)
Must remain long enough to take effect
Must be removed from the body by
metabolism, excretion, or other means
ADMET:
Absorption, Distribution,
metabolism, Excretion (Elimination),
Toxicity
Lipinski Rule of Five(Oral Drug

Properties)
Poor
absorption or permeation is
more likely when:
MW > 500
LogP >5
More than 5 H-bond donors (sum of
OH and NH groups)
More than 10 H-bond acceptors (sum
of N and O atoms)
Polar Surface Area

o
o
o
Defined as amount of molecular surface(vander-walls) arising

from polar atoms(Nitrogen and oxygen atom together with
attached hydrogens)
PSA seems to optimally encode those drug properties which
play an important role in membrane penetration: molecular
polarity, H - bonding features and also solubility.
It provide excellent correlations with transport properties
of drugs.(PSA used in the Prediction of Oral absorbtion,Brain
penetration, Intestinal Absorption, Caco-2- permeability)
It has also been effectively used to characterize drug likeness
during virtual screening & combinatorial library design.
The calculation of PSA, however, is rather timeconsuming because of the necessity to generate a reasonable
3D
molecular geometry and the calculation of the surface itself.
Peter Ertl introduced an extremely rapid method to obtain
PSA descriptor simply from the sum of contributions of polar
fragments in a molecule without the necessity to generate its
three - dimensional (3D) geometry.
PSA In Intestinal absorption
Intestinal absorption is usually expressed as fraction absorbed

(FA), expressing the percentage of initial dose appearing in a
portal vein.
A model for PSA was done for the - adrenoreceptor

antagonists[1].A excellent sigmoidal relationship between PSA
and FA after oral administration was obtained. Similar sigmoidal
relationships can also be obtained for the topological PSA (TPSA).
These results suggest that drugs with a PSA < 60 2 are

completely (more than 90%) absorbed, whereas drugs with a
PSA > 40 are absorbed to less than 10%.This conclusion
was later confirmed with the correct classification of a set
endothelin receptor antagonists as having either low,
intermediate or high permeability.
PSA was also shown to play an important role in explaining

human in vivo jejunum permeability[2]. A Model based on PSA
and LogP for the prediction of drug absorption was developed for
199 well absorbed and 35 poorly absorbed compounds[3].
PSA In Blood brain barrier

penetration(BBB)
Drugs that act on the CNS need to be able to cross the BBB in order to reach
their target, while minimal BBB penetration is required for other drugs to prevent
CNS side effects.
A common measure of BBB penetration is the ratio of drug concs in the brain
and the blood, which is expressed as log (C brain /Cblood ).
Van de Waterbeemd and Kansy were probably the first to correlate the PSA of a
series of CNS drugs to their membrane transport. They obtained a fair correlation
of brain uptake with single conformer PSA and molecular volume descriptors.
Clark etal. Derived a model of 55 compounds using TPSA and LogP
LogBB= 0.516-0.115* TPSA
n= 55 r2 =0.686 r= 0.828 = 0.42
TPSA in combiantion with ClogP
LogBB= 0.070-0.014*TPSA+0.169*ClogP
n=55 r2 =0.787 r=0.887 =0.35
Great majority of orally administered CNS drugs have a PSA <70 2 . Non CNS
compounds suggested that these have a PSA < 120 2 .
Thus to conclude a majority of the Non CNS penetrating and orally absorbed
compounds have PSA values between 70 and 120 A 2.
Partition coefficients
P
Xaqueous
Xoctanol
Partition coefficient P (usually expressed as log10P or logP) is defined as:

P=
[X]octanol
[X]aqueous
P is a measure of the relative affinity of a molecule for the lipid and aqueous phases in
the absence of ionisation.
1-Octanol is the most frequently used lipid phase in pharmaceutical research. This
is because:
It has a polar and non polar region (like a membrane phospholipid)
Po/w is fairly easy to measure
Po/w often correlates well with many biological properties
It can be predicted fairly accurately using computational models
Calculation of logP
LogP for a molecule can be calculated from a sum of fragmental
or atom-based terms plus various corrections.
logP = fragments + corrections
H
Branch
O
H
H
H
H
C
H
C
C H
H
H H
O
H
Phenylbutazone
C H
clogP for windows output
H C
C
H
C
C
H
C: 3.16 M: 3.16 PHENYLBUTAZONE

Class
| Type | Log(P) Contribution Description
Value
FRAGMENT | # 1 | 3,5-pyrazolidinedione
ISOLATING |CARBON| 5 Aliphatic isolating carbon(s)
ISOLATING |CARBON| 12 Aromatic isolating carbon(s)
EXFRAGMENT|BRANCH| 1 chain and 0 cluster branch(es)
EXFRAGMENT|HYDROG| 20 H(s) on isolating carbons
EXFRAGMENT|BONDS | 3 chain and 2 alicyclic (net)
-3.240
0.975
1.560
-0.130
4.540
-0.540
RESULT
| 2.11 |All fragments measured
clogP 3.165
What else does logP affect?
logP
Binding to
enzyme /
receptor
Aqueous
solubility
Binding to
P450
metabolising
enzymes
So log P needs to be optimised
Absorption
through
membrane
Binding to
blood / tissue
proteins
less drug free
to act
Binding to
hERG heart
ion channel
-cardiotoxicity
risk
Admet Descriptors
Calculation Tools
PreADMET http://preadmet.bmdrc.org/
Molecular Descriptors Calculation- 1081 diverse molecular

descriptors
Drug-Likeness Prediction- Lipinski rule, lead-like rule, Drug DB like

rule
ADME Prediction - caco-2, MDCK, BBB, HIA, plasima protein

bindingand skin permeability data
Toxicity Prediction- Ames test and rodent carcinogenicity assay
SPARC Online Calculator http://ibmlc2.chem.uga.edu/sparc/

SPARC on-line calculator for prediction of pK,, solubility,
polarizability, and other properties; search in the database of
experimental pKa values is also available
Daylight Chemical Information Systems

www.daylight .com/ daycgi/clogp
Calculation of log P by the CLOGP algorithm from BioByte; also access to the
LOGPSTARdatabase of experimental log P data .
Admet Tools Continued..
Molinspiration Cheminformatics
www.molinspiration.com/seruices/index.
Calculation of molecular properties relevant to drug design and QSAR,
including log P, polar surface area, Rule of Five parameters, and druglikeness index
Pirika - www.pirika.com
Calculation of various types of molecular properties, including boiling point,
vapor pressure, and solubility; web demo restricted to only aliphatic
molecules
Actelion -www.actelion.com/page/property_explorer
Calculation of molecular weight, logP, solubility, drug-score and toxlcity
risk .
Virtual Computational Chemistry Laboratory www. vcclab. org

Prediction of log P and water solubility based on associative neural
networks as well as other parameters; comparison of various prediction
methods
Virtual Screening
Ways to Assess Structures from

a Virtual Screening Experiment
Use
a previously derived mathematical

model that predicts the biological
activity of each structure
Run substructure queries to eliminate
molecules with undesirable functionality
Use a docking program to ID structures
predicted to bind strongly to the active
site of a protein (if target structure is
known)
Filters remove structures not wanted in
a succession of screening methods
Main Classes of Virtual Screening

Methods
Depend
on the amount of structural and

bioactivity data available
One active molecule known: perform similarity
search (ligand-based virtual screening)
Several active molecules known: try to ID a
common 3D pharmacophore, then do a 3D
database search
Reasonable number of active and inactive
structures known: train a machine learning
technique
3D structure of the protein known: use proteinligand docking
STRUCTURE-BASED VIRTUAL
SCREENING
Protein-Ligand
Docking
Aims to predict 3D structures when a
molecule docks to a protein
Need a way to explore the space of possible

protein-ligand geometries (poses)
Scoring of the ligand poses uch that the score
reflects binding affinity of the ligand;
Need to score or rank the poses to ID most likely
binding mode and assign a priority to the molecules
Problem: involves many degrees of freedom

(rotation, conformation) and solvent effects
Conformations of ligands in complexes often
have very similar geometries to minimumenergy conformations of the isolated ligand
Protein-Ligand Docking
Methods
Modern
methods explore orientational

and conformational degrees of
freedom at the same time
Monte Carlo algorithms (change
conformation of the ligand or subject the
molecule to a translation or rotation within
the binding site
Genetic algorithms
Incremental construction approaches
Distinguish Docking and Scoring

Docking
involves the prediction of the

binding mode of individual molecules
Goal: ID orientation closest in geometry to

the observed X-ray structure
Scoring
ranks the ligands using some

function related to the free energy of
association of the two units
DOCK function looks at atom pairs of

between 2.3-3.5 Angstroms
Pair-wise linear potential looks at attractive
and repulsive regions, taking into account
steric and hydrogen bonding
interactions(eg moldock)
Structure-Based Virtual
Screening: Other Aspects
Computationally
intensive and complex

Multitude of possible parameters figure
into docking programs
Docking programs require 3D
conformation as the starting point or
require partial atomic charges for
protein and ligand
X-Ray Crystallographic studies dont
include hydrogens, but most docking
programs require them.
Ligand Based Virtual Screening

The Ligand based approach mainly uses pharmacophore maps and
(QSAR) to identify or modify a lead in the absence of a known three
dimensional structure of the receptor. It is necessary to have
experimental affinities and molecular properties of a set of active
compounds, for which the chemical structures are known .
a)PHARMACOPHORE:A pharmacophore is an explicit geometric
hypothesis of
the critical features of a ligand.Standard features include H-bond donors and
acceptors, charged groups,and Hydrophobic patterns.The hypothesis can be used
to screen databases for compounds and to refine existing leads .
For a geometric alignment of the functional groups of the leads, it is necessary to

specify the conformations that individual compounds adopt in their bound state.
Since the simple presence of a pharmacophoric fingerprint is not sufficient for

predicting activity, inactive compounds possessing the required pharmacophoric
features must also be considered.
By comparing the volume of the active and the inactive compounds, a common
volume can be constructed in order to approximate the shape of the (unknown)
receptor site to further refine the pharmacophore model and to screen out
additional compounds.
3D compound
Structures
Feature
Analysis
Set of
Conformers
comp
are
Pharmacophore
Modelling
Workflow
Pharmacophore
validat
ion
Application
Align to
template
Continued.......
b)QSAR:
The goal of QSAR studies is to predict

the activity of new compounds based solely on their
chemical structure. The underlying assumption is that
the biological activity can be attributed to incremental
contributions of the molecular fragments determining
the biological activity. This assumption is called the
linear free energy principle. Information about the
strength of interactions is captured for each
compound by,for example, steric,electronic,and
hydrophobic descriptors.
Molecular similarity and searching Molecules

What is it?
Chemical, pharmacological or biological properties of two compounds
match.
The more the common features, the higher the similarity between two
molecules.
Chemical
The two structures on top are chemically similar to each other. This is reflected in their
common sub-graph, or scaffold: they share 14 atoms
Pharmacophore
The two structures above are less similar chemically (topologically) yet have the same
pharmacological activity, namely they both are Angiotensin-Converting Enzyme (ACE)
inhibitors
Molecular similarity
How to calculate it?
Quantitative assessment of similarity/dissimilarity of structures
need a numerically tractable form
molecular descriptors, fingerprints, structural keys
Sequences/vectors of bits, or numeric values that can be compared by
distance functions, similarity metrics .
E= Euclidean distance
T = Tanimoto index
E ( x, y )
x
i 1
yi
T ( x, y )
B( x & y )
B( x) B( y ) B( x & y )
Molecular descriptors
a) chemical fingerprint
hashed binary fingerprint
o encodes topological properties of the chemical graph: connectivity,
edge label (bond type), node label (atom type)
o allows the comparison of two molecules with respect to their
chemical structure
Construction
1. find all 0, 1, , n step walks in the chemical graph
2. generate a bit array for each walks with given number of bits set
3. merge the bit arrays with logical OR operation
Example 1: chemical fingerprint
Example
CH3 CH2 OH
walks from the first carbon atom
length walk
bit array
1010000000
CH
0001010000
CC
0001000100
CCH
0001000010
CCO
0100010000
3
CCOH
0000011000
merge bit arrays for the first carbon atom: 1111011110
This example illustrates how a 10 bits long topological chemical fingerprint is
created for a simple chain structure. In this example all walks up to 3 steps are
considered, and 2 bits are set for each pattern.
Molecular Similarity
Example 1: chemical fingerprint
0100010100010100010000000001101010011010100000010100000000100000
0100010100010100010000000001101010011010100000000100000000100000
Example 2: pharmacophore fingerprint
encodes pharmacophore properties of molecules as frequency
counts of pharmacophore point pairs at given topological distance

allows the comparison of two molecules with respect to their
pharmacophore
Construction
1. map pharmacophore point type to atoms
2. calculate length of shortest path between each pair of atoms
3. assign a histogram to every pharmacophore point pairs and count
the frequency of the pair with respect to its distance
Example 2: pharmacophore fingerprint
Pharmacophore point type based
coloring of atoms: acceptor, donor,
hydrophobic, none.
12
12
11
11
10
10
A A A A A A D D D D D D D D D D D D H H H H H H H H H H H H H H H H H H
A A A A A A A A A A A A D D D D D D A A A A A A D D D D D D H H H H H H
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
A A A A A A D D D D D D D D D D D D H H H H H H H H H H H H H H H H H H
A A A A A A A A A A A A D D D D D D A A A A A A D D D D D D H H H H H H
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
Virtual screening using fingerprints

Individual query structure
0101010100010100010100100000000000010010000010010100100100010000
query fingerprint
query
proximity
targets
0000000100001101000000101010000000000110000010000100001000001000
0100010110010010010110011010011100111101000000110000000110001000
0100010100011101010000110000101000010011000010100000000100100000
0001101110011101111110100000100010000110110110000000100110100000
0100010100110100010000000010000000010010000000100100001000101000
0100011100011101000100001011101100110110010010001101001100001000
0101110100110101010111111000010000011111100010000100001000101000
0100010100111101010000100010000000010010000010100100001000101000
0001000100010100010100100000000000001010000010000100000100000000
0100010100010011000000000000000000010100000010000000000000000000
0100010100010100000000000000101000010010000000000100000000000000
0101010101111100111110100000000000011010100011100100001100101000
0100010100011000010000011000000000010001000000110000000001100000
0000000100000000010000100000000000001010100000000100000100100000
0100010100010100000000100000000000010000000000000100001000011000
0001000100001100010010100000010100101011100010000100001000101000
0100011100010100010000100001001110010010000010001100000000101000
0101010100010100010100100000000000010010000010010100100100010000
target fingerprints
hits
Hypothesis Fingerprints
Advantages
strict conditions for hits if
actives are fairly similar
Disadvantages
false results with
asymmetric metrics
misses common features of
highly diverse sets
very sensitive to one
missing feature
captures common features less selective if actives are

of more diverse active sets very similar
captures common features less selective if actives are
of more diverse active sets very similar
specific treatment of the
absence of a feature
less sensitive to outliers
SUMMARY
Virtual
screening methods are central

to many cheminformatics problems in:
Design
Selection
Analysis
Increasing
numbers of molecules can

be evaluated using these techniques
Reliability and accuracy remain as
problems in docking and predicting
ADMET properties
Need much more reliable and
consistent experimental data
Datamining and Machine

Learning Approaches to
Virtual Screening
Idea of Datamining
Is
discovering for patterns in the

data i.e for example
a)an hunter looks pattern in animal migration

behavior.
b)farmers seek patterns in crop growth.
c) politcians seek patterns in voters opinion
d) Pattern in the compound structures .
The
Patterns which are discovered must

be meaningful and lead to some
advantage.
The process must be automatic or
semiautomatic.
Canonical learning
Problems
Supervised
Learning: given examples of inputs

and corresponding desired outputs, predict
outputs on future inputs.
a) Classification
b) Regression
c) Time series prediction
Unsupervised Learning: given only inputs,
automatically discover representations,
features, structure, etc.
a) Clustering
b) Outlier detection
c) Compression
Datamining Methods
Substructural
Analysis
The Substrcutural fragments makes a contribution to

activity irrespective of the other fragments of the
molecule. The idea is to derive a weight for each fragment
which reflects to be active or inactive. The sum of weight
gives the score of molecule which enables a new set of
structures to be ranked in Decreasing probability of
activity.
The weight is calculated using the eq :
Where act(i) is the number of active molecules that contain the i th

fragment and inact(i) is the number of inactive molecules that contain
the i th fragment
Discriminant algorithms
The aim of discriminant analysis is try to
separate the molecules into constituent classes.
The simplest Linear discriminant which in case of
two activity class and two descriptors which aim
to find a st. line that separates data such that
maximum number of compounds are classified.
If more than variable uses the line become
hyperplane.
The idea is to express a class as a linear
combination of attributes.
X= w0+w1a1+w2a2+w3a3+.........
X =class a1 a2 = attributes w1 w2 = weights
Neural Networks(NN)
The two most commonly used neural network

architectures used in chemistry are the feed forward
networks and the Kohonen networks.
The feed forward NN is a supervised learning method

as it uses the values of dependent variables to derive
the model. The Kohonen or Self Organizing map (SOM)
is an unsupervised method.
The Feed forward NN contains layers of nodes with

connection between all pairs of nodes in the adjacent
layers. A key feature is presence of hidden nodes along
with back propagation algorithm makes the network
applicable to many fields.
The neural network must first be trained with set of

inputs. Once it has been trained it can then be used to
predict values for new and unseen molecules.
Neural Networks
Continued...
The Figure Below shows a Feed forward network with 3Hidden

nodes and one output.
A Kohonen NN consist of rectangular array of nodes and each

nodes associates a vector that corresponds to input data
(Descriptors values)
The data is presented to the network one molecule at a time and

the distance between each of node vectors and molecule vectors
are determined with distance metric. The node with minimum
distance becomes the wining node.
Disadvantage of Neural
Networks
Its
is difficult to design a perfect model for neural

networks with number of hidden layers and nodes
which will best fit the data.
Another practical issue is Overtraining .An
overtrained NN will give excellent results train data
but will perform poorly on an unseen data(test
data).This is because the network memorizes the
data.
The way solve this problem is to divide the sets in
train and test and then watch performance of the set
. If the performance of the test set increase such that
till it reaches a plateau and start to decline ,at this
point network has maximum predictive ability.
DECISION TREES(DT)
In Feed forward NN it is not possible to determine the result

for a given input due to complex nature of interconnection
between nodes one cannot determine which properties are
important.
Decision trees consist of set of rules that associate molecular

descriptor values with property of interest.
A DT is a tree with nodes containing specific rules .Each Rule
may correspond to the presence or absence of a particular
feature .
In a DT one start at the root node and follows the edge with
appropriate first rule. This continues until a terminal node is
reached at which point one can assign the molecule into
active and inactive class.
DTs like ID3 ,C4.5,C 5.0 uses information theory to choose

which criteria to choose at each step.
Random forests a small subset of the descriptors is randomly

selected at each node rather than using the full set.
Support Vector
Machines(SVM)
Support vector machines select a small number of

critical boundary instances called support vectors from
each class and build a linear discriminant function that
separates them as widely as possible.
Molecules in the test set are mapped to the same
feature space and
their activity is predicted according to which side of the
hyper plane they fall.
The distance to the boundary can be used to assign
confidence level to the prediction such that higher the
distance the higher the confidence.
The output of SVM is given by f(x)=sign(g(x)) where
g(x)=w(t)x+b, w is a vector and b is a scalar.
linear SVM can be applied only when the active and
inactive compounds can be divided by a straight line
(hyperplane) in the feature space.
SVM continued....
When the data cannot be separated linearly, kernel

functions are used to transform to the Higher
dimensions.
The output of SVM is given by f(x)=sign(g(x)) and

g(x) is given by
where K is the so-called kernel function, the suffix k

represents the support vector, and m stands for the
number of support vectors.
The Gaussian and the Polynomial kernel function are
used
Strengths and Weaknesses

of SVM
Strengths
Training is relatively easy

No local optima
It scales relatively well to high dimensional data
Tradeoff between classifier complexity and error can

be controlled explicitly
Non-traditional data like strings and trees can be used

as input to SVM, instead of feature vectors
Weaknesses
Need
to choose a goodkernel function.
Measuring Classifier
Performance
N= total number of instances in the dataset
TPj= Number of True Positives for class j
FPj = Number of False positives for class j
TNj= Number of True Negatives for class j
FNj= Number of False Negatives for class j
Accuracy =
Sensitivity/recall =
Specificity/precision =
Types of Datamining
learning
Classification- learning-the learning scheme
Process
in
Weka
is presented with a set of classified examples from
which it is expected to learn a way of classifying
unseen examples.
Association
Learning-any association
among features is sought, not just ones that predict a

particular class value
Clustering-groups of examples that belong

together are sought
Numeric
prediction-the outcome to be
predicted
is not a discrete class but a numeric quantity.
Classifier Algorithms in
WEKA
a)Bayes Classifier
AODE
BAYES NET
NAVE BAYES
NAVE BAYES MULTINOMIAL
NAVE BAYES UPDATABLE
c) Functions
LINEAR REGRESSION
LOGISTIC
MULTILAYERD PERCEPTRON
RBF NETWORK
SIMPLE LINEAR REGRESSION
SIMPLE LOGISTIC
SMO,SMO REG.
b)Trees
ADTREE
ID3
J48
LMT
NB5TREE
RANDOM FOREST
RANDOM TREE
REP TREE
d)Rules
CONJUCTIVE RULE
DECISION TABLE
JRIP
M 5RULES
NNGE
ONE R
PRISM
ZERO R
Summary
Machine learning is mainly applied to ligand-based
drug screening and it is applied to the calculation
of the optimal
distance between the feature
vectors of active and inactive compounds.
A kernel is essentially a similarity function with
certain mathematical properties, and it is possible
to define kernel functions over all sorts of
structures for example, sets, strings, trees, and
probability distributions .
Interest in neural networks appears to have
declined since the arrival of support vector
machines, perhaps because the latter generally
require fewer parameters to be tuned to achieve
the same (or greater) accuracy.
THANK YOU

Molecular Descriptors and Virtual Screening Using Datamining Approach

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Molecular Descriptors and Virtual Screening Using Datamining Approach

Încărcat de

Drepturi de autor:

Formate disponibile

Molecular Descriptors and

To screen molecules interacting with

HIV-epidemic that has dramatically increased risk for

increasing emergence of multi-drug resistant TB (MDR-TB)

emergence of extensively drug-resistant (XDR) TB strains

XDR-TB is characterized by resistance to at least the two

Existing TB drugs are therefore only able to target actively

TB chemotherapy characterized by an efficient bactericidal

Expected timelines towards approval of candidate drugs

Commonly Used TB drugs and

Main Properties of Anti TB drugs

QSAR and Drug Design

New compounds with

activity = f (molecular or fragmental properties)

QSAR attempts to find consistent relationship

Hybrid and 3D Descriptors

geometric atom pairs and

data set should contain at least 5 times as

reason for this is that too few compounds

2 point exactly determine a line.

Tools To calculate Molecular

Admet Descriptors to Screen Molecules

Must bind tightly to the biological target in

Lipinski Rule of Five(Oral Drug

Polar Surface Area

Defined as amount of molecular surface(vander-walls) arising

PSA In Intestinal absorption

Intestinal absorption is usually expressed as fraction absorbed

A model for PSA was done for the - adrenoreceptor

These results suggest that drugs with a PSA < 60 2 are

PSA was also shown to play an important role in explaining

PSA In Blood brain barrier

Partition coefficient P (usually expressed as log10P or logP) is defined as:

clogP for windows output

C: 3.16 M: 3.16 PHENYLBUTAZONE

| 2.11 |All fragments measured

What else does logP affect?

So log P needs to be optimised

Molecular Descriptors Calculation- 1081 diverse molecular

Drug-Likeness Prediction- Lipinski rule, lead-like rule, Drug DB like

ADME Prediction - caco-2, MDCK, BBB, HIA, plasima protein

Toxicity Prediction- Ames test and rodent carcinogenicity assay

SPARC Online Calculator http://ibmlc2.chem.uga.edu/sparc/

Daylight Chemical Information Systems

Admet Tools Continued..

Virtual Computational Chemistry Laboratory www. vcclab. org

Ways to Assess Structures from

a previously derived mathematical

Main Classes of Virtual Screening

on the amount of structural and

Need a way to explore the space of possible

Problem: involves many degrees of freedom

methods explore orientational

Distinguish Docking and Scoring

involves the prediction of the

Goal: ID orientation closest in geometry to

ranks the ligands using some

DOCK function looks at atom pairs of

intensive and complex

Ligand Based Virtual Screening

a)PHARMACOPHORE:A pharmacophore is an explicit geometric

For a geometric alignment of the functional groups of the leads, it is necessary to

Since the simple presence of a pharmacophoric fingerprint is not sufficient for