Sunteți pe pagina 1din 65

Molecular Descriptors and

Virtual
Screening using Datamining
approach

Aim of Cheminformatics
Project

To screen molecules interacting with


the Potential TB targets using
classifiers.
Select the selected molecules and dock
with Targets to further screen the
molecules for leads.
Use cheminformatics techniques such
as QSAR ,3D qsar, ADMET to look for
potential leads and design Drugs using
the leads by building combinatorial
libraries.

Tuberculosis
Obstacles For Drug Design

HIV-epidemic that has dramatically increased risk for


developing active TB.

increasing emergence of multi-drug resistant TB (MDR-TB)

emergence of extensively drug-resistant (XDR) TB strains

XDR-TB is characterized by resistance to at least the two


first-line drugs rifampicin and isoniazid and additionally to
a fluoroquinolone and an injectable drug- kanamycin

Existing TB drugs are therefore only able to target actively


growing bacteria through the inhibition of cell processes
such as cell wall biogenesis and DNA replication.

TB chemotherapy characterized by an efficient bactericidal


activity but an extremely weak sterilizing activity i.e
inability to kill slowly growing and slowly metabolizing
strains.

Drugs Currently in
Development

Expected timelines towards approval of candidate drugs


currently in clinical stage of development
(Sources: Global TB Alliance Annual report 20042005;StopTBPartnership Working Group on New Drugs for
TB. Strategic Plan 2006-2015)

Commonly Used TB drugs and


Targets

Main Properties of Anti TB drugs

QSAR and Drug Design


Compounds + biological
activity
QSAR

New compounds with


improved biological
activity

What is QSAR?
QSAR is a mathematical relationship between a
biological activity of a molecular system and its
geometric and chemical characteristics.
A general formula for a quantitative structure-activity
relationship
(QSAR) can be given by the following:

activity = f (molecular or fragmental properties)

QSAR attempts to find consistent relationship


between biological activity and molecular
properties, so that these rules can be used to
evaluate the activity of new compounds.

Molecule Properties
SPC : Structure Property
Correlation
CHEMICAL PROPERTIES
MOLECULE
STRUCTURE

INTRINSIC PROPERTIES
Molar Volume
Connectivity Indices
Charge Distribution
Molecular Weight
Polar surface Area....
.......

CHEMICAL PROPERTIES
pKa
Log P
Solubility
Stability

BIOLOGICAL PROPERTIES

Activity
Toxicity
Biotransformation
Pharmacokinetics

Molecule Descriptors
o Molecular descriptors are numerical values
that
characterize properties of molecules.
o The descriptors fall into Four classes .
a) Topological
b) Geometrical
c) Electronic
d) Hybrid or 3D Descriptors

Classification of Descriptors
Topological Descriptors
Topological descriptors are derived directly from the connection table
representation of the structure which include:
a) Atom and Bond Counts
b) substructure counts
c) molecular connectivity Indices (Weiner Index , Randic Index, Chi Index)
d) Kappa Indices
e) path descriptors
f) distance-sum Connectivity
g) Molecular Symmetry

Geometrical Descriptors
Geometrical descriptors are derived from the threedimensional representations and include:
a) principal moments of inertia,
b) molecular volume,
c)solvent-accessible surface area,
d) Charged partial Surface area
e) Molecular Surface area

Electronic Descriptors
Electronic descriptors characterize the molecular
Strcutures with such
quantities :
a) dipole

moment,
b) Quadrupole moment,
c) polarizibility,
d) HOMO and LUMO energies,
e) Dielectric energy
f) Molar Refractivity

Hybrid and 3D Descriptors


a)
b)
c)
d)
e)
f)
g)
h)
i)

geometric atom pairs and


topological torsions
spatial autocorrelation vectors
WHIM indices
BCUTs
GETAWAY descriptors
Topomers
pharmacophore fingerprints
Eva Descriptors
Descriptors of Molecular Field

Limit Of Descriptors
The

data set should contain at least 5 times as


many compounds as descriptor in the QSAR.

The

reason for this is that too few compounds


relative to the number of descriptors will give
a
falsely high correlation:

in

2 point exactly determine a line.


3 points exactly determine a plane (etc.)
A data set of drug candidate that is similar
size meaningless correlation

Tools To calculate Molecular


Descriptors Freely available
CDK

tool

http://rguha.net/code/java/cdkdesc.html

POWER MV

http://nisla05.niss.org/PowerMV/?
q=PowerMV/

MOLD2
http://www.fda.gov/ScienceResearch/Bioi
nformaticsTools/Mold2/default.htm
PADEL Descriptor
http://www.downv.com/Windows/installPaDEL- Descriptor-10439915.htm

Admet Descriptors to Screen Molecules

Bioavailability
The Bioavailability of a compound is
classified as :
Bioavailability
Liver

Absorbtion
Permeability
Lipophilicity

Hydrogen
Bonding

Metabolism

Gut-wall
Metabolism

Solubility

Molecular
Size/Shape

Transporters
Flexibility

PREDICTION OF
ADMET PROPERTIES
Requirements

for a drug:

Must bind tightly to the biological target in


vivo
Must pass through one or more physiological
barriers (cell membrane or blood-brain
barrier)
Must remain long enough to take effect
Must be removed from the body by
metabolism, excretion, or other means
ADMET:

Absorption, Distribution,
metabolism, Excretion (Elimination),
Toxicity

Lipinski Rule of Five(Oral Drug


Properties)
Poor

absorption or permeation is
more likely when:
MW > 500
LogP >5
More than 5 H-bond donors (sum of
OH and NH groups)
More than 10 H-bond acceptors (sum
of N and O atoms)

Polar Surface Area


o

o
o

Defined as amount of molecular surface(vander-walls) arising


from polar atoms(Nitrogen and oxygen atom together with
attached hydrogens)
PSA seems to optimally encode those drug properties which
play an important role in membrane penetration: molecular
polarity, H - bonding features and also solubility.
It provide excellent correlations with transport properties
of drugs.(PSA used in the Prediction of Oral absorbtion,Brain
penetration, Intestinal Absorption, Caco-2- permeability)
It has also been effectively used to characterize drug likeness
during virtual screening & combinatorial library design.
The calculation of PSA, however, is rather timeconsuming because of the necessity to generate a reasonable
3D
molecular geometry and the calculation of the surface itself.
Peter Ertl introduced an extremely rapid method to obtain
PSA descriptor simply from the sum of contributions of polar
fragments in a molecule without the necessity to generate its
three - dimensional (3D) geometry.

PSA In Intestinal absorption

Intestinal absorption is usually expressed as fraction absorbed


(FA), expressing the percentage of initial dose appearing in a
portal vein.

A model for PSA was done for the - adrenoreceptor


antagonists[1].A excellent sigmoidal relationship between PSA
and FA after oral administration was obtained. Similar sigmoidal
relationships can also be obtained for the topological PSA (TPSA).

These results suggest that drugs with a PSA < 60 2 are


completely (more than 90%) absorbed, whereas drugs with a
PSA > 40 are absorbed to less than 10%.This conclusion
was later confirmed with the correct classification of a set
endothelin receptor antagonists as having either low,
intermediate or high permeability.

PSA was also shown to play an important role in explaining


human in vivo jejunum permeability[2]. A Model based on PSA
and LogP for the prediction of drug absorption was developed for
199 well absorbed and 35 poorly absorbed compounds[3].

PSA In Blood brain barrier


penetration(BBB)

Drugs that act on the CNS need to be able to cross the BBB in order to reach
their target, while minimal BBB penetration is required for other drugs to prevent
CNS side effects.
A common measure of BBB penetration is the ratio of drug concs in the brain
and the blood, which is expressed as log (C brain /Cblood ).
Van de Waterbeemd and Kansy were probably the first to correlate the PSA of a
series of CNS drugs to their membrane transport. They obtained a fair correlation
of brain uptake with single conformer PSA and molecular volume descriptors.
Clark etal. Derived a model of 55 compounds using TPSA and LogP
LogBB= 0.516-0.115* TPSA
n= 55 r2 =0.686 r= 0.828 = 0.42
TPSA in combiantion with ClogP
LogBB= 0.070-0.014*TPSA+0.169*ClogP
n=55 r2 =0.787 r=0.887 =0.35
Great majority of orally administered CNS drugs have a PSA <70 2 . Non CNS
compounds suggested that these have a PSA < 120 2 .
Thus to conclude a majority of the Non CNS penetrating and orally absorbed
compounds have PSA values between 70 and 120 A 2.

Partition coefficients
P

Xaqueous

Xoctanol

Partition coefficient P (usually expressed as log10P or logP) is defined as:


P=

[X]octanol
[X]aqueous

P is a measure of the relative affinity of a molecule for the lipid and aqueous phases in
the absence of ionisation.
1-Octanol is the most frequently used lipid phase in pharmaceutical research. This
is because:
It has a polar and non polar region (like a membrane phospholipid)
Po/w is fairly easy to measure
Po/w often correlates well with many biological properties
It can be predicted fairly accurately using computational models

Calculation of logP
LogP for a molecule can be calculated from a sum of fragmental
or atom-based terms plus various corrections.
logP = fragments + corrections
H

Branch
O
H
H

H
H

C
H

C
C H
H

H H

O
H

Phenylbutazone

C H

clogP for windows output

H C

C
H

C
C
H

C: 3.16 M: 3.16 PHENYLBUTAZONE


Class
| Type | Log(P) Contribution Description

Value

FRAGMENT | # 1 | 3,5-pyrazolidinedione
ISOLATING |CARBON| 5 Aliphatic isolating carbon(s)
ISOLATING |CARBON| 12 Aromatic isolating carbon(s)
EXFRAGMENT|BRANCH| 1 chain and 0 cluster branch(es)
EXFRAGMENT|HYDROG| 20 H(s) on isolating carbons
EXFRAGMENT|BONDS | 3 chain and 2 alicyclic (net)

-3.240
0.975
1.560
-0.130
4.540
-0.540

RESULT

| 2.11 |All fragments measured

clogP 3.165

What else does logP affect?

logP

Binding to
enzyme /
receptor

Aqueous
solubility

Binding to
P450
metabolising
enzymes

So log P needs to be optimised

Absorption
through
membrane

Binding to
blood / tissue
proteins
less drug free
to act

Binding to
hERG heart
ion channel
-cardiotoxicity
risk

Admet Descriptors
Calculation Tools

PreADMET http://preadmet.bmdrc.org/

Molecular Descriptors Calculation- 1081 diverse molecular


descriptors

Drug-Likeness Prediction- Lipinski rule, lead-like rule, Drug DB like


rule

ADME Prediction - caco-2, MDCK, BBB, HIA, plasima protein


bindingand skin permeability data

Toxicity Prediction- Ames test and rodent carcinogenicity assay

SPARC Online Calculator http://ibmlc2.chem.uga.edu/sparc/


SPARC on-line calculator for prediction of pK,, solubility,
polarizability, and other properties; search in the database of
experimental pKa values is also available

Daylight Chemical Information Systems


www.daylight .com/ daycgi/clogp
Calculation of log P by the CLOGP algorithm from BioByte; also access to the
LOGPSTARdatabase of experimental log P data .

Admet Tools Continued..

Molinspiration Cheminformatics
www.molinspiration.com/seruices/index.
Calculation of molecular properties relevant to drug design and QSAR,
including log P, polar surface area, Rule of Five parameters, and druglikeness index

Pirika - www.pirika.com
Calculation of various types of molecular properties, including boiling point,
vapor pressure, and solubility; web demo restricted to only aliphatic
molecules

Actelion -www.actelion.com/page/property_explorer
Calculation of molecular weight, logP, solubility, drug-score and toxlcity
risk .

Virtual Computational Chemistry Laboratory www. vcclab. org


Prediction of log P and water solubility based on associative neural
networks as well as other parameters; comparison of various prediction
methods

Virtual Screening

Ways to Assess Structures from


a Virtual Screening Experiment
Use

a previously derived mathematical


model that predicts the biological
activity of each structure
Run substructure queries to eliminate
molecules with undesirable functionality
Use a docking program to ID structures
predicted to bind strongly to the active
site of a protein (if target structure is
known)
Filters remove structures not wanted in
a succession of screening methods

Main Classes of Virtual Screening


Methods
Depend

on the amount of structural and


bioactivity data available
One active molecule known: perform similarity
search (ligand-based virtual screening)
Several active molecules known: try to ID a
common 3D pharmacophore, then do a 3D
database search
Reasonable number of active and inactive
structures known: train a machine learning
technique
3D structure of the protein known: use proteinligand docking

STRUCTURE-BASED VIRTUAL
SCREENING
Protein-Ligand

Docking
Aims to predict 3D structures when a
molecule docks to a protein

Need a way to explore the space of possible


protein-ligand geometries (poses)
Scoring of the ligand poses uch that the score
reflects binding affinity of the ligand;
Need to score or rank the poses to ID most likely
binding mode and assign a priority to the molecules

Problem: involves many degrees of freedom


(rotation, conformation) and solvent effects
Conformations of ligands in complexes often
have very similar geometries to minimumenergy conformations of the isolated ligand

Protein-Ligand Docking
Methods
Modern

methods explore orientational


and conformational degrees of
freedom at the same time
Monte Carlo algorithms (change
conformation of the ligand or subject the
molecule to a translation or rotation within
the binding site
Genetic algorithms
Incremental construction approaches

Distinguish Docking and Scoring


Docking

involves the prediction of the


binding mode of individual molecules

Goal: ID orientation closest in geometry to


the observed X-ray structure

Scoring

ranks the ligands using some


function related to the free energy of
association of the two units

DOCK function looks at atom pairs of


between 2.3-3.5 Angstroms
Pair-wise linear potential looks at attractive
and repulsive regions, taking into account
steric and hydrogen bonding
interactions(eg moldock)

Structure-Based Virtual
Screening: Other Aspects
Computationally

intensive and complex


Multitude of possible parameters figure
into docking programs
Docking programs require 3D
conformation as the starting point or
require partial atomic charges for
protein and ligand
X-Ray Crystallographic studies dont
include hydrogens, but most docking
programs require them.

Ligand Based Virtual Screening


The Ligand based approach mainly uses pharmacophore maps and
(QSAR) to identify or modify a lead in the absence of a known three
dimensional structure of the receptor. It is necessary to have
experimental affinities and molecular properties of a set of active
compounds, for which the chemical structures are known .

a)PHARMACOPHORE:A pharmacophore is an explicit geometric

hypothesis of
the critical features of a ligand.Standard features include H-bond donors and
acceptors, charged groups,and Hydrophobic patterns.The hypothesis can be used
to screen databases for compounds and to refine existing leads .

For a geometric alignment of the functional groups of the leads, it is necessary to


specify the conformations that individual compounds adopt in their bound state.

Since the simple presence of a pharmacophoric fingerprint is not sufficient for


predicting activity, inactive compounds possessing the required pharmacophoric
features must also be considered.

By comparing the volume of the active and the inactive compounds, a common
volume can be constructed in order to approximate the shape of the (unknown)
receptor site to further refine the pharmacophore model and to screen out
additional compounds.

3D compound
Structures
Feature
Analysis

Set of
Conformers

comp
are

Pharmacophore
Modelling
Workflow

Pharmacophore

validat
ion

Application

Align to
template

Continued.......
b)QSAR:

The goal of QSAR studies is to predict


the activity of new compounds based solely on their
chemical structure. The underlying assumption is that
the biological activity can be attributed to incremental
contributions of the molecular fragments determining
the biological activity. This assumption is called the
linear free energy principle. Information about the
strength of interactions is captured for each
compound by,for example, steric,electronic,and
hydrophobic descriptors.

Molecular similarity and searching Molecules


What is it?
Chemical, pharmacological or biological properties of two compounds
match.
The more the common features, the higher the similarity between two
molecules.
Chemical
The two structures on top are chemically similar to each other. This is reflected in their
common sub-graph, or scaffold: they share 14 atoms

Pharmacophore

The two structures above are less similar chemically (topologically) yet have the same
pharmacological activity, namely they both are Angiotensin-Converting Enzyme (ACE)
inhibitors

Molecular similarity
How to calculate it?
Quantitative assessment of similarity/dissimilarity of structures
need a numerically tractable form
molecular descriptors, fingerprints, structural keys
Sequences/vectors of bits, or numeric values that can be compared by
distance functions, similarity metrics .
E= Euclidean distance
T = Tanimoto index

E ( x, y )

x
i 1

yi

T ( x, y )

B( x & y )
B( x) B( y ) B( x & y )

Molecular descriptors
a) chemical fingerprint
hashed binary fingerprint
o encodes topological properties of the chemical graph: connectivity,
edge label (bond type), node label (atom type)
o allows the comparison of two molecules with respect to their
chemical structure
Construction
1. find all 0, 1, , n step walks in the chemical graph
2. generate a bit array for each walks with given number of bits set
3. merge the bit arrays with logical OR operation

Molecular descriptors
Example 1: chemical fingerprint
Example
CH3 CH2 OH
walks from the first carbon atom
length walk

bit array

1010000000

CH

0001010000

CC

0001000100

CCH

0001000010

CCO

0100010000

3
CCOH
0000011000
merge bit arrays for the first carbon atom: 1111011110
This example illustrates how a 10 bits long topological chemical fingerprint is
created for a simple chain structure. In this example all walks up to 3 steps are
considered, and 2 bits are set for each pattern.

Molecular Similarity
Example 1: chemical fingerprint

0100010100010100010000000001101010011010100000010100000000100000

0100010100010100010000000001101010011010100000000100000000100000

Molecular descriptors
Example 2: pharmacophore fingerprint
encodes pharmacophore properties of molecules as frequency

counts of pharmacophore point pairs at given topological distance


allows the comparison of two molecules with respect to their
pharmacophore

Construction
1. map pharmacophore point type to atoms
2. calculate length of shortest path between each pair of atoms
3. assign a histogram to every pharmacophore point pairs and count

the frequency of the pair with respect to its distance

Molecular descriptors
Example 2: pharmacophore fingerprint
Pharmacophore point type based
coloring of atoms: acceptor, donor,
hydrophobic, none.

12

12

11

11

10

10

A A A A A A D D D D D D D D D D D D H H H H H H H H H H H H H H H H H H
A A A A A A A A A A A A D D D D D D A A A A A A D D D D D D H H H H H H
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6

A A A A A A D D D D D D D D D D D D H H H H H H H H H H H H H H H H H H
A A A A A A A A A A A A D D D D D D A A A A A A D D D D D D H H H H H H
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6

Virtual screening using fingerprints


Individual query structure
0101010100010100010100100000000000010010000010010100100100010000

query fingerprint
query

proximity

targets

0000000100001101000000101010000000000110000010000100001000001000
0100010110010010010110011010011100111101000000110000000110001000
0100010100011101010000110000101000010011000010100000000100100000
0001101110011101111110100000100010000110110110000000100110100000
0100010100110100010000000010000000010010000000100100001000101000
0100011100011101000100001011101100110110010010001101001100001000
0101110100110101010111111000010000011111100010000100001000101000
0100010100111101010000100010000000010010000010100100001000101000
0001000100010100010100100000000000001010000010000100000100000000
0100010100010011000000000000000000010100000010000000000000000000
0100010100010100000000000000101000010010000000000100000000000000
0101010101111100111110100000000000011010100011100100001100101000
0100010100011000010000011000000000010001000000110000000001100000
0000000100000000010000100000000000001010100000000100000100100000
0100010100010100000000100000000000010000000000000100001000011000
0001000100001100010010100000010100101011100010000100001000101000
0100011100010100010000100001001110010010000010001100000000101000
0101010100010100010100100000000000010010000010010100100100010000

target fingerprints

hits

Hypothesis Fingerprints
Advantages
strict conditions for hits if
actives are fairly similar

Disadvantages
false results with
asymmetric metrics
misses common features of
highly diverse sets
very sensitive to one
missing feature

captures common features less selective if actives are


of more diverse active sets very similar
captures common features less selective if actives are
of more diverse active sets very similar
specific treatment of the
absence of a feature
less sensitive to outliers

SUMMARY
Virtual

screening methods are central


to many cheminformatics problems in:
Design
Selection
Analysis

Increasing

numbers of molecules can


be evaluated using these techniques
Reliability and accuracy remain as
problems in docking and predicting
ADMET properties
Need much more reliable and
consistent experimental data

Datamining and Machine


Learning Approaches to
Virtual Screening

Idea of Datamining
Is

discovering for patterns in the


data i.e for example

a)an hunter looks pattern in animal migration


behavior.
b)farmers seek patterns in crop growth.
c) politcians seek patterns in voters opinion
d) Pattern in the compound structures .
The

Patterns which are discovered must


be meaningful and lead to some
advantage.
The process must be automatic or
semiautomatic.

Canonical learning
Problems

Supervised

Learning: given examples of inputs


and corresponding desired outputs, predict
outputs on future inputs.
a) Classification
b) Regression
c) Time series prediction
Unsupervised Learning: given only inputs,
automatically discover representations,
features, structure, etc.
a) Clustering
b) Outlier detection
c) Compression

Datamining Methods
Substructural

Analysis

The Substrcutural fragments makes a contribution to


activity irrespective of the other fragments of the
molecule. The idea is to derive a weight for each fragment
which reflects to be active or inactive. The sum of weight
gives the score of molecule which enables a new set of
structures to be ranked in Decreasing probability of
activity.
The weight is calculated using the eq :

Where act(i) is the number of active molecules that contain the i th


fragment and inact(i) is the number of inactive molecules that contain
the i th fragment

Discriminant algorithms
The aim of discriminant analysis is try to
separate the molecules into constituent classes.
The simplest Linear discriminant which in case of
two activity class and two descriptors which aim
to find a st. line that separates data such that
maximum number of compounds are classified.
If more than variable uses the line become
hyperplane.
The idea is to express a class as a linear
combination of attributes.
X= w0+w1a1+w2a2+w3a3+.........

X =class a1 a2 = attributes w1 w2 = weights

Neural Networks(NN)

The two most commonly used neural network


architectures used in chemistry are the feed forward
networks and the Kohonen networks.

The feed forward NN is a supervised learning method


as it uses the values of dependent variables to derive
the model. The Kohonen or Self Organizing map (SOM)
is an unsupervised method.

The Feed forward NN contains layers of nodes with


connection between all pairs of nodes in the adjacent
layers. A key feature is presence of hidden nodes along
with back propagation algorithm makes the network
applicable to many fields.

The neural network must first be trained with set of


inputs. Once it has been trained it can then be used to
predict values for new and unseen molecules.

Neural Networks
Continued...

The Figure Below shows a Feed forward network with 3Hidden


nodes and one output.

A Kohonen NN consist of rectangular array of nodes and each


nodes associates a vector that corresponds to input data
(Descriptors values)

The data is presented to the network one molecule at a time and


the distance between each of node vectors and molecule vectors
are determined with distance metric. The node with minimum
distance becomes the wining node.

Disadvantage of Neural
Networks
Its

is difficult to design a perfect model for neural


networks with number of hidden layers and nodes
which will best fit the data.
Another practical issue is Overtraining .An
overtrained NN will give excellent results train data
but will perform poorly on an unseen data(test
data).This is because the network memorizes the
data.
The way solve this problem is to divide the sets in
train and test and then watch performance of the set
. If the performance of the test set increase such that
till it reaches a plateau and start to decline ,at this
point network has maximum predictive ability.

DECISION TREES(DT)

In Feed forward NN it is not possible to determine the result


for a given input due to complex nature of interconnection
between nodes one cannot determine which properties are
important.

Decision trees consist of set of rules that associate molecular


descriptor values with property of interest.
A DT is a tree with nodes containing specific rules .Each Rule
may correspond to the presence or absence of a particular
feature .

In a DT one start at the root node and follows the edge with
appropriate first rule. This continues until a terminal node is
reached at which point one can assign the molecule into
active and inactive class.

DTs like ID3 ,C4.5,C 5.0 uses information theory to choose


which criteria to choose at each step.

Random forests a small subset of the descriptors is randomly


selected at each node rather than using the full set.

Support Vector
Machines(SVM)

Support vector machines select a small number of


critical boundary instances called support vectors from
each class and build a linear discriminant function that
separates them as widely as possible.
Molecules in the test set are mapped to the same
feature space and
their activity is predicted according to which side of the
hyper plane they fall.
The distance to the boundary can be used to assign
confidence level to the prediction such that higher the
distance the higher the confidence.
The output of SVM is given by f(x)=sign(g(x)) where
g(x)=w(t)x+b, w is a vector and b is a scalar.
linear SVM can be applied only when the active and
inactive compounds can be divided by a straight line
(hyperplane) in the feature space.

SVM continued....

When the data cannot be separated linearly, kernel


functions are used to transform to the Higher
dimensions.

The output of SVM is given by f(x)=sign(g(x)) and


g(x) is given by

where K is the so-called kernel function, the suffix k


represents the support vector, and m stands for the
number of support vectors.
The Gaussian and the Polynomial kernel function are
used

Strengths and Weaknesses


of SVM
Strengths

Training is relatively easy


No local optima

It scales relatively well to high dimensional data

Tradeoff between classifier complexity and error can


be controlled explicitly

Non-traditional data like strings and trees can be used


as input to SVM, instead of feature vectors

Weaknesses
Need

to choose a goodkernel function.

Measuring Classifier
Performance
N= total number of instances in the dataset
TPj= Number of True Positives for class j
FPj = Number of False positives for class j
TNj= Number of True Negatives for class j
FNj= Number of False Negatives for class j
Accuracy =
Sensitivity/recall =
Specificity/precision =

Types of Datamining
learning
Classification- learning-the learning scheme
Process
in
Weka
is presented with a set of classified examples from
which it is expected to learn a way of classifying
unseen examples.

Association

Learning-any association

among features is sought, not just ones that predict a


particular class value

Clustering-groups of examples that belong


together are sought

Numeric

prediction-the outcome to be

predicted
is not a discrete class but a numeric quantity.

Classifier Algorithms in
WEKA
a)Bayes Classifier
AODE
BAYES NET
NAVE BAYES
NAVE BAYES MULTINOMIAL
NAVE BAYES UPDATABLE

c) Functions

LINEAR REGRESSION
LOGISTIC
MULTILAYERD PERCEPTRON
RBF NETWORK
SIMPLE LINEAR REGRESSION
SIMPLE LOGISTIC

SMO,SMO REG.

b)Trees
ADTREE
ID3
J48
LMT
NB5TREE
RANDOM FOREST
RANDOM TREE
REP TREE

d)Rules
CONJUCTIVE RULE
DECISION TABLE
JRIP
M 5RULES
NNGE
ONE R
PRISM
ZERO R

Summary
Machine learning is mainly applied to ligand-based
drug screening and it is applied to the calculation
of the optimal
distance between the feature
vectors of active and inactive compounds.
A kernel is essentially a similarity function with
certain mathematical properties, and it is possible
to define kernel functions over all sorts of
structures for example, sets, strings, trees, and
probability distributions .
Interest in neural networks appears to have
declined since the arrival of support vector
machines, perhaps because the latter generally
require fewer parameters to be tuned to achieve
the same (or greater) accuracy.

THANK YOU

S-ar putea să vă placă și