Documente Academic
Documente Profesional
Documente Cultură
Tel: 416-593-1122
Fax: 416-593-5077
www.angoss.com
Table of Contents
1 DECISION TREES ...................................................................................................3
1.2 Algorithms.......................................................................................................................... 5
2 PREDICTIVE MODELS..........................................................................................13
2
KnowledgeSTUDIO Technical Notes
1 Decision Trees
Notation
Y – dependent variable (DV);
N – number of records at the node.
vi – i-th category;
ni – number of records at the node with IV in vi .
Numeric DV
y i1 ,..., y ini - values of DV for records with IV in vi ;
ni
y i = ∑ y ij ;
j =1
yi
yi = ;
ni
r r ni
y = ∑ y i = ∑∑ y ij ;
i =1 i =1 j =1
y
y= ;
N
r ni
STD = ∑∑ ( y ij − y ) 2 ;
i =1 j =1
r
SSTR = ∑ ni ( yi − y ) 2 ;
i =1
r ni
SSE = ∑∑ ( y ij − y i ) 2 ;
i =1 j =1
SSTR
MSTR = ;
r −1
3
Mathematical Formulations
SSE
MSE = ;
N −r
MSTR
F* = ;
MSE
PV = F ( F * ; r − 1, N − r ) .
If i and k are two categories of the IV then:
2 2
y + yk y + yk
MSTR{i ,k } = ni y i − i + n k y k − i ;
ni + n k ni + n k
ni nk
∑ ( yij − yi ) 2 + ∑ ( y kj − y k ) 2
j =1 j =1
MSE{i ,k } = ;
ni + n k − 2
MSTR{i , k }
F{*i ,k } = .
MSE{i ,k }
Categorical DV
d – number of distinct values that DV can take;
ni
q ij = mj .
N
We have:
r d d r r d
m j = ∑ q ij , ni = ∑ qij , N = ∑ m j = ∑ ni = ∑∑ qij ;
i =1 j =1 j =1 i =1 i =1 j =1
r d (q ij − qij ) 2
X = ∑∑
2
;
i =1 j =1 qij
d (qij − qij ) 2 d (q kj − q kj ) 2
X 2
{i , k } =∑ +∑ ;
j =1 qij j =1 q kj
4
KnowledgeSTUDIO Technical Notes
PV = χ 2 ( X 2 ; df ) where df = (d − 1)(r − 1) ;
1 d qij qij
Ei = −
ln d
∑n
j =1
ln
ni
;
i
1 d mj mj
E=−
ln d
∑j =1 N
ln
N
;
∑mj =1
2
j
G = 1− ;
N2
d
∑q
j =1
2
ij
Gi = 1 − .
ni2
Algorithms
Treatment of a Numeric IV
To use a numeric IV in a decision tree it must be made categorical. KnowledgeSTUDIO breaks the range of values
of the IV into a user-controlled number of intervals (set by default to 10) so that the number of records for each
interval is the same. Each interval is then considered a category.
KnowledgeSEEKER
Exhaustive Method
This method starts with each value of the IV being a separate category. The following occurs at each iteration:
For every eligible pair of categories (i, k ) its similarity statistic is computed. For a numeric
*
DV this statistic is F
{i , k } . For a categorical DV it is X {2i ,k } ;
The pair with the highest similarity statistic is merged into a single category thus decreasing
the total number of categories r by 1. The new set of categories is the next candidate split and
its p-value is computed. For a numeric DV the p-value is F ( F * ; r − 1, N − r ) . For a
categorical DV it is χ 2 ( X 2 ; df ) .
5
Mathematical Formulations
These two steps are repeated until r = 1 . Of all candidate splits, the one with the highest Info (see below) is chosen.
If Info is above the user-controlled threshold, the split is presented for that IV. Otherwise no significant splits for the
IV can be found.
Cluster Method
This method also starts with each value of the IV being a separate category. The following occurs at each iteration:
For every eligible pair of categories (i, k ) its similarity statistic is computed. For a numeric
*
DV this statistic is F
{i , k } . For a categorical DV it is X {2i ,k } ;
The pair with the highest similarity statistic is merged into a single category thus decreasing
the total number of categories r by 1. The new set of categories is the next candidate split and
its p-value is computed. For a numeric DV the p-value is F ( F * ; r − 1, N − r ) . For a
categorical DV it is χ 2 ( X 2 ; df ) .
These two steps are repeated until the p-value of the pair with the highest similarity statistic found in step 1 exceeds
the user-controlled threshold (this number is entered in the Significance To Merge column of the Tree
Attribute Editor - Tools menu - where it is set by default to 0.05) or until r = 1 . The above p-value is equal to
F ( F{*i ,k } ;1, N − 2) for a numeric DV and χ 2 ( X {2i ,k } ; d − 1) for a categorical DV. Of all candidate splits, the one
with the highest Info is chosen. After this split is found the algorithm attempts to improve the significance by taking
single DV values from categories to which they were assigned and grouping them with other categories. If
improvement is achieved, the split that produced it becomes the candidate. This re-splitting process returns a
possibly different split from the one that merging produced1. If its Info is above the user-controlled threshold the
split is presented for that IV. Otherwise no significant splits for the IV can be found.
1
The exact technical details of re-splitting are available on request.
6
KnowledgeSTUDIO Technical Notes
Number of split caches per node. This parameter is used for memory management. It
determines how many splits (one per IV) are stored for each node. If the user wants to come
back to an earlier split node to see its split on a different IV, that split has to be recomputed if
it is not stored.
The following two parameters come into effect when the Automatic option is chosen from the Grow menu.
Auto-Grow Stop size (min. records). If a node contains less records than specified by
this parameter, that node will not be split;
Auto-Grow maximum tree depth. If the path from the root node to a given node contains
as many nodes as specified by this parameter, that node will not be split. When this parameter
is equal to 0 (as it is by default) it means that tree depth is not restricted.
HeatSEEKER
7
Mathematical Formulations
per tree. One of them is chosen. If the DV is numeric, it wiil be the node with the lowest standard deviation. If the
DV is categorical, it will be the node with the highest mode2. The tree whose node is chosen provides the prediction
for this record.
Split Statistics
When evaluating the significance of a split, KnowledgeSTUDIO calculates a number of statistics (apart from record
counts and proportions) that can be viewed in the Split Report tab of a decision tree or displayed directly in the
tree via settings in the Node Detail - Edit menu. Which statistics are reported is measure-dependent except Info
which is given for every measure and DV type (it can only be viewed in the Split Report).
Below we provide the list of statistics with mathematical expressions for each measure and DV type.
Statistical Measures
A split is a partition of the node dataset into subsets according to the value of the independent variable. According to
a statistical measure the best split is the one that generates subsets which are least likely to come from the same
distribution. The likelihood is measured by standard statistical tests.
A split is presented if it passes a predetermined significance level. This level can be chosen in two ways:
1. With an unadjusted p-value measure the desired significance level is directly applied to the
best split .
2. The adjusted measure takes into account the fact that a number of different splits is evaluated
for every IV (see Algorithms above) and only one of them is required to be significant. This
split has to pass a more stringent significance test than if only a single split was attempted.
The choice of that stringent test is made within the Bonferroni framework by adjusting the p-
value of the split. Exact technical details of this adjustment are beyond the current scope of
this document and are available on request.
Unadjusted p-value
1. Numeric DV
Info = 100 × C ;
df 1 = N − r ;
df 2 = r − 1 ;
MSTR
F = F* = ;
MSE
2
Mode is the highest proportion of a single value of the DV in the node dataset. In terms of our notation
mj
it is max
1≤ j ≤ d N
8
KnowledgeSTUDIO Technical Notes
P = PV = F ( F * ; df 2, df 1) ;
C = 1− P .
2. Categorical DV
Info = 100 × C ;
df = (d − 1)(r − 1) ;
r d (qij − qij ) 2
Chi 2 = X 2 = ∑∑ ;
i =1 j =1 qij
P = PV = χ 2 ( X 2 ; df ) ;
C = 1− P .
Adjusted p-value
1. Numeric DV
Info = 100 × C ;
df 1 = N − r ;
df 2 = r − 1 ;
MSTR
F = F* = ;
MSE
uP = PV = F ( F * ; df 2, df 1) ;
uC = 1 − uP ;
P = B × uP ;
C = 1− P ;
B = B a Bm ;
Bm is the additional adjustment factor that the user can specify in the Measure tab of
the Tree Configuration dialog when adjusted measure is chosen. By default it is set to
1;
9
Mathematical Formulations
2. Categorical DV
Info = 100 × C ;
df = (d − 1)(r − 1) ;
r d (qij − qij ) 2
Chi 2 = X = ∑∑
2
;
i =1 j =1 qij
uP = PV = χ 2 ( X 2 ; df ) ;
uC = 1 − uP ;
P = B × uP ;
C = 1− P ;
B = B a Bm ;
Bm is the additional adjustment factor that the user can specify in Measure tab of Tree
Configuration dialog when adjusted measure is chosen. By default it is set to 1;
Entropy Variance
This quantity tells how much information about a DV value is contained in a random draw from the node dataset. If
DV values were spread evenly within that set then, by drawing a record at random, one would gain no information
because every DV value would be equally likely. If, on the contrary, all records had the same DV value, a random
draw would provide perfect information about it. High entropy means low information and vice versa. Entropy E
achieves a maximum of 1 for a uniform (evenly spread) distribution and a minimum of 0 for a degenerate (peaked)
distribution.
1. Numeric DV
Info = 100 × RatioVaria nce ;
InputVariance = STD ;
OutputVari ance = SSE ;
3
Ba depends on the number of distinct values of IV(r), the number of categories formed by the algorithm
(Cluster or Exhaustive), the algorithm itself and the grouping type of IV. Ba is greater for unordered than
for ordered IV, for Cluster than for Exhaustive and it increases with cardinality of IV.
10
KnowledgeSTUDIO Technical Notes
OutputVariance
RatioVariance = 1 − .
InputVariance
2. Categorical DV
Info = 100 × RatioEntro py ;
InputEntropy = E ;
r
ni
OutputEntr opy = ∑ Ei ;
i =1 N
OutputEntropy
RatioEntropy = 1 − .
InputEntropy
Gini Variance
Gini variance is analogous to entropy variance in that it also measures how evenly (or unevenly) DV values are
spread within a dataset.
1. Numeric DV
Info = 100 × RatioVaria nce ;
InputVariance = STD ;
OutputVari ance = SSE ;
OutputVariance
RatioVariance = 1 − .
InputVariance
2. Categorical DV
Info = 100 × RatioIndex ;
InputIndex = G ;
r
ni
OutputIndex = ∑ Gi ;
i =1 N
OutputIndex
RatioIndex = 1 − .
InputIndex
11
Mathematical Formulations
12
KnowledgeSTUDIO Technical Notes
2 Predictive Models
Notation
The following notation is common to all predictive models. The dataset to which it applies, depending on context,
may be the training dataset or the validation dataset.
N – number of observations (records) in the dataset;
N
SSR = ∑ ( yˆ i − yˆ ) 2 ;
i =1
N
SSE = ∑ ( y i − yˆ i ) 2 ;
i =1
N
1
MDEV =
N
∑| y
i =1
i − yˆ i | .
Categorical DV
d – number of different values that the DV can take;
13
Mathematical Formulations
1 d fj fj
E=−
ln d
∑ N ln N
j =1
;
1 d fˆ j fˆ j
Eˆ = − ∑
ln d j =1 N
ln
N
.
Common Output
After the parameters for a predictive model have been specified, its training progress and results can be seen in three
views. Output there depends only on the type of DV and not on that of the model (except regressions where
conventional output is provided, see below). Training of any predictive model is performed by minimizing the
appropriate objective function using records in the training set.
Results View
There are four numeric outputs in this view.
1. Sample Size is the number of records in the training set.
2. Training Results:
SSR
Variance/Entropy Explained. For a numeric DV it is 1 − , for a categorical
SSTO
Ê
DV it is 1 − ;
E
Records correctly predicted. Self-explanatory;
Percentage correctly predicted. Self-explanatory.
Training View
The output of analytical interest in this view is green and blue lines that show the training progress. The green line,
scaled at the left, shows the value of the objective function divided by the number of records, calculated for the
training subset. The blue line shows the value of Variance/Entropy Explained, calculated for the validation
subset.
Scoring View
Output in the scoring view becomes available after the model has been validated using the dataset specifically
reserved for this purpose. This dataset must have the same structure as the training dataset and the values of the DV
must be known for that set. After the validation is finished, the table in the bottom part of the view is filled with
output. For a categorical DV the top part of the view displays a confusion matrix.
14
KnowledgeSTUDIO Technical Notes
Regression Models
Linear or logistic regression is used to predict the value of a DV using values of a set of numeric or categorical
independent variables (IV).
Inclusion of a categorical variable as IV in a regression is organized as follows. Consider a categorical variable with
n distinct values. These values are ordered lexicographically by their names. The first value in this ordering is
chosen as reference. For every other value a design variable is created. Thus, n-1 design variables are added to the
regression. The default choice of the reference value can be changed.
y1
y
2
.
Y = - column vector of DV values;
.
.
y N
15
Mathematical Formulations
b0
b
1
. - column vector of estimated parameters (in models without
b=
. intercept, b0 is omitted);
.
b p
1 1 ... 1
1 1 ... 1
. . .
J = - N × N matrix of 1s.
. . .
. . .
1 1 ... 1
Linear Regression
In a linear regression the DV must be numeric. KnowledgeSTUDIO estimates linear regression by ordinary least
squares. Thus, the vector of parameter estimates is equal to:
b = ( X ′X ) X ′Y
−1
Sequence Statistics
Dependent
name
Variable:
MSR
P-Value: F ; RDF , EDF
MSE
16
KnowledgeSTUDIO Technical Notes
Analysis of Variance
Source Sum-of-Squares DF Mean-Square
SSR
REGRESSION SSR RDF
RDF
SSE
ERROR SSE EDF
EDF
SSTO
TOTAL SSTO TDF
TDF
Values for bi
2
b 2
categorical
variables
bi (
SE (bi ) = MSE ⋅ ( X ′X )
−1
ii )
1
2
SE (bi )
χ i ,1
2
SE (bi )
In the above:
TDF = N - number of observations in the regression;
RDF = p - number of variables in the regression excluding the intercept;
EDF = TDF − RDF ;
1
SSR = b ′X ′Y − Y ′JY ;
N
1 1
SSTO = Y ′Y − Y ′JY = y12 + ... + y N2 − ( y1 + ... + y N ) 2 ;
N N
SSE = SSTO − SSR = (Y − Xb) ′(Y − Xb) .
After the set of variables to consider for regression is chosen, there is an option for performing variable selection.
Regressions that are run during variable selection are called sequences and their output is available for viewing (see
above). Six methods are available:
1. Forward Order starts with the base model (which includes the intercept and all variables
chosen to be forced into the model by the user) and adds variables in the order that has been
set in Step 5 of the Neural Net wizard starting from the first.
17
Mathematical Formulations
2. Backward Order starts with the full model (all variables from the set) and deletes variables
in the order determined in Step 5 of the Neural Net wizard starting from the last.
3. Stepwise Selection starts with the base model and, at every step, adds the variable with the
highest contribution to the explanatory power of the model chosen in the previous step in
terms of the partial F test, if this contribution is higher than the predetermined level chosen
simultaneously with the variable selection method. This is called Significance Limits –
Entry and by default is set to 0.15. After addition every variable added before is checked and
the variable with the smallest contribution is dropped if its Wald statistic is below the
predetermined level (Significance Limits – Exit with a default value of 0.15). The
selection process also stops if a variable is cycled (i.e. included and excluded) three times.
4. Forward Variable Selection is a simplification of Stepwise Selection where no
variables are dropped after they have been added.
5. Backward Variable Selection is analogous to Forward Variable Selection but the
process starts with the full model and variables are dropped based on the criterion outlined in
Stepwise Selection.
6. R-Squared Selection simply runs regressions on all subsets of the chosen set of variables.
Logistic Regression
For logistic regression the DV must be categorical. KnowledgeSTUDIO estimates binary or polytomous logistic
regression. Parameters are estimated by the maximum likelihood method.
Let Y take values in {v 0 ,..., v d } with v 0 being the reference value. Denote:
π j ( xi ) = P(Y = v j | xi ) , j=0,…,d;
π j ( xi )
g j ( x i ) = ln = b j 0 + b j1 xi1 + ... + b jp x ip , j=1,…,d;
π 0 ( xi )
xi1
x
i2
.
where xi =
. .
.
xip
Thus,
g j ( xi )
1 e
π 0 ( xi ) = p
and π j ( xi ) = p
for j=1,…,d
1+ ∑e g k ( xi )
1+ ∑e g k ( xi )
k =1 k =1
18
KnowledgeSTUDIO Technical Notes
Let mi be the value index of y i , i.e. mi = m if y i = v m . Then the log-likelihood function is:
N N p
L(b) = ∑ ln π mi ( xi ) =∑ {g mi ( x i ) − ln[1 + ∑ e g k ( xi ) ]}
i =1 i =1 k =1
Sequence Statistics
Dependent Variable: name
L(b0 )
Entropy Explained: 1−
L(b)
Chi-Square: G = −2[ L(b0 ) − L(b)] Chi-Square Degrees of Freedom: p
P-Value: χ 2 (G; p)
19
Mathematical Formulations
In the above:
20
KnowledgeSTUDIO Technical Notes
3 Cluster Analysis
Cluster analysis is a technique for grouping individuals or objects into clusters so that the objects in the same cluster
are more like each other than they are like the objects in other clusters. The resulting object clusters should then
exhibit internal (within-cluster) homogeneity and high external (between-cluster) heterogeneity with respect to some
predetermined selecting criteria.
Two clustering algorithms are presented in KnowledgeSTUDIO: K-Means and Expectation-Maximization (EM).
Both are nonhierarchical clustering procedures which don’t involve tree-like construction processing. The K-Means
similarity algorithm is based on distance. EM is based on the probabilistic theorem.
Execution of two preparation steps (before clustering) is common to both K-means and EM. The first step is to
normalize the attributes so that values are on an even plain. The second step is to select the initial cluster seeds.
Normalize Attributes
For selected attributes, original data is rescaled to new unitless attributes.
If attribute is continuous variable, standardize data matrix to convert original attributes scores
to new attributes with mean of 0 and standard deviation of 1.
If attribute is ordered variable, the treatment of rescaling is the same as continuous variable.
If attribute is qualitative variable, convert each category into dummy binary variable.
Now we got J sets of cluster solutions CM i , i=1,…,J and form CM i as a new dataset
CM;
Initialize cluster seeds from CM and use K-means to cluster CM i again and produce a
cluster solution FM i ;
FM i have minimal distortion over dataset CM and become a set of initial cluster seeds.
21
Mathematical Formulations
I
E jk = ∑(X
i =1
ij − M ik ) 2 ;
i=1, …, I attributes;
j=1, …, N objects;
k=1, …, K cluster seeds;
min{E jk }.
K
k =1
Cluster seed is recomputed to be the mean of all objects within cluster.
Calculate new cluster seeds:
Nk
1
M ik =
Nk
∑xj =1
ij .
2
Stop iteration when E value will not decrease.
K final clusters are composed of objects surrounding K cluster seeds from the final iteration.
22
KnowledgeSTUDIO Technical Notes
1
Typically choose w j = .
k
EM to estimate probability density function (PDF) for each class, and then perform classification based on Bayes’
rule:
n P( x n | j ) * P( j )
Posterior probability for class j: P ( j | x ) = .
P( x n )
n
Where, P ( x | j ) is PDF for class j.
n
EM to estimate P ( x | j ) as a weighted average of multiple Gaussians.
Nj
l =1
1
∑j
−1
1 − ( x n − µ j )T ( x n −µ j )
Gl = e 2
.
n
2π | ∑ j |
µj : mean of Gaussian.
∑ j
: covariance matrix.
P( x n ) is overall PDF.
K
P ( x n ) = ∑ P( x n | j ) * P ( j ) .
j =1
Nj
P( j ) = .
N
Calculate likelihood function.
N
L j = ∏ P( x n | j ) .
n =1
Find a new set of parameter value to maximize the likelihood in an iterative fashion.
23
Mathematical Formulations
1 Nj old
w new j = ∑ P ( j | xn ) .
d n =1
N
∑P old
( j | xn ) * xn
µ j new = n =1
N
.
∑P
n =1
old
(j|x ) n
1
∑P old
( j | x n ) * (x n − µ j
new 2
)
(σ new 2
j ) = n =1
N
.
d
∑P
n =1
old n
(j|x )
N
1
P new ( j ) =
N
∑P
n =1
old
( j | xn ) .
Then treat ‘new’ parameter as ‘old’, search for new parameter again to maximize the likelihood L j .
Repeat this iterative process until parameter values converge when the likelihood is maximized.
K cluster is formed with centroids of final K seeds of parameter values.
Assign each record to the cluster which has maximum likelihood score.
Calculate the likelihood of that record for each distribution.
L j ( x n ) = P ( x n | j ) * P( j ) .
24
KnowledgeSTUDIO Technical Notes
Normalize attributes
With selected attributes for both output and input variables, the original data is rescaled to new unitless attributes.
If an attribute is continuous, standardize the data matrix to convert original attribute scores to new
attributes with mean of 0 and standard deviation of 1.
If an attribute is ordered, the treatment of rescaling is the same as that for a continuous variable.
If an attribute is qualitative, convert each category into a dummy binary variable.
1 − e − net
Y= Hyperbolic tangent function;
1 + e −net
Y = net Linear function;
1
Y= Sigmoid function;
1 + e − net
N
Where net = w0 + ∑w *x
i =1
i i
Backpropagation is one of many error-minimizing functions which tunes these weights to generate the desired
smooth nonlinear mapping between input and output variables.
25
Mathematical Formulations
wlj : Weight between node i of layer K-1 layer and node j of layer K;
26
KnowledgeSTUDIO Technical Notes
1 Nk
El = ∑ (d jl − y jl ) 2 ;
2 j =1
d jl : desired value of output l given by input node j;
The backpropagation procedure to train a neural net to minimize error function El at output layer K+1: pick
learning rate η and momentum α by conjugate method and then adjust weight wij between node j and input node
i.
∂El
∆wij = −η + α (∆wij' − ∆wij'' ) ;
∂wij
∂El
: partial derivative of error with respect to wij ;
∂wij
η : step size of steepest gradient descent. It is the learning rate affecting how much weights are change on each
update;
α : momentum term in order to have minimization “smoothed” over successive descents. I.e., to have fraction of
previous steps to be included in current step.
∂El
Compute from output layer down to input layer, using chain rule;
∂wlj
27
Mathematical Formulations
∂E l ∂net jk ∂y jk ∂E
= * * l ;
∂wij ∂wij ∂net jk ∂y jk
∂net jk
where = xi ;
∂wij
∂y jk 1
= y jk (1 − y jk ) for sigmoid function Y = ;
∂net jk 1 + e −net
If chosen linear function Y = net ;
∂y jk
= 1;
∂net jk
1 − e − net
If chosen hyperbolic tangent function Y= ;
1 + e −net
∂y jk 1
= (1 + y jk )(1 − y jk ) ;
∂net jk 2
∂El
= d jl − y jl for output layer K+1;
∂y jk
∂E j Nk
∂net jk ∂y jk ∂E j Nk
∂E j
=∑ * * = ∑ wij * y jk (1 − y jk ) * ;
∂y ik −1 j =1 ∂y ilk −1 ∂net jk ∂y jk j =1 ∂y jk
for hidden layer K.
Then add up the weight changes for all sample inputs, and change weights.
Each cycle of backpropagation training is called an epoch. Once number of epochs or specific value for error
reached, computer will stop training, weights generated from final epoch will be the final weights for neural network
model.
xn is input vector;
28
KnowledgeSTUDIO Technical Notes
t n is target vector.
Statistical properties of the generator of the data is given by the probability density p ( x, t ) in the joint input-target
space. p ( x, t ) is modeled by using a Parzen kernel estimator constructed from the data set:
y ( x) =< t | x >=
∑te n
n −|| x − x n ||2 / 2 h 2
.
∑e n
−|| x − x n ||2 / 2 h 2
This is called Nadaraya-Watson estimator. Each basis function is centred on a data point and the coefficients in the
n
expansion are given by the target values t .
This approach is then extended by replaced the kernel estimator with an adaptive mixture model. The parameters of
the mixture model is using expectation-maximization (EM) algorithm. For a mixture of M spherical Gaussion
functions, the joint density is:
|| x − µ j ||2 ||t − v j ||2
N − − }
∧ 1 1
∑
2h2 2h2
p ( x, t ) = e .
n =1 ( 2πh )
2 (d +c) / 2
N
Then calculate a normalized radial basis function in which basis function centers are no longer constrained to
coincide with the data points:
∑
−|| x − µ j ||2 / 2 h 2
j
P ( j )v j e
y ( x) = .
∑
−|| x − µ j ||2 / 2 h 2
j
e
29
Mathematical Formulations
1
N [− ( x − µ i )T ( x − µ i )]
Y = w0 + ∑ wi e 2σ i2 h
;
i =1
µi center of i
th
basis function;
σi spread of i
th
basis function;
h overlap parameter which controls the degree of overlap between basis function;
w0 bias;
RBF share the same neural network structure as MLP. With a single hidden layer, assume K-1 is input layer, K is
hidden layer and K+1 is output layer.
Use K-means to compute center and spread of Gaussians for node j of layer K:
1
where µm =
Nm
∑X .
n∈S m
1. Assign each data point to a new cluster according to which is the nearest µ m , in order to minimize J;
2. Recompute µm and repeat until no further changes in grouping of data points.
30
KnowledgeSTUDIO Technical Notes
Compute output;
Nm
y jl = w0 j + ∑ wmjφ m ( X ) ;
m =1
where wmj : weights between j-th output and m-th normalized Gaussian;
1 NK
E= ∑
2 j =1
(d j − y j ) 2 ;
W T = (φ T φ ) −1 φ T D ;
1 G1 d1
1 G d
where φ= 2
, D=
2
;
... ... ...
1 GN d N
Gi is raw vector of Gaussian outputs for input i-th node;
Gi = [φ1 ( X ), φ 2 ( X ),..., φ M ( X )];
d i is desired output for input i-th node.
ANG-00108
31