Mathematical Formulations

Mathematical Formulations
Technical Notes for

KnowledgeSTUDIO version 3
Angoss Software Corporation

34 St Patrick Street, Suite 200
Toronto, ON M1T 1V1
Canada
Tel: 416-593-1122
Fax: 416-593-5077
www.angoss.com
November 28, 2000

Table of Contents
1 DECISION TREES ...................................................................................................3
1.1 Notation .............................................................................................................................. 3
1.2 Algorithms.......................................................................................................................... 5
1.3 Measures for Splitting a Node.......................................................................................... 8
1.4 Thresholds for Informativeness of Splits ...................................................................... 12
2 PREDICTIVE MODELS..........................................................................................13
2.1 Notation ............................................................................................................................ 13
2.2 Common Output.............................................................................................................. 14
2.3 Regression Models........................................................................................................... 15
3 CLUSTER ANALYSIS ...........................................................................................21
3.1 K-Means Algorithm Procedure...................................................................................... 22
3.2 Expectation-Maximization Algorithm Procedure........................................................ 23
4 NEURAL NETWORK ALGORITHMS ....................................................................25
4.1 Multi-layer Perceptron (MLP)....................................................................................... 25
4.2 Probabilistic Neural Network (PNN)............................................................................. 28
4.3 Radial Basis Function (RBF).......................................................................................... 29
2
KnowledgeSTUDIO Technical Notes
1 Decision Trees
Notation
Y – dependent variable (DV);
N – number of records at the node.
For a given independent variable (IV):

Category – a group of values that a variable can take (ranging from a single value to all possible values);
r – number of disjoint categories;
vi – i-th category;
ni – number of records at the node with IV in vi .
Numeric DV
y i1 ,..., y ini - values of DV for records with IV in vi ;
ni
y i = ∑ y ij ;
j =1
yi
yi = ;
ni
r r ni
y = ∑ y i = ∑∑ y ij ;
i =1 i =1 j =1
y
y= ;
N
r ni
STD = ∑∑ ( y ij − y ) 2 ;
i =1 j =1
r
SSTR = ∑ ni ( yi − y ) 2 ;
i =1
r ni
SSE = ∑∑ ( y ij − y i ) 2 ;
i =1 j =1
SSTR
MSTR = ;
r −1
3
SSE
MSE = ;
N −r
MSTR
F* = ;
MSE
PV = F ( F * ; r − 1, N − r ) .
If i and k are two categories of the IV then:
2 2
 y + yk   y + yk 
MSTR{i ,k } = ni  y i − i  + n k  y k − i  ;
 ni + n k   ni + n k 
ni nk
∑ ( yij − yi ) 2 + ∑ ( y kj − y k ) 2
j =1 j =1
MSE{i ,k } = ;
ni + n k − 2
MSTR{i , k }
F{*i ,k } = .
MSE{i ,k }
Categorical DV
d – number of distinct values that DV can take;
w j - jth value of DV;
q ij - number of records with DV= w j and IV in vi ;
m j - number of records with DV= w j ;
ni
q ij = mj .
N
We have:
r d d r r d
m j = ∑ q ij , ni = ∑ qij , N = ∑ m j = ∑ ni = ∑∑ qij ;
i =1 j =1 j =1 i =1 i =1 j =1
r d (q ij − qij ) 2
X = ∑∑
2
;
i =1 j =1 qij
d (qij − qij ) 2 d (q kj − q kj ) 2
X 2
{i , k } =∑ +∑ ;
j =1 qij j =1 q kj
4
PV = χ 2 ( X 2 ; df ) where df = (d − 1)(r − 1) ;
1 d qij qij
Ei = −
ln d
∑n
j =1
ln
ni
;
i
1 d mj mj
E=−
ln d
∑j =1 N
ln
N
;
∑mj =1
2
j
G = 1− ;
N2
d
∑q
j =1
2
ij
Gi = 1 − .
ni2
Algorithms
Treatment of a Numeric IV
To use a numeric IV in a decision tree it must be made categorical. KnowledgeSTUDIO breaks the range of values
of the IV into a user-controlled number of intervals (set by default to 10) so that the number of records for each
interval is the same. Each interval is then considered a category.
KnowledgeSEEKER
Finding a Split for a given IV

The split is found by iteratively grouping categories of the IV into new categories. An eligible pair of categories is
grouped into a single category if the two are “similar” with respect to the DV. For an unordered IV all pairs are
eligible. For an ordered IV only adjacent pairs are. There are two grouping methods: exhaustive and cluster.
Exhaustive Method
This method starts with each value of the IV being a separate category. The following occurs at each iteration:
For every eligible pair of categories (i, k ) its similarity statistic is computed. For a numeric
*
DV this statistic is F
{i , k } . For a categorical DV it is X {2i ,k } ;
The pair with the highest similarity statistic is merged into a single category thus decreasing
the total number of categories r by 1. The new set of categories is the next candidate split and
its p-value is computed. For a numeric DV the p-value is F ( F * ; r − 1, N − r ) . For a
categorical DV it is χ 2 ( X 2 ; df ) .
5
These two steps are repeated until r = 1 . Of all candidate splits, the one with the highest Info (see below) is chosen.
If Info is above the user-controlled threshold, the split is presented for that IV. Otherwise no significant splits for the
IV can be found.
Cluster Method
This method also starts with each value of the IV being a separate category. The following occurs at each iteration:
For every eligible pair of categories (i, k ) its similarity statistic is computed. For a numeric
*
DV this statistic is F
{i , k } . For a categorical DV it is X {2i ,k } ;
The pair with the highest similarity statistic is merged into a single category thus decreasing
the total number of categories r by 1. The new set of categories is the next candidate split and
its p-value is computed. For a numeric DV the p-value is F ( F * ; r − 1, N − r ) . For a
categorical DV it is χ 2 ( X 2 ; df ) .
These two steps are repeated until the p-value of the pair with the highest similarity statistic found in step 1 exceeds
the user-controlled threshold (this number is entered in the Significance To Merge column of the Tree
Attribute Editor - Tools menu - where it is set by default to 0.05) or until r = 1 . The above p-value is equal to
F ( F{*i ,k } ;1, N − 2) for a numeric DV and χ 2 ( X {2i ,k } ; d − 1) for a categorical DV. Of all candidate splits, the one
with the highest Info is chosen. After this split is found the algorithm attempts to improve the significance by taking
single DV values from categories to which they were assigned and grouping them with other categories. If
improvement is achieved, the split that produced it becomes the candidate. This re-splitting process returns a
possibly different split from the one that merging produced1. If its Info is above the user-controlled threshold the
split is presented for that IV. Otherwise no significant splits for the IV can be found.
Measuring Informativeness of splits obtained for each IV

In the first part of the algorithm at most one split has been found for each IV. Informativeness of these splits is
computed according to the split measure in effect. Split measures are described below. A split is informative if it
passes the user-controlled threshold, set in the Measure tab of the Tree Configuration dialog - Tools menu - or
in Step 3 of the Insert Tree wizard in the Maximum Attr. Info and Minimum Attr. Info fields. The user can
choose any informative split. The most informative one is chosen by default.
Growth Control Parameters

The tree growth is also constrained by a number of user-controlled parameters found in the Tree Growth tab of the
Tree Configuration option - Tools menu. These parameters are defined as follows:
Maximum Number of IV Categories. If an IV has more categories than this number, it
will not be used for splitting;
Minimum Node Creation Size. If, at the end of a splitting algorithm (Cluster or
Exhaustive) a node in the found split contains less records than this number, that node is
merged with an adjacent node (for an ordered IV) or with the most statistically similar node
(for an unordered IV);
1
The exact technical details of re-splitting are available on request.
6
Number of split caches per node. This parameter is used for memory management. It
determines how many splits (one per IV) are stored for each node. If the user wants to come
back to an earlier split node to see its split on a different IV, that split has to be recomputed if
it is not stored.
The following two parameters come into effect when the Automatic option is chosen from the Grow menu.
Auto-Grow Stop size (min. records). If a node contains less records than specified by
this parameter, that node will not be split;
Auto-Grow maximum tree depth. If the path from the root node to a given node contains
as many nodes as specified by this parameter, that node will not be split. When this parameter
is equal to 0 (as it is by default) it means that tree depth is not restricted.
HeatSEEKER
Finding a Split for a given IV

This part of the algorithm uses the Cluster method described for the KnowledgeSEEKER algorithm above.
Measuring informativeness of splits obtained for each IV

This part too is performed exactly as in KnowledgeSEEKER.
Growth control parameters

These parameters can be controlled in the Algorithm tab of the Tree Configuration dialog - Tools menu.
Maximum tree Depth. Has the same meaning as Auto-Grow maximum tree depth
above;
Maximum tree Width. Controls the splitting procedure. If a candidate split has more nodes
than this number, nodes are further merged on the basis of statistical similarity. Formally, this
constraint overrides Significance to Merge in the Cluster splitting method;
Minimum node Population. Has the same meaning as Minimum Node Creation Size
above;
Number of split caches per node. Same as in KnowledgeSEEKER;
Minimum Matching Objectives. Analogous to Minimum node Population;
Discrete Information Loss.
Parameters that control automatic growth have the same definition as in KnowledgeSEEKER. They can be accessed
in the Grow Automatic dialog that appears when the Automatic option is chosen from the Grow menu.
Voting
The Automatic dialog from the Grow menu allows the user to grow multiple trees. The Number of Trees to
grow field (denoted in this section by t) indicates how many trees will be grown. When the Grow Multiple Trees
for Voting Score radio button is checked KnowledgeSTUDIO does the following: the t most informative splitting
variables are chosen at the root node. For each of these variables a separate tree is grown automatically. Each of
these trees can be viewed in the Go To Split option - Reshape menu. The t trees are tied together in the voting
procedure when they are turned into a predictive model. When a prediction is made for a record, it is first made by
every grown tree where it falls into one of the leaf nodes. Then, for that record, we have t candidate leaf nodes, one
7
per tree. One of them is chosen. If the DV is numeric, it wiil be the node with the lowest standard deviation. If the
DV is categorical, it will be the node with the highest mode2. The tree whose node is chosen provides the prediction
for this record.
Measures for Splitting a Node
Split Statistics
When evaluating the significance of a split, KnowledgeSTUDIO calculates a number of statistics (apart from record
counts and proportions) that can be viewed in the Split Report tab of a decision tree or displayed directly in the
tree via settings in the Node Detail - Edit menu. Which statistics are reported is measure-dependent except Info
which is given for every measure and DV type (it can only be viewed in the Split Report).
Below we provide the list of statistics with mathematical expressions for each measure and DV type.
Statistical Measures
A split is a partition of the node dataset into subsets according to the value of the independent variable. According to
a statistical measure the best split is the one that generates subsets which are least likely to come from the same
distribution. The likelihood is measured by standard statistical tests.
A split is presented if it passes a predetermined significance level. This level can be chosen in two ways:
1. With an unadjusted p-value measure the desired significance level is directly applied to the
best split .
2. The adjusted measure takes into account the fact that a number of different splits is evaluated
for every IV (see Algorithms above) and only one of them is required to be significant. This
split has to pass a more stringent significance test than if only a single split was attempted.
The choice of that stringent test is made within the Bonferroni framework by adjusting the p-
value of the split. Exact technical details of this adjustment are beyond the current scope of
this document and are available on request.
Unadjusted p-value
1. Numeric DV
Info = 100 × C ;
df 1 = N − r ;
df 2 = r − 1 ;
MSTR
F = F* = ;
MSE
2
Mode is the highest proportion of a single value of the DV in the node dataset. In terms of our notation
mj
it is max
1≤ j ≤ d N
8
P = PV = F ( F * ; df 2, df 1) ;
C = 1− P .
2. Categorical DV
Info = 100 × C ;
df = (d − 1)(r − 1) ;
r d (qij − qij ) 2
Chi 2 = X 2 = ∑∑ ;
i =1 j =1 qij
P = PV = χ 2 ( X 2 ; df ) ;
C = 1− P .
Adjusted p-value
1. Numeric DV
Info = 100 × C ;
df 1 = N − r ;
df 2 = r − 1 ;
MSTR
F = F* = ;
MSE
uP = PV = F ( F * ; df 2, df 1) ;
uC = 1 − uP ;
P = B × uP ;
C = 1− P ;
B = B a Bm ;
Bm is the additional adjustment factor that the user can specify in the Measure tab of
the Tree Configuration dialog when adjusted measure is chosen. By default it is set to
1;
Ba is the Bonferroni adjustment factor determined automatically by the algorithm (see

above)3.
9
2. Categorical DV
Info = 100 × C ;
df = (d − 1)(r − 1) ;
r d (qij − qij ) 2
Chi 2 = X = ∑∑
2
;
i =1 j =1 qij
uP = PV = χ 2 ( X 2 ; df ) ;
uC = 1 − uP ;
P = B × uP ;
C = 1− P ;
B = B a Bm ;
Bm is the additional adjustment factor that the user can specify in Measure tab of Tree
Configuration dialog when adjusted measure is chosen. By default it is set to 1;
Ba is the Bonferroni adjustment factor determined automatically by the algorithm (see

above)3.
Non-statistical Measures
Entropy Variance
This quantity tells how much information about a DV value is contained in a random draw from the node dataset. If
DV values were spread evenly within that set then, by drawing a record at random, one would gain no information
because every DV value would be equally likely. If, on the contrary, all records had the same DV value, a random
draw would provide perfect information about it. High entropy means low information and vice versa. Entropy E
achieves a maximum of 1 for a uniform (evenly spread) distribution and a minimum of 0 for a degenerate (peaked)
distribution.
1. Numeric DV
Info = 100 × RatioVaria nce ;
InputVariance = STD ;
OutputVari ance = SSE ;
3
Ba depends on the number of distinct values of IV(r), the number of categories formed by the algorithm
(Cluster or Exhaustive), the algorithm itself and the grouping type of IV. Ba is greater for unordered than
for ordered IV, for Cluster than for Exhaustive and it increases with cardinality of IV.
10
OutputVariance
RatioVariance = 1 − .
InputVariance
2. Categorical DV
Info = 100 × RatioEntro py ;
InputEntropy = E ;
r
ni
OutputEntr opy = ∑ Ei ;
i =1 N
OutputEntropy
RatioEntropy = 1 − .
InputEntropy
Gini Variance
Gini variance is analogous to entropy variance in that it also measures how evenly (or unevenly) DV values are
spread within a dataset.
1. Numeric DV
Info = 100 × RatioVaria nce ;
InputVariance = STD ;
OutputVari ance = SSE ;
OutputVariance
RatioVariance = 1 − .
InputVariance
2. Categorical DV
Info = 100 × RatioIndex ;
InputIndex = G ;
r
ni
OutputIndex = ∑ Gi ;
i =1 N
OutputIndex
RatioIndex = 1 − .
InputIndex
11
Thresholds for Informativeness of Splits

In the description of algorithms above it was mentioned that only informative splits (i.e. those with Info passing the
threshold) are presented. This threshold is set differently for different measure types.
Statistical Measures
For statistical measures the Info threshold is derived from the significance level that the user sets in the Measure
tab of the Tree Configuration dialog - Tools menu. Splits for which P is below that significance level (and thus
Info is above the derived threshold) are presented. The default value of the significance level is 0.05.
Non-statistical Measures
For non-statistical measures the user can set maximum and minimum thresholds for Info. This can be done in Step
3 of the Insert Tree wizard in the Maximum Attr. Info and Minimum Attr. Info fields. Splits with Info value
between these two numbers are highlighted even though all splits with positive Info are presented.
12
2 Predictive Models
Notation
The following notation is common to all predictive models. The dataset to which it applies, depending on context,
may be the training dataset or the validation dataset.
N – number of observations (records) in the dataset;
y1 ,..., y N – DV values in the dataset;

yˆ1 ,..., yˆ N – predicted DV values;
p – number of independent variables in the model (excluding intercept for regressions).
Numeric DV
y1 + ... + y N
y= ;
N
yˆ1 + ... + yˆ N
yˆ = ;
N
N
SSTO = ∑ ( y i − y ) 2 ;
i =1
N
SSR = ∑ ( yˆ i − yˆ ) 2 ;
i =1
N
SSE = ∑ ( y i − yˆ i ) 2 ;
i =1
N
1
MDEV =
N
∑| y
i =1
i − yˆ i | .
Categorical DV
d – number of different values that the DV can take;
v1 ,..., v d - possible values of the DV;
π ij - predicted probability that the DV is equal to v j for ith record;
mi - value index of y i , i.e. mi = m if y i = v m ;

N
L = −∑ ln π mi i - log-likelihood function;
i =1
13
f 1 ,..., f d - frequencies of DV values;
fˆ1 ,..., fˆd - predicted frequencies of DV values;
1 d fj fj
E=−
ln d
∑ N ln N
j =1
;
1 d fˆ j fˆ j
Eˆ = − ∑
ln d j =1 N
ln
N
.
Common Output
After the parameters for a predictive model have been specified, its training progress and results can be seen in three
views. Output there depends only on the type of DV and not on that of the model (except regressions where
conventional output is provided, see below). Training of any predictive model is performed by minimizing the
appropriate objective function using records in the training set.
Results View
There are four numeric outputs in this view.
1. Sample Size is the number of records in the training set.
2. Training Results:
SSR
Variance/Entropy Explained. For a numeric DV it is 1 − , for a categorical
SSTO
Ê
DV it is 1 − ;
E
Records correctly predicted. Self-explanatory;
Percentage correctly predicted. Self-explanatory.
Training View
The output of analytical interest in this view is green and blue lines that show the training progress. The green line,
scaled at the left, shows the value of the objective function divided by the number of records, calculated for the
training subset. The blue line shows the value of Variance/Entropy Explained, calculated for the validation
subset.
Scoring View
Output in the scoring view becomes available after the model has been validated using the dataset specifically
reserved for this purpose. This dataset must have the same structure as the training dataset and the values of the DV
must be known for that set. After the validation is finished, the table in the bottom part of the view is filled with
output. For a categorical DV the top part of the view displays a confusion matrix.
14
In the bottom table:

Training Variance/Entropy Explained is the same number as in the Results View;
# Records refers to the validation set;
Invalid Records. A record can be found invalid for a variety of reasons. One is the presence
of missing values, especially in the DV field. Another example is the difference in the field
structure between the training and validation datasets;
Correct Records is the number of correct predictions on the validation set;
Scoring Variance/Entropy Explained is Variance/Entropy Explained calculated for
the validation set;
Mean Deviation is only calculated for a numeric DV and is equal to MDEV above.
Regression Models
Linear or logistic regression is used to predict the value of a DV using values of a set of numeric or categorical
independent variables (IV).
Inclusion of a categorical variable as IV in a regression is organized as follows. Consider a categorical variable with
n distinct values. These values are ordered lexicographically by their names. The first value in this ordering is
chosen as reference. For every other value a design variable is created. Thus, n-1 design variables are added to the
regression. The default choice of the reference value can be changed.
Introducing the following notation:

N – number of observations included in the regression;
p – number of variables in the regression (excluding intercept);
 x11 ... x1 p 1
x 1
 21 ... x 2 p
 . . . - matrix of IV values (design matrix; in models
X =
. . . without intercept, the last column is omitted);
 
 . . .
 x N 1 ... x Np 1

 y1 
y 
 2
.
Y =   - column vector of DV values;
.
 
 . 
 y N 
15
 b0 
b 
 1
. - column vector of estimated parameters (in models without
b= 
. intercept, b0 is omitted);
 
.
b p 
1 1 ... 1
1 1 ... 1
. . .
J =  - N × N matrix of 1s.
. . .
 
. . .
1 1 ... 1
Linear Regression
In a linear regression the DV must be numeric. KnowledgeSTUDIO estimates linear regression by ordinary least
squares. Thus, the vector of parameter estimates is equal to:
b = ( X ′X ) X ′Y
−1
It is computed using singular value decomposition of X.

Regression output can be found in the Results view of the predictive model when one of the sequences is chosen in
the Output to view menu. The output has the following format:
Sequence Statistics
Dependent
name
Variable:
Variance SSR Adjusted Variance MSE

Explained: Explained:
1−
SSTO MSO
MSR F-Ratio Degrees of RDF /
F-Ratio:
MSE Freedom 1 / 2: EDF
 MSR 
P-Value: F ; RDF , EDF 
 MSE 
16
Analysis of Variance
Source Sum-of-Squares DF Mean-Square
SSR
REGRESSION SSR RDF
RDF
SSE
ERROR SSE EDF
EDF
SSTO
TOTAL SSTO TDF
TDF
Independent Variable Statistics

Variable Model Parameter
Wald
Name / Para- Standard Significance
Statistics
Value meters Error
Values for  bi 
2
 b 2 
categorical
variables
bi (
SE (bi ) = MSE ⋅ ( X ′X )
−1
ii )
1
2
 
 SE (bi ) 
χ   i  ,1
2
  SE (bi )  
 
In the above:
TDF = N - number of observations in the regression;
RDF = p - number of variables in the regression excluding the intercept;
EDF = TDF − RDF ;
1
SSR = b ′X ′Y − Y ′JY ;
N
1 1
SSTO = Y ′Y − Y ′JY = y12 + ... + y N2 − ( y1 + ... + y N ) 2 ;
N N
SSE = SSTO − SSR = (Y − Xb) ′(Y − Xb) .
After the set of variables to consider for regression is chosen, there is an option for performing variable selection.
Regressions that are run during variable selection are called sequences and their output is available for viewing (see
above). Six methods are available:
1. Forward Order starts with the base model (which includes the intercept and all variables
chosen to be forced into the model by the user) and adds variables in the order that has been
set in Step 5 of the Neural Net wizard starting from the first.
17
2. Backward Order starts with the full model (all variables from the set) and deletes variables
in the order determined in Step 5 of the Neural Net wizard starting from the last.
3. Stepwise Selection starts with the base model and, at every step, adds the variable with the
highest contribution to the explanatory power of the model chosen in the previous step in
terms of the partial F test, if this contribution is higher than the predetermined level chosen
simultaneously with the variable selection method. This is called Significance Limits –
Entry and by default is set to 0.15. After addition every variable added before is checked and
the variable with the smallest contribution is dropped if its Wald statistic is below the
predetermined level (Significance Limits – Exit with a default value of 0.15). The
selection process also stops if a variable is cycled (i.e. included and excluded) three times.
4. Forward Variable Selection is a simplification of Stepwise Selection where no
variables are dropped after they have been added.
5. Backward Variable Selection is analogous to Forward Variable Selection but the
process starts with the full model and variables are dropped based on the criterion outlined in
Stepwise Selection.
6. R-Squared Selection simply runs regressions on all subsets of the chosen set of variables.
Logistic Regression
For logistic regression the DV must be categorical. KnowledgeSTUDIO estimates binary or polytomous logistic
regression. Parameters are estimated by the maximum likelihood method.
Let Y take values in {v 0 ,..., v d } with v 0 being the reference value. Denote:
π j ( xi ) = P(Y = v j | xi ) , j=0,…,d;
π j ( xi )
g j ( x i ) = ln = b j 0 + b j1 xi1 + ... + b jp x ip , j=1,…,d;
π 0 ( xi )
 xi1 
x 
 i2 
 . 
where xi =
 . .
 
 . 
 xip 
Thus,
g j ( xi )
1 e
π 0 ( xi ) = p
and π j ( xi ) = p
for j=1,…,d
1+ ∑e g k ( xi )
1+ ∑e g k ( xi )
k =1 k =1
18
Let mi be the value index of y i , i.e. mi = m if y i = v m . Then the log-likelihood function is:
N N p
L(b) = ∑ ln π mi ( xi ) =∑ {g mi ( x i ) − ln[1 + ∑ e g k ( xi ) ]}
i =1 i =1 k =1
This function is maximized by the conjugate gradient method.

When the DV is binary the log-likelihood has a particularly simple form:
N
L(b) = ∑ { y i ln π ( xi ) + (1 − yi ) ln[1 − π ( xi )]}
i =1
The output of a logistic regression has the following format:
Sequence Statistics
Dependent Variable: name
L(b0 )
Entropy Explained: 1−
L(b)
Chi-Square: G = −2[ L(b0 ) − L(b)] Chi-Square Degrees of Freedom: p
P-Value: χ 2 (G; p)
Model Fitting Summary

Source Negative 2 Log-Likelihood DF
Intercept only − 2 L(b0 ) -
Full model − 2 L(b) p
Independent Variable Statistics

Dependent variable value
Model Parameter
Variable Wald
Para- Standard Significance
Name / Value Statistics
meters Error
Values are listed  bi 

2
 b 2 
for categorical bi SE (bi ) = Σ(b) ii   χ   i  ,1
2
  SE (bi )  
variables  SE (bi )   
19
In the above:
L(b0 ) is the log-likelihood for the regression with intercept only;

−1
∂2L ∂2L
Σ(b) = −  2  where is the estimated Hessian of the log-likelihood;
 ∂b  ∂b 2
When the DV takes more than two values, the Independent Variable Statistics table is given for every value except
the reference one which is by default the first in alphabetical order.
Variable selection methods are identical to those for linear regression except that the likelihood ratio test is used for
inclusion and exclusion of variables.
20
3 Cluster Analysis
Cluster analysis is a technique for grouping individuals or objects into clusters so that the objects in the same cluster
are more like each other than they are like the objects in other clusters. The resulting object clusters should then
exhibit internal (within-cluster) homogeneity and high external (between-cluster) heterogeneity with respect to some
predetermined selecting criteria.
Two clustering algorithms are presented in KnowledgeSTUDIO: K-Means and Expectation-Maximization (EM).
Both are nonhierarchical clustering procedures which don’t involve tree-like construction processing. The K-Means
similarity algorithm is based on distance. EM is based on the probabilistic theorem.
Execution of two preparation steps (before clustering) is common to both K-means and EM. The first step is to
normalize the attributes so that values are on an even plain. The second step is to select the initial cluster seeds.
Normalize Attributes
For selected attributes, original data is rescaled to new unitless attributes.
If attribute is continuous variable, standardize data matrix to convert original attributes scores
to new attributes with mean of 0 and standard deviation of 1.
If attribute is ordered variable, the treatment of rescaling is the same as continuous variable.
If attribute is qualitative variable, convert each category into dummy binary variable.
Select Initial Cluster Seeds

There are two options in KnowledgeSTUDIO:
1. Randomly select K objects as initial cluster seeds simultaneously.
2. Use Refinement Algorithm to produce refined initial cluster seeds by an iterative algorithm to
converge to a ‘better’ local minimum.
Select J random sub-samples of full dataset, i=1,…,J;
Randomly select K objects as initial cluster seeds for each sub-sample;
Use K-means to generate cluster solutions for each sub-sample;
If there is empty cluster at termination of K-means, re-assign initial seeds which is
farthest from empty cluster and then re-cluster sub-sample through K-means;
Now we got J sets of cluster solutions CM i , i=1,…,J and form CM i as a new dataset
CM;
Initialize cluster seeds from CM and use K-means to cluster CM i again and produce a
cluster solution FM i ;
FM i have minimal distortion over dataset CM and become a set of initial cluster seeds.
21
K-Means Algorithm Procedure

Assign remaining objects to cluster seed each object is nearest.
Calculate average Euclidean distance coefficients:
I
E jk = ∑(X
i =1
ij − M ik ) 2 ;
i=1, …, I attributes;
j=1, …, N objects;
k=1, …, K cluster seeds;
X ij value of data for ith attribute and jth object;
M ik cluster seed value for ith attribute and kth cluster.
Assign object j to:
min{E jk }.
K
k =1
Cluster seed is recomputed to be the mean of all objects within cluster.
Calculate new cluster seeds:
Nk
1
M ik =
Nk
∑xj =1
ij .
Reassign each object to its nearest new cluster seed.

If, after a number of iterations, there is no further change in the grouping of objects, iteration
terminate.
Calculate sum-of-squares clustering function:
K Nk
E =∑
2
∑E 2
jk .
k =1 j =1
2
Stop iteration when E value will not decrease.
K final clusters are composed of objects surrounding K cluster seeds from the final iteration.
22
Expectation-Maximization Algorithm Procedure

Assign a weight for each of these classes, w j , ∑w l = 1 , l=1,…,N.
1
Typically choose w j = .
k
EM to estimate probability density function (PDF) for each class, and then perform classification based on Bayes’
rule:
n P( x n | j ) * P( j )
Posterior probability for class j: P ( j | x ) = .
P( x n )
n
Where, P ( x | j ) is PDF for class j.
n
EM to estimate P ( x | j ) as a weighted average of multiple Gaussians.
Nj
P( x | j ) = ∑ wl * Gl is the estimate of PDF for class j.

n
l =1
1
∑j
−1
1 − ( x n − µ j )T ( x n −µ j )
Gl = e 2
.
n
2π | ∑ j |
µj : mean of Gaussian.
∑ j
: covariance matrix.
P( x n ) is overall PDF.
K
P ( x n ) = ∑ P( x n | j ) * P ( j ) .
j =1
and P ( j ) is prior probability for class j.
Nj
P( j ) = .
N
Calculate likelihood function.
N
L j = ∏ P( x n | j ) .
n =1
Find a new set of parameter value to maximize the likelihood in an iterative fashion.
23
1 Nj old
w new j = ∑ P ( j | xn ) .
d n =1
N
∑P old
( j | xn ) * xn
µ j new = n =1
N
.
∑P
n =1
old
(j|x ) n
1
∑P old
( j | x n ) * (x n − µ j
new 2
)
(σ new 2
j ) = n =1
N
.
d
∑P
n =1
old n
(j|x )
N
1
P new ( j ) =
N
∑P
n =1
old
( j | xn ) .
Then treat ‘new’ parameter as ‘old’, search for new parameter again to maximize the likelihood L j .
Repeat this iterative process until parameter values converge when the likelihood is maximized.
K cluster is formed with centroids of final K seeds of parameter values.
Assign each record to the cluster which has maximum likelihood score.
Calculate the likelihood of that record for each distribution.
L j ( x n ) = P ( x n | j ) * P( j ) .
24
4 Neural Network Algorithms

Before neural network training begins, the original data is normalized as follows.
Normalize attributes
With selected attributes for both output and input variables, the original data is rescaled to new unitless attributes.
If an attribute is continuous, standardize the data matrix to convert original attribute scores to new
attributes with mean of 0 and standard deviation of 1.
If an attribute is ordered, the treatment of rescaling is the same as that for a continuous variable.
If an attribute is qualitative, convert each category into a dummy binary variable.
Multi-layer Perceptron (MLP)

Multi-layer Perceptron is one of most popular neural network with a nonparametric algorithm. An MLP can be
viewed as an interconnected network made up of nodes as simple computational units. The nodes are arranged into
one or more layers. The first layer is input layer and the final layer is output layer. Other layers between are called
hidden layers. The output of a node in hidden layer is served as an input to the next layer. Each hidden node ouputs
the value from calculating a sigmoid function to a weighted sum function of its inputs. In KnowledgeSTUDIO, only
one hidden layer is applied.
Between each pair of adjoining layers, there are the weights used by this node to weigh the inputs to summing as
output for this node and input for the node in next layer. The weights are free parameter to be trained by the data.
KnowledgeSTUDIO has three activition functions to choose as a transfer function for output y of each node.
1 − e − net
Y= Hyperbolic tangent function;
1 + e −net
Y = net Linear function;
1
Y= Sigmoid function;
1 + e − net
N
Where net = w0 + ∑w *x
i =1
i i
i=1, …, N # of nodes from an input layer;
W0 : bias of the node;
Wi : weight for the connection from I-th node of input layer;
Xi : input from I-th node of input layer.
Backpropagation is one of many error-minimizing functions which tunes these weights to generate the desired
smooth nonlinear mapping between input and output variables.
25
Compute squared error cost function:

N  
1 J
E = ∑  ∑ (d ij − y ij ) 2  ;
i =1  2 j =1 
i = 1,..., N # of nodes from an input layer;
j = 1,..., J # of nodes at an output layer;
d ij : desired value of output j given by input node I;
yij : actual output value from the model.

Backpropagation often uses an optimization algorithm other than just a simple gradient descent. Such an algorithm
tunes weights by means of gradient descent plus momentum. In KnowledgeSTUDIO, conjugate descent procedure is
chosen as a form of gradient descent plus momemtum, in which the parameters η and α are determined
automatically at each iteration.
With a single hidden layer, assume K-1 is the input layer, K is the hidden layer and K+1 is the output layer. Suppose
the following algorithm is a sigmoid function.
Compute sigmoid function for node j of layer K:
1
y jk = − net jk
;
1+ e
N k −1
where net jk = w0 j + ∑w
l =1
lj * xl ;
i = 1,..., N K −1 N K −1 nodes of layer K-1;

j = 1,..., N K N k nodes of layer K;
l = 1,..., N K +1 N k +1 nodes of layer K+1;
w0 j : Bias;
wlj : Weight between node i of layer K-1 layer and node j of layer K;
xl : Input from node i of layer K-1;
At output layer K+1:

for estimating purpose, use linear transfer function;
Nk
ylk +1 = w0l + ∑ w jl * x j ;
j =1
26
for classification purpose, use sigmoid non-linear function;

1
y lk +1 = ;
1 + e − netlk +1
Nk
Where net lk +1 = w0 k +1 + ∑ w jk +1 * x j ;
j =1
Compute average root-mean square error:
1 Nk
El = ∑ (d jl − y jl ) 2 ;
2 j =1
d jl : desired value of output l given by input node j;
y jl : actual value of output.
Overall error function:

N k +1
E= ∑ {E } .
l =1
l
The backpropagation procedure to train a neural net to minimize error function El at output layer K+1: pick
learning rate η and momentum α by conjugate method and then adjust weight wij between node j and input node
i.
∂El
∆wij = −η + α (∆wij' − ∆wij'' ) ;
∂wij
∂El
: partial derivative of error with respect to wij ;
∂wij
η : step size of steepest gradient descent. It is the learning rate affecting how much weights are change on each
update;
α : momentum term in order to have minimization “smoothed” over successive descents. I.e., to have fraction of
previous steps to be included in current step.
∂El
Compute from output layer down to input layer, using chain rule;
∂wlj
27
∂E l ∂net jk ∂y jk ∂E
= * * l ;
∂wij ∂wij ∂net jk ∂y jk
∂net jk
where = xi ;
∂wij
∂y jk 1
= y jk (1 − y jk ) for sigmoid function Y = ;
∂net jk 1 + e −net
If chosen linear function Y = net ;
∂y jk
= 1;
∂net jk
1 − e − net
If chosen hyperbolic tangent function Y= ;
1 + e −net
∂y jk 1
= (1 + y jk )(1 − y jk ) ;
∂net jk 2
∂El
= d jl − y jl for output layer K+1;
∂y jk
∂E j Nk
∂net jk ∂y jk ∂E j Nk
∂E j
=∑ * * = ∑ wij * y jk (1 − y jk ) * ;
∂y ik −1 j =1 ∂y ilk −1 ∂net jk ∂y jk j =1 ∂y jk
for hidden layer K.
Then add up the weight changes for all sample inputs, and change weights.
Each cycle of backpropagation training is called an epoch. Once number of epochs or specific value for error
reached, computer will stop training, weights generated from final epoch will be the final weights for neural network
model.
Probabilistic Neural Network (PNN)

Probabilistic neural network is a memory based reasoning technique. This is a technique for estimating regression
functions from noisy data, based on the theory of kernel regression.
Given a set of training data {x n

, t n };
Where n=1,…,N;
xn is input vector;
28
t n is target vector.
Statistical properties of the generator of the data is given by the probability density p ( x, t ) in the joint input-target
space. p ( x, t ) is modeled by using a Parzen kernel estimator constructed from the data set:
 || x − x n ||2 ||t − t n ||2

N − − }
∧ 1 1
∑ 2h2 2h2

p ( x, t ) = e ;
n =1 ( 2πh )
2 ( d +c ) / 2
N
where d and c are the dimensionalities of the input and output spaces respectively.
Find an optimal mapping by forming the regression or conditional average <t|x>, of the target data, conditioned on
the input variables:
y ( x) =< t | x >=
∑te n
n −|| x − x n ||2 / 2 h 2
.
∑e n
−|| x − x n ||2 / 2 h 2
This is called Nadaraya-Watson estimator. Each basis function is centred on a data point and the coefficients in the
n
expansion are given by the target values t .
This approach is then extended by replaced the kernel estimator with an adaptive mixture model. The parameters of
the mixture model is using expectation-maximization (EM) algorithm. For a mixture of M spherical Gaussion
functions, the joint density is:
 || x − µ j ||2 ||t − v j ||2
N − − }
∧ 1 1
∑
 2h2 2h2
p ( x, t ) = e .
n =1 ( 2πh )
2 (d +c) / 2
N
Then calculate a normalized radial basis function in which basis function centers are no longer constrained to
coincide with the data points:
∑
−|| x − µ j ||2 / 2 h 2
j
P ( j )v j e
y ( x) = .
∑
−|| x − µ j ||2 / 2 h 2
j
e
Radial Basis Function (RBF)

Radial basis function algorithm generates a mapping between inputs and outputs using a weighted sum of basis
functions. The basis functions are radial Gaussians (i.e. the basis functions are invariant to rotation about their
centers). The basis function network can be trained by an iterative gradient descent method, similar to MLP with
backpropagation training. However, basis function networks train rapidly and without local minima problems. One
major disadvantage of basis function networks: after training, RBF are slower to use and repair more computation to
perform classification or estimation.
Out y is a RBF mapping for each hidden layer neuron;
29
1
N [− ( x − µ i )T ( x − µ i )]
Y = w0 + ∑ wi e 2σ i2 h
;
i =1
where i=1,…,N N of basis functions;

x input vector;
µi center of i
th
basis function;
σi spread of i
th
basis function;
h overlap parameter which controls the degree of overlap between basis function;
w0 bias;
wi weights given to each basis function.
RBF networks use two-step training procedure:

1. Determine the centers and spreads of the Gaussians by minimizing the average variation of all training
patterns to Gaussian centers using K-means clustering.
2. Adjust the weights from each Gaussian to the output by minimizing the mean square error. Since the
weighted-sum relationship between an output variable and Gaussians is linear, the global optimum can
be computed quick through standard least-square solution.
RBF share the same neural network structure as MLP. With a single hidden layer, assume K-1 is input layer, K is
hidden layer and K+1 is output layer.
Use K-means to compute center and spread of Gaussians for node j of layer K:
1. Initialize a set of K clusters µ m , m=1,…,M from X input vecter;
2. X is partitioned into M disjoint subset S m containing N m data points;

3. Calculate sum-of-square error function:
M
J = ∑ ∑ (X − µm )2 ;
m =1 n∈S m
1
where µm =
Nm
∑X .
n∈S m
1. Assign each data point to a new cluster according to which is the nearest µ m , in order to minimize J;
2. Recompute µm and repeat until no further changes in grouping of data points.
30
For each cluster m, compute Gaussian output:

1
[− ( x − µ k )T ( x − µ k )]
2σ k2 h
φm ( X ) = e ;
Nm
φ ( X ) = φm ( X ) / ∑φm ( X ) ;
m =1
Compute output;
Nm
y jl = w0 j + ∑ wmjφ m ( X ) ;
m =1
where wmj : weights between j-th output and m-th normalized Gaussian;
w0 j : bias for j-th node;

Overall error function;
1 NK
E= ∑
2 j =1
(d j − y j ) 2 ;
where d j : desired value of output by input node j;
y j : actual value of output by input node j;

Calculate optimal solution by minimizing above error function for the output;
W T = (φ T φ ) −1 φ T D ;
 1 G1   d1 
1 G  d 
where φ= 2 
, D= 
2 
;
... ...   ... 
   
 1 GN  d N 
Gi is raw vector of Gaussian outputs for input i-th node;
Gi = [φ1 ( X ), φ 2 ( X ),..., φ M ( X )];
d i is desired output for input i-th node.
ANG-00108
31

Mathematical Formulations

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Mathematical Formulations

Încărcat de

Drepturi de autor:

Formate disponibile

Mathematical Formulations

Technical Notes for

Angoss Software Corporation

November 28, 2000

1.1 Notation .............................................................................................................................. 3

1.3 Measures for Splitting a Node.......................................................................................... 8

1.4 Thresholds for Informativeness of Splits ...................................................................... 12

2.1 Notation ............................................................................................................................ 13

2.2 Common Output.............................................................................................................. 14

2.3 Regression Models........................................................................................................... 15

3 CLUSTER ANALYSIS ...........................................................................................21

3.1 K-Means Algorithm Procedure...................................................................................... 22

3.2 Expectation-Maximization Algorithm Procedure........................................................ 23

4 NEURAL NETWORK ALGORITHMS ....................................................................25

4.1 Multi-layer Perceptron (MLP)....................................................................................... 25

4.2 Probabilistic Neural Network (PNN)............................................................................. 28

4.3 Radial Basis Function (RBF).......................................................................................... 29

For a given independent variable (IV):

w j - jth value of DV;

q ij - number of records with DV= w j and IV in vi ;

m j - number of records with DV= w j ;

Finding a Split for a given IV

Measuring Informativeness of splits obtained for each IV

Growth Control Parameters

Finding a Split for a given IV

Measuring informativeness of splits obtained for each IV

Growth control parameters

Measures for Splitting a Node

Ba is the Bonferroni adjustment factor determined automatically by the algorithm (see

Ba is the Bonferroni adjustment factor determined automatically by the algorithm (see

Thresholds for Informativeness of Splits

y1 ,..., y N – DV values in the dataset;

v1 ,..., v d - possible values of the DV;

π ij - predicted probability that the DV is equal to v j for ith record;

mi - value index of y i , i.e. mi = m if y i = v m ;

f 1 ,..., f d - frequencies of DV values;

fˆ1 ,..., fˆd - predicted frequencies of DV values;

In the bottom table:

Introducing the following notation:

It is computed using singular value decomposition of X.

Variance SSR Adjusted Variance MSE

Independent Variable Statistics

This function is maximized by the conjugate gradient method.

The output of a logistic regression has the following format:

Model Fitting Summary

Intercept only − 2 L(b0 ) -

Full model − 2 L(b) p

Independent Variable Statistics

Values are listed  bi 

L(b0 ) is the log-likelihood for the regression with intercept only;

Select Initial Cluster Seeds

K-Means Algorithm Procedure

X ij value of data for ith attribute and jth object;

M ik cluster seed value for ith attribute and kth cluster.

Assign object j to:

Reassign each object to its nearest new cluster seed.

Expectation-Maximization Algorithm Procedure

P( x | j ) = ∑ wl * Gl is the estimate of PDF for class j.

and P ( j ) is prior probability for class j.

4 Neural Network Algorithms

Multi-layer Perceptron (MLP)

i=1, …, N # of nodes from an input layer;

W0 : bias of the node;

Wi : weight for the connection from I-th node of input layer;