Documente Academic
Documente Profesional
Documente Cultură
f ( x ) = g m ( wm x ) ,
(1.9)
m =1
where both gm and wm are learned from the data. The regularization is now performed by
choosing M and the class of {gm }m=1 .
M
Note: PPR is not a pure ERM. Just like the GAM problem, in the PPR problem {gm }m=1 are
learned by Kernel Regression. Solving the PPR problem is thus a hybrid of ERM and Kernel
Regression algorithms.
M
Note: If M is taken arbitrarily large, for appropriate choice of gm the PPR model can
approximate any continuous function in Rp arbitrarily well. Such a class of models is called a
universal approximator. However this generality comes at a price. Interpretation of the fitted
model is usually difficult, because each input enters into the model in a complex and
multifaceted way. As a result, the PPR model is most useful for prediction, and not very
useful for producing an understandable model for the data.
Notice also- that the neural network model with one hidden layer has exactly the same form
as the projection pursuit model described above. The difference is that the PPR model uses
nonparametric functions gm(v),while the neural network uses a far simpler function based on
sigmoid(v).
g m ( x ) := m ( x m )
where { m , m }m=1 only are learned from the data. A typical activation function is the standard
M
1
.
1 + et
As can be seen, the NNET is merely a non-linear regression model. The parameters of which
are often called weights.
logistic CDF: ( t ) =
Loss Functions: Like any other ERM problem, we are free to choose the appropriate loss
function.
Universal Approximator: Like the PPR, even when {gm }m=1 are fixed beforehand, the class is
still a universal approximator.
M
Regularization: regularization of the model is done via the selection of the , the number of
nodes/variables in the network and the number of layers.
f ( x ) = cm I{xRm }
m =1
The parameters of the model are the different conditions {Rm }m =1 and the function's value at
M
Loss Functions: As usual, a squared loss can be used for continuous outcomes y. For
categorical outcomes, the loss function is called the impurity measure. Impurity Measure One
can use either a misclassification error, the multinomial likelihood (knows as the deviance, or
cross-entropy), or a first order approximation of the latter known as the Gini Index.
Universal Approximator: CART is a universal approximator.
Random Forests
Trees are very flexible hypothesis classes. They thus have small bias but large variance.
Bagging trees will reduce this variance by averaging trees from different bootstrap samples.
Alas, the variance (thus the MSE) of bagged trees is lower bounded by the fact the trees use
the same variables, and are thus correlated. To remedy this, [Breiman, 2001] proposed to fit
trees to bootstrapped samples, using only a random subset of variables. This decorrelates
between the trees, this allowing a reduction in the variances of the trees (thus their MSE).
Unsupervised Learning
1 Introduction to Unsupervised Learning
2 Density Estimation
2 1 Parametric Density Estimation
2 2 Kernel Density Estimation
2 3 Graphical Models
3 High Density Regions
3 1 Association Rules
4 Linear-Space Embeddings
4 1 Principal Components Analysis (PCA)
4 2 Random Projections
4 3 Sparse Principal Component Analysis (sPCA)
4 4 Multidimensional Scaling (MDS)
4 5 Local MDS
4 6 Isometric Feature Mapping (Isomap)
5 Non-Linear-Space Embeddings
5 1 Kernel Principal Component Analysis (kPCA)
5 2 Self Organizing Maps (SOM)
5 3 Principal Curves and Surfaces
5 4 Local Linear Embedding (LLE)
5 5 Auto Encoders
5 6 Matrix Factorization
5 7 Information Bottleneck
6 Latent Space Generative Models
6 1 Factor Analysis (FA)
6 2 Independent Component Analysis (ICA)
6 3 Exploratory Projection Pursuit
6 4 Compressed Sensing
6 5 Generative Topographic Map (GTM)
6 6 Finite Mixtures
6 7 Hidden Markov Models (HMM)
6 8 Latent Space Graphical Models
6 9 Latent Dirichlet Allocation (LDA)
6 10 Probabilistic Latent Semantic Indexing (PLSI)
6 11 Prediction by Partial Matching (PPM)
6 12 Dynamic Markov Compression (DMC)
7 Random Graph Models
7 1 Erdos Renyi
7 2 Exchangeable Graph Model
7 3 p1 Graph Model
7 4 p2 Graph Model
7 5 Stochastic Block Graph Model
7 6 Latent Space Graph Model
7 7 Exponential Random Graphs (ERGMs)
8 Cluster Analysis
8 1 K-Means Clustering
8 2 K-Medoids Clustering (PAM)
8 3 Quality Threshold Clustering (QT)
8 4 Hierarchical Clustering
5
8 5 Fuzzy Clustering
8 6 Self Organizing Maps (SOM)
8 7 Spectral Clustering
8 8 Bi Clustering
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The algorithm: (use dummy variables for 0/1 response = "in basket"/"Not in basket").
The first pass over the data computes the support (relative frequency) of all single-item sets.
Those whose support is less than the threshold are discarded. The second
pass computes the support of all item sets of size two that can be formed
from pairs of the single items surviving the first pass. In other words, to
generate all frequent itemsets with |K| = m, we need to consider only
candidates such that all of their m ancestral item sets of size m 1 are
frequent. Those size-two item sets with support less than the threshold are
discarded. Each successive pass over the data considers only those item
sets that can be formed by combining those that survived the previous
pass with those retained from the first pass. Passes over the data continue
until all candidate rules from the previous pass have support less than the
specified threshold.
> Example: suppose the item set K = {peanut butter, jelly, bread} and
consider the rule {peanut butter, jelly} => {bread}. A support value
of 0.03 for this rule means that peanut butter, jelly, and bread appeared
together in 3% of the market baskets. A confidence of 0.82 for this rule implies
that when peanut butter and jelly were purchased, 82% of the time
bread was also purchased. If bread appeared in 43% of all market baskets
then the rule {peanut butter, jelly} => {bread} would have a lift of 1.95.
The goal of this analysis is to produce association rules (A => B) with both
high values of support and confidence(A => B).
Remark: Two interpretations of "linear" can be found in the literature. It may refer to the
nature of the low dimensional space approximating the data, or to the nature of the
embedding operation.
4.1 PCA
Maximizing under a constraint, using Lagrange-Multipliers:
Where Cov [ vX ] = , Cov [ vX ] = vv .
PCA is such a basic technique it has been rediscovered and renamed independently in many
fields. It can be found under the names of discrete Karhunen-Loeve Transform; Hotteling
Transform; Proper Orthogonal Decomposition (POD); Eckart-Young Theorem;
Schmidt-Mirsky Theorem; Empirical Orthogonal Functions; Empirical Eigenfunction
Decomposition; Empirical Component Analysis; Quasi-Harmonic Modes; Spectral
Decomposition; Empirical Modal Analysis; and possibly more.
Example:
Consider human height and weight data. While clearly two dimensional data, you don't really
need both to understand how "big" are the people in the data. This is because; height and
weight vary mostly along a single dimension, which can be interpreted as the "bigness" of an
individual. This is why, physicians use the Body Mass Index (BMI) as an indicator of size,
instead of a two-dimensional measurement.
Assume now that you wish to give each individual a size score that is a linear combination of
height and weight, PCA does just that. It returns the linear combination that has the most
variability, i.e., the combination which best distinguishes between individuals.
Notice we have currently offered two motivations for PCA: (i) Find linear combinations that
best distinguish between observations, i.e., maximize variance.
(ii) Find the linear subspace the bets approximates the data. The reason these two problems
are equivalent, is due to the use of the squares error. Informally speaking, the data has some
total variance. This variance can be decomposed into the part captured in M, and the part not
captured.
Note: Usually for simplicity of exposition, we will assume that the data X has been mean
centered.
Terminology:
Principal Components: The linear combinations of the features, which best separate
between observations. In our example - the "bigness" index of each individual.
The first component captures the most variance, the second components, the second
most variance, etc. In terms of M, the principal components are an orthogonal basis
for M.
Scores: Synonymous to Principal Components.
Loadings: The weights of each data point in each principal component.
In our example, the importance of the height and weight in constructing the "bigness"
score.
PCA as a Graph Method
Starting from the maximal variance motivation, it is perhaps not surprising
that PCA depends only on the similarities between features, as measured by
their empirical covariance. The linearity of the target manifold was there by
assumption.
The building blocks of all these graph-based dimensionality reduction
methods are:
1. Compute some similarity graph G (or dissimilarity graph D) from the
raw features.
8
2. Call upon graph embedding theory to map the data points into the
target manifold M.
To summarize:
Task = dim reduce
Type = optimization
Input = Graph (G)
Output = embedding function
Sparse Principal Component Analysis (sPCA)
When analyzing the PCA results, we often wish to understand which features contribute to
which component. This is much easier when the loadings (A) are sparse, i.e., include many
zeroes. sPCA performs this in LASSO style, by means of l1 regularization.
4.4 Multidimensional Scaling (MDS)
MDS - Both self-organizing maps and principal curves and surfaces map data points
in Rp to a lower dimensional manifold. Multidimensional scaling (MDS) has a similar
goal, but approaches the problem in a somewhat different way.
MDS represents high-dimensional data in a low-dimensional coordinate system.
MDS requires only the dissimilarities dij , in contrast to the SOM and principal curves
and surfaces which need the data points xi.
MDS aims at representing a network (= a weighted graph) of distances (or
similarities) between observations, by embedding the observations in a q dimensional
linear subspace, while preserving the original distances.
of the optimization problem takes a very simple form. The classes of such g's are known as
Reproducing Kernel Hilbert Spaces (RKHS).
Nonlinear Dimension Reduction and Local Multidimensional Scaling - These methods can be
thought of as flattening the manifold, and hence reducing the data to a set of low-dimensional
coordinates that represent their relative positions in the manifold. They are useful for problems where
signal-to-noise ratio is very high (e.g., physical systems), and are probably not as useful for
observational data with lower signal-to-noise ratios.
Three Methods of Nonlinear MDS:
ISOMAP = Isometric feature mapping (Tenenbaum et al., 2000) - constructs a graph to
approximate the geodesic distance between points along the manifold. Specifically, for each data
point we find its neighbors-points within some small Euclidean distance of that point. We construct a
graph with an edge between any two neighboring points. The geodesic distance between any two
points is then approximated by the shortest path between points on the graph. Finally, classical
scaling is applied to the graph distances, to produce a low-dimensional mapping.
LLE = Local linear embedding (Roweis and Saul, 2000) - takes a very different approach, trying
to preserve the local affine structure of the high-dimensional data. Each data point is approximated by
a linear combination of neighboring points. Then a lower dimensional representation is constructed
that best preserves these local approximations.
LLE aims at finding linear subspaces that are good approximations of small neighborhoods of
the whole data X. It is similar in spirit to Isomap and LocalMDS (x5.4.5). It differs, however,
in the way similarities are computed, and in the way embedding are performed. In particular,
as the name may suggest, LLE performs local embedding to linear subspaces.
To summarize:
Task = dim. reduction
Type = algorithm
Input = graph (G)
Output = data embedding
Concept = local distance
Local MDS (Chen and Buja, 2008) - takes the simplest and arguably the most direct approach.
We define N to be the symmetric set of nearby pairs of points; specifically a pair (i, i') is in N if point i
is among the K-nearest neighbors of i', or vice-versa.
Self Organizing Maps (SOM)
SOMs, are a non-linear-subspace dimensionality reduction method, aimed at
good clustering. It is non-linear because the algorithm (which cannot be cast
as an ERM problem, i.e., optimization problem) returns an embedding into a non-linear
manifold.
To summarize:
Task = dim. reduction
Type = algorithm
Input = X (data)
Output = parametric curve or surface
Concept = self consistency => I.e., a curve with a path that is the average of all it's closest
data points. Self Consistency Roughly speaking, one can think of this curve as a
parameterized function, connecting all the k-means cluster centers in the smoothest
way possible.
10
8 Cluster Analysis
K-medoids Clustering - For a given cluster assignment (C) find the observation in the
cluster minimizing total distance to other points in that cluster. This algorithm
assumes attribute data, but the approach can also be applied to data described only by
proximity matrices. There is no need to explicitly compute cluster centers.
11
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Recommender Systems Algorithms
1. Content Filtering
2. Collaborative Filtering
3. Hybrid Filtering
4. Recommender Systems
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The two main approaches to recommender systems include content filtering and
collaborative filtering.
1. Content Filtering
In content filtering, the system is assumed to have some background information on the user
(say, because he logged in), and uses this information to give him recommendations. The
recommendation in this case, is approached as a supervised learning problem: the system
learns to predict a product's rating based on the user's features.
2. Collaborative Filtering
Unlike content filtering, in collaborative filtering, there is no external information on the user
or the products, besides the ratings of other users.
Collaborative filtering can be approached as a supervised learning problem, or as an
unsupervised learning problem. This is because it is neither. It is essentially a missing data
problem.
The two main approaches to collaborative filtering include neighborhood methods, and latent
factor models.
a. The neighborhood methods to collaborative filtering rest on the assumption that
similar individuals have similar tastes. If someone similar to individual i has seen
movie j, then i should have a similar opinion.
b. The latent factor models approach to collaborative filtering rests on the assumption
that the rankings are a function of some latent user attributes and latent movie
attributes. This idea is not a new one, as we have seen it in the context of
unsupervised learning in factor analysis (FA) and independent component analysis
(ICA). This is why this approach is more commonly known as the
Matrix Factorization approach collaborative filtering.
We can present several matrix factorization problems in the ERM framework.
Hybrid Filtering
After introducing the ideas of content filtering and collaborative filtering, why not marry the
two? Hybrid filtering is the idea of imputing the missing data, thus making recommendations,
using both a viewer's attributes, and other viewers' preferences.
It can be presented as an ERM problem.
Recommender Systems Terminology
Content Based Filtering: A supervised learning approach to recommendations.
Collaborative Filtering: A missing data imputation approach to recommendations.
Memory Based Filtering: A non-parametric (neighborhood) approach to collaborative
filtering.
12
Misc notes:
========
The Relation Between Supervised and Unsupervised Learning
It may be surprising that collaborative filtering can be seen as both an unsupervised and a
supervised learning problem. But these are not mutually exclusive problems.
In unsupervised learning we try to learn the joint distribution of x, i.e., try to learn the
relationship between any variable in x to the rest, we may see it as several supervised
learning problems. In each, a different variable in x plays the role of y.
The answer is: functions that belong to (RKHS) Reproducing Kernel Hilbert Space
function space.
The Bayesian View of RKHS
Just as the ridge regression has a Bayesian interpretation, so does the kernel trick. Informally,
the functions solving Eq.(1) can be seen as the posterior mode if our prior beliefs postulate
that the function we are trying to recover is a Gaussian zero-mean process with covariance
given by K.
Generative Models
By generative model we mean that we specify the whole data distribution.
This is particularly relevant to supervised learning where many methods only assume the
distribution of P(y|x) without stating the distribution of P(x).
LDA, QDA, and Naive Bayes, follow this exact same rational.
Dimensionality Reduction
- It is thus intimately related to lossy compression in information theory.
- Dimensionality reduction is often performed before supervised learning to keep
computational complexity low.
13
R code
Supervised Learning Code
library(magrittr) # for piping
library(dplyr) # for handeling data frames
# Some utility functions:
l2 <- function(x) x^2 %>% sum %>% sqrt
l1 <- function(x) abs(x) %>% sum
MSE <- function(x) x^2 %>% mean
missclassification <- function(tab) sum(tab[c(2,3)])/sum(tab)
```
We also initialize the random number generator so that we all get the same results (at least
upon a first run)
```{r set seed}
set.seed(2015)
```
# OLS
## OLS Regression
Starting with OLS regression, and a split train-test data set:
```{r OLS Regression}
View(prostate)
# now verify that your data looks as you would expect....
ols.1 <- lm(lcavol~. ,data = prostate.train)
# Train error:
MSE( predict(ols.1)- prostate.train$lcavol)
# Test error:
MSE( predict(ols.1, newdata = prostate.test)- prostate.test$lcavol)
```
Now using cross validation to estimate the prediction error:
```{r Cross Validation}
folds <- 10
fold.assignment <- sample(1:5, nrow(prostate), replace = TRUE)
errors <- NULL
for (k in 1:folds){
prostate.cross.train <- prostate[fold.assignment!=k,]
prostate.cross.test <- prostate[fold.assignment==k,]
.ols <- lm(lcavol~. ,data = prostate.cross.train)
.predictions <- predict(.ols, newdata=prostate.cross.test)
.errors <- .predictions - prostate.cross.test$lcavol
errors <- c(errors, .errors)
}
14
cancor()
# Kernel based robust version
kernlab::kcca()
```
## OLS Classification
```{r OLS Classification}
# Making train and test sets:
ols.2 <- lm(spam~., data = spam.train.dummy)
# Train confusion matrix:
.predictions.train <- predict(ols.2) > 0.5
(confusion.train <- table(prediction=.predictions.train, truth=spam.train.dummy$spam))
missclassification(confusion.train)
# Test confusion matrix:
.predictions.test <- predict(ols.2, newdata = spam.test.dummy) > 0.5
(confusion.test <- table(prediction=.predictions.test, truth=spam.test.dummy$spam))
missclassification(confusion.test)
```
# Ridge Regression
```{r Ridge I}
# install.packages('ridge')
library(ridge)
ridge.1 <- linearRidge(lcavol~. ,data = prostate.train)
# Note that if not specified, lambda is chosen automatically by linearRidge.
# Train error:
MSE( predict(ridge.1)- prostate.train$lcavol)
# Test error:
MSE( predict(ridge.1, newdata = prostate.test)- prostate.test$lcavol)
```
Another implementation, which also automatically chooses the tuning parameter $\lambda$:
```{r Ridge II}
# install.packages('glmnet')
library(glmnet)
ridge.2 <- glmnet(x=X.train, y=y.train, alpha = 0)
# Train error:
MSE( predict(ridge.2, newx =X.train)- y.train)
17
# Test error:
MSE( predict(ridge.2, newx = X.test)- y.test)
```
__Note__: `glmnet` is slightly picky.
I could not have created `y.train` using `select()` because I need a vector and not a
`data.frame`. Also, `as.matrix` is there as `glmnet` expects a `matrix` class `x` argument.
Thse objects are created in the make_samples.R script, which we sourced in the beggining.
# LASSO Regression
```{r LASSO}
# install.packages('glmnet')
library(glmnet)
lasso.1 <- glmnet(x=X.train, y=y.train, alpha = 1)
# Train error:
MSE( predict(lasso.1, newx =X.train)- y.train)
# Test error:
MSE( predict(lasso.1, newx = X.test)- y.test)
```
# SVM
## Classification
```{r SVM classification}
library(e1071)
svm.1 <- svm(spam~., data = spam.train)
# Train confusion matrix:
.predictions.train <- predict(svm.1)
(confusion.train <- table(prediction=.predictions.train, truth=spam.train$spam))
missclassification(confusion.train)
# Test confusion matrix:
.predictions.test <- predict(svm.1, newdata = spam.test)
(confusion.test <- table(prediction=.predictions.test, truth=spam.test$spam))
missclassification(confusion.test)
```
## Regression
```{r SVM regression}
svm.2 <- svm(lcavol~., data = prostate.train)
# Train error:
MSE( predict(svm.2)- prostate.train$lcavol)
# Test error:
MSE( predict(svm.2, newdata = prostate.test)- prostate.test$lcavol)
```
19
# GAM Regression
```{r GAM}
# install.packages('mgcv')
library(mgcv)
form.1 <- lcavol~ s(lweight)+ s(age)+s(lbph)+s(svi)+s(lcp)+s(gleason)+s(pgg45)+s(lpsa)
gam.1 <- gam(form.1, data = prostate.train) # the model is too rich. let's select a variable
subset
ridge.1 %>% coef %>% abs %>% sort(decreasing = TRUE) # select the most promising
coefficients (a very arbitrary practice)
form.2 <- lcavol~ s(lweight)+ s(age)+s(lbph)+s(lcp)+s(pgg45)+s(lpsa) # keep only
promising coefficients in model
gam.2 <- gam(form.2, data = prostate.train)
# Train error:
MSE( predict(gam.2)- prostate.train$lcavol)
# Test error:
MSE( predict(gam.2, newdata = prostate.test)- prostate.test$lcavol)
```
# Neural Net
## Regression
```{r NNET regression}
library(nnet)
nnet.1 <- nnet(lcavol~., size=20, data=prostate.train, rang = 0.1, decay = 5e-4, maxit = 1000)
# Train error:
MSE( predict(nnet.1)- prostate.train$lcavol)
# Test error:
MSE( predict(nnet.1, newdata = prostate.test)- prostate.test$lcavol)
```
20
validate.nnet(3)
validate.nnet(4)
validate.nnet(20)
validate.nnet(50)
sizes <- seq(2, 30)
validate.sizes <- rep(NA, length(sizes))
for (i in seq_along(sizes)){
validate.sizes[i] <- validate.nnet(sizes[i])$test
}
plot(validate.sizes~sizes, type='l')
```
What can I say... This plot is not what I would expect. Could be due to the random nature of
the fitting algorithm.
## Classification
```{r NNET Classification}
nnet.2 <- nnet(spam~., size=5, data=spam.train, rang = 0.1, decay = 5e-4, maxit = 1000)
# Train confusion matrix:
.predictions.train <- predict(nnet.2, type='class')
(confusion.train <- table(prediction=.predictions.train, truth=spam.train$spam))
missclassification(confusion.train)
# Test confusion matrix:
.predictions.test <- predict(nnet.2, newdata = spam.test, type='class')
(confusion.test <- table(prediction=.predictions.test, truth=spam.test$spam))
missclassification(confusion.test)
```
# CART
## Regression
```{r Tree regression}
library(rpart)
tree.1 <- rpart(lcavol~., data=prostate.train)
# Train error:
MSE( predict(tree.1)- prostate.train$lcavol)
# Test error:
MSE( predict(tree.1, newdata = prostate.test)- prostate.test$lcavol)
```
At this stage we should prune the tree using `prune()`...
## Classification
21
# Random Forest
TODO
# Rotation Forest
TODO
# Smoothing Splines
I will demonstrate the method with a single predictor, so that we can visualize the smoothing
that has been performed:
```{r Smoothing Splines}
spline.1 <- smooth.spline(x=X.train, y=y.train)
# Visualize the non linear hypothesis we have learned:
plot(y.train~X.train, col='red', type='h')
points(spline.1, type='l')
```
I am not extracting train and test errors as the output of `smooth.spline` will require some
tweaking for that.
# KNN
## Classification
```{r knn classification}
library(class)
knn.1 <- knn(train = X.train.spam, test = X.test.spam, cl =y.train.spam, k = 1)
22
# Kernel Regression
Kernel regression includes many particular algorithms.
```{r kernel}
# install.packages('np')
library(np)
ksmooth.1 <- npreg(txdat =X.train, tydat = y.train)
# Train error:
MSE( predict(ksmooth.1)- prostate.train$lcavol)
```
There is currently no method to make prediction on test data with this function.
# Stacking
As seen in the class notes, there are many ensemble methods.
Stacking, in my view, is by far the most useful and coolest. It is thus the only one I present
here.
The following example is adapted from [James E.
Yonamine](http://jayyonamine.com/?p=456).
```{r Stacking}
#####step 1: train models ####
#logits
logistic.2 <- cv.glmnet(x=X.train.spam, y=y.train.spam, family = "binomial", alpha = 0)
logistic.3 <- cv.glmnet(x=X.train.spam, y=y.train.spam, family = "binomial", alpha = 1)
23
# Fisher's LDA
```{r LDA}
library(MASS)
lda.1 <- lda(spam~., spam.train)
# Train confusion matrix:
.predictions.train <- predict(lda.1)$class
(confusion.train <- table(prediction=.predictions.train, truth=spam.train$spam))
missclassification(confusion.train)
# Test confusion matrix:
.predictions.test <- predict(lda.1, newdata = spam.test)$class
(confusion.test <- table(prediction=.predictions.test, truth=spam.test$spam))
missclassification(confusion.test)
```
__Caution__:
Both `MASS` have a function called `select`. I will thus try avoid the two packages being
loaded at once, or call the functionby its full name: `MASS::select` or `dplyr::select'.
24
# Naive Bayes
```{r Naive Bayes}
library(e1071)
nb.1 <- naiveBayes(spam~., data = spam.train)
# Train confusion matrix:
.predictions.train <- predict(nb.1, newdata = spam.train)
(confusion.train <- table(prediction=.predictions.train, truth=spam.train$spam))
missclassification(confusion.train)
# Test confusion matrix:
.predictions.test <- predict(nb.1, newdata = spam.test)
(confusion.test <- table(prediction=.predictions.test, truth=spam.test$spam))
missclassification(confusion.test)
```
25
__Note__: `foo::bar` means that function `foo` is part of the `bar` package.
With this syntax, there is no need to load (`library`) the package.
If a line does not run, you may need to install the package: `install.packages('bar')`.
Sadly, RStudio currently does not autocomplete function arguments when using the `::`
syntax.
# Learning Distributions
## Gaussian Density Estimation
```{r}
# Sample from a multivariate Gaussian:
## Generate a covariance matrix
p <- 10
Sigma <- bayesm::rwishart(nu = 100, V = diag(p))$W
lattice::levelplot(Sigma)
# Sample from a multivariate Gaussian:
n <- 1e3
means <- 1:p
X1 <- mvtnorm::rmvnorm(n = n, sigma = Sigma, mean = means)
dim(X1)
# Estiamte parameters and compare to truth:
estim.means <- colMeans(X1) # recall truth is (10,...,10)
plot(estim.means~means); abline(0,1, lty=2)
estim.cov <- cov(X1)
estim.cov.errors <- Sigma - estim.cov
lattice::levelplot(estim.cov.errors)
plot(estim.cov~Sigma); abline(0,1, lty=2)
frobenius(estim.cov.errors)
26
## Association rules
Note: Visualization examples are taken from the arulesViz [vignette](http://cran.rproject.org/web/packages/arulesViz/vignettes/arulesViz.pdf)
```{r association rules}
library(arules)
data("Groceries")
inspect(Groceries[1:2])
summary(Groceries)
rules <- apriori(Groceries, parameter = list(support=0.001, confidence=0.5))
27
summary(rules)
rules %>% sort(by='lift') %>% head %>% inspect
# Dimensionality Reduction
## PCA
Note: example is a blend from [Gaston Sanchez](http://gastonsanchez.com/blog/howto/2012/06/17/PCA-in-R.html) and [Georgia's Geography
dept.](http://geog.uoregon.edu/GeogR/topics/pca.html).
# As a correaltion graph
cor.1 <- cor(USArrests)
qgraph::qgraph(cor.1)
qgraph::qgraph(cor.1, layout = "spring", posCol = "darkgreen", negCol = "darkmagenta")
```
```{r PCA}
USArrests.1 <- USArrests[,-3] %>% scale
pca1 <- prcomp(USArrests.1, scale. = TRUE)
(pca1$rotation) # loadings
# Now score the states:
pca1$x %>% extract(,1) %>% sort %>% head
```
Interpretation:
- PC1 seems to capture overall crime rate.
- PC2 seems distinguish between sexual and non-sexual crimes
The bi-Plot
```{r biplot}
biplot(pca1) #ugly!
# library(devtools)
# install_github("vqv/ggbiplot")
ggbiplot::ggbiplot(pca1, labels = rownames(USArrests.1)) # better!
```
29
The scree-plot
```{r screeplot}
ggbiplot::ggscreeplot(pca1)
```
So clearly the main differentiation
Visualize the scoring as a projection of the states' attributes onto the factors.
```{r}
# get parameters of component lines (after Everitt & Rabe-Hesketh)
load <- pca1$rotation
slope <- load[2, ]/load[1, ]
mn <- apply(USArrests.1, 2, mean)
intcpt <- mn[2] - (slope * mn[1])
# scatter plot with the two new axes added
dpar(pty = "s") # square plotting frame
USArrests.2 <- USArrests[,1:2] %>% scale
xlim <- range(USArrests.2) # overall min, max
plot(USArrests.2, xlim = xlim, ylim = xlim, pch = 16, col = "purple") # both axes same
length
abline(intcpt[1], slope[1], lwd = 2) # first component solid line
abline(intcpt[2], slope[2], lwd = 2, lty = 2) # second component dashed
legend("right", legend = c("PC 1", "PC 2"), lty = c(1, 2), lwd = 2, cex = 1)
# projections of points onto PCA 1
y1 <- intcpt[1] + slope[1] * USArrests.2[, 1]
x1 <- (USArrests.1[, 2] - intcpt[1])/slope[1]
y2 <- (y1 + USArrests.1[, 2])/2
x2 <- (x1 + USArrests.1[, 1])/2
segments(USArrests.1[, 1], USArrests.1[, 2], x2, y2, lwd = 2, col = "purple")
```
30
## sPCA
```{r sPCA}
```
## kPCA
```{r kPCA}
kernlab::kpca()
```
## Random Projections
```{r Random Projections}
```
## MDS
```{r MDS}
stats::cmdscale()
MASS::sammon()
MASS::isoMDS()
31
```
## Isomap
```{r Isomap}
```
## LLE
```{r LLE}
```
## LocalMDS
```{r Local MDS}
```
## Principal Curves & Surfaces
```{r Principla curves}
```
32
## HMM
```{r}
# install.packages('HiddenMarkov')
library(HiddenMarkov)
```
# Clustering:
Generate clusters:
```{r generate clusters}
X <- clusterGeneration::genRandomClust(numClust=2)
clusterGeneration::viewClusters(X, cl=2)
```
## K-means
```{r kmeans}
stats::kmeans()
```
## Kmeans++
```{r kmeansPP}
kmpp <- function(X, k) {
n <- nrow(X)
C <- numeric(k)
C[1] <- sample(1:n, 1)
for (i in 2:k) {
dm <- distmat(X, X[C, ])
pr <- apply(dm, 1, min); pr[C] <- 0
33
## K-medoids
```{r kmedoids}
cluster::pam()
# Many other similarity measures:
proxy::dist()
```
## Hirarchial
```{r}
hclust()
# install.packages('cluster')
library(cluster)
agnes()
```
## Spectral Clustering
```{r}
# install.packages('kernlab')
library(kernlab)
specc()
```
34