Jeff Byers - Machine Learning and Advanced Statitics

“Physics of Data” Padua Lectures
Dr. Jeff M. Byers

Naval Research Laboratory
Washington, D.C.
Local Email: Jeff.Byers@pd.infn.it
Syllabus of supplementary lectures
Spring semester 2019
CLASS: Statistical Mechanics of Complex Systems
Diffusion: The Ultimate Learning Machine
• Machine learning kernels and the Green’s function of the Diffusion equation
• Local structure versus Global symmetries: Hamiltonians or Green’s functions
• Diffusion on a Manifold: Parametrix approximation
Information theory: Understanding the world as a communications channel
• Comparing probability distributions: From KL divergence to information geometry
8 hours • Relating probability, information theory and geometry
• Statistical manifolds: The geometry of models
• The manifold of multivariate Gaussian probability distributions
Random Matrix Theory: 1-d statistical mechanics of eigenvalues
• Random covariance matrices: Semi-circle and Quarter circle Laws
• Inferring covariance matrices from data: The Wishart distribution
• Mean-field theory for estimating matrices: Variational Bayes
CLASS: Advanced Statistics for Physical Analysis

Bayesian inference: Letting the model fluctuate around the data
• Using a stochastic process as a Bayesian prior
• Gaussian Process: Using a path integral to fit data
• GP Example: Fitting spherical harmonics to data
• Dirichlet Process: The probability over probability distributions
• DP Example: Computing the uncertainty in the density estimation of a gas
8 hours • Beta Process: The probability over combinations of features
The statistics of high-dimensional point clouds: Your intuition is wrong!
• Correcting the bias of the average: The James-Stein estimator
• When to use the Gram or the Covariance Matrix
• Estimating the dimensionality of the data
• Divide and conquer the manifold: Estimating tangent spaces in high-d data
• Stitching together the tangent spaces: Curvature functionals
Dynamics: The OODA Loop
WARNING LABELS
Our Choices should not matter!!!
Let the experiment make the choices. Physicists should avoid
making theoretical choices. If such a choice is required then
show why it doesn’t matter. In other words, it’s just for
computational efficacy. Or, it is compelled by some
symmetry property or conservation law (Noether’s Theorem)
Probability is in our Models!!!

Probability quantifies uncertainty in our representations and
the corresponding inference procedures from data to models.
Failure to understand this will cause our minds to get mixed
into the physical world in a very confusing way.
Do NOT physicalize probability.
The Map is not the World
• Concepts such as coordinate systems and probability are powerful tools for
humans to build models that represent aspects of the universe for our
particular purposes.
• Note, these are choices we make and NOT fundamental attributes of the
universe. They reside in our minds and associated artifacts (e.g., databases,
journal articles, computer programs) as a representation of the universe.
• These choices are both necessary and arbitrary and there is no way to avoid
making them.
• We should NEVER become confused and push even our most useful
concepts onto the universe and imagine they have a reality independent
from the representations in our mind.
• The key is to identify those properties of the universe that remain after we
have removed the effect of our arbitrary choices in how to represent it.
Historical Note: Doing this for coordinate systems led Einstein to General Relativity and the failure to
do this for probability has led to the interpretation morass of Quantum Mechanics.
Types of Models
X – Properties
Y - Observations
Statistical resolvability of a model
The “ball” in a “box”

What could be observed? → How big is the box?
What is observed? → Where is the ball?
What is the fidelity of the observation? → How big is the ball?
“Data” here refers to all the possible observations.
Bayesians believe they can more Frequentists believe they can more
easily specify the size of the “Model” easily specify the size of the “Data”
box using a prior distribution and this box using a sampling distribution and
specifies the sampling distribution. then test the significance of models.
p(data | model) p (data=obs | model)
p (model) p (data)
p (data=obs)
p (model | obs)
Bayesians “pull back” the observations to
the model space as a posterior probability.
Bayesian Inference: Looking backwards through the model
Physicists usually think about what model can explain the data
BUT what about a different view of this question:
How does the data constrain the possible models?

Bayes’ Formula:
LIKELIHOOD PRIOR
POSTERIOR
p ( data | model )  p ( model )
p ( model | data ) =
p ( data )
EVIDENCE
The models are instances of a stochastic

process that fluctuates around the data!
Bayesian Learning: Data chips away the Prior
So how do we get the data
to construct its own
representation?
p ( ,  )
NEW DATA
REPRESENTATION After N samples
Bayes’ Theorem: likelihood prior
p ( x |  , )  p ( , )
p ( ,  | x ) =
p(x)
posterior
evidence
Posterior
Prior
UPDATE 9
1 2
Bayesian Inference: Looking backwards through the model
Goal: Learn the fairness or bias, B, of a coin.
Sequence of coin tosses form the data set:
- Assign prior probability based on beliefs:
3 different priors (or initial conditions) x = H H T H T T H T T H T H

BLUE → Coin is heavily biased to TAILS
RED → Coin is biased to HEADS
GREEN → No information (max. entropy)
- Assign a likelihood to the process:
p( x = H | B ) = B
p( x = T | B ) = 1 − B
NOTE: B is an unknown parameter.
- Update beliefs with data sequence:

likelihood prior
p ( x | B)  p ( B)
p(B | x) =
p (x)
posterior
evidence 10
At some point, we must make a decision …
Decision-theoretic perspective:
• Define a set of probability models, p(X|), for
the data, X, indexed by a parameter, Q.
• Define a procedure d(X) that operates on the
data to produce a decision.
• Define a loss function, L(, d(X)).
• The goal is to use the loss function to
compare procedures via a risk, R, however
both arguments are unknown!
Beliefs about the data you might sample, Beliefs about models prior to acquiring data,
place a probability on data space ˆ = d ( X ) place a probability on model space
statistical
p( X )
estimation
procedure
p ( )
DATA SPACE d MODEL SPACE
X E  Q
L(, d(X)).
loss function
Goal: We need to assign a scalar to each decision procedure, R : d →

Slide inspired from Prof. Michael Jordan (UC Berkeley): http://videolectures.net/mlss09uk_jordan_bfway/
The two ways to make decisions with probability
Goal: We need to assign a scalar to each decision procedure. R :d →
 is an index over models NOT a random variable.  is an index over models AND a random variable.
Loss function Prior distribution
Sampling distribution
Put a probability distribution L( , d ( X )) Put an initial probability
distribution on the space of
p ( )
on observing various data.
model indices (parameters).
p( X )
Frequentist expectation of risk Bayesian expectation of risk
RF ( , d ) =  dX L( , d ( X ))  p( X ) RB (d ( X ) ) =  d p( )  L( , d ( X ))
Model pessimism: Data optimism:

Consider best decision for Consider the best decision
the worst choice of model given the observed data
 (d ) = arg max RF ( , d ) X = X obs

Frequentist decision rule (Minimax) Bayesian decision rule

d F = arg min RF ( =  (d ), d ) d B = arg min RB (d ( X = X obs ) )
d d
Minimax: For each d select the  that maximizes RF
then select the minimum of these as the risk.
Slide inspired by Prof. Michael Jordan (UC Berkeley): http://videolectures.net/mlss09uk_jordan_bfway/

p ( )
p( X )
Model pessimism: Data optimism:

Consider best decision for Consider the best decision


d d

p ( )
p( X )
Model pessimism: p ( ) p( X ) Data optimism:

Consider best decision for Bayes’ Risk Consider the best decision
R(d ) =  d dX p ( )  L( , d ( X ))  p( X )


d d
Coverage:   ( X , d F )  Q Coherence: p( X ) =  d p( X |  )  p( )

True value lies in a bounded region.

MOTIVATION & BACKGROUND
High-dimensional Point Cloud Data

Hyperspectral Imaging Computer Vision – Local Descriptors
Parts Models of Objects Dynamic Reconfiguration

Dynamics in the Data
Understanding the past is as difficult as predicting the future.
NOW
PAST FUTURE
DATA → Xt − 2 → Xt −1 → Xt → Xt +1 → Xt +1 →
State estimation Prediction

“Dynamics” in the Model
• Observe: Sensor representation spaces, input from the recent past

• Orient: Model spaces or Belief spaces of POMDP’s, state estimation
of the present
• Decide: Game-theoretic spaces, look into the possible futures given
the current state.
o Continuous game strategy spaces
o Bayesian games of incomplete/imperfect information
• Act: State spaces of our systems, at some point you have to make a
choice. As an agent in the world, the actions reveal the causal
connections within it.
Complex geometry of models
The University of Florida Sparse Matrix Collection T. A. Davis and Y. Hu, ACM Transactions on Mathematical Software,
Vol 38, Issue 1, 2011, pp. 1-25 http://www.cise.ufl.edu/research/sparse/matrices/synopsis
18
From rules to intelligence
Prosthesis for the mind:
From data to symbolic reasoning
https://www.fiverr.com/bilalahmedd/machine-learning-data-science-tensor-flow-python
Neural Networks
Already did these?

Dynamics in the Model
Stochastic Gradient Descent: Backpropagation on mini-batches
Random sub-sampling of data
N
1
y − f W (x n )
2
Cost Function: C ( W) n Automatic
N n =1
Gradient descent: Differentiation
Neural Network C
wij (t + 1) = wij (t ) − 
y = f W ( x) wij
wij Learning Rate
More sophisticated choices

(e.g., ADAM) are possible.
Magic: “Inductive Bias”

imposed on the network.
Deep Learning as Layered Neural Nets

Deep Learning as Recurrent Neural Nets

Is Deep Learning how we do it?
“I’ve worked all my life in Machine
Learning, and I’ve never seen one
algorithm knock over benchmarks like
Deep Learning.”
Playing to the hope of

interpretability
LEVITY
Deep Learning as all kinds of stuff …

Intriguing properties of neural networks:
Adversarial examples
Besides pigs being able to fly in Deep Learning,
MIT CSAIL
Turtles can be very threatening … … and your autonomous

car can play Stopball.
Synthesizing Robust Adversarial Examples

http://proceedings.mlr.press/v80/athalye18b/athalye18b.pdf Use of ShapeShifter by Shang-Tse Chen, Ga.Tech
Adversarial Examples
The algorithm is >99.6% confident of these labels
“Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images”,
Anh Nguyen, Jason Yosinski and Jeff Clune, CVPR 2015, p.427-436.
So What’s Happening?
The data lies along a manifold However, the DL Neural Network typically
constrained to low-dimensions chops up the space with hyperplanes to
by the generative mechanism. form decision boundaries for classes.
d << D
The data lies along a manifold However, the DL Neural Network typically
constrained to low-dimensions chops up the space with hyperplanes to form
by the generative mechanism. non-local decision boundaries for classes.
d << D
Exception!
Real data Adversarial data
https://thomas-tanay.github.io/post--L2-regularization/
Generative Adversarial Network (GAN): Bug as Feature
“This, and the variations that are now being proposed is
the most interesting idea in the last 10 years in ML, in
my opinion.” -Yann LeCun (2016)
More Supervised Learning to the rescue … LABEL = { Real, Fake }
Generative Adversarial Network (GAN): Bug as Feature
“This, and the variations that are now being proposed is
the most interesting idea in the last 10 years in ML, in
my opinion.” -Yann LeCun (2016)
More Supervised Learning to the rescue … LABEL = { Real, Fake }
Data manifold
This is a form of Implicit Density Estimation
Is this data from the manifold?
YES
NO
NOT on data manifold

Why density estimation?
• Generate new data for simulators.
• Work with missing data.
• Represent data in more compressed
forms for find latent spaces. DECISION
Think fast … OR
• Detect anomalies and outliers. Think slow?
Finding structure in high-dimensional point clouds
Challenge: Use the data to estimate a crossover regime from
high-dimensional noise to low-dimensional structure.
Solution: Mini-batches as output from Locality Sensitive

Hashing not randomly sampling the entire data set.
LOCAL MODELS
Noise subspace
dN=D−d
Signal
subspace
d<<D
35
Spectral Dynamics
Dynamical Systems Theory Random Matrix Theory
Statistical Mechanics of Knowledge
“Atom of Decidability”
PHYSICS
INFORMATION 0 1
LANGUAGE F T
Cantor’s Nightmare: Developing a proper fear of the infinite
The real number system is a very unstable representation for modelling.
Likewise, functions on the reals are strange objects. Why?
More topics
The function should be viewed as the limit of a
Mathematical Background probabilistic association between the elements of the sets
Thinking of functions as “really big” vectors in the domain and range.
Functional derivatives and integrals
Linear models: Everything linear algebra can do with data

Optimization in high-dimensional parameter space: Saddle points
Data compression while preserving information
• Unsupervised Learning, p(x): Rate Distortion Theory (Reconstructive Loss)
a) Learning how to predict itself From probability to functions:
b) Density estimation p ( y | x) → d ( y − f ( x) ) → y = f ( x)
i. Clustering
H (Y | X ) = − log 2 p( y | x) p ( x , y ) → 0
ii. Manifold learning
• Supervised Learning, p(x, y): Information Bottleneck (Discriminative Loss)
Perspectives on models:
Deterministic → Probabilistic → Information-theoretic → Geometric
From functions to measures to rescaling to
Where do labels come from?
Two spaces
In the beginning, there was only one space … but then we imagined there were two.
Many problems in machine learning and statistics are fundamentally about finding
a decomposition of a space into two subspaces and then learning a mapping
between them using a finite data set.
 x1 
Entire space: z  E = ( B  F ) pU  
 
Base space: x  B x  x 
d
z= = 1 
Fiber: y  F y  y 
 
Trivializing neighborhood: U  B  D − d 
y 
Vapnik’s Gamble: Discriminative Supervised Learning
When solving a problem of interest, do not solve
a more general problem as an intermediate step. L = L f −1
– Vladimir Vapnik
Observations INFERENCE Retrieval LABELING FUNCTION Label
y " f −1 " x̂ L ˆ
Coarse-grained
retrieval
Observations ML Prediction Label
y L
L → p ( | y ) =  dx p ( | x)  p( x | y )

Summary of perspectives on Data-to-Decisions
Deterministic Probabilistic Information-theoretic
p( ) p ( ) H ( L) H ( L)
Q p(  | ) I ( L, L)
Decisions: l l’ l l’ l l’
Label mixing Confusion matrix
L p ( | x) p(  | y) I ( X , L) I (Y , L)
L
f p( y | x) I ( X ,Y )
Data: x Forward model
y x Likelihood
y x y
p( x) p( y ) H (X ) H (Y )
I ( L, L) = H ( L) − H ( L | L)
C = max I ( L, L)
p ( |x )
Information-theoretic view of remote sensing
Using the “Information Bottleneck” approach to solve for the Property space: x  
Classification Design (to support the experiment) Observation space: y  Q
QR indices:  
Given the possible observations of the property space, what are the reliable classification schemes?
 
Quantizing the property space, , to preserve :
L L
p( x)
p ( | x) =
p( ) 
exp  −   p( y | x) log
p( y | x)  f
p( y | x) Z ( x,  )  y p( y | ) 
 p ( | x)
 Q
1
p( y | ) =  p ( | x)  p ( y | x)  p( x)
p( ) x The parameter  represents the
f :  → Q,
tradeoff between compression of x y = f ( x)
p ( ) =  p ( | x)  p( x) x and the predictive accuracy of L. L: → 
x x = L( x)
Experimental Design (to support the classification) L :Q → 
Given a classification scheme of the property space, what are the reliable observations? y = L ( y)
Quantizing the property space, , w.r.t. the labeling function, L :

p( x) p( y)  p ( | x) 
p ( y | x) = exp  −   p( | x) log
p( | y ) 
p( y | x)
p ( | x) Z ( x,  ) 
1
p( | y) =  p ( y | x)  p ( | x)  p( x)
p( y ) x
p ( y ) =  p ( y | x)  p( x)
x
Marginal of a Gaussian PDF
exp ( − 12 ( z − μ)T Σ −1 (z − μ) )
−1 2
p(z | μ, Σ) = (2 ) − d 2 Σ
 x  x    xx2  xy2 
z =  ,μ =  , Σ =  2 2 
, Σ =  xx2  yy2 −  xy4
 
 y  y    xy  yy  Completing the square in the Gaussian:
( )
2
Λ = Σ −1 ( y −  y ) + B ( x −  x ) = ( y −  y ) 2 + 2 B ( x −  x )  ( y −  y ) + B2 ( x −  x ) 2
2
xx xx xx
 xx2 xy2  1   yy2  xy2   xy

2
 2 =    B=
 xy  yy2   xx2  yy2 −  xy4   xy2  xx2  xx
−1 2 − 12 q 2
p ( x, y ) = (2 ) − d 2 Σ e
q 2 = xx2 ( x −  x ) 2 +  yy2 ( y −  y ) 2 + 2xy2 ( x −  x )( y −  y )
 yy2 ( x −  x ) 2 +  xx2 ( y −  y ) 2 + 2 xy2 ( x −  x )( y −  y )
=
 xx2  yy2 −  xy4
( yy2 − B 2 )( x −  x ) 2 +  xx2  ( y −  y ) 2 + 2  xy2 ( x −  x )( y −  y ) + B2 ( x −  x ) 2 
 2
2
=  xx xx 
 xx yy −  xy
2 2 4
( )
2
(y− ( x) )
4 2
( yy2 −  xy2 )( x −  x ) 2 +  xx2  ( y −  y ) +  xy2 ( x −  x ) ( x − x )
2 2
= = +
xx xx y
 xx2  yy2 −  xy4  xx2  yy2
( )
2
where  y ( x) =  y −  xy2  ( x −  x ) and  yy2 =  xx−2  ( xx2  yy2 −  xy4 ).  ( y −  ( x )) 
( xx2  yy2 −  xy4 )
2
−1 2 ( x −  x )2
xx p ( x, y ) = 1
2 exp  − 2y 2   exp − 2 xx2
 yy

( ) ( x − x ) (y− ( x) )
2 2
( x −  x )2
p ( x |  x ,  ) = (2 )
2
xx
2 −1 2
xx exp − 2 xx
2 q 2
= +
y
 xx2  yy2
p ( y |  y ,  yy2 ) = (2 yy2 ) −1 2 exp − ( ( y −  y )2
2 yy
2 ) 2
where  y ( x) =  y −  xy2  ( x −  x ) and  yy2 =  xx−2  ( xx2  yy2 −  xy4 ).
xx
Mutual information: Gaussians
Bivariate Gaussian PDF: Marginals:
(  yy2 −  xy4 )
 ( y −  ( x )) 
( ) ( )
2
−1 2 ( x −  )2
p( x |  x ,  xx2 ) = (2 xx2 )−1 2 exp − ( x2−2x )
2
p ( x, y ) = 1
2
2
xx exp  − 2y 2   exp − 2 2x
 yy
 xx xx
−1 2  ( y −  ( x )) 
= ( 2 yy2 ) exp  − 2y 2   (2 xx2 ) −1 2 exp − 2 2x
2
(
( x −  )2
) p( y |  y ,  yy2 ) = (2 yy2 )−1 2 exp − ( ( y −  y )2
2 yy
2 )
 yy
 xx
 ( y −  y ( x )) 
 y ( x) = ( 2 yy2 )
2
−1 2 
p ( y | x) p( x) − dy y exp  − 2 yy2 
 ( y −  y ( x )) 
= ( 2 yy )  dy ( y −  y ( x) ) exp  − 2 2 
2

2 −1 2 2
Mean-shift (linear regression):  y ( x) =  y −  xy2  ( x −  x )
2  yy2
xx
−
 yy

Variance reduction:  2
yy =  −   xy4  0 (Note: no x-dependence.)
2
yy
−2
xx
  p ( y | x)
I ( X , Y ) =  dx p ( x)  dy p ( y | x) ln ln
p( y | x)
= ln p( y | x) − ln p( y )
− − p( y ) p( y )
  ( y −  ( x ))  
= ln ( 2 yy2 ) exp  − 2y 2   − ln (2 yy2 ) −1 2 exp − 2 2y  ( )
2
−1 2 ( y −  )2
  yy
  yy 
( y −  ( x ))
2
( y −  )2
= 2 2y − 2y 2 + 12 ln  yy2  yy−2
yy yy
( y −  y ) 2 = y 2 − 2  y y +  y2 +  y2 −  y2 − 2 y y + 2 y y
= ( y 2 − 2 y y +  y2 ) + 2 y y − 2 y y +  y2 −  y2
= ( y −  y ) 2 + 2(  y −  y ) y + (  y2 −  y2 )
 ( y −  y ( x )) 
( 2 yy2 )
2
−1 2 
− dy ( y −  y ) exp  − 2 yy2  =  yy + 2 y ( y −  y ) + ( y −  y )
2 2 2 2
Mutual information: Gaussians
 ( y −  ( x ) )   ( y −  )2 ( y −  ( x ) ) 
= ( 2 yy2 )  dy exp  − 2y 2    2 2y − 2y 2 + 12 ln  yy2  yy−2 
2 2
 p ( y | x) −1 2 
−
dy p ( y | x) ln
p( y ) −
 yy
  yy yy

 ( y −  y ( x )) 
= 12 ln  yy2  yy−2 − 12 + 12  yy−2  ( 2 yy2 )
2
−1 2 
− dy ( y −  y ) exp  − 2 yy2 
2
= yy
2
+ 2  y (  y −  y ) + (  y2 −  y2 )
= 12 ln( yy2  yy−2 ) − 12 + 12  yy−2 yy2 + 12  yy−2  y2 − 12  yy−2  y2 +  yy−2  y (  y −  y )

= 12  yy−2  ( yy2 ln( yy2  yy−2 ) +  yy2 −  yy2 + (  y −  y ) 2 )
= 12  yy−2  ( yy2 ln( yy2  yy−2 ) +  yy2 −  y2y +  xx−4 xy4  ( x −  x ) 2 )
I ( X , Y ) = 12  yy−2   dx p ( x) ( yy2 ln( yy2  yy−2 ) +  yy2 −  yy2 +  xx−4 xy4  ( x −  x ) 2 )


−
= 1
2
−2
yy  ( yy2 ln( yy2  yy−2 ) +  yy2 −  yy2 +  xx−2 xy4 )
 yy−2 yy2 +  xx−2 yy−2 xy4 − 1 =  yy−2 yy2 +  xx−2 yy−2 xy4 − 1
=  ( ln(  ) − 1 +   +   
1
2
2
yy
−2
yy
−2
yy
2
yy
−2
xx
−2
yy
4
xy ) =  yy−2  ( yy2 −  xx−2 xy4 ) +  xx−2 yy−2 xy4 − 1
2  xx yy =0
2 2
= 12 ln  yy2 = 12 ln Σ
yy
 xx yy  xx yy A0
I ( X , Y ) = H ( X ) + H (Y ) − H ( X , Y ) = ln 12 =
Σ Σ
12
A
12 12
Σ XX ΣYY
Proposed generalization: I ( X , Y ) = ln 12
Σ
Class Schedules
Advanced Statistics for Physics Statistical Mechanics
of Complex Systems

Jeff Byers - Machine Learning and Advanced Statitics

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Jeff Byers - Machine Learning and Advanced Statitics

Încărcat de

Drepturi de autor:

Formate disponibile

“Physics of Data” Padua Lectures

Dr. Jeff M. Byers

CLASS: Advanced Statistics for Physical Analysis

Probability is in our Models!!!

The “ball” in a “box”

p(data | model) p (data=obs | model)

How does the data constrain the possible models?

The models are instances of a stochastic

Bayes’ Theorem: likelihood prior

3 different priors (or initial conditions) x = H H T H T T H T T H T H

- Assign a likelihood to the process:

- Update beliefs with data sequence:

Goal: We need to assign a scalar to each decision procedure, R : d →

Model pessimism: Data optimism:

Frequentist decision rule (Minimax) Bayesian decision rule

Slide inspired by Prof. Michael Jordan (UC Berkeley): http://videolectures.net/mlss09uk_jordan_bfway/

Model pessimism: Data optimism:

Frequentist decision rule (Minimax) Bayesian decision rule

Slide inspired by Prof. Michael Jordan (UC Berkeley): http://videolectures.net/mlss09uk_jordan_bfway/

Model pessimism: p ( ) p( X ) Data optimism:

Frequentist decision rule (Minimax) Bayesian decision rule

Coverage:   ( X , d F )  Q Coherence: p( X ) =  d p( X |  )  p( )

Slide inspired by Prof. Michael Jordan (UC Berkeley): http://videolectures.net/mlss09uk_jordan_bfway/

High-dimensional Point Cloud Data

Parts Models of Objects Dynamic Reconfiguration

State estimation Prediction

• Observe: Sensor representation spaces, input from the recent past

Already did these?

More sophisticated choices

Magic: “Inductive Bias”

Deep Learning as Layered Neural Nets

Deep Learning as Recurrent Neural Nets

Playing to the hope of

Deep Learning as all kinds of stuff …

Turtles can be very threatening … … and your autonomous

Synthesizing Robust Adversarial Examples

Is this data from the manifold?

NOT on data manifold

Solution: Mini-batches as output from Locality Sensitive

Functional derivatives and integrals

Linear models: Everything linear algebra can do with data

Observations INFERENCE Retrieval LABELING FUNCTION Label

Observations ML Prediction Label

Deterministic Probabilistic Information-theoretic

Quantizing the property space, , w.r.t. the labeling function, L :

 xx2 xy2  1   yy2  xy2   xy

 xx2  yy2 −  xy4  xx2  yy2

= 12 ln( yy2  yy−2 ) − 12 + 12  yy−2 yy2 + 12  yy−2  y2 − 12  yy−2  y2 +  yy−2  y (  y −  y )

I ( X , Y ) = 12  yy−2   dx p ( x) ( yy2 ln( yy2  yy−2 ) +  yy2 −  yy2 +  xx−4 xy4  ( x −  x ) 2 )

S-ar putea să vă placă și