Sunteți pe pagina 1din 61

APPLIED MACHINE

LEARNING IN PYTHON

Applied Machine Learning

Introduction
APPLIED MACHINE
LEARNING IN PYTHON

What is Machine Learning (ML)?


• The study of computer programs (algorithms)
that can learn by example
• ML algorithms can generalize from existing
examples of a task
– e.g. after seeing a training set of labeled images, an
image classifier can figure out how to apply labels
accurately to new, previously unseen images
APPLIED MACHINE
LEARNING IN PYTHON

Speech Recognition
However, not all problems lend themselves to being solved
effectively by writing a handcrafted program or set of rules.

Speech
"How do I get
recognition
to Ann
rules
Arbor?
APPLIED MACHINE
LEARNING IN PYTHON

Speech Recognition
• Given how a subtle and complex human speech is with the huge variety of different
pronunciations, vocabulary, accents, and so forth.

• Writing a set of program rules by hand that could recognize portions of an audio
signal and decide which words were in the signal and so forth would be a
gargantuan task.

• And even then, it would still likely be inflexible and not very robust at recognizing
different types of speech.

• Moreover, if we needed to customize the system so it could recognize new words


or
other features that we hadn't encoded in our existing rules, we'd have to write
a whole new set of rules, which would be a prohibitively difficult task.
APPLIED MACHINE
LEARNING IN PYTHON

Machine Learning models can learn by example


Audio signal Output text
• Algorithms learn rules from
labelled examples How do I
• A set of labelled examples get to Ann
used for learning is called Arbor?
training data.
Hello!
• The learned rules should
also be able to generalize Please order
to correctly recognize or me a pizza.
predict new examples not in
the training set.
APPLIED MACHINE
LEARNING IN PYTHON

Machine Learning models learn from experience


OK
• Labeled examples
(Email spam detection)
Spam

• User feedback
(Clicks on a search page)

• Surrounding environment
(self-driving cars)
APPLIED MACHINE
LEARNING IN PYTHON

Machine Learning brings together


statistics, computer science, and more..
• Statistical methods
– Infer conclusions from data
– Estimate reliability of predictions
• Computer science
– Large-scale computing architectures
– Algorithms for capturing, manipulating, indexing, combining, retrieving and
performing predictions on data
– Software pipelines that manage the complexity of multiple subtasks
• Economics, biology, psychology
– How can an individual or system efficiently improve their performance in a
given environment?
– What is learning and how can it be optimized?
APPLIED MACHINE
LEARNING IN PYTHON

Machine Learning for fraud detection


and credit scoring
Data instance/example Features ML algorithm User feedback

$$$ • Time
• Location Fraud rules Notification
Credit card
transaction • Amount

User
history
APPLIED MACHINE
LEARNING IN PYTHON

Web search: query spell-checking, result ranking,


content classification and selection, advertising placement
APPLIED MACHINE
LEARNING IN PYTHON

Machine Learning for Speech


Recognition
Acoustic Language
model model

"How do I
Feature
Preprocessing Decoder get to Ann
extraction
Arbor?"

Lexicon
APPLIED MACHINE
LEARNING IN PYTHON

Machine Learning algorithms are at the heart of


the information economy
• Finance: fraud detection, credit scoring
• Web search
• Speech recognition
• eCommerce: Product recommendations
• Email spam filtering
• Health applications: drug design and discovery
• Education: Automated essay scoring
APPLIED MACHINE
LEARNING IN PYTHON

What is Applied Machine


Learning?
• Understand basic ML concepts and workflow
• How to properly apply 'black-box' machine learning
components and features
• Learn how to apply machine learning algorithms in
Python using the scikit-learn package

• What is not covered in this course:


– Underlying theory of statistical machine learning
– Lower-level details of how particular ML components work
– In-depth material on more advanced concepts like deep learning
APPLIED MACHINE
LEARNING IN PYTHON

Recommended text for this course

Introduction to Machine Learning with Python


A Guide for Data Scientists
By Andreas C. Müller and Sarah Guido

O'Reilly Media
APPLIED MACHINE
LEARNING IN PYTHON

Applied Machine Learning

Key Concepts in Machine Learning


APPLIED MACHINE
LEARNING IN PYTHON

Key types of Machine Learning problems

Supervised machine learning: Learn to predict target


values from labelled data.
• Classification (target values are discrete
classes)
• Regression (target values are continuous values)
APPLIED MACHINE
LEARNING IN PYTHON

Supervised Learning (classification


Training set
XX
example)
Y
Future sample

Sample Target Value (Label)


Classifier
𝑥1 𝑦 f:X→Y
Apple 1

𝑥2 𝑦 f f
Lemon 2

𝑦 At training time, the


𝑥3 Label: Orange
Apple 3
classifier uses labelled
examples to learn rules for After training, at prediction
recognizing each fruit type.
𝑥4 Orange
𝑦4
time, the trained model is
used to predict the fruit type
for new instances using the
learned rules.
APPLIED MACHINE
LEARNING IN PYTHON

Examples of explicit and implicit label sources


Implicit labels (which pages are more releevnt to
Explicit labels
this query)

"cat"

Task "dog" Human judges/


requester annotators

"cat"
"house"

"cat"
Clicking and reading the "Mackinac Island" result can be
an implicit label for the search engine to learn that
"dog" "Mackinac Island" is especially relevant for the query
[vacations in michigan] for that specific user.

Crowdsourcing platform: Mechanical Turk or Crowd Flower


APPLIED MACHINE
LEARNING IN PYTHON

Key types of Machine Learning problems

Supervised machine learning: Learn to predict target


values from labelled data.
•Classification (target values are discrete classes)
•Regression (target values are continuous values)
Unsupervised machine learning: Find structure in
unlabeled data
• Find groups of similar instances in the data (clustering)
• Finding unusual patterns (outlier detection)
APPLIED MACHINE
LEARNING IN PYTHON

Unsupervised learning: finding useful structure or


knowledge in data when no labels are available
Careful
• Finding clusters of similar
researchers
users (clustering)

Time on site
Power users
• Detecting abnormal server
access patterns
(unsupervised outlier
detection) Quick browsers
Server accesses

Pages accessed
Categorize users (to offer them
For example, to
detect a cyber discounts) based on their usage of an
attack e-commerce site.

Time
APPLIED MACHINE
LEARNING IN PYTHON

A Basic Machine Learning


Workflow

Representation Evaluation Optimization

Choose:
Choose: Choose: • How to search for the
• A feature representation • What criterion settings/parameters that
• Type of classifier to use distinguishes good vs. bad give the best classifier for
classifiers? this evaluation criterion
e.g. image pixels, with
k-nearest neighbor classifier e.g. % correct predictions on test set
e.g. try a range of values for "k" parameter
in k-nearest neighbor classifier
APPLIED MACHINE
LEARNING IN PYTHON

Feature Representations
Feature Count Feature representation
to
To: Chris Brooks From: Daniel Romero 1
Subject: Next course offering Hi chris 2
Email Daniel,
Could you please send the outline for
brooks 1 A list of words with
the next course offering? Thanks! --
from 1 their frequency counts
daniel 2
Chris romero 1
the ...
2

Picture A matrix of color


values (pixels)

Feature Value

DorsalFin Yes
Sea Creatures MainColor
Orange A set of attribute values
Stripes Yes
StripeColor1 White
StripeColor2 Black
Length 4.3 cm
APPLIED MACHINE
LEARNING IN PYTHON

Representing a piece of fruit as an array of features (plus label


information)
width 1. Feature representation
Label information
(available in training data only) Feature representation
height

2. Learning model Classifier


mass
color
Each tow of a table represents single
data instance, and columns Predicted class
represent features of the object (apple)
APPLIED MACHINE
LEARNING IN PYTHON

Represent / Train / Evaluate / Refine


Cycle Representation:
Extract and
select object
features
APPLIED MACHINE
LEARNING IN PYTHON

Represent / Train / Evaluate / Refine


Cycle Representation:
Train models:
Extract and
Fit the estimator
select object
to the data
features
APPLIED MACHINE
LEARNING IN PYTHON

Represent / Train / Evaluate / Refine


Cycle Representation:
Train models:
Extract and
Fit the estimator
select object
to the data
features

Evaluation
APPLIED MACHINE
LEARNING IN PYTHON

Represent / Train / Evaluate / Refine


Cycle Representation:
Train models:
Extract and
Fit the estimator
select object
to the data
features

Feature and
model Evaluation
refinement
APPLIED MACHINE
LEARNING IN PYTHON

Applied Machine Learning

Python Tools for Machine


Learning
APPLIED MACHINE
LEARNING IN PYTHON

scikit-learn: Python Machine Learning


Library
• scikit-learn Homepage
http://scikit-learn.org/
• scikit-learn User Guide
http://scikit-learn.org/stable/user_guide.html
• scikit-learn API reference
http://scikit-learn.org/stable/modules/classes.html
• In Python, we typically import classes and functions we need
like this:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
APPLIED MACHINE
LEARNING IN PYTHON

SciPy Library: Scientific Computing


Tools
http://www.scipy.org/

• Provides a variety of useful scientific computing


tools, including statistical distributions, optimization
of functions, linear algebra, and a variety of
specialized mathematical functions.
• With scikit-learn, it provides support for sparse
matrices, a way to store large tables that consist
mostly of zeros.
• Example import: import scipy as sp
APPLIED MACHINE
LEARNING IN PYTHON

NumPy: Scientific Computing Library


http://www.numpy.org/

• Provides fundamental data structures used by


scikit-learn, particularly multi-dimensional arrays.
• Typically, data that is input to scikit-learn will be
in the form of a NumPy array.
• Example import: import numpy as np
APPLIED MACHINE
LEARNING IN PYTHON

Pandas: Data Manipulation and


Analys s
http://pandas.pydata.org/
i
• Provides key data structures like DataFrame
• Also, support for reading/writing data in different
formats
• Example import: import pandas as pd
APPLIED MACHINE
LEARNING IN PYTHON

matplotlib and other plotting libraries


http://matplotlib.org
/

• We typically use matplotlib's pyplot module:


import matplotlib.pyplot as plt
• We also sometimes use the seaborn visualization
library (http://seaborn.pydata.org/)
import seaborn as sn
• And sometimes the graphviz plotting library:
APPLIED MACHINE
LEARNING IN PYTHON

Versions of main libraries used in this


course
Library name Minimum version
It’s okay if your versions of these
scikit-learn 0.18.1 don’t match ours exactly, as long
scipy 0.19.0 as the version of scikit-learn and
numpy 1.12.1 other libraries you’re using is the
pandas 0.19.2 same or greater than listed here.
matplotlib 2.0.1
seaborn 0.7.1
graphviz 0.7
APPLIED MACHINE
LEARNING IN PYTHON

Applied Machine Learning

An Example Machine Learning Problem


APPLIED MACHINE
LEARNING IN PYTHON

The Fruit Dataset


width

lemon
height

apple
orange

mass mandarin orange


color_score
fruit_data_with_colors.tx
t

Credit: Original version of the fruit dataset created by Dr. Iain Murray, Univ. of Edinburgh
APPLIED MACHINE
LEARNING IN PYTHON

The input data as a table

Each row corresponds to a


single data instance (sample)

The fruit_label column contains


the label for each data instance
(sample)
APPLIED MACHINE
LEARNING IN PYTHON

The scale for the (simplistic) color_score feature


used in the fruit dataset

0.00 0.25 0.50 0.75 1.00


Color category
color_score
Red 0.85 - 1.00
Orange 0.75 - 0.85
Yellow 0.65 - 0.75
Green 0.45 - 0.65
APPLIED MACHINE
LEARNING IN PYTHON

Training and Test


• set that we already we had a classifier ready to go
Assuming for the moment

• how would we know if its predictions were likely to be accurate?

• Well, we could choose a fruit sample, called a test sample, for which we already
had a label.

• So we could feed the features of that piece of fruit into the classifier, and then
compare the label that the classifier predicts with the actual true label of the fruit
type.

• So, here's a very important point though, if we use one of our labeled fruit
examples in the data that we use to train the classifier, we can't also use that
same fruit sample later as a test sample to also evaluate the classifier.
APPLIED MACHINE
LEARNING IN PYTHON

Training and Test


set
• Why is that?

• Well, a key ability that our classifier needs to have is that it needs to work

• Well on any input sample, any new pieces of fruit that we might see in the
future not just on the ones that we have on our training set.
APPLIED MACHINE
LEARNING IN PYTHON

Creating Training and Testing


Sets
X y X_train y_train y_test
X_test


X_train, X_test, y_train,
y_test
= train_test_split(X, y)

Original data set Training set Test set


APPLIED MACHINE
LEARNING IN PYTHON

Applied Machine Learning


Examining the Data
APPLIED MACHINE
LEARNING IN PYTHON

Examining the data, missing or incorrect


• This initial exploration can be especially valuable when you're dealing with
complex objects like text that may be represented by many features that are
extracted using a series of several pre-processing steps.

• For example, you might discover that the data set you got has a single
column with person's name that still needs to be split into two separate first
and last name columns.

• For example, if you're using the name as one of the prediction feature, that
might be important.

• Second, you might notice missing or noisy data.


APPLIED MACHINE
LEARNING IN PYTHON

Examining the data, missing or incorrect


• Or maybe some specific inconsistencies, such as the wrong data type being
used for a column.

• Incorrect or inconsistent units of measurement for a particular column,


particular feature. Or maybe you'll notice that there aren't enough examples
of a particular labeled class.

• You might notice, for example, that some measurements of a person's


weight.

• Let's say you're doing a health application with a patient record for each row.
APPLIED MACHINE
LEARNING IN PYTHON

Examining the data, missing or incorrect


• Some might accidentally have the weight in grams instead of kilograms and
so forth. And that can make obviously a huge difference in how accurate
your results are.

• So inspecting and visualizing the data will help you detect and understand
these potential source of noise or errors.
APPLIED MACHINE
LEARNING IN PYTHON

Some reasons why looking at the data initially is


important
Examples of incorrect or missing feature
• Inspecting feature values may helpvalues
identify what cleaning or preprocessing
still needs to be done once you can see
the range or distribution of values that is
typical for each attribute.

• You might notice missing or noisy data,


or inconsistencies such as the wrong
data type being used for a column,
incorrect units of measurements for a
particular column, or that there aren’t
enough examples of a particular class.

• You may realize that your problem is


actually solvable without machine
learning.
APPLIED MACHINE
LEARNING IN PYTHON

A pairwise feature scatterplot visualizes the data using all possible pairs of features,
with one scatterplot per feature pair, and histograms for each feature along the
diagonal
.

height
apple
mandarin

width
orange

lemon

mass
apple

color_score
Individual scatterplot plotting all fruits by
their height and color_score.
Colors represent different fruit classes.
height width mass color_score
APPLIED MACHINE
LEARNING IN PYTHON
APPLIED MACHINE
LEARNING IN PYTHON
APPLIED MACHINE
LEARNING IN PYTHON
APPLIED MACHINE
LEARNING IN PYTHON

A three-dimensional feature scatterplot

apple

mandarin

lemon orange
APPLIED MACHINE
LEARNING IN PYTHON

A three-dimensional feature scatterplot

apple

mandarin

lemon orange
APPLIED MACHINE
LEARNING IN PYTHON

K-Nearest Neighbors Classification


• It can be used for both, classification and regression

• k-NN classifiers are an example of what's called instance based or


memory based supervised learning.

• What this means is that instance based learning methods work by memorizing
the labeled examples that they see in the training set.

• And then they use those memorized examples to classify new objects later.

• The k in k-NN refers to the number of nearest neighbors the classifier will
retrieve and use in order to make its prediction
APPLIED MACHINE
LEARNING IN PYTHON

The k-Nearest Neighbor (k-NN) Classifier


Algorithm
Given a training set X_train with labels y_train, and
given a new instance x_test to be classified:

1. Find the most similar instances (let's call them X_NN) to x_test
that are in X_train.
2. Get the labels y_NN for the instances in X_NN
3. Predict the label for x_test by combining the labels y_NN
e.g. simple majority vote
APPLIED MACHINE
LEARNING IN PYTHON

A visual explanation of k-NN classifiers


These two (width and height) features together form what is called the feature
space of the classifier.

Fruit dataset
Decision boundaries
with k = 1
APPLIED MACHINE
LEARNING IN PYTHON

A nearest neighbor algorithm


needs four things specified

1. A distance metric
2. How many 'nearest' neighbors to look at?
3. Optional weighting function on the neighbor points
4. Method for aggregating the classes of neighbor points
APPLIED MACHINE
LEARNING IN PYTHON

A nearest neighbor algorithm


needs four things specified
1. A distance metric
Typically Euclidean (Minkowski with p = 2)
2. How many 'nearest' neighbors to look at?
e.g. five
3. Optional weighting function on the neighbor points
Ignored
4. How to aggregate the classes of neighbor points Simple
majority vote
(Class with the most representatives among nearest neighbors)
APPLIED MACHINE
LEARNING IN PYTHON

K-nearest neighbors (k=1) for fruit dataset


APPLIED MACHINE
LEARNING IN PYTHON

K-nearest neighbors (k=5) for fruit dataset


APPLIED MACHINE
LEARNING IN PYTHON

K-nearest neighbors (k=10) for fruit dataset


APPLIED MACHINE
LEARNING IN PYTHON

K=1 K=5 K=10


APPLIED MACHINE
LEARNING IN PYTHON

How sensitive is k-NN classifier accuracy to the choice


of 'k' parameter?

Fruit dataset
with 75%/25%
train-test split

S-ar putea să vă placă și