Sunteți pe pagina 1din 41

Business Analytics

Chapter - 1

Business Analytics

Logistic Regression
Linear Regression

Recall:
Linear regression is the process of
finding a best fitting straight line
passing through points with the
objective of being able to use the
equation of the line as a model for
prediction.
The key assumption is that both
the predictor and target variables
are continuous as seen in this chart
below. Intuitively, one can state that
when X increases, Y increases along
the slope of the line.
Logistic Regression - Introduction

Logistic regression is a form of regression where the dependent


variable (outcome variable) is dichotomous (binary) and the
independent variables (influencing factors) can be either continuous or
categorical. It is a prediction done with categorical variable.

Logistic regression can be binomial (binary) or multinomial.

In the binary logistic regression, the outcome can have only two
possi le types of alues e.g. Yes or No , u ess or Failure .

Out o e is oded as a d i i ary logisti regressio .


Logistic Regression - Introduction

These models are extensively used in banking, finance and


telecommunication sectors to estimate the risks involved.
Multinomial logistic refers to cases where the outcome can have
three or ore possi le types of alues e.g., good s. ery good
s. est .
Examples
If you would like to predict who will win the next T20 world cup,
ased o player s stre gth a d other details.
Imagine you want to predict the outcome of an election. Here
your outcome variable is binary i.e. Win/Lose. Factors influencing
this outcome could be the amount of money spent on the
campaign, amount of time spent campaigning, previous
election history, etc.
You might be interested in factors that influence (or explain)
whether or not a person in the U.S. owns a U.S.-made or foreign-
made (non-U.S.) car. It would be natural to code owning a U.S.
car as a 1 and owning a foreign car as a 0. So the Y-variable in the
logistic regression is whether or not a person owns a U.S. car
coded as a 1 if he or she does and a 0 otherwise.
Logistic Curve
The logistic model: The logistic curve, illustrated below, is better
for modeling binary dependent variables coded 0 or 1 because it
comes closer hugging y=0 and y=1 points on the y axis.
Even more, the logistic function is bounded by 0 and 1.
Come back to Sales vs Ad spent example
What happens if the target variable is not continuous?
When the target variable (Y) is discrete, the straight line is no longer a
fit as seen in this chart.
Although intuitively we can still state that when X (say advertising
spend) increases, Y (say response or no response to a mailing campaign)
also increases, but there is no gradual transition, the Y value abruptly
jumps from one binary outcome to the other.
Thus the straight line is a poor fit for this data.
The Straight Line is a Poor Fit for Binary Outcome
S- Shaped Curve is a Better Fit

On the other hand, take a


look at the S-shaped curve
below.
This is certainly a better fit
for the data shown.
If we then know the
equation to this "sigmoid"
curve, we can use it as
effectively as we used the
straight line in the case of
linear regression.
Logistic Regression
Logistic regression is thus the process of obtaining an appropriate
sigmoid curve to fit the data when the target variable is discrete.
In statistics, logistic regression or logit regression is a type of regression
analysis used for predicting the outcome of a categorical dependent
variable.
Key facts to keep in mind
Logistic Regression is the equivalent of linear regression to use when
the target (or dependent) variable is discrete i.e. not continuous.
The predictors variables can be either continuous or categorical.
R Data Analysis Examples

Logistic regression, also called a logit model, is used to model


dichotomous outcome variables.
In the logit model the log odds of the outcome is modeled as a linear
combination of the predictor variables.
We will require following packages:
- caret
- aod
- ggplot2
install.packages("packagename")
R Data Analysis Examples
Example : German Credit data set
The data set o sists of usto ers i for atio a out a k
account, age ,sex, credit history and present credit situation, credit
purpose, property and installment, employment, residence as its
20 independent variables.

The target variable is Class which binary with levels- Bad, Good
Bad: Customer with high credit risk
Good: Customer with low credit risk

The purpose of a alyzi g this data set is to predi t a usto er s


credit risk based on the other information of the customer.
R Data Analysis Examples

German Credit dataset is an inbuilt R data set and can be loaded by


loading caret library as follows.

We can get the descriptive statistics for all the variables by giving
following command:
Output
We need to convert the target variable Class from Bad/Good to 0/1 to
apply logistic regression. This can be done by giving the following
command.
Partitioning the Data Set
One issue that arises when fitting the model is to check how
well the newly created model behaves when applied to new
data.
To address this issue, the data set can be divided into two
partitions a training partition used to build the model and a
test partition used to validate how well our model is performing.

Now we will partition our German Credit data set into training and
testing sets using the createDataPartition function.
Partitioning the Data Set

The training data set contains 700 observation and the testing data
set contains 300 observations.

We will now consider training data set and use logistic regression to
model Class as a function of 5 predictor variables Age, Amount,
ForeignWorker, Property.RealEstate, Housing.Own ,
CreditHistory.Critical and Purpose.NewCar using glm function.
Model Building
Model Building
Interpreting Output
All the variables in our model are significant.
The null deviance indicates the deviance for a model without
variables whereas the residual deviance is for our model.
The null deviance and residual deviance need to be as far as possible
from each other.
We can see that the null deviance is 839.40 and residual deviance is
769.93 which indicates that there is not much considerable difference
between them.
Estimates from logistic regression characterize the relationship
between the predictor and response variable on a log-odds scale.
The estimate for variable Age is 0.01839. This implies that for every 1
unit increase in Age, the log odds of the consumer having good credit
increases by exp(0.01839) = 1.01856
Similarly, for other variables.
Prediction

There is 28.43% misclassification in our model.


Goodness of Fit Hosmer Lemeshow Test
Hosmer Lemeshow test is used for testing overall goodness of fit.
The statistic is computed on the data after the observations have
been grouped by having similar predictor probabilities.
It examines whether the observed proportion of events are similar
to predicted probabilities of occurrence in the subgroups of the
data set using a Pearson Chi-Square test.
The hypothesis is,

H0: the current model fits well


v/s
H1: the current model does not fit well.
Goodness of Fit Hosmer Lemeshow Test

Since the p-value is greater than 0.05, we do not reject H0. i.e. the
current model is a good fit.
Wald Test
A Wald test is used to evaluate the statistical significance of each
coefficient in the model and is calculated by taking the ratio of the
square of the regression coefficient to the square of the standard
error of the coefficient.
We test whether the coefficient of the independent variable in the
model is significantly different from zero.
If the test fails to reject the null hypothesis, this suggests that
removing the variable from the model will not substantially harm the
fit of that model.
To apply Wald test, regTermTest function from survey library will be
used.
Wald Test

Since the p-value is less than 0.05, we reject H0 i.e. removing the
variable ForeignWorker from the model will harm the fit of the model.
Mc Fadden R2
Unlike linear regression, there is no R2 statistic which explains the
proportion of variation in the dependent variable that is explained by
the predictors.
For this we use Mc Faddens R2.

The predictors in our model explain just the 8.27% variation in the
data.
This suggests that one or more variables are missing in our model.
We can try a model by considering all the variables.
ROC Curve
Receiver Operating Characteristic (ROC) curve
is a plot of the true positive rate against the
false positive rate.

It shows a tradeoff between sensitivity and


specificity (any increase in sensitivity will be
accompanied by a decrease in specificity).

The closer the curve follows the left hand


border and then the top border of the ROC
space, the more accurate the test.
ROC Curve
The closer the curve comes to the 45-degree diagonal of the ROC
space, the less accurate the test.

Accuracy is measured by the area under the ROC curve. An area of 1


represents a perfect test; an area of 0.5 represents a worthless test.
Area Under ROC Curve

Since the area under the curve is 0.6944, we can say the discrimination
ability of our model is fair.
Kolmogorov Smirnov Chart
Kolmogorov Smirnov chart measures the performance of classification
models.

It is a measure of the degree of separation between Goods (Event) and


Bads (Non Event).

Distance between the goods and bads should be as large as possible.

The ra do li e i the hart orrespo ds to the ase of apturi g the


respo ders O es y ra do sele tio , i.e., he you do t ha e a y
model at disposal.

The odel li e represe ts the ase of apturi g the respo ders if you
go by the model generated probability scores where you begin by
targeting datapoint with highest probability scores.
Kolmogorov Smirnov Chart

As the measure
between the goods
and bads is very
small, we can say
that the
performance of the
model is not good.
Kolmogorov Smirnov Statistic

Kolmogorov Smirnov Statistic is the maximum difference between


the cumulative true positive and cumulative false positive rate.

It is the maximum difference between the goods and the bads.

It is often used as the deciding metric to judge the efficacy of the


models in credit scoring.

The higher the value of Kolmogorov Smirnov statistic, the more


efficient is the model at capturing the responders (Ones or Goods)
Kolmogorov Smirnov Statistic

As the value of Kolmogorov Smirnov Statistic is small we can say that


the model is not that efficient in capturing the responders.
Tree Diagrams
A useful way of investigating probability problems is to use what are
known as tree diagrams.
Tree diagrams are a useful way of mapping out all possible outcomes
for a given scenario.
They are widely used in probability and are often referred to as
probability trees.
They are also used in decision analysis where they are referred to as
decision trees.
In the context of decision theory a complex series of choices are
available with various different outcomes and we are looking for the
bets of these under a given performance criterion such as maximizing
profit or minimizing cost referred to as probability trees.
Tree Diagram Example 1
Suppose we are given three boxes, Box A contains 10 light bulbs, of
which 4 are defective, Box B contains 6 light bulbs, of which 1 is
defective and Box C contains 8 light bulbs, of which 3 are defective.
We select a box at random and then draw a light bulb from that box at
random. What is the probability that the bulb is defective?

Here we are performing two experiments:


Selecting a box at random
Selecting a bulb at random from the chosen box

If A, B and C denote the events choosing box A, B, or C respectively


and D and N denote the events defective/non-defective bulb chosen,
the two experiments can be represented on the diagram below.
Tree Diagram Example 1

We can compute the following probabilities and insert them onto


the branches of the tree:
Tree Diagram Example 1
To get the probability for a particular path of the tree (left to right) we
multiply the corresponding probabilities on the branches of the path.

For example, the probability of selecting box A and then getting a


defective bulb is:

Since all the paths are mutually exclusive and there are three paths
which lead to a defective bulb, to answer the original question we
must add the probabilities for the three paths,
i.e. 4/30 + 1/3*1/6 + 1/3*3/8 = 2/15 + 1/18 + 1/8 = 0.314.
Tree Diagram Example 2

Machines A and B turn out respectively 10% and 90% of


the total production of a certain type of article.
The probability that machine A turns out a defective
item is 0.01 and the probability that machine B turns out
a defective item is 0.05.

(i) What is the probability that an article taken at random from


the production line is defective?

(ii) What is the probability that an article taken at random from


the production line was made by machine A, given that it is
defective?
Tree Diagram Example 2
Thank You

S-ar putea să vă placă și