Documente Academic
Documente Profesional
Documente Cultură
Student ID 2017030
Answer
CMM723 – Statistics for Business Analysis 2018: Answer by 2017030 1
Table of Contents
Introduction: .................................................................................................................................................. 2
2. Splitting the dataset into Training and Testing data set: ........................................................................... 2
Summary: ................................................................................................................................................. 3
4. Is there any difference between the creditability of female customers and male customers? ................ 16
6. Assessment of validity of regression model and its variables of training data on testing data set: ......... 22
Attachments ................................................................................................................................................ 25
Introduction:
The data set reflects the outlook of proposed creditability measures of 1000 samples of the given bank.
On the view point of their performances and performances of other parameters, the bank would decide
whether it would extend credit facilities or not.
The necessary cost-profit portfolio analysis is being executed with the help of this data. Most of the
data variables are categorical (ordinal or nominal, i.e. nominal variables are used to “name,” or label
a series of values, while ordinal scales provide good information about the order of choices). For
purpose of the analysis, I transformed the numerical data set into categorical variables too.
The JASP (version 0.8.6.0) software tool is utilised in this analysis for solving the assignment.
However, we have taken the help of MS Excel (the format in which data set is given) in some instances
of the calculation or execution, where JASP software faced failure.
Data Analysis:
The recoding is NOT possible in JASP software. Therefore, I took the help of MS Excel.
Refer 01.dataset.xls
Refer 02A. testing dataset.xls in MS Excel and 02B.testing dataset. JASP file
Refer 03B. training dataset.xls in MS Excel and 03B.training dataset. JASP file
CMM723 – Statistics for Business Analysis 2018: Answer by 2017030 3
Summary:
Refer 03B. training dataset JASP file
From the table below, it is clear that majority of the respondents in the study (Apprx 70%, n = 554)
were categorized to be credit worthy people where more than a quarter (Apprx 30%, n = 246) being
categorized as not credit worthy people.
Looking at the previous payment status of the previous credit, I observe that majority (52.5%, n = 420)
are said to have had no previous credits or rather they paid back all previous credits. However, about
5% (n = 40) were problematic running account while 4% (n = 32) were hesitant in paying the previous
credits.
The descriptive statistic of Creditability, Account balance, Payment Status of Previous credit and
Purpose of training data indicates that the average values are 0.693, 2.539, 2.556 and 2.804 respectively.
The standard deviations of these categorical variables are 0.462, 1.252, 1.097 and 2.749 respectively.
The Creditability, Account balance, payment Status of Previous Credit vary from 0 to 1, 1 to 4, 0 to 4
and 0 to 10 level respectively.
CMM723 – Statistics for Business Analysis 2018: Answer by 2017030 5
The descriptive statistic of Value Savings/Stocks, Length of current employment, Instalment per cent
and Sex & Marital Status of training data shows that the average values are 2.07, 3.395, 2.958 and 2.672
respectively. The standard deviations of these categorical variables are 1.562, 1.207, 1.113 and 0.713
respectively. The Value Savings/Stocks, Length of current employment, Instalment per cent and Sex &
Marital Status vary from 1 to 5, 1 to 5, 1 to 4 and 1 to 4 level respectively.
The descriptive statistic of Guarantors, Duration in Current Address, most valuable available asset,
Concurrent Credits and Type of apartment of training data shows that the average values are 1.144,
2.886, 2.377, 2.674 and 1.921 respectively. The standard deviations of these categorical variables are
0.475, 1.093, 1.062, 0.706 and 0.539 respectively. The Value Guarantors, Duration in Current Address,
most valuable available asset, Concurrent Credits and Type of apartment vary from 1 to 3, 1 to 4, 1 to
4, 1 to 3 and 1 to 3 level respectively.
CMM723 – Statistics for Business Analysis 2018: Answer by 2017030 6
The descriptive statistic of Number of Credits at this bank, Occupation, number of dependents,
Telephone and Foreign Workers of training data shows that the average values are 1.413, 2.875, 1.156,
1.389 and 1.036 respectively. The standard deviations of these five categorical variables are 0.583,
0.661, 0.363, 0.448 and 0.187 respectively. The Number of Credits at this bank, Occupation, number
of dependents, Telephone and Foreign Workers vary from 1 to 4, 1 to 4, 1 to 2, 1 to 2 and 1 to 2 level
respectively.
CMM723 – Statistics for Business Analysis 2018: Answer by 2017030 7
The descriptive statistic of Duration of monthly credit, Amount of Credit and Age group of training
data shows that the average values are 1.524, 1.964 and 1.915 respectively. The standard deviations of
these categorical variables are 0.638, 1.059 and 0.548 respectively. The Duration of monthly credit,
Amount of Credit and Age group vary from 1 to 3, 1 to 4 and 1 to 3 level respectively.
Out of 800 people, 30.75% people are not credit worthy and rest of 69.25% people are credit worthy.
Out of 800 people, 28.375% people have no balance or debit followed by 27.875% people have no
running account. A highest percentage of 38% people have checked out $200 for at least 1 year.
CMM723 – Statistics for Business Analysis 2018: Answer by 2017030 8
A highest percentage of 52% people have no previous or pending credits followed by 30.125% people
who had paid back previous credits at this bank. Only 4.25% and 4.875% people are facing hesitant
payment of previous credits and problematic running account.
A highest percentage of 27.875% people need credit for purchasing items and furniture followed by the
24.375% people require credit for purchasing other purposes. It is notable that the percentage of people
who need credit for purchasing used cars (9.875%) is also satisfactory.
Among 800 chosen people, mostly (61.375%) people have no available savings followed by 17.625%
people who have more than $1000 savings.
CMM723 – Statistics for Business Analysis 2018: Answer by 2017030 9
Only 6.125% people are currently unemployed. 34.25% people are employed for 1 to 4 years followed
by 25.625% people who are employed for more than 7 years.
A significant number of 46.5% people are under the Instalment percent less than 20%.
53.875% people are either single or widowed male. Only 9.25% people are females.
42.5% people are living in their current addresses for more than 7 years followed by 29.625% people
who are living in their current addresses for only 2 to 4 years.
More than 33% people have savings contract with building society or life insurance.
More than 70% people are living in owner-occupied flat with highest percentage and only 10.875%
people are living in rented flat with least frequency.
CMM723 – Statistics for Business Analysis 2018: Answer by 2017030 11
More than 62% people have only one credit in this bank and only 0.75% people have six or more credits
in this bank.
The people who are asking for credit in this bank are either skilled worker or skilled employees and
minor civil servants. Only 2.625% unemployed or unskilled labour will no permanent resistance are
asking for credit.
61.125% people are using telephones whereas 38.875% people are not using telephones.
CMM723 – Statistics for Business Analysis 2018: Answer by 2017030 12
96.375% people are foreign workers whereas 3.625% people are not foreign workers.
The duration if credits of employees less than 40 months is for 55.5% people, followed by the duration
of credits of employees less than 60 months but greater than 40 months is for 36.625% people.
The frequency of people is highest for the people whose amount of credit is less than $2000 (42.875%)
followed by the frequencies with credit amount more than $2000 but greater than $4000 (32.875%).
Major number of people belong to the age-group 25 years to 50 years with percentage almost 70%.
CMM723 – Statistics for Business Analysis 2018: Answer by 2017030 13
Reading a Boxplot
CMM723 – Statistics for Business Analysis 2018: Answer by 2017030 15
The statistically significant associations with explanatory variable Creditability is found in case of the
following variables-
2) Payment Status of previous credits (r = 0.249, p-value <0.001): Weak positive correlation
4. Is there any difference between the creditability of female customers and male
customers?
Refer 04B. dataset for t-test JASP file
4.1 T-test:
This test help to find the equality of averages of any numerical variable (here, Creditability) with respect
to different levels of categorical variables (here, age and sex). For calculation, I transformed the level
1,2 and 3 to the level “Males” and level 4 to the level “Females”. This recoding is not possible in JASP
software. Therefore, I have incorporated this in MS Excel.
The independent sample t-test produces the t-value 0.62 with 998 degrees of freedom and p-value 0.535.
The mean creditability of 92 females is 0.728 and 908 males are 0.697.
CMM723 – Statistics for Business Analysis 2018: Answer by 2017030 17
Creditability
Contingency Tables
Creditability
Sex & Marital Status Not credit- Credit- Total
worthy worthy
Male- divorced / living Count 20.0 30.0 50.0
apart % within row 40.0 % 60.0% 100.0%
Male- single Count 109.0 201.0 310.0
% within row 35.2 % 64.8% 100.0%
Male- married / widowed Count 146.0 402.0 548.0
% within row 26.6% 73.4% 100.0%
Female Count 25.0 67.0 92.0
% within row 27.2% 72.8% 100.0%
Total Count 300.0 700.0 1000.0
% within row 30.0% 70.0 % 100.0%
Chi-Squared Tests
Value df p
Χ² 9.605 3 0.022
N 1000
The p-value is given as 0.022 (a value less than 5% level of significance), we thus reject the null
hypothesis and conclude that there is significant association between sex and credibility of the person.
CMM723 – Statistics for Business Analysis 2018: Answer by 2017030 18
4.3 Hypotheses:
Null hypothesis (H0): The difference of averages of credibility of males and females is 0.
Alternative hypothesis (H1): The averages of credibility of males and females are different to each other.
Decision Making: The average values of credibility of males and females are equal.
Level of significance: 5%
Decision Making: The average values of credibility of males and females are equal.
Conclusion: According to the both types of t-test and Chi-Sq, it is concluded that the average
scores of credibility for males is equal to the average scores of credibility for females. As can be
seen, the female tend to be more trust worthy than men especially those men who are single or
divorced and living apart.
CMM723 – Statistics for Business Analysis 2018: Answer by 2017030 19
Training data
Using the training dataset, a logistic model was fitted to predict the creditability of a customer. Factors
such as duration of the credit, credit amount, instalment percent and age of the person were considered
in developing the model. Results are given below:
CMM723 – Statistics for Business Analysis 2018: Answer by 2017030 20
Area Under Curve: Validation check- AUC should be more than 0.7 in both the training and validation
samples. Should not be a significant difference between AUC score of both these samples. If it is more
than 0.8, it is considered as an excellent score. As calculated above AUC of the model is 0.805,
meaning the model is well fitted.
Coefficients Wald-Chi-
square
(Intercept) 11.33423872
Purpose 1.525951557
Value Savings/Stocks 9.460284665
Occupation 0.011531012
No of dependents 0.516601563
Telephone 3.058274405
Foreign Worker 2.698979592
Duration_of_monthly_Credit 4.794589774
Credit_Amount 3.398412098
Age_group 2.891921223
CMM723 – Statistics for Business Analysis 2018: Answer by 2017030 21
The logistic regression model takes into consideration “Creditability” as dependent and all other
variables as independent variables. The logistic regression model interprets that that significant
factors that influence the rate of Creditability are Account balance, payment Status of Previous
Credits and Instalment per cent. Rest of the factors do not significantly impact the dependent
factor- Creditability. It could be suggested that variables like Duration in current address (Wald
statistic = 0.003), Occupation (Wald statistic = 0.0115) and number of dependents (Wald statistic = 0.5)
could be easily omitted from the model.
The Akaike information criterion (AIC) is an estimator of the relative quality of statistical models for a
given set of data. Given a collection of models for the data, AIC estimates the quality of each model,
relative to each of the other models. Thus, AIC provides a means for model selection. The AIC value
for the Logistic Regression Model of the bank, is 814.369. It indicates that the model is not badly
fitted. The significant p-value (p < 0.001) indicates that the model is well fitted.
The coefficient for the duration of credit was found to be -0.038; this shows that an increase in the
duration of credit would result to a lower chance of not being credit worthy. In short, longer
duration increase the chances of credit worthiness of a person. The credit amount is not significant
in the model. The coefficient for the instalment percent is -0.351; this shows that as the credit
instalment percent decreases so does the chances of not being credit worthy. That is, low interest
rates tend to increase the credit worthiness. Lastly, the coefficient for the age is 0.031; this implies that
an increase in age increases the chance of credit worthiness by 3%. In overall, the results of the logistic
regression indicated that there was a significant association between duration of credit,
instalment percent, and age of the person.
.
Squared Pearson residuals plot
CMM723 – Statistics for Business Analysis 2018: Answer by 2017030 22
6. Assessment of validity of regression model and its variables of training data on testing
data set:
Testing Data: Refer 03B. testing dataset JASP File
The logistic regression on testing data indicates that the model is also good fitted (p-value <0.001)
with AIC value = 215.79. However, only Account balance is found significant in the logistic model
of testing data with significant p-value less than 0.001. In this logistic regression model, the Wald
Chi-square statistic also validates that Purpose (Wald statistic = 0.49), Duration in current address
(Wald statistic = 0.005), Age group (Wald statistic = 0.69), foreign workers (Wald statistic = 0.0001)
and Occupation (Wald statistic = 0.001) are unnecessary predictors present in the variable. These
variables should be eliminated from the logistic model. Therefore, it could be said that the logistic
regression model executed on training data do not completely validates the logistic regression on
testing data.
CMM723 – Statistics for Business Analysis 2018: Answer by 2017030 23
Purpose 0.49
Value Savings/Stocks 7.716049
Guarantors 2.528843
Occupation 0.001072
No of dependents 0.206612
Telephone 0.409489
Duration_of_monthly_Credit 0.046172
Credit_Amonth 4.548889
Age_group 0.690305
CMM723 – Statistics for Business Analysis 2018: Answer by 2017030 24
Using the Testing dataset, I found that unlike in the Training dataset where 3 out of the four independent
variables were significant, however, the results for the testing dataset showed that none of the
independent variables was statistically significant in the model.
CMM723 – Statistics for Business Analysis 2018: Answer by 2017030 25
Refer 01.dataset.xls
When a bank rejects an applicant with a good credit risk who are likely to repay the loan, then it results
loss in business and also when a bank accepts an applicant with a bad credit risk, then also it results the
financial loss in business.
The two decisions that might bring causes of loss and profit are said two be wrong and correct decisions.
The analysis incorporates the total credit amount is $3,271,248 (See Dataset.xls, tab dataset, cell F1003)
As per fitted logistic model of training data, I find the probabilities of credit risk of the whole data set
(1000 samples).
If the credit risk probabilities are found greater than equal to 0.5, then it can be considered as it
risky and level it by “1”.
If credit risk probabilities are found lesser than 0.5, then we consider it non-risky and therefor
level it “0”.
For, equality of levels of “Creditability” and “Credit risk” (1 or 0 for both cases), we consider
wrong decision and otherwise correct decision.
Further, the revenues for correct decisions are accounted as 135% and for wrong decisions are
accounted as 0%. The total revenue is found $1,256,545. (See Dataset.xls, tab dataset, cell AK1003)
The deficit is calculated as $2,014,703. The bank would face a loss if they would not verify their
creditability procedure. Note that, Cost-profit analysis is almost impossible by JASP software.
Therefore, I executed it by MS Excel.
Attachments
01.dataset.xls
02A. testing dataset.xls
02B.testing dataset. JASP file
03B. training dataset.xls
03B.training dataset. JASP file
04A. dataset for t-test
04A. dataset for t-test JASP file
Internet References:
Heeren, T. and D'Agostino, R., 1987. Robustness of the two independent samples t‐test when applied to ordinal scaled
data. Statistics in medicine, 6(1), pp.79-90.
Lee Rodgers, J. and Nicewander, W.A., 1988. Thirteen ways to look at the correlation coefficient. The American
Statistician, 42(1), pp.59-66.
Ray, S.C. and Das, A., 2010. Distribution of cost and profit efficiency: Evidence from Indian banking. European Journal of
Operational Research, 201(1), pp.297-307.