Sunteți pe pagina 1din 35

1 of 35

Elements of Optimal Predictive Modeling Success in Data


Science: An Analysis of Survey Data for the Give me some
credit competition hosted on Kaggle

Acknowledgements: Thanks to Ross Gayler for suggesting Kaggle to test out the
hypothesis on credit scoring as a commodity and also to Anthony Goldbloom for
generous support and use of Kaggle, the platform for predictive modeling
Crowdsourcing.

Electronic copy available at: http://ssrn.com/abstract=2227333

2 of 35

Executive Summary.........................................................................................................3
Introduction .....................................................................................................................4
Survey Analysis of Give me some Credit Contest...........................................................6

..........................................................................................................................................8
Correlation amongst variables.......................................................................................11
Analysis of Free Text Questions....................................................................................11
Principal Components Analysis of Survey....................................................................12
Regression results of Predict Model Ranking................................................................14
Regression Results of Linear Model of predictive modeling ranking Adjusted for
response bias..................................................................................................................15
Conclusion.....................................................................................................................16
References .................................................................................................................17
Appendix of Survey Questions..................................................................................18
Appendix of R code for survey analysis....................................................................18
Appendix of Rattle Log.............................................................................................21
Appendix of Online Blog postings of Winners:........................................................30

Electronic copy available at: http://ssrn.com/abstract=2227333

3 of 35

Executive Summary
In September 2011 a 3 month long credit contest called Give me some credit was
hosted on the Kaggle predictive modeling platform. This was the most popular Kaggle
Crowdsourcing contest to date and had intense competition and drew out the best data
scientists and credit scoring practitioners in the world to compete. The contest results
supported the hypothesis that credit scoring is a commodity which cant create
sustainable competitive advantage. The survey data also provided insight into predictive
modeling skill and the most important factors to optimal predictive modeling related to:
optimal model selection (top 3 models used were hybrid models of random forest,
support vector machines and gradient boosted machines), proficiency in predictive
modeling, effort proxied by number of methods explored, team size (more people
resulted in better models in general), and domain knowledge. Thus predictive model
performance is a function is: Performance=Chosen Model (exploring multiple models
and choosing best algorithm)+Effort (Hard work)+Predictive Modeling skill+domain
knowledge+team work. The choice of the right kind of algorithm or model dominated
performance the most and exploration of different approaches mattered more than
backgrounds or experience.

4 of 35

Introduction
In September 2011 the most popular predictive modeling contest held on the Kaggle
platform was hosted and surprisingly for that time period was the most popular data
science contest and only offered a $5,000 bounty. This contest had 990 submissions,
1,473 players, 8,301 entries.1 During its duration it had more contestants than the $3mm
heritage health prize and $10k wiki contest.
The aim of this contest was to build the best credit scoring system that people could use
to score their own risk profiles using attributes people know about their own credit
behavior and usage. The contest was sponsored by Credit Fusion a think tank devoted to
commoditizing credit and ensuring borrower well being as the foundation of lending.
The contest involved a 250,000 observation data set with area under the receiver
operating curve as the metric of model performance. Area under the curve was used for
model performance as dominance in the Receiver operating curve plotting model true
positive rates and false positive rates has been shown to be equivalent in maximum profit,
and efficient frontier dominance in business objectives of optimal profit and volume trade
offs (Beling etal, 2005). The current benchmark algorithm used for credit scoring was a
random forest ensemble algorithm which had very strong performance and an area under
the curve of .864. Random forests were invented by Leo Breiman and Adele cutler and
develop randomly selected variables and data selection to build a majority voting scheme
of weak learners (Breiman, 2001). The area under the curve is a chart which plots
accurate predictions (true positives) against inaccurate predictions (false positives). The
area under the curve is a common metric for measuring the accuracy of predictive
models. The higher the area, the more powerful the model.
This is .006 below that of the best commercial credit scores which have an area under the
curve of .870 and are built by teams of PhDs using hundreds of attributes and models.
The success of a the simple random forest which builds 500 simple decision tree models
using random samples and random variable selection and uses a majority vote of the
resulting models to predict the outcome is remarkable and shows that proprietary
predictive modeling adds little value on top of simple algorithms (2001). Random
forests have been found to outperform standard logistic regression models by 5-7% and
1

Kaggle represents the culmination of the internet revolution and allows organizations to
publish their data and invite contestants to build statistical models. www.kaggle.com

5 of 35
well suited for credit scoring out of the box due to their ability to deal with correlated
variables which confound modeling process using regression methods (Sharma, 2009;
Sharma, 2011).
The winning teams had an Area under the curve of .005 better than the benchmark
random forest and used a blended approach of multiple random forests, Support vector
machines data cleaning, and Gradient boosting machines. Stochastic gradient boosting
machines are similar to random forests in that they too build weak learners but do so
instead is stage wise manner fitting to pseudo residuals of the least squares fit of
additive regression trees using randomly selected subsamples to build the weak learners
(Friedman, 1999). Support vector machines are an algorithm invented by Vapnik which
use hyperplanes in high dimension to maximize the distance from the model to the
margins (Christianni etal, 2000). Random forests, support vector machines and gradient
boosted machines represent the state of the art in predictive modeling classification and
were used heavily by the winning teams. In the end model performance was a function
of algorithms, hard work proxied by number of models explored, team size, some domain
knowledge and predictive modeling proficiency.
Performance=Optimal Model+Effort+Predictive Modeling skill+Hard
work+domain knowledge+team work
What is impressive is that using predictive modeling skill the winners were able to extract
all possible predictive value in the data and achieve performance in line with models
using much more data and teams of PHDs within 3 months using open source tools. The
competition was intense and provided information on predictive modeling skill itself and
factors at play at building the best models and characteristics of the best modelers. The
survey results will be used to analyze patterns in predictive modeling success.
The results supported the original hypothesis that credit scoring and credit risk prediction
are commodity processes which are not a source of sustainable competitive advantage or
value creation. The limits of credit scoring studied under the phenomenon of
insensitivity of regression models studied under the phenomenon known as the flat
maximum effect have been long known and documented but with ensemble algorithms
we can be sure we are near the true performance asymptote of information (Overstreet
etal, 1991).

6 of 35

Survey Analysis of Give me some Credit Contest


Although 990 teams were on the final leaderboard (online ranking of results;
http://www.kaggle.com/c/GiveMeSomeCredit/leaderboard ) the survey had 77
respondents which represent a response rate of 7.77% which is similar to the response
rate on other Kaggle predictive modeling contests. This suggests the survey might be
better to required and collected before the final contest results are unveiled.
Looking at the distribution of rank of respondents where 1 is the best and lower ranks are
lower scores and the benchmark of random forest out of the box was in 386th position out
of 990 contestants. Thus 38.9% of the contestants were able to beat the benchmark by
some slight margin.
In terms of survey response bias looking at the distribution of ranks show the survey
response was biased upwards with respondents who did well responding to the survey
more. That said it is important to note Opera solutions a leading analytics firm which
strong experience in credit scoring competed as well and was in the top 20 thus showing
the contest had the best and brightest in the world competing. The winners were
comprised of an Australian interdisciplinary team called Perfect Storm, Gxav a Singapore
team of 1, and Occupy a team of 1 in Boston MA. (Details of their responses are in the
appendix of their blog entries).
Shout out to Best credit scorers in the world: Australian team called Perfect Storm: Alec
Stephenson, Eu Jin Lok and Nathaniel Ramm was #1, a Singapore Team of 1 called
GXAV: Xavior Conort was rank #2 and rank #3 was team occupy of Joe Malickie of
Boston, MA. Their blog posting about their approach are in the appendix from the Give
me some credit blog.
Responses Skewed Towards Top Performers

7 of 35

8 of 35

Graphical Display of Survey Results

Analysis
Looking at the Education and Occupation characteristics of contestants shows there was
sample representation of contestants with PHDs, Masters degrees and some Bachelors
and even high school students. In terms of occupation computer science and predictive
modeling practitioners did the best.
In terms of experience the top performers had high proficiency in predictive modeling
and less experience in domain of credit scoring, risk, and the credit industry.
Interestingly the higher ranking of performing teams and models tried many different
models while the winning teams had fewer submissions indicating they avoided
overfitting to test data in their modeling process.

9 of 35
The biggest predictor of success of top ranking groups was the use of multiple and hybrid
models and top ranking teams tended to use random forests, Gradient Boosting machines,
logistic regression, decision trees which are the basis of both random forests and gradient
boosting machines), and hybrid solutions. Not everyone in the top 2 quartiles used
support vector machines all 3 winning groups used them along with the other hybrid
models.
Thus the optimal model was a function of the algorithm more than anything, with
blended algorithms of ensembles and different samples performing the best along with
experience in predictive modeling, hard work as evidenced by using lots of models and
entries and some domain expertise.
Running a random forest of the factors most important in predictive model proficiency or
best credit scorers in the world is as follows: the algorithm or model used dictates
performance, proficiency in credit scoring, and Predictive modeling and years of
experience doing data analysis. Education and years of experience in the credit industry
correlated the least with performance. Ironically the algorithm used most frequently in
the credit scoring industry, logistic regression, performed the worst. Team size was
important as well as number of models tried as a proxy for hard work and effort.

10 of 35
Plots of Performance show team size and predictive modeling experience resulted in
higher performance. That said team size and number of submissions had some
relationship as seen in the principal components plot and thus having more team
members resulted in more solutions being explored and improvement in performance.

Credit scoring proficiency and domain knowledge resulted in better performance but only
in instances of a great deal of experience >10 years in credit domain and high proficiency
in credit scoring and predictive modeling.

11 of 35

Correlation amongst variables

There was strong positive correlation with


proficiency in credit scoring and predictive
modeling experience and use of multiple
models and number of models tried and
negative correlation of modeling proficiency
and number of submissions showing less
likelihood of overfitting. Also there was
strong correlation between credit domain
knowledge and credit scoring proficiency.

Analysis of Free Text Questions


Analysis of the 3 open ended questions on the survey about what worked well and did not
work well was done using word clouds which highlight frequently occurring words in
bold and bigger text. This showed gradient boosting algorithms and random forest and
SVM, boosting, trees, and logistic regression were used the most in the final models.
The methods that worked the best were: random forests and gradient boosted machines.
Methods said to work least effectively were logistic regression, neural nets, SVM, and
nearest neighbors.
What different modeling techniques did you try to use? What was your final
choice?

Source: http://www.abcya.com/word_clouds.htm

12 of 35

Which modeling techniques gave you the most improvement?


Source: tagcrowd

Which modeling techniques were the least useful?

Principal Components Analysis of Survey

13 of 35

Principal components analysis showed 8 components (using TeamSize ,


YearsExperienceDataAnalysis,YearsExperienceCredit,
ProficiencyCreditScoring,
ProficiencyRisk , ProficiencyPredictiveModel, NumberModelsTried,
NbrSubmissions, Multi=actual number of models in final solution )
explained all the variance in the ranking prediction.

14 of 35

Regression results of Predict Model Ranking


Due to the response bias the heckman 2 stage procedure was conducted to adjust for
selection bias (Vartanian, 2009). This involved building a probit model of survey
response using the full leaderboard list of teams and subset of teams which responded to
the survey and the probit model to predict survey response was a function of number of
submissions as people who participated little and did poorly putting less effort in the
process were more likely to not respond.
Probit Model for response:
glm(formula = Reponse ~ NbrSubmissions, family = binomial(link =
"probit"),
data = r)
Deviance Residuals:
Min
1Q
Median
-1.7008 -0.3705 -0.3370

3Q
-0.3210

Max
2.4656

Coefficients:

Estimate Std. Error z value Pr(>|z|)


(Intercept)
-1.689211
0.075646 -22.330 < 2e-16 ***
NbrSubmissions 0.023176
0.003609
6.421 1.35e-10 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 527.86
Residual deviance: 485.27
AIC: 489.27

on 968
on 967

degrees of freedom
degrees of freedom

#heckman 2 stage
#http://www.brynmawr.edu/socialwork/GSSW/Vartanian/Handouts/Heckman
%20selection%20model.pdf
#heckman 2 stage
#predict prob of filling out survey
#score data and generate mills ratio and include in regression
pi<-predict(m,data,method='prob')
phi<-1/(sqrt(2*-pi))*exp(-(pi^2/2))
capphi<-pnorm(pi)
data$invmills<- phi/capphi

The inverse mill ratio to adjust for this bias was added to the final regression model.
The analysis of response bias shows as predicted by Adams equity theory humans
compute ratios of output to input which determine motivation and contestants that faired
poorly saw little output for their work and could see other perform better and self selected
out of the contest and survey (Adams, 1965).

15 of 35

Regression Results of Linear Model of predictive modeling ranking


Adjusted for response bias
Analysis of variance of regression results for contest rank showed Gradient Boosting
machines/random forest model choice, predictive modeling proficiency, number of
models tried (effort), and some domain knowledge resulted in the best performance.
lm(formula = rank ~ ., data = crs$dataset[, c(crs$input, crs$target)])
Residuals:
Min
1Q
-469.71 -122.96

Median
-5.88

3Q
120.93

Max
442.20

Coefficients:

Estimate Std. Error t value Pr(>|t|)


(Intercept)
-1006.2381 5828.8519 -0.173
0.8638
TeamSize
97.4602
54.4115
1.791
0.0808 .
EducationHigh School
-304.4767
214.7896 -1.418
0.1641
EducationMasters
11.4456
83.2452
0.137
0.8913
EducationPhD
-50.4428
118.1022 -0.427
0.6716
OccupationBusiness Analyst/Intell
71.6045
110.2972
0.649
0.5199
OccupationComp. Science
221.8455
118.0458
1.879
0.0675 .
OccupationPractitioner
47.0529
102.7625
0.458
0.6495
OccupationStats
315.6056
225.0497
1.402
0.1685
YearsExperienceDataAnalysis
6.0830
5.8916
1.032
0.3080
YearsExperienceCredit
-22.1285
18.4920 -1.197
0.2385
ProficiencyCreditScoring
32.3485
35.7630
0.905
0.3711
ProficiencyPredictiveModel
0.6131
43.8714
0.014
0.9889
NumberModelsTried
-1.1355
5.7530 -0.197
0.8445
NbrSubmissions
2.6274
11.5279
0.228
0.8209
TreeY
-15.4942
114.0108 -0.136
0.8926
NNY
-26.8972
127.6654 -0.211
0.8342
RFY
-41.7868
109.8000 -0.381
0.7055
SVMY
68.7060
130.4858
0.527
0.6014
GBMY
-307.4722
144.2320 -2.132
0.0392 *
BayesY
81.7748
133.2795
0.614
0.5430
Multi
-73.3270
78.2430 -0.937
0.3543
invmills
464.1616 2042.6937
0.227
0.8214
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 235.3 on 40 degrees of freedom
(14 observations deleted due to missingness)
Multiple R-squared: 0.5082,
Adjusted R-squared: 0.2377
F-statistic: 1.879 on 22 and 40 DF, p-value: 0.04083

Analysis of Variance Table


Response: rank
TeamSize
Education
Occupation
YearsExperienceDataAnalysis
YearsExperienceCredit
ProficiencyCreditScoring
ProficiencyPredictiveModel
NumberModelsTried
NbrSubmissions
Tree
NN
RF
SVM
GBM
Bayes
Multi

Df
1
3
4
1
1
1
1
1
1
1
1
1
1
1
1
1

Sum Sq Mean Sq F value


Pr(>F)
1541
1541 0.0278 0.868310
124764
41588 0.7514 0.528041
177852
44463 0.8033 0.530350
646
646 0.0117 0.914541
165579 165579 2.9915 0.091412 .
141525 141525 2.5569 0.117681
650621 650621 11.7548 0.001420 **
197001 197001 3.5592 0.066489 .
77501
77501 1.4002 0.243672
38
38 0.0007 0.979244
53944
53944 0.9746 0.329468
111318 111318 2.0112 0.163886
9807
9807 0.1772 0.676063
516383 516383 9.3295 0.004001 **
1475
1475 0.0266 0.871151
55093
55093 0.9954 0.324430

16 of 35
invmills
1
2858
2858 0.0516 0.821402
Residuals
40 2213972
55349
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
[1] "\n"
Time taken: 0.06 secs

Conclusion
Building the best predictive models is a function primarily of algorithms. That said
choosing an algorithm does not happen in a vacuum and optimal Predictive Modeling is a
function of Algorithm, proficiency in modeling, multiple hybrid models, experience in
domain, and hard work.
The best and brightest competed intensely for predicting credit risk using variables
consumers can use to score themselves. The end model was .5% better than the random
forest algorithm out of the box and supported the claim credit scoring is a commodity
function and borrowers should manage their own risk and have knowledge of their own
risk profiles leaving banks to compete on value creation such as cost, convenience and
safe products.
Crowdsourcing also allows for more transparent model building and team work and
improvement as many teams improved by working together and moving up in the
rankings. That said it is not necessary as a few individuals did really well based on skill
and talent alone. The final algorithm choice dictated most of the performance and
explained performance the best.

17 of 35

References
Adams, J.S. 1965. Inequity in social exchange. Adv. Exp. Soc. Psychol. 62:335-343.
Beling, P; Covaliu, Z, Oliver RM (2005) Optimal scoring cutoff policies and efficient
frontiers. Journal of Operations Research Society 56.
Breiman, L (2001) Random forests. Machine Learning, 45(1): 532, 18.
Cristianini, Nello; and Shawe-Taylor, John; (2000) An Introduction to Support Vector
Machines and other kernel-based learning methods, Cambridge University Press.
Friedman JH (1999) Stochastic Gradient Boosting. Technical report.
Stanford, CA: Department of Statistics, Stanford University; 1999
Overstreet GA Bradley EL and Kemp R (1992). The flat maximum effect and generic
linear scoring models: a test. IMA Journal of Mathematics Applied in Business and
Industry. 4(1): 97-109
Sharma, Dhruv. (2011) Improving the Art, Craft and Science of Economic Credit Risk
Scorecards Using Random Forests: Why Credit Scorers and Economists Should Use
Random Forests (June 9, 2011). Available at SSRN: http://ssrn.com/abstract=1861535 or
http://dx.doi.org/10.2139/ssrn.1861535 Forthcoming Academy of Banking studies
summer 2012
Sharma, Dhruv. (2009) Not If Affordability Data Adds Value But How to Add Real
Value by Leveraging Affordability Data: Enhancing Predictive Capability of Credit
Scoring Using Affordability Data (August 15, 2009). Available at SSRN:
http://ssrn.com/abstract=1801346 or http://dx.doi.org/10.2139/ssrn.1801346 also earlier
copy http://www.casact.org/research/wp/papers/working-paper-sharma-2009-09.pdf
Vartanian, T. (2009) UCLA Heckman Two Stage Selection Tutorial. From
http://gseis.ucla.edu/courses/ed231c/notes3/selection.html and Retrieved from
http://www.brynmawr.edu/socialwork/GSSW/Vartanian/Handouts/Heckman
%20selection%20model.pdf
http://cran.r-project.org
http://rattle.togaware.com/
www.creditfusion.org
http://www.kaggle.com/c/GiveMeSomeCredit/leaderboard
www.kaggle.com and http://www.kaggle.com/c/GiveMeSomeCredit

18 of 35

Appendix of Survey Questions

What is your email? ip address?

What was your teamname?


How many people were in your team (total, including yourself)?
What was the highest level of education that you have attained?
What is your current occupation?
How many years of experience do you have with modeling/data analysis?
How many years of experience do you have with commercial banking/lending?
Please describe similar/ and or related work experience you may have in relation to
commercial banking/lending programs.

(likert scale questions 1=low, 5 high)


o Please rate your proficiency with credit scoring:
o Please rate your proficiency with credit risk:
o Please rate your porficiency with predictive modeling:
How many different models did you try?
What different modeling techniques did you try to use? What was your final choice?
Which modeling techniques gave you the most improvement?
Which modeling techniques were the least useful?
What was your final score on the leader board?
Final Score computed based on name and leaderboard

Appendix of R code for survey analysis


# give me credit survey
data<-read.csv("give me some credit.csv")
names(data)<c("team","TeamSize","Education","OccupationDetail","Occupation","YearsE
xperienceDataAnalysis","YearsExperienceCredit","ProficiencyCreditScorin
g","ProficiencyRisk","ProficiencyPredictiveModel","NumberMOdelsTried","
FinalScore","BelowBench","NbrSubmissions","Tree","NN","RF","SVM","GBM",
"Logisitic","DA","Bayes","rank")
data<-subset(data,BelowBench!="")
#bench 385 and 0.864249
par(mfrow=c(5,3));
#graphs univariate
#education

19 of 35
barplot(table(data$Education)*100/nrow(data), ylim=c(0,50),
main='Distribution of Constestants by Education',ylab='%')
plot(rank~Education,data=data, main='Rank (Low is good) by Education')
table(data$Education)*100/nrow(data)
#occupation
barplot(table(data$Occupation)*100/nrow(data), ylim=c(0,50),
main='Distribution of Constestants by Occupation',ylab='%')
plot(rank~Occupation,data=data, main='Rank (Low is good) by Occupation')
table(data$Occupation)*100/nrow(data)
#experience YearsExperienceDataAnalysis YearsExperienceCredit
hist(data$YearsExperienceDataAnalysis, freq=FALSE , main='Histogram of
Experience in Data Analysis' )
plot(rank~YearsExperienceDataAnalysis,data=data, main='Rank (Low is
good) by Data Analysis Experience')
plot(rank~cut(YearsExperienceDataAnalysis,10),data=data, main='Rank
(Low is good) by Data Analysis Experience Deciles')
#experience in domain Credit
hist(data$YearsExperienceCredit, freq=FALSE , main='Histogram of
Experience in Credit Domain' )
plot(rank~YearsExperienceCredit,data=data, main='Rank (Low is good) by
Experience in Credit Domain')
plot(rank~cut(YearsExperienceCredit,10),data=data, main='Rank (Low is
good) by Credit Domain Experience Deciles')
#expertise in ProficiencyCreditScoring ProficiencyRisk
ProficiencyPredictiveModel
plot(data$ProficiencyCreditScoring ~data$YearsExperienceCredit)
boxplot(rank~ProficiencyCreditScoring ,data=data, main='Rank (Low is
good) by Credit Scoring Proficiency (5 is highest,1=Lowest) ')
hist(data$ProficiencyCreditScoring , freq=FALSE , main='Histogram of
Proficiency in Credit Scoring' );
#weight proficiency by actual experience in credit domain as you could
inflated self report proficiency with no experience
plot(rank~cut(ProficiencyCreditScoring*YearsExperienceCredit ,
10),data=data, main='Rank (Low is good) by Credit Scoring
Proficiency*Experience in Domain Deciles')
plot(rank~ProficiencyCreditScoring,data=subset(data,YearsExperienceCred
it>=3), main='Rank (Low is good) by Credit Scoring Proficiency')
dev.new();
par(mfrow=c(3,3));
#risk proficiency
hist(data$ProficiencyRisk , freq=FALSE , main='Histogram of Proficiency
in Risk Management' )
boxplot(rank~ProficiencyRisk ,data=data, main='Rank (Low is good) by
Risk Management Proficiency (5 is highest,1=Lowest) ')
# predictive modeling
hist(data$ProficiencyPredictiveModel, freq=FALSE , main='Histogram of
Proficiency in Predictive Modeling ' )
boxplot(rank~ProficiencyPredictiveModel,data=data, main='Rank (Low is
good) by Predictive Modeling Proficiency (5 is highest,1=Lowest) ')

20 of 35
#TeamSize
hist(data$TeamSize, freq=FALSE , main='Histogram of TeamSize' )
plot(rank~TeamSize,data=data, main='Rank (Low is good) by TeamSize')
plot(rank~cut(TeamSize,5),data=data, main='Rank (Low is good) by
TeamSize Quantiles')
# Team Size interdisp function of occupations, education
#only winning team had 2 different occupations and education
interdisciplenary
# NumberMOdelsTried

NbrSubmissions proxies for effort

# of models
data$NumberMOdelsTried<-as.numeric(data$NumberMOdelsTried)
names(data)[11]<-"NumberModelsTried"
data$top10<- as.factor(ifelse(data$rank<=100,"Y","N"))
data$Multi<-as.numeric(data$Tree)+as.numeric(data$NN)
+as.numeric(data$RF)+as.numeric(data$SVM)+as.numeric(data$GBM)
+as.numeric(data$Logisitic)+as.numeric(data$DA)+as.numeric(data$Bayes)
data$Multi<-data$Multi-16
data$Hybrid <- as.factor(ifelse(data$Multi>0,"Y","N"))
library(randomForest)
data<-na.roughfix(data)
#990 teams 79 resp and 61% of people scored worse than benchmark but in
survey smaple 41% did worse than benchmark;
#so skewed towards higher response players
plot(density(na.roughfix(data$rank)));
data<-subset(data,select=-OccupationDetail)
data<-subset(data,select=-team)
par(mfrow=c(5,3));
for (i in c(1:length(data)))
{
if (class(data[,i])=='factor' & !names(data)[i]%in
%c('BelowBench','top10'))
{
plot(data$rank~data[,i],main=names(data)[i]);
}
}
par(mfrow=c(3,3));
for (i in c(2,3,5,6,7,8,9,10,11,14,15,16:22))
{
if (is.numeric(data[,i])==TRUE)
{
hist(data[,i],main=names(data)[i]);
}
}

21 of 35

#estimating response bias


r<-read.csv("resp data.csv")
names(r)[3]<-"NbrSubmissions"
m<-glm(Reponse~NbrSubmissions,data=r,family=binomial(link="probit"))
summary(m)
#heckman 2 stage from ucla stats website originally retrieved from
#http://www.brynmawr.edu/socialwork/GSSW/Vartanian/Handouts/Heckman%20selection
%20model.pdf

#heckman 2 stage
#predict prob of filling out survey
#score data and generate mills ratio and include in regression
#generate phi = (1/sqrt(2*pi))*exp(-(pi^2/2)) /*standardize it*/
#generate capphi = norm(p1)
#generate invmills = phi/capphi
pi<-predict(m,data,method='prob')
phi<-1/(sqrt(2*-pi))*exp(-(pi^2/2))
capphi<-pnorm(pi)
data$invmills<- phi/capphi

Appendix of Rattle Log


# Rattle is Copyright (c) 2006-2011 Togaware Pty Ltd.
#============================================================
# Rattle timestamp: 2012-03-10 11:50:01 i386# Rattle version 2.6.13 user ''
# Export this log textview to a file using the Export button or the
Tools
# menu to save a log of all activity. This facilitates repeatability.
Exporting
# to file 'myrf01.R', for example, allows us to the type in the R
Console
# the command source('myrf01.R') to repeat the process automatically.
# Generally, we may want to edit the file to suit our needs. We can
also directly
# edit this current log textview to record additional information
before exporting.
# Saving and loading projects also retains this log.
library(rattle)
# This log generally records the process of building a model. However,
with very
# little effort the log can be used to score a new dataset. The logical
variable

22 of 35
# 'building' is used to toggle between generating transformations, as
when building
# a model, and simply using the transformations, as when scoring a
dataset.
building <- TRUE
scoring <- ! building
# The colorspace package is used to generate the colours used in plots,
if available.
library(colorspace)
# A pre-defined value is used to reset the random seed so that results
are repeatable.
crv$seed <- 42
#============================================================
# Rattle timestamp: 2012-03-10 11:55:41 i386# Load an R data frame.
crs$dataset <- data
# Display a simple summary (structure) of the dataset.
str(crs$dataset)
#============================================================
# Rattle timestamp: 2012-03-10 11:55:42 i386# Note the user selections.
# Build the training/validate/test datasets.
set.seed(crv$seed)
crs$nobs <- nrow(crs$dataset) # 77 observations
crs$sample <- crs$train <- sample(nrow(crs$dataset), 0.69*crs$nobs) #
53 observations
crs$validate <- sample(setdiff(seq_len(nrow(crs$dataset)), crs$train),
0.14*crs$nobs) # 11 observations
crs$test <- setdiff(setdiff(seq_len(nrow(crs$dataset)), crs$train),
crs$validate) # 13 observations
# The following variable selections have been noted.
crs$input <- c("team", "Education", "OccupationDetail", "Occupation",
"YearsExperienceDataAnalysis", "YearsExperienceCredit",
"ProficiencyCreditScoring", "ProficiencyRisk",
"ProficiencyPredictiveModel", "NumberModelsTried", "FinalScore",
"BelowBench",
"NbrSubmissions", "Tree", "NN", "RF",
"SVM", "GBM", "Logisitic", "DA",
"Bayes", "rank", "top10", "Multi",
"Hybrid", "invmills")

23 of 35
crs$numeric <- c("YearsExperienceDataAnalysis",
"YearsExperienceCredit", "ProficiencyCreditScoring", "ProficiencyRisk",
"ProficiencyPredictiveModel", "NumberModelsTried", "FinalScore",
"NbrSubmissions",
"rank", "Multi", "invmills")
crs$categoric <- c("team", "Education", "OccupationDetail",
"Occupation",
"BelowBench", "Tree", "NN", "RF",
"SVM", "GBM", "Logisitic", "DA",
"Bayes", "top10", "Hybrid")
crs$target
crs$risk
crs$ident
crs$ignore
crs$weights

<<<<<-

"TeamSize"
NULL
NULL
NULL
NULL

#============================================================
# Rattle timestamp: 2012-03-10 11:56:23 i386# Note the user selections.
# Build the training/validate/test datasets.
set.seed(crv$seed)
crs$nobs <- nrow(crs$dataset) # 77 observations
crs$sample <- crs$train <- sample(nrow(crs$dataset), 0.69*crs$nobs) #
53 observations
crs$validate <- sample(setdiff(seq_len(nrow(crs$dataset)), crs$train),
0.14*crs$nobs) # 11 observations
crs$test <- setdiff(setdiff(seq_len(nrow(crs$dataset)), crs$train),
crs$validate) # 13 observations
#============================================================
# Rattle timestamp: 2012-03-10 11:56:31 i386# Note the user selections.
# Build the training/validate/test datasets.
set.seed(crv$seed)
crs$nobs <- nrow(crs$dataset) # 77 observations
crs$sample <- crs$train <- sample(nrow(crs$dataset), 0.69*crs$nobs) #
53 observations
crs$validate <- sample(setdiff(seq_len(nrow(crs$dataset)), crs$train),
0.14*crs$nobs) # 11 observations
crs$test <- setdiff(setdiff(seq_len(nrow(crs$dataset)), crs$train),
crs$validate) # 13 observations
# The following variable selections have been noted.
crs$input <- c("TeamSize", "Education", "Occupation",
"YearsExperienceDataAnalysis",
"YearsExperienceCredit", "ProficiencyCreditScoring",
"ProficiencyPredictiveModel", "NumberModelsTried",
"NbrSubmissions", "invmills")

24 of 35

crs$numeric <- c("TeamSize", "YearsExperienceDataAnalysis",


"YearsExperienceCredit", "ProficiencyCreditScoring",
"ProficiencyPredictiveModel", "NumberModelsTried",
"NbrSubmissions", "invmills")
crs$categoric <- c("Education", "Occupation")
crs$target
crs$risk
crs$ident
crs$ignore
"NN", "RF",
"Hybrid")
crs$weights

<- "rank"
<- NULL
<- c("ProficiencyRisk", "FinalScore")
<- c("team", "OccupationDetail", "BelowBench", "Tree",
"SVM", "GBM", "Logisitic", "DA", "Bayes", "top10", "Multi",
<- NULL

#============================================================
# Rattle timestamp: 2012-03-10 11:56:37 i386# Regression model
# Build a Regression model.
crs$glm <- lm(rank ~ ., data=crs$dataset[crs$train,c(crs$input,
crs$target)])
# Generate a textual view of the Linear model.
print(summary(crs$glm))
cat('==== ANOVA ====
')
print(anova(crs$glm))
print("
")
# Time taken: 0.05 secs
#============================================================
# Rattle timestamp: 2012-03-10 11:57:24 i386# Note the user selections.
# Build the training/validate/test datasets.
set.seed(crv$seed)
crs$nobs <- nrow(crs$dataset) # 77 observations
crs$sample <- crs$train <- sample(nrow(crs$dataset), 0.69*crs$nobs) #
53 observations
crs$validate <- sample(setdiff(seq_len(nrow(crs$dataset)), crs$train),
0.14*crs$nobs) # 11 observations
crs$test <- setdiff(setdiff(seq_len(nrow(crs$dataset)), crs$train),
crs$validate) # 13 observations
# The following variable selections have been noted.

25 of 35
crs$input <- c("TeamSize", "Education", "Occupation",
"YearsExperienceDataAnalysis",
"YearsExperienceCredit", "ProficiencyCreditScoring",
"ProficiencyPredictiveModel", "NumberModelsTried",
"NbrSubmissions", "RF", "SVM", "GBM",
"Multi", "invmills")
crs$numeric <- c("TeamSize", "YearsExperienceDataAnalysis",
"YearsExperienceCredit", "ProficiencyCreditScoring",
"ProficiencyPredictiveModel", "NumberModelsTried",
"NbrSubmissions", "Multi",
"invmills")
crs$categoric <- c("Education", "Occupation", "RF", "SVM",
"GBM")
crs$target <- "rank"
crs$risk
<- NULL
crs$ident
<- c("ProficiencyRisk", "FinalScore")
crs$ignore <- c("team", "OccupationDetail", "BelowBench", "Tree",
"NN", "Logisitic", "DA", "Bayes", "top10", "Hybrid")
crs$weights <- NULL
#============================================================
# Rattle timestamp: 2012-03-10 11:57:33 i386# Regression model
# Build a Regression model.
crs$glm <- lm(rank ~ ., data=crs$dataset[crs$train,c(crs$input,
crs$target)])
# Generate a textual view of the Linear model.
print(summary(crs$glm))
cat('==== ANOVA ====
')
print(anova(crs$glm))
print("
")
# Time taken: 0.05 secs
#============================================================
# Rattle timestamp: 2012-03-10 11:57:42 i386# Note the user selections.
# The following variable selections have been noted.
crs$input <- c("TeamSize", "Education", "Occupation",
"YearsExperienceDataAnalysis",
"YearsExperienceCredit", "ProficiencyCreditScoring",
"ProficiencyPredictiveModel", "NumberModelsTried",
"NbrSubmissions", "RF", "SVM", "GBM",

26 of 35
"Multi", "invmills")
crs$numeric <- c("TeamSize", "YearsExperienceDataAnalysis",
"YearsExperienceCredit", "ProficiencyCreditScoring",
"ProficiencyPredictiveModel", "NumberModelsTried",
"NbrSubmissions", "Multi",
"invmills")
crs$categoric <- c("Education", "Occupation", "RF", "SVM",
"GBM")
crs$target <- "rank"
crs$risk
<- NULL
crs$ident
<- c("ProficiencyRisk", "FinalScore")
crs$ignore <- c("team", "OccupationDetail", "BelowBench", "Tree",
"NN", "Logisitic", "DA", "Bayes", "top10", "Hybrid")
crs$weights <- NULL
#============================================================
# Rattle timestamp: 2012-03-10 11:57:47 i386# Regression model
# Build a Regression model.
crs$glm <- lm(rank ~ ., data=crs$dataset[,c(crs$input, crs$target)])
# Generate a textual view of the Linear model.
print(summary(crs$glm))
cat('==== ANOVA ====
')
print(anova(crs$glm))
print("
")
# Time taken: 0.09 secs
#============================================================
# Rattle timestamp: 2012-03-10 11:58:44 i386# Note the user selections.
# The following variable selections have been noted.
crs$input <- c("TeamSize", "YearsExperienceDataAnalysis",
"YearsExperienceCredit", "ProficiencyCreditScoring",
"ProficiencyPredictiveModel", "NumberModelsTried",
"NbrSubmissions", "RF",
"SVM", "GBM", "Multi", "invmills")
crs$numeric <- c("TeamSize", "YearsExperienceDataAnalysis",
"YearsExperienceCredit", "ProficiencyCreditScoring",
"ProficiencyPredictiveModel", "NumberModelsTried",
"NbrSubmissions", "Multi",
"invmills")

27 of 35

crs$categoric <- c("RF", "SVM", "GBM")


crs$target <- "rank"
crs$risk
<- NULL
crs$ident
<- c("ProficiencyRisk", "FinalScore")
crs$ignore <- c("team", "Education", "OccupationDetail", "Occupation",
"BelowBench", "Tree", "NN", "Logisitic", "DA", "Bayes", "top10",
"Hybrid")
crs$weights <- NULL
#============================================================
# Rattle timestamp: 2012-03-10 11:58:49 i386# Regression model
# Build a Regression model.
crs$glm <- lm(rank ~ ., data=crs$dataset[,c(crs$input, crs$target)])
# Generate a textual view of the Linear model.
print(summary(crs$glm))
cat('==== ANOVA ====
')
print(anova(crs$glm))
print("
")
# Time taken: 0.08 secs
#============================================================
# Rattle timestamp: 2012-03-10 11:59:39 i386# Note the user selections.
# The following variable selections have been noted.
crs$input <- c("TeamSize", "Education", "Occupation",
"YearsExperienceDataAnalysis",
"YearsExperienceCredit", "ProficiencyCreditScoring",
"ProficiencyPredictiveModel", "NumberModelsTried",
"NbrSubmissions", "invmills")
crs$numeric <- c("TeamSize", "YearsExperienceDataAnalysis",
"YearsExperienceCredit", "ProficiencyCreditScoring",
"ProficiencyPredictiveModel", "NumberModelsTried",
"NbrSubmissions", "invmills")
crs$categoric <- c("Education", "Occupation")
crs$target
crs$risk
crs$ident

<- "rank"
<- NULL
<- c("ProficiencyRisk", "FinalScore")

28 of 35
crs$ignore <- c("team", "OccupationDetail", "BelowBench", "Tree",
"NN", "RF", "SVM", "GBM", "Logisitic", "DA", "Bayes", "top10", "Multi",
"Hybrid")
crs$weights <- NULL
#============================================================
# Rattle timestamp: 2012-03-10 11:59:44 i386# Regression model
# Build a Regression model.
crs$glm <- lm(rank ~ ., data=crs$dataset[,c(crs$input, crs$target)])
# Generate a textual view of the Linear model.
print(summary(crs$glm))
cat('==== ANOVA ====
')
print(anova(crs$glm))
print("
")
# Time taken: 0.03 secs
#============================================================
# Rattle timestamp: 2012-03-10 12:01:49 i386# Note the user selections.
#============================================================
# Rattle timestamp: 2012-03-10 12:01:54 i386# Note the user selections.
# The following variable selections have been noted.
crs$input <- c("TeamSize", "Education", "Occupation",
"YearsExperienceDataAnalysis",
"YearsExperienceCredit", "ProficiencyCreditScoring",
"ProficiencyPredictiveModel", "NumberModelsTried",
"NbrSubmissions", "Multi", "invmills")
crs$numeric <- c("TeamSize", "YearsExperienceDataAnalysis",
"YearsExperienceCredit", "ProficiencyCreditScoring",
"ProficiencyPredictiveModel", "NumberModelsTried",
"NbrSubmissions", "Multi",
"invmills")
crs$categoric <- c("Education", "Occupation")
crs$target
crs$risk
crs$ident
crs$ignore
"NN", "RF",

<- "rank"
<- NULL
<- c("ProficiencyRisk", "FinalScore")
<- c("team", "OccupationDetail", "BelowBench", "Tree",
"SVM", "GBM", "Logisitic", "DA", "Bayes", "top10", "Hybrid")

29 of 35
crs$weights <- NULL
#============================================================
# Rattle timestamp: 2012-03-10 12:02:00 i386# Regression model
# Build a Regression model.
crs$glm <- lm(rank ~ ., data=crs$dataset[,c(crs$input, crs$target)])
# Generate a textual view of the Linear model.
print(summary(crs$glm))
cat('==== ANOVA ====
')
print(anova(crs$glm))
print("
")
# Time taken: 0.03 secs
#============================================================
# Rattle timestamp: 2012-03-10 12:04:06 i386# Note the user selections.
# The following variable selections have been noted.
crs$input <- c("TeamSize", "Education", "Occupation",
"YearsExperienceDataAnalysis",
"YearsExperienceCredit", "ProficiencyCreditScoring",
"ProficiencyPredictiveModel", "NumberModelsTried",
"NbrSubmissions", "Tree", "NN", "RF",
"SVM", "GBM", "Bayes", "Multi",
"invmills")
crs$numeric <- c("TeamSize", "YearsExperienceDataAnalysis",
"YearsExperienceCredit", "ProficiencyCreditScoring",
"ProficiencyPredictiveModel", "NumberModelsTried",
"NbrSubmissions", "Multi",
"invmills")
crs$categoric <- c("Education", "Occupation", "Tree", "NN",
"RF", "SVM", "GBM", "Bayes")
crs$target <crs$risk
<crs$ident
<crs$ignore <"DA", "top10",
crs$weights <-

"rank"
NULL
c("ProficiencyRisk", "FinalScore")
c("team", "OccupationDetail", "BelowBench", "Logisitic",
"Hybrid")
NULL

#============================================================
# Rattle timestamp: 2012-03-10 12:04:11 i386-

30 of 35
# Regression model
# Build a Regression model.
crs$glm <- lm(rank ~ ., data=crs$dataset[,c(crs$input, crs$target)])
# Generate a textual view of the Linear model.
print(summary(crs$glm))
cat('==== ANOVA ====
')
print(anova(crs$glm))
print("
")
# Time taken: 0.06 secs
#============================================================
# Rattle timestamp: 2012-03-10 12:21:13 i386# Generate a correlation plot for the variables.
# The 'ellipse' package provides the 'plotcorr' function.
require(ellipse, quietly=TRUE)
# Correlations work for numeric variables only.
crs$cor <- cor(crs$dataset[, crs$numeric], use="pairwise",
method="pearson")
# Order the correlations by their strength.
crs$ord <- order(crs$cor[1,])
crs$cor <- crs$cor[crs$ord, crs$ord]
# Display the actual correlations.
print(crs$cor)
# Graphically display the correlations.
plotcorr(crs$cor, col=colorRampPalette(c("red", "white", "blue"))(11)
[5*crs$cor + 6])
title(main="Correlation data using Pearson",
sub=paste("Rattle", format(Sys.time(), "%Y-%b-%d %H:%M:%S"),
Sys.info()["user"]))

Appendix of Online Blog postings of Winners:


Source:
http://blog.kaggle.com/2012/01/03/the-perfect-storm-meet-the-winners-of-give-me-somecredit/#more-1621

31 of 35

Rank 1. Perfect Storm


The Perfect Storm, comprising Alec Stephenson, Eu Jin Lok and Nathaniel Ramm,
brought home first prize in Give Me Some Credit. We caught up with Alec and Eu Jin.

How does it feel to have done so well in a contest with almost 1000 teams?
EJ: Pretty amazing, especially when it was such an intense competition with so many
good competitors. Personally, I felt a strong sense of achievement together as a team.
AS: It feels great, particularly because we won by such a well-defined margin. The gap
between first and second place was the largest gap in the top 500 placings.
What were your backgrounds prior to entering this challenge?
EJ: My background is in statistics and econometric modelling. More recently I've worked
in data mining and machine learning for Deloitte Analytics Australia, where I am a
Senior Analyst.
AS: My formal background is in mathematics and statistics. I am a largely self-taught
programmer, and have written a number of R packages. I do not work in data mining, but
have picked up an interest in it over the last year or so, mainly due to Kaggle! I am an
academic, originally from London, and have studied or worked at universities in England,
Singapore, China and Australia.
What preprocessing and supervised learning methods did you use?
AS: We tried many different supervised learning methods, but we decided to keep our
ensemble to only those things that we knew would improve our score through crossvalidation evaluations. In the end we only used five supervised learning methods: a
random forest of classification trees, a random forest of regression trees, a classification
tree boosting algorithm, a regression tree boosting algorithm, and a neural network.
This competition had a fairly simple data set and relatively few features did that
affect how you went about things?
EJ: It meant that the barrier to entry was low, competition would be very intense and
everyone would eventually arrive at similar results and methods. Before we formed a
team, I knew that I would have to work extra hard and be really innovative in my
approach to solving this problem. Collaboration was the last ace and as the competition
started to hit the ceiling, I decided to play that card.
What was your most important insight into the data?

32 of 35
EJ: I discovered 2 key features, the first being the total number of late days, and second
the difference between income and expense. They turned out to be very predictive!
Were you surprised by any of your insights?
AS: I was surprised at how well neural networks performed. They certainly gave a good
improvement over and above more modern approaches based on bagging and boosting. I
have tried neural networks in other competitions where they did not perform as well.
How did working in a team help you?
TOGETHER: As individuals, we were unlikely to win. But with Nathaniel's expertise in
credit scoring, Alec's expertise in algorithms and Eu Jin's knowledge in data mining, we
had something completely different to offer that was really powerful. In a literal sense,
we stormed our way up to the top.
Which tools did you use?
TOGETHER: SQL, SAS, R, Viscovery and even Excel!
What have you taken away from this competition?
AS: That data mining is fun when you are in a team, and also how effective a team can be
if the skills of its members complement each other. You can learn a lot from the people
that you work with.

Source: http://blog.kaggle.com/2011/12/21/score-xavier-conort-on-coming-second-ingive-me-some-credit/

Rank2: Team GXAV


Xavier Conort came second in Give Me Some Credit and let us in on some of the tricks of
the trade.
How does it feel to have done so well in a contest with almost 1000 teams?
I feel great because Machine Learning is not part of my natural toolkit. I now look
forward to exploiting this insight in my professional life and exploring new ideas and
techniques in other competitions.
What was your background prior to entering this challenge?
I am an actuary and set up a consultancy called Gear Analytics a few months ago. Its
based in Singapore, and helps companies to build internal capabilities in predictive

33 of 35
modeling and risk management using R. Previously, I worked in France, Brazil, China
and Singapore holding different roles (actuary, CFO, risk manager) in the Life and NonLife Insurance industry.
What preprocessing and supervised learning methods did you use?
I didn't spend much time on preprocessing. My most important input was to create a
variable which estimates the likelihood of being late by more than 90 days.
I used a mix of 15 models including Gradient Boosting Machine (GBM), weighted
GBMs, Random Forest, balanced Random Forest, GAM, weighted GAM, Support Vector
Machine (SVM) and bagged ensemble of SVMs. My best score, however, was an
ensemble without SVMs.
This competition had a fairly simple data set and relatively few features did that
affect how you went about things?
The data was simple yet messy. I found off-the-shelf techniques such as GBM could
handle it. The relative simplicity of the data allowed me to allocate more time to trying
different models and ensembling my individual models.
What was your most important insight into the data?
The likelihood of being late was by far the most important predictor in my GBM and its
inclusion as a predictor improved my individual fits accuracy.
Were you surprised by any of your insights?
I've always believed that people can benefit from diversity, but I was surprised to see
how much data science can also benefit from it (through ensembling techniques). The
strong performance achieved by Alec, Eu Jin and Nathaniel (Perfect Storm) also shows
that teamwork matters.
My best individual fit was a weighted GBM which scored 0.86877 in the private set.
Without ensembling weaker models, my rank would have been 25.
Which tools did you use?
R, the excellent book The Elements of Statistical Learning and the great online Stanford
course by Andrew Ng. All are free.
What have you taken away from this competition?
When I entered the competition, I was still unfamiliar with Machine Learning techniques
as they are rarely used in the insurance industry. I was amazed by the capacity of
Gradient Boosting Machine (also called Boosted Regression Trees) to learn non-linear

34 of 35
functions (including interactions) and accommodate missing values and outliers. It is
definitely something that I will include in my toolbox in the future.

Source: http://blog.kaggle.com/2011/12/20/credit-where-credits-due-joe-malicki-onplacing-third-in-give-me-some-credit-2/

Rank3: Team Occupy


Joe Malicki placed third in Give Me Some Credit despite health problems preventing him
from submitting for the majority of the competition.

How does it feel to have done so well in a competition with almost 1000 teams?
Great! This was my first serious attempt at Kaggle - I've been doing data modeling for a
while, and wanted to try cutting my teeth at a real competition for the first time.
What was your background prior to entering this challenge?
I have a computer science (CS) degree and have done some graduate work in CS,
statistics, and applied math. I currently work for Nokia doing machine learning for local
search ranking.
What preprocessing and supervised learning methods did you use?
I tried quite a bit of preprocessing, mentioned below, and in the code I posted to the
forums.
I used random forests, gradient boosted decision trees, logistic regression, and SVMs,
with various subsampling of the data for various positive/negative feature balancing. In
the end, using 50/50 balanced boosting, and 10/90 balance random forests, and averaging
them, won.
This competition had a fairly simple data set and relatively few features did that
affect how you went about things?
Absolutely! One of the big problems in this competition was a large and imbalanced
dataset - defaults are rare. I used stratified sampling for classes to produce my training
set.
Transformations of the data were key. To expand the number of features, I tried to use
my prior knowledge of credit scoring and personal finance to expand the feature set.

35 of 35
For instance, knowing that people above age 60 are likely to qualify for social security,
which makes their income more stable, and that 33% (for the mortgage) and 43% (for the
mortgage plus other debt) are often magic debt to income (DTI) numbers was very
useful. I feel strongly that knowing what the data elements actually represent in the real
world, rather than numbers, is huge for modelling.
Random forests are a great learning algorithm, but they deal poorly where transforms of
features are very important. So I identified some things that looked important combining income and DTI to estimate debt and combining number of dependents and
income as a proxy for disposable income/constraints. The latter is important since
struggling to barely support dependents makes default more likely.
Also, trying to notice "interesting" incomes divisible by 1000 was useful - I was guessing
that fraud might be more likely for these, and/or they may signal new jobs where a person
hasn't had percent raises. I was going to try a Benford's law-inspired income feature, to
help detect fraud, but had to leave the competition before I got a chance.
What was your most important insight into the data?
Forum discussion pointed out that number of times past due in the 90's seemed like a
special code, and should be treated differently. Also, noticing that DTI for those with
missing income data seemed weird was critical.
Were you surprised by any of your insights?
I had tried keeping a small hold-out data set, and then combining all of the models I
trained via a logistic regression on that data set, using the model probabilities as features,
but found that only taking the best models of various classes worked better.
Which tools did you use?
R. It's a memory hog, but it has first-class algorithm implementations.
What have you taken away from this competition?
The results I saw on the public test set differed dramatically from any results I could get
on any held-out portion of the training set. In the end, the results on the private testing
data set for all of the models I submitted were extremely close to my private evaluations,
far closer than to the performance on the public set. This highlighted to me the
importance of relying on cross-validation on large samples.

S-ar putea să vă placă și