Documente Academic
Documente Profesional
Documente Cultură
Acknowledgements: Thanks to Ross Gayler for suggesting Kaggle to test out the
hypothesis on credit scoring as a commodity and also to Anthony Goldbloom for
generous support and use of Kaggle, the platform for predictive modeling
Crowdsourcing.
2 of 35
Executive Summary.........................................................................................................3
Introduction .....................................................................................................................4
Survey Analysis of Give me some Credit Contest...........................................................6
..........................................................................................................................................8
Correlation amongst variables.......................................................................................11
Analysis of Free Text Questions....................................................................................11
Principal Components Analysis of Survey....................................................................12
Regression results of Predict Model Ranking................................................................14
Regression Results of Linear Model of predictive modeling ranking Adjusted for
response bias..................................................................................................................15
Conclusion.....................................................................................................................16
References .................................................................................................................17
Appendix of Survey Questions..................................................................................18
Appendix of R code for survey analysis....................................................................18
Appendix of Rattle Log.............................................................................................21
Appendix of Online Blog postings of Winners:........................................................30
3 of 35
Executive Summary
In September 2011 a 3 month long credit contest called Give me some credit was
hosted on the Kaggle predictive modeling platform. This was the most popular Kaggle
Crowdsourcing contest to date and had intense competition and drew out the best data
scientists and credit scoring practitioners in the world to compete. The contest results
supported the hypothesis that credit scoring is a commodity which cant create
sustainable competitive advantage. The survey data also provided insight into predictive
modeling skill and the most important factors to optimal predictive modeling related to:
optimal model selection (top 3 models used were hybrid models of random forest,
support vector machines and gradient boosted machines), proficiency in predictive
modeling, effort proxied by number of methods explored, team size (more people
resulted in better models in general), and domain knowledge. Thus predictive model
performance is a function is: Performance=Chosen Model (exploring multiple models
and choosing best algorithm)+Effort (Hard work)+Predictive Modeling skill+domain
knowledge+team work. The choice of the right kind of algorithm or model dominated
performance the most and exploration of different approaches mattered more than
backgrounds or experience.
4 of 35
Introduction
In September 2011 the most popular predictive modeling contest held on the Kaggle
platform was hosted and surprisingly for that time period was the most popular data
science contest and only offered a $5,000 bounty. This contest had 990 submissions,
1,473 players, 8,301 entries.1 During its duration it had more contestants than the $3mm
heritage health prize and $10k wiki contest.
The aim of this contest was to build the best credit scoring system that people could use
to score their own risk profiles using attributes people know about their own credit
behavior and usage. The contest was sponsored by Credit Fusion a think tank devoted to
commoditizing credit and ensuring borrower well being as the foundation of lending.
The contest involved a 250,000 observation data set with area under the receiver
operating curve as the metric of model performance. Area under the curve was used for
model performance as dominance in the Receiver operating curve plotting model true
positive rates and false positive rates has been shown to be equivalent in maximum profit,
and efficient frontier dominance in business objectives of optimal profit and volume trade
offs (Beling etal, 2005). The current benchmark algorithm used for credit scoring was a
random forest ensemble algorithm which had very strong performance and an area under
the curve of .864. Random forests were invented by Leo Breiman and Adele cutler and
develop randomly selected variables and data selection to build a majority voting scheme
of weak learners (Breiman, 2001). The area under the curve is a chart which plots
accurate predictions (true positives) against inaccurate predictions (false positives). The
area under the curve is a common metric for measuring the accuracy of predictive
models. The higher the area, the more powerful the model.
This is .006 below that of the best commercial credit scores which have an area under the
curve of .870 and are built by teams of PhDs using hundreds of attributes and models.
The success of a the simple random forest which builds 500 simple decision tree models
using random samples and random variable selection and uses a majority vote of the
resulting models to predict the outcome is remarkable and shows that proprietary
predictive modeling adds little value on top of simple algorithms (2001). Random
forests have been found to outperform standard logistic regression models by 5-7% and
1
Kaggle represents the culmination of the internet revolution and allows organizations to
publish their data and invite contestants to build statistical models. www.kaggle.com
5 of 35
well suited for credit scoring out of the box due to their ability to deal with correlated
variables which confound modeling process using regression methods (Sharma, 2009;
Sharma, 2011).
The winning teams had an Area under the curve of .005 better than the benchmark
random forest and used a blended approach of multiple random forests, Support vector
machines data cleaning, and Gradient boosting machines. Stochastic gradient boosting
machines are similar to random forests in that they too build weak learners but do so
instead is stage wise manner fitting to pseudo residuals of the least squares fit of
additive regression trees using randomly selected subsamples to build the weak learners
(Friedman, 1999). Support vector machines are an algorithm invented by Vapnik which
use hyperplanes in high dimension to maximize the distance from the model to the
margins (Christianni etal, 2000). Random forests, support vector machines and gradient
boosted machines represent the state of the art in predictive modeling classification and
were used heavily by the winning teams. In the end model performance was a function
of algorithms, hard work proxied by number of models explored, team size, some domain
knowledge and predictive modeling proficiency.
Performance=Optimal Model+Effort+Predictive Modeling skill+Hard
work+domain knowledge+team work
What is impressive is that using predictive modeling skill the winners were able to extract
all possible predictive value in the data and achieve performance in line with models
using much more data and teams of PHDs within 3 months using open source tools. The
competition was intense and provided information on predictive modeling skill itself and
factors at play at building the best models and characteristics of the best modelers. The
survey results will be used to analyze patterns in predictive modeling success.
The results supported the original hypothesis that credit scoring and credit risk prediction
are commodity processes which are not a source of sustainable competitive advantage or
value creation. The limits of credit scoring studied under the phenomenon of
insensitivity of regression models studied under the phenomenon known as the flat
maximum effect have been long known and documented but with ensemble algorithms
we can be sure we are near the true performance asymptote of information (Overstreet
etal, 1991).
6 of 35
7 of 35
8 of 35
Analysis
Looking at the Education and Occupation characteristics of contestants shows there was
sample representation of contestants with PHDs, Masters degrees and some Bachelors
and even high school students. In terms of occupation computer science and predictive
modeling practitioners did the best.
In terms of experience the top performers had high proficiency in predictive modeling
and less experience in domain of credit scoring, risk, and the credit industry.
Interestingly the higher ranking of performing teams and models tried many different
models while the winning teams had fewer submissions indicating they avoided
overfitting to test data in their modeling process.
9 of 35
The biggest predictor of success of top ranking groups was the use of multiple and hybrid
models and top ranking teams tended to use random forests, Gradient Boosting machines,
logistic regression, decision trees which are the basis of both random forests and gradient
boosting machines), and hybrid solutions. Not everyone in the top 2 quartiles used
support vector machines all 3 winning groups used them along with the other hybrid
models.
Thus the optimal model was a function of the algorithm more than anything, with
blended algorithms of ensembles and different samples performing the best along with
experience in predictive modeling, hard work as evidenced by using lots of models and
entries and some domain expertise.
Running a random forest of the factors most important in predictive model proficiency or
best credit scorers in the world is as follows: the algorithm or model used dictates
performance, proficiency in credit scoring, and Predictive modeling and years of
experience doing data analysis. Education and years of experience in the credit industry
correlated the least with performance. Ironically the algorithm used most frequently in
the credit scoring industry, logistic regression, performed the worst. Team size was
important as well as number of models tried as a proxy for hard work and effort.
10 of 35
Plots of Performance show team size and predictive modeling experience resulted in
higher performance. That said team size and number of submissions had some
relationship as seen in the principal components plot and thus having more team
members resulted in more solutions being explored and improvement in performance.
Credit scoring proficiency and domain knowledge resulted in better performance but only
in instances of a great deal of experience >10 years in credit domain and high proficiency
in credit scoring and predictive modeling.
11 of 35
Source: http://www.abcya.com/word_clouds.htm
12 of 35
13 of 35
14 of 35
3Q
-0.3210
Max
2.4656
Coefficients:
on 968
on 967
degrees of freedom
degrees of freedom
#heckman 2 stage
#http://www.brynmawr.edu/socialwork/GSSW/Vartanian/Handouts/Heckman
%20selection%20model.pdf
#heckman 2 stage
#predict prob of filling out survey
#score data and generate mills ratio and include in regression
pi<-predict(m,data,method='prob')
phi<-1/(sqrt(2*-pi))*exp(-(pi^2/2))
capphi<-pnorm(pi)
data$invmills<- phi/capphi
The inverse mill ratio to adjust for this bias was added to the final regression model.
The analysis of response bias shows as predicted by Adams equity theory humans
compute ratios of output to input which determine motivation and contestants that faired
poorly saw little output for their work and could see other perform better and self selected
out of the contest and survey (Adams, 1965).
15 of 35
Median
-5.88
3Q
120.93
Max
442.20
Coefficients:
Df
1
3
4
1
1
1
1
1
1
1
1
1
1
1
1
1
16 of 35
invmills
1
2858
2858 0.0516 0.821402
Residuals
40 2213972
55349
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
[1] "\n"
Time taken: 0.06 secs
Conclusion
Building the best predictive models is a function primarily of algorithms. That said
choosing an algorithm does not happen in a vacuum and optimal Predictive Modeling is a
function of Algorithm, proficiency in modeling, multiple hybrid models, experience in
domain, and hard work.
The best and brightest competed intensely for predicting credit risk using variables
consumers can use to score themselves. The end model was .5% better than the random
forest algorithm out of the box and supported the claim credit scoring is a commodity
function and borrowers should manage their own risk and have knowledge of their own
risk profiles leaving banks to compete on value creation such as cost, convenience and
safe products.
Crowdsourcing also allows for more transparent model building and team work and
improvement as many teams improved by working together and moving up in the
rankings. That said it is not necessary as a few individuals did really well based on skill
and talent alone. The final algorithm choice dictated most of the performance and
explained performance the best.
17 of 35
References
Adams, J.S. 1965. Inequity in social exchange. Adv. Exp. Soc. Psychol. 62:335-343.
Beling, P; Covaliu, Z, Oliver RM (2005) Optimal scoring cutoff policies and efficient
frontiers. Journal of Operations Research Society 56.
Breiman, L (2001) Random forests. Machine Learning, 45(1): 532, 18.
Cristianini, Nello; and Shawe-Taylor, John; (2000) An Introduction to Support Vector
Machines and other kernel-based learning methods, Cambridge University Press.
Friedman JH (1999) Stochastic Gradient Boosting. Technical report.
Stanford, CA: Department of Statistics, Stanford University; 1999
Overstreet GA Bradley EL and Kemp R (1992). The flat maximum effect and generic
linear scoring models: a test. IMA Journal of Mathematics Applied in Business and
Industry. 4(1): 97-109
Sharma, Dhruv. (2011) Improving the Art, Craft and Science of Economic Credit Risk
Scorecards Using Random Forests: Why Credit Scorers and Economists Should Use
Random Forests (June 9, 2011). Available at SSRN: http://ssrn.com/abstract=1861535 or
http://dx.doi.org/10.2139/ssrn.1861535 Forthcoming Academy of Banking studies
summer 2012
Sharma, Dhruv. (2009) Not If Affordability Data Adds Value But How to Add Real
Value by Leveraging Affordability Data: Enhancing Predictive Capability of Credit
Scoring Using Affordability Data (August 15, 2009). Available at SSRN:
http://ssrn.com/abstract=1801346 or http://dx.doi.org/10.2139/ssrn.1801346 also earlier
copy http://www.casact.org/research/wp/papers/working-paper-sharma-2009-09.pdf
Vartanian, T. (2009) UCLA Heckman Two Stage Selection Tutorial. From
http://gseis.ucla.edu/courses/ed231c/notes3/selection.html and Retrieved from
http://www.brynmawr.edu/socialwork/GSSW/Vartanian/Handouts/Heckman
%20selection%20model.pdf
http://cran.r-project.org
http://rattle.togaware.com/
www.creditfusion.org
http://www.kaggle.com/c/GiveMeSomeCredit/leaderboard
www.kaggle.com and http://www.kaggle.com/c/GiveMeSomeCredit
18 of 35
19 of 35
barplot(table(data$Education)*100/nrow(data), ylim=c(0,50),
main='Distribution of Constestants by Education',ylab='%')
plot(rank~Education,data=data, main='Rank (Low is good) by Education')
table(data$Education)*100/nrow(data)
#occupation
barplot(table(data$Occupation)*100/nrow(data), ylim=c(0,50),
main='Distribution of Constestants by Occupation',ylab='%')
plot(rank~Occupation,data=data, main='Rank (Low is good) by Occupation')
table(data$Occupation)*100/nrow(data)
#experience YearsExperienceDataAnalysis YearsExperienceCredit
hist(data$YearsExperienceDataAnalysis, freq=FALSE , main='Histogram of
Experience in Data Analysis' )
plot(rank~YearsExperienceDataAnalysis,data=data, main='Rank (Low is
good) by Data Analysis Experience')
plot(rank~cut(YearsExperienceDataAnalysis,10),data=data, main='Rank
(Low is good) by Data Analysis Experience Deciles')
#experience in domain Credit
hist(data$YearsExperienceCredit, freq=FALSE , main='Histogram of
Experience in Credit Domain' )
plot(rank~YearsExperienceCredit,data=data, main='Rank (Low is good) by
Experience in Credit Domain')
plot(rank~cut(YearsExperienceCredit,10),data=data, main='Rank (Low is
good) by Credit Domain Experience Deciles')
#expertise in ProficiencyCreditScoring ProficiencyRisk
ProficiencyPredictiveModel
plot(data$ProficiencyCreditScoring ~data$YearsExperienceCredit)
boxplot(rank~ProficiencyCreditScoring ,data=data, main='Rank (Low is
good) by Credit Scoring Proficiency (5 is highest,1=Lowest) ')
hist(data$ProficiencyCreditScoring , freq=FALSE , main='Histogram of
Proficiency in Credit Scoring' );
#weight proficiency by actual experience in credit domain as you could
inflated self report proficiency with no experience
plot(rank~cut(ProficiencyCreditScoring*YearsExperienceCredit ,
10),data=data, main='Rank (Low is good) by Credit Scoring
Proficiency*Experience in Domain Deciles')
plot(rank~ProficiencyCreditScoring,data=subset(data,YearsExperienceCred
it>=3), main='Rank (Low is good) by Credit Scoring Proficiency')
dev.new();
par(mfrow=c(3,3));
#risk proficiency
hist(data$ProficiencyRisk , freq=FALSE , main='Histogram of Proficiency
in Risk Management' )
boxplot(rank~ProficiencyRisk ,data=data, main='Rank (Low is good) by
Risk Management Proficiency (5 is highest,1=Lowest) ')
# predictive modeling
hist(data$ProficiencyPredictiveModel, freq=FALSE , main='Histogram of
Proficiency in Predictive Modeling ' )
boxplot(rank~ProficiencyPredictiveModel,data=data, main='Rank (Low is
good) by Predictive Modeling Proficiency (5 is highest,1=Lowest) ')
20 of 35
#TeamSize
hist(data$TeamSize, freq=FALSE , main='Histogram of TeamSize' )
plot(rank~TeamSize,data=data, main='Rank (Low is good) by TeamSize')
plot(rank~cut(TeamSize,5),data=data, main='Rank (Low is good) by
TeamSize Quantiles')
# Team Size interdisp function of occupations, education
#only winning team had 2 different occupations and education
interdisciplenary
# NumberMOdelsTried
# of models
data$NumberMOdelsTried<-as.numeric(data$NumberMOdelsTried)
names(data)[11]<-"NumberModelsTried"
data$top10<- as.factor(ifelse(data$rank<=100,"Y","N"))
data$Multi<-as.numeric(data$Tree)+as.numeric(data$NN)
+as.numeric(data$RF)+as.numeric(data$SVM)+as.numeric(data$GBM)
+as.numeric(data$Logisitic)+as.numeric(data$DA)+as.numeric(data$Bayes)
data$Multi<-data$Multi-16
data$Hybrid <- as.factor(ifelse(data$Multi>0,"Y","N"))
library(randomForest)
data<-na.roughfix(data)
#990 teams 79 resp and 61% of people scored worse than benchmark but in
survey smaple 41% did worse than benchmark;
#so skewed towards higher response players
plot(density(na.roughfix(data$rank)));
data<-subset(data,select=-OccupationDetail)
data<-subset(data,select=-team)
par(mfrow=c(5,3));
for (i in c(1:length(data)))
{
if (class(data[,i])=='factor' & !names(data)[i]%in
%c('BelowBench','top10'))
{
plot(data$rank~data[,i],main=names(data)[i]);
}
}
par(mfrow=c(3,3));
for (i in c(2,3,5,6,7,8,9,10,11,14,15,16:22))
{
if (is.numeric(data[,i])==TRUE)
{
hist(data[,i],main=names(data)[i]);
}
}
21 of 35
#heckman 2 stage
#predict prob of filling out survey
#score data and generate mills ratio and include in regression
#generate phi = (1/sqrt(2*pi))*exp(-(pi^2/2)) /*standardize it*/
#generate capphi = norm(p1)
#generate invmills = phi/capphi
pi<-predict(m,data,method='prob')
phi<-1/(sqrt(2*-pi))*exp(-(pi^2/2))
capphi<-pnorm(pi)
data$invmills<- phi/capphi
22 of 35
# 'building' is used to toggle between generating transformations, as
when building
# a model, and simply using the transformations, as when scoring a
dataset.
building <- TRUE
scoring <- ! building
# The colorspace package is used to generate the colours used in plots,
if available.
library(colorspace)
# A pre-defined value is used to reset the random seed so that results
are repeatable.
crv$seed <- 42
#============================================================
# Rattle timestamp: 2012-03-10 11:55:41 i386# Load an R data frame.
crs$dataset <- data
# Display a simple summary (structure) of the dataset.
str(crs$dataset)
#============================================================
# Rattle timestamp: 2012-03-10 11:55:42 i386# Note the user selections.
# Build the training/validate/test datasets.
set.seed(crv$seed)
crs$nobs <- nrow(crs$dataset) # 77 observations
crs$sample <- crs$train <- sample(nrow(crs$dataset), 0.69*crs$nobs) #
53 observations
crs$validate <- sample(setdiff(seq_len(nrow(crs$dataset)), crs$train),
0.14*crs$nobs) # 11 observations
crs$test <- setdiff(setdiff(seq_len(nrow(crs$dataset)), crs$train),
crs$validate) # 13 observations
# The following variable selections have been noted.
crs$input <- c("team", "Education", "OccupationDetail", "Occupation",
"YearsExperienceDataAnalysis", "YearsExperienceCredit",
"ProficiencyCreditScoring", "ProficiencyRisk",
"ProficiencyPredictiveModel", "NumberModelsTried", "FinalScore",
"BelowBench",
"NbrSubmissions", "Tree", "NN", "RF",
"SVM", "GBM", "Logisitic", "DA",
"Bayes", "rank", "top10", "Multi",
"Hybrid", "invmills")
23 of 35
crs$numeric <- c("YearsExperienceDataAnalysis",
"YearsExperienceCredit", "ProficiencyCreditScoring", "ProficiencyRisk",
"ProficiencyPredictiveModel", "NumberModelsTried", "FinalScore",
"NbrSubmissions",
"rank", "Multi", "invmills")
crs$categoric <- c("team", "Education", "OccupationDetail",
"Occupation",
"BelowBench", "Tree", "NN", "RF",
"SVM", "GBM", "Logisitic", "DA",
"Bayes", "top10", "Hybrid")
crs$target
crs$risk
crs$ident
crs$ignore
crs$weights
<<<<<-
"TeamSize"
NULL
NULL
NULL
NULL
#============================================================
# Rattle timestamp: 2012-03-10 11:56:23 i386# Note the user selections.
# Build the training/validate/test datasets.
set.seed(crv$seed)
crs$nobs <- nrow(crs$dataset) # 77 observations
crs$sample <- crs$train <- sample(nrow(crs$dataset), 0.69*crs$nobs) #
53 observations
crs$validate <- sample(setdiff(seq_len(nrow(crs$dataset)), crs$train),
0.14*crs$nobs) # 11 observations
crs$test <- setdiff(setdiff(seq_len(nrow(crs$dataset)), crs$train),
crs$validate) # 13 observations
#============================================================
# Rattle timestamp: 2012-03-10 11:56:31 i386# Note the user selections.
# Build the training/validate/test datasets.
set.seed(crv$seed)
crs$nobs <- nrow(crs$dataset) # 77 observations
crs$sample <- crs$train <- sample(nrow(crs$dataset), 0.69*crs$nobs) #
53 observations
crs$validate <- sample(setdiff(seq_len(nrow(crs$dataset)), crs$train),
0.14*crs$nobs) # 11 observations
crs$test <- setdiff(setdiff(seq_len(nrow(crs$dataset)), crs$train),
crs$validate) # 13 observations
# The following variable selections have been noted.
crs$input <- c("TeamSize", "Education", "Occupation",
"YearsExperienceDataAnalysis",
"YearsExperienceCredit", "ProficiencyCreditScoring",
"ProficiencyPredictiveModel", "NumberModelsTried",
"NbrSubmissions", "invmills")
24 of 35
<- "rank"
<- NULL
<- c("ProficiencyRisk", "FinalScore")
<- c("team", "OccupationDetail", "BelowBench", "Tree",
"SVM", "GBM", "Logisitic", "DA", "Bayes", "top10", "Multi",
<- NULL
#============================================================
# Rattle timestamp: 2012-03-10 11:56:37 i386# Regression model
# Build a Regression model.
crs$glm <- lm(rank ~ ., data=crs$dataset[crs$train,c(crs$input,
crs$target)])
# Generate a textual view of the Linear model.
print(summary(crs$glm))
cat('==== ANOVA ====
')
print(anova(crs$glm))
print("
")
# Time taken: 0.05 secs
#============================================================
# Rattle timestamp: 2012-03-10 11:57:24 i386# Note the user selections.
# Build the training/validate/test datasets.
set.seed(crv$seed)
crs$nobs <- nrow(crs$dataset) # 77 observations
crs$sample <- crs$train <- sample(nrow(crs$dataset), 0.69*crs$nobs) #
53 observations
crs$validate <- sample(setdiff(seq_len(nrow(crs$dataset)), crs$train),
0.14*crs$nobs) # 11 observations
crs$test <- setdiff(setdiff(seq_len(nrow(crs$dataset)), crs$train),
crs$validate) # 13 observations
# The following variable selections have been noted.
25 of 35
crs$input <- c("TeamSize", "Education", "Occupation",
"YearsExperienceDataAnalysis",
"YearsExperienceCredit", "ProficiencyCreditScoring",
"ProficiencyPredictiveModel", "NumberModelsTried",
"NbrSubmissions", "RF", "SVM", "GBM",
"Multi", "invmills")
crs$numeric <- c("TeamSize", "YearsExperienceDataAnalysis",
"YearsExperienceCredit", "ProficiencyCreditScoring",
"ProficiencyPredictiveModel", "NumberModelsTried",
"NbrSubmissions", "Multi",
"invmills")
crs$categoric <- c("Education", "Occupation", "RF", "SVM",
"GBM")
crs$target <- "rank"
crs$risk
<- NULL
crs$ident
<- c("ProficiencyRisk", "FinalScore")
crs$ignore <- c("team", "OccupationDetail", "BelowBench", "Tree",
"NN", "Logisitic", "DA", "Bayes", "top10", "Hybrid")
crs$weights <- NULL
#============================================================
# Rattle timestamp: 2012-03-10 11:57:33 i386# Regression model
# Build a Regression model.
crs$glm <- lm(rank ~ ., data=crs$dataset[crs$train,c(crs$input,
crs$target)])
# Generate a textual view of the Linear model.
print(summary(crs$glm))
cat('==== ANOVA ====
')
print(anova(crs$glm))
print("
")
# Time taken: 0.05 secs
#============================================================
# Rattle timestamp: 2012-03-10 11:57:42 i386# Note the user selections.
# The following variable selections have been noted.
crs$input <- c("TeamSize", "Education", "Occupation",
"YearsExperienceDataAnalysis",
"YearsExperienceCredit", "ProficiencyCreditScoring",
"ProficiencyPredictiveModel", "NumberModelsTried",
"NbrSubmissions", "RF", "SVM", "GBM",
26 of 35
"Multi", "invmills")
crs$numeric <- c("TeamSize", "YearsExperienceDataAnalysis",
"YearsExperienceCredit", "ProficiencyCreditScoring",
"ProficiencyPredictiveModel", "NumberModelsTried",
"NbrSubmissions", "Multi",
"invmills")
crs$categoric <- c("Education", "Occupation", "RF", "SVM",
"GBM")
crs$target <- "rank"
crs$risk
<- NULL
crs$ident
<- c("ProficiencyRisk", "FinalScore")
crs$ignore <- c("team", "OccupationDetail", "BelowBench", "Tree",
"NN", "Logisitic", "DA", "Bayes", "top10", "Hybrid")
crs$weights <- NULL
#============================================================
# Rattle timestamp: 2012-03-10 11:57:47 i386# Regression model
# Build a Regression model.
crs$glm <- lm(rank ~ ., data=crs$dataset[,c(crs$input, crs$target)])
# Generate a textual view of the Linear model.
print(summary(crs$glm))
cat('==== ANOVA ====
')
print(anova(crs$glm))
print("
")
# Time taken: 0.09 secs
#============================================================
# Rattle timestamp: 2012-03-10 11:58:44 i386# Note the user selections.
# The following variable selections have been noted.
crs$input <- c("TeamSize", "YearsExperienceDataAnalysis",
"YearsExperienceCredit", "ProficiencyCreditScoring",
"ProficiencyPredictiveModel", "NumberModelsTried",
"NbrSubmissions", "RF",
"SVM", "GBM", "Multi", "invmills")
crs$numeric <- c("TeamSize", "YearsExperienceDataAnalysis",
"YearsExperienceCredit", "ProficiencyCreditScoring",
"ProficiencyPredictiveModel", "NumberModelsTried",
"NbrSubmissions", "Multi",
"invmills")
27 of 35
<- "rank"
<- NULL
<- c("ProficiencyRisk", "FinalScore")
28 of 35
crs$ignore <- c("team", "OccupationDetail", "BelowBench", "Tree",
"NN", "RF", "SVM", "GBM", "Logisitic", "DA", "Bayes", "top10", "Multi",
"Hybrid")
crs$weights <- NULL
#============================================================
# Rattle timestamp: 2012-03-10 11:59:44 i386# Regression model
# Build a Regression model.
crs$glm <- lm(rank ~ ., data=crs$dataset[,c(crs$input, crs$target)])
# Generate a textual view of the Linear model.
print(summary(crs$glm))
cat('==== ANOVA ====
')
print(anova(crs$glm))
print("
")
# Time taken: 0.03 secs
#============================================================
# Rattle timestamp: 2012-03-10 12:01:49 i386# Note the user selections.
#============================================================
# Rattle timestamp: 2012-03-10 12:01:54 i386# Note the user selections.
# The following variable selections have been noted.
crs$input <- c("TeamSize", "Education", "Occupation",
"YearsExperienceDataAnalysis",
"YearsExperienceCredit", "ProficiencyCreditScoring",
"ProficiencyPredictiveModel", "NumberModelsTried",
"NbrSubmissions", "Multi", "invmills")
crs$numeric <- c("TeamSize", "YearsExperienceDataAnalysis",
"YearsExperienceCredit", "ProficiencyCreditScoring",
"ProficiencyPredictiveModel", "NumberModelsTried",
"NbrSubmissions", "Multi",
"invmills")
crs$categoric <- c("Education", "Occupation")
crs$target
crs$risk
crs$ident
crs$ignore
"NN", "RF",
<- "rank"
<- NULL
<- c("ProficiencyRisk", "FinalScore")
<- c("team", "OccupationDetail", "BelowBench", "Tree",
"SVM", "GBM", "Logisitic", "DA", "Bayes", "top10", "Hybrid")
29 of 35
crs$weights <- NULL
#============================================================
# Rattle timestamp: 2012-03-10 12:02:00 i386# Regression model
# Build a Regression model.
crs$glm <- lm(rank ~ ., data=crs$dataset[,c(crs$input, crs$target)])
# Generate a textual view of the Linear model.
print(summary(crs$glm))
cat('==== ANOVA ====
')
print(anova(crs$glm))
print("
")
# Time taken: 0.03 secs
#============================================================
# Rattle timestamp: 2012-03-10 12:04:06 i386# Note the user selections.
# The following variable selections have been noted.
crs$input <- c("TeamSize", "Education", "Occupation",
"YearsExperienceDataAnalysis",
"YearsExperienceCredit", "ProficiencyCreditScoring",
"ProficiencyPredictiveModel", "NumberModelsTried",
"NbrSubmissions", "Tree", "NN", "RF",
"SVM", "GBM", "Bayes", "Multi",
"invmills")
crs$numeric <- c("TeamSize", "YearsExperienceDataAnalysis",
"YearsExperienceCredit", "ProficiencyCreditScoring",
"ProficiencyPredictiveModel", "NumberModelsTried",
"NbrSubmissions", "Multi",
"invmills")
crs$categoric <- c("Education", "Occupation", "Tree", "NN",
"RF", "SVM", "GBM", "Bayes")
crs$target <crs$risk
<crs$ident
<crs$ignore <"DA", "top10",
crs$weights <-
"rank"
NULL
c("ProficiencyRisk", "FinalScore")
c("team", "OccupationDetail", "BelowBench", "Logisitic",
"Hybrid")
NULL
#============================================================
# Rattle timestamp: 2012-03-10 12:04:11 i386-
30 of 35
# Regression model
# Build a Regression model.
crs$glm <- lm(rank ~ ., data=crs$dataset[,c(crs$input, crs$target)])
# Generate a textual view of the Linear model.
print(summary(crs$glm))
cat('==== ANOVA ====
')
print(anova(crs$glm))
print("
")
# Time taken: 0.06 secs
#============================================================
# Rattle timestamp: 2012-03-10 12:21:13 i386# Generate a correlation plot for the variables.
# The 'ellipse' package provides the 'plotcorr' function.
require(ellipse, quietly=TRUE)
# Correlations work for numeric variables only.
crs$cor <- cor(crs$dataset[, crs$numeric], use="pairwise",
method="pearson")
# Order the correlations by their strength.
crs$ord <- order(crs$cor[1,])
crs$cor <- crs$cor[crs$ord, crs$ord]
# Display the actual correlations.
print(crs$cor)
# Graphically display the correlations.
plotcorr(crs$cor, col=colorRampPalette(c("red", "white", "blue"))(11)
[5*crs$cor + 6])
title(main="Correlation data using Pearson",
sub=paste("Rattle", format(Sys.time(), "%Y-%b-%d %H:%M:%S"),
Sys.info()["user"]))
31 of 35
How does it feel to have done so well in a contest with almost 1000 teams?
EJ: Pretty amazing, especially when it was such an intense competition with so many
good competitors. Personally, I felt a strong sense of achievement together as a team.
AS: It feels great, particularly because we won by such a well-defined margin. The gap
between first and second place was the largest gap in the top 500 placings.
What were your backgrounds prior to entering this challenge?
EJ: My background is in statistics and econometric modelling. More recently I've worked
in data mining and machine learning for Deloitte Analytics Australia, where I am a
Senior Analyst.
AS: My formal background is in mathematics and statistics. I am a largely self-taught
programmer, and have written a number of R packages. I do not work in data mining, but
have picked up an interest in it over the last year or so, mainly due to Kaggle! I am an
academic, originally from London, and have studied or worked at universities in England,
Singapore, China and Australia.
What preprocessing and supervised learning methods did you use?
AS: We tried many different supervised learning methods, but we decided to keep our
ensemble to only those things that we knew would improve our score through crossvalidation evaluations. In the end we only used five supervised learning methods: a
random forest of classification trees, a random forest of regression trees, a classification
tree boosting algorithm, a regression tree boosting algorithm, and a neural network.
This competition had a fairly simple data set and relatively few features did that
affect how you went about things?
EJ: It meant that the barrier to entry was low, competition would be very intense and
everyone would eventually arrive at similar results and methods. Before we formed a
team, I knew that I would have to work extra hard and be really innovative in my
approach to solving this problem. Collaboration was the last ace and as the competition
started to hit the ceiling, I decided to play that card.
What was your most important insight into the data?
32 of 35
EJ: I discovered 2 key features, the first being the total number of late days, and second
the difference between income and expense. They turned out to be very predictive!
Were you surprised by any of your insights?
AS: I was surprised at how well neural networks performed. They certainly gave a good
improvement over and above more modern approaches based on bagging and boosting. I
have tried neural networks in other competitions where they did not perform as well.
How did working in a team help you?
TOGETHER: As individuals, we were unlikely to win. But with Nathaniel's expertise in
credit scoring, Alec's expertise in algorithms and Eu Jin's knowledge in data mining, we
had something completely different to offer that was really powerful. In a literal sense,
we stormed our way up to the top.
Which tools did you use?
TOGETHER: SQL, SAS, R, Viscovery and even Excel!
What have you taken away from this competition?
AS: That data mining is fun when you are in a team, and also how effective a team can be
if the skills of its members complement each other. You can learn a lot from the people
that you work with.
Source: http://blog.kaggle.com/2011/12/21/score-xavier-conort-on-coming-second-ingive-me-some-credit/
33 of 35
modeling and risk management using R. Previously, I worked in France, Brazil, China
and Singapore holding different roles (actuary, CFO, risk manager) in the Life and NonLife Insurance industry.
What preprocessing and supervised learning methods did you use?
I didn't spend much time on preprocessing. My most important input was to create a
variable which estimates the likelihood of being late by more than 90 days.
I used a mix of 15 models including Gradient Boosting Machine (GBM), weighted
GBMs, Random Forest, balanced Random Forest, GAM, weighted GAM, Support Vector
Machine (SVM) and bagged ensemble of SVMs. My best score, however, was an
ensemble without SVMs.
This competition had a fairly simple data set and relatively few features did that
affect how you went about things?
The data was simple yet messy. I found off-the-shelf techniques such as GBM could
handle it. The relative simplicity of the data allowed me to allocate more time to trying
different models and ensembling my individual models.
What was your most important insight into the data?
The likelihood of being late was by far the most important predictor in my GBM and its
inclusion as a predictor improved my individual fits accuracy.
Were you surprised by any of your insights?
I've always believed that people can benefit from diversity, but I was surprised to see
how much data science can also benefit from it (through ensembling techniques). The
strong performance achieved by Alec, Eu Jin and Nathaniel (Perfect Storm) also shows
that teamwork matters.
My best individual fit was a weighted GBM which scored 0.86877 in the private set.
Without ensembling weaker models, my rank would have been 25.
Which tools did you use?
R, the excellent book The Elements of Statistical Learning and the great online Stanford
course by Andrew Ng. All are free.
What have you taken away from this competition?
When I entered the competition, I was still unfamiliar with Machine Learning techniques
as they are rarely used in the insurance industry. I was amazed by the capacity of
Gradient Boosting Machine (also called Boosted Regression Trees) to learn non-linear
34 of 35
functions (including interactions) and accommodate missing values and outliers. It is
definitely something that I will include in my toolbox in the future.
Source: http://blog.kaggle.com/2011/12/20/credit-where-credits-due-joe-malicki-onplacing-third-in-give-me-some-credit-2/
How does it feel to have done so well in a competition with almost 1000 teams?
Great! This was my first serious attempt at Kaggle - I've been doing data modeling for a
while, and wanted to try cutting my teeth at a real competition for the first time.
What was your background prior to entering this challenge?
I have a computer science (CS) degree and have done some graduate work in CS,
statistics, and applied math. I currently work for Nokia doing machine learning for local
search ranking.
What preprocessing and supervised learning methods did you use?
I tried quite a bit of preprocessing, mentioned below, and in the code I posted to the
forums.
I used random forests, gradient boosted decision trees, logistic regression, and SVMs,
with various subsampling of the data for various positive/negative feature balancing. In
the end, using 50/50 balanced boosting, and 10/90 balance random forests, and averaging
them, won.
This competition had a fairly simple data set and relatively few features did that
affect how you went about things?
Absolutely! One of the big problems in this competition was a large and imbalanced
dataset - defaults are rare. I used stratified sampling for classes to produce my training
set.
Transformations of the data were key. To expand the number of features, I tried to use
my prior knowledge of credit scoring and personal finance to expand the feature set.
35 of 35
For instance, knowing that people above age 60 are likely to qualify for social security,
which makes their income more stable, and that 33% (for the mortgage) and 43% (for the
mortgage plus other debt) are often magic debt to income (DTI) numbers was very
useful. I feel strongly that knowing what the data elements actually represent in the real
world, rather than numbers, is huge for modelling.
Random forests are a great learning algorithm, but they deal poorly where transforms of
features are very important. So I identified some things that looked important combining income and DTI to estimate debt and combining number of dependents and
income as a proxy for disposable income/constraints. The latter is important since
struggling to barely support dependents makes default more likely.
Also, trying to notice "interesting" incomes divisible by 1000 was useful - I was guessing
that fraud might be more likely for these, and/or they may signal new jobs where a person
hasn't had percent raises. I was going to try a Benford's law-inspired income feature, to
help detect fraud, but had to leave the competition before I got a chance.
What was your most important insight into the data?
Forum discussion pointed out that number of times past due in the 90's seemed like a
special code, and should be treated differently. Also, noticing that DTI for those with
missing income data seemed weird was critical.
Were you surprised by any of your insights?
I had tried keeping a small hold-out data set, and then combining all of the models I
trained via a logistic regression on that data set, using the model probabilities as features,
but found that only taking the best models of various classes worked better.
Which tools did you use?
R. It's a memory hog, but it has first-class algorithm implementations.
What have you taken away from this competition?
The results I saw on the public test set differed dramatically from any results I could get
on any held-out portion of the training set. In the end, the results on the private testing
data set for all of the models I submitted were extremely close to my private evaluations,
far closer than to the performance on the public set. This highlighted to me the
importance of relying on cross-validation on large samples.