Sunteți pe pagina 1din 29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

Introduction
Ithappenedfewyearsback.AfterworkingonSASformorethan5years,Idecidedtomoveoutof
mycomfortzone.Beingadatascientist,myhuntforotherusefultoolswasON!Fortunately,itdidnt
takemelongtodecide,Pythonwasmyappetizer.
Ialwayshadainclinationtowardscoding.ThiswasthetimetodowhatIreallyloved.Code.Turned
out,codingwassoeasy!
IlearnedbasicsofPythonwithinaweek.And,sincethen,Ivenotonlyexploredthislanguagetothe
depth, but also have helped many other to learn this language. Python was originally a general
purposelanguage.But,overtheyears,withstrongcommunitysupport,thislanguagegotdedicated
libraryfordataanalysisandpredictivemodeling.
Due to lack of resource on python for data science, I decided to create this tutorial to help many
others to learn python faster. In this tutorial, we will take bite sized information about how to use
PythonforDataAnalysis,chewittillwearecomfortableandpracticeitatourownend.

TableofContents

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

1/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

1.BasicsofPythonforDataAnalysis
WhylearnPythonfordataanalysis?
Python2.7v/s3.4
HowtoinstallPython?
RunningafewsimpleprogramsinPython
2.Pythonlibrariesanddatastructures
PythonDataStructures
PythonIterationandConditionalConstructs
PythonLibraries
3.ExploratoryanalysisinPythonusingPandas
Introductiontoseriesanddataframes
AnalyticsVidhyadatasetLoanPredictionProblem
4.DataMunginginPythonusingPandas
5.BuildingaPredictiveModelinPython
LogisticRegression
DecisionTree
RandomForest

Letsgetstarted!

1.BasicsofPythonforDataAnalysis
WhylearnPythonfordataanalysis?
Python has gathered a lot of interest recently as a choice of language for data analysis. I
had compared it against SAS & Rsome time back. Here are some reasons which go in favour of
learningPython:

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

2/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

OpenSourcefreetoinstall
Awesomeonlinecommunity
Veryeasytolearn
Canbecomeacommonlanguagefordatascienceandproductionofwebbasedanalyticsproducts.

Needlesstosay,itstillhasfewdrawbackstoo:
It is an interpreted language rather than compiled language hence might take up more CPU time.
However,giventhesavingsinprogrammertime(duetoeaseoflearning),itmightstillbeagoodchoice.

Python2.7v/s3.4
ThisisoneofthemostdebatedtopicsinPython.Youwillinvariablycrosspathswithit,speciallyif
youareabeginner.Thereisnoright/wrongchoicehere.Ittotallydependsonthesituationandyour
needtouse.Iwilltrytogiveyousomepointerstohelpyoumakeaninformedchoice.

WhyPython2.7?
1.Awesomecommunitysupport!Thisissomethingyoudneedinyourearlydays.Python2wasreleased
inlate2000andhasbeeninuseformorethan15years.
2.Plethoraofthirdpartylibraries!Thoughmanylibrarieshaveprovided3.xsupportbutstillalargenumber
of modules work only on 2.x versions. If you plan to use Python for specific applications like web
developmentwithhighrelianceonexternalmodules,youmightbebetteroffwith2.7.
3.Someofthefeaturesof3.xversionshavebackwardcompatibilityandcanworkwith2.7version.

WhyPython3.4?
1.Cleanerandfaster!Pythondevelopershavefixedsomeinherentglitchesandminordrawbacksinorder
to set a stronger foundation for the future. These might not be very relevant initially, but will matter
eventually.
2.It is the future! 2.7 is the last release for the 2.x family and eventually everyone has to shift to 3.x
versions.Python3hasreleasedstableversionsforpast5yearsandwillcontinuethesame.

ThereisnoclearwinnerbutIsupposethebottomlineisthatyoushouldfocusonlearningPythonas
a language. Shifting between versions should just be a matter of time. Stay tuned for a dedicated
articleonPython2.xvs3.xinthenearfuture!

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

3/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

HowtoinstallPython?
Thereare2approachestoinstallPython:
YoucandownloadPythondirectlyfromitsprojectsiteandinstallindividualcomponentsandlibrariesyou
want
Alternately, you can download and install a package, which comes with preinstalled libraries. I would
recommenddownloadingAnaconda.AnotheroptioncouldbeEnthoughtCanopyExpress .

Second method provides a hassle free installation and hence Ill recommend that to
beginners.Theimitationofthisapproachisyouhavetowaitfortheentirepackagetobeupgraded,
evenifyouareinterestedinthelatestversionofasinglelibrary.Itshouldnotmatteruntilandunless,
untilandunless,youaredoingcuttingedgestatisticalresearch.

Choosingadevelopmentenvironment
OnceyouhaveinstalledPython,therearevariousoptionsforchoosinganenvironment.Herearethe
3mostcommonoptions:
Terminal/Shellbased
IDLE(defaultenvironment)
iPythonnotebooksimilartomarkdowninR

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

4/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

IDLEeditorforPython
While the right environment depends on your need, I personally prefer iPython Notebooks a lot. It
providesalotofgoodfeaturesfordocumentingwhilewritingthecodeitselfandyoucanchooseto
runthecodeinblocks(ratherthanthelinebylineexecution)
WewilluseiPythonenvironmentforthiscompletetutorial.

Warmingup:RunningyourfirstPythonprogram
YoucanusePythonasasimplecalculatortostartwith:

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

5/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

Fewthingstonote
YoucanstartiPythonnotebookbywritingipythonnotebookonyourterminal/cmd,dependingonthe
OSyouareworkingon
YoucannameaiPythonnotebookbysimplyclickingonthenameUntitledOintheabovescreenshot
TheinterfaceshowsIn[*]forinputsandOut[*]foroutput.
YoucanexecuteacodebypressingShift+EnterorALT+Enter,ifyouwanttoinsertanadditional
rowafter.

Beforewedeepdiveintoproblemsolving,letstakeastepbackandunderstandthebasicsof
Python.Asweknowthatdatastructuresanditerationandconditionalconstructsformthecruxofany
language.InPython,theseincludelists,strings,tuples,dictionaries,forloop,whileloop,ifelse,etc.
Letstakealookatsomeofthese.

2.PythonlibrariesandDataStructures
PythonDataStructures
Followingaresomedatastructures,whichareusedinPython.Youshouldbefamiliarwiththemin
ordertousethemasappropriate.

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

6/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

Lists Lists are one of the most versatile data structure in Python.A list can simply be defined by
writingalistofcommaseparatedvaluesinsquarebrackets.Listsmightcontainitemsofdifferenttypes,
butusuallytheitemsallhavethesametype.Pythonlistsaremutableandindividualelementsofalist
canbechanged.

Hereisaquickexampletodefinealistandthenaccessit:

StringsStringscansimplybedefinedbyuseofsingle(),double()ortriple()invertedcommas.
Stringsenclosedintripequotes()canspanovermultiplelinesandareusedfrequentlyindocstrings
(Pythons way of documenting functions). \ is used as an escape character. Please note that Python
stringsareimmutable,soyoucannotchangepartofstrings.

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

7/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

TuplesA tuple is represented by a number of values separated by commas.Tuples are immutable


andtheoutputissurroundedbyparenthesessothatnestedtuplesareprocessedcorrectly.Additionally,
eventhoughtuplesareimmutable,theycanholdmutabledataifneeded.

SinceTuplesareimmutableandcannotchange,theyarefasterinprocessingascomparedto
lists.Hence,ifyourlistisunlikelytochange,youshouldusetuples,insteadoflists.

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

8/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

DictionaryDictionaryisanunorderedsetofkey:valuepairs,withtherequirementthatthekeysare
unique(withinonedictionary).Apairofbracescreatesanemptydictionary:{}.

PythonIterationandConditionalConstructs
Like most languages, Python also has a FORloop which is the most widely used method for
iteration.Ithasasimplesyntax:

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

9/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

foriin[PythonIterable]:
expression(i)

HerePythonIterablecanbealist,tupleorotheradvanceddatastructureswhichwewillexplorein
latersections.Letstakealookatasimpleexample,determiningthefactorialofanumber.

fact=1
foriinrange(1,N+1):
fact*=i

Comingtoconditionalstatements,theseareusedtoexecutecodefragmentsbasedonacondition.
Themostcommonlyusedconstructisifelse,withfollowingsyntax:

if[condition]:
__executioniftrue__
else:
__executioniffalse__

Forinstance,ifwewanttoprintwhetherthenumberNisevenorodd:

ifN%2==0:
print'Even'
else:
print'Odd'

Now that you are familiar with Python fundamentals, lets take a step further. What if you have to
performthefollowingtasks:
1.Multiply2matrices
2.Findtherootofaquadraticequation
3.Plotbarchartsandhistograms
4.Makestatisticalmodels

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

10/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

5.Accesswebpages

Ifyoutrytowritecodefromscratch,itsgoing tobeanightmareandyouwontstayonPythonfor
morethan2days!Butletsnotworryaboutthat.Thankfully,therearemanylibrarieswithpredefined
whichwecandirectlyimportintoourcodeandmakeourlifeeasy.
Forexample,considerthefactorialexamplewejustsaw.Wecandothatinasinglestepas:

math.factorial(N)

Offcourseweneedtoimportthemathlibraryforthat.Letsexplorethevariouslibrariesnext.

PythonLibraries
Lets take one step ahead in our journey to learn Python by getting acquainted with some useful
libraries.Thefirststepisobviouslytolearntoimportthemintoourenvironment.Thereareseveral
waysofdoingsoinPython:

importmathasm

frommathimport*

Inthefirstmanner,wehavedefinedanaliasmtolibrarymath.Wecannowusevariousfunctions
frommathlibrary(e.g.factorial)byreferencingitusingthealiasm.factorial().
In the second manner, you have imported the entire name space in math i.e. you can directly use
factorial()withoutreferringtomath.
Tip:Googlerecommendsthatyouusefirststyleofimportinglibraries,asyouwillknowwhere
thefunctionshavecomefrom.

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

11/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

Followingarealistoflibraries,youwillneedforanyscientificcomputationsanddataanalysis:
NumPystandsforNumericalPython.ThemostpowerfulfeatureofNumPyisndimensionalarray.This
library also contains basic linear algebra functions, Fourier transforms, advanced random number
capabilitiesandtoolsforintegrationwithotherlowlevellanguageslikeFortran,CandC++
SciPy stands for Scientific Python. SciPy is built on NumPy. It is one of the most useful library for
variety of high level science and engineering modules like discrete Fourier transform, LinearAlgebra,
OptimizationandSparsematrices.
Matplotlibforplottingvastvarietyofgraphs,startingfromhistogramstolineplotstoheatplots..Youcan
usePylabfeatureinipythonnotebook(ipythonnotebookpylab=inline)tousetheseplottingfeatures
inline.Ifyouignoretheinlineoption,thenpylabconvertsipythonenvironmenttoanenvironment,very
similartoMatlab.YoucanalsouseLatexcommandstoaddmathtoyourplot.
Pandasforstructureddataoperationsandmanipulations.Itisextensivelyusedfordatamungingand
preparation. Pandas were added relatively recently to Python and have been instrumental in boosting
Pythonsusageindatascientistcommunity.
ScikitLearnfor machine learning. Built on NumPy, SciPy and matplotlib, this library contains a lot of
effiecient tools for machine learning and statistical modeling including classification, regression,
clusteringanddimensionalityreduction.
Statsmodelsforstatisticalmodeling.StatsmodelsisaPythonmodulethatallowsuserstoexploredata,
estimate statistical models, and perform statistical tests. An extensive list of descriptive statistics,
statisticaltests,plottingfunctions,andresultstatisticsareavailablefordifferenttypesofdataandeach
estimator.
Seaborn for statistical data visualization. Seaborn is a library for making attractive and informative
statisticalgraphicsinPython.Itisbasedonmatplotlib.Seabornaimstomakevisualizationacentralpart
ofexploringandunderstandingdata.
Bokeh for creating interactive plots, dashboards and data applications on modern webbrowsers. It
empowerstheusertogenerateelegantandconcisegraphicsinthestyleofD3.js.Moreover,ithasthe
capabilityofhighperformanceinteractivityoververylargeorstreamingdatasets.
BlazeforextendingthecapabilityofNumpyandPandastodistributedandstreamingdatasets.Itcanbe
used to access data from a multitude of sources including Bcolz, MongoDB, SQLAlchemy, Apache
Spark,PyTables,etc.TogetherwithBokeh,Blazecanactasaverypowerfultoolforcreatingeffective
visualizationsanddashboardsonhugechunksofdata.
Scrapyforwebcrawling.Itisaveryusefulframeworkforgettingspecificpatternsofdata.Ithasthe
capability to start at a website home url and then dig through webpages within the website to gather
information.
SymPy for symbolic computation. It has wideranging capabilities from basic symbolic arithmetic to
calculus,algebra,discretemathematicsandquantumphysics.Anotherusefulfeatureisthecapabilityof

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

12/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

formattingtheresultofthecomputationsasLaTeXcode.
Requestsforaccessingtheweb.Itworkssimilartothethestandardpythonlibraryurllib2butismuch
easier to code.You will find subtle differences with urllib2 but for beginners, Requests might be more
convenient.

Additionallibraries,youmightneed:
osforOperatingsystemandfileoperations
networkxandigraphforgraphbaseddatamanipulations
regularexpressionsforfindingpatternsintextdata
BeautifulSoupforscrappingweb.ItisinferiortoScrapyasitwillextractinformationfromjustasingle
webpageinarun.

NowthatwearefamiliarwithPythonfundamentalsandadditionallibraries,letstakeadeepdiveinto
problem solving through Python. Yes I mean making a predictive model! In the process, we use
some powerful libraries and also come across the next level of data structures. We will take you
throughthe3keyphases:
1.DataExplorationfindingoutmoreaboutthedatawehave
2.DataMungingcleaningthedataandplayingwithittomakeitbettersuitstatisticalmodeling
3.PredictiveModelingrunningtheactualalgorithmsandhavingfun

3.ExploratoryanalysisinPythonusingPandas
In order to explore our data further, let me introduce you to another animal (as if Python was not
enough!)Pandas

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

13/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

ImageSource:Wikipedia
PandasisoneofthemostusefuldataanalysislibraryinPython(Iknowthesenamessoundsweird,
but hang on!). They have been instrumental in increasing the use of Python in data science
community. We will now use Pandas to read a data set from an Analytics Vidhya competition,
perform exploratory analysis and build our first basic categorization algorithm for solving this
problem.
Beforeloadingthedata,letsunderstandthe2keydatastructuresinPandasSeriesand
DataFrames

IntroductiontoSeriesandDataframes
Series can be understood as a 1 dimensional labelled / indexed array. You can access individual
elementsofthisseriesthroughtheselabels.
A dataframe is similar to Excel workbook you have column names referring to columns and you
have rows, which can be accessed with use of row numbers. The essential difference being that
columnnamesandrownumbersareknownascolumnandrowindex,incaseofdataframes.
SeriesanddataframesformthecoredatamodelforPandasinPython.Thedatasetsarefirstread
intothesedataframesandthenvariousoperations(e.g.groupby,aggregationetc.)canbeapplied
veryeasilytoitscolumns.
More:10MinutestoPandas

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

14/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

PracticedatasetLoanPredictionProblem
Youcandownloadthedatasetfromhere.Hereisthedescriptionofvariables:

VARIABLEDESCRIPTIONS:
Variable

Description

Loan_IDUniqueLoanID
Gender Male/Female
MarriedApplicantmarried(Y/N)
Dependents

Numberofdependents

Education

ApplicantEducation(Graduate/UnderGraduate)

Self_Employed

Selfemployed(Y/N)

ApplicantIncomeApplicantincome
CoapplicantIncome
LoanAmount

Coapplicantincome

Loanamountinthousands

Loan_Amount_Term

Termofloaninmonths

Credit_History credithistorymeetsguidelines
Property_Area

Urban/SemiUrban/Rural

Loan_Status

Loanapproved(Y/N)

Letsbeginwithexploration
Tobegin,startiPythoninterfaceinInlinePylabmodebytypingfollowingonyourterminal/windows
commandprompt:

ipythonnotebookpylab=inline

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

15/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

This opens up iPython notebook in pylab environment, which has a few useful libraries already
imported.Also,youwillbeabletoplotyourdatainline,whichmakesthisareallygoodenvironment
forinteractivedataanalysis.Youcancheckwhethertheenvironmenthasloadedcorrectly,bytyping
thefollowingcommand(andgettingtheoutputasseeninthefigurebelow):

plot(arange(5))

IamcurrentlyworkinginLinux,andhavestoredthedatasetinthefollowinglocation:
/home/kunal/Downloads/Loan_Prediction/train.csv

Importinglibrariesandthedataset:
Followingarethelibrarieswewilluseduringthistutorial:
numpy
matplotlib
pandas

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

16/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

PleasenotethatyoudonotneedtoimportmatplotlibandnumpybecauseofPylabenvironment.I
havestillkepttheminthecode,incaseyouusethecodeinadifferentenvironment.
Afterimportingthelibrary,youreadthedatasetusingfunctionread_csv().Thisishowthecodelooks
liketillthisstage:

importpandasaspd
importnumpyasnp
importmatplotlibasplt

df=pd.read_csv("/home/kunal/Downloads/Loan_Prediction/train.csv")#Readingthedatasetin
adataframeusingPandas

QuickDataExploration
Onceyouhavereadthedataset,youcanhavealookatfewtoprowsbyusingthefunctionhead()

df.head(10)

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

17/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

Thisshouldprint10rows.Alternately,youcanalsolookatmorerowsbyprintingthedataset.
Next,youcanlookatsummaryofnumericalfieldsbyusingdescribe()function

df.describe()

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

18/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

describe()functionwouldprovidecount,mean,standarddeviation(std),min,quartilesandmaxinits
output(Readthisarticletorefreshbasicstatisticstounderstandpopulationdistribution)
Hereareafewinferences,youcandrawbylookingattheoutputofdescribe()function:
1.LoanAmounthas(614592)22missingvalues.
2.Loan_Amount_Termhas(614600)14missingvalues.
3.Credit_Historyhas(614564)50missingvalues.
4.Wecanalsolookthatabout84%applicantshaveacredit_history.How?ThemeanofCredit_History
fieldis0.84(Remember,Credit_Historyhasvalue1forthosewhohaveacredithistoryand0otherwise)
5.TheApplicantIncomedistributionseemstobeinlinewithexpectation.SamewithCoapplicantIncome

Pleasenotethatwecangetanideaofapossibleskewinthedatabycomparingthemeantothe
median,i.e.the50%figure.
For the nonnumerical values (e.g. Property_Area, Credit_History etc.), we can look at frequency
distribution to understand whether they make sense or not.The frequency table can be printed by
followingcommand:

df['Property_Area'].value_counts()

Similarly,wecanlookatuniquevaluesofportofcredithistory.Notethatdfname[column_name]isa
basicindexingtechniquetoacessaparticularcolumnofthedataframe.Itcanbealistofcolumnsas
well.Formoreinformation,refertothe10MinutestoPandasresourcesharedabove.

Distributionanalysis
Nowthatwearefamiliarwithbasicdatacharacteristics,letusstudydistributionofvariousvariables.
LetusstartwithnumericvariablesnamelyApplicantIncomeandLoanAmount
LetsstartbyplottingthehistogramofApplicantIncomeusingthefollowingcommands:

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

19/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

df['ApplicantIncome'].hist(bins=50)

Hereweobservethattherearefewextremevalues.Thisisalsothereasonwhy50binsarerequired
todepictthedistributionclearly.
Next,welookatboxplotstounderstandthedistributions.Boxplotforfarecanbeplottedby:

df.boxplot(column='ApplicantIncome')

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

20/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

Thisconfirmsthepresenceofalotofoutliers/extremevalues.Thiscanbeattributedtotheincome
disparityinthesociety.Partofthiscanbedrivenbythefactthatwearelookingatpeoplewith
differenteducationlevels.LetussegregatethembyEducation:

df.boxplot(column='ApplicantIncome',by='Education')

We can see that there is no substantial different between the mean income of graduate and non
graduates.Butthereareahighernumberofgraduateswithveryhighincomes,whichareappearing

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

21/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

tobetheoutliers.
Now,LetslookatthehistogramandboxplotofLoanAmountusingthefollowingcommand:

df['LoanAmount'].hist(bins=50)

df.boxplot(column='LoanAmount')

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

22/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

Again,therearesomeextremevalues.Clearly,bothApplicantIncomeandLoanAmountrequiresome
amount of data munging. LoanAmount has missing and well as extreme values values, while
ApplicantIncomehasafewextremevalues,whichdemanddeeperunderstanding.Wewilltakethis
upincomingsections.

Categoricalvariableanalysis
Now that we understand distributions for ApplicantIncome and LoanIncome, let us understand
categorical variables in more details. We will use Excel style pivot table and crosstabulation. For
instance,letuslookatthechancesofgettingaloanbasedoncredithistory.Thiscanbeachievedin
MSExcelusingapivottableas:

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

23/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

Note: here loan status has been coded as 1 for Yes and 0 for No. So the mean represents the
probabilityofgettingloan.
NowwewilllookatthestepsrequiredtogenerateasimilarinsightusingPython.Pleasereferto this
articleforgettingahangofthedifferentdatamanipulationtechniquesinPandas.

temp1=df['Credit_History'].value_counts(ascending=True)
temp2=df.pivot_table(values='Loan_Status',index=['Credit_History'],aggfunc=lambdax:x.ma
p({'Y':1,'N':0}).mean())
print'FrequencyTableforCreditHistory:'
printtemp1

print'\nProbilityofgettingloanforeachCreditHistoryclass:'
printtemp2

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

24/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

Nowwecanobservethatwegetasimilarpivot_tableliketheMSExcelone.Thiscanbeplottedasa
barchartusingthematplotliblibrarywithfollowingcode:

importmatplotlib.pyplotasplt
fig=plt.figure(figsize=(8,4))
ax1=fig.add_subplot(121)
ax1.set_xlabel('Credit_History')
ax1.set_ylabel('CountofApplicants')
ax1.set_title("ApplicantsbyCredit_History")
temp1.plot(kind='bar')

ax2=fig.add_subplot(122)
temp2.plot(kind='bar')
ax2.set_xlabel('Credit_History')
ax2.set_ylabel('Probabilityofgettingloan')
ax2.set_title("Probabilityofgettingloanbycredithistory")

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

25/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

Thisshowsthatthechancesofgettingaloanareeightfoldiftheapplicanthasavalidcredithistory.
YoucanplotsimilargraphsbyMarried,SelfEmployed,Property_Area,etc.
Alternately,thesetwoplotscanalsobevisualizedbycombiningtheminastackedchart::

temp3=pd.crosstab(df['Credit_History'],df['Loan_Status'])
temp3.plot(kind='bar',stacked=True,color=['red','blue'],grid=False)

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

26/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

Youcanalsoaddgenderintothemix(similartothepivottableinExcel):

Ifyouhavenotrealizedalready,wehavejustcreatedtwobasicclassificationalgorithmshere,one
based on credit history, while other on 2 categorical variables (including gender). You can quickly
codethistocreateyourfirstsubmissiononAVDatahacks.
We just saw how we can do exploratory analysis in Python using Pandas. I hope your love for

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

27/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

pandas (the animal) would have increased by now given the amount of help, the library can
provideyouinanalyzingdatasets.
Next lets explore ApplicantIncome and LoanStatus variables further, perform data munging and
create a dataset for applying various modeling techniques. I would strongly urge that you take
anotherdatasetandproblemandgothroughanindependentexamplebeforereadingfurther.

4.DataMunginginPython:UsingPandas
Forthose,whohavebeenfollowing,hereareyourmustwearshoestostartrunning.

Datamungingrecapoftheneed
Whileourexplorationofthedata,wefoundafewproblemsinthedataset,whichneedstobesolved
beforethedataisreadyforagoodmodel.ThisexerciseistypicallyreferredasDataMunging.Here
aretheproblems,wearealreadyawareof:
1.Therearemissingvaluesinsomevariables.Weshouldestimatethosevalueswiselydependingonthe
amountofmissingvaluesandtheexpectedimportanceofvariables.
2.While looking at the distributions, we saw thatApplicantIncome and LoanAmount seemed to contain
extreme values at either end. Though they might make intuitive sense, but should be treated
appropriately.

Inadditiontotheseproblemswithnumericalfields,weshouldalsolookatthenonnumericalfields
i.e. Gender, Property_Area, Married, Education and Dependents to see, if they contain any useful
information.
IfyouarenewtoPandas,Iwouldrecommendreading thisarticlebeforemovingon.Itdetailssome
usefultechniquesofdatamanipulation.

Checkmissingvaluesinthedataset
Letuslookatmissingvaluesinallthevariablesbecausemostofthemodelsdontworkwithmissing

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

28/29

3/6/2016

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

dataandeveniftheydo,imputingthemhelpsmoreoftenthannot.So,letuscheckthenumberof
nulls/NaNsinthedataset

df.apply(lambdax:sum(x.isnull()),axis=0)

Thiscommandshouldtellusthenumberofmissingvaluesineachcolumnasisnull()returns1,ifthe
valueisnull.

Thoughthemissingvaluesarenotveryhighinnumber,butmanyvariableshavethemandeachone
of these should be estimated and added in the data. Get a detailed view on different imputation
techniquesthroughthisarticle.
Note: Remember that missing values may not always be NaNs. For instance, if the
Loan_Amount_Termis0,doesitmakessenseorwouldyouconsiderthatmissing?Isupposeyour
answerismissingandyoureright.Soweshouldcheckforvalueswhichareunpractical.

HowtofillmissingvaluesinLoanAmount?

http://www.analyticsvidhya.com/blog/2016/01/completetutoriallearndatasciencepythonscratch2/

29/29