Sunteți pe pagina 1din 7

10/6/2016

DataMungingInPythonUsingPandas

Timefliesby!IseeJenika(mydaughter)runningaroundintheentirehouseandmyofficenow.She
stillslipsandtripsbutisnowindependenttoexploretheworldandfigureoutnewstuffonherown.
IhopeIwouldhavebeenabletoinspiresimilarconfidencewithuseofPythonfordataanalysisin
thefollowersofthisseries.
Forthose,whohavebeenfollowing,hereareapairofshoesforyoutostartrunning!

By end of this tutorial, you will also have all the tools necessary to perform any data analysis by
yourselfusingPython.

RecapGettingthebasicsright
In the previous posts in this series, we had downloaded and setup a Python installation, got
introduced to several useful libraries and data structures and finally started with an exploratory
analysisinPython(usingPandas).
In this tutorial, we will continue our journey from where we left it in our last tutorial we have a
reasonable idea about the characteristics of the dataset we are working on. If you have not gone
throughthepreviousarticleintheseries,kindlydosobeforeproceedingfurther.

http://www.analyticsvidhya.com/blog/2014/09/datamungingpythonusingpandasbabystepspython/

1/7

10/6/2016

DataMungingInPythonUsingPandas

Datamungingrecapoftheneed
Whileourexplorationofthedata,wefoundafewproblemsinthedataset,whichneedtobesolved
beforethedataisreadyforagoodmodel.ThisexerciseistypicallyreferredasDataMunging.Here
aretheproblems,wearealreadyawareof:
1.About31%(277outof891)ofvaluesinAgearemissing.Weexpectagetoplayanimportantroleand
hencewouldwanttoestimatethisinsomemanner.
2.Whilelookingatthedistributions,wesawthatFareseemedtocontainextremevaluesateitherenda
fewticketswereprobablyprovidedfreeorcontaineddataentryerror.Ontheotherhand$512sounds
likeaveryhighfareforbookingaticket

Inadditiontotheseproblemswithnumericalfields,weshouldalsolookatthenonnumericalfields
i.e.Name,TicketandCabintosee,iftheycontainanyusefulinformation.

Checkmissingvaluesinthedataset
LetuslookatCabintostartwith.Firstglanceatthevariableleavesuswithanimpressionthatthere
aretoomanyNaNsinthedataset.So,letuscheckthenumberofnulls/NaNsinthedataset

sum(df['Cabin'].isnull())

Thiscommandshouldtellusthenumberofmissingvaluesasisnull()returns1,ifthevalueisnull.
Theoutputis687whichisalotofmissingvalues.So,wellneedtodropthisvariable.

Next,letuslookatvariableTicket.Ticketlookstohavemixofnumbersandtextanddoesntseemto
containanyinformation,sowilldropTicketaswell.

http://www.analyticsvidhya.com/blog/2014/09/datamungingpythonusingpandasbabystepspython/

2/7

10/6/2016

DataMungingInPythonUsingPandas

df=df.drop(['Ticket','Cabin'],axis=1)

HowtofillmissingvaluesinAge?
There are numerous ways to fill the missing values of Age the simplest being replacement by
mean,whichcanbedonebyfollowingcode:

meanAge=np.mean(df.Age)
df.Age=df.Age.fillna(meanAge)

Theotherextremecouldbetobuildasupervisedlearningmodeltopredictageonthebasisofother
variablesandthenuseagealongwithothervariablestopredictsurvival.
Since, the purpose of this tutorial is to bring out the steps in data munging, Ill rather take an
approach, which lies some where in between these 2 extremes. The key hypothesis is that the
salutationsinName,GenderandPclasscombinedcanprovideuswithinformationrequiredtofillin
themissingvaluestoalargeextent.
Herearethestepsrequiredtoworkonthishypothesis:
Step1:ExtractingsalutationsfromName

Letusdefineafunction,whichextractsthesalutationfromaNamewritteninthisformat:
Family_Name,Salutation.FirstName

defname_extract(word):
returnword.split(',')[1].split('.')[0].strip()

http://www.analyticsvidhya.com/blog/2014/09/datamungingpythonusingpandasbabystepspython/

3/7

10/6/2016

DataMungingInPythonUsingPandas

This function takes a Name, splits it by a comma (,), then splits it by a dot(.) and removes the
whitespaces.TheoutputofcallingfunctionwithJain,Mr.KunalwouldbeMrandJain,Miss.Jenika
wouldbeMiss
Next,weapplythisfunctiontotheentirecolumnusingapply()functionandconverttheoutcometoa
newDataFramedf2:

df2=pd.DataFrame({'Salutation':df['Name'].apply(name_extract)})

Once we have the Salutations, let us look at their distribution. We use the good old groupby after
mergingtheDataFramedf2withDataFramedf:

df=pd.merge(df,df2,left_index=True,right_index=True)#mergesonindex
temp1=df.groupby('Salutation').PassengerId.count()
printtemp1

Followingistheoutput:

Salutation
Capt1
Col2
Don1
Dr7
Jonkheer1
Lady1

http://www.analyticsvidhya.com/blog/2014/09/datamungingpythonusingpandasbabystepspython/

4/7

10/6/2016

DataMungingInPythonUsingPandas

Major2
Master40
Miss182
Mlle2
Mme1
Mr517
Mrs125
Ms1
Rev6
Sir1
theCountess1
dtype:int64

As you can see, there are 4 main Salutations Mr, Mrs, Miss and Master all other are less in
number.Hence,wewillcombinealltheremainingsalutationsunderasinglesalutationOthers.In
ordertodoso,wetakethesameapproach,aswedidtoextractSalutationdefineafunction,apply
ittoanewcolumn,storetheoutcomeinaDataFrameandthenmergeitwitholdDataFrame:

defgroup_salutation(old_salutation):
ifold_salutation=='Mr':
return('Mr')
else:
ifold_salutation=='Mrs':
return('Mrs')
else:
ifold_salutation=='Master':
return('Master')
else:
ifold_salutation=='Miss':

http://www.analyticsvidhya.com/blog/2014/09/datamungingpythonusingpandasbabystepspython/

5/7

10/6/2016

DataMungingInPythonUsingPandas

return('Miss')
else:
return('Others')
df3=pd.DataFrame({'New_Salutation':df['Salutation'].apply(group_salutation)})
df=pd.merge(df,df3,left_index=True,right_index=True)
temp1=df3.groupby('New_Salutation').count()
temp1
df.boxplot(column='Age',by='New_Salutation')

FollowingistheoutcomeforDistributionofNew_SalutationandvariationofAgeacrossthem:

http://www.analyticsvidhya.com/blog/2014/09/datamungingpythonusingpandasbabystepspython/

6/7

10/6/2016

DataMungingInPythonUsingPandas

Step2:Creatingasimplegrid(ClassxGender)xSalutation

SimilarlyplottingthedistributionofagebySex&Classshowsasloping:

So,wecreateaPivottable,whichprovidesusmedianvaluesforallthecellsmentionedabove.Next,
wedefineafunction,whichreturnsthevaluesofthesecellsandapplyittofillthemissingvaluesof
age:

http://www.analyticsvidhya.com/blog/2014/09/datamungingpythonusingpandasbabystepspython/

7/7

S-ar putea să vă placă și