Sunteți pe pagina 1din 13

9/6/2016

12UsefulPandasTechniquesinPythonforDataManipulation

Introduction

Python is fast becoming the preferred language for data scientists and for good reasons. It
provides the larger ecosystem of a programming language and the depth of good scientific
computationlibraries.IfyouarestartingtolearnPython,havealookatlearningpathonPython.
Among its scientific computation libraries, I found Pandas to be the most useful for data science
operations. Pandas, along with Scikitlearn provides almost the entire stack needed by a data
scientist. This article focuses on providing 12 ways for data manipulation in Python. Ive also
sharedsometips&trickswhichwillallowyoutoworkfaster.
Iwouldrecommendthatyoulookatthecodesfordataexplorationbeforegoingahead.Tohelpyou
understandbetter,Ivetakenadatasettoperformtheseoperationsandmanipulations.
DataSet:IveusedthedatasetofLoanPredictionproblem.Downloadthedatasetandgetstarted.

Letsgetstarted
IllstartbyimportingmodulesandloadingthedatasetintoPythonenvironment:

http://www.analyticsvidhya.com/blog/2016/01/12pandastechniquespythondatamanipulation/

1/13

9/6/2016

12UsefulPandasTechniquesinPythonforDataManipulation

importpandasaspd
importnumpyasnp
data=pd.read_csv("train.csv",index_col="Loan_ID")

#1BooleanIndexing
What do you do, if you want to filter values of a column based on conditions from another set of
columns?Forinstance,wewantalistofallfemaleswhoarenotgraduateandgotaloan.Boolean
indexingcanhelphere.Youcanusethefollowingcode:

data.loc[(data["Gender"]=="Female")&(data["Education"]=="NotGraduate")&(data["Loan_Status"]=
="Y"),["Gender","Education","Loan_Status"]]

ReadMore:PandasSelectingandIndexing

http://www.analyticsvidhya.com/blog/2016/01/12pandastechniquespythondatamanipulation/

2/13

9/6/2016

12UsefulPandasTechniquesinPythonforDataManipulation

#2ApplyFunction
It is one of the commonly used functions for playing with data and creating new
variables. Apply returns some value after passing each row/column of a data frame with some
function.Thefunctioncanbebothdefaultoruserdefined.Forinstance,hereitcanbeusedtofind
the#missingvaluesineachrowandcolumn.

#Createanewfunction:
defnum_missing(x):
returnsum(x.isnull())

#Applyingpercolumn:
print"Missingvaluespercolumn:"
printdata.apply(num_missing,axis=0)#axis=0definesthatfunctionistobeappliedoneachcolu
mn

#Applyingperrow:
print"\nMissingvaluesperrow:"
printdata.apply(num_missing,axis=1).head()#axis=1definesthatfunctionistobeappliedonea
chrow

http://www.analyticsvidhya.com/blog/2016/01/12pandastechniquespythondatamanipulation/

3/13

9/6/2016

12UsefulPandasTechniquesinPythonforDataManipulation

Thuswegetthedesiredresult.
Note:head()functionisusedinsecondoutputbecauseitcontainsmanyrows.
ReadMore:PandasReference(apply)

#3Imputingmissingfiles
fillna()doesitinonego.Itisusedforupdatingmissingvalueswiththeoverallmean/mode/median
ofthecolumn.LetsimputetheGender,MarriedandSelf_Employedcolumnswiththeirrespective
modes.

#Firstweimportafunctiontodeterminethemode
fromscipy.statsimportmode
mode(data['Gender'])

Output:ModeResult(mode=array([Male],dtype=object),count=array([489]))
Thisreturnsbothmodeandcount.Rememberthatmodecanbeanarrayastherecanbemultiple

http://www.analyticsvidhya.com/blog/2016/01/12pandastechniquespythondatamanipulation/

4/13

9/6/2016

12UsefulPandasTechniquesinPythonforDataManipulation

valueswithhighfrequency.Wewilltakethefirstonebydefaultalwaysusing:

mode(data['Gender']).mode[0]

Nowwecanfillthemissingvaluesandcheckusingtechnique#2.

#Imputethevalues:
data['Gender'].fillna(mode(data['Gender']).mode[0],inplace=True)
data['Married'].fillna(mode(data['Married']).mode[0],inplace=True)
data['Self_Employed'].fillna(mode(data['Self_Employed']).mode[0],inplace=True)

#Nowcheckthe#missingvaluesagaintoconfirm:
printdata.apply(num_missing,axis=0)

Hence, it is confirmed that missing values are imputed. Please note that this is the most primitive
form of imputation. Other sophisticated techniques include modeling the missing values, using
groupedaverages(mean/mode/median).Illcoverthatpartinmynextarticles.
ReadMore:PandasReference(fillna)

http://www.analyticsvidhya.com/blog/2016/01/12pandastechniquespythondatamanipulation/

5/13

9/6/2016

12UsefulPandasTechniquesinPythonforDataManipulation

#4PivotTable
PandascanbeusedtocreateMSExcelstylepivottables.Forinstance,inthiscase,akeycolumnis
LoanAmount which has missing values. We can impute it using mean amount of each Gender,
MarriedandSelf_Employedgroup.ThemeanLoanAmountofeachgroupcanbedeterminedas:

#Determinepivottable
impute_grps=data.pivot_table(values=["LoanAmount"],index=["Gender","Married","Self_Employed"],
aggfunc=np.mean)
printimpute_grps

More:PandasReference(PivotTable)

#5MultiIndexing
Ifyounoticetheoutputofstep#3,ithasastrangeproperty.Eachindexismadeupofacombination
of3values.ThisiscalledMultiIndexing.Ithelpsinperformingoperationsreallyfast.
Continuingtheexamplefrom#3,wehavethevaluesforeachgroupbuttheyhavenotbeenimputed.
Thiscanbedoneusingthevarioustechniqueslearnedtillnow.

#iterateonlythroughrowswithmissingLoanAmount
fori,rowindata.loc[data['LoanAmount'].isnull(),:].iterrows():

http://www.analyticsvidhya.com/blog/2016/01/12pandastechniquespythondatamanipulation/

6/13

9/6/2016

12UsefulPandasTechniquesinPythonforDataManipulation

ind=tuple([row['Gender'],row['Married'],row['Self_Employed']])
data.loc[i,'LoanAmount']=impute_grps.loc[ind].values[0]

#Nowcheckthe#missingvaluesagaintoconfirm:
printdata.apply(num_missing,axis=0)

Note:
1.Multiindexrequirestuplefordefininggroupsofindicesinlocstatement.Thisatupleusedinfunction.
2.The.values[0]suffixisrequiredbecause,bydefaultaserieselementisreturnedwhichhasanindexnot
matchingwiththatofthedataframe.Inthiscase,adirectassignmentgivesanerror.

#6.Crosstab
This function is used to get an initial feel (view) of the data. Here, we can validate some basic
hypothesis. For instance, in this case, Credit_History is expected to affect the loan status
significantly.Thiscanbetestedusingcrosstabulationasshownbelow:

pd.crosstab(data["Credit_History"],data["Loan_Status"],margins=True)

http://www.analyticsvidhya.com/blog/2016/01/12pandastechniquespythondatamanipulation/

7/13

9/6/2016

12UsefulPandasTechniquesinPythonforDataManipulation

Theseareabsolutenumbers.But,percentagescanbemoreintuitiveinmakingsomequickinsights.
Wecandothisusingtheapplyfunction:

defpercConvert(ser):
returnser/float(ser[1])
pd.crosstab(data["Credit_History"],data["Loan_Status"],margins=True).apply(percConvert,axis=1)

Now, it is evident that people with a credit history have much higher chances of getting a loan as
80%peoplewithcredithistorygotaloanascomparedtoonly9%withoutcredithistory.
But thats not it. It tells an interesting story. Since I know that having a credit history is super
important, what if I predict loan status to be Y for ones with credit history and N otherwise.
Surprisingly,wellberight82+378=460timesoutof614whichisawhopping75%!
I wont blame you if youre wondering why the hell do we need statistical models. But trust me,
increasing the accuracy by even 0.001% beyond this mark is a challenging task. Would you take
thischallenge?
Note:75%isontrainset.Thetestsetwillbeslightlydifferentbutclose.Also,Ihopethisgivessome
intuitionintowhyevena0.05%increaseinaccuracycanresultinjumpof500ranksontheKaggle

http://www.analyticsvidhya.com/blog/2016/01/12pandastechniquespythondatamanipulation/

8/13

9/6/2016

12UsefulPandasTechniquesinPythonforDataManipulation

leaderboard.
ReadMore:PandasReference(crosstab)

#7MergeDataFrames
Merging dataframes become essential when we have information coming from different sources to
becollated.Considerahypotheticalcasewheretheaveragepropertyrates(INRpersqmeters)is
availablefordifferentpropertytypes.Letsdefineadataframeas:

prop_rates=pd.DataFrame([1000,5000,12000],index=['Rural','Semiurban','Urban'],columns=['rate
s'])
prop_rates

Nowwecanmergethisinformationwiththeoriginaldataframeas:

data_merged=data.merge(right=prop_rates,how='inner',left_on='Property_Area',right_index=True,
sort=False)
data_merged.pivot_table(values='Credit_History',index=['Property_Area','rates'],aggfunc=len)

http://www.analyticsvidhya.com/blog/2016/01/12pandastechniquespythondatamanipulation/

9/13

9/6/2016

12UsefulPandasTechniquesinPythonforDataManipulation

The pivot table validates successful merge operation. Note that the values argument is irrelevant
herebecausewearesimplycountingthevalues.
ReadMore:PandasReference(merge)

#8SortingDataFrames
Pandasalloweasysortingbasedonmultiplecolumns.Thiscanbedoneas:

data_sorted=data.sort_values(['ApplicantIncome','CoapplicantIncome'],ascending=False)
data_sorted[['ApplicantIncome','CoapplicantIncome']].head(10)

Note:Pandassortfunctionisnowdeprecated.Weshouldusesort_valuesinstead.
More:PandasReference(sort_values)

http://www.analyticsvidhya.com/blog/2016/01/12pandastechniquespythondatamanipulation/

10/13

9/6/2016

12UsefulPandasTechniquesinPythonforDataManipulation

#9Plotting(Boxplot&Histogram)
ManyofyoumightbeunawarethatboxplotsandhistogramscanbedirectlyplottedinPandasand
callingmatplotlibseparatelyisnotnecessary.Itsjusta1linecommand.Forinstance,ifwewantto
comparethedistributionofApplicantIncomebyLoan_Status:

importmatplotlib.pyplotasplt
%matplotlibinline
data.boxplot(column="ApplicantIncome",by="Loan_Status")

data.hist(column="ApplicantIncome",by="Loan_Status",bins=30)

Thisshowsthatincomeisnotabigdecidingfactoronitsownasthereisnoappreciabledifference
betweenthepeoplewhoreceivedandweredeniedtheloan.
ReadMore:PandasReference(hist)|PandasReference(boxplot)

#10Cutfunctionforbinning
Sometimesnumericalvaluesmakemoresenseifclusteredtogether.Forexample,ifweretryingto
modeltraffic(#carsonroad)withtimeoftheday(minutes).Theexactminuteofanhourmightnotbe
thatrelevantforpredictingtrafficascomparedtoactualperiodofthedaylikeMorning,Afternoon,
Evening, Night, Late Night. Modeling traffic this way will be more intuitive and will avoid
overfitting.
Herewedefineasimplefunctionwhichcanbereusedforbinninganyvariablefairlyeasily.

http://www.analyticsvidhya.com/blog/2016/01/12pandastechniquespythondatamanipulation/

11/13

9/6/2016

12UsefulPandasTechniquesinPythonforDataManipulation

#Binning:
defbinning(col,cut_points,labels=None):
#Defineminandmaxvalues:
minval=col.min()
maxval=col.max()

#createlistbyaddingminandmaxtocut_points
break_points=[minval]+cut_points+[maxval]

#ifnolabelsprovided,usedefaultlabels0...(n1)
ifnotlabels:
labels=range(len(cut_points)+1)

#Binningusingcutfunctionofpandas
colBin=pd.cut(col,bins=break_points,labels=labels,include_lowest=True)
returncolBin

#Binningage:
cut_points=[90,140,190]
labels=["low","medium","high","veryhigh"]
data["LoanAmount_Bin"]=binning(data["LoanAmount"],cut_points,labels)
printpd.value_counts(data["LoanAmount_Bin"],sort=False)

ReadMore:PandasReference(cut)

http://www.analyticsvidhya.com/blog/2016/01/12pandastechniquespythondatamanipulation/

12/13

9/6/2016

12UsefulPandasTechniquesinPythonforDataManipulation

#11Codingnominaldata
Often,wefindacasewherewevetomodifythecategoriesofanominalvariable.Thiscanbedueto
variousreasons:
1.Some algorithms (like Logistic Regression) require all inputs to be numeric. So nominal variables are
mostlycodedas0,1.(n1)
2.Sometimes a category might be represented in 2 ways. For e.g. temperature might be recorded as
High,Medium,Low,H,low.Here,bothHighandHrefertosamecategory.Similarly,inLow
andlowthereisonlyadifferenceofcase.But,pythonwouldreadthemasdifferentlevels.
3.Somecategoriesmighthaveverylowfrequenciesanditsgenerallyagoodideatocombinethem.

HereIvedefinedagenericfunctionwhichtakesininputasadictionaryandcodesthevaluesusing
replacefunctioninPandas.

#DefineagenericfunctionusingPandasreplacefunction
defcoding(col,codeDict):
colCoded=pd.Series(col,copy=True)
forkey,valueincodeDict.items():
colCoded.replace(key,value,inplace=True)
returncolCoded

#CodingLoanStatusasY=1,N=0:
print'BeforeCoding:'
printpd.value_counts(data["Loan_Status"])
data["Loan_Status_Coded"]=coding(data["Loan_Status"],{'N':0,'Y':1})
print'\nAfterCoding:'
printpd.value_counts(data["Loan_Status_Coded"])

http://www.analyticsvidhya.com/blog/2016/01/12pandastechniquespythondatamanipulation/

13/13

S-ar putea să vă placă și