12 Useful Pandas Techniques in Python For Data Manipulation PDF

9/6/2016
12UsefulPandasTechniquesinPythonforDataManipulation
Introduction
Python is fast becoming the preferred language for data scientists and for good reasons. It
provides the larger ecosystem of a programming language and the depth of good scientific
computationlibraries.IfyouarestartingtolearnPython,havealookatlearningpathonPython.
Among its scientific computation libraries, I found Pandas to be the most useful for data science
operations. Pandas, along with Scikitlearn provides almost the entire stack needed by a data
scientist. This article focuses on providing 12 ways for data manipulation in Python. Ive also
sharedsometips&trickswhichwillallowyoutoworkfaster.
Iwouldrecommendthatyoulookatthecodesfordataexplorationbeforegoingahead.Tohelpyou
understandbetter,Ivetakenadatasettoperformtheseoperationsandmanipulations.
DataSet:IveusedthedatasetofLoanPredictionproblem.Downloadthedatasetandgetstarted.
Letsgetstarted
IllstartbyimportingmodulesandloadingthedatasetintoPythonenvironment:
http://www.analyticsvidhya.com/blog/2016/01/12pandastechniquespythondatamanipulation/
1/13
9/6/2016
importpandasaspd
importnumpyasnp
data=pd.read_csv("train.csv",index_col="Loan_ID")
#1BooleanIndexing
What do you do, if you want to filter values of a column based on conditions from another set of
columns?Forinstance,wewantalistofallfemaleswhoarenotgraduateandgotaloan.Boolean
indexingcanhelphere.Youcanusethefollowingcode:
data.loc[(data["Gender"]=="Female")&(data["Education"]=="NotGraduate")&(data["Loan_Status"]=
="Y"),["Gender","Education","Loan_Status"]]
ReadMore:PandasSelectingandIndexing
2/13
9/6/2016
#2ApplyFunction
It is one of the commonly used functions for playing with data and creating new
variables. Apply returns some value after passing each row/column of a data frame with some
function.Thefunctioncanbebothdefaultoruserdefined.Forinstance,hereitcanbeusedtofind
the#missingvaluesineachrowandcolumn.
#Createanewfunction:
defnum_missing(x):
returnsum(x.isnull())
#Applyingpercolumn:
print"Missingvaluespercolumn:"
printdata.apply(num_missing,axis=0)#axis=0definesthatfunctionistobeappliedoneachcolu
mn
#Applyingperrow:
print"\nMissingvaluesperrow:"
printdata.apply(num_missing,axis=1).head()#axis=1definesthatfunctionistobeappliedonea
chrow
3/13
9/6/2016
Thuswegetthedesiredresult.
Note:head()functionisusedinsecondoutputbecauseitcontainsmanyrows.
ReadMore:PandasReference(apply)
#3Imputingmissingfiles
fillna()doesitinonego.Itisusedforupdatingmissingvalueswiththeoverallmean/mode/median
ofthecolumn.LetsimputetheGender,MarriedandSelf_Employedcolumnswiththeirrespective
modes.
#Firstweimportafunctiontodeterminethemode
fromscipy.statsimportmode
mode(data['Gender'])
Output:ModeResult(mode=array([Male],dtype=object),count=array([489]))
Thisreturnsbothmodeandcount.Rememberthatmodecanbeanarrayastherecanbemultiple
4/13
9/6/2016
valueswithhighfrequency.Wewilltakethefirstonebydefaultalwaysusing:
mode(data['Gender']).mode[0]
Nowwecanfillthemissingvaluesandcheckusingtechnique#2.
#Imputethevalues:
data['Gender'].fillna(mode(data['Gender']).mode[0],inplace=True)
data['Married'].fillna(mode(data['Married']).mode[0],inplace=True)
data['Self_Employed'].fillna(mode(data['Self_Employed']).mode[0],inplace=True)
#Nowcheckthe#missingvaluesagaintoconfirm:
printdata.apply(num_missing,axis=0)
Hence, it is confirmed that missing values are imputed. Please note that this is the most primitive
form of imputation. Other sophisticated techniques include modeling the missing values, using
groupedaverages(mean/mode/median).Illcoverthatpartinmynextarticles.
ReadMore:PandasReference(fillna)
5/13
9/6/2016
#4PivotTable
PandascanbeusedtocreateMSExcelstylepivottables.Forinstance,inthiscase,akeycolumnis
LoanAmount which has missing values. We can impute it using mean amount of each Gender,
MarriedandSelf_Employedgroup.ThemeanLoanAmountofeachgroupcanbedeterminedas:
#Determinepivottable
impute_grps=data.pivot_table(values=["LoanAmount"],index=["Gender","Married","Self_Employed"],
aggfunc=np.mean)
printimpute_grps
More:PandasReference(PivotTable)
#5MultiIndexing
Ifyounoticetheoutputofstep#3,ithasastrangeproperty.Eachindexismadeupofacombination
of3values.ThisiscalledMultiIndexing.Ithelpsinperformingoperationsreallyfast.
Continuingtheexamplefrom#3,wehavethevaluesforeachgroupbuttheyhavenotbeenimputed.
Thiscanbedoneusingthevarioustechniqueslearnedtillnow.
#iterateonlythroughrowswithmissingLoanAmount
fori,rowindata.loc[data['LoanAmount'].isnull(),:].iterrows():
6/13
9/6/2016
ind=tuple([row['Gender'],row['Married'],row['Self_Employed']])
data.loc[i,'LoanAmount']=impute_grps.loc[ind].values[0]
#Nowcheckthe#missingvaluesagaintoconfirm:
printdata.apply(num_missing,axis=0)
Note:
1.Multiindexrequirestuplefordefininggroupsofindicesinlocstatement.Thisatupleusedinfunction.
2.The.values[0]suffixisrequiredbecause,bydefaultaserieselementisreturnedwhichhasanindexnot
matchingwiththatofthedataframe.Inthiscase,adirectassignmentgivesanerror.
#6.Crosstab
This function is used to get an initial feel (view) of the data. Here, we can validate some basic
hypothesis. For instance, in this case, Credit_History is expected to affect the loan status
significantly.Thiscanbetestedusingcrosstabulationasshownbelow:
pd.crosstab(data["Credit_History"],data["Loan_Status"],margins=True)
7/13
9/6/2016
Theseareabsolutenumbers.But,percentagescanbemoreintuitiveinmakingsomequickinsights.
Wecandothisusingtheapplyfunction:
defpercConvert(ser):
returnser/float(ser[1])
pd.crosstab(data["Credit_History"],data["Loan_Status"],margins=True).apply(percConvert,axis=1)
Now, it is evident that people with a credit history have much higher chances of getting a loan as
80%peoplewithcredithistorygotaloanascomparedtoonly9%withoutcredithistory.
But thats not it. It tells an interesting story. Since I know that having a credit history is super
important, what if I predict loan status to be Y for ones with credit history and N otherwise.
Surprisingly,wellberight82+378=460timesoutof614whichisawhopping75%!
I wont blame you if youre wondering why the hell do we need statistical models. But trust me,
increasing the accuracy by even 0.001% beyond this mark is a challenging task. Would you take
thischallenge?
Note:75%isontrainset.Thetestsetwillbeslightlydifferentbutclose.Also,Ihopethisgivessome
intuitionintowhyevena0.05%increaseinaccuracycanresultinjumpof500ranksontheKaggle
8/13
9/6/2016
leaderboard.
ReadMore:PandasReference(crosstab)
#7MergeDataFrames
Merging dataframes become essential when we have information coming from different sources to
becollated.Considerahypotheticalcasewheretheaveragepropertyrates(INRpersqmeters)is
availablefordifferentpropertytypes.Letsdefineadataframeas:
prop_rates=pd.DataFrame([1000,5000,12000],index=['Rural','Semiurban','Urban'],columns=['rate
s'])
prop_rates
Nowwecanmergethisinformationwiththeoriginaldataframeas:
data_merged=data.merge(right=prop_rates,how='inner',left_on='Property_Area',right_index=True,
sort=False)
data_merged.pivot_table(values='Credit_History',index=['Property_Area','rates'],aggfunc=len)
9/13
9/6/2016
The pivot table validates successful merge operation. Note that the values argument is irrelevant
herebecausewearesimplycountingthevalues.
ReadMore:PandasReference(merge)
#8SortingDataFrames
Pandasalloweasysortingbasedonmultiplecolumns.Thiscanbedoneas:
data_sorted=data.sort_values(['ApplicantIncome','CoapplicantIncome'],ascending=False)
data_sorted[['ApplicantIncome','CoapplicantIncome']].head(10)
Note:Pandassortfunctionisnowdeprecated.Weshouldusesort_valuesinstead.
More:PandasReference(sort_values)
10/13
9/6/2016
#9Plotting(Boxplot&Histogram)
ManyofyoumightbeunawarethatboxplotsandhistogramscanbedirectlyplottedinPandasand
callingmatplotlibseparatelyisnotnecessary.Itsjusta1linecommand.Forinstance,ifwewantto
comparethedistributionofApplicantIncomebyLoan_Status:
importmatplotlib.pyplotasplt
%matplotlibinline
data.boxplot(column="ApplicantIncome",by="Loan_Status")
data.hist(column="ApplicantIncome",by="Loan_Status",bins=30)
Thisshowsthatincomeisnotabigdecidingfactoronitsownasthereisnoappreciabledifference
betweenthepeoplewhoreceivedandweredeniedtheloan.
ReadMore:PandasReference(hist)|PandasReference(boxplot)
#10Cutfunctionforbinning
Sometimesnumericalvaluesmakemoresenseifclusteredtogether.Forexample,ifweretryingto
modeltraffic(#carsonroad)withtimeoftheday(minutes).Theexactminuteofanhourmightnotbe
thatrelevantforpredictingtrafficascomparedtoactualperiodofthedaylikeMorning,Afternoon,
Evening, Night, Late Night. Modeling traffic this way will be more intuitive and will avoid
overfitting.
Herewedefineasimplefunctionwhichcanbereusedforbinninganyvariablefairlyeasily.
11/13
9/6/2016
#Binning:
defbinning(col,cut_points,labels=None):
#Defineminandmaxvalues:
minval=col.min()
maxval=col.max()
#createlistbyaddingminandmaxtocut_points
break_points=[minval]+cut_points+[maxval]
#ifnolabelsprovided,usedefaultlabels0...(n1)
ifnotlabels:
labels=range(len(cut_points)+1)
#Binningusingcutfunctionofpandas
colBin=pd.cut(col,bins=break_points,labels=labels,include_lowest=True)
returncolBin
#Binningage:
cut_points=[90,140,190]
labels=["low","medium","high","veryhigh"]
data["LoanAmount_Bin"]=binning(data["LoanAmount"],cut_points,labels)
printpd.value_counts(data["LoanAmount_Bin"],sort=False)
ReadMore:PandasReference(cut)
12/13
9/6/2016
#11Codingnominaldata
Often,wefindacasewherewevetomodifythecategoriesofanominalvariable.Thiscanbedueto
variousreasons:
1.Some algorithms (like Logistic Regression) require all inputs to be numeric. So nominal variables are
mostlycodedas0,1.(n1)
2.Sometimes a category might be represented in 2 ways. For e.g. temperature might be recorded as
High,Medium,Low,H,low.Here,bothHighandHrefertosamecategory.Similarly,inLow
andlowthereisonlyadifferenceofcase.But,pythonwouldreadthemasdifferentlevels.
3.Somecategoriesmighthaveverylowfrequenciesanditsgenerallyagoodideatocombinethem.
HereIvedefinedagenericfunctionwhichtakesininputasadictionaryandcodesthevaluesusing
replacefunctioninPandas.
#DefineagenericfunctionusingPandasreplacefunction
defcoding(col,codeDict):
colCoded=pd.Series(col,copy=True)
forkey,valueincodeDict.items():
colCoded.replace(key,value,inplace=True)
returncolCoded
#CodingLoanStatusasY=1,N=0:
print'BeforeCoding:'
printpd.value_counts(data["Loan_Status"])
data["Loan_Status_Coded"]=coding(data["Loan_Status"],{'N':0,'Y':1})
print'\nAfterCoding:'
printpd.value_counts(data["Loan_Status_Coded"])
13/13

12 Useful Pandas Techniques in Python For Data Manipulation PDF

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

12 Useful Pandas Techniques in Python For Data Manipulation PDF

Încărcat de

Drepturi de autor:

Formate disponibile

9/6/2016

S-ar putea să vă placă și