Sunteți pe pagina 1din 17

Paper 036-2013

Big Data, Fast Processing Speeds


Kevin McGowan SAS Solutions on Demand, Cary NC

ABSTRACT
Asdatasetscontinuetogrow,itisimportantforprogramstobewrittenveryefficientlyto
makesurenotimeiswastedprocessingdata.Thispapercoversvarioustechniquestospeedup
dataprocessingtimeforverylargedatasetsordatabases,includingPROCSQL,datastep,
indexesandSASmacros.Someoftheseproceduresmayresultinjustaslightspeedincrease,
butwhenyouprocess500millionrecordsperday,evena10%increaseisverygood.Thepaper
includesactualtimecomparisonstodemonstratethespeedincreasesusingthenew
techniques.

INTRODUCTION
Moreorganizationsarerunningintoproblemswithprocessingbigdataeveryday.Thebigger
thedata,thelongertheprocessingtimeinmostcases.Manyprojectshavetighttime
constraintsthatmustbemetbecauseofcontractualagreements.Whenthedatasizeincreases,
itcanmeanthattheprocessingtimewillbelongerthantheallottedtimetoprocessthedata.
Sincetheamountofdatacannotbereduced(exceptinrarecases),thebestsolutionistoseek
outmethodstoreducetheruntimeofprogramsbymakingthemmoreefficient.Thisisalsoa
cheapermethodthansimplyspendingalotofmoneytobuybigger/fasterhardware,which
mayormaynotspeeduptheprocessingtime.
ImportantNote:Inthispaper,whenevercodeispresentedthatisefficientitwillbeshownin
green.Codethatshouldnotbeusedisshowninred.
WHATISBIGDATA?
Therearemanydifferentdefinitionsofbigdata.Andmoredefinitionsarebeingcreated
everyday.Ifyouask10people,youwillprobablyget10differentdefinitions.AtSAS
SolutionsonDemand(SSO)wehavemanyprojectsthatwouldbeconsideredbigdataprojects.
Someoftheseprojectshavejobsthatrunanywherefrom16to40hoursbecauseofthelarge
amountofdataandcomplexcalculationsthatareperformedoneachrecordordatapoint.

Theseprojectsalsohaveverylargeandfastservers.OneexampleofatypicalSASserverthatis
usedbySSOhasthesespecifications:

24CPUcores
256GbofRAM
5+Tbofdiskspace
VeryfastRAIDdiskdrivearrayswithadvancedcaches
LinuxorAIXoperatingsystem
Veryhighspeedinternalnetworkconnections(upto10Gbpersecond)
Encrypteddatatransfersbetweenservers
Production,Test,andDevelopmentserversthatareidentical
Gridcomputingsystem(foronelargeproject)

TheprojectsalsouseathreetieredsystemwherethemainserverissupportedbyanOracle
databaseserverandafrontendterminalserverfortheendusers.Thesupportserversare
typicallysizedaboutthesamesizeastheSASserver.Inmostcasesthedataissplitsomedata
isstoredinSASdatasetswhilethedatausedmostbytheendusersisstoredinOracletables.
ThissetupwaschosenbecauseOracletablesallowfasteraccesstodatainrealtime.Evenwith
thislargeamountofcomputinghorsepower,itstilltakesalongtimetorunasinglejob
becauseofthelargeamountofdatatheprojectsuse.Mostoftheseprojectsprocessover
200millionrecordsduringasinglerun.Thedataiscomplex,andrequiresalargenumberof
calculationstoproducethedesiredresults.
BESTWAYSTOMEASURESPEEDIMPROVEMENTS
SAShasseveralsystemoptionsthatareveryhelpfultodeterminethelevelofincreasein
performance.Inthisthispaper,wewillfocusontheactualclocktime(notCPUtime)the
programtakestorun.InanenvironmentwithmultipleCPUs,theCPUtimestatisticcanbe
confusingitsevenpossiblethatCPUtimecanbegreaterthantheclocktime.Themost
importantSASoptionsformeasuringperformancearelistedbelowwithashortdescription:
Stimer/FullstimerTheseoptionscontroltheamountofdataproducedforCPUusage.
Fullstimeristhepreferredoptionfordebuggingandtryingtoincreaseprogramspeed.

MemrptThisoptionshowstheamountofmemoryusageforeachstep.Whilememory
usageisnotdirectlyrelatedtoprogramspeedinallcases,thisdatacanbehelpfulwhenused
alongwiththeCPUtimedata.

Msglvl=Ithisoptionoutputsinformationabouttheindexusageduringtheexecutionofthe
program.


OptionsObs=NThisoptioncanbeveryusefultotestprogramsonasmallsubsetofdata.Care
mustbetakentomakesurethatthisoptionisturnedoffforproduction

DATABASE/DATASETACCESSSPEEDIMPROVEMENTSUSINGSQL
SincemanySASprogrammersaccessdatathatisstoredinarelationaldatabaseaswellasSAS
datasets,thisisakeyareathatcanbechangedtospeedupprogramspeed.Inanidealworld,
theSASprogrammerswouldbeabletohelpdesignthedatabaselayout.But,thatisnotalways
possible.Inthispaper,wewillassumethatthedatabasedesigniscomplete,andtheSAS
programmeraccessesthedatawithnoabilitytochangethedatabasestructure.Thereare
threemainwaysSASdevelopersaccessdatainarelationaldatabase:

PROCsql
LIBNAMEaccesstoadatabase
Convertingdatabasedatatotextfiles(thisshouldbealastresortwhennoother
methodworks,suchaswhenusingcustomdatabasesoftware)

Mostofthemethodsdescribedherewillworkforallthreemethodsfordatabaseaccess.Oneof
theprimaryreasonstospeedupdatabaseaccessisthatitistypicallyoneoftheeasiestwaysto
speedupaprogram.Databaseaccessnormallyusesalotofinput/output(I/O)todisk,whichis
slowerthanreadingdatafrommemory.Advanceddiskdrivesystemscancachedatain
memoryforfasteraccess(comparedtodisk)butitsbesttoassumethatthesystemyouare
usingdoesnothavedatacachedinmemory.
ThesimplestwaytospeedaccesstoeitherdatabasesorSASdatasetsistomakesureyouare
usingindexesasmuchaspossible.Indexesareveryfamiliartodatabaseprogrammersbut
manySASprogrammers,especiallybeginners,arenotasfamiliarwiththeuseofindexes.Using
datawithoutindexesissimilartotryingtofindinformationinabookwithoutanindexortable
ofcontents.Evenafteraprojecthasstarted,itsalwayspossibletogobackandaddindexesto
thedatatospeedupaccess.Oraclehastoolsthatcanhelpaprogrammerdeterminewhich
indexesshouldbeaddedtospeedupdatabaseaccessthesystemdatabaseadmins(DBAs)
canhelpwiththeuseofthosetools.
Therearemany,manymethodstospeedupdataaccess.Thispaperwilllistthemethodsthe
authorhasusedovertheyearsthathaveworkedwell.AsimpleGooglesearchonthetopicof
SQLefficiencywillfindothermethodsthatarenotcoveredinthispaper.

DBAscanbeveryhelpfulinmakingdatabasequeriesrunfaster.Ifthereisaqueryorsetof
queriesthatisrunninglong,agoodfirststepistogettheDBAstotakealookatthequerywhile
itisrunningtoseeexactlyhowthequeryisbeingprocessedbythedatabase.Insomecases,
thequeryoptimizerwillnotoptimizethequerybecauseofthewaytheSQLcodeiswritten.
TheDBAscanmakesuggestionsabouthowtorewritethequerytomakeitrunbetter.
Thefirstmethodistodropindexesandconstraintswhenaddingdatatoadatabasetable.
Afterthedataisloaded,theindexesandconstraintsarerestored.Thisspeedsuptheprocessof
dataloading,becauseitsfastertorestoretheindexesthantoupdatethemeverytimearecord
isloaded.Thisisveryimportantifyouareimportingmillionsofrecordsduringadataload.
Thereisoneproblemtowatchoutforwiththismethodyouhavetomakesurethedatabeing
loadedisveryclean.Ifthedataisnotclean,itcouldcauseproblemslaterwhentheindexesand
constraintsareputbackintothetables.
ThesecondmethodforspeedingupdatabaseaccessisusingtheexistsstatementinSQL
ratherthantheinstatement.Forexample
Select*fromtable_aa
Whereexists(select*fromorderso
wherea.prod_id=o.prod_id);
isthebestwaytowriteanSQLstatementwithasubquery.
ThethirdmethodistoavoidusingSQLfunctionsinWHEREclausesorpredicateclause.An
expressionusingacolumn,forexamplesuchasafunctionhavingacolumnasanargument,can
causetheSQLoptimizertoignoretheuseofanindexonthatcolumn.
ThisisanexampleofSQLcodethatshouldnotbeused:
Whereto_number(SUBSTR(a.order_no,INSTR(b.order_no,'.')1))
=to_number(SUBSTR(a.order_no,INSTR(b.order_no,'.')1))

Anotherexampleofaproblemwithafunctionandanindexis:
Selectname
Fromorders
Whereamount!=0;

Theproblemwiththisqueryisthatanindexcannottellyouwhatisnotinthedata!Sothe
indexisnotusedinthiscase.
ChangethisWHEREclauseto
Whereamount>0;
Andtheindexwillbeused.
ThefourthmethodisadvicetonotuseHAVINGclausesinselectstatements.Thereasonfor
thisissimple:havingonlyfiltersrowsafteralltherowshavebeenreturned.Inmostqueries,
youdonotwantallrowsreturned,justasubsetofrows.Therefore,onlyuseHAVINGwhen
summaryoperationsareappliedtocolumnsrestrictedbytheWHEREclause.
Selectstatefromorderwherestate=NC;groupbystate;
Ismuchfasterthan
Selectstatefromordergroupbystatehavingstate=NC;
Anothermethodistominimizetheuseofsubqueries,insteadusejoinstatementswhenthe
dataiscontainedinasmallnumberoftables.
Insteadofthisquery:
Selectename
Fromemployeesemp
whereexists(selectpricefromprices
whereprod_id=emp.prod_idandclass=J);
Usethisqueryinstead:
Selectename,
Frompricespr,employeesemp
wherepr.prod_id=emp.prod_idandclass=J;
TheorderthattablesarelistedintheSQLstatementcangreatlyimpactthespeedofaquery.
Inmostcases,thetablewiththegreatestnumberofrowsshouldbelistedfirstinaquery.
ThereasonisthattheSQLparsermovesfromrighttoleftratherthanlefttoright.Itscansthe
lasttablelisted,andmergesalloftherowsfromthefirsttablewiththerowsinthelasttable.

Forexample,iftableTab1has20,000rowsandTab2has1rowthen
Selectcount(*)fromTab1,Tab2isthebestwaytowritethequery

Insteadof
Selectcount(*)fromTab2,Tab1
Whenqueryingdatafrommultipletablesitsverycommontoperformajoinbetweenthe
tables.However,ajoinisnotalwaysneeded.Averysimplewaytoquerytwotableswithone
queryistousethefollowingcode.
SelectA.name,a.grade,
B.name,b.grade
Fromempa,empxb
Whereb.emp_no=1010anda.emp_no=2010;
Whenperformingajoinwithdistinct,itsmuchmoreefficienttouseexistsratherthan
DISTINCT
Selectdate,name
Fromsaless
Whereexists(selectXfrom
Employeeemp
Whereemp.prod_id=s.prod_id);
(Xisadummyvariablethatisneededtomakethisqueryworkcorrectly)
Selectdistinctdate,name
Fromsaless,employeeemp
Wheres.prod_id=emp.prod_id;
EXISTSisafasteralternativebecausethedatabaserealizesthatwhenthesubqueryhasbeen
satisfiedonce,thequerycanbeterminated.

Theperformanceofgroupbyqueriescanbeimprovedbyremovingunneededrowsearlyinthe
selectionprocess.Thefollowingqueriesreturnthesamedata.However,thesecondqueryis
potentiallyfaster,sincerowswillberemovedfromthequerybeforethesetoperatorsare
applied.
Selecttitle,avg(pay_rate)
Fromemployees
Groupbyjob
Havingjob=Manager;
Isnotasgoodas
Selecttitle,avg(pay_rate)
Fromemployees
Havingjob=Manager
Groupbyjob;

SASMACROSPEEDINCREASES
ItsverycommonforbigdataprojectsthatuseSAStoemployalotofSASmacros.Macrossave
alotoftimeincoding,andtheyalsomakecodemucheasiertoreuseanddebug.The
downsidetomacrosisthatiftheyarenotusedcorrectly,theycanactuallyslowdowna
programratherthanspeeditup.Thisisespeciallytrueifsomeofthemacrodebuggingfeatures
areturnedononcethecodeisfullytestedandreadytobeputintoproductionstatus.Hereare
sometechniquestousetomakesurethatmacrosdonotslowdownaprogram.
Thebasictipforusingmacrosisthatafterdebuggingthemacroiscomplete,setthesystem
optionsNOMLOGIC,NOMPRINT,NOMRECALL,andNOSYMBOLGEN. Iftheseoptionsarenot
usedforaproductionjob,therearetwomainproblemsthatcanhappenFirst,thelogfilecan
growtobeverylarge.Insomeprogramswithalotofcodethatloopsmanytimes,thelogfile
cangrowsolargethatitcanfillupthediskandcausetheprogramtocrash.Theotherproblem
isthatwritingoutallthoselogmessagescangreatlyreducethespeedoftheprogrambecause
diskI/Oisaveryslowprocess.
Thefirstmacrotechniqueistousecompiledmacros.Acompiledmacrorunsfasterbecauseit
doesnotneedtobeparsedorcompiledwhentheprogramruns.Macrosshouldnotbe

compileduntiltheyarefullytestedanddebugged.Onecautionwithcompiledmacrosisthat
oncetheyarecompiled,theycannotbeconvertedbackintoreadableSASsourcecode.Itis
essentialtostorethemacrocodeinasafeplacesothatitcanbemodifiedoraddedtoatalater
date.Anotheradvantageofcompiledmacrosisthatthecodeisnotvisibletotheuser.Thisis
importantifyouaregivingthecodetoacustomertousebuttheyshouldnotbeallowedto
viewthesourcecode.
Anotherwaytospeedupmacrosistoavoidnestedmacroswhereamacroisdefinedinside
anothermacro.Thisisnotoptimalbecausetheinnermacroisrecompiledeverytimetheouter
macroisexecuted.Andwhenyouareprocessingmillionsofrecords,recompilingamacrofor
eachrecordcanreallyslowdowntheprogram.Ingeneral,itsalsoeasiertomaintainmacros
thatarenotnested.Itsmuchbettertodefinetwoormoremacrosseparatelyasshownbelow:
%macrom1;
<macro1codegoeshere>
%mendm1;
%macrom2;
<macro2codegoeshere>
%mendm1;
Insteadof
%macrom1;
%macrom2;
%mendm2;
%mendm1;
Callingamacrofromamacrowillnotslowdowntheprocessing,becauseitwillnotcausethe
calledmacrotoberecompiledeverytimethemainmacroiscalled:
%macrotest1;
%another_macro(thismacrowasdefinedoutsideofmacrotest1)
%mendtest1;

Although%includeistechnicallypartofthemacrolanguage,onebigdifferenceisthatanycode
thatisputintotheprogramwith%includeisnotcompiledasamacro.Therefore,itwillrun
fasterthananormalmacro.Thebestusefor%includealongwithmacrosistoputsimple
statementsintheincludefile:
%letdept_name=Sales;
%letnumber_div=4;
Anothergoodideaforusing%includetospeedupaprogramisthatanexternalshellprogram
thatcallsSAScanwriteoutvaluesasSAScodeintoatextfile,whicharethenincludedintothe
SAScodeatthetimeofexecution.ThistechniqueallowsoneSASprogramtobewrittenthatis
veryflexible.Thewaytheprogramisrundependsoninputspassedtothecodefromthe
externalprogram.TheexternalprogramcanbewritteninavarietyoflanguagessuchasC,C++,
HTML,orJava.ThismethodalsomeansthattheSAScodeneverhastobechanged.Therefore,
thereislesschanceofbugsbeingintroducedintotheprogram.TheSAScodecanevenbe
storedatreadonlysothatnochangescanbemadetothesourcecode.
Hereishowthismethodworks:

SASsourcecodeiswrittenwith%includestatementstosubsetdata
Theexternalshellprogramcollectsinformationfromtheenduserforexample
Species=Mice
Theexternalshellprogramwritesoutalineforeachpieceofinformationcollected
%letspecies=Mice;orifspecies=Mice;
SASprogramiscalledfromtheshellprogramand%includefilesareexecuted
Programrunsfasterbecausethecorrectdatasubsetisused

Theexternalshellprogramcanbesimpleorcomplex.Themaingoalistocollectinformationto
speeduptheexecutionoftheSASprogrambymakingsurethecorrectdataisused.Theshell
programcanalsocollectinformationfromuserstouseinformattingoftables,colors,output
format(suchasODSmethods),loglocation,outputlocation,andsoon.
GRIDCOMPUTINGOPTION
ThebestoptiontospeedupabigdataprojectwithSASistousegridcomputing.Thisoptionis
notinexpensiveorsimpletosetupbutfornowitistheultimatewaytoincreasecomputing
poweranddecreaseprocessingtimeforSASprocessing.SSOcurrentlyusesagridsystemfor
oneofourlargeretailprojectsthatusesalotofdata,andhasaverytighttimelinetocomplete
thedailyandweeklyprocessing.

ThekeypointsforagridenvironmentatSSOare:

Windowsserversforenduseraccess
OneSASserver
Onedatabaseserver
Fourormoregridservers,whichareusedforthemaincomputing
RAIDdiskarraysforfastaccesstodata
Highspeed(10Gb)accessbetweenmainserverandgridnodes
SASGridcomputingsoftwarepackage

HereisadiagramofatypicalgridcomputinglayoutusedatSSO:

Figure1GridComputingLayoutforSSO
Inthisexample,thereare12gridnodes,eachwith12coresand64GbofRAM(thediagram
says12of40nodesbecausetheother28nodesareusedfordevelopmentandtestservers.)
TheDMZmeansthosesystemsanddiskarelocatedtogetherinonelocation.ThereisaDMZ
forthemainSASserver,andanOracleserverandaseparateDMZforthegridnodes.
Thekeyadvantagesofgridcomputingare:

ImprovedprogramdistributionandCPUutilization

Canbeusedformultipleusersandmultipleapplications
Job,queue,andhostmanagementservices
Gridnodescanbesetupashotbackupsformainserver
Simplifiesadministrationofmultiplesystems
Allowseasymaintenancesincegridnodescanbeshutdownwithoutdisrupting
application
Providesrealtimemonitoringofsystemsandapplications

InSSO,thegridnodesarenotusedfordataloading(ETL)orreporting.Theyareusedonlyfor
thenumbercrunchingaspectoftheproject.Thenormaldataflowforagridcomputing
projectinSSOisasfollows:
1. DataisloadeddailyandweeklyusingtheSASserverandOracleserver
2. Dataispartitionedintosubsetsthatmatchthenumberofgridnodes(10setsfor10grid
nodes).Theusercanselectthemethodforpartitioning
3. Duringprocessing,thedataiscopiedtothegridnodesforcomplexcalculations
4. Whenprocessingisdone,theresultsarecopiedbacktotheSASandOracleservers
Itistechnicallypossibletouseagridcomputingarchitecturewithoutusingtheextragridnode
systems.Thismethodstillpartitionsthedata,andprocessesthedatainsmallerbatches.But,it
isnotasfastasthefullgridsystemshownabove.
Ofcourse,itisimportanttopointoutthedisadvantagesofagridsystem:

Greatlyincreasedcostforsoftwareandhardwarevs.anongridsystem
Morepointsofpotentialfailure(gridnodes,connections,etc.)
Hardertosetupandmaintainagridvs.asingleserversystem
Agridmightnotspeedupallprocessingsuchasdataloads
ExtraCPUprocessingtimeisusedcopyingdatabackandforthtothegridnodes
MorediskI/Oisusedinagridsystem

GENERALSASPROGRAMMINGIDEASFORFASTERPROCESSINGOFBIGDATA
ThetopicsinthispapercoverwhatmightbeconsideredareasthatmightnotapplytoallSAS
programs.TherestofthispaperwillgiveadviceonstandardSASprogrammingthedatastep
alongwithvariousprocedures(procs).Thisisanimportantareatoconsiderbecauseitis
applicabletoawidevarietyofprograms.NoteverySASprogramwilluseSQLormacros,but
everySASprogramwilluseatleastonedatastep.
Thebasicideatospeedupprocessingwiththedatastepistoreducetheamountofworkthat
SASneedstodo.Onesimplewaytodothisistomakesurethatwhenpossible,asectionof

codeisonlyexecutedonetimeinsteadofmanytimes(onetimeforeachrecordinthedata.)
Forexample,theeasiestwaytodothisisbyusingtheretainstatement.
Anotherlittleknownmethodforincreasingthespeedofcalculationsinadatastepinvolvesthe
useofmissingvalues.Ifavariableisknowntohavealotofmissingvalues,itisabestpractice
tolistthatvariablelastinamathematicalexpression.Forexample,ifthevariableT4hasalotof
missingvaluesthen
Total=(x*b)+c*(abc)+T4;
Ismoreefficientthan
Total=T4+(x*b)+c*(abc);
Thereasonforthisisthatifthemissingvalueisearly,thatmissingvalueispropagatedthrough
allthecalculationsandSAShastousemoreCPUtimetocomputethevaluesandkeeptrackof
themissingvalues.Itisalsoagoodideatocheckforamissingvaluefirstbyusingcodelikethis:

IfT4ne.thendo
Total=(x*b)+c*(abc)+T4;
End;

Inmostcases,PROCformatisamuchfasterwaytoassignvaluestodataratherthanusinga
longlistofifthenstatements.Statementslikethis:
ifeduc=0thenneweduc="<3yrsold";
elseifeduc=1thenneweduc="noschool";
elseifeduc=2thenneweduc="nurseryschool";
Shouldbeconvertedtothiscode:
procformat;
valueeducf
0="<3yrsold"
1="noschool"
2="nurseryschool";

datanew;
setold;

neweduc=put(educ,educf.);
run;

Inasimilarmanner,theuseoftheinfunctionuseslessCPUtimethanagroupoforstatements.
Insteadof
Ifx=8orx=9orx=23orx=45thendo;
Use
Ifxin(8,9,23,45)thendo;
Thereasonforthischangeisthatwiththeuseofor,SASchecksalltheconditions.Thein
functionstopsafteritfindsthefirstmatchingnumbertomaketheexpressiontrue.
SASusesmoreCPUtimewhenithastoprocesslargervolumesofdata.Averyeasywayto
reducethesizeofthedataistoavoidusingthedefaultdatasizeforvariables.Bydefault,all
SASnumericvariableshaveasizeof8bytes.Formanyvariables,8bytesismuchlargerthanis
needed.Forexample,avariablethatisusedfortheageofapersoncaneasilybestoredin3
bytes,whichmeansthatthesizeofthedataforthatonevariablehasbeenreducedby5/8or
62.5%.Whendealingwithverylargedatasets,thatnumberinthehundredsofmillionsof
records,theCPUprocessingtimesavingscanbesubstantial.
ManyprogramsthatwerewritteninolderversionsofSAScanbechangedtotakeadvantageof
moremodernSASprogrammingfeatures.InolderversionsofSAS,procedurescouldnotrunon
asubsetofdata.Iftheanalysisneededtoberunonjustonesexforexample,anewdataset
wascreatedthatincludedjustmembersofthesexneeded.Nowitsmuchquickertousea
subsetstatementintheprocedurestatementssuchas
Procfreq;
wheresex=Male;run;
Or
Procmeans;
wheresex=Female;run;

Inmanycases,itispossibletowritecodeusingeithertraditionalSASDATAstepsandPROCsor
writecodeusingSQLstatementsinplaceoftheDATAstepsandPROCs.Thesetwopiecesof
codeproducethesameresults:
Dataabc;
Setold_data;
Keepnamedatecity
Procsort;
Byname,date,city;
Vs
Procsql;
Createtableabcas
Selectname,date,city
Fromold_data
Orderbyname,date,city;
OneadvantageoftheSQLcodeisthatitismorecompactandeasiertoread(assuming
knowledgeofSQLprogramming.)Thequestioncomesup:whichmethodisfaster?TheSQL
codeappearstobefastersincethereisonestepversustwostepsinthedatastepcode.The
truthisthereisnoeasyanswertothatquestion.Theanswertowhichisfasterreallydepends
onmanydifferentfactors:

Amountofdataprocessed
Howmanyindexesareused
Hardwareandsoftwareconfiguration(WindowsversusLinuxorUnix,PCversus
Mainframeandsoon.)
TypeofanalysisneededcanitbedoneinthedatabaseoronlyinSAS

Ifpossible,itsagoodideatotestDATAstepsandPROCsversusSQLprocessingonasmall
subsetofdatatodeterminewhichmethodisfastest.

SASINDATABASEPRODUCT
SAShasarelativelynewproductcalledSASInDatabase.Thebigadvantageofthisproductis
thatisallowsSASjobstorundirectlyinthedatabaseserver.Mostdatabaseshaveavery
limitedfeaturesetforstatisticalanalysisaddingSASdirectlyintoadatabasegreatlyincreases
theamountofanalysisthatcanbedonewithoutneedingtopulldataintoSAS(usingSQLora
DATAstep)andpotentiallysendingtheresultsbacktothedatabase.Currently,InDatabase
worksonthefollowingdatabases:Asterdatabase,EMCGreenplum,IBMDB2andNetezza,
Oracle,andTeradata.InDatabaseusesmassiveparallelprocessing(MPP)toenhancesystem
scalabilityandperformance.Itsmuchbettertomovetheprocessingtothedataratherthan
movethedatatoSAS,especiallyconsideringthefactthatI/Oisoneofthemainfactorsthatcan
slowdownthespeedofaprogram.ThethreepartsofInDatabaseare:

SASScoringAccelerator
AnalyticsAcceleratorforTeradata
SASAnalyticAdvantageforTeradata

THREADSANDCPUCOUNTOPTIONS
Thesetwooptionscanbeveryhelpfulforspeedingupprocessingbutitsimportanttobe
carefulwhenyouareusingthem.Ingeneral,itsbesttousethemonlyforverylargedatasets.
Usingthemonsmallerdatasetsmightactuallyslowdownprocessing.TheSASsystemwill
decideiftheseoptionsareactuallyusedbasedondifferentfactorssuchasnumberofCPUs
installedinthesystem,oroptionsselectedforagivenDATAsteporprocedurethatisused.Its
alsoaverygoodideatotestthethreadsoptionversusnothreadstomakesurethatthespeed
doesincreasebyusingthreads.
ThebestwaytousetheCPUandthreadsoptionis:
Optionsthreadscpu=actual;
TheactualstatementontheCPUoptiontellsSAStousetheactualnumberofCPUsinstalledin
thesystem.ItmightbetemptingtotrytouseahighernumberforCPUs.Inreality,itdoesnot
workthatway.Also,thisoptionmeanstheprogrammerdoesnothavetospendtimelearning
howmanyCPUsareinthesystem.
Asimplewaytoexplainthethreadsoptionisthatitdividestheworkupintosmallerchunks
sothattheycanbeworkedoninparallelbydifferentCPUs.Thisisveryhelpfulinmany
differentSASprocedures.iftheprogramusesPROCSQLwiththepassthroughoption,the
threadsoptionwillhavenoimpactbecausetheSQLcodeispassedtothedatabasewhereitis
executed.ThepassthroughoptiontreatsthedatabaseasasortofblackboxthatSAShasno

controlover.However,itispossiblethedatabasesystemmightuseitsownversionof
multithreadingtospeedupprocessingwithinthedatabase.
Anotherimportantfactaboutthethreadsoptionisthattheresultscanvarydependingonwhat
typeofhardwareisusedtoruntheSASprogram.Forexample,aprogramthatusesthreadson
LinuxmightnotworkaswellifthesourcecodeisrunonWindowsoramainframesystem.
DONTFORGETABOUTMAKINGOLD/EXISTINGCODERUNFASTER!
Programmersreadtechnicalpaperssuchasthisoneanddecidetostartusingthesetechniques
inthefuture.Whilethatisaverygoodidea,alltheinformationinthisarticlecan(andshould)
beusedtoexamineoldcodetodetermineiftheoldcodecanbeimproved.Justbecauseold
codehasbeenrunningwithoutproblems(sometimesforyears)doesnotmeanthatthecodeis
efficient.Ifitisnotbroken,dontfixitisagoodsayingbutsometimesaprogrammightbe
brokenevenwhenitproducesthecorrectresults.Inthiscontext,brokenmeansthatthe
codecanbechangedtorunfasterwhilestillproducingthecorrectresults.
CONCLUSION
Withdatavolumesincreasingallthetime,itisimportanttoalwaysbemindfulofwaystospeed
upprocessing.Itcanbeverytemptingtosimplythrowmoneyattheproblembybuying
fasterorbiggerserverstoruntheSAScode.Betterhardwarecanpotentiallyspeedupthe
processing,butitisfarfromthecheapestwaytoincreaseperformance.Thispaperpresented
twobasicwaystodecreaseprocessingtimeforbigdataprojectsbyusingbetterprogramming
techniqueswithSQL,SASmacros,andgeneralSASprogrammingtechniques,andbyusing
multipleserversinagridenvironment.Thefirstthreemethodscanbeimplementedatverylow
costs,sotheyshouldbeevaluatedforallprojects.Fororganizationswithlargerbudgetsorvery
largeamountsofdata,thegridenvironmentisagoodchoicetoinvestigate.
ACKNOWLEDGEMENTS
IwouldliketothankthewritingstaffatSASforeditinghelponthispaper.Iwouldalsoliketo
thankallmycoworkersatRTI,SRAandSASwhohavehelpedmebecomeabetterSAS
programmerthroughoutmycareer.SpecialthankstoDr.BillSandersattheUniversityof
Tennessee(andlaterthedirectoroftheSASEVAASgroup)whoshowedmeSASprogramming
fortheveryfirsttime.

CONTACTINFORMATION
KevinMcGowan
SASSolutionsonDemand
Kevin.McGowan@sas.com
(919)5312731
http://www.sas.com
SASandallotherSASInstituteInc.productorservicenamesareregisteredtrademarksor
trademarksofSASInstituteInc.intheUSAandothercountries.indicatesUSAregistration.
Otherbrandandproductnamesaretrademarksoftheirrespectivecompanies.

S-ar putea să vă placă și