Sunteți pe pagina 1din 21

Cleaning Messy Data With OpenRefine

Scott Carlson
9/15/2015

What is OpenRefine?
OpenRefine(formerlyGoogleRefine)isapowerfultoolforworkingwithmessydata,including
toolsforcleaningit,transformingitfromoneformatintoanother,extendingitwithwebservices,
andlinkingittodatabases.

How do I get OpenRefine?


Refineisopensourceandcanbedownloaded
here
.Pleasenotethatthelateststableversionis
stillbrandedas
GoogleRefine
developersarehardatworkgettingthefirst
OpenRefinev
ersion
toastablereleaseformat.

Forthesakeofclarity,wellsimplyrefertothesoftwareas
Refine
.

Lets Jump in
AfteryouvesuccessfullyinstalledRefine,clickontheapplicationicontolaunchtheprogram.
Refineopensinyourdefaultwebbrowser,butitrunslocallyasavirtualserverifyouuse
Windows,youllseethisrunningbehindit:

Refinerunsonport3333ofyourbrowser,sooncetheprogramisstarted,youshouldseeyour
defaultbrowseropentoeither
http://127.0.0.1:3333/

or
http://localhost:3333/

toshowyouthe
startscreen.

Asshownabove,thestartscreenwillallowyoutocreateprojectsfromknownsourcesof
informationandfiletypes,continueworkingonpastprojects,andimportprojectsexportedby
others.

Refineissmartenoughtobeabletoparseraw,unformatteddataasyoucansee,youcan
simplycopyandpastedataintoRefineusingthe
Clipboard
functionbutthiswillrequirea
formulatingacertainamountofstructurebeforeyoucanstartusingit.Forourpurposes,its
bettertostartwithstructuredsampledatatogetanideaofhowtheapplicationworks,sowell
usesomesampledatafromthePowerhouseMuseuminSydney,Australia,whichyoucan
download
here
asacommaseparatedvalues(CSV)file.Goaheadandselectthefileyou
downloadedunder
ThisComputer
inthehomescreen.

YoushouldnowseeaparsingwindowRefineisshowingyouwhatyourdatawilllooklikein
theprogram.Inthebottomrighthandcorner,youshouldseethis:

Acoupleimportantthingsaregoingonhere:

1) Refinehasautomaticallyskippedthefirstrowinthedataandsetthemtocolumn
headers.
2) Refinehasautomaticallytriedtoparsethedataintoknownformatsdates,numbers,
etc.
3) Cellswithnoinformationarenotonlykept,buttheyarestoredasnullwhichis
separatefromothervalues,andwillbeimportantlateron.

Click
CreateProject
.

The Refine Interface


Afterthedataloads,youwillbestaringatthedefaultRefineinterfaceforyournewproject.In
theupperlefthandcorner,youwillseeyourprojectname(whichcanbechangedbyclickingon
it)andtheprogramslogo:

Ifyouneedtoreturntothestartscreen,clickthatlogo.

Eachofthecolumnsfromthedatahavedropdownmenuswithdifferentcommands:

Intime,wewillgothroughanumberofthemostoftenusedcommands.

Facets
ThecoreofRefinespowerliesinitsuseof
Facets
.Facetsallowyoutotakeamacrolevellook
atalargeamountofdatabycountingindividualpiecesofcolumndata,andgroupingthem.

Letstryitout:onthe
Categories
column,clickthedropdownandgoto
Facets
.Sincethese
categoriesaregoingtobeaformoftext,goaheadandchoose
TextFacet
.

Rightaway,weseetheressomethingupThecategorieshavepipes(|)inthem,signifying
multipleentriesforasinglefield:

Mostofthedatayoullworkwithwillcontainonevalueperfield.Butsometimes,likeabove,your
datawillhavemultiplevaluesforcertainfields.Refinecaneasilybreakapartthisdata,andputit
backtogetherlaterwhenyouredonecleaning.Todothis,wegobacktothedropdownon
Categories
andselect
EditCells>SplitMultiValuedCells
:

Sinceweknowthedataisseparatedbyapipe,weenterthatintothenextwindow:

TheFacetwindowisnowupdated.

YoullseethatRefinehasseparatedandcounted350individualpiecesofdatainthecolumn.
Thedefaultviewisanalphabeticallist,butyoucanalsoseethembyinstancecount,frommost
toleast.

SideNote
:
Afterbreakingapartthoserecords,atthetopofthemainwindow,youllsee
acountofthenumberofrowsintheprojectsetto
8097
thisisanincreaseoftheoverall
numberofrows,whichwas
3029
whenwestarted.Theincreaseisbecauseeach
brokenapartcategorynowhasitsownseparaterowinthedata.Ifyoudontwanttolose
thecontextualinformationwhilecleaningdata,youcanselect
ShowasRecords
,which
willcontinuetogroupallthebrokenapartdatawithitsoriginalparentrow.

Inspecting and Cleaning


Nowthatourcolumndataisseparate,wecangettoworkinspectingandcleaning.Inthefacet
window,theresabuttonlabeled
Cluster
.Clickit.

ClusteringisRefineswayofcomparingthedatainacolumnagainstitselftolookfor
inconsistencies.Refineusestwomethods(KeyCollisionandNearestNeighbor)withdifferent
functionstolookforpotentialdatainconsistencies.Thesearealotoffuntoplayaroundwith,
butforourpurposes,wewillstickwiththedefaultClustermethod,KeyCollision/Fingerprint,
whichisdesignedtoprovideasfewfalsepositiveresultsaspossible.

ThesampleofPowerhousedataisrelativelyclean,thoughaswecansee,thereareafew
instanceswherecapitalizationdiffersbetweeninstances:

Thequestionthisraisesis:whatistheformattingstandardforthisdata?Howshouldthese
categoriesbeentered?Theexistingdataleansheavilyonthefirstwordofthecategorybeing
capitalized,thoughthereareseveralthatdonotfollowthisformatting.

BacktotheClusterwindow.Wedefinitelywanttomergetheserecordstomakethedatamore
consistent,sotickthe
Merge?
boxesnexttoeach.Wealsowanttomakesurewereusingthe
rightformoftheheadingsifweclickontheformofheadingwewant,itwillpopulatetheNew
CellValuespaceautomatically.Yourwindowshouldnowlooklikethis:

Nowclick
MergeSelectedandReCluster
tomakethechangesfinal.

JustforfunletstryanotherClustermethod:Nearestneighbor.Thedefaultfunctionis
Levenshtein.ByplayingaroundwiththeRadiusandBlockCharacters,wecanchangethe
parametersforlookingforpossibleinconsistenciessetyourstoa
2.0radius
and
5Block
Characters.

Ourhitsincreased,butunfortunately,allbutonearefalsepositivestheonlyonethatlookslike
aninconsistencyisthelast,SculpturevsSculptures:

Again,letsreferbacktothecategorylistinourfacetwindow.Themajorityofcategoryterms
areinthepluralform,sowemightaswellchangethistomakeitmoreconsistent.Mergeonly
thatClustersettoSculptures.

SideNote:
NowmightbeagoodtimetomentionthatRefineispowerfulbutnot
infallibleyougetoutofitthelevelofdetailyourequestfromit.Forexample,forthelevel
ofclustersearchignwedid,Refinemissedtwopotentiallyinconsistentterms:
Retail
Equipment
and
RetailingEquipment
.

Theupshotisthatlikeanymethod,youregoingtohavetomakedoubleandtriple
checkingyourdataapriority.

Letsfinishupthispartofcleaning:westillneedtoreattachthemultivaluedcells.Gobackto
thedropdownmenuandselect
EditCells>JoinMultiValuedCells
then,entertheseparator
betweeneach(letsstickwithasinglepipe)andclickOK.

Congratulationsyoujustcleaneddata!

Project History
Bynow,youmayhavenoticedthattheFacetwindowareahasanothertabnexttoit,labeled
Undo/Redo.Thisistheprojectshistoryeverydatatransformationyoudoistrackedhere.Ifyou
makeamistake,youcaneasilystepbackward,allthewaytothebeginningoftheproject.

Manually Changing Data


Whathappensifwewanttochangedatathatisntrelatedtoclustering?Letssay,forexample,
thatthecategorytermAudiorecordsdispleasesusinsomeway,andallthreeinstancesofthe
termshouldbechangedtoVinylRecords.

Ifyoumoveyourmousepointerovertheterminthefacetwindow,youllseethattheoptionto
Edit
thetermcomesup.

Bychangingthetextintheeditboxandclicking
Apply
,youwillautomaticallychangeall
instancesinthedatatoVinylrecords.

10

Whathappensifmultipletermsneedtobechanged?Youcanchangethemonebyone,butthat
cangettimeconsuming.Luckily,thereareacoupleofwaystodolotsofthingsatonce.

Letssayforwhateverreason,youwanttochangeanycategoriesthathavetheword
Advertisinginthemtothebroaderterm
Advertisements
.Becausethekeywordadvertising
couldbefoundinvaryingformsunderdifferentterms,letsuseafiltertofindallofthem.Under
theCategoriesdropdownmenu,select
TextFilter
.Anewwindowwillopenbelowthefacet.
Letstrysearchingfor
advert
,whichisgeneralenoughtofindvaryingformsoftheword:

Onlythreetermscomeup,andthemainwindowisnowrestrictedtothe12rowswithanyof
thosethreeterms.(IfyoureviewingasRecords,switchtorowsforthenextpart.)

LetsgobacktothedropdownmenuforCategories.Click
EditCells>Transform...

11

ThetransformwindowiswherewegetintothenittygrittyofRefinescleaningpower.Inthis
window,youcanexecutecustomwrittencommandsusingacoupleofdifferentlanguages,the
mostimportantbeingthedefaultLanguageGREL(GoogleRefineExpressionLanguage).

12

Theresquitealottoworkwithhere,andifyouhavethetimeandexperience,youcancomeup
withsomeprettyamazingtricks.(MoreadvancedGRELrecipescanbefound
here
.)

Youllnoticetheword
value
isloggedinthecommandwindowthisvariablestandsinforthe
originalvalueinthecellsweregoingtochange.Underneaththecommandwindow,youcan
seeallofthechangesyourcodewillinflict.

Whatwewanttodoistochangetheexistingvaluestothesameterm:Advertisements.Todo
this,wewipeoutthevaluecommandandreplaceitwithournewcategory,inparentheses.The
parentheticalmarkstellRefinethatthereisnocommandstorun,justanewvaluetoreplacethe
previousvalues.:

Click
OK
.YoucannowremovetheTextFilterandseethatyourpreviouscategoriesarenow
mergedtogether.

Exporting
Whenyourdatahasbeencleanedtoyoursatisfaction,youcanexportittoavarietyofdifferent
fileformats.Besurealltextfiltersandfacetwindowsareclosed,andthatyouhavejoinedany
splitmultivalueddatabeforeyouexport.

Clickonthe
Export
buttonintheupperrighthandcornerofthescreen:

13

ExportProjectwillsaveacopyofyourRefineprojectthatcanbesharedandopenedinother
locations.Formostpeople,exportingtoaspreadsheetorotherdocumentinthesecondtierfile
typeswillbejustfine.

SideNote
:Forwhateverreason,RefineexportsExcelSpreadsheetsinaprevious
Officeformat.IfexportingtoExcel,yourbestbetistoresaveittothemostcurrentform
Officeformat,tostaveoffanyfutureformattingproblems.

Unlocking More of Refines Power


Sofar,weveseenRefinesabilitiestocleanmessydata.Butwhatifourdataisntmessy,but
woefullyincomplete?Refinehassomefantasticmethodstoenrichcertainkindsofpreexisting
data.Inthisexample,wewillbetakingpubliclyavailabledatafromtheInternetMovieDatabase
(IMDb)andenrichingitbyusinganexternalwebAPI(applicationprogramminginterface).

Anumberofwebsitesallowthepublictoextractmetadatafromtheirsitesanddatabasesvia
webAPIs.Forexample,GooglehasanAPIthatturnsaddressesintogeocodedmetadata.

Clickingon

https://maps.googleapis.com/maps/api/geocode/json?address=6100+Main+St,+Houston,+TX+7
7005

...Returnsthefollowing(truncated)array:
14

{
"results":[
{
"address_components":[
{
"long_name":"6100",
"short_name":"6100",
"types":["street_number"]
},
{
"long_name":"SouthMainStreet",
"short_name":"SMainSt",
"types":["route"]
},
{
"long_name":"SouthCentralHouston",
"short_name":"SouthCentralHouston",
"types":["neighborhood","political"]
}...

Thismaylooklikegibberish,butitsreallyJSON,orJavascriptObjectNotation,astandard
metadatatransmissionmethod.ThemajorityofthewebAPIstransmittheirshareabledata
usingJSON.

IMDbalsohasalotofgreatinformationunfortunately,muchofitislockedinsidethesite
IMDbhasnopublicAPI,andprovideslittleefforttomakeitsdatadownloadable.Fortunately,
thesitedoeshaveonedownloadingloophole:publiclistsoffilms,generatedbythesitesusers.
Ifloggedin,onecaneasilydownloadthebarebonesmetadataofthefilmsinthelistasaCSV
file.

15

Forthepurposesofthisexercise,metadatafrom
alistofthetop1001moviesontheIMDb
was
acquiredandputintoanExcelspreadsheet,whichcanbedownloaded
here
.

Onceloadedintorefine,wecanseethemetadataisuseful,butincomplete.Wewouldliketo
addotherkeymetadataaboutthefilms,suchastopbilledactors,plotsummaries,andmore.

However,sinceIMDbdoesntofferanAPIforustoquery,wewillhavetogoelsewhere.
Fortunately,wecanusetheOpenMovieDatabase(OMDb),awebAPIthathasbeenloaded
withusersubmittedinformationonfilms.TheOMDbsdataislinkedtotheIMDbviathelatters
uniqueidentifiers.ThecommandfortheOMDbtoreturnJSONdatais:
http://www.omdbapi.com/?i=[IMDbID]&plot=full&r=json

Luckilyforus,ourIMDbdownloadalreadyincludestheuniqueIDforeachfilm.Thismeanswe
canautomaticallygenerateanOMDbAPIsearchforeachentryinthedataset.Wedothisby
goingtothedropdownmenuoftheIDcolumn(
const
)andselecting
EditColumn>Add
columnbyfetchingURLS
.

16

ThiswillcallupacodewindownotunliketheonewesawinTransform.Thedifferencehereis
thatweconstructtheAPIrequestsandRefinewillautomaticallyrunthemforus,returningthe
fullJSONresultsinanewcolumnforeachfilm.Todothis,wewillenterthefollowingGREL
expressionforournewJSONURLcolumn:
"http://www.omdbapi.com/?i="+value+"&plot=full&r=json"

17

Becausetheresliterallyathousandsearchestobedonehere,itwilltakeawhilefortheresults
torollin.Butoncefinished,wenowshouldhaveanentirecolumnofJSONdatathatwecan
usetoenrichourpreviousdata.WhatsleftisforustoparsetheJSONandextractthedatathat
wewant.

Wewillstartbyselecting
EditColumn>Addcolumnbasedonthiscolumn
fromthe
dropdownontheJSONcolumn.WewillthenuseaGRELexpressionbuiltonthecommand
value.parseJson()
,
which

parsestheJSONvalueofthatcolumnintoadataelementwe
canworkwith,programmatically.

WellstartwithaGRELexpressiontogetactordataforeachmovie:
value.parseJson()["Actors"]

Wenowhaveacolumnwithastringofactormetadata.Ifwewantedto,wecouldtakewhatwe
learnedintheprevioussectionandsplitapartthoseactornamesandseeiftheyneedtobe
cleanedup.

Letsdoonemore:welladdacolumnwiththeentireplotsfromtheJSONresults:
value.parseJson()["Plot"]
18

Oncethisplotdataisadded,wecouldthenextendthisdatabyusingatextfiltertosearchfor
keywordsandphrases,givingusanopportunitytolookforpatternsinthetop1,000movieson
IMDb.

19

Useful Refine Resources

OpenRefineHomePage
http://openrefine.org/

OpenRefineUsersWiki
https://github.com/OpenRefine/OpenRefine/wiki

OpenRefineGithubPage
https://github.com/OpenRefine/OpenRefine

UsingOpenRefine
,ahowto/cookbookbyRubenVerborghandMaxDeWilde
https://www.packtpub.com/bigdataandbusinessintelligence/usingopenrefine

Additional questions? Contact me at


sjc5@rice.edu
.

Thisworkislicensedundera

CreativeCommonsAttributionNonCommercialShareAlike4.0
InternationalLicense
.
20

S-ar putea să vă placă și