Documente Academic
Documente Profesional
Documente Cultură
DATA SCIENCE
2
Conceitos de Data Science
Susan Lund et al., Game Changers: Five Opportunities for US Growth and Renewal,
McKinsey Global Institute Report, July 2013.
http://www.mckinsey.com/insights/americas/us_game_changers
4
Duas definies iniciais
5
Envolve dados, mas...
https://www.oreilly.com/ideas/what-is-data-science
6
Envolve programao, mas...
http://simplystatistics.org/2013/12/12/the-key-word-in-data-science-is-not-data-it-is-
science/ 7
Envolve estatstica, mas...
http://magazine.amstat.org/blog/2013/07/01/datascience/
8
tudo isto (e ainda mais?)
http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
9
um processo (?)
Doing Data Science, Rachel Shutt and Cathy ONeil, OReilly, 2014 10
um processo (?)
Introducing Data Science, Davy Cielen, Arno Meysman, Mohamed Ali, Manning, 2016 11
Conceitos de Data Science
13
O que mesmo um Data Scientist?
14
O que mesmo um Data Scientist?
Redes e infraestrutura.
Jeffrey Stanton et al, Interdisciplinary Data Science Education,
http://pubs.acs.org/doi/abs/10.1021/bk-2012-1110.ch006
15
T-Shaped Data Scientist
Doing Data Science, Rachel Shutt and Cathy ONeil, OReilly, 2014
16
Ento voc quer ser um data scientist
Voc
Tem acesso (ou pode ter) a colees de dados temticos em
diferentes graus de organizao e/ou
17
Conceitos de Data Science
19
Skill: Entender o Problema
20
Skill: Entender o Problema
21
Skill: Achar e Organizar Dados
22
Skill: Achar Dados
Preciso replicar/amostrar?
23
Big Data
Big Data Lessons from the Climate Science Community, Seth McGinnis, 2016
24
Big Data
3 Vs:
Volume: quanto de armazenamento necessrio. Depende de
capacidade tecnolgica: armazenamento, capacidade de
processamento.
Big Data Lessons from the Climate Science Community, Seth McGinnis, 2016
25
Big Data
Big Data Lessons from the Climate Science Community, Seth McGinnis, 2016
26
Big Data
27
Skill: Entender a Organizao dos Dados (1)
Antes do processamento:
Como os dados so representados?
n Como transformar?
28
Skill: Entender a Organizao dos Dados (2)
Coletaremos repetidamente?
29
Skill: Entender a Organizao dos Dados (3)
30
Que tecnologias so necessrias?
n NoSQL pode ser mais flexvel para dados com estruturas diferentes.
n Vrias abordagens/implementaes/modelos...
31
NoSQL
Baseados em pares chave/valor
Arrays associativos, mapas ou dicionrios.
Baseados em colunas
Amplia chave/valor para vrias colunas.
Cassandra, HBase
Baseado em Documentos
Permite hierarquia de chaves/valores/documentos.
Baseados em Grafos
Armazena ns e relaes entre ns.
Neo4J, OrientDB
https://www.digitalocean.com/community/tutorials/a-comparison-of-nosql-database-
32
management-systems-and-models
Skill: Anlise (Hacking)
33
Skill: Anlise (Hacking)
Conhecimentos bsicos em
estatstica/modelagem so muito
teis.
34
Skill: Anlise (Hacking)
Lembrete importante!
35
Skill: Anlise (Hacking): Python
Exemplo bsico
Exemplo bsico
37
Skill: Anlise (Hacking): Python
38
Crticas a Python
39
Skill: Anlise (Hacking): R
Exemplo bsico:
> d = read.table('dollar_vs_major_currencies_index.txt',
header=F, sep="t", col.names=c("month", "index"))
> dim(d)
[1] 437 2
> head(d)
month index
1 JAN 1973 108.1883
2 FEB 1973 103.7461
3 MAR 1973 100.0000
4 APRimg 1973 100.8251
5 MAY 1973 100.0602
6 JUN 1973 98.2137
> plot(d$index)
40
Skill: Anlise (Hacking): R
RStudio!
41
Crticas a R
42
S R e Python?
43
Skill: Machine Learning, Models
44
Skill: Machine Learning, Models
45
Skill: Machine Learning, Models
Cuidados:
Modelos podem ser bem mais complexos
do que EDA sugere.
Interpretabilidade e
validao de modelos
imprescindvel!
46
Skill: Comunicao de Resultados
47
Skill: Comunicao de Resultados
48
Skill: Comunicao de Resultados
49
Skill: Comunicao de Resultados
50
Jupyter
51
Jupyter
52
Skill: Entender (melhor) o Problema
53
Conceitos de Data Science
Projetos
LattesLab
55
LattesLab
Expertise:
Conhecimento das necessidades de anlises e relatrios baseados no Lattes.
Anlise:
Text Mining e Casamento de Padres/Grafos.
Visualizao.
Hacking: Expertise
Domain
Processamento de dados em XML.
56
LattesLab
Produtos de Dados:
Dicionrios de similaridade de nomes e conceitos.
Casamento de publicaes.
Bases de anotaes.
Bases de grafos.
57
S-Plus Virtual Observatory
58
S-Plus Virtual Observatory
Expertise:
Conhecimentos bsicos de astronomia, complementados pela equipe.
Anlise:
Somente para subprojetos.
Expertise
Hacking: Domain
59
S-Plus Virtual Observatory
Produtos de Dados:
Catlogos de objetos e sistemas de busca nos mesmos.
Metadados (provenincia).
60
Conceitos de Data Science
Referncias
Referncias
Data Scientists
Sebastian Gutierrez is a data entrepreneur who has founded SEXY SCIENTISTS WRANGLING DATA AND BEGETTING NEW INDUSTRIES
three data-related companies: DataYou (data science and visualiza-
tion consulting and education), LetsWombat (product sampling),
and Acheevmo (athletic performance statistics). He was formerly
an emerging markets risk manager at Scotia Capital and an FX Jamie Zawinski
Chris Wiggins Guy
AmySteele
Heineike
options trader at JP Morgan and Standard Chartered Bank. He (The New York Times) (Quid)
leads the 1,600-member New York City D3.js Meetup Group and
is co-editor of Data Science Weekly.
Brad Fitzpatrick Dan Ingalls
Caitlin Smallwood Jonathan Lenaghan
(Netflix)
Douglas Crockford L (PlaceIQ)
Peter Deutsch
In this book, you will see how some of the worlds top data scientists work across a dizzyingly wide
variety of industries and applicationseach leveraging her own blend of domain expertise,
statistics, and computer science to create tremendous value and impact.
Data Scientists
from the foreword by Peter Norvig, Director of Research, Google
Data Scientists at Work is a collection of interviews with sixteen of the worlds most influential and innovative
data scientists from across the spectrum of this hot new profession. Data scientist is the sexiest job in the 21st
century, according to the Harvard Business Review. By 2018, the United States will experience a shortage of
190,000 skilled data scientists, according to a McKinsey report.
at Work
Through incisive in-depth interviews, this book mines the what, how, and why of the practice of data science
from the stories, ideas, insights, and forecasts of its preeminent practitioners across diverse sectors: social
network (Yann LeCun, Facebook); professional network (Daniel Tunkelang, LinkedIn); venture capital (Roger
Ehrenberg, IA Ventures); enterprise cloud computing and neuroscience (Eric Jonas, formerly Salesforce.com);
at
newspaper and media (Chris Wiggins, The New York Times); streaming television (Caitlin Smallwood, Netflix);
music forecast (Victor Hu, Next Big Sound); strategic intelligence (Amy Heineike, Quid); environmental big data
(Andr Karpitenko, Planet OS); geospatial marketing intelligence (Jonathan Lenaghan, PlaceIQ); advertising
Work
(Claudia Perlich, Dstillery); fashion e-commerce (Anna Smith, Rent the Runway); specialty retail (Erin Shellman,
Nordstrom); email marketing (John Foreman, MailChimp); predictive sales intelligence (Kira Radinsky,
SalesPredict); and humanitarian nonprofit (Jake Porway, DataKind).
Each of these data scientists shares how he or she tailors the torrent-taming techniques of big data, data
visualization, search, and statistics to specific jobs by dint of ingenuity, imagination, patience, and passion. Data Roger Ehrenberg
Brendan Eich Kira
Ken Radinsky
Thompson
Scientists at Work parts the curtain on the interviewees earliest data projects, how they became data scientists, their (IA Ventures) (SalesPredict)
discoveries and surprises in working with data, their thoughts on the past, present, and future of the profession,
their experiences of team collaboration within their organizations, and the insights they have gained as they get Joshua
Erin Bloch
Shellman Fran
EricAllen
Jonas
their hands dirty refining mountains of raw data into objects of commercial, scientific, and educational value for
their organizations and clients. Readers will learn: (Nordstrom) (Independent Scientist)
Joe Armstrong Bernie Cosell
r)PXUIFEBUBTDJFOUJTUTBSSJWFEBUUIFJSQPTJUJPOTBOEXIBUBEWJDFUIFZIBWFGPSPUIFST
Victor Hu Yann LeCun
r8IBUQSPKFDUTUIFEBUBTDJFOUJTUTXPSLPOBOEUIFUFDIOJRVFTBOEUPPMTUIFZBQQMZ
r)PXUPGSBNFQSPCMFNTUIBUEBUBTDJFODFDBOTPMWF Simon Peyton
(Next Jones
Big Sound) Donald Knuth
(Facebook)
r8IFSFEBUBTDJFOUJTUTUIJOLUIFNPTUFYDJUJOHPQQPSUVOJUJFTMJFJOUIFGVUVSFPGEBUBTDJFODF
r)PXEBUBTDJFOUJTUTBEEWBMVFUPUIFJSPSHBOJ[BUJPOTBOEIFMQQFPQMFBSPVOEUIFXPSME John
Peter Foreman
Norvig Anna Smith
(MailChimp) (Rent the Runway)
Gutierrez
www.apress.com
Claudia Perlich Jake Porway
U S $ 2 9.9 9
S h e lv e i n B u s i n e s s / C a r e e r s
(Dstillery) (DataKind)
62
Referncias
Davy Cielen
Arno D. B. Meysman
Mohamed Ali
MANNING
63
Referncias
Data Science at the Command Line
Data
Science
at the
Command Line
Janssens
Jeroen Janssens
64
Referncias
ManasA.Pathak
Beginning
Data Science
with R
65
Em Breve!
Diviso de contedo:
Doing Data Science, Rachel Shutt and Cathy ONeil, OReilly, 2014
66
CONCEITOS DE
DATA SCIENCE