Sunteți pe pagina 1din 31

ANALYTICS IN BIG DATA ERA

ANALYTICS TECHNOLOGY AND ARCHITECTURE TO MANAGE VELOCITY AND VARIETY,


DISCOVER RELATIONSHIPS AND CLASSIFY HUGE AMOUNT OF DATA
MAURIZIO SALUSTI SAS

C op yr i g h t 2 0 1 2 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

AGENDA

From DBMS to BIG DATA


Architectural Considerations
Big Data Analytics
Methods

Data Discovery: Visual Analytics

C op yr i g h t 2 0 1 3 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

WHAT IS BIG DATA?

DATA are everywhere:


IT organization often collect many data in EDW but them
need to integrate with many other sources

The ability to generate, communicate, share, and access information has


been revolutionized by the increasing number of people, devices, and
sensors that are now connected by digital networks.
People leave information in networks
Devices many ways to provide information
Data are a stream continuos of information
Data are not only measures but text, images, sounds
C op yr i g h t 2 0 1 3 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

ACTUAL COMPANY DATA ORGANIZATION

DATA ARE DEPLOYED INFORMATION AS SNAPSHOTS:

DATA WAREHOUSE
ANALYTICAL DATAMARTS
Same information are replicated in several data structures provide
slow updating process and slow renewal data.

Spreading information need drastic changements into paradigm how


companies collect their data and how they use it:
Customer data are not only in Customer company DB. These data
give partial customers vision: i.e. Telco operators collect customer
voice and sms traffic, while many their customers establish
contacts using social media and apps.
Customers can give many signal on market preferences like a
sensor on market but the actual data storage structures and their
analytics tools are not be able to deal with these data.
C op yr i g h t 2 0 1 3 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

TREND COMPANY DATA ORGANIZATION

NEEDS:

TO AVOID DATA PROLIFERATION


TO PROVIDE SEVERAL SCENARIO OF SAME DATA
DATA ENRICHMENT WITH SEVERAL SOURCES
QUICKLY DATA RENEWAL
TO PROVIDE PATTERN OF CHANGEMENTS SCENARIO

Big data refers to datasets whose size is beyond the ability of typical
database software tools to capture, store, manage, and analyze.
The ability to store, aggregate, and combine data and then use the results to
perform analysis in motion has become ever more accessible as trends.

C op yr i g h t 2 0 1 3 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

NEW QUESTIONS

Not always data are in structured data model


Often we need to join data with not same keys
Often data coming with periodic flow near real time
Often we need to recognize pattern from data changing
frequently

New ways to manage distributed and not structured in classical way data are
needed:
We need different paradigm to organize data and, above all, to query them.
Collect several sources and manage them open several new problems:
Relational data (GRAPH DATA) can be useful to understand event
spreading in a population.
Data in motion coming from several tools on field (sensor
devices, smarthphone) provide dynamic pattern often without an
history of their form
C op yr i g h t 2 0 1 3 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

ANALYSIS

Not always you can apply sampling to extract data


Not always you can join data to define ABT
Often you need to know how environment can influence
event: like buy, choice, changement.
Often we need to merging information collected with
different scope.

SQL Queries often are useless to reach these data:


Information are not organized into DB structures
Data are very different way to provides information: i.e. text
are not easy to query using traditional query languages.

Merging are driven by fuzzy keys where you can assign group
information according statistic relationship.
Event can be happen driven from relational with other data
rather from specific behavior.
C op yr i g h t 2 0 1 3 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

BIG DATA

What types?

C op yr i g h t 2 0 1 3 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

AGENDA

From DBMS to BIG DATA


Architectural Considerations
Big Data Analytics
Methods

Data Discovery: Visual Analytics

C op yr i g h t 2 0 1 3 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

DBMS and Datamart help to


analyzing data coming from one
central point data.
You need only to know where data
is and their meaning.
Query are managed directly from
DBMS
C op yr i g h t 2 0 1 3 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

Data are stored in different place and


you have to know relationship
MAPPING coming from different
sources.
Here before you extract data your
query have to know from which place
into the net you have data.

MULTI POINT DATA HUB

BUILDING BLOCKS OF A BIG DATA ANALYTICS PROCESS

ANALYTICS

C op yr i g h t 2 0 1 3 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

REFERENCE
EXAMPLE SAS-RACK IMPLEMENTATION
ARCHITECTURE
CLIENT

GREENPLUM

HADOOP

C op yr i g h t 2 0 1 3 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

TERADATA

ORACLE

Input

Hadoop

Output
Visual
Analytics

Metadata

High
Performance
Analytics
C op yr i g h t 2 0 1 3 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

In memory
GRID
COMPUTING
In Database

Input

Output
Visual
Analytics

Metadata

Analytical
Tool
C op yr i g h t 2 0 1 3 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

High
Performance
Analytics

AGENDA

From DBMS to BIG DATA


Architectural Considerations
Big Data Analytics
Methods

Data Discovery: Visual Analytics

C op yr i g h t 2 0 1 3 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

SAS HIGHPERFORMANCE
ANALYTICS

Worrying about software performance is not a new


concept at SAS
What is New?

Dedicated high-performance software


Accelerated development

Why Now?

C op yr i g h t 2 0 1 3 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

Customer needs
Blade systems have proven viable platforms for high-performance
computing
New computing paradigms
Partnerships with MPP database vendors

SAS
PROCEDURES

THEN AND NOW

proc logistic data=TD.mydata; proc hplogistic data=TD.mydata;


class A B C;
class A B C;
model y(event=1) = A B B*C; model y(event=1) = A B B*C;
run;
run;

Single-threaded
Not aware of distributed
computing environment

Multi-threaded
Aware of distributed
computing environment

Runs on client

Runs on client or DBMS appliance

C op yr i g h t 2 0 1 3 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

HP PROCS IN SINGLE SERVER


libname disk BASE /filesys;
proc hpreg data=disk.source;
analytic stuff
run;

OPERATING SYSTEM

SAS Process

SAS Process Steps:


(1) SAS Process Starts on HW & O/S
(2) SAS sets up access library to disk
(3) SAS starts HPREG PROC
(4) HPREG reads data through ACCESS
during computation*
(5) Multiple threads are launched to process
the incoming data
(6) As execution continues, temporary data
is written out to utility files on disk
*SMP HP PROCS do not load the entire source
dataset into RAM the SAS Process utilizes the
MEMSIZE option as a boundary. No different than
MVA or regular procs, datastep, etc.

C op yr i g h t 2 0 1 3 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

Process

Disks /filesys
Temp/Utility files
to support SAS

SAS Datasets

HPPROCS IN DISTRIBUTED ARCHITECTURE


HADOOP HDAT SHARED-RACK EXAMPLE
libname a sashdat;
option set=gridhost=NAMENODE;
proc hpreg data=a.source;
analytic stuff
performance nodes=all;
run;

SAS Process Steps:


(1) SAS Process Starts on HW & O/S
(2) SAS sets up access library to disk
(3) SAS starts HPREG PROC
(4) Due to GRIDHOST and proper access
engine setting, multi-threaded processes
are started on grid nodes (via TKGrid)
(5) As TKGrid processes start up, ALL data
is lifted into RAM from HDFS.
(6) Processing occurs in parallel against in
memory data
(7) Results return to initiating process on
SAS Server
C op yr i g h t 2 0 1 3 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

HADOOP NAMENODE

OPERATING SYSTEM

Process

NODE 1

SAS Process
4

3
2

Data

NODE 2

Data

6
NODE N

4
6

Data

Big data analysis can be done using several analytic


strategy.
SAS collects many different methods many of them
coming from traditional statistical inference analysis
using SEMMA paradigm.
Other coming from stochastic process analysis both
for continue and discrete events.
Other coming from linear and not linear mixed
models.

Graph analysis

C op yr i g h t 2 0 1 3 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

AGENDA

From DBMS to BIG DATA


Architectural Considerations
Big Data Analytics
Methods

Data Discovery: Visual Analytics

C op yr i g h t 2 0 1 3 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

ANALYTICAL CATEGORIES AND TARGET USAGE

Data Mining

Statistics

Binary target
& continuous
no.
predictions
Linear, NonLinear, &
Mixed Linear
modeling

C op yr i g h t 2 0 1 3 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

Complex

Text Mining

relationships

Tree-based
Classification

Variable
Selection

Parsing
large-scale
text
collections
Extract
entities
Auto.
Stemming &
synonym
detection

Forecasting

Large-scale,

multiple
hierarchy
problems

Optimization

Econometrics

Probability of
events
Severity of
random
events

Local search
optimization
Large-scale
linear &
mixed integer
problems
Graph theory

Data coming from different sources can be tie using


different methods like canonical decomposition.
Data pattern variability on data in motion like data
coming from devices can be sampled or simulate pattern
distribution using Markov chain Monte Carlo methods .
Sparse vector data with missing values can be simulate
using MCMC or other regression methods
Discrete choice among different events can be defined
using multinomial discrete models.

C op yr i g h t 2 0 1 3 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

GRAPH
ANALYSIS

Network

The Network Analysis objectives are:

Identifying the subnets (communities)


with high potential of information
exchange.
Community

Measuring changes over time.


Producing initiatives which increase the
enterprise presence in the single
communities knowing the spreading
strength of the community.

C op yr i g h t 2 0 1 3 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

GRAPH
ANALYSIS

Link
Node
2

A network is collection of the


relationships among nodes by links.
A node is an individual featured by
qualities which can be transmitted
through the links (impulses).
A link is the relationship which
connects 2 nodes. It can be
outgoing, incoming or with no
direction.

C op yr i g h t 2 0 1 3 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

AGENDA

From DBMS to BIG DATA


Architectural Considerations
Big Data Analytics
Methods

Data Discovery: Visual Analytics

C op yr i g h t 2 0 1 3 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

. . .provide very easy to use - yet sophisticated


statistical graphic tools to all of your users?
SAS VISUAL
ANALYTICS

A Single solution
for Statistical
Visualization and
reporting
C op yr i g h t 2 0 1 3 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

use ad hoc exploration and visualizations to analyze


multivariate results?
quickly produce mobile dashboards and reports that
convey more foresight than hindsight?

SAS VISUAL
BUSINESS VISUALIZATION DRIVEN BY ANALYTICS
ANALYTICS

EXPLORATION AND
VISUALIZATION

C op yr i g h t 2 0 1 3 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

POWER

OF

ANALYTICS

RAPID DELIVERY OF
MOBILE INSIGHTS

BUSINESS THE DIFFERENCE BETWEEN RAPID INSIGHT AND FAST


VISUALIZATION INFORMATION

DATA VISUALIZATION

ANALYTIC VISUALIZATION

EXPLORATION

DISCOVERY

C op yr i g h t 2 0 1 3 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

BENEFITS INCREASE THE USE OF ANALYTICS AND BI

Self-service
Easy to use Analytics
Work with more data

C op yr i g h t 2 0 1 3 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

Reporting and Dashboards


Mobile BI
Collaboration

SAS VISUAL
MEETING YOUR BUSINESS NEEDS THROUGH FLEXIBILITY
ANALYTICS

Traditional
on premise
Deployments

C op yr i g h t 2 0 1 3 , S A S I n s t i t u t e I n c . A l l r i g h t s r es er v e d .

Public
Private
Hybrid

SAS Cloud
&
SAS Solutions on Demand

S-ar putea să vă placă și