Sunteți pe pagina 1din 50

#APS

Overview of the

Microsoft Analytics
Platform System
(APS)
Matt Usher
Senior Program Manager
@two_under

About.me
Senior Program Manager
9 years at Microsoft
Visual Studio
Office
Windows Server
Analytics Platform System (APS)

Amazon, Deloitte Consulting, 5 startups

data warehousing has reached


the most significant tipping point
since its inception. The biggest,
possibly most elaborate data
- Gartner, The State of Data Warehousing in
management system
in IT is
2012
changing.

The traditional data warehouse


BI and
analytics

Dashboard Reporting
s

Data
warehouse

Realtime
data

ETL
1

3
Increasing
data
volumes

Data sources
OLTP ERP CRM LOB

New data
sources and
types

Device Web Sensor Socia

4
Cloudborn
data

The modern data warehouse


BI AND ANALYTICS

Self-service

Corporate

Collaboration

Mobile

Predictive

DATA ENRICHMENT AND FEDERATED QUERY

Single query
model

Extract, transform,
load

Data quality

Master data
management

DATA MANAGEMENT AND PROCESSING

Relational

Non-relational

Analytical

Streaming

INFRASTRUCTURE

Data sources
OLTP ERP CRM LOB

Non-relational data

Device Web Sensor Socia


s
s
l

Internal and external

Insights from all your data


Enrich and optimize your data from non-traditional sources
A city wanted better
insights into service
effectiveness. They
improved services by
using social, service logs,
devices, and GPS to
improve safety and
enhance services and
community.

A building management
company wanted to
integrate and analyze
data from sensors and
equipment to improve
efficiency and lower
energy costs by 20
percent.

A technical university
needed on-demand
computing in the cloud for
DNA sequencing to
accelerate access,
discovery, and analysis.

Social and web


analytics

Live data
feeds

Advanced
analytics
6

Roadblocks to a modern data


warehouse
Keep legacy
investment

Limited
scalability and
ability to handle
new data types

Acquire Big Data


solution

Significant
training and data
silos

Buy new tier-one


hardware
appliance

Acquire business
intelligence

High
acquisition
and migration
costs

Complex with
low adoption

Introducing the Microsoft Analytics Platform System


The turnkey modern data warehouse appliance

Enterprise-ready
Big Data
Relational and nonrelational data in a single
appliance
Enterprise-ready Hadoop
Integrated querying across
Hadoop and PDW using TSQL
Direct integration with
Microsoft BI tools such as
Microsoft Excel

Next-generation
performance at
scale
Near real-time
performance with InMemory Columnstore
Ability to scale out to
accommodate growing
data
Removal of data
warehouse bottlenecks
with MPP SQL Server
Concurrency that fuels
rapid adoption

Engineered for
optimal value
Industrys lowest data
warehouse appliance price
per terabyte
Value through a single
appliance solution
Value with flexible
hardware options using
commodity hardware

Microsoft Analytics Platform System


The turnkey modern data warehouse appliance

Enterprise-ready
Big Data

Next-generation
performance at
scale

Engineered for
optimal value

What is Big Data and why is it valuable to the


business?
Value to the business

Megabyte
s

Petabytes

Evolution in the nature and use of data in the enterprise

Data complexity:
variety and
velocity

Historical
analysis

Insight
analysis

Predictive
analytics

Predictive
forecasting

What is Hadoop?

Distributed, scalable system on


commodity HW
Composed of a few parts:

OPERATION
AL
SERVICES
AMBARI

DATA
SERVICES
FLUME

OOZIE

SQOOP

FALCON

Others: HBase, R, Pig, Hive, Flume,


Mahout, Avro, Zookeeper

Microsoft Confidential

HBASE

LOAD &
EXTRACT

MAP
REDUC
E
YARN

NFS

Core Services
WebHDF

HDFS

HDFS Distributed file system


MapReduce Programming model

PI
G

Hadoop Cluster
compute
&
storage

compute
&
storage

Hadoop clusters provide


scale-out storage and
distributed data
processing on
commodity hardware

HIVE &
HCATALO
G

11

APS delivers enterprise-ready Hadoop with HDInsight


Manageable, secured, and highly available Hadoop integrated into the appliance

SQL Server
Parallel Data
Warehouse

High performance
and tuned within
the appliance

End-user
authentication
with Active
Directory

100-percent
Apache Hadoop

Managed and
monitored using
System Center

PolyBase

Microsoft
HDInsight

Accessible
insights for
everyone with
Microsoft BI tools

APS appliance overview


Parallel Data
Warehouse workload

Each workload contains the


following boundaries:

Security

Metering

Servicing

Appliance

A region is a logical container


within an appliance

Fabric

Hardware

HDInsight workload

Demo

Connecting islands of data with PolyBase

Bringing Hadoop point solutions and the data warehouse together for users and IT
Select

Microsoft Azure
HDInsight

Hortonworks for
Windows and Linux
Cloudera

Result
set

SQL Server
Parallel Data
Warehouse

PolyBase

Microsoft
HDInsight

Provides a single T-SQL query model for


PDW and Hadoop with rich features of TSQL, including joins without ETL
Uses the power of MPP to enhance query
execution performance
Supports Windows Azure HDInsight to
enable new hybrid cloud scenarios
Provides the ability to query nonMicrosoft Hadoop distributions, such as
Hortonworks and Cloudera

(HDFS) Bridge
Direct and parallelized HDFS access
Enhancing the Data Movement Service (DMS) of APS to allow direct communication between HDFS data nodes and
PDW compute nodes

Non-relational
data

Relational data

Social
apps

Sensor
and
RFID

Mobile

Web
apps

apps

Hadoop

Regular
T-SQL

Results

External table
External
data
source

Traditional schema-based
data warehouse
applications

External file
format

Enhanced
PDW query
engine
HDFS bridge

PDW

Automatic MapReduce pushdown


Source systems

Analytics / Ad-hoc / Visualization

SQL Server
Data Marts

Hadoop / Data Lake


(Cloudera, Hortonworks,
HDInsight)

MapReduce

SQL Server
Parallel Data
Warehouse

T-SQL
SQL Server
Reporting Services

PolyBase

Microsoft
HDInsight

APS

Day / Hour / Minute Refresh

SQL Server
Analysis Services

PolyBase Predicate pushdown


HDFS File / Directory
//hdfs/social_media/twitter
//hdfs/social_media/twitter/Daily.log

Dynamic binding

Column filtering
SELECT User,
FROM

Product, Sentiment

Twitter_Table

WHERE Hour
=
AND
Date
=
AND
Sentiment

Current - 1
Today
>= 0

Hour

Date

5-15-14

5-15-14

xbox

5-15-14

IL

sqls

5-13-14

Sanjay

MN

wp8

5-14-14

Roger

TX

ssas

23

5-14-14

Steve

AL

ssrs

23

5-13-14

User

Location

Product

Sentiment Rtwt

Sean

CA

xbox

-1

Audie

CO

excel

Suz

WA

Tom

Row filtering

Hadoop

Enhanced compatibility and ease of use with PolyBase


Improve APS operations by extending PolyBase

Syntax extensions

HDFS file formats


Textfile and
RCFile
support

Security and
permission model

External table
source and file
format syntax

Microsoft Azure
HDInsight

HDInsight on APS

Hortonworks Data
Platform 1.3 and 2.0
(Linux/Windows Server)

Cloudera Linux 4.3

Azure extensions
Microsoft
Azure
Storage
Blobs

AU1
PolyBase
v2

Analytics Platform
System
(powered by PolyBase)

Big Data insights for anyone


New insights with familiar tools through native Microsoft BI integration

Takes
advantage of
high adoption
of Excel,
Power View,
PowerPivot,
and SQL
Server
Offers Hadoop
Analysis
tools like
Services
MapReduce,
Hive, and Pig
for data
scientists

Minimizes IT
intervention
for discovering
data with tools
such as
Microsoft Excel
Enables DBA
and power
users to join
relational and
Hadoop data
with T-SQL

Everyone else using


Microsoft BI tools

Power users

Data scientist

Create External Table


CREATE EXTERNAL TABLE table_name
({<column_definition>}[,..n ])
{WITH (
DATA_SOURCE = <data_source>,
FILE_FORMAT = <file_format>,
LOCATION =<file_path>,
[REJECT_VALUE = <value>],
)};

Referencing external data


source

Referencing external file


format

Path of the Hadoop file/folder

(Optional) Reject parameters

Create External Data Source


CREATE EXTERNAL DATA SOURCE datasource_name

{WITH (
1 Type of external data source
TYPE = <data_source>,
2 Location of external data
source
LOCATION =<location>,
[JOB_TRACKER_LOCATION = <jb_location>]
};
3

Enabling or disabling of
MapReduce job generation

Create External File Format


CREATE EXTERNAL FILE FORMAT fileformat_name
1

Type of external data source

(De)Serialization method [Hive


RCFile]

{WITH (
3 Compression meth
FORMAT_TYPE = <type>,
[SERDE_METHOD = <sede_method>,]
[DATA_COMPRESSION = <compr_method>,]
[FORMAT_OPTIONS (<format_options>)]
};
4

(Optional) Format Options [Text


Files]

Format Options
<Format Options> :: =
[,FIELD_TERMINATOR = value],
[,STRING_DELIMITER = value],
[,DATE_FORMAT = value],
[USE_TYPE_DEFAULT = value]

Column delimiter

Delimiter for string data types

To specify a particular date


format

How missing entries are handle

Use cases where PolyBase simplifies using Hadoop


data
Bringing islands of Hadoop data together

Running high performance queries against Hadoop dat


Archiving data warehouse data to Hadoop (move)
Exporting relational data to Hadoop (copy)
Importing Hadoop data into a data warehouse (copy)

Demo

Microsoft Analytics Platform System


The turnkey modern data warehouse appliance

Enterprise-ready
Big Data

Next-generation
performance at
scale

Engineered for
optimal value

Performance limitations and scale in traditional data


warehouse
Scale up

Rowstore
Querying data by row

Data
Forklift
Forklift

Diminishing scale as requirements


grow

C
1

C
2

C
3

C
4

R1

R1

R1

R1

R2

R2

R2

R2

R3

R3

R3

R3

R4

R4

R4

R4

R5

R5

R5

R5

R6

R6

R6

R6

Page 1

Page 2

Page 3

Sub-optimal performance for many


data warehouse queries

Scaling out your data to petabytes


Scale-out technologies in the Analytics Platform System

Scale out

PDW /
HDInsight

Multiple nodes with dedicated


CPU, memory, and storage
PDW /
HDInsight

PDW /
HDInsight

PDW /
HDInsight

PDW /
HDInsight

Ability to incrementally add


hardware for near-linear scale to
multiple petabytes
Ability to handle query
complexity and concurrency at
scale

PDW /
HDInsight

PDW

No forklift of prior warehouse


to increase capacity
0 terabytes

6
petabytes

Ability to scale out HDInsight and

Blazing-fast performance

MPP and In-Memory Columnstore for next-generation performance

Columnstore index representation

C
1

C
2

C
3

C
4

C
5

C
6

Parallel query execution

100x

Updateable clustered columnstore vs. table with customary


indexing

Store data in columnar format for massive

compression

Load data into or out of memory for nextgeneration performance with up to 60%
improvement in data loading speed

Updateable and clustered for real-time


trickle loading

Query
Results

15x

Up to
Up to
faster queries more
compression

Clustered columnstore index


Why is a clustered columnstore
index important?

Saves space

Provides easier management by


eliminating maintenance of secondary
indexes

Supports all PDW data types, including


high-precision decimal data types and
more

In-Memory Columnstore is featured in


the storage engine in PDW AU1

Space used in GB (table with 101 million rows)

91%
saving
s

Space used = table space + index space

Distributed parallel query execution


Relational query execution processing
1
2

Create query
plan

SQL queries sent to control


node
Control node creates query
execution plan

User
query
Client

Query plan creates distributed


queries to run on each
compute node

Distributed queries sent to


compute nodes (all running in
parallel)

Control node collects query


results and returns them to
user

Appliance
Compute
Management
Control

Compute
Compute

Query
results

Aggregate query
results

Compute

Compute nodes
process query
plan operations in
parallel

Concurrency that fuels rapid adoption


Great performance with mixed workloads
ETL/ELT with SSIS, DQS,
MDS
ERP

CRM

LOB

Analytics Platform
Intra-Day
System

APPS

ETL/ELT with DWLoader

CRTAS

SQL Server SMP

Link Table

Near real-time

PDW

Real-Time

Reporting and cubes

Columnstore
ROLAP / MOLAP
DirectQuery

Hadoop / Big Data

Ad hoc queries

Polybase

PolyBase

Fast ad hoc

HDInsight

SNAC

BI Tools

Microsoft Analytics Platform System


The turnkey modern data warehouse appliance

Enterprise-ready
Big Data

Next-generation
performance at
scale

Engineered for
optimal value

Lowest price per TB for data warehouse appliance


with APS
High performance using commodity hardware
Price per terabyte for leading
vendors
Significantly
$30

Price per terabyte for user-available storage


(compressed)

Thousands

$25
$20

lower
price per
terabyte than the
closest competitor

$15
$10
$5
$0
Oracle

EMC

IBM

Teradata

Microsoft

NOTE: Orange line indicates average price


per terabyte.

Lower storage
costs
with Windows Server
2012
Storage Spaces

Hardware and software engineered together


The ease of an appliance

PDW

Integrated
support plan
with a single
Microsoft
contact

Coengineered
with HP, Dell,
and Quanta
best practices

Preconfigured,
built, and
tuned
software and
hardware

Leading
performance
with
commodity
hardware

PolyBase

HDInsight

Rack #2

Rack #1

InfiniBand

InfiniBand

InfiniBand

InfiniBand

Ethernet

Ethernet

Ethernet

HDI extension
base unit

Failover node

Hardware architecture

Ethernet
Control node
Failover node

Networkin
g

PDW region

HST-01

Master node

HST-02

Failover node
Compute nodes

Economical disk storage

Compute nodes

HDI active
scale unit

Compute nodes

Economical disk storage

HSA-01
Economical disk storage

HDInsight region
HST-02

Compute nodes

HDI active
scale unit

Economical disk storage

HDI extension
base unit

Addition of two or three compute


nodes depending on OEM
hardware configuration and
related storage

Passive Unit

Host for non-worker HDInsight


nodes

Failover
Node

High availability for the rack

Economical disk storage

PDW region

Economi
cal disk
storage

Active Unit

Compute nodes

Compute nodes

Economical disk storage

IB and
Ethernet

Virtualized architecture overview

PDW engine
DMS Manager
SQL Server 2012 Enterprise Edition (PDW build)

Software details
C
T
L

M
A
D

A
D

V
M
M

Base Unit
Host 1

All hosts run Windows Server 2012 Standard


and Windows Azure Virtual Machines

Fabric or workload in Hyper-V Virtual


Machines

Fabric virtual machine, management server


(MAD01), and control server (CTL) share one
server

PDW agent that runs on all hosts and all


virtual machines

DWConfig and Admin Console

Windows Storage Spaces and Azure Storage


blobs

Host 2

Compute 1

IB and
Ethernet

Compute 2

Host 3

Economica
l disk
storage

Host 4
Direct attached SAS

Failover functionality

Virtual machine migration can be used to


move workload nodes to new hosts after
hardware failure

Cluster Shared Volumes


C
TL

M
A
D

A
D

V
M
M

Host 1

C
TL

M
A
D

FA
B
A
D

Compute 1

V
M
M

1
Host 2

Compute 1

C
TL

Compute 1

IB and
Ethernet

Compute 2

Host 5
2

Host 3

Base Unit

Base Unit
Passive
Unit

Economic
al disk
storage

Host 4

Enable all nodes to access logical unit


numbers (LUNs) on economical disk
storage

Use Server Message Block (SMB3) protocol

Failover capabilities

Uses one cluster across the whole


appliance

Automatically migrates virtual machines


on host failure

Enforces rules with affinity and anti-affinity


maps

Uses Windows Failover Cluster Manager

Direct attached SAS

Security enhancements
Integrated
authentication

Transparent
data
encryption

Scenarios
User logs in with domain
credentials

User logs in with SQL


username/pass

Kerberos attempted first

No trust is required

If no Kerberos, then fall back to NTLM


If no NTLM, then authentication fails

Integrated authentication: Trust


Trust is between
corporate domain and
workload domain

Minimum configuration
(NTLM)

Minimum configuration
(Kerberos)

One-way (outgoing)
external (non-transitive)
trust between corporate DC
and PDW Workload AD

One-way forest
Two-way forest
Two-way external trust

TDE in PDW V2 AU1: Under the


hood
1. User
2. User
creates
enables
master key in encryption at
master
appliance
database
level

3. User
creates
certificate in
master
database

4. User
creates
database
encryption
key (UserDB)

5. Initiate
database
encryption
for user
database

PDW creates
certificate on
CTL01

PDW creates
database
encryption key
on CTL01

PDW creates
different
database
encryption key
(ALL CMP)

PDW creates
master key on
CTL01
PDW creates
separate master
key on all
compute nodes

PDW encrypts
tempdb and
pdwtempdb

PDW exports
certificate and
imports it into
all CMP nodes

PDW encrypts
user database

Demo

The Royal Bank of Scotlandthe leading UK


provider of corporate banking services
needed a powerful analytics platform to
improve performance and customer services.
The bank implemented a Microsoft SQL Server
2012 Parallel Data Warehouse appliance to
increase productivity by 40 percent for faster
response to business needs.

I knew that it would be easy for


my team to transition from
managing SQL Server databases
to SQL Server 2012 PDW, and the
solution cost about 85 percent less
than products from other vendors.
Alan Grogan
Chief Analytics Officer
Royal Bank of Scotland

Get started today!


Learn more at www.microsoft.com/aps
Try HDInsight at www.microsoft.com/bigdata

Resources
Learning
Sessions on Demand

http://channel9.msdn.com/Events/Tec
hEd

TechNet
Resources for IT Professionals

http://microsoft.com/technet

Microsoft Certification & Training Resources

www.microsoft.com/learning

msdn
Resources for Developers

http://microsoft.com/msdn

Complete an evaluation and enter


to win!

Evaluate this session


Scan this
QR code
to evaluate
this session.

2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.
The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be
interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR
STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

S-ar putea să vă placă și