Sunteți pe pagina 1din 41

How to Run a Big Data

POC —In Six Weeks


A practical workbook to deploy your first proof
of concept and avoid early failure
How To Run A Big Data POC - In Six Weeks / Introduction 02

Contents

Introduction 03 Part 4: Structuring the POC 20


–– The Importance of Starting Small 03 –– The Three Layers of a Big Data POC 21
–– The Three Pillars of Big Data Management 23
Part 1: General Lessons 05 –– A Checklist for Big Data POCs 26
–– POC Challenges 08 –– A Solution Architecture for
–– Big Data Management Lessons 09 Big Data Management 34
–– A Sample Schedule for a Big Data POC 36
Part 2: Picking the First Project 11
–– Exploring Cloud Data Management Drivers 11 Conclusion 38
–– Limiting the Scope 12 –– Making the Most of Your POC 38
–– Sample Projects 15
About Informatica® 40
Part 3: Proving Value 17
–– Defining the Right Metrics 18

Tip: Click on parts to jump to the particular section you want.


03

Introduction

The Importance of
Starting Small
How To Run A Big Data POC - In Six Weeks / Introduction 04

The Importance of Starting Small

At this point there’s little doubt that big That means starting with a well-planned Drinking our own champagne
data represents real and tangible value for proof of concept (POC) and building toward
enterprises. A full 84 percent of them see big a coherent vision of big data-driven success. The marketing team at Informatica
data analytics changing the way they The aim of this workbook is to share the advice recently ran a successful POC for a
compete1. And companies like Uber 2 and and best practice needed to run a successful marketing analytics data lake to fuel
Airbnb 3 are disrupting whole industries with big data POC. account-based marketing. We’ve written
smarter operations fueled by big data. a 200-page book that describes the
Central to this is ensuring your POC has a short experiences, challenges, and lessons
But even though the potential for big data time frame. So we’ve written this workbook to learned during the project. Download a
innovation is significant 4 , and even though the help you complete your POC in an ambitious free copy of the book here.
low cost of storage and scalable distributed but feasible six weeks. By following the lessons
computing has shrunk the gap between ideas shared here, you’ll be able to run a POC that is:
and implementations, the journey to big data
success is a tricky one. –– S
 mall enough that it can be completed
in six weeks.
A new set of technological and organizational
challenges makes big data projects prone to –– I mportant enough that it can prove the value
failure. But if there’s one thing that early big of big data to the rest of the organization.
data projects have proven, it’s that you need a
phased approach—not only to make projects –– I ntelligently designed so you can grow
more manageable, but also to prove the value of the project incrementally.
big data to the enterprise.
Let’s start.
05

Part 1

General Lessons
How To Run A Big Data POC - In Six Weeks / Part 1 06

“Give me six hours to chop


down a tree and I will spend the
first four sharpening the axe.”
Abraham Lincoln
How To Run A Big Data POC - In Six Weeks / Part 1 07

POC Challenges

Far too often, POCs run over budget, fall expectations and in others you may need to
behind schedule, or fail to live up to show them they can actually aim higher.
expectations. In fact, only 13 percent of In either case, it’s important that they’re treated
organizations have big data projects in like the crucial part of the POC process they are. 13%
full-scale production 5 . And only 27 percent
of executives deem their big data projects Without the active involvement of an executive,
‘successful’5 . you cannot prove the value of a POC to the
enterprise. And the POC can’t be completed
So before we get into what it takes to make in six weeks.
big data projects succeed, it’s important to Only 13% of organizations have big
consider why so many POCs fail. Lack of planning and design data projects in full-scale production
You don’t want to spend three of your six weeks
A lack of alignment with executives just reconfiguring your Hadoop cluster because
The success or failure of your POC will depend it was set up incorrectly or for a different project.
on your relationship with the executives Even at an early stage, planning, architecture,
sponsoring the project. So rather than just and design are crucial considerations for big
occasionally reporting to the executive, it’s data implementations. 27%
important that you make them an active
participant in the project While you may not need your project to have a
fully-designed architecture to prove value, (in
That means understanding the outcomes fact with a time frame of six weeks it’s actually
they’re interested in championing and making wise not to build a complete architecture)
sure that what you build proves their value. you will need to understand the implications Just 27% of executives deem their
In some cases, you may need to manage of various technical components. big data projects “successful”
How To Run A Big Data POC - In Six Weeks / Part 1 08

POC Challenges

And POC or not, a poorly rationalized Ignoring data management


technology stack with component technologies In all the excitement around new technologies
that aren’t configured in a coordinated way is like Hadoop and Spark, it’s worryingly easy
a recipe for disaster. to forget the basic principles of data
management needed to make big data work.
Scope creep It’s this simple: if the data fueling your big
As excitement around the project grows, it’s data POC isn’t reliable, connected, and easy
common for new business units, executives, to manage, it isn’t successful.
and team members to add requirements and
alter your priorities. In fact, as you go through And if you’re hand-coding integrations,
the build and realize what’s possible, you may transformations, and business logic, you
be tempted to broaden the scope of your first won’t have the time you need to actually
use case too. solve a business problem.

But scope creep invariably slows down So rather than getting caught up in making
deployments. And a small project that solves hand-coded scripts for “slowly changing
a business problem is a lot more valuable than dimensions” work in new technologies like
a broad project that doesn’t. Spark, it’s a lot smarter to use tools you already
know. More importantly, it isn’t easy to scale,
So it’s crucial that your POC be tightly focused deploy, and maintain the code of a small
on the use case it’s meant to solve in six weeks’ group of experts. So automating crucial data
time. That means focusing everyone involved management helps both during and after your
on a single use case that’s important enough to POC’s six-week time frame.
see through before you start building.
How To Run A Big Data POC - In Six Weeks / Part 1 09

Big Data Management Lessons

A phased approach to big data projects is So when it comes to building out the solution Metadata matters
crucial. But the success of a phased approach architecture for your POC, consider the One of the biggest challenges with large
isn’t just about starting small. It’s also about following big data management principles: datasets is the sheer amount of data “wrangling”
having a cohesive vision to build toward. To (cleansing, transforming, and stitching) that
that end, big data management has a crucial Abstraction is key needs to occur for data to be usable. In some
role to play in the success of big data projects The key to scale is leveraging all available cases this can take up as much as 80 percent
for two reasons. hardware platforms and production artifacts of a data scientist’s time 6 .
(such as existing data pipelines). So when it
First, because big data management principles comes to data management, it’s important that So rather than saddling your scientists with
streamline a large part of the work that needs the logic and metadata you need is abstracted the burden of manually applying the same
to be done during your first six weeks, allowing away from the execution platform. transformations over and over again, it makes
your developers to focus on the domain- a lot more sense to manage metadata in an
specific problem in front of them. That way, you make sure that no matter what accessible way. Global repositories of metadata,
changes in the runtime environment—for rules, and logic make it a lot easier to use a
And second, because once you prove your instance, if you realize you need to move from rationalized set of integration patterns and
solution works in a POC environment, you MapReduce to Spark—you can still preserve transformations repeatedly.
need to be able to scale and deploy it to a your integrations, transformations, and logic.
production environment in a reliable, flexible, It gives you the neutrality you need to manage
and maintainable way. the data and not the underlying technology.
How To Run A Big Data POC - In Six Weeks / Part 1 10

Big Data Management Lessons

Leverage your existing talent Automate to operationalize Security will be more than a checkbox
Thanks to technologies like Hadoop, it’s Data management in a POC is hard. And data At the POC stage, data security is basically
become incredibly easy to add low-cost quality is fundamental to the success of a treated like a checkbox. That’s because in a
commodity hardware. At the same time, it’s big data project. So while you might be able non-production environment with a small group
becoming harder to add the right talent. For to complete a POC with a small team of of users, you might be able to manage risk
instance, if you’re using Spark as your runtime experts writing a lot of hand coding, all it without a clearly-defined security strategy.
environment, you need developers who can proves is that you can get Hadoop to work
code in specialized languages and understand in a POC environment. But once you operationalize your project, you
the underlying computing paradigm. have to make sure your big data application
It’s no guarantee that what you built will doesn’t turn into a data free-for-all. At scale,
So instead of investing in outside talent for continue to work in a production data center, you need to be able to not only apply strict
every new skill and technology, it’s smarter comply with industry regulations, scale to security policies centrally, but also to make sure
to leverage your existing talent to integrate, future workloads, or be maintainable as enforcing them secures your data across users,
cleanse, and secure all your data using the technologies change and new data is applications, environments, and regions.
data management tools and interfaces your onboarded. The key to a successful POC
team knows how to use. Make sure your data is preparing for what comes next. It’s no mean feat. For example, do you know
management tool can deploy data pipelines which data sets contain sensitive information?
to big data platforms such as Hadoop, and At this stage you need automation to scale And can you de-identify and de-sensitize them
specialty processing engines on YARN or Spark. your project. So keep the future scale of in case of a security breach? Even though you
your initiative in mind as you plan and design may get away with minimal security now,
the POC. it helps to consider and discuss the various
threats to the security of the data that resides
in your cluster.
11

Part 2

Picking the First Project


How To Run A Big Data POC - In Six Weeks / Part 2 12

Limiting the Scope

The biggest challenge with a big data POC lies It’s better to spend a few weeks refining the The smaller the team of business stakeholders
in picking the first use case. The key lies features and functions of your project case you need to engage and report to, the easier it’ll
in finding a use case that can deliver an down to the bare minimum required to prove be to meet and manage expectations.
important business outcome—while still being value, before you start the actual POC. That way,
small enough to solve within budget and in you can ensure everyone’s on the same page The executive
six weeks. (Note: we’d advise against having about what needs to be done and avoid scope As we’ve said, executive support is crucial to
multiple use cases for a six-week POC.) creep later. And any requirements that fall the success of your POC. But trying to cater to
outside the scope of your first project can the needs of multiple executives takes much-
Once you’ve identified a business challenge you be re-considered during the next phase. needed focus and resources away from your
can solve with big data, minimize the scope of six-week timeframe. As far as possible, make
the project as much as possible. The smaller Limit the scope of your POC along these sure you’re only working with one executive
and more tightly focused your project case, the dimensions: during your POC—and that you’re working
greater the likelihood of success. closely together to deliver the same vision.
The business unit
Of course, it isn’t worth over-simplifying your In a six-week POC, it’s smarter to serve one
project and missing out on crucial capabilities. business unit rather than trying to serve the
But starting small and having a clearly defined whole enterprise. For enterprise-wide initiatives,
roadmap and phased approach is key. you typically need champions from multiple
departments and within a six-week timeframe,
it can be challenging to find and manage them.
How To Run A Big Data POC - In Six Weeks / Part 2 13

Engaging your Integration


Competency Center (ICC)

A lot of the data management technology


(from data quality tools to data pipelines)
and standards (from business logic to
glossary terms) you need may already exist
within your Integration Competency Center
or Integration Center of Excellence.

It makes sense to engage these experts and


show them what you’re working on. If your
group and theirs decide to work together, you
get the benefit of pre-built standards and tools.
And even if you don’t work together for the POC,
you’ll have established a relationship that will be
useful when you operationalize your project.
How To Run A Big Data POC - In Six Weeks / Part 2 14

Limiting the Scope

The team analytics data has 400 fields but you only need When it came to building out the
Effective collaboration between IT and 10 of them, you can rationalize your efforts by marketing data lake at Informatica,
business stakeholders is crucial if you’re focusing on those 10. our team focused on only the most
going to complete your POC in six weeks. essential data sources needed to
So involve only the most essential team Similarly, if you’re attempting to run predictive gain a consolidated view of how
members needed to prove value in six weeks. analytics based on customer data, it might be prospect accounts engaged with
better to validate a model based on a smaller our website: Adobe Analytics,
Depending on the nature of your project but indicative subset of attributes so as to not Salesforce, and Marketo.
and the way your legacy stack is set up, this over fit your analytic models.
team is likely to consist of a domain expert, Once we’d proven the value of this use
data scientist, architect, data engineer, Cluster size case, we were in a position to add new
and data steward—though one person may Typically, your starting cluster should only be sources such as Demandbase and ERP
wear many hats. as big as five to 10 nodes. We recommend you data, as well as other techniques such
work with your Hadoop vendor to determine as predictive analytics.
The data the appropriate size of cluster needed for your
Aim for as few data sources as possible. POC. It’s smarter to start small during the POC
Go back and forth between what needs to phase and then scale your development and
be achieved and what’s possible in six weeks production clusters since Hadoop has already
until you have to manage only a few sources. proven its linear scalability.
You can always add new ones once you’ve
proved the value. And the six weeks you have for your POC should
really only serve to prove your solution can
In terms of the actual data itself, focus only solve your domain-specific business challenge.
on the dimensions and measures that actually
matter to your specific POC. If your web
How To Run A Big Data POC - In Six Weeks / Part 2 15

Sample Projects

Now let’s take a look at some use casesfrom successful POCs that limited their scope to A Marketing Data Lake
prove value and grow incrementally. for Account-Based Marketing

Business Use Cases: Problem: The Informatica marketing


team lacked a consolidated view of how
Fraud Detection in Hadoop known and unknown prospects from
Problem: One U.S. bank found that managing Customer 360 Analytics target accounts interacted with the
its general ledger data on a mainframe was Problem: Disparate data sources and a Informatica website and content.
proving to be far too expensive. Even worse, it fragmented view of customers across systems
made it incredibly time-consuming to use that made it impossible for marketers at a global Project: The team integrated site traffic
data for risk analytics. payment services company to detect trends data from Adobe Analytics with lead
and patterns in their customer data. data from Marketo and Salesforce,
Project: The team took the general ledger as well as reverse-IP lookup from
data off the mainframe and stored it in Hadoop Project: The team decided to process a subset Demandbase to understand what users
to reduce costs. This allowed them to identify of customer data for a customer 360 initiative within specific accounts were looking for.
anomalies and trends with a new and better that would allow them to analyze and predict
approach to fraud detection. purchasing behavior and consumer preferences Looking forward: Now that the team
across different channels. has proven the value of both the solution
architecture and its account-based
approach to marketing, we’re extending
its capabilities to include cohort analysis,
attribution modeling, and identifying
influencers.
How To Run A Big Data POC - In Six Weeks / Part 2 16

Sample Use Cases

Technical Use Cases Data warehouse extensions


–– Building a staging environment to
In some cases, IT teams have the resources offload data storage and ETL workloads
and authority to invest in technical use cases from a data warehouse.
for big data. This may be to identify technical
–– E
 xtending your data warehouse with a
challenges and possibilities around building
Hadoop-based ODS (operational data store).
an enterprise data lake or analytics cloud,
for instance.
Building data lakes and hubs
–– Building a centralized data lake that
It’s important to note that while we’ve been
stores all enterprise data required for big
describing the first phase of a big data project
data analytic projects.
as a Proof of Concept we’re really talking about
a Proof of Value. The difference being that while –– B
 uilding a centralized data hub with
a POC aims to demonstrate the capabilities of a logical data structure over Hadoop to
features and functions in big data technology, perform aggregations and transformations
a Proof of Value goes one step further and on the fly.
demonstrates the business impact.
Building an active data archive
Even if you have the license to explore big –– Managing archived data with full data
data, it’s still important that you have a clear lineage to increase auditability.
view of the value you need to demonstrate
–– M
 anaging all archived data in a single
for business stakeholders.
platform to make it easier to search
and query.
Some of the most common technical use
cases are:
17

Part 3

Proving Value
How To Run A Big Data POC - In Six Weeks / Part 3 18

Defining the Right Metrics

A successful POC needs to prove a number of The starting costs for big data POCs can Quantitative measures (examples)
different things: be misleadingly low and costs can spiral –– Identify 1,200 high revenue potential
once you operationalize. So your POC needs prospects within one month.
–– T
 hat the big data project will deliver a to have comprehensively proved its value
measurable return. before this point. –– I dentify anomalies in general ledger data
30 percent faster than in the enterprise
–– T
 hat the technology stack used for the To that end, even before your POC has started, data warehouse.
project will help business stakeholders it’s important that you begin translating
achieve expectations of success into clearly defined –– M
 ake customer product preferences
that return in a reliable, repeatable way. metrics for success. available within top-fold of Salesforce
interface for 20,000 leads.
–– T
 hat the solution architecture is worth Based on your first use case, list out as many
further investment once you start to of the most appropriate metrics for success Qualitative measures
operationalize the project. as you can. If the metrics you’re listing aren’t –– Agility: Allow marketers to ask more
likely to prove this is a project worth further sophisticated questions such as who
investment, it’s worth redesigning your project are my most influential customers.
to achieve more relevant metrics.
–– T
 ransparency: Improve visibility into
provenance of atypical data entries leading
to better identifying fraudulent patterns.

–– P
 roductivity and efficiency: Reduce
number of man-hours spent by data
analysts on stitching and de-duplicating
data by 50 percent.
How To Run A Big Data POC - In Six Weeks / Part 3 19

Capturing Usage Metrics to Prove Value

One big indicator of success for Informatica’s SL No Job Title View ID View Name Last Viewed Viewed
marketing team was the number of people
1 Product Marketing 1096 Dashboard 12/17/16 14:02 71
actually using the tools they created.
By recording usage metrics, they could prove
2 Sales Director 1096 Dashboard 01/04/17 11:07 217
adoption of the solution by field marketers
and sales development representative to make
3 SDR 1096 Dashboard 12/21/16 11:06 1
sure they were on the right track.

4 Team leader 1096 Dashboard 12/10/16 12:19 40


Additionally, it helps to analyze why more users
are flocking to your new interfaces and tools.
For instance, how much faster can sales qualify 5 Team leader 1096 Dashboard 10/26/16 7:34 7
a new opportunity compared to previous
approaches? Are lead conversions increasing? 6 VP of Sales 1096 Dashboard 11/03/16 12:39 42
What fields can sales development
representatives see now that they have a 7 Field Marketing 1096 Dashboard 01/04/17 14:20 322
consolidated view of different systems?
8 SDR 1096 Dashboard 10/08/16 11:29 15

9 Field Marketing 1096 Dashboard 12/17/16 11:30 232

10 Solutions Director 1096 Dashboard 01/04/17 12:40 24

11 VP of Sales 1096 Dashboard 12/17/16 13:38 235

12 Solutions Director 1096 Dashboard 01/04/17 07:27 16


20

Part 4

Structuring the POC


How To Run A Big Data POC - In Six Weeks / Part 4 21

The Three Layers of a Big Data POC

For the most part, the hype around big Visualization and analysis Storage persistence layer
data technologies has focused on either Includes analytic and visualization tools used Includes storage persistence technologies
visualization and analysis tools like for statistical analyses, machine learning, and distributed processing frameworks
Tableau and Qlik, or distributed storage predictive analytics, and data exploration. like Hadoop, Spark, and Cassandra.
technology like Hadoop and Spark.
Here, user experience is key. Business users, Here the emphasis should be primarily
And for the most part, the hype has been analysts, and scientists need to be able to on cost reduction and leveraging the linear
justified. Visualization tools have successfully quickly and easily query data, test hypotheses, scalability these technologies enable.
democratized the use of data in enterprises. and visualize their findings, preferably in
And Hadoop has simultaneously reduced familiar spreadsheet-like interfaces.
costs, improved agility, and increased the linear
scalability of enterprise technology stacks. Big data management
Includes core tools to integrate, govern,
1
But big data projects rely on a crucial layer and secure big data such as pre-built
between these two—the data management connectors and transformations, data
layer. Without it, data isn’t fit-for-purpose, quality, data lineage, and data masking.
developers get mired in maintaining spaghetti 2

code, architectures become impossible to Here, the emphasis should be on ensuring


scale for enterprise-wide use, and data lakes data is made fit-for-purpose and protected
become data swamps. in an automated, flexible, and scalable way 3
that speeds up your build.
So when it comes to building out the solution
architecture for your POC, it’s important
you consider all three of the following layers:
How To Run A Big Data POC - In Six Weeks / Part 4 22

Should Big Data POCs Be Done On-Premise


or in the Cloud?

Just over half (51 percent) of Hadoop clusters Additionally, if you’ve already sunk the 100%
are run on-premise today7. That’s a surprising costs of hardware and have the staff for
amount considering Hadoop’s association 24/7 maintenance, it might actually be
with the rise of cloud computing. easier to get started on your own premises. 49%
Cloud providers,
But it does raise an interesting question In either case, it’s clear that as projects scale managed services,
about whether or not you should run your and evolve, you’ll likely change the runtime and other
cluster in the cloud or on your own hardware. environment under your stack. So make
sure you manage your metadata strategically
50%
On the one hand, cloud storage is cheaper, and abstract away transformations, logic,
offers elasticity, and makes it easier to access and code so it’s easier to deploy elsewhere.
commodity hardware. In fact, cloud vendors
have the resources and expertise to actually 51%
provide significantly better security than most On-premise
companies can afford.

At the same time, you have to be careful about


your organization’s standards and mandates, 0%
as well as perceptions about the cloud.
Location of Hadoop clusters
How To Run A Big Data POC - In Six Weeks / Part 4 23

The Three Pillars of Big Data Management

When it comes to building the big data –– V


 ersatility is key. Once you’ve completed At the scale of big data, managing data
management layer of your POC, you need your POC, you’ll need to be able to grow your quality is all about maximizing IT’s impact
your stack to account for three key areas: platform incrementally. Aim for universal with consistent standards and centralized
connectivity across data sources, data types, policies for effective governance.
1. Big Data Integration and dynamic schemas so you can onboard
The first challenge with any big data project new types of data quickly. It’s about making sure you can:
is about ingesting the data you need. Big data
means huge volumes of data from multiple –– A
 bstraction matters. As we described –– D
 etect anomalies quickly: You can’t eyeball
data sources in multiple different schemas. in Part I Section 2, it helps to abstract all your data. And you can’t attempt to
As a result, you need high-throughput data your code and logic away from the runtime manually correct every error. For this to be
integration and, if necessary, real-time environment. That way, no matter what scalable, you need to be able to track and
streaming for low-latency data. changes you make to your underlying monitor data quality with a common set of
technology stack, you still have a rules, standards, and scorecards acting as a
In the more specific context of your six-week global repository of patterns and frame of reference across the environment.
POC, it’s important that you aim for the right transformations to use.
things as early as possible. –– M
 anage technical and operational
2. Big Data Quality (and Governance) metadata: A global metadata repository
–– L
 everage pre-built connectors, Without fit-for-purpose data, it doesn’t matter of data assets, code, and logic about
transformations, and integration patterns how sophisticated your models and algorithms transformations enables every user to
to speed up the development of your data are. It’s essential that you proactively manage easily find datasets and deploy the cleansing
pipelines. They’ll make it easier for analysts the quality of your data and ensure you have the and validation they need. Most important,
and developers to connect to key sources tools you need to cleanse and validate data. it gives you the ability to define those rules
and ensure no one has to reinvent the wheel once and govern anywhere.
every time they need a new connection.
How To Run A Big Data POC - In Six Weeks / Part 4 24

The Three Pillars of Big Data Management

–– Match and link entities from disparate 3. Big data security –– Monitor sensitive data wherever it lives
data sets: The more data you’re managing Data security can be fairly light during POCs. and wherever it’s used: That means risk-
the harder it becomes to infer and detect But the second your project succeeds, centric security analytics based on a
relationships between different entities security becomes a whole other ball game. comprehensive 360 degree view of all
and domains. By building in entity matching This is particularly challenging with Hadoop your sensitive data.
and linking into your stack, you’re making implementations where it’s hard to monitor
it easier for your analysts to uncover trends and control different users from different –– C
 arry out risk assessments and alert
and patterns. teams and even regions. stakeholders: Security breaches occur all
too frequently. So you need to make sure
–– P
 rovide end-to-end lineage: An important For instance, if you have a team in India you clearly understand where your sensitive
function of a global repository of metadata accessing certain data that was generated in data resides, who’s using it and for what
is to make it easy to uncover the provenance California, you need to make sure you purpose. Risk scorecards based on trends
of data. For one thing, this streamlines have the tools in place to control what they and anomalies are essential to detect
compliance processes because it increases can and cannot see. exposure to breaches when they happen.
the auditability of all your data. But it also
makes it easy to map data transformations While it may not directly apply to your POC –– M
 ask data in different environments to
that need repeating. project, it’s important to start thinking de-identify sensitive data: Encryption and
(and talking) about your security strategy access control can only go so far. You don’t
for a production environment. At scale want your security controls to be prohibitive
you need to be able to: and intrusive. So it makes a lot more sense
to use persistent and dynamic data masking
 to make sure different users can see the
data they need, while never exposing the
data they don’t.
How To Run A Big Data POC - In Six Weeks / Part 4 25

Four Dimensions of Big Data Security

When most of your data no longer lives 3. Tokenization


behind the firewall, you can’t afford to rely This is all about substituting sensitive data
on authentication and authorization alone for elements with non-sensitive equivalents or
your big data security strategy. In fact, your “tokens”. It allows you to preserve data types
big data security should ideally encompass and formats and protect against stolen keys.
four types of protection:
4. Data masking
1. Authentication & Authorization To persistently or dynamically protect data
Tools like Kerberos, Knox, and Sentry are from certain users (e.g. protecting credit
essential for perimeter control when card data in dev and test envwironments),
multiple users are accessing your cluster. data masking can be used to de-identify
data while still ensuring it looks like the
2. Encryption original data. Unlike with encryption, the
Initiatives like Intel’s Project Rhino allow you “masked” data element can’t be decrypted
to encrypt the data living in your data stores by insiders to reveal the sensitive
and manage keys. This is crucial when information being protected.
you need to ensure outsiders can’t access
sensitive data but insiders need to be able
to decrypt and access the same data. And
format preserving encryption can be used to
get as granular as the field level.
26

A Checklist for Big Data POCs

1. Visualization and Analytics

Questions to ask yourself

1. List the users (and their job roles) who 3. What types of analytic algorithms will 5. Do your tools need to access multiple
will be working with these tools. you need to run (e.g. basic or advanced systems? If so, how will they access
statistics, predictive analytics, machine all of these systems?
learning)?

2. I f you’re aiming for self-service discovery, 4. If you’re aiming for reporting or


list the types of visualizations your users predictive analytics, which tools 6. What is the format of the data you
need (e.g. bar charts, heat maps, bubble will you be deploying? will deliver to the visualization and
charts, scatter plots, tree maps). analytics tools?
27

A Checklist for Big Data POCs

2. Big Data Integration

Questions to ask yourself

1. Will you be ingesting real-time 3. Outline the process for accessing data 5. What types of transformations will
streaming data? from those systems (Who do you need your use case call for (e.g. complex file
to speak to? How frequently will the data parsing, joins, aggregation, updates)?
be transferred? Will you just capture
changes or the entire dataset?)

2. Which source systems are you


ingesting from? 4. How many data integration patterns 6. What type of data (e.g. relational,
will you need for your use case? (Try machine data, social, JSON, etc.) are
to get this number down to between you ingesting? (Review sample data
two and three.) in the required format.)
28

A Checklist for Big Data POCs

2. Big Data Integration

Questions to ask yourself

1. What type of data (e.g. relational,


7. 3.
9. How much data is needed for your
machine data, social, JSON, etc.) are use case?
you ingesting? (Review sample data
in the required format.)

2. How many dimensions and measures are


8. 4. How will you infer join keys for
10.
necessary for your six-week timeframe? disparate datasets?
29

A Checklist for Big Data POCs

3. Big Data Quality and Governance

Questions to ask yourself

1. Will your end-users need business- 3. What tools are available for end-users
intelligence levels of data quality? to do their own data prep?

2. A
 lternatively, will they be able to work 4. Will you need to record the provenance
with only a few modifications to the of the data for either documentation,
raw data and then prepare the data auditing, or security needs?
themselves?
30

A Checklist for Big Data POCs

3. Big Data Quality and Governance

Questions to ask yourself

5.
1. What system can you use for metadata 7.
3. Will your first use case require
management? data stewards? If so, list out their
names and job titles.

2. Can you track and record


6.
transformations as workflows? 8.
4. How will you match entities (e.g.
customers) between datasets
from different systems?
31

A Checklist for Big Data POCs

4. Big Data Security

Questions to ask yourself

1. Do you know which datasets contain 3. Which techniques will you use to 5. Have you set up your users to only
sensitive data and on which systems protect sensitive data (masking, access data within your network?
they reside? encryption, and/or tokenization)? What datasets will they have access to?

2. Will you need different data masking 4. Have you set up profiles for 6. What security policies do you need to
rules for different environments Kerberos authentication? work within your POC environment?
(e.g. development and production)?

32

A Checklist for Big Data POCs

5. Big Data Storage Persistence Layer

Questions to ask yourself

1. What systems will you use to store your 3. W


 ill you be using Hadoop for 5. W
 hich Hadoop sub-projects
data (e.g. data warehouse, Hadoop, pre-processing before moving will you be using (e.g. Hive,
Cassandra, or other NoSQL platforms)? data to your data warehouse? HBase, Kafka, Spark)?

2. W
 hat type of data do you plan to store in 4. O
 r will you be using Hadoop to both
each of these systems? How do you plan store and analyze large volumes of
to share data between these systems if unstructured data?
there is more than one?
How To Run A Big Data POC - In Six Weeks / Part 4 33

The Importance of Documentation

Documentation is vital, especially if your POC –– T


 he goals of your POC as defined
is successful. You’ll need to be able to share by the desired business outcomes.
the details of it once you decide to expand the
scope of your project. You’ll also benefit from –– T
 he success of your POC as defined
being able to socialize your new capabilities, by your metrics for success.
what they enabled, and why they matter.
–– H
 ow data flows between different systems;
Instead of trying to reverse-engineer the crucial transformations; mappings;
documentation you need after the project, integration patterns and end-points.
it makes sense to document your progress
along these lines: –– Y
 our solution architecture and explanations
about important design choices the team
made along the way.

–– Development support: It should be said


that with the right selection of tools, you
can minimize the amount of code that needs
to be documented.
How To Run A Big Data POC - In Six Weeks / Part 4 34

A Solution Architecture for


Big Data Management

No two POCs will have the same Additionally, you’ll notice that we’ve also
solution architecture. But when it comes included a horizontal layer to represent
to determining what technology you the necessary universal metadata catalog
need for your big data management we’ve described throughout this workbook.
stack, it’s important to consider all
the necessary layers. Use this reference architecture, along with the
checklist from the previous section to plot out
In the following solution architecture, we’ve your technology needs for an efficient, reliable,
used a general set of technologies worth and repeatable big data infrastructure.
considering that span:

–– The visualization and analytics layer

–– The big data management layer

–– And the data storage layer


How To Run A Big Data POC - In Six Weeks / Part 4 35

A Sample Schedule for a Big Data POC

The key to completing a POC in six weeks


is having a schedule that prioritizes and
organizes everything you need to do.
The specific components and layout of
your schedule will depend on the nature
of your own POC.

But consider the following schedule from the


Informatica marketing team’s data lake project
for an idea of how to structure your efforts. Bear
in mind, the timeline for this deployment was
two months so your six-week POC will need to
be carried out in a slightly tighter timeframe.
How To Run A Big Data POC - In Six Weeks / Part 4 36

Big Data Analytics POC Schedule


June July August

Big Data Analytics Application — Storyboard

Data lake POC kick off Storyboard 1: Internal marketing ops use-case Storyboard 2—Integrate Rio Social
for Informatica Web Intelligence Dashboard

Data Lake Environment Setup

Environment Planning & Procurement for Amazon Hadoop cluster Install Tableau Server; set up
and test connection with Hadoop
Set up & configure Informatica
BDM with Hadoop

Data Lake Development

Extracts scheduled for Adobe Analytics files Extracts scheduled for Rio

Analysis & design for story requirement

Create Hive tables & integrate Marketo, Rio Social, & Adobe Analytics data on Hadoop

Informatica Data Masking & Test Data Generation

Data Lake Web Analytics Dashboards

Implement Informatica Web Intelligence Dashboard in Tableau

Dashboard Validation & POC sign off Set up & training on Tableau server

Big Data Analytics Documentation

Document reference architecture Develop Data Dictionary

For more context about the details of this project, read our blog series “Naked Marketing”.
37

Conclusion

Making the Most of Your POC


How To Run A Big Data POC - In Six Weeks / Conclusion 38

Making the Most of Your POC

At the start of this workbook, we explained Successful POCs don’t just prove the value of a
that our aim was to help you build and run specific solution architecture. They prove the
a POC that is: value of big data as a viable source of insight
and innovation at a time when enterprises need
–– S
 mall enough that it can be completed both more than ever.
in six weeks.
So really, a successful big data POC is actually
–– I mportant enough that it can prove the value just the start of a constant, iterative, and
of big data to the rest of the organization. increasingly interesting journey towards better
products, smarter analyses, and bigger data.
–– I ntelligently designed so you can grow
the project incrementally.

We hope the advice, lessons, and examples


we’ve shared in this workbook help you achieve
all these things. But more than that, we hope
they can help you kick start a bigger, more
important journey for your company.
Contact Data Verification Strategies for Marketing and Sales / Further Reading 39

From Lab to Factory:


The Big Data Management Workbook.
Breakthrough innovation starts with
experimentation. When it comes to big data that
means enterprises need to be able to move ideas
from lab to factory environments faster than ever.
Read our workbook to find out how.

READ IT NOW
About Informatica ®

Digital transformation is changing our world. As the leader in Enterprise Worldwide Headquarters
Cloud Data Management, we’re prepared to help you intelligently lead 2100 Seaport Blvd, Redwood City, CA 94063, USA
the way and provide you with the foresight to become more agile, realize Phone: 650.385.5000
new growth opportunities or even create new inventions. We invite you to Fax: 650.385.5500
explore all that Informatica has to offer—and unleash the power of data to Toll-free in the US: 1.800.653.3871
drive your next intelligent disruption. Not just once, but again and again.
informatica.com
linkedin.com/company/informatica
twitter.com/Informatica

CONTACT US

IN18-1217-3055
© Copyright Informatica LLC 2017. Informatica and the Informatica logo are trademarks or registered trademarks of
Informatica LLC in the United States and other countries. A current list of Informatica trademarks is available on the
web at https://www.informatica.com/trademarks.html. Other company and product names may be trade names or
trademarks of their respective owners. The information in this documentation is subject to change without notice and
provided “AS IS” without warranty of any kind, express or implied.
How To Run A Big Data POC - In Six Weeks 41

Sources

1. Forbes, 84% of enterprises see big data 5. Capgemini, Cracking the data conundrum:
analytics changing their competitive How successful companies make big data
landscape, 2014 operational, 2015

2. Data Science Central, The amazing ways 6. N ew York Times, For big data scientists,
Uber is using big data, 2015 janitor work is the key hurdle to insights,
2014
3. Forbes, How Airbnb uses big data and
machine learning to guide hosts to the 7. T DWI, Hadoop for the Enterprise: Making
perfect price, 2015 data management massively scalable, agile,
feature-rich and cost-effective. Second
4. Fast Company, The world’s top 10 most Quarter, 2015
innovative companies of 2015 in big data,
2015

S-ar putea să vă placă și