Documente Academic
Documente Profesional
Documente Cultură
Contents
Introduction
The Importance of
Starting Small
How To Run A Big Data POC - In Six Weeks / Introduction 04
At this point there’s little doubt that big That means starting with a well-planned Drinking our own champagne
data represents real and tangible value for proof of concept (POC) and building toward
enterprises. A full 84 percent of them see big a coherent vision of big data-driven success. The marketing team at Informatica
data analytics changing the way they The aim of this workbook is to share the advice recently ran a successful POC for a
compete1. And companies like Uber 2 and and best practice needed to run a successful marketing analytics data lake to fuel
Airbnb 3 are disrupting whole industries with big data POC. account-based marketing. We’ve written
smarter operations fueled by big data. a 200-page book that describes the
Central to this is ensuring your POC has a short experiences, challenges, and lessons
But even though the potential for big data time frame. So we’ve written this workbook to learned during the project. Download a
innovation is significant 4 , and even though the help you complete your POC in an ambitious free copy of the book here.
low cost of storage and scalable distributed but feasible six weeks. By following the lessons
computing has shrunk the gap between ideas shared here, you’ll be able to run a POC that is:
and implementations, the journey to big data
success is a tricky one. –– S
mall enough that it can be completed
in six weeks.
A new set of technological and organizational
challenges makes big data projects prone to –– I mportant enough that it can prove the value
failure. But if there’s one thing that early big of big data to the rest of the organization.
data projects have proven, it’s that you need a
phased approach—not only to make projects –– I ntelligently designed so you can grow
more manageable, but also to prove the value of the project incrementally.
big data to the enterprise.
Let’s start.
05
Part 1
General Lessons
How To Run A Big Data POC - In Six Weeks / Part 1 06
POC Challenges
Far too often, POCs run over budget, fall expectations and in others you may need to
behind schedule, or fail to live up to show them they can actually aim higher.
expectations. In fact, only 13 percent of In either case, it’s important that they’re treated
organizations have big data projects in like the crucial part of the POC process they are. 13%
full-scale production 5 . And only 27 percent
of executives deem their big data projects Without the active involvement of an executive,
‘successful’5 . you cannot prove the value of a POC to the
enterprise. And the POC can’t be completed
So before we get into what it takes to make in six weeks.
big data projects succeed, it’s important to Only 13% of organizations have big
consider why so many POCs fail. Lack of planning and design data projects in full-scale production
You don’t want to spend three of your six weeks
A lack of alignment with executives just reconfiguring your Hadoop cluster because
The success or failure of your POC will depend it was set up incorrectly or for a different project.
on your relationship with the executives Even at an early stage, planning, architecture,
sponsoring the project. So rather than just and design are crucial considerations for big
occasionally reporting to the executive, it’s data implementations. 27%
important that you make them an active
participant in the project While you may not need your project to have a
fully-designed architecture to prove value, (in
That means understanding the outcomes fact with a time frame of six weeks it’s actually
they’re interested in championing and making wise not to build a complete architecture)
sure that what you build proves their value. you will need to understand the implications Just 27% of executives deem their
In some cases, you may need to manage of various technical components. big data projects “successful”
How To Run A Big Data POC - In Six Weeks / Part 1 08
POC Challenges
But scope creep invariably slows down So rather than getting caught up in making
deployments. And a small project that solves hand-coded scripts for “slowly changing
a business problem is a lot more valuable than dimensions” work in new technologies like
a broad project that doesn’t. Spark, it’s a lot smarter to use tools you already
know. More importantly, it isn’t easy to scale,
So it’s crucial that your POC be tightly focused deploy, and maintain the code of a small
on the use case it’s meant to solve in six weeks’ group of experts. So automating crucial data
time. That means focusing everyone involved management helps both during and after your
on a single use case that’s important enough to POC’s six-week time frame.
see through before you start building.
How To Run A Big Data POC - In Six Weeks / Part 1 09
A phased approach to big data projects is So when it comes to building out the solution Metadata matters
crucial. But the success of a phased approach architecture for your POC, consider the One of the biggest challenges with large
isn’t just about starting small. It’s also about following big data management principles: datasets is the sheer amount of data “wrangling”
having a cohesive vision to build toward. To (cleansing, transforming, and stitching) that
that end, big data management has a crucial Abstraction is key needs to occur for data to be usable. In some
role to play in the success of big data projects The key to scale is leveraging all available cases this can take up as much as 80 percent
for two reasons. hardware platforms and production artifacts of a data scientist’s time 6 .
(such as existing data pipelines). So when it
First, because big data management principles comes to data management, it’s important that So rather than saddling your scientists with
streamline a large part of the work that needs the logic and metadata you need is abstracted the burden of manually applying the same
to be done during your first six weeks, allowing away from the execution platform. transformations over and over again, it makes
your developers to focus on the domain- a lot more sense to manage metadata in an
specific problem in front of them. That way, you make sure that no matter what accessible way. Global repositories of metadata,
changes in the runtime environment—for rules, and logic make it a lot easier to use a
And second, because once you prove your instance, if you realize you need to move from rationalized set of integration patterns and
solution works in a POC environment, you MapReduce to Spark—you can still preserve transformations repeatedly.
need to be able to scale and deploy it to a your integrations, transformations, and logic.
production environment in a reliable, flexible, It gives you the neutrality you need to manage
and maintainable way. the data and not the underlying technology.
How To Run A Big Data POC - In Six Weeks / Part 1 10
Leverage your existing talent Automate to operationalize Security will be more than a checkbox
Thanks to technologies like Hadoop, it’s Data management in a POC is hard. And data At the POC stage, data security is basically
become incredibly easy to add low-cost quality is fundamental to the success of a treated like a checkbox. That’s because in a
commodity hardware. At the same time, it’s big data project. So while you might be able non-production environment with a small group
becoming harder to add the right talent. For to complete a POC with a small team of of users, you might be able to manage risk
instance, if you’re using Spark as your runtime experts writing a lot of hand coding, all it without a clearly-defined security strategy.
environment, you need developers who can proves is that you can get Hadoop to work
code in specialized languages and understand in a POC environment. But once you operationalize your project, you
the underlying computing paradigm. have to make sure your big data application
It’s no guarantee that what you built will doesn’t turn into a data free-for-all. At scale,
So instead of investing in outside talent for continue to work in a production data center, you need to be able to not only apply strict
every new skill and technology, it’s smarter comply with industry regulations, scale to security policies centrally, but also to make sure
to leverage your existing talent to integrate, future workloads, or be maintainable as enforcing them secures your data across users,
cleanse, and secure all your data using the technologies change and new data is applications, environments, and regions.
data management tools and interfaces your onboarded. The key to a successful POC
team knows how to use. Make sure your data is preparing for what comes next. It’s no mean feat. For example, do you know
management tool can deploy data pipelines which data sets contain sensitive information?
to big data platforms such as Hadoop, and At this stage you need automation to scale And can you de-identify and de-sensitize them
specialty processing engines on YARN or Spark. your project. So keep the future scale of in case of a security breach? Even though you
your initiative in mind as you plan and design may get away with minimal security now,
the POC. it helps to consider and discuss the various
threats to the security of the data that resides
in your cluster.
11
Part 2
The biggest challenge with a big data POC lies It’s better to spend a few weeks refining the The smaller the team of business stakeholders
in picking the first use case. The key lies features and functions of your project case you need to engage and report to, the easier it’ll
in finding a use case that can deliver an down to the bare minimum required to prove be to meet and manage expectations.
important business outcome—while still being value, before you start the actual POC. That way,
small enough to solve within budget and in you can ensure everyone’s on the same page The executive
six weeks. (Note: we’d advise against having about what needs to be done and avoid scope As we’ve said, executive support is crucial to
multiple use cases for a six-week POC.) creep later. And any requirements that fall the success of your POC. But trying to cater to
outside the scope of your first project can the needs of multiple executives takes much-
Once you’ve identified a business challenge you be re-considered during the next phase. needed focus and resources away from your
can solve with big data, minimize the scope of six-week timeframe. As far as possible, make
the project as much as possible. The smaller Limit the scope of your POC along these sure you’re only working with one executive
and more tightly focused your project case, the dimensions: during your POC—and that you’re working
greater the likelihood of success. closely together to deliver the same vision.
The business unit
Of course, it isn’t worth over-simplifying your In a six-week POC, it’s smarter to serve one
project and missing out on crucial capabilities. business unit rather than trying to serve the
But starting small and having a clearly defined whole enterprise. For enterprise-wide initiatives,
roadmap and phased approach is key. you typically need champions from multiple
departments and within a six-week timeframe,
it can be challenging to find and manage them.
How To Run A Big Data POC - In Six Weeks / Part 2 13
The team analytics data has 400 fields but you only need When it came to building out the
Effective collaboration between IT and 10 of them, you can rationalize your efforts by marketing data lake at Informatica,
business stakeholders is crucial if you’re focusing on those 10. our team focused on only the most
going to complete your POC in six weeks. essential data sources needed to
So involve only the most essential team Similarly, if you’re attempting to run predictive gain a consolidated view of how
members needed to prove value in six weeks. analytics based on customer data, it might be prospect accounts engaged with
better to validate a model based on a smaller our website: Adobe Analytics,
Depending on the nature of your project but indicative subset of attributes so as to not Salesforce, and Marketo.
and the way your legacy stack is set up, this over fit your analytic models.
team is likely to consist of a domain expert, Once we’d proven the value of this use
data scientist, architect, data engineer, Cluster size case, we were in a position to add new
and data steward—though one person may Typically, your starting cluster should only be sources such as Demandbase and ERP
wear many hats. as big as five to 10 nodes. We recommend you data, as well as other techniques such
work with your Hadoop vendor to determine as predictive analytics.
The data the appropriate size of cluster needed for your
Aim for as few data sources as possible. POC. It’s smarter to start small during the POC
Go back and forth between what needs to phase and then scale your development and
be achieved and what’s possible in six weeks production clusters since Hadoop has already
until you have to manage only a few sources. proven its linear scalability.
You can always add new ones once you’ve
proved the value. And the six weeks you have for your POC should
really only serve to prove your solution can
In terms of the actual data itself, focus only solve your domain-specific business challenge.
on the dimensions and measures that actually
matter to your specific POC. If your web
How To Run A Big Data POC - In Six Weeks / Part 2 15
Sample Projects
Now let’s take a look at some use casesfrom successful POCs that limited their scope to A Marketing Data Lake
prove value and grow incrementally. for Account-Based Marketing
Part 3
Proving Value
How To Run A Big Data POC - In Six Weeks / Part 3 18
A successful POC needs to prove a number of The starting costs for big data POCs can Quantitative measures (examples)
different things: be misleadingly low and costs can spiral –– Identify 1,200 high revenue potential
once you operationalize. So your POC needs prospects within one month.
–– T
hat the big data project will deliver a to have comprehensively proved its value
measurable return. before this point. –– I dentify anomalies in general ledger data
30 percent faster than in the enterprise
–– T
hat the technology stack used for the To that end, even before your POC has started, data warehouse.
project will help business stakeholders it’s important that you begin translating
achieve expectations of success into clearly defined –– M
ake customer product preferences
that return in a reliable, repeatable way. metrics for success. available within top-fold of Salesforce
interface for 20,000 leads.
–– T
hat the solution architecture is worth Based on your first use case, list out as many
further investment once you start to of the most appropriate metrics for success Qualitative measures
operationalize the project. as you can. If the metrics you’re listing aren’t –– Agility: Allow marketers to ask more
likely to prove this is a project worth further sophisticated questions such as who
investment, it’s worth redesigning your project are my most influential customers.
to achieve more relevant metrics.
–– T
ransparency: Improve visibility into
provenance of atypical data entries leading
to better identifying fraudulent patterns.
–– P
roductivity and efficiency: Reduce
number of man-hours spent by data
analysts on stitching and de-duplicating
data by 50 percent.
How To Run A Big Data POC - In Six Weeks / Part 3 19
One big indicator of success for Informatica’s SL No Job Title View ID View Name Last Viewed Viewed
marketing team was the number of people
1 Product Marketing 1096 Dashboard 12/17/16 14:02 71
actually using the tools they created.
By recording usage metrics, they could prove
2 Sales Director 1096 Dashboard 01/04/17 11:07 217
adoption of the solution by field marketers
and sales development representative to make
3 SDR 1096 Dashboard 12/21/16 11:06 1
sure they were on the right track.
Part 4
For the most part, the hype around big Visualization and analysis Storage persistence layer
data technologies has focused on either Includes analytic and visualization tools used Includes storage persistence technologies
visualization and analysis tools like for statistical analyses, machine learning, and distributed processing frameworks
Tableau and Qlik, or distributed storage predictive analytics, and data exploration. like Hadoop, Spark, and Cassandra.
technology like Hadoop and Spark.
Here, user experience is key. Business users, Here the emphasis should be primarily
And for the most part, the hype has been analysts, and scientists need to be able to on cost reduction and leveraging the linear
justified. Visualization tools have successfully quickly and easily query data, test hypotheses, scalability these technologies enable.
democratized the use of data in enterprises. and visualize their findings, preferably in
And Hadoop has simultaneously reduced familiar spreadsheet-like interfaces.
costs, improved agility, and increased the linear
scalability of enterprise technology stacks. Big data management
Includes core tools to integrate, govern,
1
But big data projects rely on a crucial layer and secure big data such as pre-built
between these two—the data management connectors and transformations, data
layer. Without it, data isn’t fit-for-purpose, quality, data lineage, and data masking.
developers get mired in maintaining spaghetti 2
Just over half (51 percent) of Hadoop clusters Additionally, if you’ve already sunk the 100%
are run on-premise today7. That’s a surprising costs of hardware and have the staff for
amount considering Hadoop’s association 24/7 maintenance, it might actually be
with the rise of cloud computing. easier to get started on your own premises. 49%
Cloud providers,
But it does raise an interesting question In either case, it’s clear that as projects scale managed services,
about whether or not you should run your and evolve, you’ll likely change the runtime and other
cluster in the cloud or on your own hardware. environment under your stack. So make
sure you manage your metadata strategically
50%
On the one hand, cloud storage is cheaper, and abstract away transformations, logic,
offers elasticity, and makes it easier to access and code so it’s easier to deploy elsewhere.
commodity hardware. In fact, cloud vendors
have the resources and expertise to actually 51%
provide significantly better security than most On-premise
companies can afford.
–– Match and link entities from disparate 3. Big data security –– Monitor sensitive data wherever it lives
data sets: The more data you’re managing Data security can be fairly light during POCs. and wherever it’s used: That means risk-
the harder it becomes to infer and detect But the second your project succeeds, centric security analytics based on a
relationships between different entities security becomes a whole other ball game. comprehensive 360 degree view of all
and domains. By building in entity matching This is particularly challenging with Hadoop your sensitive data.
and linking into your stack, you’re making implementations where it’s hard to monitor
it easier for your analysts to uncover trends and control different users from different –– C
arry out risk assessments and alert
and patterns. teams and even regions. stakeholders: Security breaches occur all
too frequently. So you need to make sure
–– P
rovide end-to-end lineage: An important For instance, if you have a team in India you clearly understand where your sensitive
function of a global repository of metadata accessing certain data that was generated in data resides, who’s using it and for what
is to make it easy to uncover the provenance California, you need to make sure you purpose. Risk scorecards based on trends
of data. For one thing, this streamlines have the tools in place to control what they and anomalies are essential to detect
compliance processes because it increases can and cannot see. exposure to breaches when they happen.
the auditability of all your data. But it also
makes it easy to map data transformations While it may not directly apply to your POC –– M
ask data in different environments to
that need repeating. project, it’s important to start thinking de-identify sensitive data: Encryption and
(and talking) about your security strategy access control can only go so far. You don’t
for a production environment. At scale want your security controls to be prohibitive
you need to be able to: and intrusive. So it makes a lot more sense
to use persistent and dynamic data masking
to make sure different users can see the
data they need, while never exposing the
data they don’t.
How To Run A Big Data POC - In Six Weeks / Part 4 25
1. List the users (and their job roles) who 3. What types of analytic algorithms will 5. Do your tools need to access multiple
will be working with these tools. you need to run (e.g. basic or advanced systems? If so, how will they access
statistics, predictive analytics, machine all of these systems?
learning)?
1. Will you be ingesting real-time 3. Outline the process for accessing data 5. What types of transformations will
streaming data? from those systems (Who do you need your use case call for (e.g. complex file
to speak to? How frequently will the data parsing, joins, aggregation, updates)?
be transferred? Will you just capture
changes or the entire dataset?)
1. Will your end-users need business- 3. What tools are available for end-users
intelligence levels of data quality? to do their own data prep?
2. A
lternatively, will they be able to work 4. Will you need to record the provenance
with only a few modifications to the of the data for either documentation,
raw data and then prepare the data auditing, or security needs?
themselves?
30
5.
1. What system can you use for metadata 7.
3. Will your first use case require
management? data stewards? If so, list out their
names and job titles.
1. Do you know which datasets contain 3. Which techniques will you use to 5. Have you set up your users to only
sensitive data and on which systems protect sensitive data (masking, access data within your network?
they reside? encryption, and/or tokenization)? What datasets will they have access to?
2. Will you need different data masking 4. Have you set up profiles for 6. What security policies do you need to
rules for different environments Kerberos authentication? work within your POC environment?
(e.g. development and production)?
32
2. W
hat type of data do you plan to store in 4. O
r will you be using Hadoop to both
each of these systems? How do you plan store and analyze large volumes of
to share data between these systems if unstructured data?
there is more than one?
How To Run A Big Data POC - In Six Weeks / Part 4 33
No two POCs will have the same Additionally, you’ll notice that we’ve also
solution architecture. But when it comes included a horizontal layer to represent
to determining what technology you the necessary universal metadata catalog
need for your big data management we’ve described throughout this workbook.
stack, it’s important to consider all
the necessary layers. Use this reference architecture, along with the
checklist from the previous section to plot out
In the following solution architecture, we’ve your technology needs for an efficient, reliable,
used a general set of technologies worth and repeatable big data infrastructure.
considering that span:
Data lake POC kick off Storyboard 1: Internal marketing ops use-case Storyboard 2—Integrate Rio Social
for Informatica Web Intelligence Dashboard
Environment Planning & Procurement for Amazon Hadoop cluster Install Tableau Server; set up
and test connection with Hadoop
Set up & configure Informatica
BDM with Hadoop
Extracts scheduled for Adobe Analytics files Extracts scheduled for Rio
Create Hive tables & integrate Marketo, Rio Social, & Adobe Analytics data on Hadoop
Dashboard Validation & POC sign off Set up & training on Tableau server
For more context about the details of this project, read our blog series “Naked Marketing”.
37
Conclusion
At the start of this workbook, we explained Successful POCs don’t just prove the value of a
that our aim was to help you build and run specific solution architecture. They prove the
a POC that is: value of big data as a viable source of insight
and innovation at a time when enterprises need
–– S
mall enough that it can be completed both more than ever.
in six weeks.
So really, a successful big data POC is actually
–– I mportant enough that it can prove the value just the start of a constant, iterative, and
of big data to the rest of the organization. increasingly interesting journey towards better
products, smarter analyses, and bigger data.
–– I ntelligently designed so you can grow
the project incrementally.
READ IT NOW
About Informatica ®
Digital transformation is changing our world. As the leader in Enterprise Worldwide Headquarters
Cloud Data Management, we’re prepared to help you intelligently lead 2100 Seaport Blvd, Redwood City, CA 94063, USA
the way and provide you with the foresight to become more agile, realize Phone: 650.385.5000
new growth opportunities or even create new inventions. We invite you to Fax: 650.385.5500
explore all that Informatica has to offer—and unleash the power of data to Toll-free in the US: 1.800.653.3871
drive your next intelligent disruption. Not just once, but again and again.
informatica.com
linkedin.com/company/informatica
twitter.com/Informatica
CONTACT US
IN18-1217-3055
© Copyright Informatica LLC 2017. Informatica and the Informatica logo are trademarks or registered trademarks of
Informatica LLC in the United States and other countries. A current list of Informatica trademarks is available on the
web at https://www.informatica.com/trademarks.html. Other company and product names may be trade names or
trademarks of their respective owners. The information in this documentation is subject to change without notice and
provided “AS IS” without warranty of any kind, express or implied.
How To Run A Big Data POC - In Six Weeks 41
Sources
1. Forbes, 84% of enterprises see big data 5. Capgemini, Cracking the data conundrum:
analytics changing their competitive How successful companies make big data
landscape, 2014 operational, 2015
2. Data Science Central, The amazing ways 6. N ew York Times, For big data scientists,
Uber is using big data, 2015 janitor work is the key hurdle to insights,
2014
3. Forbes, How Airbnb uses big data and
machine learning to guide hosts to the 7. T DWI, Hadoop for the Enterprise: Making
perfect price, 2015 data management massively scalable, agile,
feature-rich and cost-effective. Second
4. Fast Company, The world’s top 10 most Quarter, 2015
innovative companies of 2015 in big data,
2015