Documente Academic
Documente Profesional
Documente Cultură
TDWI E-Book
Sponsored by:
tdwi.org
Expert Q&A
About Pivotal
MARKET OVERVIEW:
OPEN SOURCE DATA WAREHOUSING
What are the benefits and risks of using an open source data
warehouse, and why are they just coming to market now? We
look at the basics of open source data warehousing with
Jeff Kelly, a data market strategist at Pivotal Software, Inc.
1TDWI E - BOOK SH A PING THE FUTURE OF DATA WA REHOUSING THROUGH OPEN SOURCE SOF T WA RE
Expert Q&A
About Pivotal
Open source data warehousing has been held back due to a lack of
vendor support and reluctance on the part of practitioners to throw
their lot in with the few, untested open source options that were
available. With the speed of business today, practitioners require
more agile, more powerful approaches to data warehousing and
analytics that are simultaneously cost-effective to scale. Only open
source data warehousing can meet those requirements.
How does open source data warehousing relate to the larger big
data technology stack, much of which is based on open source
technology itself?
Data warehousing is an important part of the big data stack,
and open source data warehousing in particular is a perfect
complement to the other open source technologies in that stack,
such as Hadoop. Using a common open source consumption
model for all the components of your big data platform makes
administration that much easier. Open source data warehousing is
also complementary to other big data technologies from a workload
perspective, providing flexible, high-performance analytics and
reporting capabilities that support other important workloads such
as streaming and unstructured data analysis.
Do you expect other data warehouse vendors to move their
proprietary products to open source?
Potentially, but the challenge most vendors face is that open source
is a threat to their business models, which are based on expensive,
proprietary appliances that lead to vendor lock-in. Open source
data warehouses are generally deployed on inexpensive commodity
hardware, and they significantly reduce the risk of lock-in because
practitioners can stop paying their vendor at any time yet continue to
use the software indefinitely!
Are there risks to relying on open source data warehousing?
There are in the sense that for most organizations, the reporting and
analytics that data warehousing supports are mission-critical to
the business, so it is important that they select a hardened, battletested, and reliable open source data warehouse. Failure is not an
option.
2TDWI E - BOOK SH A PING THE FUTURE OF DATA WA REHOUSING THROUGH OPEN SOURCE SOF T WA RE
Expert Q&A
About Pivotal
3TDWI E - BOOK SH A PING THE FUTURE OF DATA WA REHOUSING THROUGH OPEN SOURCE SOF T WA RE
Expert Q&A
About Pivotal
4TDWI E - BOOK SH A PING THE FUTURE OF DATA WA REHOUSING THROUGH OPEN SOURCE SOF T WA RE
Expert Q&A
are done on each cluster. For example, lets say you wanted to
create a data warehouse for a retail store chain. You would have
two clusters and you would feed, say, retail store purchases data
locally into both systems. What that enables you to do is to have
two systems that are both live. This allows you to distribute your
workloads between both systems. It allows you to do upgrades to
one cluster and not to the other. It allows you to have downtime
on a cluster if you want to do hardware changes, and so on. This
approach permits DBAs to more efficiently boost concurrency, too.
A dual ETL topology can permit an organization to achieve double
the rate of concurrency, supporting thousands or potentially tens of
thousands of simultaneous users.
Dual ETL isnt everything, of course. MPP systems such as
Greenplum use workload management facilities to manage
concurrency, too. The key to this, which has been proven across
multiple vendors, is to have a good workload management system
where you can define and enforce dynamic rules. This is a rulebased workload management system where you can set different
thresholds and conditions and based on that allow different queries
of different priorities to run at different times, Novick argues,
noting that not all queries (or all user classes) are equaland that
some of the users or groups who initiate queries are more important
(or more trustworthy) than others. If you know users of a certain
group are problematic, its important to have the ability to terminate
their overly expensive queries.
In addition, off-loading infrequently accessed data to non-MPP
storage can simplify data archiving as well as boost performance.
Frequently accessed data is still available in an online, MPP context,
which means cluster resources can be allocated to the workloads
that need it most. Data that is less frequently accessed can be
saved on an external system and queried using external tables
seamlessly from the same SQL interface as internal data, however
with a performance penalty. This is ideal for meeting requirements
of regulatory agencies that have mandates of online access to
older data.
About Pivotal
5TDWI E - BOOK SH A PING THE FUTURE OF DATA WA REHOUSING THROUGH OPEN SOURCE SOF T WA RE
Expert Q&A
About Pivotal
Conclusion
In the era of big data, MPP database systems are ideally suited
for many if not most analytical workloads. Theyre able to support
high concurrency rates and, in some cases, new kinds of advanced,
NoSQL analytics. For example, some MPP platforms can parallelize
and run different types of algorithms in the context of the database
engine. Take the Apache MADlib (incubating) machine-learning
library, which runs in the context of the Greenplum database engine.
This permits it to benefit from Greenplums MPP processing.
This is just one example, says Novick. The MPP data warehouses of
today are utilizing a cluster of servers to store and process data. You
can run a machine-learning algorithm that leverages the CPUs of all
of the servers in that cluster to do the analysis in parallel.
Hadoop and other NoSQL platforms have positive, distinctive roles
to play in the big data architectures of today and tomorrow. NoSQL
platforms are well suited for storing and managing multistructured
data (e.g., text files, multimedia content, and binary objects), as
well as for storing relational data at truly massive scale. The MPP
data warehouse, however, is ideal for dynamic, mixed workloads,
as well as for storing, managing, and processing information at
big data volumes.
Therere really only about seven products in the world that can do
thatsupport big data volumes and big data analytics in a data
warehouse. This is why we fully embrace the term data warehouse.
Were targeting people to use our system who are running serious
businesses, Novick says.
6TDWI E - BOOK SH A PING THE FUTURE OF DATA WA REHOUSING THROUGH OPEN SOURCE SOF T WA RE
Expert Q&A
About Pivotal
7TDWI E - BOOK SH A PING THE FUTURE OF DATA WA REHOUSING THROUGH OPEN SOURCE SOF T WA RE
Expert Q&A
About Pivotal
An MPP database can scale up (within a single SMP node) across all
of the available cores in a server node. However, an MPP database
also scales horizontally in the sense that its distributed across
multiple SMP nodes in a cluster. When an MPP database processes
a query, each of the nodes in the cluster independently processes a
piece of the queryso instead of, say, 24 cores, an MPP database
can muster 192 cores, 384 cores, 768 cores, and more.
8TDWI E - BOOK SH A PING THE FUTURE OF DATA WA REHOUSING THROUGH OPEN SOURCE SOF T WA RE
Expert Q&A
About Pivotal
Communal Innovation
Rojas says Pivotal isnt just paying lip service to the importance of
community.
In addition to MADlib, which became an ASF incubator project late
last year, it also donated other valuable proprietary IP: namely,
HAWQa port of Greenplum to run natively in Hadoop, with full
SQL support, RDBMS-like transactional consistency guarantees, and
MPP-database-like parallelization, courtesy of Hadoop.
This is further proof that the proprietary data warehouse is an
endangered species, Rojas argues. This isnt to equate the term
proprietary with intellectual property (IP), as if killer IP no longer
mattered. IP was, is, and ever shall be a critical differentiator.
Instead, its to make a distinction between closed source and
open source IP.
9TDWI E - BOOK SH A PING THE FUTURE OF DATA WA REHOUSING THROUGH OPEN SOURCE SOF T WA RE
Expert Q&A
About Pivotal
pivotal.io
tdwi.org
10TDWI E - BOOK SH A PING THE FUTURE OF DATA WA REHOUSING THROUGH OPEN SOURCE SOF T WA RE