Sunteți pe pagina 1din 11

APRIL 2016

TDWI E-Book

Shaping the Future


of Data Warehousing
through Open Source
Software
1 Market Overview: Open Source Data Warehousing
3 Big Data Analytics: The Importance of MPP Data Warehouses
7 MPP Data Warehouses: The Open Source Perspective
10 About Pivotal

Sponsored by:

tdwi.org

Expert Q&A

MPP Data Warehouses

Open Source Solutions

About Pivotal

MARKET OVERVIEW:
OPEN SOURCE DATA WAREHOUSING
What are the benefits and risks of using an open source data
warehouse, and why are they just coming to market now? We
look at the basics of open source data warehousing with
Jeff Kelly, a data market strategist at Pivotal Software, Inc.

TDWI: What is an open source data warehouse?


Jeff Kelly: An open source data warehouse is a specialized database
built entirely on open source software code that supports enterprise,
production-grade data analytics and reporting. An open source data
warehouse should also support large-scale exploratory analytics and
data science workloads including machine learning.
What role does the open source community play as it relates to
data warehousing?
Like other open source technologies and projects, community
involvement leads to faster development cycles. Rather than being
beholden to the slow development cycles of a given data warehouse
vendor, open source data warehouse practitioners benefit from
continuous improvements made by the open source community,
which practitioners themselves can participate in to influence
product direction.
Most people associate open source with lower total cost of
ownership (TCO) compared to proprietary technology. Is that
the case with open source data warehousing?
Yes, open source data warehousing significantly reduces TCO. With
open source data warehousing, there are no software licensing costs
and no expensive proprietary hardware to purchase. The code is free
and generally runs on inexpensive commodity hardware. Open source
data warehouses are also an ideal environment for important but
less complex workloads many enterprises currently run on expensive
proprietary appliances, such as large-scale extract, transform, and
load (ETL) workloads.

1TDWI E - BOOK SH A PING THE FUTURE OF DATA WA REHOUSING THROUGH OPEN SOURCE SOF T WA RE

Expert Q&A

MPP Data Warehouses

Open Source Solutions

About Pivotal

Open source is not a new phenomenon, but it hasnt been


associated with data warehousing until now. Why?

What are other important criteria to consider when evaluating


an open source data warehousing option?

Open source data warehousing has been held back due to a lack of
vendor support and reluctance on the part of practitioners to throw
their lot in with the few, untested open source options that were
available. With the speed of business today, practitioners require
more agile, more powerful approaches to data warehousing and
analytics that are simultaneously cost-effective to scale. Only open
source data warehousing can meet those requirements.

As I mentioned, data warehouses support mission-critical workloads,


so it is important to select an open source data warehouse with an
active, growing community that is continuously developing the code
base. It is also critical to pick an open source data warehouse that
has at least one but preferably several trusted vendors backing it up
with world-class support.

How does open source data warehousing relate to the larger big
data technology stack, much of which is based on open source
technology itself?
Data warehousing is an important part of the big data stack,
and open source data warehousing in particular is a perfect
complement to the other open source technologies in that stack,
such as Hadoop. Using a common open source consumption
model for all the components of your big data platform makes
administration that much easier. Open source data warehousing is
also complementary to other big data technologies from a workload
perspective, providing flexible, high-performance analytics and
reporting capabilities that support other important workloads such
as streaming and unstructured data analysis.
Do you expect other data warehouse vendors to move their
proprietary products to open source?
Potentially, but the challenge most vendors face is that open source
is a threat to their business models, which are based on expensive,
proprietary appliances that lead to vendor lock-in. Open source
data warehouses are generally deployed on inexpensive commodity
hardware, and they significantly reduce the risk of lock-in because
practitioners can stop paying their vendor at any time yet continue to
use the software indefinitely!
Are there risks to relying on open source data warehousing?
There are in the sense that for most organizations, the reporting and
analytics that data warehousing supports are mission-critical to
the business, so it is important that they select a hardened, battletested, and reliable open source data warehouse. Failure is not an
option.

2TDWI E - BOOK SH A PING THE FUTURE OF DATA WA REHOUSING THROUGH OPEN SOURCE SOF T WA RE

Expert Q&A

MPP Data Warehouses

Open Source Solutions

About Pivotal

BIG DATA ANALYTICS: THE IMPORTANCE


OF MPP DATA WAREHOUSES

MPP data warehouses are ideal for the dynamic, mixed


workloads of today, and they can store, manage, and process
information at big data volumes.
The data warehouse plays a critical role in storing, managing,
and processing information at big data scale. This might sound
counterintuitive, especially now that Hadoop, Cassandra, MongoDB,
and other NoSQL platforms are marketed as replacements for the
data warehouse. True, one or more SQL query engines exist for all
of these platforms, but a SQL query engine does not a data
warehouse make.
If all you want to do is take some flat files and execute SQL
across them, that doesnt actually require a database. It requires
a translator from SQL to execution. In order to design and build a
massively parallel processing [MPP] database, some of the most
difficult problems to solve are maintaining consistency across a
huge database that runs on a distributed cluster, where you have
concurrent access to that data, notes Ivan Novick, a product
manager with Pivotal Software, Inc., which markets Greenplum, an
open source MPP database.
The number of ACID-compliant (atomic, consistent, isolated,
durable), MPP analytical data warehouses is remarkably small. This
doesnt mean MPP is a prohibitively expensive proposition, however.
Not anymore. Thanks to the commodification of MPP software
and server hardware and a little assist from the world of open

source software, MPP performance is surprisingly affordable. Its


surprisingly cost-effective, too.

The MPP Data Warehouse Reimagined


Of course, NoSQL systems can also claim very strong price
performance. When it comes to query processing, however, NoSQLs
price-performance benefits disappear. NoSQL query engines cannot
process analytical queries as efficiently, as richly, or as reliably
MPP databases.
For one thing, none of the extant SQL engines fully adheres to
modern versions of the ANSI SQL standard. (Few, if any, fully
implement the ANSI SQL-92 standard; most implement only portions
of ANSI SQL-1999 and later.) Second, a SQL query engine is only as
goodonly as useful, powerful, and valuableas the underlying
database its querying. Hadoop, Cassandra, and MongoDB are not
relational database systems. They lack the guaranteessuch as
support for ACID transactions as well as rich metadata management
featuresthat ensure data is reliably ingested, structured,
cataloged, and, as it were, true.
Whats more, Hadoop, MongoDB, and similar NoSQL technologies
are general purpose parallel processing platforms, not analytics
MPP platforms. They likewise cant efficiently process concurrent
SQL queries from multiple simultaneous users. Basically, its very
difficult to build a data warehouse as opposed to just a SQL engine,
Novick explains. The difference between a data warehouse and

3TDWI E - BOOK SH A PING THE FUTURE OF DATA WA REHOUSING THROUGH OPEN SOURCE SOF T WA RE

Expert Q&A

MPP Data Warehouses

a SQL engine is that the data warehouse is a guarantor of truth.


The data can be consistently guaranteed and true. At the same
time, the warehouse can support concurrent access across dozens
or hundreds of users. These NoSQL engines compromise on either
concurrency, ACID, or richness of SQL expression.
Fifteen years ago, only the largest companies could afford MPP
performance. Thanks to a combination of technological innovation
and ongoing commodification, MPP performance is now much more
affordable. Commercial offerings from Microsoft (which markets
its SQL Server Parallel Data Warehouse), Amazon (which markets
Redshift, an MPP data warehouse in the cloud), and Pivotal (which
offers both commercial and free versions of its open source
Greenplum MPP data warehouse) are priced at a fraction of the
cost of traditional MPP databases. MPP hardware has become less
expensive and more scalable, too. MPP databases used to be sold
as software-hardware bundles. This meant that an MPP server
node stuffed with the latest and greatest Intel Pentium Pro or Xeon
processors would cost moresometimes, much morethan would
equivalent hardware from manufacturers such as Dell, HP, or IBM.
To some extent, this was a necessary evil. An MPP database
distributes data across all of the nodes in a cluster; this can entail
significant data movement. MPP also relies on a technique called
message passing to coordinate communications between and
among nodes. For this reason, MPP database server nodes used to
be outfitted with proprietary interconnects designed specifically for
high data throughput and low latency.
Today, commodity high-throughput, low-latency technologies such
as 10-gigabit Ethernet are available at much lower price points.
Add it all up and the hardware market has changed dramatically,
Novick argues. Essentially, whats ideal is to use the sweet spot of
commodity hardware. In todays world, basically, Intel is commodity,
and generally a server that has two Intel processors, literally two, is
the sweet spot. If you put four Intel processors in a computer, then
it becomes a proprietary design and it becomes very expensive, and
you dont get the bang for the buck, he explains.
Because the data warehouse technologies today are scale out,
you can combine 10, 50, 100, even 500 or more servers in a single
cluster and build a data warehouse.
Moving to MPP isnt an overly complex or costly process, Novick
maintains. Youd move an existing data warehouse over to an MPP
data warehouse much like youd migrate or transition from one
conventional data warehouse to another. The way to do it is to
identify and pull out strategic applications based on business value.

Open Source Solutions

About Pivotal

Build a new system based on one, two, or three business-critical


applications and add to that over time, he says. If youre running
a Netezza data warehouse, you could probably migrate in one shot,
but if you have a Teradata system thats supporting 200 different
use cases, youd start by migrating data and applications to the new
system piecemeal.
Designing a schema for an MPP database system isnt overly
complicated, either. Novick recommends a vertical partitioning
scheme with a clearly defined fact table based on star or
snowflake schema.
Whats vertical partitioning? He offers an example. In addition
to dividing data into pieces per machine[a technique called]
horizontal sharding, which is what all scale-out systems do
vertical partitioning divides it by, say, time. If I have 500 days of
data, I could have a different partition [distributed across all of the
nodes in a cluster] for each day, Novick says.
Now, imagine youre an analyst in a financial services company
and someone says to you, Give me the count of trades done on
this one specific day; then all 100 machines will operate in parallel
and theyll all loop through the data. However, because the data is
partitioned separately [and] you have partitioned it by time, [this
query] will only process one day (1/500th) of data than if you had
not partitioned it.
He also touts a technique that Pivotal and a few other vendors call
dual ETL. Its an alternative to techniques such as data replication
or changed data capture (CDC), wherein data is replicated from
a live master system to a standby backup system. Dual ETL
describes a scheme in which two distinct clusters both act as
live or hot systems. Both systems are fed by the same ETL
processes and enforce the same data validation, data consistency,
data quality, rules, and mechanisms. Both are likewise available
to users for querying and advanced analytics. The clusters can
also be geographically distributed for disaster recovery or business
continuity purposes, Novick says.
The customer might say, I want to use a replicated solution, which
means that as data is modified on one warehouse, its automatically
updated on the second one. That is expensive and troublesome on
the big data system. Just the pure transfer rate of the data, plus the
syncing of the data, especially at big data scale, would be expensive
and vulnerable to failure, he explains.
Thanks to the commodification of MPP server and software kits,
dual ETL is cost-effective on its own terms, Novick argues. You
essentially build two clusters and all of the data warehouse inputs

4TDWI E - BOOK SH A PING THE FUTURE OF DATA WA REHOUSING THROUGH OPEN SOURCE SOF T WA RE

Expert Q&A

MPP Data Warehouses

are done on each cluster. For example, lets say you wanted to
create a data warehouse for a retail store chain. You would have
two clusters and you would feed, say, retail store purchases data
locally into both systems. What that enables you to do is to have
two systems that are both live. This allows you to distribute your
workloads between both systems. It allows you to do upgrades to
one cluster and not to the other. It allows you to have downtime
on a cluster if you want to do hardware changes, and so on. This
approach permits DBAs to more efficiently boost concurrency, too.
A dual ETL topology can permit an organization to achieve double
the rate of concurrency, supporting thousands or potentially tens of
thousands of simultaneous users.
Dual ETL isnt everything, of course. MPP systems such as
Greenplum use workload management facilities to manage
concurrency, too. The key to this, which has been proven across
multiple vendors, is to have a good workload management system
where you can define and enforce dynamic rules. This is a rulebased workload management system where you can set different
thresholds and conditions and based on that allow different queries
of different priorities to run at different times, Novick argues,
noting that not all queries (or all user classes) are equaland that
some of the users or groups who initiate queries are more important
(or more trustworthy) than others. If you know users of a certain
group are problematic, its important to have the ability to terminate
their overly expensive queries.
In addition, off-loading infrequently accessed data to non-MPP
storage can simplify data archiving as well as boost performance.
Frequently accessed data is still available in an online, MPP context,
which means cluster resources can be allocated to the workloads
that need it most. Data that is less frequently accessed can be
saved on an external system and queried using external tables
seamlessly from the same SQL interface as internal data, however
with a performance penalty. This is ideal for meeting requirements
of regulatory agencies that have mandates of online access to
older data.

Column Storage, Cloud Storage, and


Data Warehouse Futures
Conventional relational databases store data in rows, which means
that even if a query only needs to pull data from a single column, the
database still has to scan the contents of each and every column.
This increases I/O as well as seek time and latency. By contrast, a

Open Source Solutions

About Pivotal

columnar design helps minimize I/O contention, as well as drastically


reduces disk seek times. For these and other reasons, including
superior compressibility, a columnar architecture is generally
advantageous for analytical workloads. Columnar isnt a panacea,
however; there are analytical workloads (e.g., queries on very wide
tables) for which a row store is superior.
Columnar versus row store isnt an either/or proposition, Novick
argues; some database systems, including Pivotal Greenplum,
support both.
We have a hybrid approach where you define the storage format
at the time of table creation, and that can include both row versus
column, as well as compression. You can do it not only at the table
level but also at the partition level. The point is that you have that
flexibility, he says.
Cloud, too, isnt an either/or proposition. For many (if not most)
customers, it will likely be both/and.
I think the first question about data warehousing in the cloud is
a simple one: Where is the source of your data? If your source is in
the cloud, then it makes sense to have the data warehouse be there
because the data itself is already there. If you have to migrate a
huge amount of data from an on-premises location to the cloud,
thats another matter, he points out.
Again, for most customers, both on-premises and cloud deployments
will likely make sense. Theres no shortage of cloud data warehouse
services, with offerings from Amazon and Microsoft, in addition
to specialty providers, but organizations must safeguard against
vendor lock-in. The promise of the cloud is openness and portability,
but cloud platform-as-a-service (PaaS) offerings arent always as
portable or open as they seem.
If youre going to locate your data warehouse in the cloud, dont use
a single vendors cloud platform. Use the cloud for infrastructure-asa-service (IaaS). Instead of consolidating your whole data warehouse
on a single vendors cloud stack, go with something like an Amazon
Web Services or Microsoft Azure for the IaaSthe servers, the
network, and storage capacitybut use data warehouse software
that is portable, he argues.
Novicks isnt a disinterested opinion: Pivotals Greenplum database
can run in both traditional on-premises environments and in
cloud IaaS environments. However, Pivotal has its own highly
successful cloud PaaS service, tooCloud FoundryNovick

5TDWI E - BOOK SH A PING THE FUTURE OF DATA WA REHOUSING THROUGH OPEN SOURCE SOF T WA RE

Expert Q&A

MPP Data Warehouses

Open Source Solutions

About Pivotal

points out. Running in cloud IaaS makes it possible to more easily


shift data warehousing workloads between on-premises and cloud
environments and vice versa.
Rather than being locked into some specific providers system, you
can easily redeploy your data warehouse to another cloud provider
or to on-premises hardware. You can more easily take advantage of
external cloud storage, too. Right now, cloud providers are offering
very cheap storage in the form of such things as Amazon S3. Cheap
cloud storage can also be leveraged for data archiving, as, for
example, when you off-load infrequently accessed or cold data.

Conclusion
In the era of big data, MPP database systems are ideally suited
for many if not most analytical workloads. Theyre able to support
high concurrency rates and, in some cases, new kinds of advanced,
NoSQL analytics. For example, some MPP platforms can parallelize
and run different types of algorithms in the context of the database
engine. Take the Apache MADlib (incubating) machine-learning
library, which runs in the context of the Greenplum database engine.
This permits it to benefit from Greenplums MPP processing.
This is just one example, says Novick. The MPP data warehouses of
today are utilizing a cluster of servers to store and process data. You
can run a machine-learning algorithm that leverages the CPUs of all
of the servers in that cluster to do the analysis in parallel.
Hadoop and other NoSQL platforms have positive, distinctive roles
to play in the big data architectures of today and tomorrow. NoSQL
platforms are well suited for storing and managing multistructured
data (e.g., text files, multimedia content, and binary objects), as
well as for storing relational data at truly massive scale. The MPP
data warehouse, however, is ideal for dynamic, mixed workloads,
as well as for storing, managing, and processing information at
big data volumes.
Therere really only about seven products in the world that can do
thatsupport big data volumes and big data analytics in a data
warehouse. This is why we fully embrace the term data warehouse.
Were targeting people to use our system who are running serious
businesses, Novick says.

6TDWI E - BOOK SH A PING THE FUTURE OF DATA WA REHOUSING THROUGH OPEN SOURCE SOF T WA RE

Expert Q&A

MPP Data Warehouses

Open Source Solutions

About Pivotal

MPP DATA WAREHOUSES:


THE OPEN SOURCE PERSPECTIVE

Enterprises looking for alternatives to proprietary data


warehouses may find big benefits in open source solutions.
Open source software has transformed software development,
delivery, and licensing. It hasnt just changed how businesses use
software but what they expect from the software they use.
Open source has also transformed the price-performance calculus
organizations use to evaluate and make decisions about their
strategic IT investments. In an era of open source innovation, its
becoming increasingly difficult for organizations to justify the cost
and the inflexibility of proprietary platforms. This is even true of
highly specialized product segments, such as machine learning,
data mining, statistical analysis, and, yes, the massively parallel
processing (MPP) data warehouse.
Many customers are looking for alternatives to proprietary data
warehouses, especially in the open source space, and the main
reason for this is that not only are they paying premium prices,
but theres a vendor lock-in coming that companies cannot justify
any longer, says Cesar Rojas, product marketing director for
the Greenplum open source data warehouse at Pivotal Software,
Inc. Proprietary platforms used to make sense because of their
uniqueness in the market, but theyre not making sense anymore.
Customers feel trapped in that environment and this is made even
worse by the sky-high total cost. In addition to their software, many
vendors push their proprietary appliances in a big way and with their
appliances come expensive consulting and implementation services.

Pivotals Greenplum database is an open source MPP data


warehouse. Greenplum itself is based on the open source PostgreSQL
database, which has a rich open source pedigree. Greenplum wasnt
always an open source offering, though; it wasnt until October 2015
that it became available under the Apache License Version 2. Rojas
says that Pivotal made this decision for Greenplumand all Pivotal
data productsbecause it was the right thing to do for Pivotals
customers.
A large number of customers want to start new data warehouse
projects or migrate away from proprietary technologies to
Greenplums open source platform because it frees the organization
from any vendor lock-in, he explains.
This is definitely something customers are looking for because they
want to be able to run a variety of uses cases including reporting,
advanced analytics, and data science in a massively scalable and
open source environment. In some ways, were able to classify
ourselves in a very unique way because no one else is doing what
were currently doing.

Open Source MPP Data Warehouse


Open source software has lowered the proverbial bar with respect to
cost of entry, cost of maintenance, and total cost of ownership (TCO)
in many once-specialized markets.

7TDWI E - BOOK SH A PING THE FUTURE OF DATA WA REHOUSING THROUGH OPEN SOURCE SOF T WA RE

Expert Q&A

MPP Data Warehouses

Take the open source GNU-Linux operating system, which turns 23


this year. A quarter of a century ago, the dominant UNIX operating
systems were proprietary and ran on costly RISC hardware. GNULinux isnt technically UNIX, but its UNIX-likeand its market
share now trounces that of its proprietary UNIX rivals.
Heres another example: the open source R statistical programming
environment. Disciplines dont get much more specialized than
statistics and data mining, which were dominated by proprietary
vendors such as SAS Institute Inc. and the former SPSS Inc. (now
IBM SPSS) for decades. R hasnt just mounted a challenge to
the dominance of SAS and SPSS, its arguably already won. Most
graduates of college business, engineering, social scientific, and,
of course, statistical programs learned their craft on R, not on
proprietary platforms.
Theres no shortage of open source database offerings. PostgreSQL
and MySQL are just two of the more prominent open source database
platforms. Non-MPP platforms use a technology called symmetric
multiprocessing, or SMP, to scale up (or scale vertically). A MySQL or
standard SQL Server database is designed to run on a single server,
or node, and to scale across all of the processors or cores on that
node. Ideally, an SMP database would scale linearly; in practice, this
is never the case because as you add more cores, the ability of the
database to use those cores diminishes.

Open Source Solutions

About Pivotal

very new open source development (project name: GPORCA) that is


modular and independent of the Greenplum engine, he continues.
What does it mean to develop a query optimizer for big data?
When Greenplum utilizes GPORCA to optimize queries, it considers
many, many more alternatives than other query optimizers. It
optimizes a much wider range of queries and it uses memory
extensively to do it, Rojas says.
Elsewhere, Pivotal plans to offer a cloud infrastructure-as-a-service
(IaaS) deployment option for Greenplum. Once again, even though
theres no shortage of cloud databases, its possible to count the
number of cloud MPP data warehouses on a single handand to
have fingers left over.
Were running right now on Amazon Web Services, but at the same
time we are in the process of working with our Pivotal Cloud Foundry
[service] to be able to deploy in any potential client environment that
is available to the customer, he says. This year, we have a very
extensive pipeline of cloud innovations. One of the things that is
coming in the near term is the ability to [write to] external [database]
tables running on Amazon S3. Anything we do in the cloud is going
to help us move faster to a managed services type of environment,
which makes our technology more elastic.

Machine Learning to the Max

An MPP database can scale up (within a single SMP node) across all
of the available cores in a server node. However, an MPP database
also scales horizontally in the sense that its distributed across
multiple SMP nodes in a cluster. When an MPP database processes
a query, each of the nodes in the cluster independently processes a
piece of the queryso instead of, say, 24 cores, an MPP database
can muster 192 cores, 384 cores, 768 cores, and more.

Last fall, Pivotal donated its MADlib machine learning framework


to the Apache Software Foundation, or ASF. Apache MADlib
(incubating) describes a collection of more than 30 machine
learning algorithms. Its one of several machine learning, predictive
analytics, data mining, and statistical algorithms or libraries that
can run in the context of the Greenplum database engine itself,
says Rojas.

There are several commercial MPP data warehouse platforms, Rojas


says, but Greenplum is the only open source MPP database. There
isnt another creditable open source alternative. In a sense, he
maintains, Greenplums own pedigree demonstrates the difficulty of
developing an open source MPP database technology from scratch.
Unlike Linux and R, Greenplum started out as a commercial, best-ofbreed database. Its designers forked PostgreSQL and spent a decade
enhancing Greenplum as a proprietary product.

The MADlib library runs in this massively parallel environment. In


addition to that, we also run other kinds of in-database analytics,
he continues, citing PostGIS, a spatial database extension for the
PostgreSQL database, which also runs in-database in Greenplum.
We also provide in-database programming, so anything that is
called PL/, so PL/R, which enables R to run in-database, PL/Perl,
PL/Python. Those arent just running in-database, theyre running in
this MPP environment.

Unlike its non-MPP open source database alternatives, Greenplum


supports both row-based and columnar storage. Greenplum is fully
compliant with SQL. We provide both columnar and row orientation;
we call it polymorphic storage, Rojas explains. We obviously
are an MPP database, but we have also developed as part of this
technology the first query optimizer specifically for big data. Its a

In other words, its possible to parallelize machine learning,


predictive analytics, data mining, and other advanced analytics
workloads across a Greenplum MPP cluster. Because these
workloads are running on distributed nodes, using the discrete
processing, storage, and network resources of those nodes, they

8TDWI E - BOOK SH A PING THE FUTURE OF DATA WA REHOUSING THROUGH OPEN SOURCE SOF T WA RE

Expert Q&A

MPP Data Warehouses

can execute much fastersometimes several orders of magnitude


fasterthan on a single-system SMP database. (Not to mention
that not all single-system SMP databases can run machine learning
algorithms or other types of analytical workloads in-database.)
With machine learning and other advanced analytics practices,
iterationthe ability to rapidly build and test prototypes, or
hypothesesis key. The idea is to fail faster. By rapidly iterating
through what doesnt work, you more quickly arrive at what does.
MPP permits extremely rapid iteration, Rojas says.
Lets say youre playing with R, you have your R model, youre
working in your own little environment [on a test-dev system]. When
you want to execute [your prototype] on a massive scale, you take
that R model and you execute it in this MPP infrastructure. Youre
able to run everything in parallel and you get the results in parallel.
As an analyst, youre able to iterate much more quickly.
MADlib permits an MPP database to iterate even faster, Rojas
maintains. Just as important, he points out, it exposes a SQL
interface, so analysts who arent well versed in Java or Python
can write SQL code to exploit MADlib algorithms. MADlib provides
you with MPP implementations of mathematical, statistical, and
machine learning [algorithms] for both structured and unstructured
data. MADlib also has full SQL execution on top of it, [along with]
embedded functions that are also run as SQL, he explains. For
those who are not familiar with Java development, MADlib definitely
democratizes the access to analytics by giving the SQL-fluent
analyst access to very complete algorithms that would be out of
their reach if they needed to start coding Java.

Open Source Solutions

About Pivotal

The collaboration with the open source community has been


incredible. Every single day, theres either a GitHub pull request or
a comment. Every day, either our engineers or the community at
large is answering questions, he concludes. Were seeing more and
more collaboration not only with the core database engine but also
at the tool level. This is very exciting for the engineers. They were
working on their own, by themselves, and now everybody wants to
collaborate with them. Theres also a lot of collaboration with the
PostgreSQL community. Theyre pretty much embracing the fact
that we went open source. Theres also integration with the main
PostgreSQL project. From that point of view, we want to be able to
be 100 percent integrated with the latest PostgreSQL release.
In the end, Rojas concludes, community is the backbone of open
source software. We believe all of this [IP] is going to be utilized by
the larger community, not only us. This is core to our development
philosophy. Weve designed all of the components [of the Greenplum
database] in a way that is very modular. We want to work with the
community and innovate with the community.

Communal Innovation
Rojas says Pivotal isnt just paying lip service to the importance of
community.
In addition to MADlib, which became an ASF incubator project late
last year, it also donated other valuable proprietary IP: namely,
HAWQa port of Greenplum to run natively in Hadoop, with full
SQL support, RDBMS-like transactional consistency guarantees, and
MPP-database-like parallelization, courtesy of Hadoop.
This is further proof that the proprietary data warehouse is an
endangered species, Rojas argues. This isnt to equate the term
proprietary with intellectual property (IP), as if killer IP no longer
mattered. IP was, is, and ever shall be a critical differentiator.
Instead, its to make a distinction between closed source and
open source IP.

9TDWI E - BOOK SH A PING THE FUTURE OF DATA WA REHOUSING THROUGH OPEN SOURCE SOF T WA RE

Expert Q&A

MPP Data Warehouses

Open Source Solutions

About Pivotal

pivotal.io

tdwi.org

Pivotals Cloud Native platform drives software innovation for many


of the worlds most admired brands. With millions of developers in
communities around the world, Pivotal technology touches billions
of users every day. After shaping the software development culture
of Silicon Valleys most valuable companies for over a decade, today
Pivotal leads a global technology movement transforming how the
world builds software.

TDWI is your source for in-depth education and research on all


things data. For 20 years, TDWI has been helping data professionals
get smarter so the companies they work for can innovate and grow
faster.

Pivotal Greenplum: The Open Source Massively Parallel


Data Warehouse
Greenplum Database: The Worlds First Open Source Massively
Parallel Data Warehouse

TDWI provides individuals and teams with a comprehensive portfolio


of business and technical education and research to acquire
the knowledge and skills they need, when and where they need
them. The in-depth, best-practices-based information TDWI offers
can be quickly applied to develop world-class talent across your
organizations business and IT functions to enhance analytical, datadriven decision making and performance.
TDWI advances the art and science of realizing business value
from data by providing an objective forum where industry experts,
solution providers, and practitioners can explore and enhance data
competencies, practices, and technologies.
TDWI offers five major conferences, topical seminars, onsite
education, a worldwide membership program, business intelligence
certification, live webinars, resourceful publications, industry news,
an in-depth research program, and a comprehensive website:
tdwi.org.

2016 by TDWI, a division of 1105 Media, Inc. All rights reserved.


Reproductions in whole or in part are prohibited except by written permission. Email requests or feedback to info@tdwi.org.
Product and company names mentioned herein may be trademarks and/or registered trademarks of their respective companies.

10TDWI E - BOOK SH A PING THE FUTURE OF DATA WA REHOUSING THROUGH OPEN SOURCE SOF T WA RE

S-ar putea să vă placă și