Sunteți pe pagina 1din 4

Big Data Everywhere: Making Big Data Results Easy, Economic & Fast

Companies that can harness big data will trample big data incompetents, writes the
Economist. The McKinsey Global Institutes report on Big Data states that the value of
Big Data to the US health care system could be $300 billion, that a 60% increase in
retailer operating margins is possible and that location data could provide $600
billion of value to consumers globally.
Big Data is everywhere and it represents a huge opportunity to those who can use
it effectively. But how do you know whether you have a Big Data problem? And if
you do, how do you solve it?
The relational database has been at the heart of many IT systems for more than two
decades. In recent years, this has been augmented by data warehousing technology that
can support business analytics without impacting the core online transaction processing
(OLTP) functions of the main database.
The world is changing. Location data from mobile devices, global web scale applications like
social media, machine generated data, mobile devices and the data exhaust from a multitude
of new systems present new opportunities to derive value and improve efficiency but present
significant challenges in terms of volume, velocity and variety of data. Traditional relational
databases are often a poor fit for managing these new Big Data requirements because of cost,
performance or both.
The analyst firm Forrester defines Big Data in terms of four key attributes - Volume, Velocity,
Variety and Variability. This is a catchy definition but the reality is that any one of the four Vs
can be a challenge for existing data management systems, giving you a Big Data problem.
This white paper provides an introduction to some of the new technology that can help.
Scaling out versus scaling up
For more than two decades, relational databases have been the standard way to manage large
datasets. The technology is well understood, highly standardised and works well for a wide
range of use cases. But as data volumes grow and workloads change, alternative approaches
become attractive. The traditional relational database runs on a single computer. As data sizes
and transaction rates increase, the means of growing the capacity of the system is to attach
more storage and use a faster computer. This works up to the point that no faster computer or
high speed storage is available, or if available, is too expensive to be considered.
Using many smaller computers rather than one large one is an attractive idea. The sweet spot
in terms of price-performance is smaller computer systems based on commodity components.
When these computers can use local storage rather than a costly storage area network (SAN)
or other shared storage, further costs get eliminated along the way. This has been known for
years; the problem has been how to take advantage of these lower cost computers for database
applications.
New generation database technologies make it possible to apply many low cost servers to the
task of running a database, distributing the load across the servers in a scale-out approach that
can collectively provide the resources needed even for very demanding Big Data applications.
The ability to run on low cost hardware and scale in this way gives another advantage: since
public cloud infrastructures are built on this kind of architecture, scale-out databases are also
well suited to running in the cloud.
NOSQL databases, together with distributed processing frameworks such as Hadoop, are
aimed at making the exploitation of scale-out architecture possible. By reducing the cost of
dealing with the data explosion, such technologies change the economics of Big Data, allowing
you to drive value and transform business without sacrificing the performance they demand.
?
US office: 181 Fremont St, San Francisco CA. +1-866-487-2650 UK office: Wenlock Building, 50-52 Wharf Rd, London N1 7EU. +44-203-1760143
For additional information please email contact@acunu.com or call us at the numbers below.
NoSQL & NOSQL
NoSQL (and NOSQL) are frequently used terms. Unfortunately neither are very meaningful.
NoSQL is a recently invented term used to describe those databases which do not have SQL
as their query language. But document databases, graph databases, object databases, tuple
and triple databases, key-value stores and many others all meet this definition while being wildly
different in their capabilities. NoSQL is merely a category that excludes traditional relational
databases.
To add to the confusion, some of the new NoSQL databases have recently had SQL or SQLlike
features added, leading to NoSQL being adjusted by some to read NOSQL meaning Not
Only SQL. This does not help much either.
In reality, NoSQL or NOSQL databases make a number of design compromises which go
beyond SQL. Typically they allow some of the guarantees around atomicity, consistency and
isolation to be relaxed, disallow some kinds of complex transactions and do not support join
operations. This sounds drastic to those brought up to think of these features as fundamental
to database design but there is a trade-off here. NOSQL databases can provide outstanding
performance and high availability for a wide variety of emerging use cases which are outlined
below. More importantly, they can do so at low cost.
Databases such as Acunu Reflex are built for this new world. Low cost commodity servers or
the cloud are assumed to be the target for deployment. That means that distribution is a given,
both to ensure the ability to scale out and to provide continued operation even in the presence
of the occasional inevitable hardware failure, whether of a server component such as a disk or
network interface, or even of an entire data center.
The death of the relational database?
Any claims that the relational database is dead are greatly exaggerated. Relational databases
are not going away. Expect to continue using them for applications where they work well: those
that require the sophisticated support that relational databases provide for complex transactions
for example.
But there are plenty of use cases where the relational database is overkill, where scaling to
meet performance or throughput challenges is a problem and where new technology can
provide a better solution at lower cost.
What use cases make sense?
Time series data, like telemetry from smart metering projects, IT infrastructure monitoring and
price data in financial markets
Time series data is common but using a relational database to store it can be a poor choice. In
many cases, new time-series data simply gets added at the end of existing data. There are no
complex transactions needed to combine the new data with the existing data, so the
complexity of a relational database simply imposes performance overheads and
cost on the resulting system.
Well-designed NOSQL systems will handle large volumes of time series data
easily and without the performance degradation over time that can hamper the
ability of a relational database to handle the largest time series datasets.
SQL
US office: 181 Fremont St, San Francisco CA. +1-866-487-2650 UK office: Wenlock Building, 50-52 Wharf Rd, London N1 7EU. +44-203-1760143
For additional information please email contact@acunu.com or call us at the numbers below.
Write intensive workloads with large working sets and low latency requirements, such as games, online
advertising and real-time operational analytics
Relational databases are optimized for workloads where reading the data occurs more often than writing
or modifying it. Where you have the need to absorb huge amounts of data which will be read relatively
infrequently, look for a system that supports fast writes.
Many NOSQL databases are optimised for write-intensive workloads providing the ability to deal with
data rates that would require high-end hardware in a traditional solution.
High availability - particularly where you do not have control over the hardware infrastructure and do not want to
have to trust to a cloud providers SLA (or lack of one)
Building high availability into a relational database is hard. Common solutions include using shared
storage and multiple replica or standby servers that can take over when the active server fails and
special software to manage the failover process. Not only is this expensive, it often requires a high
level of control over the hardware infrastructure, a fairly sophisticated operations capability and
continual testing. Worse, in cloud deployments you may have no control over the kind of hardware
that is provided to you, so some of the techniques used to make a relational database highly available
become impossible.
The answer is a NOSQL system that allows simple, automatic
replication of data across multiple servers with no single point
of failure. Systems such as Acunu Reflex allow the deployment
of clusters that span multiple data centers, so that tough
business continuity challenges become much easier.
Global deployments where you need to be able to put the data close
to your customers
Whether it is providing low latency access to web customers to
improve conversion rates or capturing activity across a global user
base, NOSQL technologies provide automated data distribution,
replication, etc. in a manner that is generally much less expensive
than when using a complex relational architecture. NOSQL databases have been designed to facilitate a
high degree of automated data protection while leveraging commodity hardware. This is much less
expensive and easier to manage than the commonly manual approach of spreading data out across
numerous relational database servers.
Scale from a handful to tens or even hundreds of servers at low incremental cost
It is possible to scale a relational database across many servers by sharing, that is, splitting the data so
that in a customer database, for example, customers whose names begin with A sit on one server, B
on the next and so on. The problem with this approach is that splitting the data and load evenly between
servers is hard (you probably dont have many customers whose names begin with X) and that many
operations that would be simple if the data were all in one place (like joins) become impossible.
Unstructured or semi-structured data
Changing the schema of a relational database once it contains lots of data is a costly and time consuming
operation. But with agile approaches to software development, the ability to adjust the database quickly to
reflect changes in business requirements is becoming increasingly important. New database technologies
offer this flexibility and in some cases can also handle unstructured data, data whose format has not been
anticipated in advance by the database administrator.
US office: 181 Fremont St, San Francisco CA. +1-866-487-2650 UK office: Wenlock Building, 50-52 Wharf Rd, London N1 7EU. +44-203-1760143
What to use, when?
Given the wide range of NOSQL (or NoSQL) databases available, how should one choose? Some of
the technologies lend themselves to specific specialised use cases. If you are looking to store a social
graph, for example, it is worth looking at a graph database.
But for the use cases described in this paper, several of the mainstream NOSQL databases are a good fit.
Acunu Reflex incorporates Apache Cassandra: It has a particularly strong set of features combined with a
significant user base. Cassandra works well for time series data, for applications that have write-heavy
workloads, supports deployments ranging in size from a handful to hundreds of nodes, either on hardware
or in the cloud, and provides great features for high availability: it allows clusters which span multiple
geographically distributed data centers or cloud availability zones while having no single point of failure.
While Acunu Reflex is able to scale to large deployments, it works well in smaller ones too. All the nodes
in an Acunu Reflex cluster are the same; theres no master and slave, no central controlling node and no
read-only replicas. This simplicity makes starting small an easy task and helps in sizing for growth as well.
Because Acunu incorporates Cassandra it supports CQL, a SQL-like query language that provide familiarity
to a generation of developers brought up on SQL.
Hadoop, another Apache project, has become almost synonymous with Big Data in the eyes of many.
Unlike Cassandra, it is not really a database. Instead, it is a distributed processing framework based
around technology originally developed by Google. Where there is a requirement to store and process
huge volumes of unstructured data, Hadoop can be a great fit. But Hadoops big strength is batch analytics.
Both Cassandra and Hadoop can deliver huge cost and ROI benefits over legacy technologies but support
different use cases and workloads. Big Data commonly doesnt lend itself to a single all-purpose solution.
Whats different? How to get started
In some cases there are existing applications where current database technology is struggling to keep up
with the workload. More commonly there is an existing or new dataset that matches one or more of the
use cases described here and which lends itself to a Big Data solution. In both cases, delivering business
benefit without big fixed costs for high-end hardware or recurrent costs for enterprise software licensing
is also a significant driver.
Choosing the technology to use requires some research and is naturally influenced by the skills and size of
the team, along with the appetite for risk. None of the new technologies are as mature as the relational
database, so the surrounding ecosystem of tools is sometimes limited and skills are in great demand and
often hard to find.
Acunu can help with technology selection and understanding the use cases that can best benefit from the
new generation of Big Data tools. Acunu also provides tailored training basedaround your needs to help
get your team up to speed. To get the best out of these new technologies means adopting design practices
which are strikingly different from those that have been widely used in the past. Picking one example,
normalization has been a guideline to the design of relational databases in order to eliminate redundant
data, but when the same principle is used with some of the new data stores, much of the performance
benefit can be lost. Understanding the trade-offs is critical to project success.
Acunu also offers a product that incorporates a fully tested and supported version of Apache
Cassandra, web-based cluster management and an optimised operating system image in a
single software package that can streamline deployment and reduce support effort. For
organisations that need the benefits that cutting edge technology can bring to their Big Data
opportunities but do not want to become experts in the numerous open source projects out
there, Acunu Reflex provides an easily accessible route to Big Data results.
For additional information please email contact@acunu.com or call us at the numbers below.

S-ar putea să vă placă și