Sunteți pe pagina 1din 17

What is Big Data?

Big data is data that exceeds the processing capacity of conventional


database systems. The data is too big, moves too fast, or doesn’t fit the
strictures of your database architectures. To gain value from this data, you
must choose an alternative way to process it.

The hot IT buzzword of 2012, big data has become viable as cost-effective
approaches have emerged to tame the volume, velocity and variability of
massive data. Within this data lie valuable patterns and information,
previously hidden because of the amount of work required to extract them.
To leading corporations, such as Walmart or Google, this power has been in
reach for some time, but at fantastic cost. Today’s commodity hardware,
cloud architectures and open source software bring big data processing into
the reach of the less well-resourced. Big data processing is eminently
feasible for even the small garage startups, who can cheaply rent server
time in the cloud.

The value of big data to an organization falls into two categories: analytical
use, and enabling new products. Big data analytics can reveal insights
hidden previously by data too costly to process, such as peer influence
among customers, revealed by analyzing shoppers’ transactions, social and
geographical data. Being able to process every item of data in reasonable
time removes the troublesome need for sampling and promotes an
investigative approach to data, in contrast to the somewhat static nature of
running predetermined reports.

The past decade’s successful web startups are prime examples of big data
used as an enabler of new products and services. For example, by combining
a large number of signals from a user’s actions and those of their friends,
Facebook has been able to craft a highly personalized user experience and
create a new kind of advertising business. It’s no coincidence that the lion’s
share of ideas and tools underpinning big data have emerged from Google,
Yahoo, Amazon and Facebook.

The emergence of big data into the enterprise brings with it a necessary
counterpart: agility. Successfully exploiting the value in big data requires
experimentation and exploration. Whether creating new products or looking
for ways to gain competitive advantage, the job calls for curiosity and an
entrepreneurial outlook.
Characteristics of Bigdata

What does big data look like?

As a catch-all term, “big data” can be pretty nebulous, in the same way that
the term “cloud” covers diverse technologies. Input data to big data systems
could be chatter from social networks, web server logs, traffic flow sensors,
satellite imagery, broadcast audio streams, banking transactions, MP3s of
rock music, the content of web pages, scans of government documents, GPS
trails, telemetry from automobiles, financial market data, the list goes on.
Are these all really the same thing?

The 10 Vs of Big Data

Big data goes beyond volume, variety, and velocity alone. You need to know
these 10 characteristics and properties of big data to prepare for both the
challenges and advantages of big data initiatives.

The term big data started to show up sparingly in the early 1990s, and its
prevalence and importance increased exponentially as years passed.
Nowadays big data is often seen as integral to a company's data strategy.

#1: Volume

Volume is probably the best known characteristic of big data; this is no


surprise, considering more than 90 percent of all today's data was created in
the past couple of years. The current amount of data can actually be quite
staggering. Here are some examples:

-- 300 hours of video are uploaded to YouTube every minute.

-- An estimated 1.1 trillion photos were taken in 2016, and that number is
projected to rise by 9 percent in 2017. As the same photo usually has
multiple instances stored across different devices, photo or document
sharing services as well as social media services, the total number of photos
stored is also expected to grow from 3.9 trillion in 2016 to 4.7 trillion in 2017.

-- In 2016 estimated global mobile traffic amounted for 6.2 exabytes per
month. That's 6.2 billion gigabytes.
#2: Velocity

Velocity refers to the speed at which data is being generated, produced,


created, or refreshed.

Sure, it sounds impressive that Facebook's data warehouse stores upwards


of 300 petabytes of data, but the velocity at which new data is created
should be taken into account. Facebook claims 600 terabytes of incoming
data per day.Google alone processes on average more than "40,000 search
queries every second," which roughly translates to more than 3.5 billion
searches per day.

#3: Variety

When it comes to big data, we don't only have to handle structured data but
also semistructured and mostly unstructured data as well. As you can
deduce from the above examples, most big data seems to be unstructured,
but besides audio, image, video files, social media updates, and other text
formats there are also log files, click data, machine and sensor data, etc.

#4: Variability

Variability in big data's context refers to a few different things. One is the
number of inconsistencies in the data. These need to be found by anomaly
and outlier detection methods in order for any meaningful analytics to occur.

Big data is also variable because of the multitude of data dimensions


resulting from multiple disparate data types and sources. Variability can also
refer to the inconsistent speed at which big data is loaded into your database.

#5: Veracity

This is one of the unfortunate characteristics of big data. As any or all of the
above properties increase, the veracity (confidence or trust in the data)
drops. This is similar to, but not the same as, validity or volatility (see below).
Veracity refers more to the provenance or reliability of the data source, its
context, and how meaningful it is to the analysis based on it.

For example, consider a data set of statistics on what people purchase at


restaurants and these items' prices over the past five years. You might ask:
Who created the source? What methodology did they follow in collecting the
data? Were only certain cuisines or certain types of restaurants included?
Did the data creators summarize the information? Has the information been
edited or modified by anyone else?

Answers to these questions are necessary to determine the veracity of this


information. Knowledge of the data's veracity in turn helps us better
understand the risks associated with analysis and business decisions based
on this particular data set.

#6: Validity

Similar to veracity, validity refers to how accurate and correct the data is for
its intended use. According to Forbes, an estimated 60 percent of a data
scientist's time is spent cleansing their data before being able to do any
analysis. The benefit from big data analytics is only as good as its underlying
data, so you need to adopt good data governance practices to ensure
consistent data quality, common definitions, and metadata.

#7: Vulnerability

Big data brings new security concerns. After all, a data breach with big data
is a big breach. Unfortunately there have been many big data breaches.
Another example, as reported by CRN: in May 2016 "a hacker called Peace
posted data on the dark web to sell, which allegedly included information on
167 million LinkedIn accounts and ... 360 million emails and passwords for
MySpace users."

Information on many others can be found at Information is Beautiful.

#8: Volatility

How old does your data need to be before it is considered irrelevant, historic,
or not useful any longer? How long does data need to be kept for?

Before big data, organizations tended to store data indefinitely -- a few


terabytes of data might not create high storage expenses; it could even be
kept in the live database without causing performance issues. In a classical
data setting, there not might even be data archival policies in place.

Due to the velocity and volume of big data, however, its volatility needs to
be carefully considered. You now need to establish rules for data currency
and availability as well as ensure rapid retrieval of information when required.
Make sure these are clearly tied to your business needs and processes --
with big data the costs and complexity of a storage and retrieval process are
magnified.

#9: Visualization

Another characteristic of big data is how challenging it is to visualize.

Current big data visualization tools face technical challenges due to


limitations of in-memory technology and poor scalability, functionality, and
response time. You can't rely on traditional graphs when trying to plot a
billion data points, so you need different ways of representing data such as
data clustering or using tree maps, sunbursts, parallel coordinates, circular
network diagrams, or cone trees.

Combine this with the multitude of variables resulting from big data's variety
and velocity and the complex relationships between them, and you can see
that developing a meaningful visualization is not easy.

#10: Value

Last, but arguably the most important of all, is value. The other
characteristics of big data are meaningless if you don't derive business value
from the data.

Substantial value can be found in big data, including understanding your


customers better, targeting them accordingly, optimizing processes, and
improving machine or business performance. You need to understand the
potential, along with the more challenging characteristics, before embarking
on a big data strategy.

Business Intelligence vs Data Analytics

Analytics is an immense field with many subfields, so it can be difficult to


sort out all the buzzwords around it.

We know that analytics refers to the skills, technologies, applications and


practices for continuous iterative exploration and investigation of data to
gain insight and drive business planning. Analytics consists of two major
areas: Business Intelligence and Advanced Analytics.

What is often overlooked is how the two differ based on the questions they
answer:
Business Intelligence — traditionally focuses on using a consistent set of
metrics to measure past performance and guide business planning. Business
Intelligence consists of querying, reporting, OLAP (online analytical
processing), and can answer questions including “what happened,” “how
many,” and “how often.”

Advanced Analytics — goes beyond Business Intelligence by using


sophisticated modeling techniques to predict future events or discover
patterns which cannot be detected otherwise.

Advanced Analytics can answer questions including “why is this happening,”


“what if these trends continue,” “what will happen next” (prediction), “what
is the best that can happen”
(optimization).

Need of data analytics

Big data is now a reality: The volume, variety and velocity of data coming
into your organization continue to reach unprecedented levels. This
phenomenal growth means that not only must you understand big data in
order to decipher the information that truly counts, but you also must
understand the possibilities of big data analytics.

What is big data analytics?

Big data analytics is the process of examining big data to uncover hidden
patterns, unknown correlations and other useful information that can be
used to make better decisions. With big data analytics, data scientists and
others can analyze huge volumes of data that conventional analytics and
business intelligence solutions can't touch. Consider that your organization
could accumulate (if it hasn't already) billions of rows of data with hundreds
of millions of data combinations in multiple data stores and abundant
formats. High-performance analytics is necessary to process that much data
in order to figure out what's important and what isn't. Enter big data
analytics.

Why collect and store terabytes of data if you can't analyze it in full context?
Or if you have to wait hours or days to get results? With new advances in
computing technology, there's no need to avoid tackling even the most
challenging business problems. For simpler and faster processing of only
relevant data, you can use high-performance analytics. Using high-
performance data mining, predictive analytics, text mining, forecasting and
optimization on big data enables you to continuously drive innovation and
make the best possible decisions. In addition, organizations are discovering
that the unique properties of machine learning are ideally suited to
addressing their fast-paced big data needs in new ways.

Why is big data analytics important?

Customers have evolved their analytics methods from a reactive view into a
proactive approach using predictive and prescriptive analytics. Both reactive
and proactive approaches are used by organizations.

Reactive vs. proactive approaches

There are four approaches to analytics, and each falls within the reactive or
proactive category:

Reactive – business intelligence. In the reactive category, business


intelligence (BI) provides standard business reports, ad hoc reports,
OLAP and even alerts and notifications based on analytics. This ad hoc
analysis looks at the static past, which has its purpose in a limited
number of situations.
Reactive – big data BI. When reporting pulls from huge data sets, we
can say this is performing big data BI. But decisions based on these
two methods are still reactionary.
Proactive – big analytics. Making forward-looking, proactive
decisions requires proactive big analytics like optimization, predictive
modeling, text mining, forecasting and statistical analysis. They allow
you to identify trends, spot weaknesses or determine conditions for
making decisions about the future. But although it's proactive, big
analytics cannot be performed on big data because traditional storage
environments and processing times cannot keep up.
Proactive – big data analytics. By using big data analytics you can
extract only the relevant information from terabytes, petabytes and
exabytes, and analyze it to transform your business decisions for the
future. Becoming proactive with big data analytics isn't a one-time
endeavor; it is more of a culture change – a new way of gaining ground
by freeing your analysts and decision makers to meet the future with
sound knowledge and insight.

Data Analytics Lifecycle

Data Analytics and Big Data are getting importance and attention. They are
expected to create customer value and competitive advantage for the
business. Considering focus around big data, an analysis is undertaken to
understand impact of big data on data analytics life cycle. Typical analytics
projects have following (column chart below) effort and time distribution. Of
course, various factors influence time taken across data analytics life stages
such as complexity of business problem, messiness of data (quality, variety
and volume), experience of data analyst or scientist, maturity of analytics in
an organization or analytical tools/systems. But, data manipulation is one of
the biggest effort drains of analyst time.
Understanding Business Objective

Big Data or any other technology plays little role in understanding the
business objective and converting a business problem into an analytics
problem. But the flexibility and versatility of the tools and technology guides
in what all can or can’t be done. For example, a brick and mortar retailer
may have to launch a survey to understand customer sensitivity toward
prices. But an eCommerce retailer may carry out an analysis using
customers’ web visits – what different ecommerce website customers visit
pre and post the visiting the eCommerce retailer.

Data Manipulation/Preparation

Data manipulation requires significant effort from an analyst and the big
data is expected to impact this stage the most. The big data will help an
analyst in getting the result of a query quicker (Velocity of Big Data). Also,
the big data facilitates accessing and using unstructured data (Variety of Big
Data) which was a challenge in traditional technology. The data volume
handling (Volume of big data) is expected to help by taking away a data
volume processing constraint or improving the speed. Statistical Scientists
had devised sampling techniques to get rid of constraint of processing high
volume of data. Though, big data can process high volume of data and the
sampling techniques may not be required from this perspective. But the
sampling is still relevant and required.

Data Analysis and Modeling

Most of the machine learning and statistical techniques are available in


traditional technology platform, so the value add of big data could be limited.
One of the arguments in favour of machine learning in big data is “more data
is fed to the machine learning algorithm more it can learn and higher would
be quality of insights”. Many practitioners do not believe in simply volume
leading to quality of insights.

Certainly having different dimensions of data such as customer web clicks


and calls data will lead to better insights and improved accuracy of the
predictive models.

Action on Insights or Deployment

Big Data has created a new wave in industry and there is a lot of pressure on
organizations to think of big data. The big data technology is still maturing,
but organizations are making investment to tap big data for competitive
advantage. A few organizations such as Facebook and Amazon have already
adopted and are using the big data. The real differentiator between
successful and non-successful organizations will be right insights and action
on the insights.

Big Data technology is expected to enables deployment of insights or


predictive models quicker but more importantly speed to action on analytics
will be almost in real time.

Offer Recommendation on Web and Big Data

A generic offer is prevalent on a web without much success. A personalized


and relevant offer is the customer expectation and the organizations are
proceeding in this direction. One of the ways to identify customer needs is
combining web clicks behavior and transactional behavior in a real time, and
providing a personalized offer to the customer. This may be a realty using
big data & big data analytics.

Evaluation

Due to Big Data and Big Data Analytics, data analytics cycle time and cost is
expected to come down. The cost reduction and shrinkage in cycle time will
have propitious impact on analytics adoptions. The organizations will be
open proceed toward experimentation and learning culture. Of course, this is
not going to happen automatically.

Big data is going to impact each stage of Data Analytics life cycle, but the
main value add (till Big Data analytics tools matures) will be around data
manipulation.

Industry applications of big data analytics

Big data analytics is the process of examining large data sets containing a
variety of data types -- i.e., big data -- to uncover hidden patterns, unknown
correlations, market trends, customer preferences and other useful business
information. The analytical findings can lead to more effective marketing,
new revenue opportunities, better customer service, improved operational
efficiency, competitive advantages over rival organizations and other
business benefits.

The primary goal of big data analytics is to help companies make more
informed business decisions by enabling data scientists, predictive modelers
and other analytics professionals to analyze large volumes of transaction
data, as well as other forms of data that may be untapped by conventional
business intelligence (BI) programs. That could include Web server logs and
Internet clickstream data, social media content and social network activity
reports, text from customer emails and survey responses, mobile-phone call
detail records and machine data captured by sensors connected to the
Internet of Things. Some people exclusively associate big data with semi-
structured and unstructured data of that sort. Some consider other
structured data to be valid components of big data analytics applications.

Big data can be analyzed with the software tools commonly used as part of
advanced analytics disciplines such as predictive analytics, data mining, text
analytics and statistical analysis. Mainstream BI software and data
visualization tools can also play a role in the analysis process. But the semi-
structured and unstructured data may not fit well in traditional data
warehouses based on relational databases. Furthermore, data warehouses
may not be able to handle the processing demands posed by sets of big data
that need to be updated frequently or even continually -- for example, real-
time data on the performance of mobile applications or of oil and gas
pipelines. As a result, many organizations looking to collect, process and
analyze big data have turned to a newer class of technologies that includes
Hadoop and related tools such as YARN, MapReduce, Spark, Hive and Pig as
well as NoSQL databases. Those technologies form the core of an open
source software framework that supports the processing of large and diverse
data sets across clustered systems.
In some cases, Hadoop clusters and NoSQL systems are being used as
landing pads and staging areas for data before it gets loaded into a data
warehouse for analysis, often in a summarized form that is more conducive
to relational structures. Increasingly though, big data vendors are pushing
the concept of a Hadoop data lake that serves as the central repository for
an organization's incoming streams of raw data. In such architectures,
subsets of the data can then be filtered for analysis in data warehouses and
analytical databases, or it can be analyzed directly in Hadoop using batch
query tools, stream processing software and SQL on Hadoop technologies
that run interactive, ad hoc queries written in SQL.

Potential pitfalls that can trip up organizations on big data analytics


initiatives include a lack of internal analytics skills and the high cost of hiring
experienced analytics professionals. The amount of information that's
typically involved, and its variety, can also cause data management
headaches, including data quality and consistency issues. In addition,
integrating Hadoop systems and data warehouses can be a challenge,
although various vendors now offer software connectors between Hadoop
and relational databases, as well as other data integration tools with big data
capabilities.

Industry predictions

Cyber Analytics

It seems every few months a large organization announces that it has been
compromised. Traditionally, data and network security has focused on
firewalls, intrusion detection and intrusion prevention. Security specialists
are beginning to realize that putting up a barrier doesn’t always work. As fast
as experts put up a firewall, savvy hackers find a way around it—or through
it.

It turns out the key to stopping hackers in their tracks is data – lots and lots
of data. Public and private-sector organizations all over the world are starting
to use something called Cyber Analytics to defend their systems. They are
taking internal and external information—from firewall data and
behavioural profiles to cyber threat intelligence and fraud alerts—
analyzing it in real time and identifying anomalous patterns of activity that in
the past would have gone undetected. What used to be a liability is now an
asset, as data becomes the key to recognizing threats before they
materialize and identifying clandestine activity before it becomes a breach.
Now organizations can spot potential hacks and prevent them from occurring.
In 2015 more organizations will turn to analytics to heighten their cyber
security capabilities.
Cloud Analytics

Cloud computing fundamentally changes the way IT services are delivered


and consumed, offering a range of benefits like business flexibility,
operational efficiency and economies of IT and scale. In today’s economic
climate of increased customer expectations, accelerated business pace and
fierce competition, organizations need to make better use of their ever-
increasing volumes of data for fact-based decision making. More and more
enterprises are taking their data use to new levels and applying business
analytics to support rigorous, constant business experimentation that drives
better decisions – whether it involves testing new products, developing
better business models or transforming the customer experience. The
challenge to provide better information faster is changing the traditional
approach to information technology and many organizations are increasingly
turning to cloud solutions to help drive their business needs.

Cloud computing’s ability to provide elastic scalability, faster service delivery,


greater IT efficiencies and a subscription-based accounting model has broken
down many of the physical and financial barriers to aligning IT with evolving
business goals. With its promise to deliver better business models and
services quickly and cheaply, cloud computing has become a major driver of
business innovation across all industries and this trend will continue in 2015.

Cloud computing techniques, are enabling organizations to deploy analytics


more rapidly throughout their organizations while lowering the total cost of
ownership through reduced hardware costs. SAS Cloud Analytics have grown
3 per cent year over year (YOY), last year alone the YOY increase was 20 per
cent. The cloud phenomenon is also addressing the analytics talent gap as
companies that don’t have sufficient analytical talent in-house are looking for
hosted solutions to provide analytical insight.

The Data Skills Shortage is Real!

As enterprises increasingly recognize the value of the vast volumes of data


they collect, the demand for people who can unlock that value is rising
sharply. The McKinsey Global Institute predicts that by 2018, there will be a
shortage of as many as 190,000 data scientists in the US alone. SAS has
joined Canada’s Big Data Consortium to lend industry insight to an upcoming
study from Ryerson University that will be launched in early 2015. The study
will try to understand the analytics skills shortage in Canada and what can
be done to address it.

As nearly every industry starts to embrace a new era of data collection,


management and analysis, forward-thinking schools all over Canada have
clearly responded by offering innovative courses to ensure Canadian
graduates have the analytics skills that businesses need. 2015 will be a
promising year for these analytically savvy graduates as organizations race
to hire the best of their class.

Real-time Analytics

In this always-on culture, consumers are more demanding in how companies


interact with them. Today’s consumer wants the right message delivered to
the right channel, to the device of their choice, at the time of their choosing.
As mobile devices continue to proliferate, marketers will need to understand
the needs of the mobile user and be ready to deliver real-time, relevant
communications with each and every customer communication.

While real-time marketing is not a new term, the key for 2015 is to use real-
time analytics to feed an extremely complex consumer marketplace. With
offers, advertisements and marketing messages being sent from every
possible direction, savvy marketers can no longer bombard consumers’
senses. In an ideal world, consumers receive messages that are tailored,
relevant and timely – and offered to them based on their actions and
behaviors in the past, or better yet, how we might act or behave in the future.
These personalized offers are consistent regardless of the channel that is
being used. They enable technology, such as beaconing hardware sensors
that are designed to wirelessly communicate and transmit data with mobile
devices within a specific proximity, to enable in-store promotions,
geolocation targeted messaging and shopper analytics. In other words they
deliver the offer at the moment of truth - a moment that matters most to a
customer’s purchasing decision.

This is the customer experience that has come to be expected and only
analytics can drive the type of real-time insights that underpins personalized
marketing In 2015 it will be key for decision-makers to have access to up-to-
date, accurate and relevant data, which needs to be available to them in
real-time to allow them to make the right decisions.

There’s no doubt that data is helping to improve everything from driving


better marketing insights to healthcare delivery to urban infrastructure

Horizontal vs Vertical Scaling

Horizontal scaling means that you scale by adding more machines into your
pool of resources where Vertical scaling means that you scale by adding
more power (CPU, RAM) to your existing machine.

In a database world horizontal-scaling is often based on partitioning of the


data i.e. each node contains only part of the data , in vertical-scaling the
data resides on a single node and scaling is done through multi-core i.e.
spreading the load between the CPU and RAM resources of that machine.
With horizontal-scaling it is often easier to scale dynamically by adding more
machines into the existing pool - Vertical-scaling is often limited to the
capacity of a single machine, scaling beyond that capacity often involves
downtime and comes with an upper limit.

A good example for horizontal scaling is Cassandra, MongoDB and a good


example for vertical scaling is MySQL - Amazon RDS (The cloud version of
MySQL) provides an easy way to scale vertically by switching from small to
bigger machines this process often involves downtime.

In-Memory Data Grids such as GigaSpaces XAP, Coherence etc.. are often
optimized for both horizontal and vertical scaling simply because they're not
bound to disk. Horizontal-scaling through partitioning and vertical-scaling
through multi-core support. Horizontal scalability is the ability to increase
capacity by connecting multiple hardware or software entities so that they
work as a single logical unit.

When servers are clustered, the original server is being scaled out
horizontally. If a cluster requires more resources to improve performance
and provide high availability (HA), an administrator can scale out by adding
more servers to the cluster.

An important advantage of horizontal scalability is that it can provide


administrators with the ability to increase capacity on the fly. Another
advantage is that in theory, horizontal scalability is only limited by how many
entities can be connected successfully. The distributed storage system
Cassandra, for example, runs on top of hundreds of commodity nodes spread
across different data centers. Because the commodity hardware is scaled out
horizontally, Cassandra is fault tolerant and does not have a single point of
failure (SPoF).

Vertical scalability, on the other hand, increases capacity by adding more


resources, such as more memory or an additional CPU, to a machine. Scaling
vertically, which is also called scaling up, usually requires downtime while
new resources are being added and has limits that are defined by hardware.
When Amazon RDS customers need to scale vertically, for example, they can
switch from a smaller to a bigger machine, but Amazon's largest RDS
instance has only 68 GB of memory.

Scaling horizontally has both advantages and disadvantages. For example,


adding inexpensive commodity computers to a cluster might seem to be a
cost-effective solution at first glance, but it's important for the administrator
to know whether the licensing costs for those additional servers, the
additional operations cost of powering and cooling and the large footprint
they will occupy in the data center truly makes scaling horizontally a better
choice than scaling vertically.

S-ar putea să vă placă și