Documente Academic
Documente Profesional
Documente Cultură
The hot IT buzzword of 2012, big data has become viable as cost-effective
approaches have emerged to tame the volume, velocity and variability of
massive data. Within this data lie valuable patterns and information,
previously hidden because of the amount of work required to extract them.
To leading corporations, such as Walmart or Google, this power has been in
reach for some time, but at fantastic cost. Today’s commodity hardware,
cloud architectures and open source software bring big data processing into
the reach of the less well-resourced. Big data processing is eminently
feasible for even the small garage startups, who can cheaply rent server
time in the cloud.
The value of big data to an organization falls into two categories: analytical
use, and enabling new products. Big data analytics can reveal insights
hidden previously by data too costly to process, such as peer influence
among customers, revealed by analyzing shoppers’ transactions, social and
geographical data. Being able to process every item of data in reasonable
time removes the troublesome need for sampling and promotes an
investigative approach to data, in contrast to the somewhat static nature of
running predetermined reports.
The past decade’s successful web startups are prime examples of big data
used as an enabler of new products and services. For example, by combining
a large number of signals from a user’s actions and those of their friends,
Facebook has been able to craft a highly personalized user experience and
create a new kind of advertising business. It’s no coincidence that the lion’s
share of ideas and tools underpinning big data have emerged from Google,
Yahoo, Amazon and Facebook.
The emergence of big data into the enterprise brings with it a necessary
counterpart: agility. Successfully exploiting the value in big data requires
experimentation and exploration. Whether creating new products or looking
for ways to gain competitive advantage, the job calls for curiosity and an
entrepreneurial outlook.
Characteristics of Bigdata
As a catch-all term, “big data” can be pretty nebulous, in the same way that
the term “cloud” covers diverse technologies. Input data to big data systems
could be chatter from social networks, web server logs, traffic flow sensors,
satellite imagery, broadcast audio streams, banking transactions, MP3s of
rock music, the content of web pages, scans of government documents, GPS
trails, telemetry from automobiles, financial market data, the list goes on.
Are these all really the same thing?
Big data goes beyond volume, variety, and velocity alone. You need to know
these 10 characteristics and properties of big data to prepare for both the
challenges and advantages of big data initiatives.
The term big data started to show up sparingly in the early 1990s, and its
prevalence and importance increased exponentially as years passed.
Nowadays big data is often seen as integral to a company's data strategy.
#1: Volume
-- An estimated 1.1 trillion photos were taken in 2016, and that number is
projected to rise by 9 percent in 2017. As the same photo usually has
multiple instances stored across different devices, photo or document
sharing services as well as social media services, the total number of photos
stored is also expected to grow from 3.9 trillion in 2016 to 4.7 trillion in 2017.
-- In 2016 estimated global mobile traffic amounted for 6.2 exabytes per
month. That's 6.2 billion gigabytes.
#2: Velocity
#3: Variety
When it comes to big data, we don't only have to handle structured data but
also semistructured and mostly unstructured data as well. As you can
deduce from the above examples, most big data seems to be unstructured,
but besides audio, image, video files, social media updates, and other text
formats there are also log files, click data, machine and sensor data, etc.
#4: Variability
Variability in big data's context refers to a few different things. One is the
number of inconsistencies in the data. These need to be found by anomaly
and outlier detection methods in order for any meaningful analytics to occur.
#5: Veracity
This is one of the unfortunate characteristics of big data. As any or all of the
above properties increase, the veracity (confidence or trust in the data)
drops. This is similar to, but not the same as, validity or volatility (see below).
Veracity refers more to the provenance or reliability of the data source, its
context, and how meaningful it is to the analysis based on it.
#6: Validity
Similar to veracity, validity refers to how accurate and correct the data is for
its intended use. According to Forbes, an estimated 60 percent of a data
scientist's time is spent cleansing their data before being able to do any
analysis. The benefit from big data analytics is only as good as its underlying
data, so you need to adopt good data governance practices to ensure
consistent data quality, common definitions, and metadata.
#7: Vulnerability
Big data brings new security concerns. After all, a data breach with big data
is a big breach. Unfortunately there have been many big data breaches.
Another example, as reported by CRN: in May 2016 "a hacker called Peace
posted data on the dark web to sell, which allegedly included information on
167 million LinkedIn accounts and ... 360 million emails and passwords for
MySpace users."
#8: Volatility
How old does your data need to be before it is considered irrelevant, historic,
or not useful any longer? How long does data need to be kept for?
Due to the velocity and volume of big data, however, its volatility needs to
be carefully considered. You now need to establish rules for data currency
and availability as well as ensure rapid retrieval of information when required.
Make sure these are clearly tied to your business needs and processes --
with big data the costs and complexity of a storage and retrieval process are
magnified.
#9: Visualization
Combine this with the multitude of variables resulting from big data's variety
and velocity and the complex relationships between them, and you can see
that developing a meaningful visualization is not easy.
#10: Value
Last, but arguably the most important of all, is value. The other
characteristics of big data are meaningless if you don't derive business value
from the data.
What is often overlooked is how the two differ based on the questions they
answer:
Business Intelligence — traditionally focuses on using a consistent set of
metrics to measure past performance and guide business planning. Business
Intelligence consists of querying, reporting, OLAP (online analytical
processing), and can answer questions including “what happened,” “how
many,” and “how often.”
Big data is now a reality: The volume, variety and velocity of data coming
into your organization continue to reach unprecedented levels. This
phenomenal growth means that not only must you understand big data in
order to decipher the information that truly counts, but you also must
understand the possibilities of big data analytics.
Big data analytics is the process of examining big data to uncover hidden
patterns, unknown correlations and other useful information that can be
used to make better decisions. With big data analytics, data scientists and
others can analyze huge volumes of data that conventional analytics and
business intelligence solutions can't touch. Consider that your organization
could accumulate (if it hasn't already) billions of rows of data with hundreds
of millions of data combinations in multiple data stores and abundant
formats. High-performance analytics is necessary to process that much data
in order to figure out what's important and what isn't. Enter big data
analytics.
Why collect and store terabytes of data if you can't analyze it in full context?
Or if you have to wait hours or days to get results? With new advances in
computing technology, there's no need to avoid tackling even the most
challenging business problems. For simpler and faster processing of only
relevant data, you can use high-performance analytics. Using high-
performance data mining, predictive analytics, text mining, forecasting and
optimization on big data enables you to continuously drive innovation and
make the best possible decisions. In addition, organizations are discovering
that the unique properties of machine learning are ideally suited to
addressing their fast-paced big data needs in new ways.
Customers have evolved their analytics methods from a reactive view into a
proactive approach using predictive and prescriptive analytics. Both reactive
and proactive approaches are used by organizations.
There are four approaches to analytics, and each falls within the reactive or
proactive category:
Data Analytics and Big Data are getting importance and attention. They are
expected to create customer value and competitive advantage for the
business. Considering focus around big data, an analysis is undertaken to
understand impact of big data on data analytics life cycle. Typical analytics
projects have following (column chart below) effort and time distribution. Of
course, various factors influence time taken across data analytics life stages
such as complexity of business problem, messiness of data (quality, variety
and volume), experience of data analyst or scientist, maturity of analytics in
an organization or analytical tools/systems. But, data manipulation is one of
the biggest effort drains of analyst time.
Understanding Business Objective
Big Data or any other technology plays little role in understanding the
business objective and converting a business problem into an analytics
problem. But the flexibility and versatility of the tools and technology guides
in what all can or can’t be done. For example, a brick and mortar retailer
may have to launch a survey to understand customer sensitivity toward
prices. But an eCommerce retailer may carry out an analysis using
customers’ web visits – what different ecommerce website customers visit
pre and post the visiting the eCommerce retailer.
Data Manipulation/Preparation
Data manipulation requires significant effort from an analyst and the big
data is expected to impact this stage the most. The big data will help an
analyst in getting the result of a query quicker (Velocity of Big Data). Also,
the big data facilitates accessing and using unstructured data (Variety of Big
Data) which was a challenge in traditional technology. The data volume
handling (Volume of big data) is expected to help by taking away a data
volume processing constraint or improving the speed. Statistical Scientists
had devised sampling techniques to get rid of constraint of processing high
volume of data. Though, big data can process high volume of data and the
sampling techniques may not be required from this perspective. But the
sampling is still relevant and required.
Big Data has created a new wave in industry and there is a lot of pressure on
organizations to think of big data. The big data technology is still maturing,
but organizations are making investment to tap big data for competitive
advantage. A few organizations such as Facebook and Amazon have already
adopted and are using the big data. The real differentiator between
successful and non-successful organizations will be right insights and action
on the insights.
Evaluation
Due to Big Data and Big Data Analytics, data analytics cycle time and cost is
expected to come down. The cost reduction and shrinkage in cycle time will
have propitious impact on analytics adoptions. The organizations will be
open proceed toward experimentation and learning culture. Of course, this is
not going to happen automatically.
Big data is going to impact each stage of Data Analytics life cycle, but the
main value add (till Big Data analytics tools matures) will be around data
manipulation.
Big data analytics is the process of examining large data sets containing a
variety of data types -- i.e., big data -- to uncover hidden patterns, unknown
correlations, market trends, customer preferences and other useful business
information. The analytical findings can lead to more effective marketing,
new revenue opportunities, better customer service, improved operational
efficiency, competitive advantages over rival organizations and other
business benefits.
The primary goal of big data analytics is to help companies make more
informed business decisions by enabling data scientists, predictive modelers
and other analytics professionals to analyze large volumes of transaction
data, as well as other forms of data that may be untapped by conventional
business intelligence (BI) programs. That could include Web server logs and
Internet clickstream data, social media content and social network activity
reports, text from customer emails and survey responses, mobile-phone call
detail records and machine data captured by sensors connected to the
Internet of Things. Some people exclusively associate big data with semi-
structured and unstructured data of that sort. Some consider other
structured data to be valid components of big data analytics applications.
Big data can be analyzed with the software tools commonly used as part of
advanced analytics disciplines such as predictive analytics, data mining, text
analytics and statistical analysis. Mainstream BI software and data
visualization tools can also play a role in the analysis process. But the semi-
structured and unstructured data may not fit well in traditional data
warehouses based on relational databases. Furthermore, data warehouses
may not be able to handle the processing demands posed by sets of big data
that need to be updated frequently or even continually -- for example, real-
time data on the performance of mobile applications or of oil and gas
pipelines. As a result, many organizations looking to collect, process and
analyze big data have turned to a newer class of technologies that includes
Hadoop and related tools such as YARN, MapReduce, Spark, Hive and Pig as
well as NoSQL databases. Those technologies form the core of an open
source software framework that supports the processing of large and diverse
data sets across clustered systems.
In some cases, Hadoop clusters and NoSQL systems are being used as
landing pads and staging areas for data before it gets loaded into a data
warehouse for analysis, often in a summarized form that is more conducive
to relational structures. Increasingly though, big data vendors are pushing
the concept of a Hadoop data lake that serves as the central repository for
an organization's incoming streams of raw data. In such architectures,
subsets of the data can then be filtered for analysis in data warehouses and
analytical databases, or it can be analyzed directly in Hadoop using batch
query tools, stream processing software and SQL on Hadoop technologies
that run interactive, ad hoc queries written in SQL.
Industry predictions
Cyber Analytics
It seems every few months a large organization announces that it has been
compromised. Traditionally, data and network security has focused on
firewalls, intrusion detection and intrusion prevention. Security specialists
are beginning to realize that putting up a barrier doesn’t always work. As fast
as experts put up a firewall, savvy hackers find a way around it—or through
it.
It turns out the key to stopping hackers in their tracks is data – lots and lots
of data. Public and private-sector organizations all over the world are starting
to use something called Cyber Analytics to defend their systems. They are
taking internal and external information—from firewall data and
behavioural profiles to cyber threat intelligence and fraud alerts—
analyzing it in real time and identifying anomalous patterns of activity that in
the past would have gone undetected. What used to be a liability is now an
asset, as data becomes the key to recognizing threats before they
materialize and identifying clandestine activity before it becomes a breach.
Now organizations can spot potential hacks and prevent them from occurring.
In 2015 more organizations will turn to analytics to heighten their cyber
security capabilities.
Cloud Analytics
Real-time Analytics
While real-time marketing is not a new term, the key for 2015 is to use real-
time analytics to feed an extremely complex consumer marketplace. With
offers, advertisements and marketing messages being sent from every
possible direction, savvy marketers can no longer bombard consumers’
senses. In an ideal world, consumers receive messages that are tailored,
relevant and timely – and offered to them based on their actions and
behaviors in the past, or better yet, how we might act or behave in the future.
These personalized offers are consistent regardless of the channel that is
being used. They enable technology, such as beaconing hardware sensors
that are designed to wirelessly communicate and transmit data with mobile
devices within a specific proximity, to enable in-store promotions,
geolocation targeted messaging and shopper analytics. In other words they
deliver the offer at the moment of truth - a moment that matters most to a
customer’s purchasing decision.
This is the customer experience that has come to be expected and only
analytics can drive the type of real-time insights that underpins personalized
marketing In 2015 it will be key for decision-makers to have access to up-to-
date, accurate and relevant data, which needs to be available to them in
real-time to allow them to make the right decisions.
Horizontal scaling means that you scale by adding more machines into your
pool of resources where Vertical scaling means that you scale by adding
more power (CPU, RAM) to your existing machine.
In-Memory Data Grids such as GigaSpaces XAP, Coherence etc.. are often
optimized for both horizontal and vertical scaling simply because they're not
bound to disk. Horizontal-scaling through partitioning and vertical-scaling
through multi-core support. Horizontal scalability is the ability to increase
capacity by connecting multiple hardware or software entities so that they
work as a single logical unit.
When servers are clustered, the original server is being scaled out
horizontally. If a cluster requires more resources to improve performance
and provide high availability (HA), an administrator can scale out by adding
more servers to the cluster.