Documente Academic
Documente Profesional
Documente Cultură
Big data analytics is an often-complex method of examining broad and different data
sets or big data to discover knowledge such as hidden patterns, unknown correlations,
industry trends and consumer preferences that can help organizations make better
business decisions. It is area that deals with ways to evaluate, systematically extract
information from, or otherwise deal with data sets that are too large or complex to
handle conventional data-processing application software. Big data problems include
data collection, data storage, data processing, search, share, transfer, visualization,
query, update, privacy and data source. Big data was initially related to three main
concepts: volume, variety and velocity. Big data also contains data with a scale that
exceeds the capacity of conventional software to process within a reasonable time and
value.
Big data definitions have evolved rapidly, which has raised some confusion. This is
evident from an online survey of 154 C-suite global executives conducted by Harris
Interactive on behalf of SAP in April 2012 (“Small and midsize companies look to make
big gains with big data,” 2012). It shows executives differed in their understanding of big
data, where some definitions focused on what it is, while others tried to answer what it
does. Clearly, size is the first characteristic that comes to mind considering the question
“what is big data?”
There isn’t a widely accepted definition of Big Data. A common one, however, is that Big
Data is more and different data than is easily handled by a typical RDMS. Some people
say 10 terabytes is Big Data, but the trouble with this definition is that what is big today
will be common place tomorrow. A more useful approach is to identify the
characteristics of Big Data – the 3Vs. It is of higher volume, velocity, and variety than
can be handled by a traditional RDMS. There is more of it, it comes more quickly, and it
takes more forms.
Not too long ago, a terabyte sized data warehouse was considered to be Big Data. To
illustrate, Teradata recognized customers who had data warehouses this large. Today,
Teradata has 30 customers who have data warehouses that store over a petabyte of
data. Because of this volume of data (and other reasons discussed later), new analytical
platforms have emerged. In a few instances these platforms have replaced data
warehouses, but in most cases, they are an addition to the decision support data
infrastructure.
Organizations are collecting, storing, and analyzing many more kinds of data. Unlike the
structured data usually stored in RDMS, this data is described by terms such as loosely,
poorly, unstructured, and multi-structured. It is estimated that 80 percent of all
organizational data is multi-structured.
The Internet and social media have changed how companies engage with customers
before, during, and after a purchase. Companies have a brief window to affect the
transaction to make it more productive and profitable. At this moment of engagement,
the results of data analyses must come into play. For example, consider the product
recommendations on websites like Amazon.
3-Velocity. It refers to the rate at which data are generated and the speed at which it
should be analyzed and acted upon. The proliferation of digital devices such as
smartphones and sensors has led to an unprecedented rate of data creation and is
driving a growing need for real-time analytics and evidence-based planning. Even
conventional retailers are generating high-frequency data. Wal-Mart, for instance,
processes more than one million transactions per hour (Cukier, 2010). The data
emanating from mobile devices and flowing through mobile apps produces torrents of
information that can be used to generate real-time, personalized offers for everyday
customers. This data provides sound information about customers, such as geospatial
location, demographics, and past buying pat-terns, which can be analyzed in real time
to create real customer value.
4-Veracity. IBM coined Veracity as the fourth V, which represents the unreliability
inherent in some sources of data. For example, customer sentiments in social media
are uncertain in nature, since they entail human judgment. Yet they contain valuable
information. Thus, the need to deal with imprecise and uncertain data is another facet of
big data, which is addressed using tools and analytics developed for management and
mining of uncertain data.
6-Volatility describes the lifespan of data. When applied to source systems, volatility
considers how long the data will be available. For analytics systems it involves how long
the data should be stored. Can at least some of the data be moved offline, archived, or
expired? Data typically are stored until they are no longer needed. However, some data
store times are governed by regulations such as privacy laws and tax regulations.
Significantly, the data lifespan may determine which analysis is performed on the data
and when.
7-Value. Oracle introduced Value as a defining attribute of big data. Based on Oracle’s
definition, big data are often characterized by relatively “low value density”. That is, the
data received in the original form usually has a low value relative to its volume. How-
ever, a high value can be obtained by analyzing large volumes of such data.
Stored data, by itself, does not generate business value. This is true of traditional
databases, data warehouses, and the new technologies for storing data (e.g.,
Hadoop/MapReduce). Once the data is appropriately stored, however, it can be
analyzed and this can generate tremendous value. Sometimes the data is analyzed
by analytics built into the storage, such as in-database analytics, and sometimes by
tools and applications that access and analyze the data the data.
The technologies for Big Data improve analytics by supporting the analysis of new
data types (e.g., social media), providing new analysis techniques (in database
analytics), and improved data management and performance.
Big Data Considerations
1. We can’t process the amount of data that you want to because of the limitations
of your current platform.
2. We can’t include new/contemporary data sources (e.g., social media, RFID,
Sensory, Web, GPS, textual data) because it does not comply with the data
schema.
3. We need to (or want to) integrate data as quickly as possible to be current on
your analysis.
4. We want to work with a schema-on-demand data storage paradigm because the
variety of data types.
5. The data is arriving so fast at your organization’s doorstep that your analytics
platform cannot handle it.
A clear bussiness
need
Personnal with
Strong,comitted
advanced
sponsorship
analytical skills
keys to
success with
Big Data
Analytics Alignment with
The right
the IT strategy
analytics tool
and bussiness
A fact-based
A strong data
decision making
infrastructure
culture
Technical Challenges of Big Data Analytics
1. Uncertainty of Data Management Landscape: Because big data is
continuously expanding, there are new companies and technologies that are
being developed every day. A big challenge for businesses is to figure out which
technology works best for them without the implementation of new risks and
problems.
2. The Big Data Talent Gap: While Big Data is a growing field, there are very few
experts available in this field. This is because Big data is a complex field and
people who understand the complexity and intricate nature of this field are far few
and between. Another major challenge in the field is the talent gap that exists in
the industry.
3. Getting data into the big data platform: Data is increasing every single day.
This means that companies have to tackle a limitless amount of data on a regular
basis. The scale and variety of data that is available today can overwhelm any
data practitioner and that is why it is important to make data accessibility simple
and convenient for brand managers and owners.
4. Need for synchronization across data sources: As data sets become more
diverse, there is a need to incorporate them into an analytical platform. If this is
ignored, it can create gaps and lead to wrong insights and messages.
5. Getting important insights through the use of Big data analytics: It is
important that companies gain proper insights from big data analytics and it is
important that the correct department has access to this information. A major
challenge in big data analytics is bridging this gap in an effective fashion.
Business Challenges of Big Data Analytics
1. Need for Synchronization Across Disparate Data Sources
As data sets are becoming bigger and more diverse, there is a big challenge to
incorporate them into an analytical platform. If this is overlooked, it will create
gaps and lead to wrong messages and insights.
Business organizations are growing at a rapid pace. With the tremendous growth of the
companies and large business organizations, increases the amount of data produced.
The storage of this massive amount of data is becoming a real challenge for everyone.
Popular data storage options like data lakes/ warehouses are commonly used to gather
and store large quantities of unstructured and structured data in its native format. The
real problem arises when a data lakes/ warehouse tries to combine unstructured and
inconsistent data from diverse sources, it encounters errors. Missing data, inconsistent
data, logic conflicts, and duplicates data all result in data quality challenges.
Once business enterprises discover how to use Big Data, it brings them a wide range of
possibilities and opportunities. However, it also involves the potential risks associated
with big data when it comes to the privacy and the security of the data. The Big Data
tools used for analysis and storage utilizes the data disparate sources. This eventually
leads to a high risk of exposure of the data, making it vulnerable. Thus, the rise of
voluminous amount of data increases privacy and security concerns
1. Information Extraction
This is the most famous text mining technique. Information exchange refers to the
process of extracting meaningful information from vast chunks of textual data. This text
mining technique focuses on identifying the extraction of entities, attributes, and their
relationships from semi-structured or unstructured texts. Whatever information is
extracted is then stored in a database for future access and retrieval. The efficacy and
relevancy of the outcomes are checked and evaluated using precision and recall
processes.
2. Information Retrieval
Information Retrieval (IR) refers to the process of extracting relevant and associated
patterns based on a specific set of words or phrases. In this text mining technique, IR
systems make use of different algorithms to track and monitor user behaviors and
discover relevant data accordingly. Google and Yahoo search engines are the two most
renowned IR systems.
3. Categorization
4. Clustering
Clustering is one of the most crucial text mining techniques. It seeks to identify intrinsic
structures in textual information and organize them into relevant subgroups or
‘clusters’ for further analysis. A significant challenge in the clustering process is to form
meaningful clusters from the unlabeled textual data without having any prior information
on them. Cluster analysis is a standard text mining tool that assists in data distribution
or acts as a pre-processing step for other text mining algorithms running on detected
clusters.
5. Summarization
Audio Analytics
Audio analytics analyze and extract information from unstructured audio data. When
applied to human spoken language, audio analytics is also referred to as speech
analytics. Since these techniques have mostly been applied to spoken audio, the terms
audio analytics and speech analytics are often used interchangeably. Currently,
customer call centers and healthcare are the primary application areas of audio
analytics.
Call centers use audio analytics for efficient analysis of thousands or even millions of
hours of recorded calls. These techniques help improve customer experience, evaluate
agent performance, enhance sales turnover rates, monitor compliance with different
policies (e.g., privacy and security policies), gain insight into customer behavior, and
identify product or service issues, among many other tasks. Audio analytics systems
can be designed to analyze a live call, formulate cross/up-selling recommendations
based on the customer's past and present interactions, and provide feedback to agents
in real time. In addition, automated call centers use the Interactive Voice Response
(IVR) platforms to identify and handle frustrated callers.
Video analytics
Video analytics, also known as video content analysis (VCA), involves a variety of
techniques to monitor, analyze, and extract meaningful information from video streams.
Although video analytics is still in its infancy compared to other types of data mining,
various techniques have already been developed for processing real-time as well as
pre-recorded videos. The increasing prevalence of closed-circuit television (CCTV)
cameras and the booming popularity of video-sharing websites are the two leading
contributors to the growth of computerized video analysis. A key challenge, however, is
the sheer size of video data. To put this into perspective, one second of a high-definition
video, in terms of size, is equivalent to over 2000 pages of text. Now consider that
100 hours of video are uploaded to YouTube every minute.
Big data technologies turn this challenge into opportunity. Obviating the need for cost-
intensive and risk-prone manual processing, big data technologies can be leveraged to
automatically sift through and draw intelligence from thousands of hours of video. As a
result, the big data technology is the third factor that has contributed to the development
of video analytics. In terms of the system architecture, there exist two approaches to
video analytics, namely server-based and edge-based:
Predictive analytics
Predictive analytics comprise a variety of techniques that predict future outcomes based
on historical and current data. In practice, predictive analytics can be applied to almost
all disciplines – from predicting the failure of jet engines based on the stream of data
from several thousand sensors, to predicting customers’ next moves based on what
they buy, when they buy, and even what they say on social media.
The third factor corresponds to the distinctive features inherent in big data:
heterogeneity, noise accumulation, spurious correlations, and incidental endogeneity
are described below.
Heterogeneity. Big data are often obtained from different sources and
represent information from different sub-populations. As a result, big data are
highly heterogeneous. The sub-population data in small samples are deemed
outliers because of their insufficient frequency. However, the sheer size of big
data sets creates the unique opportunity to model the heterogeneity arising from
sub-population data, which would require sophisticated statistical techniques.
Noise accumulation. Estimating predictive models for big data often involves
the simultaneous estimation of several parameters. The accumulated estimation
error (or noise) for different parameters could dominate the magnitudes of
variables that have true effects within the model. In other words, some variables
with significant explanatory power might be overlooked as a result of noise
accumulation.
Spurious correlation. For big data, spurious correlation refers to
uncorrelated variables being falsely found to be correlated due to the massive
size of the dataset show this phenomenon through a simulation example, where
the correlation coefficient between independent random variables is shown to
increase with the size of the dataset. As a result, some variables that are
scientifically unrelated (due to their independence) are erroneously proven to be
correlated as a result of high dimensionality.
Incidental endogeneity. A common assumption in regression analysis is the
exogeneity assumption: the explanatory variables, or predictors, are independent
of the residual term. The validity of most statistical methods used in regression
analysis depends on this assumption. In other words, the existence of incidental
endogeneity (i.e., the dependence of the residual term on some of the predictors)
undermines the validity of the statistical methods used for regression analysis.
Although the exogeneity assumption is usually met in small samples, incidental
endogeneity is commonly present in big data. It is worthwhile to mention that, in
contrast to spurious correlation, incidental endogeneity refers to a genuine
relationship between variables and the error term.
The irrelevance of statistical significance, the challenges of computational efficiency,
and the unique characteristics of big data discussed above highlight the need to
develop new statistical techniques to gain insights from predictive models.