Big Data

1
Table of Content
Letter Of Transmittal 2
Acknowledgement 3
Introduction 4
Executive summary 5
Problem Statement 6
Back ground of Big data
.7
Methotology
.13
Limitation Of Big
data
.14
Basic Concept of Big
Data
17
Big data
analysis
.20
Findings of Big
data
..24
Conclusion
26
Reference
.27
2
Letter of Transmittal
25 April, 2016
MD.Rashedul Hasan
Lecturer
Southeast University
Dear Sir,
It gives us immense pleasure in preparing this Term Paper, which was assigned to
us in fulfillment of our MIS2111 course requirement. We have found the study
quite interesting, beneficial and knowledgeable. We have tried our level best to
prepare an effective & creditable term paper. This report is about the Big Data
Analysis.
We also want to thank you for your support and patience with us.
3
We shall be very pleased to answer any query you think necessary as and when
needed.
Sincerely,
Sajid Ahmed
Raihan Ahsan
Samaun Akter Sama
Sheik Kawsar Jaman
Acknowledgment
This article is about large collections of data. For the graph database, see Graph database. For the band, see Big
Data (band).
Visualization of daily Wikipedia edits created by IBM. At multiple terabytes in size, the text and images of
Wikipedia are an example of big data.
Growth of and Digitization of Global Information Storage Capacity.
Big data is a term for data sets that are so large or complex that traditional data processing applications are
inadequate. Challenges include analysis, capture, data curtain, search, sharing, storage, transfer, visualization,
querying and information privacy. The term often refers simply to the use of predictive analytics or certain other
advanced methods to extract value from data, and seldom to a particular size of data set. Accuracy in big data
may lead to more confident decision making, and better decisions can result in greater operational efficiency,
cost reduction and reduced risk.
4
Analysis of data sets can find new correlations to "spot business trends, prevent diseases, and combat crime and
so on." Scientists, business executives, practitioners of medicine, advertising and governments alike regularly
meet difficulties with large data sets in areas including Internet search, finance and business informatics.
Scientists encounter limitations in e-Science work, including meteorology, genomics, connectomics, complex
physics simulations, biology and environmental research.
Data sets are growing rapidly in part because they are increasingly gathered by cheap and numerous
information-sensing mobile devices, aerial (remote sensing), software logs, cameras, microphones, radio-
frequency identification (RFID) readers and wireless sensor networks. The world's technological per-capita
capacity to store information has roughly doubled every 40 months since the 1980s; as of 2012, every day 2.5
Exabytes (2.51018) of data are created. One question for large enterprises is determining who should own big
data initiatives that affect the entire organization.
Relational database management systems and desktop statistics and visualization packages often have difficulty
handling big data. The work instead requires "massively parallel software running on tens, hundreds, or even
thousands of servers. What is considered "big data" varies depending on the capabilities of the users and their
tools, and expanding capabilities make big data a moving target. "For some organizations, facing hundreds of
gigabytes of data for the first time may trigger a need to reconsider data management options.
Introduction
. Numerous technological innovations are driving the dramatic increase in data and data gathering.
This is why big data has become a recent area of strategic investment for It organizations. For
example, the rise of mobile users has increased enterprise aggregation of user statistics
geographic, sensor, capability, datathat can, if properly synthesized and analyzed, provide
extremely powerful business intelligence. In addition, the increased use of sensors for everything
from traffic patterns, purchasing behaviors, and real-time inventory management is a primary
example of the massive increase in data. Much of this data is gathered in real time and provides a
unique and powerful opportunity if it can be analyzed and acted upon quickly. Machine-to-machine
interchange is another often unrecognized source of big data. The rise of security information
management (sIM) and the security Information and event Management (sIeM) industry is at the
heart of gathering, analyzing, and proactively responding to event data from active machine log files.
at the heart of this trend is the ability to capture, analyze, and respond to data and data trends in real
time. Although it may be clear that new technologies and new forms of personal communication are
driving the big data trend, consider that the global Internet population grew by 6.5% from 2010 to
2011 and now represents over two billion people. This may seem large, but it suggests that the vast
majority of the worlds population has yet to connect. While it may be that we never reach 100% of
the worlds population online (due to resource constraints, cost of goods, and limits to material
flexibility), increasingly those that are online are more connected than ever. Just a few years ago, it
was realistic to think that many had a desktop (perhaps at work) and maybe a laptop at their disposal.
However, today we also may have a connected smartphone and even a tablet computing device. so,
5
of todays two billion connected people, many are connected for the vast majority of their waking
hours, every second generating data:
In 2011 alone, mankind created over 1.2 trillion gB of data.
data volumes are expected to grow 50 times by 2020.

Google receives over 2,000,000 search queries every minute.
72 hours of video are added to YouTube every minute.
there are 217 new mobile Internet users every minute.
twitter users send over 100,000 tweets every minute (thats over 140 million per day). Companies,
brands, and organizations receive 34,000 likes on social networks every minute. International data
Corporation (IdC) predicts that the market for big data technology and services will reach $16.9 billion
by 2015 with 40% growth over the prediction horizon. Not only will this technology and services spend
directly impact big data technology providers for related sQL database technologies, Hadoop or
Mapreduce file systems, and related software and analytics software solutions, but it also will impact
new server, storage, and networking infrastructure that is specifically designed to leverage and
optimize the new analytical solutions.
Executive Summary
Numerous technological innovations are driving the dramatic increase in data and data gathering. This is why
big data has become a recent area of strategic investment for It organizations. For example, the rise of mobile
users has increased enterprise aggregation of user statisticsgeographic, sensor, capability, datathat can, if
properly synthesized and analyzed, provide extremely powerful business intelligence. In addition, the increased
use of sensors for everything from traffic patterns, purchasing behaviors, and real-time inventory management is
a primary example of the massive increase in data. Much of this data is gathered in real time and provides a
unique and powerful opportunity if it can be analyzed and acted upon quickly. Machine-to-machine interchange
is another often unrecognized source of big data. The rise of security information management (sIM) and the
security Information and event Management (sIeM) industry is at the heart of gathering, analyzing, and
proactively responding to event data from active machine log files. At the heart of this trend is the ability to
capture, analyze, and respond to data and data trends in real time. Although it may be clear that new
technologies and new forms of personal communication are driving the big data trend, consider that the global
Internet population grew by 6.5% from 2010 to 2011 and now represents over two billion people. This may
seem large, but it suggests that the vast majority of the worlds population has yet to connect. While it may be
that we never reach 100% of the worlds population online (due to resource constraints, cost of goods, and limits
to material flexibility), increasingly those that are online are more connected than ever. Just a few years ago, it
was realistic to think that many had a desktop (perhaps at work) and maybe a laptop at their disposal. However,
today we also may have a connected smartphone and even a tablet computing device. So, of todays two billion
6
connected people, many are connected for the vast majority of their waking hours, every second generating
data:
In 2011 alone, mankind created over 1.2 trillion GB of data.
Data volumes are expected to grow 50 times by 2020.
Google receives over 2,000,000 search queries every minute.
72 hours of video are added to YouTube every minute.
There are 217 new mobile Internet users every minute.
twitter users send over 100,000 tweets every minute (thats over 140 million per day)
Companies, brands, and organizations receive 34,000 likes on social networks every minute. International
data Corporation (IdC) predicts that the market for big data technology and services will reach $16.9 billion by
2015 with 40% growth over the prediction horizon. Not only will this technology and services spend directly
impact big data technology providers for related sQL database technologies, Hadoop or Map reduce file
systems, and related software and analytics software solutions, but it also will impact new server, storage, and
networking infrastructure that is specifically designed to leverage and optimize the new analytical solutions.
Problem Statement
System Model: We integrate the advantage of Hadoop and R, respectively, and

design the big data visualization algorithm analysis model. Processing and analysis of big
data makes the original impossible possible with the big data visualization algorithm
analysis model. In the system Model, people can easily implement parallel algorithms
using R and can use ZB and PB data to analyze and visualize. In this way, the statistician
must first conduct the sample extraction, hypothesis testing and various regression
calculations, when analyzing and processing the large data. But now statistician can
complete mathematical analysis with the total big data based on the system Model.
Figure 1 shows the whole process of the system integrated model.
In this system Model, we use Hadoop Distributed File Systems to store big data, use
MapReduce calculation model to implement distributed computation, and use R to
control the algorithm and data streaming. While people, who are skilled in using R, can
easily use the system Model to implement kinds of parallel algorithms with R, the system
model has three divisions; the first part is the client; people can use the Client to control
and monitor the data streaming. The second part is the design of parallel algorithms, in
which people can use R to program their own thought and idea. The third part is the
7
storage of big data with the Distributed File Systems. Through this new perspective, it is
of great practical significance, with large-scale data storage and data processing and
data analysis as and data visualization a whole, to process the big data and analysis.
Algorithm Model: Algorithm model is the most flexible portion in our big data
visualization algorithm analysis integrated model. People can program kinds of parallel
algorithms with their thoughts based our system model, the only requirement that the
parties understand the theory of MapReduce, While they do not need to care about the
data processing details. Figure 2 shows the finer details. The algorithm model includes
three key parts; the top part is the application
layer; people can program kinds of algorithms. The middle and bottom parts are the
processing and analysis layers; people can control the data streaming with the
application layer through kinds of algorithms. In the next sections, we design a paralleled
collaborative filtering algorithm with the algorithm model. The experimental results can
show the algorithm models advantages. The sixth part of Experimental Results explains
the details.
Design Analysis: In this paper, the main characteristic of the big data visualization
algorithm analysis integrated model can implement parallel algorithms easily. This is also
the focus of this paper. As we all know, many algorithms are only calculated on a single
computer basically. When the quantities of data reach a certain level, the performance of
algorithm restricts. In other words, the performance bottleneck appears. Using the big
data visualization algorithm analysis integrated model can solve the performance
bottleneck, because we use the Distributed File Systems, and the divide-and-conquer
logical thought, to process the problem. As shown in Figure 3. In the image on the left
you can see the traditional standalone processing data model, while on the right you can
see the extended parallel processing model. The paper is also using the extended
parallel processing model to process and analyze the big data.
8
Background History Of Big Data

The story of how data became big starts many years before the current buzz around big data. Already seventy
years ago we encounter the first attempts to quantify the growth rate in the volume of data or what has popularly
been known as the information explosion (a term first used in 1941, according to the Oxford English
Dictionary). The following are the major milestones in the history of sizing data volumes plus other firsts in
the evolution of the idea of big data and observations pertaining to data or information explosion.
1944 Fremont Rider, Wesleyan University Librarian, publishes The Scholar and the Future of the Research
Library. He estimates that American university libraries were doubling in size every sixteen years. Given this
growth rate, Rider speculates that the Yale Library in 2040 will have approximately 200,000,000 volumes,
which will occupy over 6,000 miles of shelves [requiring] a cataloging staff of over six thousand persons.
1961 Derek Price publishes Science Since Babylon, in which he charts the growth of scientific knowledge by
looking at the growth in the number of scientific journals and papers. He concludes that the number of new
journals has grown exponentially rather than linearly, doubling every fifteen years and increasing by a factor of
ten during every half-century. Price calls this the law of exponential increase, explaining that each
[scientific] advance generates a new series of advances at a reasonably constant birth rate, so that the number of
births is strictly proportional to the size of the population of discoveries at any given time.
November 1967 B. A. Marron and P. A. D. de Maine publish Automatic data compression in

the Communications of the ACM, stating that The information explosion noted in recent years makes it
essential that storage requirements for all information be kept to a minimum. The paper describes a fully
automatic and rapid three-part compressor which can be used with any body of information to greatly reduce
slow external storage requirements and to increase the rate of information transmission through a computer.
1971 Arthur Miller writes in The Assault on Privacy that Too many information handlers seem to measure a
man by the number of bits of storage capacity his dossier will occupy.
1975 The Ministry of Posts and Telecommunications in Japan starts conducting the Information Flow Census,
tracking the volume of information circulating in Japan (the idea was first suggested in a 1969 paper). The
census introduces amount of words as the unifying unit of measurement across all media. The 1975 census
9
already finds that information supply is increasing much faster than information consumption and in 1978 it
reports that the demand for information provided by mass media, which are one-way communication, has
become stagnant, and the demand for information provided by personal telecommunications media, which are
characterized by two-way communications, has drastically increased. Our society is moving toward a new
stage in which more priority is placed on segmented, more detailed information to meet individual needs,
instead of conventional mass-reproduced conformed information.
April 1980 I.A. Tjomsland gives a talk titled Where Do We Go From Here? at the Fourth IEEE Symposium
on Mass Storage Systems, in which he says Those associated with storage devices long ago realized that
Parkinsons First Law may be paraphrased to describe our industryData expands to fill the space
available. I believe that large amounts of data are being retained because users have no way of identifying
obsolete data; the penalties for storing obsolete data are less apparent than are the penalties for discarding
potentially useful data.
1981 The Hungarian Central Statistics Office starts a research project to account for the countrys information
industries, including measuring information volume in bits. The research continues to this day. In 1993, Istvan
Dienes, chief scientist of the Hungarian Central Statistics Office, compiles a manual for a standard system of
national information accounts.
August 1983 Ithiel de Sola Pool publishes Tracking the Flow of Information in Science. Looking at growth
trends in 17 major communications media from 1960 to 1977, he concludes that words made available to
Americans (over the age of 10) through these media grew at a rate of 8.9 percent per year words actually
attended to from those media grew at just 2.9 percent per year. In the period of observation, much of the
growth in the flow of information was due to the growth in broadcasting But toward the end of that period
[1977] the situation was changing: point-to-point media were growing faster than broadcasting. Pool, Inose,
Takasaki and Hurwitz follow in 1984 with Communications Flows: A Census in the United States and Japan, a
book comparing the volumes of information produced in the United States and Japan.
July 1986 Hal B. Becker publishes Can users really absorb data at todays rates? Tomorrows? in Data
Communications. Becker estimates that the recoding density achieved by Gutenberg was approximately 500
symbols (characters) per cubic inch500 times the density of [4,000 B.C. Sumerian] clay tablets. By the year
2000, semiconductor random access memory should be storing 1.25X10^11 bytes per cubic inch.
September 1990 Peter J. Denning publishes Saving All the Bits (PDF) in American Scientist. Says Denning:
The imperative [for scientists] to save all the bits forces us into an impossible situation: The rate and volume of
information flow overwhelm our networks, storage devices and retrieval systems, as well as the human capacity
for comprehension What machines can we build that will monitor the data stream of an instrument, or sift
through a database of recordings, and propose for us a statistical summary of whats there? it is possible to
build machines that can recognize or predict patterns in data without understanding the meaning of the patterns.
Such machines may eventually be fast enough to deal with large data streams in real time With these
machines, we can significantly reduce the number of bits that must be saved, and we can reduce the hazard of
losing latent discoveries from burial in an immense database. The same machines can also pore through existing
databases looking for patterns and forming class descriptions for the bits that weve already saved.
1996 Digital storage becomes more cost-effective for storing data than paper according to R.J.T. Morris and
B.J. Truskowski, in The Evolution of Storage Systems, IBM Systems Journal, July 1, 2003.
October 1997 Michael Cox and David Ellsworth publish Application-controlled demand paging for out-of-
core visualization in the Proceedings of the IEEE 8th conference on Visualization. They start the article with
Visualization provides an interesting challenge for computer systems: data sets are generally quite large, taxing
10
the capacities of main memory, local disk, and even remote disk. We call this the problem of big data. When
data sets do not fit in main memory (in core), or when they do not fit even on local disk, the most common
solution is to acquire more resources. It is the first article in the ACM digital library to use the term big data.
Source: Michael Lesk
1997 Michael Lesk publishes How much information is there in the world? Lesk concludes that There may
be a few thousand petabytes of information all told; and the production of tape and disk will reach that level by
the year 2000. So in only a few years, (a) we will be able to save everythingno information will have to be
thrown out, and (b) the typical piece of information will never be looked at by a human being.
April 1998 John R. Masey, Chief Scientist at SGI, presents at a USENIX meeting a paper titled Big Data
and the Next Wave of Infrastress.
October 1998 K.G. Coffman and Andrew Odlyzko publish The Size and Growth Rate of the Internet. They
conclude that the growth rate of traffic on the public Internet, while lower than is often cited, is still about
100% per year, much higher than for traffic on other networks. Hence, if present growth trends continue, data
traffic in the U. S. will overtake voice traffic around the year 2002 and will be dominated by the Internet.
Odlyzko later established the Minnesota Internet Traffic Studies (MINTS), tracking the growth in Internet
traffic from 2002 to 2009.
August 1999 Steve Bryson, David Kenwright, Michael Cox, David Ellsworth, and Robert Haimes publish
Visually exploring gigabyte data sets in real time in the Communications of the ACM. It is the first CACM
article to use the term Big Data (the title of one of the articles sections is Big Data for Scientific
Visualization). The article opens with the following statement: Very powerful computers are a blessing to
many fields of inquiry. They are also a curse; fast computations spew out massive amounts of data. Where
megabyte data sets were once considered large, we now find data sets from individual simulations in the 300GB
range. But understanding the data resulting from high-end computations is a significant endeavor. As more than
one scientist has put it, it is just plain difficult to look at all the numbers. And as Richard W. Hamming,
mathematician and pioneer computer scientist, pointed out, the purpose of computing is insight, not numbers.
11
October 1999 Bryson, Kenwright and Haimes join David Banks, Robert van Liere, and Sam Uselton on a panel
titled Automation or interaction: whats best for big data? at the IEEE 1999 conference on Visualization.
October 2000 Peter Lyman and Hal R. Varian at UC Berkeley publish How Much Information? It is the first
comprehensive study to quantify, in computer storage terms, the total amount of new and original information
(not counting copies) created in the world annually and stored in four physical media: paper, film, optical (CDs
and DVDs), and magnetic. The study finds that in 1999, the world produced about 1.5 exabytes of unique
information, or about 250 megabytes for every man, woman, and child on earth. It also finds that a vast
amount of unique information is created and stored by individuals (what it calls the democratization of data)
and that not only is digital information production the largest in total, it is also the most rapidly growing.
Calling this finding dominance of digital, Lyman and Varian state that even today, most textual information
is born digital, and within a few years this will be true for images as well. A similar study conducted in 2003
by the same researchers found that the world produced about 5 exabytes of new information in 2002 and that
92% of the new information was stored on magnetic media, mostly in hard disks.
November 2000 Francis X. Diebold presents to the Eighth World Congress of the Econometric Society a
paper titled Big Data Dynamic Factor Models for Macroeconomic Measurement and Forecasting (PDF), in
which he states Recently, much good science, whether physical, biological, or social, has been forced to
confrontand has often benefited fromthe Big Data phenomenon. Big Data refers to the explosion in the
quantity (and sometimes, quality) of available and potentially relevant data, largely the result of recent and
unprecedented advancements in data recording and storage technology.
February 2001 Doug Laney, an analyst with the Meta Group, publishes a research note titled 3D Data
Management: Controlling Data Volume, Velocity, and Variety. A decade later, the 3Vs have become the
generally-accepted three defining dimensions of big data, although the term itself does not appear in Laneys
note.
September 2005 Tim OReilly publishes What is Web 2.0 in which he asserts that data is the next Intel
inside. OReilly: As Hal Varian remarked in a personal conversation last year, SQL is the new HTML.
Database management is a core competency of Web 2.0 companies, so much so that we have sometimes
referred to these applications as infoware rather than merely software.
March 2007 John F. Gantz, David Reinsel and other researchers at IDC release a white paper titled The
Expanding Digital Universe: A Forecast of Worldwide Information Growth through 2010 (PDF). It is the first
study to estimate and forecast the amount of digital data created and replicated each year. IDC estimates that in
2006, the world created 161 exabytes of data and forecasts that between 2006 and 2010, the information added
annually to the digital universe will increase more than six fold to 988 exabytes, or doubling every 18 months.
12
According to the 2010 (PDF) and 2012 (PDF) releases of the same study, the amount of digital data created
annually surpassed this forecast, reaching 1227 exabytes in 2010, and growing to 2837 exabytes in 2012.
January 2008 Bret Swanson and George Gilder publish Estimating the Exaflood (PDF), in which they
project that U.S. IP traffic could reach one zettabyte by 2015 and that the U.S. Internet of 2015 will be at least
50 times larger than it was in 2006.
June 2008 Cisco releases the Cisco Visual Networking Index Forecast and Methodology, 2007
2012 (PDF) part of an ongoing initiative to track and forecast the impact of visual networking applications.
It predicts that IP traffic will nearly double every two years through 2012 and that it will reach half a zettabyte
in 2012. The forecast held well, as Ciscos latest report (May 30, 2012) estimates IP traffic in 2012 at just over
half a zettabyte and notes it has increased eightfold over the past 5 years.
Recommended by Forbes
September 2008 A special issue of Nature on Big Data examines what big data sets mean for contemporary
science.
December 2008 Randal E. Bryant, Randy H. Katz, and Edward D. Lazowska publish Big-Data Computing:
Creating Revolutionary Breakthroughs in Commerce, Science and Society (PDF). They write: Just as search
engines have transformed how we access information, other forms of big-data computing can and will
transform the activities of companies, scientific researchers, medical practitioners, and our nations defense and
intelligence operations. Big-data computing is perhaps the biggest innovation in computing in the last decade.
We have only begun to see its potential to collect, organize, and process data in all walks of life. A modest
investment by the federal government could greatly accelerate its development and deployment.
December 2009 Roger E. Bohn and James E. Short publish How Much Information? 2009 Report on
American Consumers. The study finds that in 2008, Americans consumed information for about 1.3 trillion
hours, an average of almost 12 hours per day. Consumption totaled 3.6 Zettabytes and 10,845 trillion words,
corresponding to 100,500 words and 34 gigabytes for an average person on an average day. Bohn, Short, and
Chattanya Baru follow this up in January 2011 with How Much Information? 2010 Report on Enterprise
Server Information, in which they estimate that in 2008, the worlds servers processed 9.57 Zettabytes of
information, almost 10 to the 22nd power, or ten million million gigabytes. This was 12 gigabytes of
information daily for the average worker, or about 3 terabytes of information per worker per year. The worlds
companies on average processed 63 terabytes of information annually.
February 2010 Kenneth Cukier publishes in The Economist a Special Report titled, Data, data everywhere.
Writes Cukier: the world contains an unimaginably vast amount of digital information which is getting ever
vaster more rapidly The effect is being felt everywhere, from business to science, from governments to the
arts. Scientists and computer engineers have coined a new term for the phenomenon: big data.
13
February 2011 Martin Hilbert and Priscila Lopez publish The Worlds Technological Capacity to Store,
Communicate, and Compute Information in Science. They estimate that the worlds information storage
capacity grew at a compound annual growth rate of 25% per year between 1986 and 2007. They also estimate
that in 1986, 99.2% of all storage capacity was analog, but in 2007, 94% of storage capacity was digital, a
complete reversal of roles (in 2002, digital information storage surpassed non-digital for the first time).
May 2011 James Manyika, Michael Chui, Brad Brown, Jacques Bughin, Richard Dobbs, Charles Roxburgh,
and Angela Hung Byers of the McKinsey Global Institute publish Big data: The next frontier for innovation,
competition, and productivity. They estimate that by 2009, nearly all sectors in the US economy had at least
an average of 200 terabytes of stored data (twice the size of US retailer Wal-Marts data warehouse in 1999) per
company with more than 1,000 employees and that the securities and investment services sector leads in terms
of stored data per firm. In total, the study estimates that 7.4 exabytes of new data were stored by enterprises and
6.8 exabytes by consumers in 2010.
April 2012 The International Journal of Communications publishes a Special Section titled Info Capacity on
the methodologies and findings of various studies measuring the volume of information. In Tracking the flow
of information into the home (PDF), Neuman, Park, and Panek (following the methodology used by Japans
MPT and Pool above) estimate that the total media supply to U.S. homes has risen from around 50,000 minutes
per day in 1960 to close to 900,000 in 2005. And looking at the ratio of supply to demand in 2005, they estimate
that people in the U.S. are approaching a thousand minutes of mediated content available for every minute
available for consumption. Bounie and Gille (following Lyman and Varian above) estimate that the world
produced 14.7 exabytes of new information in 2008, nearly triple the volume of information in 2003.
May 2012 danah boyd and Kate Crawford publish Critical Questions for Big Data in Information,
Communications, and Society. They define big data as a cultural, technological, and scholarly phenomenon
that rests on the interplay of: (1) Technology: maximizing computation power and algorithmic accuracy to
gather, analyze, link, and compare large data sets. (2) Analysis: drawing on large data sets to identify patterns in
order to make economic, social, technical, and legal claims. (3) Mythology: the widespread belief that large data
sets offer a higher form of intelligence and knowledge that can generate insights that were previously
impossible, with the aura of truth, objectivity, and accuracy.
Methodology
14
1. Data quality grading and assurance
This research will develop new and adapt existing methodologies for merging data from multiple sources. It will
also develop robust techniques for data quality grading and assurance providing automated data quality and
cleaning procedures for use by researchers.
2. Identifying "unusual" data segments
Methods will be developed to automatically identify "unusual" data segments through an ICMetrics-based
technique. Such methods will be able to alert researchers of specific data segments that require subsequent
further analysis and identify potential issues with unsolicited data manipulation and integrity breaches.
3. Confidentiality preserving data mining techniques
Some datasets include sensitive information; this research considers how best to aggregate/transform data to
allow subsequent analysis to be undertaken with the minimum loss of information. Methods for dimensionality
reduction and data perturbation techniques will be investigated alongside privacy preserving data mining
methods
4. Text data mining
Textual data represents rich information, but lacks structure and requires specialist techniques to be mined and
linked properly as well as to reason with and make useful correlations. A set of techniques will be developed for
extracting entities, relations between them, opinions and other elements
5. Tracking interactions among users
Data generated via the interaction of users online contains a wealth of information. This research will
investigate automatic methods for tracking interactions that can be used, for example, to identify service
pathways in local government or business data to aid organizations in improving service delivery to
citizens/customers. Methods to identify the context of the interaction and the individual user needs to provide
tailor-made services will also be developed.
6. Machine learning and transactional data
Investigate machine learning and other methods for identifying stylized facts, seasonal, spatial or other
relations, patterns of behavior at the level of the individual, group, or region from transactional data from
business, local government or other organizations. Such methods can provide essential decision support
information to organizations in planning services based on predicted trends, spikes or troughs in demand.
7. Developing methods to evaluate, target and monitor the provision of care.

15
Models and statistical methods for the analysis of local government health and social care data will be
developed alongside new data mining and machine learning algorithms to identify intervention subgroups, and
new joint modelling methods to improve existing predictive models with a view to evaluate, target and monitor
the provision of care.
Limitation of Big Data:

As everyone knows, big data is all the rage in digital marketing nowadays. Marketing organizations across the
globe are trying to find ways to collect and analyze user-level or touchpoint-level data in order to uncover
insights about how marketing activity affects consumer purchase decisions and drives loyalty.
In fact, the buzz around big data in marketing has risen to the point where one could easily get the illusion that
utilizing user-level data is synonymous with modern marketing.
This is far from the truth. Case in point, Gartners hype cycle as of last August placed big data for digital
marketing near the apex of inflated expectations, about to descend into the trough of disillusionment.
It is important for marketers and marketing analysts to understand that user-level data is not the end-all be-all of
marketing: as with any type of data, it is suitable for some applications and analyses but unsuitable for others.
Following is a list describing some of the limitations of user-level data and the implications for marketing
analytics.
1. User Data Is Fundamentally Biased

The user-level data that marketers have access to is only of individuals who have visited your owned digital
properties or viewed your online ads, which is typically not representative of the total target consumer base.
Even within the pool of trackable cookies, the accuracy of the customer journey is dubious: many consumers
now operate across devices, and it is impossible to tell for any given touchpoint sequence how fragmented the
path actually is. Furthermore, those that operate across multiple devices is likely to be from a different
demographic compared to those who only use a single device, and so on.
User-level data is far from being accurate or complete, which means that there is inherent danger in assuming
that insights from user-level data applies to your consumer base at large.
2. User-Level Execution Only Exists In Select Channels

Certain marketing channels are well suited for applying user-level data: website personalization, email
automation, dynamic creatives, and RTB spring to mind.
In many channels however, it is difficult or impossible to apply user data directly to execution except via
segment-level aggregation and whatever other targeting information is provided by the platform or publisher.
16
Social channels, paid search, and even most programmatic display is based on segment-level or attribute-level
targeting at best. For offline channels and premium display, user-level data cannot be applied to execution at all.
3. User-Level Results Cannot Be Presented Directly

More accurately, it can be presented via a few visualizations such as a flow diagram, but these tend to be
incomprehensible to all but domain experts. This means that user-level data needs to be aggregated up to a daily
segment-level or property-level at the very least in order for the results to be consumable at large.
4. User-Level Algorithms Have Difficulty Answering Why

Largely speaking, there are only two ways to analyze user-level data: one is to aggregate it into a smaller data
set in some way and then apply statistical or heuristic analysis; the other is to analyze the data set directly using
algorithmic methods.
Both can result in predictions and recommendations (e.g. move spend from campaign A to B), but algorithmic
analyses tend to have difficulty answering why questions (e.g. why should we move spend) in a manner
comprehensible to the average marketer. Certain types of algorithms such as neural networks are black boxes
even to the data scientists who designed it. Which leads to the next limitation:
5. User Data Is Not Suited For Producing Learnings

This will probably strike you as counter-intuitive. Big data = big insights = big learnings, right?
Wrong! For example, lets say you apply big data to personalize your website, increasing overall conversion
rates by 20%. While certainly a fantastic result, the only learning you get from the exercise is that you should
indeed personalize your website. While this result certainly raises the bar on marketing, but it does nothing to
raise the bar for marketers.
Actionable learnings that require user-level data for instance, applying a look-alike model to discover
previously untapped customer segments are relatively few and far in between, and require tons of effort to
uncover. Boring, ol small data remains far more efficient at producing practical real-world learnings that you
can apply to execution today.
6. User-Level Data Is Subject To More Noise

If you have analyzed regular daily time series data, you know that a single outlier can completely throw off
analysis results. The situation is similar with user-level data, but worse.
In analyzing touchpoint data, you will run into situations where, for example, a particular cookie received for
whatever reason a hundred display impressions in a row from the same website within an hour (happens much
more often than you might think). Should this be treated as a hundred impressions or just one, and how will it
affect your analysis results?
Even more so than smaller data, user-level data tends to be filled with so much noise and potentially
misleading artifacts, that it can take forever just to clean up the data set in order to get reasonably accurate
results.
17
7. User Data Is Not Easily Accessible or Transferable

Because of security concerns, user data cannot be made accessible to just anyone, and requires care in
transferring from machine to machine, server to server.
Because of scale concerns, not everyone has the technical know-how to query big data in an efficient manner,
which causes database admins to limit the number of people who has access in the first place.
Because of the high amount of effort required, whatever insights that are mined from big data tend to remain a
one-off exercise, making it difficult for team members to conduct follow-up analyses and validation.
All of these factors limit agility of analysis and ability to collaborate.
Basic Concept of Big Data

It is a new set of approaches for analysing data sets that were not previously accessible because they
posed challenges across one or more of the 3 Vs of Big Data
Volume - too Big Terabytes and more of Credit Card Transactions, Web Usage data, System logs
Variety - too Complex truly unstructured data such as Social Media, Customer Reviews, Call
Centre Records
Velocity - too Fast - Sensor data, live web traffic, Mobile Phone usage, GPS Data
Well addressed by data warehouse crowd
Who are pretty good at SQL analytics on
Hundreds of nodes
Petabytes of data
Revolutionary. That pretty much describes the data analysis time in which we live. Businesses grapple
with huge quantities and varieties of data on one hand, and ever-faster expectations for analysis on the
other. The vendor community is responding by providing highly distributed architectures and new levels
of memory and processing power. Upstarts also exploit the open-source licensing model, which is not
new, but is increasingly accepted and even sought out by data-management professionals.
Apache Hadoop, a nine-year-old open-source data-processing platform first used by Internet giants
including Yahoo and Facebook, leads the big-data revolution. Cloudera introduced commercial support
for enterprises in 2008, and MapR and Hortonworks piled on in 2009 and 2011, respectively. Among
data-management incumbents, IBM and EMC-spinout Pivotal each has introduced its own Hadoop
distribution. Microsoft and Teradata offer complementary software and first-line support for
Hortonworks' platform. Oracle resells and supports Cloudera, while HP, SAP, and others act more like
Switzerland, working with multiple Hadoop software providers.
18
In-memory analysis gains steam as Moore's Law brings us faster, more affordable, and more-memory-
rich processors. SAP has been the biggest champion of the in-memory approach with its Hana platform,
but Microsoft and Oracle are now poised to introduce in-memory options for their flagship databases.
Focused analytical database vendors including Actian, HP Vertica, and Teradata have introduced options
for high-RAM-to-disk ratios, along with tools to place specific data into memory for ultra-fast analysis.
Advances in bandwidth, memory, and processing power also have improved real-time stream-processing
and stream-analysis capabilities, but this technology has yet to see broad adoption. Several vendors here
complex event processing, but outside of the financial trading, national intelligence, and security
communities, deployments have been rare. Watch this space and, particularly, new open source options
as breakthrough applications in ad delivery, content personalization, logistics, and other areas push
broader adoption.
Our slideshow includes broad-based data-management vendors -- IBM, Microsoft, Oracle, SAP -- that
offer everything from data-integration software and database-management systems (DBMSs) to business
intelligence and analytics software, to in-memory, stream-processing, and Hadoop options. Teradata is a
blue chip focused more narrowly on data management, and like Pivotal, it has close ties with analytics
market leader SAS.
Plenty of vendors covered here offer cloud options, but 1010data and Amazon Web Services (AWS)
have staked their entire businesses on the cloud model. Amazon has the broadest selection of products of
the two, and it's an obvious choice for those running big workloads and storing lots of data on the AWS
platform. 1010data has a highly scalable database service and supporting information-management, BI,
and analytics capabilities that are served up private-cloud style.
The jury is still out on whether Hadoop will become as indispensable as database management systems.
Where volume and variety are extreme, Hadoop has proven its utility and cost advantages. Cloudera,
Hortonworks, and MapR are doing everything they can to move Hadoop beyond high-scale storage and
MapReduce processing into the world of analytics.
The niche vendors here include Actian, InfiniDB/Calpont, HP Vertica, Infobright, and Kognitio, all of
which have centered their big-data stories around database management systems focused entirely on
analytics rather than transaction processing. German DBMS vendor Exasol is another niche player in
this mold, but we don't cover it here as its customer base is almost entirely in continental Europe. It
opened offices in the U.S. and U.K. in January 2014.
This collection does not cover analytics vendors, such as Alpine Data Labs, Revolution Analytics, and
SAS. These vendors invariably work in conjunction with platforms provided by third-party DBMS
vendors and Hadoop distributors, although SAS in particular is blurring this line with growing support
for SAS-managed in-memory data grids and Hadoop environments. We also excluded NoSQL and
NewSQL DBMSs, which are heavily (though not entirely) focused on high-scale transaction processing,
not analytics. We plan to cover NoSQL and NewSQL platforms in a separate, soon-to-be-published
collection.
New Opportunities, New Challenges

19
Big data technologies can derive value from large datasets in ways that were
previously impossible indeed, big data can generate insights that researchers
didnt even think to seek. But the technical capabilities of big data have reached
a level of sophistication and pervasiveness that demands consideration of how
best to balance the opportunities a f-forded by big data against the social and
ethical questions these technologies raise.
The power and opportunity of big data applications:
Used well, big data analysis can boost economic productivity, drive improvedconsumer
and government services, thwart terrorists, and save lives. Examples include:Big data
and the growing Internet of Things have made it possible to merge the industrial and
information economies. Jet engines and delivery trucks can now be outfitted with
sensors that monitor hundreds of data points and send automatic
Finding the needle in the haystack

Computational capabilities now make finding a needle in a haystack not only
possible,but practical. In the past, searching large datasets required both rationally
organized data and a specific research question, relying on choosing the right query to
return the cor-rect result. Big data analytics enable data scientists to amass lots of
data, including un-structured data, and find anomalies or patterns.
The benefits and consequences of perfect personalization
The fusion of many different kinds of data, processed in real time, has the power to d
e-liver exactly the right message, product or service to consumers before they even
ask. Small bits of data can be brought together to create a clear picture of a person to
predict preferences or behaviors. These detailed personal profiles and personalized
experience are effective in the consumer marketplace and
can deliver products and offers to pre-cise segments of the population like a
professional accountant with a passion for knit-ting, or a home chef with a penchant for
horror films.
De-identification and re-identification
As techniques like data fusion make big data analytics more powerful, the challenges
to current expectations of privacy grow more serious. When data is initially linked to an
20
in-dividual or device, some privacy-protective technology seeks to remove this linkage,

or de-identify personally identifiable informationbut equally effective techniques
exist to pull the pieces back together through re-identification. Similarly, integrating
diverse data can lead to what some analysts call the mosaic effect, whereby personal
ly identifia-bleinformation can be derived or inferred from datasets that do not even
include personal identifiers bringing into focus a picture of who an individual is and
what he or she likes.
The persistence of data
In the past, retaining physical control over ones personal information was often
sufficient to ensure privacy. Documents could be destroyed, conversations forgotten,
andrecords expunged. But in the digital world, information can be captured, copied,
shared, and transfer red at high fidelity and retained indefinitely. Volumes of data that
were once un-thinkable expensive to preserve are now easy and affordable to store on
a chip the size of a grain of rice. As a consequence, data, once created, is in many
cases effectively permanent.
Big Data Analysis
In the previous section, Techniques for Analyzing Big Data, we discussed some of methods
you can use to find meaning and discover hidden relationships in big data. Here are three
significant requirements for conducting these inquiries in an expedient way
:
1. Minimize data movement
2. Use existing skills
3. Attend to data security. Minimizing data movement is all about conserving computing
resources. In traditional analysis scenarios, data is brought to the computer, processed, and
then sent to the next destination. For example, production data might be extracted from e-
business systems, transformed into a relational data type, and loaded into an operational data
Store structured for reporting. But as the volume of data grows, this type of ETL architecture
becomes increasingly less efficient. Theres just too much data to move around.
It makes more sense to store and process the data in the same place.
With new data and new data sources comes the need to acquire new skills. Sometimes the
existing skillset will determine where analysis can and should be done. When the requisite
skills are lacking, a combination of training, hiring and new tools will address the problem.
Since most organizations have more people who can analyze data using SQL than using Map
Reduce, it is important to be able to support both types of processing. Data security is
essential for many corporate applications. Data warehouse users are accustomed not only to
21
carefully defined metrics and dimensions and attributes, but also to a reliable set of
administration policies and security controls. These rigorous processes are often lacking with
unstructured data sources and open source analysis tools. Pay attention to the security and
data governance requirements of each analysis project and make sure that the tools you are
using can accommodate those requirements.
The primary goal of big data analytics is to help companies make more informed business decisions by enabling
data scientists, predictive modelers and other analytics professionals to analyze large volumes of transaction
data, as well as other forms of data that may be untapped by conventional business intelligence (BI) programs.
That could include Web server logs and Internet clickstream data, social media content and social network
activity reports, text from customer emails and survey responses, mobile-phone call detail records and machine
data captured by sensors connected to the Internet of Things. Some people exclusively associate big data with
semi-structured and unstructured data of that sort, but consulting firms like Gartner Inc. and Forrester Research
Inc. also consider transactions and other structured data to be valid components of big data analytics
applications.
Big data can be analyzed with the software tools commonly used as part of advanced analytics disciplines such
as predictive analytics, data mining, text analytics and statistical analysis. Mainstream BI software and data
visualization tools can also play a role in the analysis process. But the semi-structured and unstructured data
may not fit well in traditional data warehouses based on relational databases. Furthermore, data warehouses may
not be able to handle the processing demands posed by sets of big data that need to be updated frequently or
even continually -- for example, real-time data on the performance of mobile applications or of oil and gas
pipelines. As a result, many organizations looking to collect, process and analyze big data have turned to a
newer class of technologies that includes Hadoop and related tools such as YARN, MapReduce, Spark, Hive
and Pig as well as NoSQL databases. Those technologies form the core of an open source software framework
that supports the processing of large and diverse data sets across clustered systems.
Tools for Analyzing Big Data:
There are five key approaches to analyzing big data and generating insight. Discovery tools
are useful throughout the information lifecycle for rapid, intuitive exploration and analysis of
information from any combination of structured and unstructured sources. These tools permit
analysis alongside traditional BI source systems. Because there is no need for upfront
modeling, users can draw new insights, come to meaningful conclusions, and make informed
decisions quickly.BI toolsare important forreporting, analysis and performance management,
primarily with transactional data from data warehouses and production information systems.BI
Tools provide comprehensive capabilities for business intelligence and performance
management, including enterprise reporting, dashboards, ad-hoc analysis, scorecards, and
what-if scenario analysis on an integrated, enterprise scale platform. In Database Analytics
include a variety of techniques for finding patterns and relationships in your data. Because
these techniques are applied directly within the database, you eliminatedata movement to and
from other analytical servers, which accelerates information cycle times and reduces total
cost of ownership . Hadoopis useful for pre-processing data to identity macro trends or find
nuggets of information, such as out-of-range values. It enables businesses to unlock potential
22
value from new data using inexpensive commodity servers. Organizations primarily use
Hadoop as a precursor to advanced forms of analytics.Decision Management includes
predictive modeling, business rules, and self-learning to take informed action based on the
current context .This type of analysis enables individual recommendations across multiple
channels, maximizing the value of every customer interaction. Oracle Advanced Analytics
scores can be integrated to operationalize complex predictive analytic models and create
real-time decision processes. All of these approaches have a role to play uncovering hidden
relationships. Traditional data discovery tools like Oracle Endeca Information Discovery, BI
tools like Oracle Analytics and decision management tools like Oracle Real Time Decisions
Are covered extensively in other white papers. In this paper, we mainly focus on the
integrated use of Hadoop and In-Database Analytics to process and analyze a vast field of
new data.
Healthcare:
healthcare delivery system can develop more thorough and insightful diagnoses and
treatments, resulting, one would expect, in higher quality care at lower costs and in
better outcomes overall [12]. The potential for big data analytics in healthcare to lead to
better outcomes exists across many scenarios, for example: by analyzing patient
characteristics and the cost and outcomes of care to identify the most clinically and cost
effective treatments and offer analysis and tools, thereby influencing provider behavior;
applying advanced analytics to patient profiles (e.g., segmentation and predictive
modeling) to proactively identify individuals who would benefit from preventative care or
lifestyle changes; broad scale disease profiling to identify predictive events and support
prevention initiatives; collecting and publishing data on medical procedures, thus
assisting patients in determining the care protocols or regimens that offer the best value;
identifying, predicting and minimizing fraud by implementing advanced analytic systems
for fraud detection and checking the accuracy and consistency of claims; and,
implementing much nearer to real-time, claim authorization; creating new revenue
streams by aggregating and synthesizing patient clinical records and claims data sets to
provide data and services to third parties, for example, licensing data to assist
pharmaceutical companies in identifying patients for inclusion in clinical trials. Many
payers are developing and deploying mobile apps that help patients manage their care,
locate providers and improve their health. Via analytics, payers are able to monitor
adherence to drug and treatment regimens and detect trends that lead to individual and
population wellness benefits [12,16-18].
23
This article provides an overview of big data analytics in healthcare as it is emerging as a

discipline. First, we define and discuss the various advantages and characteristics of big
data analytics in healthcare. Then we describe the architectural framework of big data
analytics in healthcare. Third, the big data analytics application development
methodology is described. Fourth, we provide examples of big data analytics in
healthcare reported in the literature. Fifth, the challenges are identified. Lastly, we offer
conclusions and future directions.
Big data analytics in healthcare

Health data volume is expected to grow dramatically in the years ahead [6]. In addition,
healthcare reimbursement models are changing; meaningful use and pay for
performance are emerging as critical new factors in todays healthcare environment.
Although profit is not and should not be a primary motivator, it is vitally important for
healthcare organizations to acquire the available tools, infrastructure, and techniques to
leverage big data effectively or else risk losing potentially millions of dollars in revenue
and profits [19].
What exactly is big data? A report delivered to the U.S. Congress in August 2012 defines
big data as large volumes of high velocity, complex, and variable data that require
advanced techniques and technologies to enable the capture, storage, distribution,
management and analysis of the information [6]. Big data encompasses such
characteristics as variety, velocity and, with respect specifically to healthcare, veracity
[20-23]. Existing analytical techniques can be applied to the vast amount of existing (but
currently unanalyzed) patient-related health and medical data to reach a deeper
understanding of outcomes, which then can be applied at the point of care. Ideally,
individual and population data would inform each physician and her patient during the
decision-making process and help determine the most appropriate treatment option for
that particular patient.
Advantages to healthcare
By digitizing, combining and effectively using big data, healthcare organizations ranging
from single-physician offices and multi-provider groups to large hospital networks and
accountable care organizations stand to realize significant benefits [2]. Potential benefits
include detecting diseases at earlier stages when they can be treated more easily and
effectively; managing specific individual and population health and detecting health care
fraud more quickly and efficiently. Numerous questions can be addressed with big data
analytics. Certain developments or outcomes may be predicted and/or estimated based
on vast amounts of historical data, such as length of stay (LOS); patients who will choose
elective surgery; patients who likely will not benefit from surgery; complications; patients
at risk for medical complications; patients at risk for sepsis, MRSA, C. difficile, or other
hospital-acquired illness; illness/disease progression; patients at risk for advancement in
disease states; causal factors of illness/disease progression; and possible comorbid
conditions (EMC Consulting). McKinsey estimates that big data analytics can enable more
than $300 billion in savings per year in U.S. healthcare, two thirds of that through
reductions of approximately 8% in national healthcare expenditures. Clinical operations
and R & D are two of the largest areas for potential savings with $165 billion and $108
24
billion in waste respectively [24]. McKinsey believes big data could help reduce waste
and inefficiency in the following three areas:
Clinical operations: Comparative effectiveness research to determine more clinically relevant and cost-
effective ways to diagnose and treat patients.
Research & development: 1) predictive modeling to lower attrition and produce a leaner, faster, more
targeted R & D pipeline in drugs and devices;
2) statistical tools and algorithms to improve clinical trial design and patient recruitment to better match
treatments to individual patients, thus reducing trial failures and speeding new treatments to market; and 3)
analyzing clinical trials and patient records to identify follow-on indications and discover adverse effects
before products reach the market.
Public health: 1) analyzing disease patterns and tracking disease outbreaks and transmission to improve public
health surveillance and speed response; 2) faster development of more accurately targeted vaccines, e.g.,
choosing the annual influenza strains; and, 3) turning large amounts of data into actionable information that
can be used to identify needs, provide services, and predict and prevent crises, especially for the benefit of
populations
In addition, suggests big data analytics in healthcare can contribute to Evidence-based medicine: Combine and
analyze a variety of structured and unstructured data-EMRs, financial and operational data, clinical data, and
genomic data to match treatments with outcomes, predict patients
at risk for disease or readmission and provide more efficient care;
Genomic analytics: Execute gene sequencing more efficiently and cost effectively and make genomic analysis
a part of the regular medical care decision process and the growing patient medical record Pre-adjudication
fraud analysis: Rapidly analyze large numbers of claim requests to reduce fraud, waste and abuse;
Device/remote monitoring: Capture and analyze in real-time large volumes of fast-moving data from in-
hospital and in-home devices, for safety monitoring and adverse event prediction;
Patient profile analytics: Apply advanced analytics to patient profiles (e.g., segmentation and predictive
modeling) to identify individuals who would benefit from proactive care or lifestyle changes, for example,
those patients at risk of developing a specific disease (e.g., diabetes) who would benefit from preventive care
According to areas in which enhanced data and analytics yield the greatest results
include: pinpointing patients who are the greatest consumers of health resources or at
the greatest risk for adverse outcomes; providing individuals with the information they
need to make informed decisions and more effectively manage their own health as well
as more easily adopt and track healthier behaviors; identifying treatments, programs and
processes that do not deliver demonstrable benefits or cost too much; reducing
readmissions by identifying environmental or lifestyle factors that increase risk or trigger
adverse events and adjusting treatment plans accordingly; improving outcomes by
examining vitals from at-home health monitors; managing population health by detecting
vulnerabilities within patient populations during disease outbreaks or disasters; and
bringing clinical, financial and operational data together to analyze resource utilization
productively and in real time.
The 4 Vs of big data analytics in healthcare

25
Like big data in healthcare, the analytics associated with big data is described by three
primary characteristics: volume, velocity and variety . Over time, health-related data will
be created and accumulated continuously, resulting in an incredible volume of data. The
already daunting volume of existing healthcare data includes personal medical records,
radiology images, clinical trial data FDA submissions, human genetics and population
data genomic sequences, etc. Newer forms of big data, such as 3D imaging, genomics
and biometric sensor readings, are also fueling this exponential growth.
Findings Of Big data:

The digital universe will grow from 3.2 zettabytes to 40 zettabytes in only six years. Every day, we create 2.5
quintillion bytes of data so much that 90% of the data in the world today has been created in the last two
years alone. This data comes from everywhere: sensors used to gather climate information, posts to social media
sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. The
volume of data created by U.S. companies alone each year is enough to fill ten thousand Libraries of Congress (
Zuckerberg noted that 1 billion pieces of content are shared via Facebooks Open Graph daily.Facebook puts up
over 10 million photographs every hour and around 3 billion like buttons are pushed everyday. Google process
more than 24 petabytes of data every day . 48 hours of video are uploaded to YouTube every minute, resulting
in nearly 8 years of content every day. 70% of data is created by individuals but enterprises are responsible for
storing and managing 80% of it.
88% data is ignored
According to a recent study by Forrester Research, most companies analyze a mere amount of 12% of the data
they have. Perhaps, these firms might be missing out on data-driven insights hidden inside the 88% of data
theyre ignoring. A lack of analytics tools and repressive data silos are two reasons companies ignore a vast
majority of their own data, says Forrester, as well as the simple fact that often its hard to know which
information is valuable and which is best left ignored.
Structured vs Unstructured Data
While classifying data, Tata Consultancy Services Limited (TCS) has looked at how much of companies data
was structured versus unstructured, as well as how much was generated internally versus externally. It found
that 51% of data is structured, 27% of data is unstructured and 21% of data is semi-structured. A much higher
than anticipated percentage of data was not structured either unstructured or semi-structured and a little less
than a quarter of the data was external.
Booming of jobs, but lack of talents
By 2015, 4.4 million IT jobs globally will be created to support big data, generating 1.9 million IT jobs in the
United States. Every big data-related role in the US will create employment for three people outside of IT, so
over the next four years a total of 6 million jobs in the U.S. will be generated by the information economy. The
challenge? Theres not enough talent in the industry. By 2018, the United States alone could face a shortage of
140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the
know-how to use the analysis of big data to make effective decisions. According to Dice, a career site for tech
and engineering professionals, job postings for NoSQL experts were up 54% year over year, and those for big
26
data talent rose 46%, the site reported in April. Similarly, postings for Hadoop and Python pros were up 43%
and 16%, respectively.
Big Data means big money
According to Modis, a global IT staffing services provider, data scientists remain in high demand but short
supply, which translates into generous six-figure salaries for some PhDs with relevant big data experience.
According to a study by Burtch Works, the base salary for a staff data scientist is $120,000, and $160,000 for a
manager. The estimates are based on interviews with more than 170 data scientists from a Burtch Works
employment database.
However, Data Scientists Salary Survey shows that data scientists salaries in Europe and Asia are significantly
lower.
Quality of Data
More than half of IT leaders (57%) and IT professionals (52%) report they dont always know who owns the
data. If one doesnt know who owns the data, there is no one to hold accountable for its quality. As different
sources and varieties of data are fused together for big data projects, ensuring the accuracy and quality of the
data will be critical to success.
Big Data drives software growth
In its latest Worldwide Semiannual Software Tracker, International Data Corporation (IDC) predicts that the
worldwide software market will grow 5.9% year over year in current US dollars (USD). IDC believes that the
compound annual growth rate (CAGR) for the 2013-2018 forecast period will remain close to 6%. The average
2013-2018 CAGR for Asia/Pacific (excluding Japan), Latin America, and Central Eastern, Middle East, and
Africa (CEMA) is 8.5% while the average CAGR for the mature regions North America, Western Europe, and
Japan is 5.9%.
Visualization is in demand
Visualization is hot because it makes data-analysis easier. According to InformationWeek Business Intelligence,
Analytics and Information Management Survey, nearly half (45%) of the 414 respondents to a poll cited ease-
of-use challenges with complex software/less-technically savvy employees as the second-biggest barrier to
adopting BI/analytics products. The online dating giant Match.com started using Tableau Software because it
wanted to put analysis capabilities in the hands of our users, not elite analytics or BI experts.
Conlusion:
27
we have entered an era of Big Data. Through better analysis of the large volumes of data that are
becoming available, there is the potential for making faster advances in many scientific disciplines
and improving the profitability and success of many enterprises. However, many technical
challenges described in this paper must be addressed before this potential can be realized fully.
The challenges include not just the obvious issues of scale, but also heterogeneity, lack of
structure, error-handling, privacy, timeliness, provenance, and visualization, at all stages of the
analysis pipeline from data acquisition to result interpretation. These technical challenges are
common across a large variety of application domains, and therefore not cost-effective to address
in the context of one domain alone. Furthermore, these challenges will require transformative
solutions, and will not be addressed naturally by the next generation of industrial products. We
must support and encourage fundamental research towards addressing these technical challenges
if we are to achieve the promised benefits of Big Data
True data-driven insight calls for domain expertise. For the CSP, this means in-depth knowledge of
how the network functions, what data to pull from the networks nodes and OSS/BSS systems, and
an understanding of how to connect data from multiple sources end-to-end to yield an enriched set
of information sources. This is what ultimately enables the creation of a range of services and user-
centric applications. Smarter networks, customer experience management, data brokering and
marketing are just some examples of what is possible.
A common, horizontal big data analytics platform is necessary to support a variety of analytics
applications. Such a platform analyzes incoming data in real time, makes correlations (guided by
domain expertise), produces insights and exposes those insights to various applications. This
approach both enhances the performance of each application and leverages the big data
investments across multiple applications. Storing and processing huge amounts of information is
no longer the issue. The challenge now is to know what needs to be done within the big data
analytics platform to create specific value.
While big data storage and processing techniques are necessary enablers, the goal must be the
creation of the right use cases. The big data tools and technologies deployed have to support the
process of finding insights that are adequate, accurate and actionable. Instead of talking about the
three Vs of big data, CSPs should therefore focus their attention on these three as of big data.
Reference:
https://www.google.com/
28
https://www.wikipedia.org/
http://bigdata-madesimple.com/exciting-facts-and-findings-about-big-data/
https://www.accenture.com/hk-en/insight-big-data-research.aspx
http://www.sciencedirect.com/science/article/pii/S0925527314004253
http://www.oracle.com/technetwork/database/options/advanced-
analytics/bigdataanalyticswpoaa-1930891.pdf
http://aisel.aisnet.org/cgi/viewcontent.cgi?article=3785&context=cais
http://www.slideshare.net/search/slideshow?
searchfrom=header&q=big+data+anlysis&ud=any&ft=all&lang=**&sort=
http://www.slideshare.net/srikarasu/big-data-ppt
http://www.slideshare.net/Reports_Corner/big-data-and-telecom-analytics-market-
business-case-market-analysis-forecasts-2014-2019-reports-corner?
next_slideshow=1
https://www.google.com/?gws_rd=ssl#q=big+data+analysis
http://www.slideshare.net/ChiuYW/data-analysis-making-big-data-work?
qid=84302611-29c6-42aa-aa06-59b99e5c88ca&v=&b=&from_search=1

Big Data

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Big Data

Încărcat de

Drepturi de autor:

Formate disponibile

1

Growth of and Digitization of Global Information Storage Capacity.

In 2011 alone, mankind created over 1.2 trillion gB of data.

data volumes are expected to grow 50 times by 2020.

72 hours of video are added to YouTube every minute.

there are 217 new mobile Internet users every minute.

System Model: We integrate the advantage of Hadoop and R, respectively, and

Background History Of Big Data

November 1967 B. A. Marron and P. A. D. de Maine publish Automatic data compression in

Source: Michael Lesk

1. Data quality grading and assurance

2. Identifying "unusual" data segments

3. Confidentiality preserving data mining techniques

4. Text data mining

5. Tracking interactions among users

6. Machine learning and transactional data

7. Developing methods to evaluate, target and monitor the provision of care.

Limitation of Big Data:

1. User Data Is Fundamentally Biased

2. User-Level Execution Only Exists In Select Channels

3. User-Level Results Cannot Be Presented Directly

4. User-Level Algorithms Have Difficulty Answering Why

5. User Data Is Not Suited For Producing Learnings

6. User-Level Data Is Subject To More Noise

7. User Data Is Not Easily Accessible or Transferable

All of these factors limit agility of analysis and ability to collaborate.

Basic Concept of Big Data

Well addressed by data warehouse crowd

Who are pretty good at SQL analytics on

New Opportunities, New Challenges

The power and opportunity of big data applications:

Finding the needle in the haystack

The benefits and consequences of perfect personalization

De-identification and re-identification

in-dividual or device, some privacy-protective technology seeks to remove this linkage,

The persistence of data

Big Data Analysis

Tools for Analyzing Big Data:

This article provides an overview of big data analytics in healthcare as it is emerging as a

Big data analytics in healthcare

The 4 Vs of big data analytics in healthcare

Findings Of Big data:

88% data is ignored

Structured vs Unstructured Data

Booming of jobs, but lack of talents

Big Data means big money

Big Data drives software growth

S-ar putea să vă placă și