Sunteți pe pagina 1din 53

BIG DATA

CS-585
Unit-1: Lecture 5
Big Data Analytics

Challanges to Big Data Analytics

Contents Adoption Achitecture of Big Data

Benefits of Big Data

Barrier to Big Data


The Data Scientist

Hal Varian, Mckinsey Quarterly, January 2009:


“The sexy job in the next ten years will be statisticians… The ability to take
data—to be able to understand it, to process it, to extract value from it, to
visualize it, to communicate it—that’s going to be a hugely important skill.”
Ref: http://www.mckinseyquarterly.com/Hal_Varian_on_how_the_Web_challenges_managers_2286

“The critical job in the next 20 years will be the analytic scientist … the
individual with the ability to understand a problem domain, to understand
and know what data to collect about it, to identify analytics to process that
data/information, to discover its meaning, and to extract knowledge from it—
that’s going to be a very critical skill.”
- Kaisler, Armour, Espinosa, Money (2014) Amended

For both roles:


Analytic scientists require advanced training in specific domains, data science
tools, multiple analytics, and visualization to perform predictive and
prescriptive analytics. They may hold Ph.D.’s, but pragmatic experience in a
domain will be equally important.

Copyright (except where referenced) 2014-2016


Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money
BDA-3
Some Big Data Issues Affecting Analytics

•Volume:
–How much data is really relevant to the problem solution? Cost of processing?
–So, can you really afford to store and process all that data?
•Velocity:
–Much data coming in at high speed
–Need for streaming versus block approach to data analysis
–So, how to analyze data in-flight and combine with data at-rest
•Variety:
–A small fraction is structured formats, Relational, XML, etc.
–A fair amount is semi-structured, as web logs, etc.
–The rest of the data is unstructured text, photographs, etc.
–So, no single data model can currently handle the diversity
•Veracity: cover term for …
–Accuracy, Precision, Reliability, Integrity
–So, what is it that you don’t know you don’t know about the data?
•Value:
–How much value is created for each unit of data (whatever it is)?
–So, what is the contribution of subsets of the data to the problem solution?

Copyright (except where referenced) 2014-2016


Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money
BDA-4
Source: https://www.elderresearch.com/blog/42-v-of-big-data
Types of Analytics

•Descriptive: A set of techniques for reviewing and examining the data set(s) to
understand the data and analyze business performance.
•Diagnostic: A set of techniques for determine what has happened and why
•Predictive: A set of techniques that analyze current and historical data to
determine what is most likely to (not) happen
•Prescriptive: A set of techniques for computationally developing and
analyzing alternatives that can become courses of action – either tactical or
strategic – that may discover the unexpected
•Decisive: A set of techniques for visualizing information and recommending
courses of action to facilitate human decision-making when presented with a
set of alternatives.

Passive Active
Deductive Descriptive Diagnostic
Inductive Predictive Prescriptive

Copyright (except where referenced) 2014-2016


Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money
BDA-6
Descriptive Analytics

•Process:
–Identify the attributes, then assess/evaluate the attributes
–Estimate the magnitude to correlate the relative contribution of each attribute to the final
solution
–Accumulate more instances of data from the data sources
–If possible, perform the steps of evaluation, classification and categorization quickly
–Yield a measure of adaptability within the OODA loop
•At some threshold, crossover into diagnostic and predictive analytics

Copyright (except where referenced) 2014-2016


Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money
BDA-7
Diagnostic Analytics

•Process:
–Begin with descriptive analytics
–Extract patterns from large data quantities via data mining
–Correlate data types for explanation of near-term behavior – past and present
–Estimate linear/non-linear behavior not easily identifiable through other
approaches.
•Example: by classifying past insurance claims, estimate the number of
future claims to flag for investigation with a high probability of being
fraudulent.

Copyright (except where referenced) 2014-2016


Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money
8
Predictive Analytics

•Process:
–Begin with descriptive AND diagnostic analytics
–Choose the right data based on domain knowledge and relationships among variables
–Choose the right techniques to yield insight into possible outcomes
–Determine the likelihood of possible outcomes given initial boundary conditions
–Remember! Data driven analytics is non-linear; do NOT treat like an engineering project

9
Prescriptive Analytics

•Process:
–Begin w/ predictive analytics
–Determine what should occur and how to make it so
–Determine the mitigating factors that lead to desirable/undesirable outcomes
–“What-if” analysis w/ local or global optimization
–Ex: Find the best set of prices and advertising frequency to maximize revenue
–Ex: And, the right set of business moves to make to achieve that goal

“Make it so”

Copyright (except where referenced) 2014-2016


Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money
BDA-10
Decisive Analytics

•Process:
–Given a set of decision
alternatives, choose the one course
of action to do from possibly many
–But, it may not be the optimal one.
–Visualize alternatives – whole or
partial subset
–Perform exploratory analysis –
what-if and why
•How do I get to there from here?
•How did I get here from there?

Copyright (except where referenced) 2014-2016


Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money
BDA-11
The Role of Analytics

•“Tools and techniques that gear the analyst’s mind to apply higher
levels of critical thinking can substantially improve analysis…
structuring information, challenging assumptions, and exploring
alternative interpretations.”
Richards Heuer, Jr., “The Psychology of Intelligence Analysis”

•Beware Frege’s Caution:


–Converse Problems:
•If you magnify on details, you are losing the overview
•If you focus on the overview, you don’t see the details
–Problem with Data Mining:
•Applying statistics to understand the trends causes a loss
of grounding in the data
6/14/2020 Copyright (except where referenced) 2014-2016
Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money
BDA-12
Analytics Is About Discovery

Novelty Discovery
–Finding new, rare, one-in-a-[million / billion / trillion/ etc.] objects
and events
Class Discovery
–Finding new classes of objects and behaviors
–Learning the rules that constrain class boundaries
Association Discovery
–Finding unusual (improbable) co-occurring associations
Correlation Discovery
–Finding patterns and dependencies, which reveal new natural
laws or new scientific principles
associations
Ref: Kirk Borne, Dynamic Events in Massive Data Streams, GMU

Copyright (except where referenced) 2014-2016


Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money
BDA-13
The Goal of Analytics

From sensors (data collection, measurement, observation, …)


to Monitoring and Alerting
to Sensemaking (Data and Analytics Science)
to Cents-Making (Getting to ROI!!)

Adapted from: Kirk Borne, Dynamic Events in Massive Data Streams, GMU
Copyright (except where referenced) 2014-2016
Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money
BDA-14
The New Analytic Paradigm

#1: You will be expected to do something with information


#2: There really is more to know
#3: You will have to know more about knowing
#4: Brain science and decision science are converging
#5: The environment is changing our brain
#6: Information management is the essence of leadership
#7: A more connected world means much more data is
available (and accessible)
#8: Math matters (but so does logic and rules)
#9: There are significant downsides to not knowing
#10: Knowing can change the world

Source: Thompson May, The New Know: Innovation Powered by Analytics, 2009
6/14/2020 Copyright (except where referenced) 2014-2016
Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money
BDA-15
Analytics Challenges

BDA-16
Finding A Needle in a Haystack
•What if the “needle” happens to be a complex data structure?
–Brute force search and computation are unlikely to succeed due to inefficiency
–Complexity increases with streaming data as opposed to a static data set

•Absence of evidence (so far) is not evidence of absence! (Borne 2013)

•What preprocessing do we need to do before searching?


–Quality vs. Quantity: What data are required to satisfy the given value proposition?
–At what precision, accuracy, and reliability?

•What if the needle must be derived rather than found?


–How do we track the provenance of the derived data/information?
–Is the process repeatable as we change algorithms and data structures?

Challenge: Consider finding the few packets in the millions (er, tens of
billions) flowing through a network that carry a virus or malware.

6/14/2020 Copyright (except where referenced) 2014-2016


Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money
BDA-17
Network Forensics

•Networks have become exponentially faster. They carry more traffic and more
types of data than ever before. Yet as they get faster, they become more
difficult to monitor and analyze.
–40G Networks
–Richer Data: VOIP as the telephony standard
–Malicious security threats are more subtle
•Problems:
–Finding proof of a security attack
–Troubleshooting intermittent performance issues
–Identifying the source of data leaks
–Troubleshooting VOIP and Video over VOIP
•Network forensics must be:
–Precise: capture high-speed packets without droppage
–Scalable: extend to new network technologies and speeds
–Flexible: adapt to heterogeneous network segments
–VOIP-Smart: reconstruct & replay VoIP calls; present Call Detail Records (CDR) for each call
–Continuously available: run 24/7 with adequate storage; support real-time analysis
Finding the Knees

•The knee of an algorithm or analytic is the scale value at which the


performance begins to degrade as larger data volumes are processed.
–Every analytic method and algorithm can have one (or more?)
–Where positive slope increases begin to flatten out
–Where positive or flat slopes transition to negative slopes
•Factors affecting the knee:
–data structure, volume, and variety
–algorithm complexity and implementation, and
–infrastructure implementation.
•What is/are the corollaries for non-algorithmic analytics?

6/14/2020 Copyright (except where referenced) 2014-2016


Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money
BDA-19
Finding the Tipping Point

•A tipping point is one in which change in a system becomes potentially


irreversible and maybe even unstoppable.
–Maybe associated with negative or positive effects
–In social systems, a buildup to a critical mass at which point a seminal change occurs.
–Ex: MySpace was a formidable component of Facebook, but once the Facebook membership
reached its “tipping point” people started abandoning MySpace and signing up for Facebook.

• Small events can create


ripple effects – may be linear
or non-linear, chaotic or
perturbative
• Concept of emerging
trends in the commercial
marketplace
• The explosion of a viral
infection into an epidemic

Ref: Choucri, N., et al. (2006) Understanding and Modeling State stability: Exploiting System Dynamics.
MIT Sloan Research Papers, No. 4574-06, Jan. 2006.

Copyright (except where referenced) 2014-2016


Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money
BDA-20
Spinning Straw Into Gold

•With (all) the data available, describe a situation in a generalized form such
that predictions for future events and prescriptions for courses of actions can
be made.
•Objective: Identify one or more patterns that characterize the behavior of the
system.
•Remember: All data has value to someone, but not all data has value to
everyone.

•Patterns may be unknown or ambiguously


defined.
•Patterns may be morphing over time.
•The problem is sensemaking: the dual
process of trying to fit data to a frame or
model and of fitting a frame around the
data.
•Neither data nor frame comes first!
•Must evolve concurrently!

6/14/2020 Copyright (except where referenced) 2014-2016


Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money
BDA-21
What Are Grand Challenges?

•Definition: A specific scientific or technological innovation that would


remove a critical barrier to solving an important domain problem with a high
likelihood of global impact and feasibility.
–Provide scope for engineering ambition to build something that has never been
seen before.
–Generally comprehensible, and capture the imagination of the general public, as
well as the esteem of scientists in other disciplines.
–Go beyond what is initially possible, and requires development of understanding,
techniques and tools unknown at the start of the project.
–Since these first appeared in the 80s, they abound in every science and discipline!
•Not just a restatement of the many “big problems” facing the world today
–Are they equivalent to Wicked Problems?, or are
–Grand Challenges Wicked Problems?
•A tool for focusing investigators working towards overcoming one or more
bottlenecks in a foreseeable path toward a solution to significant domain
problems.

BDA-22
Some Grand Challenges

•Modeling Our Planet’s Systems:


–Assessing global warming and determining mitigating actions
•Confronting Existential Risk:
–What is the impact of a dangerous genetically modified pathogen
•Exploring Transhumanism:
–What is the impact of embedded nanotechnology, genetic therapy, and
“smart” prosthetics?
•The Singularity?
–What happens when systems approach the level of human
intelligence? Emotional intelligence?
•Dealing Effectively with Globalism:
–Modeling the interconnected of human societies/organizations

Ref: Martin, J. "The Meaning of the 21st Century: A Vital Blueprint for Ensuring Our Future“, Jan 2007

23
Where is the ROI?

•ROI (Return on Investment) is not always immediately obvious


•Results of analytics may be available only after years of following the
prescription
•Requires long-term effort(s) to develop a sustainable capability
•Examples:
–Health: moving from predictive to preventative health care
–Health: enabling personalized medicine for shortened time to value
–Health: recognizing and predicting the spread of infectious diseases (Ebola)
–Crime: aggressively recognizing and combating syndicated, multiparty fraud
online
–Crime: predicting potential crime locales and time to preventatively deploy
police
–Environment: predicting weather, floods, earthquakes, volcanic eruptions
earlier
–Computer Security: surveying systems to predict potential for attacks

BDA-24
Is this ….?

Oooops! We mean the Exabyte Age!

BDA-25
Big Data Adoption
As part of an advancing digitization, many enterprises feel the need to explore the
possibilities big data may provide for their business.

However, only a few companies use big data applications productively, despite its high
expected potential.

How companies examine the possibilities of big data, is therefore a highly interesting and
relevant question.

Against this background, a growing number of companies are investing in big data looking for competitive
advantages (Constantiou and Kallinikos, 2015).

Nevertheless, companies seem to have difficulties with the productive implementation of big
data applications.

According to a Gartner study, only 14% of enterprises have put big data projects into
production (Kart, 2015).

The important Question here are Why? And How?


Big Data Adoption Process
●Cluster Design: Application requirements are analyzed in Application Requirements
terms of workload, volume and other associated
parameters, based on which the cluster is designed..
Cluster design is not an iterative process. The initial setup
is verified and validated with a sample application and Cluster Design
sample data before being rolled out. Although Big data
cluster design allows flexibility in fine tuning the
configuration parameters, the layered number of Hardware Architecture
parameters and their cross-impact introduce additional
complexity.
●Hardware Architecture: The key success factor for hadoop Network Architecture
clusters is the usage of high commodity equipment. Most
hadoop users are cost conscious and as cluster grow, their
cost can be significant. In the present scenario, the
hardware architecture requirements for the NameNode Storage Architecture
and higer RAM and moderate HDD. If the JobTracker is
physically separate server, it will have higher RAM and CPU
speed. DataNodes are standard low-end server class Info-Security Architecture
machines.
Big Data Adoption Process
●Network Architecture: Currently, the network architecture is
not specifically designed for Big Data that is inputs from cluster Application Requirements
design and application requirements are not always mapped to
it. Standard network setup within the existing data center is used
as backbone. In most cases, this may result in overestimated
network deployment and at times have a negative effect on the Cluster Design
MapReduce data processing algorithms. Hence, there is
significant scope for creating concrete guidelines related to
designing network architecture for Big data.
Hardware Architecture
●Storage Architecture: Most enterprises have huge investments
in NAS and SAN devices. When implanting Big Data, they attempt
to re-use this existing storage infrastructure even though DAS is
the recommended storage for Big Data clusters. Parameters like Network Architecture
the type of disk, shared nothing versus shared something are
often not taken into account.
●Information Security Architecture: A general examination of Storage Architecture
different Big Data implementations show that security features
are sparse and aftermarket security offerings are not fully
tailored to these clusters. Findings show these deployments to
be largely insecure and wholly reliant on the network and Info-Security Architecture
perimeter security support.

https://www.happiestminds.com/whitepapers/big-data-infrastructure-considerations.pdf
Recommended Adoption Process
• Big Data Architecture is the combination of tools
and technologies to accomplish the whole of the
task.
• An ideal big data architecture would be resilient,
cost effective, secured and adaptive to the new
needs of environment.
Big Data
This can be achieved by beginning with the proven
Architectures •

architecture and creatively & progressively keep on


restructuring it as the additional need and problem
arises.
• It should ultimately align with the architecture of
the organization/Universe itself.
• Types
• Generic layered Architecture
• Traditional Architecture
• Streaming Analytics Architecture for Big Data
• Lambda Architecture
• Kappa Architecture
Generic Layered Architecture
1. Data Sources : It consists of disparate data sources,
which range from sensor streaming data to structured
Functions of information- relational databases. It also help to sort the
unstructured and semi structured data.
Big
2. Integration Processes :This layer acquires the data and
Architecture integrates the data-sets into a uniform format. It requires
the necessary pre-processing /filtering operations.
Layers
3. Data Storage : This layer consists of a pool of resources-
distributed file systems, RDF stores, NOSQL, and new SQL
databases. The resources are suitable for the storage of a
large number of datasets.
4. Analytical and computing models : This layer
encapsulates different data tools- mapReduce. It runs on
the storage resources that include the data management
and programming model.
5. Presentation: This layer enables the visualization
technologies. It tend to meet the infrastructure
requirements like cost-effectiveness, elasticity and ability
to scale up or down.
Big Data is Still • Choosing the right architecture is key for any (big data)
project
in the state of • Big Data is still quite a young field and therefore there
Work In are no standard architectures available which have been
used for years
Progress
• In the past few years, a few architectures have evolved
and have been discussed online
• Know the use cases before choosing your architecture
• To have one/a few reference architectures can help in
choosing the right components
NIST Big Data Architecture
Important Properties to choose a big data architecture

Latency

Keep raw and un-interpreted data “forever” ?


Volume, Velocity, Variety, Veracity


Ad-Hoc Query Capabilities needed ?


Robustness & Fault Tolerance


Scalability

...

Traditional Architecture for Big Data
Apache Spark: The new kid in the block

Apache Spark is a fast and general engine for large-scale data processing

– The hot trend in Big Data


– Originally developed 2009 in UC Berkley’s AMPLab
– Can run programs up to 100x faster than Hadoop MapReduce in
memory, or 10x faster on disk
– One of the largest OSS communities in big data with over 200
contributors in 50+ organizations
– Open Sourced in 2010 – since 2014 part of Apache Software foundation
Supported by many vendors
Motivation : Why apache spark
Apache Spark Echo system
Spark echo system: Technology Mapping
Limitations
Traditional Architecture for Big Data

●Batch Processing
●Not for low latency use cases

●Spark can speed up, but if positioned as alternative

to Hadoop
●Map/Reduce, it’s still Batch Processing

●Spark Ecosystems offers a lot of additional advanced

analytic capabilities (machine learning, graph


processing, ...)
Streaming Analytics Architecture for Big Data
Also known as Complex/Event Processing
Streaming Analytics Technology Mapping
Advantages:
Streaming Analytics Architecture for Big Data

●The solution for low latency use cases


●Process each event separately => low latency
●Process events in micro-batches => increases latency
but offers better reliability
●Previously known as “Complex Event Processing”
Keep the data moving / Data in Motion instead of

Data at Rest => raw events are (often) not stored


Lambda Architecture for Big Data
Use Case
Social Media and Social Network Analysis
Combines (Big) Data at Rest with (Fast)
Data in Motion

Closes the gap from high-latency batch


processing
Pros and
Cons Keeps the raw information forever

Lambda Makes it possible to rerun analytics


operations on whole data set if necessary
Architecture • => because the old run had an error or

for Big Data • => because we have found a better algorithm we


want to apply

Have to implement functionality twice

• Once for batch


• Once for real-time streaming
Kappa Architecture for Big Data
Pros and Kones
Kappa Architecture for Big Data
Today the stream processing
Some using the same base infrastructure, i.e.
infrastructure are as scalable as Big Data Hadoop YARN
processing architectures

Only implement processing / analytics logic once

Can Replay historical events out of an Provided by either the Messaging or Raw Data
historical (raw) event store (Reservoir) component

Updates of processing logic / Event replay New logic will reprocess events until it caught up
are handled by deploying new version of with the current events and then the old version
logic in parallel to old one can be de-commissioned.
Cost Saving

Time reduction in data


processing

Benefits of New Product development


Big Data
Understand the market
conditions

Control online reputation


Infrastructure Cost
•It’s difficult to project the cost of a big data project, and given how quickly they
scale, can quickly eat up resources. The challenge lies in taking into account all
costs of the project from acquiring new hardware, to paying a cloud provider, to
hiring additional personnel. Businesses pursuing on-premises projects must
remember the cost of training, maintenance and expansion. Big data in the cloud
projects must carefully evaluate the service-level agreement with the provider to
determine how usage will be billed and if there will be any additional fees.

Manpower requirements
•Businesses are feeling the data talent shortage. Not only is there a shortage of
data scientists, but to successfully implement a big data project requires a
sophisticated team of developers, data scientists and analysts who also have a
sufficient amount of domain knowledge to identify valuable insights. Many big
data vendors seek to overcome this big data challenge by providing their own
Challenges/Barriers educational resources or by providing the bulk of the management.

to big Data Security of the Data


•Keeping that vast lake of data secure is another big data challenge. Specific
challenges include:
•User authentication for every team and team member accessing the data.
•Restricting access based on a user’s need.
•Recording data access histories and meeting other compliance regulations
•Proper use of encryption on data in-transit and at rest.

Data Quality
•Data quality is not a new concern, but the ability to store every piece of data a
business produces in its original form compounds the problem. Dirty data costs
companies in the United States $600 billion every year. Common causes of dirty
data that must be addressed include user input errors, duplicate data and
incorrect data linking. In addition to being meticulous at maintaining and
cleaning data, big data algorithms can also be used to help clean data
References
• IBM ICE notes from tekstac

• https://www.happiestminds.com/whitepapers/big-data-infrastructure-considerations.pdf

• Apache Foundation, Hadoop overview, http://hadoop.apache.org/

• MapReduce.org, information about MapReduce Framework, http://www.mapreduce.org/

• Forbes, Ten properties of the Perfect Big Data storage


architecture,http://www.forbes.com/sites/danwoods/2012/07/23/ten-properties-of-the-perfect-big-data-
storage-architecture/2/

• Securosis, Securing Big Data: Security Recommendations for Hadoop and NoSQL
Environments,https://securosis.com/Research/Publication/securing-big-data-security-recommendations-for-
hadoop-and-nosql-environment

• Forbes, BigData meets cloud,http://www.forbes.com/sites/forrester/2012/08/15/big-data-meets-cloud/[6]


Virginia State University, Evaluating MapReduce System Performance: A Simulation
Approach,http://scholar.lib.vt.edu/theses/available/etd-08282012-152556/unrestricted/Wang_G_D_2012.pdf

• https://www.bigdataframework.org/big-data-architecture/

• https://docs.microsoft.com/en-us/azure/architecture/guide/architecture-styles/big-data
● Thank you

S-ar putea să vă placă și