Documente Academic
Documente Profesional
Documente Cultură
CS-585
Unit-1: Lecture 5
Big Data Analytics
“The critical job in the next 20 years will be the analytic scientist … the
individual with the ability to understand a problem domain, to understand
and know what data to collect about it, to identify analytics to process that
data/information, to discover its meaning, and to extract knowledge from it—
that’s going to be a very critical skill.”
- Kaisler, Armour, Espinosa, Money (2014) Amended
•Volume:
–How much data is really relevant to the problem solution? Cost of processing?
–So, can you really afford to store and process all that data?
•Velocity:
–Much data coming in at high speed
–Need for streaming versus block approach to data analysis
–So, how to analyze data in-flight and combine with data at-rest
•Variety:
–A small fraction is structured formats, Relational, XML, etc.
–A fair amount is semi-structured, as web logs, etc.
–The rest of the data is unstructured text, photographs, etc.
–So, no single data model can currently handle the diversity
•Veracity: cover term for …
–Accuracy, Precision, Reliability, Integrity
–So, what is it that you don’t know you don’t know about the data?
•Value:
–How much value is created for each unit of data (whatever it is)?
–So, what is the contribution of subsets of the data to the problem solution?
•Descriptive: A set of techniques for reviewing and examining the data set(s) to
understand the data and analyze business performance.
•Diagnostic: A set of techniques for determine what has happened and why
•Predictive: A set of techniques that analyze current and historical data to
determine what is most likely to (not) happen
•Prescriptive: A set of techniques for computationally developing and
analyzing alternatives that can become courses of action – either tactical or
strategic – that may discover the unexpected
•Decisive: A set of techniques for visualizing information and recommending
courses of action to facilitate human decision-making when presented with a
set of alternatives.
Passive Active
Deductive Descriptive Diagnostic
Inductive Predictive Prescriptive
•Process:
–Identify the attributes, then assess/evaluate the attributes
–Estimate the magnitude to correlate the relative contribution of each attribute to the final
solution
–Accumulate more instances of data from the data sources
–If possible, perform the steps of evaluation, classification and categorization quickly
–Yield a measure of adaptability within the OODA loop
•At some threshold, crossover into diagnostic and predictive analytics
•Process:
–Begin with descriptive analytics
–Extract patterns from large data quantities via data mining
–Correlate data types for explanation of near-term behavior – past and present
–Estimate linear/non-linear behavior not easily identifiable through other
approaches.
•Example: by classifying past insurance claims, estimate the number of
future claims to flag for investigation with a high probability of being
fraudulent.
•Process:
–Begin with descriptive AND diagnostic analytics
–Choose the right data based on domain knowledge and relationships among variables
–Choose the right techniques to yield insight into possible outcomes
–Determine the likelihood of possible outcomes given initial boundary conditions
–Remember! Data driven analytics is non-linear; do NOT treat like an engineering project
9
Prescriptive Analytics
•Process:
–Begin w/ predictive analytics
–Determine what should occur and how to make it so
–Determine the mitigating factors that lead to desirable/undesirable outcomes
–“What-if” analysis w/ local or global optimization
–Ex: Find the best set of prices and advertising frequency to maximize revenue
–Ex: And, the right set of business moves to make to achieve that goal
“Make it so”
•Process:
–Given a set of decision
alternatives, choose the one course
of action to do from possibly many
–But, it may not be the optimal one.
–Visualize alternatives – whole or
partial subset
–Perform exploratory analysis –
what-if and why
•How do I get to there from here?
•How did I get here from there?
•“Tools and techniques that gear the analyst’s mind to apply higher
levels of critical thinking can substantially improve analysis…
structuring information, challenging assumptions, and exploring
alternative interpretations.”
Richards Heuer, Jr., “The Psychology of Intelligence Analysis”
Novelty Discovery
–Finding new, rare, one-in-a-[million / billion / trillion/ etc.] objects
and events
Class Discovery
–Finding new classes of objects and behaviors
–Learning the rules that constrain class boundaries
Association Discovery
–Finding unusual (improbable) co-occurring associations
Correlation Discovery
–Finding patterns and dependencies, which reveal new natural
laws or new scientific principles
associations
Ref: Kirk Borne, Dynamic Events in Massive Data Streams, GMU
Adapted from: Kirk Borne, Dynamic Events in Massive Data Streams, GMU
Copyright (except where referenced) 2014-2016
Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money
BDA-14
The New Analytic Paradigm
Source: Thompson May, The New Know: Innovation Powered by Analytics, 2009
6/14/2020 Copyright (except where referenced) 2014-2016
Stephen H. Kaisler, Frank Armour, Alberto Espinosa and William H. Money
BDA-15
Analytics Challenges
BDA-16
Finding A Needle in a Haystack
•What if the “needle” happens to be a complex data structure?
–Brute force search and computation are unlikely to succeed due to inefficiency
–Complexity increases with streaming data as opposed to a static data set
Challenge: Consider finding the few packets in the millions (er, tens of
billions) flowing through a network that carry a virus or malware.
•Networks have become exponentially faster. They carry more traffic and more
types of data than ever before. Yet as they get faster, they become more
difficult to monitor and analyze.
–40G Networks
–Richer Data: VOIP as the telephony standard
–Malicious security threats are more subtle
•Problems:
–Finding proof of a security attack
–Troubleshooting intermittent performance issues
–Identifying the source of data leaks
–Troubleshooting VOIP and Video over VOIP
•Network forensics must be:
–Precise: capture high-speed packets without droppage
–Scalable: extend to new network technologies and speeds
–Flexible: adapt to heterogeneous network segments
–VOIP-Smart: reconstruct & replay VoIP calls; present Call Detail Records (CDR) for each call
–Continuously available: run 24/7 with adequate storage; support real-time analysis
Finding the Knees
Ref: Choucri, N., et al. (2006) Understanding and Modeling State stability: Exploiting System Dynamics.
MIT Sloan Research Papers, No. 4574-06, Jan. 2006.
•With (all) the data available, describe a situation in a generalized form such
that predictions for future events and prescriptions for courses of actions can
be made.
•Objective: Identify one or more patterns that characterize the behavior of the
system.
•Remember: All data has value to someone, but not all data has value to
everyone.
BDA-22
Some Grand Challenges
Ref: Martin, J. "The Meaning of the 21st Century: A Vital Blueprint for Ensuring Our Future“, Jan 2007
23
Where is the ROI?
BDA-24
Is this ….?
BDA-25
Big Data Adoption
As part of an advancing digitization, many enterprises feel the need to explore the
possibilities big data may provide for their business.
However, only a few companies use big data applications productively, despite its high
expected potential.
How companies examine the possibilities of big data, is therefore a highly interesting and
relevant question.
Against this background, a growing number of companies are investing in big data looking for competitive
advantages (Constantiou and Kallinikos, 2015).
Nevertheless, companies seem to have difficulties with the productive implementation of big
data applications.
According to a Gartner study, only 14% of enterprises have put big data projects into
production (Kart, 2015).
https://www.happiestminds.com/whitepapers/big-data-infrastructure-considerations.pdf
Recommended Adoption Process
• Big Data Architecture is the combination of tools
and technologies to accomplish the whole of the
task.
• An ideal big data architecture would be resilient,
cost effective, secured and adaptive to the new
needs of environment.
Big Data
This can be achieved by beginning with the proven
Architectures •
Latency
●
Scalability
●
...
●
Traditional Architecture for Big Data
Apache Spark: The new kid in the block
Apache Spark is a fast and general engine for large-scale data processing
●
●Batch Processing
●Not for low latency use cases
to Hadoop
●Map/Reduce, it’s still Batch Processing
Can Replay historical events out of an Provided by either the Messaging or Raw Data
historical (raw) event store (Reservoir) component
Updates of processing logic / Event replay New logic will reprocess events until it caught up
are handled by deploying new version of with the current events and then the old version
logic in parallel to old one can be de-commissioned.
Cost Saving
Manpower requirements
•Businesses are feeling the data talent shortage. Not only is there a shortage of
data scientists, but to successfully implement a big data project requires a
sophisticated team of developers, data scientists and analysts who also have a
sufficient amount of domain knowledge to identify valuable insights. Many big
data vendors seek to overcome this big data challenge by providing their own
Challenges/Barriers educational resources or by providing the bulk of the management.
Data Quality
•Data quality is not a new concern, but the ability to store every piece of data a
business produces in its original form compounds the problem. Dirty data costs
companies in the United States $600 billion every year. Common causes of dirty
data that must be addressed include user input errors, duplicate data and
incorrect data linking. In addition to being meticulous at maintaining and
cleaning data, big data algorithms can also be used to help clean data
References
• IBM ICE notes from tekstac
• https://www.happiestminds.com/whitepapers/big-data-infrastructure-considerations.pdf
• Securosis, Securing Big Data: Security Recommendations for Hadoop and NoSQL
Environments,https://securosis.com/Research/Publication/securing-big-data-security-recommendations-for-
hadoop-and-nosql-environment
• https://www.bigdataframework.org/big-data-architecture/
• https://docs.microsoft.com/en-us/azure/architecture/guide/architecture-styles/big-data
● Thank you