Sunteți pe pagina 1din 74

Introduction to Hadoop and extension to Machine

Learning
Phone: 647 977 2648 Meeting ID 270 984 711

Toronto, Calgary Delhi, Bangalore


Canada India

E-mail: info@2iSolutions.com E-mail: info@2iSolutions.com


Website: www.2isolutions.com Website: www.2iSolutions.com
AGENDA
Agenda

• About Me
• What is Big Data?
• Importance of Big Data
• Why?
• Business Cases
• What is in the future?
• Possible Training
Programs
IIBS Introduction: Who made it possible!

• Since 2005
• Trained more than 12000 students in Big Data, PMP, CBAP,
BA, SAP, SCRUM and Software Testing, ISTQB
• Involve in Training, Resume Prep, Projects
• LMS
• Almost Free Repeat Policy (Some condition apply)
Iibs.ca

• Training on Big Data, Hadoop, Spark, R, Data Science,


Python

• Other Training on
 SAP
 BA, Software Testing, ISTQB, CBAP, PMP
INTRODUCTION
About Me

• Praveen Kumar • 2iSolutions


• 18 Years in SAP, BI and • Specialized in SAP and
Big Data Big Data
• Cloudera and SAP
Partner Company
• 12 Years in Business
• Offices in Canada and
in India
CHALLENGES IN TODAY WORLD
What are the Challenges?

• Disruptive Technologies
• Digital
• So many Tools
• So many roadmaps
• HANA
• HADOOP
• Lumira
• Machine Learning
• Deep Learning
• ANN
• Supervised Learning
• Unsupervised Learning
Market Research: Forbes 2016
Market Research: Forbes 2016

72
72% of CEOs believe the next 3 years will be more critical for
their industry than the previous 50 …Forbes 2016

One of Top 3
One of the top three priorities of CEOS over the next 3 years
is implementing disruptive technologies

77
77% are concerned whether their organization is keeping up
with new technologies
Disrupt or Be Disrupted?

65% of the CEOs surveyed are concerned that new entrants are
disrupting their business models

More than half (53%) of CEOs believe that their


company is not disrupting their industry's business models enough

Either you are a disruptor or you will be disrupted


• Amazon
• Apple
• Google
• Netflix
• Uber
• Expedia
• AirBnB
Disruption Tools and Techniques

• IoT Big Data


• Machine Learning
• Analytics
• Block Chain
• Data Intelligence
• Deep Learning
Digital
LETS TALK BIG DATA…
Today Challenges

• 2,500,000 Terabytes of data is generated every day


• 90% data is unstructured
• Only 0.5% of data is being used for Analytics
In a span of a internet minute …
What is Big Data?
What is Hadoop?

• Hadoop is a software framework for distributed processing


of large datasets across large clusters of computers
 Large datasets  Terabytes or petabytes of data
 Large clusters  hundreds or thousands of nodes
• Hadoop is based on a simple data model, any data will fit
• Technically
 Map Reduce
 HDFS
 Commodity Hardware
 YARN
Big Data Analytics Tools and Technology
Big Data Analytics Tools and Technology
HDFS : Hadoop File System

• Runs on top of any operating file system.


• Designed to handle very large files with streaming data
access patterns.
• Hadoop uses blocks to store file or part of a file

Block 1 Block
1
Input Block
Data Block 2 Block
2
2

Block
Block 3 … 3

Block
1

Block
3
Why Hadoop ?

Database

vs.

Scalability (petabytes of data, Performance (tons of indexing,


thousands of machines) tuning, data organization tech.)

Flexibility in accepting all data


formats (no schema) Features:
- Provenance tracking
Efficient and simple fault-tolerant - Annotation management
mechanism - ….

Commodity inexpensive hardware


Top 10 Reasons for Big Data and Hadoop

1. Creating an Innovation Platform for Disruptive Technology


2. Any Data will fit
3. Commodity Hardware
4. Easier Data Management, ELT Model
5. Business Case e.g. AI, Deep Learning and Machine
Learning, Chatbot
6. Archival Storage
7. Data Lake
8. Scalabale
9. Hidden opportunities for Saving and Innovation
10. Open Source
ANALYTICS…WHERE ARE WE HEADING?
4 Types of Analytics

• Descriptive : What happened ?


• Diagnostic : Why did it happen ?
• Predictive : What is it likely to happen ?
• Perspective : What should we do about it ?
Where is Analytics heading to?
Disruption Tools and Techniques

• IoT Big Data


• Machine Learning
• Analytics
• Block Chain
• Data Intelligence
• Deep Learning
Enter Apache Spark
Flexible, in-memory data processing for Hadoop

Advanced Analytics &


Ease of Use Machine Learning
Performance

• Rich & flexible APIs • Unified framework • In-Memory caching


for Scala, Java, and for batch and • Optimized
Python stream processing Scheduler
• Seamlessly • Rich collection of • Query optimizer
interleave SQL distributed ML
syntax with code algorithms
• Interactive shell
SOME BUSINESS CASE..
EDW (Enterprise Data Warehouse) to EDH (Enterprise Data Hub)

4 Diverse Analytic Platform


• Bring applications to data
• Combine different workloads on
common data (i.e. SQL + Search)
• True analytic agility 3 4

3 Self-Service Exploratory BI 2
• Simple search + BI tools
• “Schema on read” agility
• Reduce BI user backlog requests
SERVERS MARTS EDWS DOCUMENTS STORAGE SEARCH ARCHIVE
2 Persistent Staging
• One source of data for all analytics 1
• Persist state of transformed data
• Significantly faster & cheaper

1 Active Compliance Archive


• Full fidelity original data
• Indefinite time, any source ERP, CRM, RDBMS, MACHINES FILES, IMAGES, VIDEOS, LOGS, CLICKSTREAMS ESTERNAL DATA SOURCES
• Lowest cost storage

31 ©2014 Cloudera, Inc. All rights reserved.


Other Potential Business Case

• Chatbot
• Predictive Maintenance
• AI
• Datalake
• Archive Solutions
• IoTs
• Streaming Data
• Health
MACHINE LEARNING USING CLOUDERA DSW
Why Machine Learning?

• It is very hard to write programs that solve problems like


recognizing a three-dimensional object from a novel
viewpoint in new lighting conditions in a cluttered scene.

• It is hard to write a program to compute the probability


that a credit card transaction is fraudulent.
What is Machine Learning?

• Definition Machine Learning is a field of study that gives


computers the ability to learn without being explicitly
programmed [Arthur Samuel,1959]
• Instead of writing a program by hand for each specific task,
we collect lots of examples that specify the correct output
for a given input.
• A machine learning algorithm then takes these examples
and produces a program that does the job.
• Massive amounts of computation are now cheaper than
paying someone to write a task-specific program.
Age of Machine Learning
Best Use Case

• Recognizing patterns: – Objects in real scenes – Facial


identities or facial expressions – Spoken words
• Recognizing anomalies: – Unusual sequences of credit card
transactions – Unusual patterns of sensor readings in a
nuclear power plant
• Prediction: – Future stock prices or currency exchange
rates – Which movies will a person like?
Types of Learning

• Supervised learning – Learn to predict an output when


given an input vector. – Each training example consists of
an input vector x and a target output t.
• Unsupervised learning – Discover a good internal
representation of the input
• Others: – Reinforcement learning, recommender systems
Overview of Machine Learning Algorithms

• Classification is a family of supervised In clustering, an algorithm groups objects into


machine learning algorithms that designate categories by analyzing similarities between
input as belonging to one of several pre- input examples. Clustering uses include:
defined classes. Some common use cases for • Search results grouping
classification include: • Grouping of customers
• credit card fraud detection • Anomaly detection
• email spam detection • Text categorization
Supervised Learning: Classification

Predict a discrete class label

– The simplest case is a choice between 1 and


0.

– We can also have multiple alternative labels


Supervised Learning: Regression

• Predict continuous valued


output
– The price of a stock in 6
months time
– The temperature at noon
tomorrow
How Supervised Learning Works?

• We start by choosing a model-class: – A model-class, f, is a


way of using some numerical parameters W, to map each
input vector, x, into a predicted output y.
• Learning usually means adjusting the parameters to reduce
the discrepancy between the target output, t, on each
training case and the actual output, y, produced by the
model.
– For regression, is often a sensible measure of the
discrepancy.
– For classification there are other measures that are
generally more sensible (they also work better).
How Supervised Learning Works?
Neural Networks

• Inspired by our understanding of how the brain learns


• •Powerful tool for addressing typical machine learning
tasks such as regression and classification
• Perform exceptionally well in speech recognition and
object detection in images
What is Deep Learning?

• A family of methods that uses deep architectures to


learn highlevel feature representations
• and using these representations to perform typical
machine learning tasks such as classification and
regression.
HADOOP FOR AI
Apache Spark
Cloudera as a platform
Machine Learning, Deep Learning and AI
Apache Spark: FAST DATA PROCESSING
How SPARK is used?
Apache Spark use case for Data Science
The Full Platform for Data Science and AI
Data Science Workbench
DSW: A Self Service Data Science
Open Echo System
Machine Learning on Hadoop
CDSW integrated with Cloudera Manager
How does CDSW help?
Random Forest Classifier As an Example
DL Framework
Why Deep Learning Today?
Key Benefits
Tensorflow on Spark
About TensorFlow

• TensorFlow™ is an open source software library for


numerical computation using data flow graphs.
• Nodes in the graph represent mathematical operations,
while the graph edges represent the multidimensional data
arrays (tensors) communicated between them.
• The flexible architecture allows you to deploy computation
to one or more CPUs or GPUs in a desktop, server, or
mobile device with a single API.
• TensorFlow was originally developed by researchers and
engineers working on the Google Brain Team for the
purposes of conducting machine learning and deep neural
networks research.
SUMMARY
Summary

• Need of Big Data


• Need of Spark
• Use of AI Framerwork
• Languages:
 R
 Python
 Scala
Next 3 Steps

• Identify your business case


• Start a PoC (Start Small)
• Confirm benefits and Iterate
Lets Connect

• www.linkedin.com/in/hipraveen

• Twitter: hipraveen
IIBS TRAINING PROGRAM
Training Program at IIBS.CA

• www.iibs.ca
• Big Data Training Program: http://iibs.ca/big-data-hadoop/
• Python: http://iibs.ca/python/
• Data Scientist:
http://iibs.ca/artificial-intelligence/data-scientist/
• Machine Learning
http://iibs.ca/machine-learning-with-python/
BONUS
People to Jobs
Questions

S-ar putea să vă placă și