Sunteți pe pagina 1din 81

https://signup.live.

com
https://studio.azureml.net
Email password

1 myaml01@outlook.com MLAzure2@2@ Myaml01 Lab


2 myaml02@outlook.com MLAzure2@2@ Myaml02 Lab
3 myaml03@outlook.com MLAzure2@2@ Myaml03 Lab
4 myaml04@outlook.com MLAzure2@2@ Myaml04 Lab
5 myaml05@outlook.com MLAzure2@2@ Myaml05 Lab
6 myaml06@outlook.com MLAzure2@2@ Myaml06 Lab
7 myaml07@outlook.com MLAzure2@2@ Myaml07 Lab
8 myaml08@outlook.com MLAzure2@2@ Myaml08 Lab
9 myaml09@outlook.com MLAzure2@2@ Myaml09 Lab
10 myaml10@outlook.com MLAzure2@2@ Myaml10 Lab
https://www.cio.com/article/3238088/analytics/data-analytics-myths-debunked.html

1. Data Analytics require major investments.


2. You need big data platform to do analytics.

6. Data Science is a mysterious black art.


7. To do more data science you need more data scientists.
8. Analytics takes too long.
9. Technology is the hard part.
10. Data analytics should be a separate department.
11. Analytics is just for PhD.
BIG DATA
Predictive & Prescriptive
Analytics

Data Science
They are different – but linked!
• Hadoop and BD techniques.
Gartner : 3-V definition etc.
• Big Data Engineering.
By default this involves
BIG DATA Hadoop & BD techniques
• Big Data Management.
(Parallel and Distributed
e.g Map Reduce, etc.
Computing)

Predictive & Extracting Value out of data


• Data Visualization.
• Data Mining/Machine
to forecast/predict data not
Prescriptive available, and making
Learning.
• Cognitive APIs and
Analytics decisions.
applications.
Data Science (Evolving –
computer science,
Data Science math/stats, information
systems, domain
applications ) .
http://datascience.nyu.edu/what-is-data-science/
Data Scientist is in High Demand
Wikipedia 2015
https://en.wikipedia.org/wiki/Data_science
Data Science is the extraction of
knowledge from large volumes of data
that are structured or unstructured.
Big Data was a Hype

Wikipedia 21-02-2017
https://en.wikipedia.org/wiki/Data_scienc
e
Data science, also known as data-driven
science, is an interdisciplinary field about
scientific methods, processes and systems
to extract knowledge or insights from
data in various forms, either structured or
unstructured, similar to Knowledge
Discovery in Databases (KDD).
https://www.glassdoor.com/List/Best-Jobs-in-America-LST_KQ0,20.htm
Big Data no longer a Hype
General consensus of Data Science
Data Science still evolving (http://datascience.nyu.edu/what-is-data-science/ )

• Applied Field.
• Multi-disciplinary.
http://datascience.nyu.edu/what-is-data-science/
• Science and Mathematics based.
Data Science Activities – industry / practitioner view

Primary goal: To answer 5 types of questions


using data – but to achieve and benefit from this,
they also need to:
• Prepare data to answer the 5 questions
(Data Engineering, Feature Engineering).
• To extract value on the answers from these 5
https://docs.microsoft.com/en-
question (operationalizing data):
us/azure/machine-learning/machine- • Predicting and forecasting.
learning-data-science-for-beginners-the- • AI/Intelligence/Cognitive applications.
5-questions-data-science-answers
Reinforcement
(Forecasting) Learning

Is it A or B? How much/how Which option? Is it weird? Which groups? e.g.


e.g Will the HDD fail many? e.g. do the car stop or e.g. Fraud Detection. which viewers like the
next month? Yes/No. e.g. what is the go on an orange same type of movie.
temperature next light.
Tuesday?
Machine Learning algorithms are used in Advanced Analytics by Data Scientists.
Classification (A or B)
• Logistic regression.
• Two-Class Boosted Decision Trees.
• Two-Class Decision Forest.
• Two-Class Neural Network.
• Two-Class [Locally-Deep] SVM
• K-Nearest Neighbors
Regression/Forecasting (How many)
• Linear regression.
• Boosted Decision Tree Regression.
• Forest regression.
• NN Regression.
• ARIMA
• K-Nearest Neighbors
Neural Net (Reinforcement Learning)
Clustering (Which Group)
• K-Means.
• Hierarchical Agglomerative
Clustering.
• Recommenders.
Evaluation:
• Iteration.
• Parameter Tuning.
• Cross Validation.

Ask a Answer
Get more Put data Check for Transform Use the
sharp the
data in a table quality features Answer
question question

no no no no no no
Business Data Data Evaluation Deployment
Modeling
Understanding Understanding Preparation
Classification (A or B) Classification (A or B)
• Logistic regression. • Accuracy.
• Two-Class Boosted Decision • AOC.
Trees. • Confusion matrix.
• Two-Class Decision Forest. • Recall.
• Two-Class Neural Network. • Feature interpretation.
• Two-Class [Locally-Deep] SVM • Tuning/Sweeping Model Parameters.
• K-Nearest Neighbors. • Cross validation.
Regression, Forecasting (How • Nested cross validation to tune
many) parameters.
• Comparison between models.
• Linear regression.
• Boosted Decision Tree Regression (How many)
Regression. • Root Mean Square Error.
• Forest regression. • Standard Error (ARMA).
• NN Regression. • Coefficient of determination.
• ARIMA • Residual Visualization:
• Time series plots of actual
• K-Nearest Neighbors. versus predicted.
Neural Net (Reinforcement • Box-Plot
Learning) • Time series plots of residuals
(line, histograms) .
Clustering (Which Group)
• Regularization.
• K-Means.
• Tuning/Sweeping Model Parameters.
• Hierarchical Agglomerative
Clustering. • Cross validation.

• Recommenders. • Nested cross validation to tune


parameters.
• Comparison between models.
Clustering (Which Group)
• Maximal distance to cluster center.
• Principal Component Analysis (PCA)
Microsoft Data Science Programs for
Universities
Microsoft Professional
• Professional Certification in Data Science (STEM
Program for Data Professionals)
Science Certification

• Postgraduate Data Science


Cortana Intelligence
• Postgraduate Data Engineers for Big Data
Education Program • Postgraduate Data Analyst

• Computer Science – Data Science


Undergraduate Data
• Information Systems – Applied Data Science/Data
Science Program Engineering

Make AI/Data Science as important as Software Development


MPP for Data
Science
Overview

• 10 Modules.
• Contents are free – and tracked by the
Open edX platform.
• Each module will have a completion
“badge” (paid).
• 169 – 256 hours (5-6 Credits only)
• 3 months full time, 6 months part time.
MPP - DATA SCIENCE TRACK
https://academy.microsoft.com/en-us/tracks/data-science/
CORE DATA APPLIED PROJECT
FUNDAMENTALS SCIENCE DATA SCIENCE

ANALYZING AND
VISUALIZING INTRODUCTION PROGRAMMING APPLIED MACHINE
DATA WITH TO R FOR DATA WITH R FOR LEARNING SCENARIOS
EXCEL SCIENCE DATA SCIENCE

DEGREE

IMPLEMENTING CORTANA
DATA SCIENCE QUERYING INTRODUCTION DATA SCIENCE PRINCIPLES OF PREDICTIVE COMPETITION
ORIENTATION DATA WITH TO STATISTICS ESSENTIALS MACHINE MODELS WITH
SPARK IN AZURE
TRANSACT-SQL LEARNING
HDINSIGHT

ANALYZING INTRODUCTION PROGRAMMING


AND TO PYTHON FOR WITH PYTHON DEVELOPING
VISUALIZING DATA SCIENCE FOR DATA INTELLIGENT
DATA WITH SCIENCE APPLICATIONS
POWER BI

Azure
Machine Learning HDInsight
CIEP Curriculum Content
i. Free access to a world class curriculum to train the next generation of applied data scientists , analysts and data developers
ii. Combines fundamental concepts along with real world use cases suited for both business and technical data science tracks
iii. Includes video tutorials, lecture material, hands on exercises with data sets and labs developed by academic and industry
experts

Data Science/ML Track Big Data Track Data Visualization Track


• Overview of machine learning • Provisioning an HDInsight cluster, • Explain Power BI Service and
theory upload data, and run MapReduce Power BI Desktop
• Data acquisition, ingestion, jobs • Build a data model showcasing
sampling, quantization, cleaning and • Use Hive to store and process the features of Power BI Desktop
transformation data, Process data using Pig. • Publish the data model, create
• Building data science workflows with • Use custom Python user-defined dashboard and refresh data in
Azure ML, Python and R functions from Hive and Pig. Power BI Service
• Data exploration and visualization • Define and run workflows for data • Explain collaboration options with
• Building, evaluating and publishing processing using Oozie lab
machine learning models with Azure • Transfer data between HDInsight • Explain direct connectivity options
Text Analytics and databases using Sqoop • Explain Power BI REST API

Role Role Role


Computer Science in Applied Data Science
Ideal approach given time and
Present effort.Computer Science Curricular Core
MQA/ACM

Statistics with R

End to End
Introduction
Data Machine Project in Data
Advanced AI to Big Data
Engineering Learning Science
(optional)
(Capstone)

Entry level
• Computer Scientist.
• Data Scientist.
• Data Engineer.
• AI Systems
Developer.
• Data Analyst (if
Business Data Data Core AI
Modeling Evaluation Deployment
Understanding Understanding Preparation
Classification (A or B)
Classification (A or B)
• Accuracy.
• Logistic regression. • AOC.
• Two-Class Boosted • Confusion matrix.
Decision Trees.
• Recall.
• Two-Class Decision • Feature interpretation.
Forest.
• Sweeping Model Parameters.
• Two-Class Neural
• Cross validation.
Network.
• Nested cross validation to tune
• Two-Class [Locally- parameters.
Deep] SVM • Comparison between models.
Regression, Forecasting (How Regression (How many)
many)
• Root Mean Square Error.
• Linear regression.
• Standard Error (ARMA).
• Boosted Decision Tree • Coefficient of determination.
Regression.
• Residual Visualization:
• Forest regression. • Time series plots of actual
• NN Regression. versus predicted.
• Box-Plot
• ARIMA
• Time series plots of residuals
Neural Net (Reinforcement (line, histograms) .
Learning) • Regularization.
Clustering (Which Group) • Cross validation.

• K-Means. • Nested cross validation to tune


parameters.
• Hierarchical • Comparison between models.
Agglomerative
Clustering. Clustering (Which Group)
• Recommenders. • Maximal distance to cluster center.
• Principal Component Analysis (PCA)
Information System in Applied Data Science
Present MQA/ACM Information System Curricular Core

Statistics with R??

Advanced Programming, Algorithms, Basic AI

End to End
Business Advanced Project in Data
Data Engineering Intelligence & IS/Data Mining Science
Data Visualization (optional) (Capstone)

Entry level
• IS Professional.
• Data Engineer.
• AI Systems
Developer.
• Data Analyst (if
Advanced IS is
U N D E RG R A D UAT E S | C U S TO M I S E D

We’re customising the three curriculum above to give you the below, for undergrads

1 2 3 4 5 6
MTA: Database Introduction to Introduction to Data Introduction to R for Analyzing Big Data Perform Cloud Data
Fundamentals Python Science Data Science with Microsoft R Science with Azure
Machine Learning
Business Data Advanced Programming,
Data Algorithms, Basic AI
Evaluation Deployment
Modeling
Understanding Understanding Preparation
Classification (A or B)
Classification (A or B)
• Accuracy.
• Logistic regression. • AOC.
• Two-Class Boosted • Confusion matrix.
Decision Trees.
• Recall.
• Two-Class Decision • Feature interpretation.
Forest.
• Sweeping Model Parameters.
• Two-Class Neural
• Cross validation.
Network.
• Nested cross validation to tune
• Two-Class [Locally- parameters.
Deep] SVM • Comparison between models.
Regression, Forecasting (How Regression (How many)
many)
• Root Mean Square Error.
• Linear regression.
• Standard Error (ARMA).
• Boosted Decision Tree • Coefficient of determination.
Regression.
• Residual Visualization:
• Forest regression. • Time series plots of actual
• NN Regression. versus predicted.
• Box-Plot
• ARIMA
• Time series plots of residuals
Neural Net (Reinforcement (line, histograms) .
Learning) • Regularization.
Clustering (Which Group) • Cross validation.

• K-Means. • Nested cross validation to tune


parameters.
• Hierarchical • Comparison between models.
Agglomerative
Clustering. Clustering (Which Group)
• Recommenders. • Maximal distance to cluster center.
• Principal Component Analysis (PCA)
Introduction half-day
workshop series.
1. Introduction to Applied Data
Science.
2. Data Preparation – Cleansing,
Manipulation.
3. Machine Learning Model and
Evaluation.
4. Improving Models.
5. Introduction to Big Data for Data
Science using HD-Insights.
Cortana Intelligence Suite helps to boost data science
graduates’ career prospects at a Malaysian university

Objectives Solution Results


• Required an • Leveraged on the use of • Cloud-based platform eliminated the
integrated existing industry tools need for high-spec hardware and enabled
solution to rather than “building from mobile learning
reduce time scratch” • An increase in the Application of Machine
spent on non- • Introduced data science Learning algorithms implemented as a case
learning related students to Cortana study / mini projects.
tasks to improve Intelligence Suite (Azure • Students at IIUM were able to implement
learning ML) in an early foundation more algorithms during lesson time on the
outcomes course which forms the Azure ML platform.
basis for extensive learning • The hands-on assist students to understand
in advanced Machine machine learning concepts to be applied for
Learning modules industry problems, accordingly

“Understanding the machine leaning concepts through Azure ML will help


the students excel in their future career.”
— Dr. Amelia Ritahani Ismail, Associate Professor at International Islamic University of Malaysia
30
Business Data Data Evaluation Deployment
Modeling
Understanding Understanding Preparation
Classification (A or B)
Classification (A or B)
• Accuracy.
• Logistic regression. • AOC.
• Two-Class Boosted • Confusion matrix.
Decision Trees.
• Recall.
• Two-Class Decision • Feature interpretation.
Forest.
• Sweeping Model Parameters.
• Two-Class Neural
• Cross validation.
Network.
• Nested cross validation to tune
• Two-Class [Locally- parameters.
Deep] SVM • Comparison between models.
Regression, Forecasting (How Regression (How many)
many)
• Root Mean Square Error.
• Linear regression.
• Standard Error (ARMA).
• Boosted Decision Tree • Coefficient of determination.
Regression.
• Residual Visualization:
• Forest regression. • Time series plots of actual
• NN Regression. versus predicted.
• Box-Plot
• ARIMA
• Time series plots of residuals
Neural Net (Reinforcement (line, histograms) .
Learning) • Regularization.
Clustering (Which Group) • Cross validation.

• K-Means. • Nested cross validation to tune


parameters.
• Hierarchical • Comparison between models.
Agglomerative
Clustering. Clustering (Which Group)
• Recommenders. • Maximal distance to cluster center.
• Principal Component Analysis (PCA)
Reinforcement
(Forecasting) Learning

Is it A or B? How much/how Which option? Is it weird? Which groups? e.g.


e.g Will the HDD fail many? e.g. do the car stop or e.g. Fraud Detection. which viewers like the
next month? Yes/No. e.g. what is the go on an orange same type of movie.
temperature next light.
Tuesday?
Machine Learning algorithms are used in Advanced Analytics by Data Scientists.
Date My stock price

Date Americas sales Europe and Africa Asia sales


sales

Competitor Product Market share

Product First month users First quarter users First year users

Date Dow Jones Nikkei


Business Data Data Evaluation Deployment
Modeling
Understanding Understanding Preparation
Classification (A or B) Classification (A or B)
• Logistic regression. • Accuracy.
• AOC.
• Two-Class Boosted
Decision Trees. • Confusion matrix.
• Recall.
• Two-Class Decision
Forest. • Cross validation.

• Two-Class Neural • Nested cross validation to tune


parameters.
Network.
• Comparison between models.
• Two-Class [Locally-
Deep] SVM Regression (How many)
Regression (How many) • Root Mean Square Error.
• Coefficient of determination.
• Linear regression.
• Residual Visualization:
• Boosted Decision
• Time series plots of actual
Tree Regression. versus predicted.
• Forest regression. • Box-Plot

• Neural Network • Time series plots of residuals


(line, histograms) .
Regression.
• Cross validation.
Neural Net (Reinforcement
• Nested cross validation to tune
Learning) parameters.
• Comparison between models.
Clustering (Which Group) Clustering (Which Group)
• K-Means. • Maximal distance to cluster center.
• Hierarchical • Principal Component Analysis (PCA)
Agglomerative
Clustering.
Date My stock price

Date Americas sales Europe and Africa Asia sales


sales

Competitor Product Market share

Product First month users First quarter users First year users

Date Dow Jones Nikkei


Stock Date Day Dow Last Last Market New New Days Days Total
price of Jones month quarter share users users since since users
week sales sales last last press product
month quarter release release
57.3 5/21 Tue 17,245 68.8M 211.2M 23.1% 63,522 195,322 3 96 2.49M
58.8 5/22 Wed 17,289 68.8M 211.2M 23.1% 63,522 195,322 4 97 2.49M
56.9 5/23 Thu 17,115 68.8M 211.2M 23.1% 63,522 195,322 5 98 2.49M
57.4 5/24 Fri 17,278 68.8M 211.2M 23.1% 63,522 195,322 6 99 2.49M
Business Data Data Evaluation Deployment
Modeling
Understanding Understanding Preparation

Reference Join Data


Data

Dataset Transformation Scored Model


Business Data Data Evaluation Deployment
Modeling
Understanding Understanding Preparation

Reference Import Data


Data

Transformation Scored Model

Reference Import Data


Data
Web Service
Web Transformation Scored Model
Output
Service
Input
Business Data Data Evaluation Deployment
Modeling
Understanding Understanding Preparation
Business Data Data Evaluation Deployment
Modeling
Understanding Understanding Preparation
Excel, SQL Server Tools, Power BI, Azure ML, R/Python, Jupyter

Business Data Data Evaluation Deployment


Modeling
Understanding Understanding Preparation
Data Science Team for Advanced Analytics - Challenges
• Different types of skills.
Math/Stats • Number of highly skilled professionals.
Algorithms
IT
Platform Data
• Takes a long time.
Management Engineering

Data Business
Data
Science Team
Acumen

Dividend
Application
Development

Data
Management
Understanding the role of Mathematics
• Given data x1, x2, …., xt-1 predict xt
• Or P(xt | x1, x2, …., xt-1)
• Basis of this is Bayes rule

Machine Learning
• Supervised learning.
• Unsupervised learning.
• Reinforced learning

Selection or combining following mathematical modelling:


• Latent variable models (e.g. factor analysis, K-means etc).
• EM algorithms (ML estimation).
• Modelling time series.
• Nonlinear, factorial models & hierarchical models.
• Graphical models.
ML (2-5 years)
General
Purpose ML
Platforms (>10
years)
Out of the HC
IoT, 3D
Printing, Hybrid
Cloud, etc.
• Went viral : 50 Million Users in 7 Days, 1.2M users/hour.
Democratizing Data Science

Math/Stats
IT Algorithms
Platform Data

Data Dividend at
Management Engineering

Data Business lower cost andData


Science Professional
Acumen

faster turn-
Dividend
around
Application
Development

Data
Management
Cortana Intelligence Suite helps to boost data science
graduates’ career prospects at a Malaysian university

Objectives Solution Results


• Required an • Leveraged on the use of • Cloud-based platform eliminated the
integrated existing industry tools need for high-spec hardware and enabled
solution to rather than “building from mobile learning
reduce time scratch” • An increase in the Application of Machine
spent on non- • Introduced data science Learning algorithms implemented as a case
learning related students to Cortana study / mini projects.
tasks to improve Intelligence Suite (Azure • Students at IIUM were able to implement
learning ML) in an early foundation more algorithms during lesson time on the
outcomes course which forms the Azure ML platform.
basis for extensive learning • The hands-on assist students to understand
in advanced Machine machine learning concepts to be applied for
Learning modules industry problems, accordingly

“Understanding the machine leaning concepts through Azure ML will help


the students excel in their future career.”
— Dr. Amelia Ritahani Ismail, Associate Professor at International Islamic University of Malaysia
67
Microsoft
Microsoft Imagine Academy Program Professional Program
These Learning Paths are current as of September 2017.

Computer Science
Software
   Engineer

Hour of Code Minecraft Minecraft Creative Technopreneurship Introduction Harvard CS50 AP© MTA: Software
Redstone Coding (April 2017) to Python Computer Science Development
(STEM)
(Computer Games and (April 2017) Principles Fundamentals
Science) Apps

 Programmer

MTA: HTML5 Application

IT Infrastructure
Development Fundamentals

      IT Pro

MTA: MTA: MTA: MTA: MTA: MTA:

Data Science (in development*)


Data Scientist
 
Excel 2016 MTA: Database *TBD -- Courses under review may include:
and Excel Fundamentals • SQL Database Fundamentals
Expert 2016 • Analyzing and Visualizing Data with Excel
• Analyzing and Visualizing Data with Power BI
• Data Science and Machine Learning Essentials

Productivity
• Introduction to Python for Data Science or Introduction to R for Data Science

      
Microsoft Certified Solutions Associate (MCSA) — Data Management & Analytics path
Microsoft Professional Programme (MPP) — Data Science Track
CORE DATA APPLIED PROJECT
FUNDAMENTALS SCIENCE DATA SCIENCE

ANALYZING AND
VISUALIZING DATA INTRODUCTION TO R PROGRAMMING APPLIED MACHINE LEARNING
WITH EXCEL FOR DATA SCIENCE WITH R FOR DATA SCENARIOS
SCIENCE

Certificate
of
Completion
HDInsight

IMPLEMENTING PREDICTIVE
DATA SCIENCE QUERYING DATA INTRODUCTION TO DATA SCIENCE PRINCIPLES OF MODELS WITH SPARK IN
AZURE HDINSIGHT
ORIENTATION WITH TRANSACT- STATISTICS ESSENTIALS MACHINE LEARNING CAPSTONE EVENT
SQL

ANALYZING AND INTRODUCTION TO PROGRAMMING


VISUALIZING DATA PYTHON FOR DATA WITH PYTHON FOR
Azure
DEVELOPING INTELLIGENT
WITH POWER BI SCIENCE DATA SCIENCE
APPLICATIONS
Machine Learning

Transform Data
How the
RDBMS Mashup Data Data R for Data Data Science Classification Control Flow Time Series and
Microsoft Data
Database Design Data Modelling Classification Analysis Process Regression Creating Forecasting
Science program
DDL, DML Data Bayseian Model Data Structures Probability and Clustering Visualizations Spatial Data
works
Visualization Linear Custom Plots Statistics Modelling Predictive Text Analytics
Regression Data Frames Machine Analysis
Control Flow Learning

Data exploration T-SQL Language Excel Data Tools Data Python Language Azure ML Azure ML Python Language SQL
and visualization Data DAX Engine Visualization R Syntax Data R / Python SciKit-Learn Azure
techniques Manipulation in Power BI Cloud Condition Visualization Visualization and Visualizations Python / R
SQL Server Service Probability using Exploration with with ggplot2
Statistics with Azure SQL Dashboards Tools R / Python and
the Excel Data Natural Azure ML
Analysis Pack Language
Three Microsoft Curriculum Tracks
U N D E RG R A D UAT E S | C U S TO M I S E D

We’re customising the three curriculum above to give you the below, for undergrads

1 2 3 4 5 6
MTA: Database Introduction to Introduction to Data Introduction to R for Analyzing Big Data Perform Cloud Data
Fundamentals Python Science Data Science with Microsoft R Science with Azure
Machine Learning

S-ar putea să vă placă și