Documente Academic
Documente Profesional
Documente Cultură
com
https://studio.azureml.net
Email password
Data Science
They are different – but linked!
• Hadoop and BD techniques.
Gartner : 3-V definition etc.
• Big Data Engineering.
By default this involves
BIG DATA Hadoop & BD techniques
• Big Data Management.
(Parallel and Distributed
e.g Map Reduce, etc.
Computing)
Wikipedia 21-02-2017
https://en.wikipedia.org/wiki/Data_scienc
e
Data science, also known as data-driven
science, is an interdisciplinary field about
scientific methods, processes and systems
to extract knowledge or insights from
data in various forms, either structured or
unstructured, similar to Knowledge
Discovery in Databases (KDD).
https://www.glassdoor.com/List/Best-Jobs-in-America-LST_KQ0,20.htm
Big Data no longer a Hype
General consensus of Data Science
Data Science still evolving (http://datascience.nyu.edu/what-is-data-science/ )
• Applied Field.
• Multi-disciplinary.
http://datascience.nyu.edu/what-is-data-science/
• Science and Mathematics based.
Data Science Activities – industry / practitioner view
Ask a Answer
Get more Put data Check for Transform Use the
sharp the
data in a table quality features Answer
question question
no no no no no no
Business Data Data Evaluation Deployment
Modeling
Understanding Understanding Preparation
Classification (A or B) Classification (A or B)
• Logistic regression. • Accuracy.
• Two-Class Boosted Decision • AOC.
Trees. • Confusion matrix.
• Two-Class Decision Forest. • Recall.
• Two-Class Neural Network. • Feature interpretation.
• Two-Class [Locally-Deep] SVM • Tuning/Sweeping Model Parameters.
• K-Nearest Neighbors. • Cross validation.
Regression, Forecasting (How • Nested cross validation to tune
many) parameters.
• Comparison between models.
• Linear regression.
• Boosted Decision Tree Regression (How many)
Regression. • Root Mean Square Error.
• Forest regression. • Standard Error (ARMA).
• NN Regression. • Coefficient of determination.
• ARIMA • Residual Visualization:
• Time series plots of actual
• K-Nearest Neighbors. versus predicted.
Neural Net (Reinforcement • Box-Plot
Learning) • Time series plots of residuals
(line, histograms) .
Clustering (Which Group)
• Regularization.
• K-Means.
• Tuning/Sweeping Model Parameters.
• Hierarchical Agglomerative
Clustering. • Cross validation.
• 10 Modules.
• Contents are free – and tracked by the
Open edX platform.
• Each module will have a completion
“badge” (paid).
• 169 – 256 hours (5-6 Credits only)
• 3 months full time, 6 months part time.
MPP - DATA SCIENCE TRACK
https://academy.microsoft.com/en-us/tracks/data-science/
CORE DATA APPLIED PROJECT
FUNDAMENTALS SCIENCE DATA SCIENCE
ANALYZING AND
VISUALIZING INTRODUCTION PROGRAMMING APPLIED MACHINE
DATA WITH TO R FOR DATA WITH R FOR LEARNING SCENARIOS
EXCEL SCIENCE DATA SCIENCE
DEGREE
IMPLEMENTING CORTANA
DATA SCIENCE QUERYING INTRODUCTION DATA SCIENCE PRINCIPLES OF PREDICTIVE COMPETITION
ORIENTATION DATA WITH TO STATISTICS ESSENTIALS MACHINE MODELS WITH
SPARK IN AZURE
TRANSACT-SQL LEARNING
HDINSIGHT
Azure
Machine Learning HDInsight
CIEP Curriculum Content
i. Free access to a world class curriculum to train the next generation of applied data scientists , analysts and data developers
ii. Combines fundamental concepts along with real world use cases suited for both business and technical data science tracks
iii. Includes video tutorials, lecture material, hands on exercises with data sets and labs developed by academic and industry
experts
Statistics with R
End to End
Introduction
Data Machine Project in Data
Advanced AI to Big Data
Engineering Learning Science
(optional)
(Capstone)
Entry level
• Computer Scientist.
• Data Scientist.
• Data Engineer.
• AI Systems
Developer.
• Data Analyst (if
Business Data Data Core AI
Modeling Evaluation Deployment
Understanding Understanding Preparation
Classification (A or B)
Classification (A or B)
• Accuracy.
• Logistic regression. • AOC.
• Two-Class Boosted • Confusion matrix.
Decision Trees.
• Recall.
• Two-Class Decision • Feature interpretation.
Forest.
• Sweeping Model Parameters.
• Two-Class Neural
• Cross validation.
Network.
• Nested cross validation to tune
• Two-Class [Locally- parameters.
Deep] SVM • Comparison between models.
Regression, Forecasting (How Regression (How many)
many)
• Root Mean Square Error.
• Linear regression.
• Standard Error (ARMA).
• Boosted Decision Tree • Coefficient of determination.
Regression.
• Residual Visualization:
• Forest regression. • Time series plots of actual
• NN Regression. versus predicted.
• Box-Plot
• ARIMA
• Time series plots of residuals
Neural Net (Reinforcement (line, histograms) .
Learning) • Regularization.
Clustering (Which Group) • Cross validation.
End to End
Business Advanced Project in Data
Data Engineering Intelligence & IS/Data Mining Science
Data Visualization (optional) (Capstone)
Entry level
• IS Professional.
• Data Engineer.
• AI Systems
Developer.
• Data Analyst (if
Advanced IS is
U N D E RG R A D UAT E S | C U S TO M I S E D
We’re customising the three curriculum above to give you the below, for undergrads
1 2 3 4 5 6
MTA: Database Introduction to Introduction to Data Introduction to R for Analyzing Big Data Perform Cloud Data
Fundamentals Python Science Data Science with Microsoft R Science with Azure
Machine Learning
Business Data Advanced Programming,
Data Algorithms, Basic AI
Evaluation Deployment
Modeling
Understanding Understanding Preparation
Classification (A or B)
Classification (A or B)
• Accuracy.
• Logistic regression. • AOC.
• Two-Class Boosted • Confusion matrix.
Decision Trees.
• Recall.
• Two-Class Decision • Feature interpretation.
Forest.
• Sweeping Model Parameters.
• Two-Class Neural
• Cross validation.
Network.
• Nested cross validation to tune
• Two-Class [Locally- parameters.
Deep] SVM • Comparison between models.
Regression, Forecasting (How Regression (How many)
many)
• Root Mean Square Error.
• Linear regression.
• Standard Error (ARMA).
• Boosted Decision Tree • Coefficient of determination.
Regression.
• Residual Visualization:
• Forest regression. • Time series plots of actual
• NN Regression. versus predicted.
• Box-Plot
• ARIMA
• Time series plots of residuals
Neural Net (Reinforcement (line, histograms) .
Learning) • Regularization.
Clustering (Which Group) • Cross validation.
Product First month users First quarter users First year users
Product First month users First quarter users First year users
Data Business
Data
Science Team
Acumen
Dividend
Application
Development
Data
Management
Understanding the role of Mathematics
• Given data x1, x2, …., xt-1 predict xt
• Or P(xt | x1, x2, …., xt-1)
• Basis of this is Bayes rule
Machine Learning
• Supervised learning.
• Unsupervised learning.
• Reinforced learning
Math/Stats
IT Algorithms
Platform Data
Data Dividend at
Management Engineering
faster turn-
Dividend
around
Application
Development
Data
Management
Cortana Intelligence Suite helps to boost data science
graduates’ career prospects at a Malaysian university
Computer Science
Software
Engineer
Hour of Code Minecraft Minecraft Creative Technopreneurship Introduction Harvard CS50 AP© MTA: Software
Redstone Coding (April 2017) to Python Computer Science Development
(STEM)
(Computer Games and (April 2017) Principles Fundamentals
Science) Apps
Programmer
IT Infrastructure
Development Fundamentals
IT Pro
Productivity
• Introduction to Python for Data Science or Introduction to R for Data Science
Microsoft Certified Solutions Associate (MCSA) — Data Management & Analytics path
Microsoft Professional Programme (MPP) — Data Science Track
CORE DATA APPLIED PROJECT
FUNDAMENTALS SCIENCE DATA SCIENCE
ANALYZING AND
VISUALIZING DATA INTRODUCTION TO R PROGRAMMING APPLIED MACHINE LEARNING
WITH EXCEL FOR DATA SCIENCE WITH R FOR DATA SCENARIOS
SCIENCE
Certificate
of
Completion
HDInsight
IMPLEMENTING PREDICTIVE
DATA SCIENCE QUERYING DATA INTRODUCTION TO DATA SCIENCE PRINCIPLES OF MODELS WITH SPARK IN
AZURE HDINSIGHT
ORIENTATION WITH TRANSACT- STATISTICS ESSENTIALS MACHINE LEARNING CAPSTONE EVENT
SQL
Transform Data
How the
RDBMS Mashup Data Data R for Data Data Science Classification Control Flow Time Series and
Microsoft Data
Database Design Data Modelling Classification Analysis Process Regression Creating Forecasting
Science program
DDL, DML Data Bayseian Model Data Structures Probability and Clustering Visualizations Spatial Data
works
Visualization Linear Custom Plots Statistics Modelling Predictive Text Analytics
Regression Data Frames Machine Analysis
Control Flow Learning
Data exploration T-SQL Language Excel Data Tools Data Python Language Azure ML Azure ML Python Language SQL
and visualization Data DAX Engine Visualization R Syntax Data R / Python SciKit-Learn Azure
techniques Manipulation in Power BI Cloud Condition Visualization Visualization and Visualizations Python / R
SQL Server Service Probability using Exploration with with ggplot2
Statistics with Azure SQL Dashboards Tools R / Python and
the Excel Data Natural Azure ML
Analysis Pack Language
Three Microsoft Curriculum Tracks
U N D E RG R A D UAT E S | C U S TO M I S E D
We’re customising the three curriculum above to give you the below, for undergrads
1 2 3 4 5 6
MTA: Database Introduction to Introduction to Data Introduction to R for Analyzing Big Data Perform Cloud Data
Fundamentals Python Science Data Science with Microsoft R Science with Azure
Machine Learning