Sunteți pe pagina 1din 8

Notes on Intro to Data Science Udacity

Created 12/20/13
Updated 01/20/14, Updated 02/21/14, Updated 04/06/14, Updated 04/19/14, Updated 04/26/14, Updated 05/04/14
Updated 05/17/14, Updated 05/31/14

Introduction
The Introduction to Data Science class (Udacity UD359) will survey the foundational topics in data science,
namely:
Data Manipulation
Data Analysis with Statistics and Machine Learning
Data Communication with Information Visualization
Data at Scale -- Working with Big Data
The class will focus on breadth and present the topics briefly instead of focusing on a single topic in depth. This
will give you the opportunity to sample and apply the basic techniques of data science.
Presenters are Dave Holtz (Yub, TrialPay) and Cheng-Han Lee (formerly at Microsoft, now at Udacity).
Course duration and time required is Approx. 2 months Assumes 6hr/wk (work at your own pace)
URL: https://www.udacity.com/course/ud359
There is a similar course on Coursera, taught by Bill Howe of University of Washington.

Related Groups
https://groups.google.com/forum/#!forum/datsciprojects - this is the Monday night group 6:30. Organized by Mike
Wilber.
https://groups.google.com/forum/#!forum/spark-ml-and-big-data-analytics this is the Wednesday night group 6:30.
Organized by Richard Walker.

Lectures
Lesson 1: Introduction to Data Science

Introduction to Data Science


What is a Data Scientist?
Pi-Chaun (Data Scientist @ Google): What is Data Science?
Gabor (Data Scientist @ Twitter): What is Data Science?

The definitions here included concepts that a data scientist has math/stat knowledge, domain knowledge, and good
analysis skills. It doesnt pre-suppose any specific set of tools, though R and MatLab-like tools are important. The
definition also includes a focus on communication skills, to get the results across.

Problems Solved by Data Science


Pandas = R + Python
Dataframes = data sets that can be run through algorithms
Create a New Dataframe. Indicates that a dataframe is a data set with metadata, like in ExtJS stores and models.

At this point, we are starting to work with an R + Python environment, which the student should be installing on a
computer, thought the initial exercises can be run interactively on the web.

Page 1

Lesson project: Titanic Data. Can we predict who will survive? In this case, we were asked to fill in some python
code that implements a heuristic that generates a prediction. In the first case, the performance is about 70%, then the
questions ask you to improve the heuristic to get better and better results.
predictions = {}
df = pandas.read_csv(file_path)
columns = ['Survived','Sex','Age','SibSp','Parch']
print df[df['Survived'] == 1][columns]

You could dump the data for analysis using a regression analysis, for instance. But in this case you are on your own
to generate a heuristic.
Final Lectures:
Pi-Chaun (Data Scientist @ Google): What is Data Science?
Gabor (Data Scientist @ Twitter): What is Data Science?
Summary: no cookie-cutter way to become a Data science, but a strong mathematical background is very important.

Our next

lesson will focus on getting data and loading it so that you can get value from it.

Lesson 2: Data Wrangling

What is Data Wrangling?


Acquiring Data: from databases, from text files, from the web (scraping). Versions include CSV, XML, and JSON.

The lectures point out that the Data Wrangling can be 70% of your time.

Common Data Formats, examples of each of the common formats. Baseball data is used here. JSON data looks like a dictionary.

The baseball database is at: http://www.seanlahman.com/baseball-archive/statistics/


Assignment: load CSV data, add another column to it.

What are Relational Databases?


Aadhaar Data (this is from an Indian national registry project)
Aadhaar Data and Relational Databases
Introduction to Databases Schemas
APIs

The examples are based on the API of last.fm. The structure of the request is located in the URL.

Data in JSON Format


How to Access an API efficiently
Missing Values

Two ways to deal with missing value: partial deletion, and imputing data. Approaches to imputing data include
using the average, or performing linear regression. There are functions within panda to perform this in two steps:
x = numpy.mean(baseball['weight'])
baseball['weight'] = baseball['weight'].fillna(x)

More sophisticated functions exist as well.

Easy Imputation
Impute using Linear Regression

This was discussed but not demonstrated.

Page 2

Tip of the Imputation Iceberg

Lesson Project: Wrangle NYC subway and weather data (11 parts)
Step 1: Number of rainy days SQL query
Step 2: Temp on Foggy and Non Foggy Days SQL query
Step 5: Fixing NYC subway turnstile data this focused on operations using panda data frames, and
deriving corrected one.
Step 7: Filtering No Regular data another example of using panda data frames, such as
turnstile_data[(turnstile_data.DESCn == 'REGULAR')]
Step 9: Get Hourly Exits in this case we are comparing a count value for a data set that has hourly values
with the value in the row above, for the prior hour.
Step 10: Time to hour this was a way to learn some of the operations on a datetime, such as extract a
field, convert to string and extract a character, etc. Several different versions of solutions were posted.
Step 11: Reformat subway dates demonstrated a few formatting operations on a datetime, including
converted back to a string, and extracting a substring.

Lesson 3: Data Analysis

Statistical Rigor
Kurt (Data Scientist @ Twitter) - Why is Stats Useful?
Introduction to Normal Distribution
T Test
Welch T Test

By this point, we have seen how to understand tests, and seen a formulation in Python.

Non-Parametric Tests

These dont assume that the data is drawn from any specific probability distribution. For example, Mann-Whitney
U test.

Non-Normal Data

Shapiro-Wilks test

Stats vs. Machine Learning


Different Types of Machine Learning
Prediction with Regression

For instance, predict home runs given information about a baseball player

Cost Function

These introduces steepest-descent methods, which are built around the idea of a cost function, where the cost
function is typically of the sum of the squares of the errors.

How to Minimize Cost Function

This is intended to be a discussion of an algorithm. But it is rather weak. One of the students added a note which is
a link to the Coursera class on machine learning, as a better source.

Page 3

Coefficient of Determination

This value is also called R-squared. The closer to 1, the better is our model.
The final lectures list other issues to take into account. Gradient descent is only one implementation of linear
regression. Another issue is overfitting. An approach is cross-validation. Cost function may have local minima.
Lesson Project: Analyze NYC subway and weather data. We will be analyzing data, and modeling links between
weather and ridership, for instance.
Step 1: Exploratory Data Analysis examine the hourly entries in our NYC subway data and determine
what distribution the data follows. In this case, we are using matplotlib with pandas. However, there has
been little discussion of matplotlib in the source materials
o
o
o
o
o
o
o
o
o

Step 2: Welchs t-test this just asks the question if you think the t-test applies. Since the data is not
normal, it doesnt.
Step 3: Mann-Whitney U test there is not a lot of explanation here, simply a request to run this test,
which is built into the stat libraries.
o
o
o
o
o

plt.figure()
x=turnstile_weather[turnstile_weather['rain']==1]
y=turnstile_weather[turnstile_weather['rain']==0]
x['ENTRIESn_hourly'].hist(color='r', bins=30, alpha=0.5, label='Rain')
y['ENTRIESn_hourly'].hist(color='b', bins=30, alpha=0.5, label='No rain')
plt.xlabel("Ridership")
plt.ylabel("Counts")
plt.legend()
plt.show()

a = turnstile_weather['ENTRIESn_hourly'][turnstile_weather['rain']==1]
b = turnstile_weather['ENTRIESn_hourly'][turnstile_weather['rain']==0]
with_rain_mean = np.mean(a)
without_rain_mean = np.mean(b)
U, p = scipy.stats.mannwhitneyu(a, b)

Step 4: Rainy Day Ridership vs. Non Rainy Day Ridership this asks you a question about interpreting the
M-W test.
Step 5: Linear regression this asks you to integrate the cost function and calls to steepest descent into
analysis of ridership data. Since the dataset of much larger than the baseball set, you are running a subset
of the data.
Step 6: Plot residuals this adds to step 5, and asked you generate a histogram of (predicted minus actual)
Step 7: Compute R-squared this also adds to Step 5
Step 8: Non Gradient-Descent Linear Regression this is example of using the ordinary least squares
analytical formulation, which is built into a function called stats.linregress. However, it requires that the
input matrix avoid problems such as colinearity. I had trouble coming up with a set of columns that didnt
exhibit this behavior.

Lesson 4: Data Visualization

Effective Information Visualization


Napoleon's March on Russia

Discussion of the classic diagram by

Don (Principal Data Scientist @ AT&T): Communicating Findings


Rishiraj (Principal Data Scientist @ AT&T): Communicating Findings Well
Visual Encodings

Here is a discussion of using lines, colors, indicators, thickness/direction of lines, etc.

Perception of Visual Cues

The different possible ways to encode information are ranked by peoples perception.

Page 4

Types of charts

We have seen the following diagram in a similar course, and want to add it here:

Plotting in Python

We are going to use ggplot instead of matplotlib. The former looks nicer, and has a grammar of graphs. What is
meant by a grammar of graphs is a set of graphing components, similar to d3/nvd3. This in turn encourages us to
think about scales and other low-level elements of the chart, rather than simply having all of the configuration
provided to us and not visible or controllable.

Data Scales

This is similar to the discussion about scales in d3. Here is an example:


f=pd.read_csv(hr_year_csv)
p = ggplot(f, aes(x='yearID', y='HR')) + geom_point(color = 'red') + geom_line(color='red')
p = p + ggtitle('Number of Home Runs by year') + xlab('Year') + ylab('Home Runs')
return p

Commentary
After reading the help files of Pythons ggplot clone, the document basically says; 'Warning, ggplot is NOT 'Pythonic'!' This basically means, it is
weird to use. Don't worry; it was designed to be weird. It follows some rules in a famous book called 'Grammar of Graphics' (a book written before most
of us were born!)
Don't worry about it being weird, you can just copy & paste some code, then modify it for yourself.

Page 5

Visualizing Time Series Data

A LOESS curve can emphasize long-term trends.


Lesson Project: Visualizing NYC subway data
Step 1: Visualization 1: this was a simple map by date
Step 2: Visualization 2: here I produced a bar chart by date. Had trouble grouping it any other way.

Lesson 5: MapReduce

Big Data and MapReduce

This includes a discussion about how much data is big data? The answer was several terabytes or more.

Basics of MapReduce
Mapper
Reducer

At this point, we have been introduced to the concepts, and the idea that MapReduce performs partitioning of a large
data problem has been described.

MapReduce with Aadhaar Data

At this point, we are writing mapper and reducer functions in python. However, they appear to be operating on the
same datasets as in earlier lessons (Aadhaar data). So we are not really using Hadoop or a compute cluster. But this
is similar to the use of MapReduce in our MongoDB work back in 2012.

MapReduce Ecosystem

At this point the discussion starting including the terms Hadoop, and really large configurations are described. It
would appear that setting up such examples using compute resources that are at classroom scale is beyond the scope
of the course.

Joshua (Data Scientist @ Twitter): MapReduce Tools, Pig

So we get to hear from industry experts about Hadoop, Hive, and Pig. Hive is a library for writing Hadoop jobs that
makes it easier to create them. Pig is a high-level platform for creating MapReduce programs used with Hadoop.
The language for this platform is called Pig Latin. Pig Latin abstracts the programming from the Java MapReduce
idiom into a notation which makes MapReduce programming high level, similar to that f SQL for RDBMS systems.
Pig Latin can be extended using UDF (User Defined Functions) which the user can write in Java, Python,
JavaScript, Ruby or Groovy and then call directly from the language. Hive was developed at Facebook, and Pig at
Yahoo.

MapReduce with Subway Data

Lesson 6: Final Project


This is up to each individual student or group.

Appendix A: Set Up
Local installation
You would need to install the following Python libraries and packages to run the assignments on your own
computer:

Page 6

Pandas
Numpy
Scipy
Statsmodels
Ggplot
Matplotlib
Pandasql
We would highly recommend that you install Anaconda, which should contain most of the libraries and packages that
you need to work on the assignments.
One caveat is that Anaconda does not include pandasql but installing it after Anaconda is as easy as:
pip install -U pandasql
From the same company you have a hosted Data science Toolbox: www.wakari.io - nothing to install and the Anaconda
distribution is included.

Virtual machine
There is a Vagrant specification for a virtual machine here:
https://github.com/asimihsan/intro-to-data-science-udacity

By following these instructions, which are also present in the Git repository README file, you will be able to
create a virtual machine on Linux, Mac OS X, or Windows, that will include all dependencies required for this
course, and additionally be able to use IPython Notebooks, which make following this class much easier.

Install VirtualBox: https://www.virtualbox.org/wiki/Downloads

Install Vagrant: http://www.vagrantup.com/downloads.html

Download this repository's contents to your machine. Either:


1.

Install Git, then clone this repo to your computer: git clone git@github.com:asimihsan/intro-to-data-scienceudacity.git, OR

2.

Download then extract a ZIP file of this repo.

Change directory to your clone: cd intro-to-data-science-udacity

From the root of the clone run: vagrant up

Check for errors. There should be none. A warning about the version of the Guest Additions is harmless.

SSH onto the box using: vagrant ssh.

For more basic information on using Vagrant refer to the official documentation: http://docs.vagrantup.com/v2/gettingstarted/index.html

After starting the virtual machine you can run an IPython Notebook server by running the following inside the guest VM: ipython
notebook --ip 0.0.0.0 --pylab inline . Then on your host machine browse to http://localhost:58888. Congratulations!

Appendix B: Project Files


There was a 16MB download provided with the course.

Page 7

Baseball data
http://www.seanlahman.com/baseball-archive/statistics/
Contains player information and scores going back to 1871.

NYC Subway data


The best example file was turnstile_data_master_with_weather.csv. this is over 15MB and has over 10,000 rows by
22 columns of data about turnstile locations, routes, usage by period (4-hour internals in some cases, other intervals
in other cases) on 5/1/2011 through 5/15/2011, and weather on that date.
For discussion see http://chriswhong.com/open-data/visualizing-the-mtas-turnstile-data/
This data is from https://nycopendata.socrata.com/ Apparently there are over 1,000 data sets available.

Appendix C: What other projects and data are relevant?


RO data: construct data set of activity events joined to user data for K challenge. Determine if there is statistical
difference between tracking in the different activity types of the challenge.
RO data: construct data set of challenge members joined to user data for K challenge. Determine a model to predict
a challenge members final score given user data.
JG data: construct data set of donations joined user data for 2012-2013. Determine a model to predict donation
amount given user data.
JG Data: construct data set of charity information joined to donation data for 2012-2013. Determine if there is a
relationship between charity output divided by charity capitalization and donation amounts.
Scott Davis <scdavis6@gmail.com> May 16 09:32PM -0700
Just got access to a plethora of Amazon datasets by professor McAuley from Stanford. Hope the site helps anyone
in the beginning stages of getting their project together.
Here's the link: http://snap.stanford.edu/data/web-Amazon-links.html.

Page 8

S-ar putea să vă placă și