Sunteți pe pagina 1din 14

MapReduce with Python

Topic II Extra for Stats Applications II


outline: regression and linear model
Write and Try mapper and reducer for
more statistical procedure
 Now what about calculation of sample standard deviation, correlation, simple linear
regression, paired t-test or A/B test?
 Task 1: compute the standard deviation for annual salaries per job title.
 Task 2: compute the sample correlation between annual salaries and gross pay.
 Task 3: fit a simple linear regression line through origin to predict gross pay by annual
salaries
 Task 4: identify out-liers and data patterns.

2 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 3/23/2018
Regression w/ an intercept and a slope
 Simple Linear Regression Model:
𝑌 =𝛼 +𝛽 ×𝑋 +𝜖 ,

where 𝑌 is GrossPay observed for 𝑗 employee with 𝑖 JobTitle such as Account II;

𝑋 is AnnualSalary for 𝑗 employee with 𝑖 JobTitle; 𝜖 ’s are independently and identically distributed

as normal distribution with mean 0 and variance 𝜎 .

 Parameters for estimation: (𝛼 , 𝛽 , 𝜎 ); all are job (group) specific coefficients.

𝛼 ∑ 1 ∑ 𝑥 ∑ 𝑦
 Least-squares solution: = × ,
𝛽 ∑ 𝑥 ∑ 𝑥 ∑ 𝑥 𝑦

∑ 𝑦 ∑ 1 ∑ 𝑥 ∑ 𝑦
𝜎 = 𝑀𝑆𝐸 = ∑ 𝑦 −
∑ 𝑥 𝑦 ∑ 𝑥 ∑ 𝑥 ∑ 𝑥 𝑦

3 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 3/23/2018
Regression w/ an intercept and a slope
 Simple Linear Regression Model:
𝑌 =𝛼 +𝛽 ×𝑋 +𝜖
𝛼 1 𝑥 𝑦
 Least-squares solution: = ∑ × ∑ 𝑥 ×𝑦 ,
𝛽 𝑥 𝑥

𝑦 1 𝑥 𝑦
𝜎 = 𝑀𝑆𝐸 = ∑ 𝑦 − ∑ 𝑥 𝑦 × ∑ × ∑ 𝑥 𝑦
𝑥 𝑥

 The variance of the intercept and slope estimates will be the diagonal elements of
following matrix:
1 𝑥
𝜎 ×
𝑥 𝑥

4 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 3/23/2018
Regression w/ an intercept and a slope
salary_grosspay_regr_mapper2.py

5 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 3/23/2018
Regression w/ an intercept and a slope
salary_grosspay_regr_mapper2.py

6 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 3/23/2018
Regression w/ an intercept and a slope
salary_grosspay_regr_reducer2.py

7 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 3/23/2018
Regression w/ an intercept and a slope
salary_grosspay_regr_reducer2.py

8 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 3/23/2018
Regression w/ an intercept and a slope
Run salary_grosspay_regr_mapper2.py

9 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 3/23/2018
Regression w/ an intercept and a slope
Run salary_grosspay_regr_mapper2.py

10 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 3/23/2018
Regression w/ an intercept and a slope
Run salary_grosspay_regr_reducer2.py

11 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 3/23/2018
Regression w/ an intercept and a slope
Results after running the reducer:

12 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 3/23/2018
Regression w/ an intercept and a slope
Results after running the reducer:

13 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 3/23/2018
Regression w/ an intercept and a slope
Results after running the reducer:

14 STA 9760 Big Data Tech | Spring 2018 | Junyi Zhang 3/23/2018