Documente Academic
Documente Profesional
Documente Cultură
Table of Contents
1. Prologue
2. Chapter 0 - Data Scientist's Toolbox
3. Chapter 1 - R Programming
4. Chapter 2 - Getting and Cleaning Data
5. Chapter 3 - Exploratory Data Analysis
6. Chapter 4 - Reproducible Research
7. Chapter 5 - Statistical Inference
8. Chapter 6 - Regression Models
9. Chapter 7 - Practical Machine Learning
10. Chapter 8 - Developing Data Products
11. Capstone
12. Epilogue
Prologue
Welcome recruits!
During the next year you will learn the fundamentals of data science. The Data Science Specialization, offered by Johns
Hopkins University, is challenging. Success requires a strategy. This book aims to equip each of you with the knowledge
and skills to complete boot-camp. The "Data Science Boot-Camp Survival Manual" alone cannot guarantee success. Listen
to the instructor's lectures and apply yourself to the evaluations throughout your training.
According to Jeff Leek and the Data Science Specialization Team the key word in data science is "science". To this end, the
focus of the ten-course series including a capstone project is to provide the learner with:
1. an introduction to the key ideas behind reproducible research,
2. an introduction to the tools and techniques to transform raw data into a presentable report,
3. an opportunity to gain hands-on practice so you can learn the techniques for yourself, and
4. an appreciation of the mathematics & statistics involved in data science.
Core Courses
The courses comprising the Data Science Specialization are:
Data Scientist's Toolbox
R Programming
Getting and Cleaning Data
Exploratory Data Analysis
Reproducible Research
Statistical Inference
Regression Models
Practical Machine Learning
Developing Data Products
These courses taught by Brian Caffo, Jeff Leek, and Roger D. Peng enable the learner to get the foundational skills. While
the lectures and assignments build these foundational skills, learners often required further explanations. The course
forums allow learners to discuss the lecture topics and assignments. Yet each session of a course begins without the
shared knowledge of previous participants. As a Community Teaching Assistant (CTA) it became clear that a companion
guide would be beneficial.
Are you up to the challenge of Johns Hopkins University's Data Science Specialization?
Prologue
Chapter 1: R Programming
URL: https://www.coursera.org/course/rprog
Synopsis: "Learn how to program in R and how to use R for effective data analysis. This is the second course in the Johns
Hopkins Data Science Specialization."
Chapter 2: Getting and Cleaning Data
URL: https://www.coursera.org/course/getdata
Synopsis: "Learn how to gather, clean, and manage data from a variety of sources. This is the third course in the Johns
Hopkins Data Science Specialization."
Chapter 3: Exploratory Data Analysis
URL: https://www.coursera.org/course/exdata
Synopsis: "Learn the essential exploratory techniques for summarizing data. This is the fourth course in the Johns Hopkins
Data Science Specialization."
Chapter 4: Reproducible Research
URL: https://www.coursera.org/course/repdata
Synopsis: "Learn the concepts and tools behind reporting modern data analyses in a reproducible manner. This is the fifth
course in the Johns Hopkins Data Science Specialization."
Chapter 5: Statistical Inference
URL: https://www.coursera.org/course/statinference
Synopsis: "Learn how to draw conclusions about populations or scientific truths from data. This is the sixth course in the
Johns Hopkins Data Science Course Track."
Chapter 6: Regression Models
URL: https://www.coursera.org/course/regmods
Synopsis: "Learn how to use regression models, the most important statistical analysis tool in the data scientist's toolkit.
This is the seventh course in the Johns Hopkins Data Science Specialization."
Chapter 7: Practical Machine Learning
URL: https://www.coursera.org/course/predmachlearn
Synopsis: "Learn the basic components of building and applying prediction functions with an emphasis on practical
applications. This is the eighth course in the Johns Hopkins Data Science Specialization."
Chapter 8: Developing Data Products
URL: https://www.coursera.org/course/devdataprod
Synopsis: "Learn the basics of creating data products using Shiny, R packages, and interactive graphics. This is the ninth
course in the Johns Hopkins Data Science Specialization."
Data Science Capstone
Prologue
URL: https://www.coursera.org/course/dsscapstone
Synopsis: "The capstone project class will allow students to create a usable/public data product that can be used to show
your skills to potential employers. Projects will be drawn from real-world problems and will be conducted with industry,
government, and academic partners. "
Course synposes quoted from the course information pages at Coursera as at 1 April 2015.
Figure 1 Course dependency diagram provided by Daniel M. Bontje (created 17 November 2014)
You need a language or system to perform the tasks (R Programming) and data to analyse (Getting and Cleaning Data) to
get a sense of the data (Exploratory Data Analysis) before building models and drawing inferences (Statistical Inference,
Regression Models) or making predictions (Practical Machine Learning) from the data before presenting your conclusions
and supporting evidence (Building Data Products, Reproducible Research).
The recommended mathematics background is linear algebra and introductory statistics (descriptive and inferential).
Statistical Inference and Regression Models, courses in this specialisation, cover all the basic statistical concepts forming a
solid foundation for subsequent courses in the Data Science Specialization. These courses along with Practical Machine
Learning are the theoretical underpinnings, while the other six courses are applied in nature: obtaininng data, scrubbing
data, exploring data, modeling data, and interpreting data collectively known as the OSEMN (prounounced as awesome)
model.
Again welcome to the Data Science Boot-Camp. Review the "Data Science Boot-Camp Survival Manual" on a regular basis
throughout your training.
Prologue
Learning Objectives
You will have learned the basic skills to successfully use the various tools required throughout the book and the data
science specialisation courses.
Virtualisation Software
While the various applications required for these courses can be installed on the host operating system of your computer
we recommend using virtualisation software such as Oracle VirtualBox, VMWare Workstation or Fusion or Player, and
Parallels Desktop depending upon the operating system running on the computer. Another virtualisation option is RStudio
Server Amazon Machine Image (AMI) or rolling your own local or hosted virtual machine instance.
This section will describe two scenarios:
importing a ready-made disk image (AMI) of Ubuntu Linux 14.04 LTS (64-bit) on the Amazon Web Service Elastic
Computing 2 (AWS EC2) hosting platform.
importing a ready-made disk image of Ubuntu Linux 14.04 LTS (32-bit or 64-bit) into Oracle VirtualBox on your
computer, and
An advantage of virtualisation software, running on your computer or remotely hosted by a service provider, is all the
required applications are kept separate from your computer's operating system and by default isolated from the host file
system.
Option A: Amazon Web Service Elastic Compute 2 with Amazon Machine Image
If you prefer installing Oracle VirtualBox and creating a virtual machine on your computer, you can skip this section.
Instructions are forthcoming.
Option B: Local Computer with Oracle VirtualBox
Please consult the instructions about downloading and installing Oracle VirtualBox onto your computer before proceeding.
Download the ready-made disk image of Ubuntu Linux (32-bit or 64-bit) based on the version supported by Oracle
VirtualBox and the architecture of the computer.
Note: Some computers are 64-bit but only allow 32-bit operating systems to run within virtualisation software.
Extract the compressed archive containing the disk image using p7zip.
$ 7za e Ubuntu_14.04.2-32bit.7z
7-Zip (A) [64] 9.20 Copyright (c) 1999-2010 Igor Pavlov 2010-11-18
p7zip Version 9.20 (locale=en_CA.UTF-8,Utf16=on,HugeFiles=on,2 CPUs)
Processing archive: Ubuntu_14.04.2-32bit.7z
Extracting 32bit/Ubuntu 14.04.2 (32bit).vdi
Extracting 32bit
Everything is Ok
Folders: 1
Files: 1
Size: 3807379456
Compressed: 776252068
$
After installing Oracle VirtualBox it is time to launch it so we can import the virtual machine disk image (.vdi).
Figure 0.2 Allocating system memory to the new virtual machine instance
Select the amount of system memory (RAM) to allocate to the virtual machine. Allocate 2048 MB of system memory to this
virtual machine instance. This parameter can be modified later if necessary. Click 'Next' to continue.
Figure 0.3 Associating an existing virtual hard drive to the new virtual machine instance
Select 'Use an existing virtual hard drive file' and click on the file folder icon to navigate to the virtual hard drive file
previously downloaded and uncompressed. Click 'Create' to associate this disk image with the current virtual machine.
10
11
12
Figure 0.6 Starting the new virtual machine instance
At the main screen of Oracle VirtualBox select the newly created virtual machine instance. Click 'Start' to launch the virtual
machine. At the login prompt type the password from the download webpage.
The final preparatory step is enabling the VirtualBox Guest Additions and updating any out-of-date packages installed on
the virtual machine. Open a terminal window (CTRL + ALT + T).
Activate the VirtualBox Guest Additions so the virtual machine instance integrates with the host system.
$ cd /media/osboxes/VBOXADDITIONS*
$ sudo sh VBoxLinuxAdditions.run
Upon successful installation shutdown the virtual machine instance by clicking the Gear icon in the upper right corner of the
virtual machine, umount the VirtualBox Guest Additions by reversing the steps shown in Figures 0.4 and 0.5. Alternatively,
you may choose to leave the VirtualBox Guest Additions ISO attached.
Note: Whenever an updated Linux kernel is installed as part of the normal update process the VirtualBox Guest Additions
will have to be reapplied to ensure the shared clipboard, for example, continues to work. Do NOT forget to restart the virtual
machine instance so the VirtualBox Guest Additions are activated.
13
14
15
16
Figure 0.10 Editing the user name, password, language preference and enabling automatic login
Automatic login can be enabled and the display name for the user account and password can be changed, if desired, via
'Systems Setting's by clicking 'User Accounts'.
17
15-minute Introduction to Navigating and Manipulating the File System from the Terminal
Let's start exploring the basic features of the environment from the comfort of a terminal session and the command-line.
Open a terminal window (CTRL+ALT+T) if you are running a graphical desktop environment. By learning a few basic
commands to navigate and manipulate the file system you will feel at ease and understand what is going in behind the
scenes within File Panel of RStudio.
Command
Description
pwd
ls
mkdir
Common Flags
-l (long form)
-a (hidden)
-R (recursive)
Arguments
[directory_path/]
[pattern]
(optional)
[directory_path/]directory_name or
[directory_path/]directory_name_list
make directory
(mandatory)
[directory_path/][directory_name]
cd
change directory
(optional)
[directory_path/]file_name
touch
echo
create a file
(by default stdout)
-e -n
(no carriage
return)
cp
-r (recursive)
(target)
[directory_path/][file_name]
(mandatory)
(source)
[directory_path/][file_name]
mv
-r (recursive)
(target)
[directory_path/][file_name]
(mandatory)
rm
remove/delete file or
directory
-f (force)
-r (recursive)
[directory_path/][file_name]
(mandatory)
Arguments in brackets are optional
but if the 'mandatory' designation is
present, at least one of the
arguments must be supplied. Directory
names and paths as well as file names
may contain wildcard characters
(* and ?) when used with some of these
commands.
18
$ pwd
/home/osboxes
Example 2: List the file and subdirectory names in the current working directory
$ ls
Desktop Downloads Music Public Videos
Documents examples.desktop Pictures Templates
$ mkdir test
$ cd test
$ pwd
/home/osboxes/test
Example 4: Create subdirectories named '1', '2', '3', and '4' in the current directory
$ mkdir {1,2,3,4}
Example 5: List the files and subdirectory names in the current directory
$ ls 1 2 3 4
Example 6: Create some empty files and some files with content
$ touch 1/file01.txt 2/file02.txt $ echo "Bonjour tout le monde" Bonjour tout le monde $ echo "Hello World!" > ./1/file0101.txt
$ echo "To be or not to be" > ./3/file03.txt
Example 7: Change to the directory immediately above the current directory and list the files and subdirectory names in the
subdirectory named '1'
$ cd ..
$ ls -l test/1
total 4
-rw-rw-r-- 1 osboxes osboxes 13 Apr 3 09:28 file0101.txt
-rw-rw-r-- 1 osboxes osboxes 0 Apr 3 09:27 file01.txt
Example 8: List the files ending with '.txt' in the subdirectory named '3'
$ ls -l test/3/*.txt
-rw-rw-r-- 1 osboxes osboxes 19 Apr 3 09:29 test/3/file03.txt
Example 9: (a) Copy the file 'file02.txt' from directory named '${HOME}/test/2' to directory '${HOME}/test/4' and name the
file 'file04.txt'
$ cp ./test/2/file02.txt ./test/4/file04.txt
19
(b) Copy the file 'file02.txt' from directory named '${HOME}/test/2' to directory '${HOME}/test/4' and name the file 'file02.txt'
$ cp ~/test/2/file02.txt ./test/4/file02.txt
Example 10: Make subdirectory '${HOME}/test/3' the current working directory and create a hidden file and a hidden
subdirectory
$ cd test/3
$ touch .hidden01.txt
$ mkdir .hidden
Example 11: List the names of non-hidden files and subdirectories in the current directory
$ ls
file03.txt
$ ls -a
. .. file03.txt .hidden .hidden01.txt
Example 12: Create a subdirectory named 'another' in the home directory of the user and copy the files and recursively
from '${HOME}/osboxes/test' to '${HOME}/another'
$ mkdir ~/another
$ cp -r ../* ~/another
Exampke 13: List the files and subdirectories in the home directory of user
$ ls ~
another Documents examples.desktop Pictures Templates Videos
Desktop Downloads Music Public test
$ ls ~/another
1 2 3 4
Example 15: List the file namess and recursively the subdirectories in '${HOME}/another'
$ ls -R ~/another
/home/osboxes/another:
1 2 3 4
/home/osboxes/another/1:
file0101.txt file01.txt
/home/osboxes/another/2:
file02.txt
/home/osboxes/another/3:
file03.txt
/home/osboxes/another/4:
file02.txt file04.txt
Example 16: Create a subdirectory named 'test/5' in the home directory of the user and move (copy and delete) the files
and/or subdirectories from '${HOME}/another'
20
$ mkdir ~/test/5
$ mv ~/another/* ../5
$ ls -a /home/osboxes/another
. ..
$ ls ../5
1 2 3 4
Example 19: List the file names and recursively the subdirectories in '${HOME}/test/5'
$ ls -R ~/test/5
/home/osboxes/test/5:
1 2 3 4
/home/osboxes/test/5/1:
file0101.txt file01.txt
/home/osboxes/test/5/2:
file02.txt
/home/osboxes/test/5/3:
file03.txt
/home/osboxes/test/5/4:
file02.txt file04.txt
$ cd
$ pwd
/home/osboxes
Example 21: Delete the subdirectories 'test' and 'another' from '${HOME}', and then list the file and subdirectory names in
the current directory
$ exit
A cheatsheet for the Bourne Again SHell (BASH) has been prepared by the folks at Learn Code the Hardway (LCodeTHW).
A complete manual for BASH is available from the GNU Project if you want to further explore the CLI and its capabilities.
21
HTML and XML. Taking a portion of this book as an example, with some minor changes to demonstrate particular features,
we explore some of the more common markdown elements.
Prologue
===
# Introduction
During the next year you will learn the fundamentals of data science.
Surviving the nine courses which make up the [Data Science
Specialization][0001] offered by [Johns Hopkins University][jhu] requires a
**strategy**.
To this end, the focus of the ten-course series including a capstone project
is to provide the learner with:
1. an introduction to the key ideas behind reproducible research,
2. an introduction to the tools and techniques to transform raw
data into a presentable report,
4. an opportunity to gain hands-on practice so you can learn the
techniques for yourself, and
3. an appreciation of the mathematics & statistics involved in
data science.
## Core Courses
The courses comprising the Data Science Specialization are:
* Data Scientist's Toolbox
* R Programming
* Exploratory Data Analysis
* Getting and Cleaning Data
* Reproducible Research
* Statistical Inference
* Regression Models
* Practical Machine Learning
* Developing Data Products
![Course Dependency](dst_courses.png)
*Figure 1 Course dependency diagram*
[0001]: https://www.coursera.org/specialization/jhudatascience/1?utm_medium=
courseDescripTop
[jhu]: http://www.jhu.edu
22
23
A text editor combined with the markdown-to-html converter is all that is needed.
$ nano sample.md
$ markdown sample.md # sends HTML output to the screen
$ markdown sample.md > sample.html # sends HTML outout to a file named 'sample.html'
$ firefox sample.html # view the rendered HTML in a web broswer
Take your time working through the sample markdown document until you fully understand why each element produces the
observed results. This book is written in a markdown language. In another course you will learn how to produce a
markdown document combining text and executable R code using Rmarkdown, and convert it to HTML and PDF using
RStudio.
If you have installed a different distribution refer to the system documentation to determine the package manager needed to
install software from the software repository.
15-minute Introduction to Version Control with Git from the Terminal
Let's start exploring the basic features of the version control from the comfort of an R Console session. Open a terminal
window (CTRL+ALT+T) if you are running a graphical desktop environment and then type 'R' and press the [ENTER] key.
Once RStudio is installed you will have integrated access to R.
Command
git init
Description
Common Flags
Arguments
[directory_path/]
[directory_name]
(optional)
git branch
git
checkout
git status
git show
branch_name
-b (new branch)
(mandatory)
-A (add)
git add
[directory_path/][file_name]
(mandatory)
24
deletions)
[directory_path/][file_name]
git commit
-a (add)
"a string of characters"
-m (message)
(optional, mandatory)
git pull
source target
(mandatory)
target source
git push
-u (add upstream
(tracking) reference)
git merge
--squash
git revert
Note: The output of some Git commands in these examples has been reformatted for presentation within this book.
Example 1: Create a local repository.
$ mkdir Projects
$ mkdir Projects/DataScientistsToolbox
$ mkdir Projects/DataScientistsToolbox/sample
$ cd Projects/DataScientistsToolbox/sample
$ git init
Initialised empty Git repository in /home/osboxes/Projects/DataScientistsToolbox/sample/.git/
$ ls -la
drwxrwxr-x 3 osboxes osboxes 4096 Apr 5 19:15 .
drwxrwxr-x 3 osboxes osboxes 4096 Apr 5 19:07 ..
drwxrwxr-x 7 osboxes osboxes 4096 Apr 5 19:15 .git
25
$ touch README.md
$ git add .
$ git commit -m "initial commit"
[master (root-commit) b7c48f3] initial commit
1 file changed, 0 insertions(+), 0 deletions(-)
create mode 100644 README.md
$ git status
On branch master
nothing to commit, working directory clean
$ git show
commit b7c48f3e5cdc772e6a198c3633acd853a69a5778
Author: jhudss
Date: Sun Apr 5 19:21:21 2015 -0300
initial commit
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..e69de29
Example 3: Edit the README.md file and paste the sample markdown document into the file.
$ nano README.md
$ git add -A .
$ git commit -m "added content"
[master 8fd8eb8] added content
1 file changed, 41 insertions(+)
Example 4: Edit the README.md file swapping "Getting and Cleaning Data" and "Exploratory Data Analysis."
$ nano README.md
$ git commit -m "swapped order of two courses"
[master 87d0125] swapped order of two courses
1 file changed, 1 insertion(+), 1 deletion(-)
$ git status
On branch master
nothing to commit, working directory clean
$ git show
commit 87d012594aa5a8a39e99d4728dc8c853779587ab
Author: jhudss
Date: Sun Apr 5 19:34:34 2015 -0300
swapped order of two courses
diff --git a/README.md b/README.md
index 756292a..48587e6 100644
--- a/README.md
+++ b/README.md
@@ -25,8 +25,8 @@ The courses comprising the Data Science Specialization are:
* Data Scientist's Toolbox
* R Programming
-* Exploratory Data Analysis
26
Example 7: Edit the README.md file to add "Git is easy. Git is fun. Thanks Linus!" anywhere in the file.
$ nano README.md
$ git status
On branch draft
Changes not staged for commit:
(use "git add ..." to update what will be committed)
(use "git checkout -- ..." to discard changes in working directory)
modified: README.md
no changes added to commit (use "git add" and/or "git commit -a")
$ git commit -a -m "thanked the creator of Git"
[draft 34af00f] thanked the creator of Git
1 file changed, 2 insertions(+)
Example 8: Switch to the 'master' branch and check the repository status.
Example 9: Merge the 'draft' branch' with the 'master' branch and check the repository status.
27
Author: jhudss
Date: Sun Apr 5 19:49:08 2015 -0300
thanked the creator of Git
diff --git a/README.md b/README.md
index 48587e6..aa53fee 100644
--- a/README.md
+++ b/README.md
@@ -19,6 +19,8 @@ is to provide the learner with:
3. an appreciation of the mathematics & statistics involved in
data science.
+Git is easy. Git is fun. Thanks Linus!
+
## Core Courses
The courses comprising the Data Science Specialization are:
A cheatsheet for Git and GitHub has been prepared by the folks at GitHub.
GitHub - Repository Hosting Service Supporting the Git Version Control System
GitHub is a repository hosting service allowing any number of people to collaboratively contribute to software development
or other projects. Some of the courses require learners to submit their programming assignments to GitHub as part of a
peer assessment grading process.
15-minute Introduction to Version Control with GitHub from the Terminal and Web Browser
28
29
A cheatsheet for Git and GitHub has been prepared by the folks at GitHub.
Install the lastest version of R which might be newer than shown in the figures.
Description
Arguments
package_name
install.packages
install_github
library
load a package
(mandatory)
[package_name]
[function_name]
(mandatory)
q()
exit R
30
$ R
R version 3.1.1 (201-07-10) -- "Sock it to Me"
Copyright (C) 2014 The R Foundation for Statistical Computing
Platform: i686-pc-linux-gnu (32-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> install.packages("devtools")
> library(devtools)
> ?devtools
> q()
$
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
Figure 0.31 Read student grades file and output the contents
Highlight the code in the 'student_grades.R' tab. Click 'Run'.
48
49
50
51
52
53
Figure 0.38 GitHub repository named demo after the push from local repository
Congratulations! You successfully onfigured a virtual machine for use during the data science boot-camp.
Practise. Practise. Practice your newly acquired knowledge and skills in preparation for the course project.
54
Final Thoughts
Data Scientist's Toolbox introduced the statistical computing and graphing suite, the integrated development
environment, and the version / revision control system selected by the Data Science Specialization Lab Team in the
Biostatistics Department of Johns Hopkins University. The features and capabiilities of these tools extend beyond the
basics presented in this chapter. While the graphical user interface is convenient we highly recommend and encourage you
to become comfortable with the command-line as well.
As a data science recruit outfitted with your kit (Git, R, RStudio, Ubuntu Linux, and GitHub account) the instructor for R
Programming awaits. Boot-camp has been easy up to this point. Read the "Data Science Boot-Camp Survival Manual"
regularly to avoid washing-out of boot-camp.
Recruits, dismissed.
55
Chapter 1 - R Programming
Chapter 1 - R Programming
56
57
58
59
60
61
62
63
Capstone
Capstone
64
Epilogue
Epilogue
65