Practical Data Analysis - Second Edition
By Hector Cuesta and Dr. Sampath Kumar
()
About this ebook
- Learn to use various data analysis tools and algorithms to classify, cluster, visualize, simulate, and forecast your data
- Apply Machine Learning algorithms to different kinds of data such as social networks, time series, and images
- A hands-on guide to understanding the nature of data and how to turn it into insight
This book is for developers who want to implement data analysis and data-driven algorithms in a practical way. It is also suitable for those without a background in data analysis or data processing. Basic knowledge of Python programming, statistics, and linear algebra is assumed.
Hector Cuesta
Hector Cuesta holds a B.A in Informatics and M.Sc. in Computer Science. He provides consulting services for software engineering and data analysis with experience in a variety of industries including financial services, social networking, e-learning, and human resources. He is a lecturer in the Department of Computer Science at the Autonomous University of Mexico State (UAEM). His main research interests lie in computational epidemiology, machine learning, computer vision, high-performance computing, big data, simulation, and data visualization. He helped in the technical review of the books, Raspberry Pi Networking Cookbook by Rick Golden and Hadoop Operations and Cluster Management Cookbook by Shumin Guo for Packt Publishing. He is also a columnist at Software Guru magazine and he has published several scientific papers in international journals and conferences. He is an enthusiast of Lego Robotics and Raspberry Pi in his spare time. You can follow him on Twitter at https://twitter.com/hmCuesta.
Related to Practical Data Analysis - Second Edition
Related ebooks
Python Data Analysis Rating: 4 out of 5 stars4/5Getting Started with Python Data Analysis Rating: 0 out of 5 stars0 ratingsMastering Data Analysis with R Rating: 5 out of 5 stars5/5R for Data Science Rating: 5 out of 5 stars5/5Regression Analysis with Python Rating: 0 out of 5 stars0 ratingsPython Data Science Essentials - Second Edition Rating: 4 out of 5 stars4/5Learning Tableau 10 - Second Edition Rating: 4 out of 5 stars4/5Learning Predictive Analytics with Python Rating: 0 out of 5 stars0 ratingsMastering Python for Data Science Rating: 3 out of 5 stars3/5Learning pandas Rating: 4 out of 5 stars4/5Python Data Science Essentials Rating: 0 out of 5 stars0 ratingsIntroduction to R for Business Intelligence Rating: 0 out of 5 stars0 ratingsLearning Bayesian Models with R Rating: 5 out of 5 stars5/5Real-Time Big Data Analytics Rating: 5 out of 5 stars5/5Principles of Data Science Rating: 4 out of 5 stars4/5Practical Data Analysis Rating: 4 out of 5 stars4/5R Machine Learning By Example Rating: 0 out of 5 stars0 ratingsLearning Data Mining with Python Rating: 0 out of 5 stars0 ratingsMastering Predictive Analytics with R Rating: 4 out of 5 stars4/5Practical Predictive Analytics Rating: 0 out of 5 stars0 ratingsLearning pandas - Second Edition Rating: 4 out of 5 stars4/5Practical Business Intelligence Rating: 3 out of 5 stars3/5Text Mining and Analysis: Practical Methods, Examples, and Case Studies Using SAS Rating: 0 out of 5 stars0 ratingsIntroduction to Statistical and Machine Learning Methods for Data Science Rating: 0 out of 5 stars0 ratingsMastering Data Mining with Python – Find patterns hidden in your data Rating: 0 out of 5 stars0 ratingsLearning Tableau 2019 - Third Edition: Tools for Business Intelligence, data prep, and visual analytics, 3rd Edition Rating: 0 out of 5 stars0 ratingsR High Performance Programming Rating: 4 out of 5 stars4/5
Computers For You
The Insider's Guide to Technical Writing Rating: 0 out of 5 stars0 ratingsMastering ChatGPT: 21 Prompts Templates for Effortless Writing Rating: 5 out of 5 stars5/5SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL Rating: 4 out of 5 stars4/5CompTIA Security+ Get Certified Get Ahead: SY0-701 Study Guide Rating: 5 out of 5 stars5/5Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates Rating: 4 out of 5 stars4/5Artificial Intelligence: The Complete Beginner’s Guide to the Future of A.I. Rating: 4 out of 5 stars4/5Deep Search: How to Explore the Internet More Effectively Rating: 5 out of 5 stars5/5How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally Rating: 4 out of 5 stars4/5Grokking Algorithms: An illustrated guide for programmers and other curious people Rating: 4 out of 5 stars4/5CompTIA IT Fundamentals (ITF+) Study Guide: Exam FC0-U61 Rating: 0 out of 5 stars0 ratingsThe ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology Rating: 0 out of 5 stars0 ratingsElon Musk Rating: 4 out of 5 stars4/5Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad Rating: 0 out of 5 stars0 ratingsRemote/WebCam Notarization : Basic Understanding Rating: 3 out of 5 stars3/5CompTIA Security+ Practice Questions Rating: 2 out of 5 stars2/5Network+ Study Guide & Practice Exams Rating: 4 out of 5 stars4/5Ultimate Guide to Mastering Command Blocks!: Minecraft Keys to Unlocking Secret Commands Rating: 5 out of 5 stars5/5Mindhacker: 60 Tips, Tricks, and Games to Take Your Mind to the Next Level Rating: 4 out of 5 stars4/5The Professional Voiceover Handbook: Voiceover training, #1 Rating: 5 out of 5 stars5/5Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls Rating: 4 out of 5 stars4/5The Invisible Rainbow: A History of Electricity and Life Rating: 4 out of 5 stars4/5Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are Rating: 4 out of 5 stars4/5
Reviews for Practical Data Analysis - Second Edition
0 ratings0 reviews
Book preview
Practical Data Analysis - Second Edition - Hector Cuesta
Table of Contents
Practical Data Analysis - Second Edition
Credits
About the Authors
About the Reviewers
www.PacktPub.com
eBooks, discount offers, and more
Why subscribe?
Free access for Packt account holders
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Downloading the color images of this book
Errata
Piracy
Questions
1. Getting Started
Computer science
Artificial intelligence
Machine learning
Statistics
Mathematics
Knowledge domain
Data, information, and knowledge
Inter-relationship between data, information, and knowledge
The nature of data
The data analysis process
The problem
Data preparation
Data exploration
Predictive modeling
Visualization of results
Quantitative versus qualitative data analysis
Importance of data visualization
What about big data?
Quantified self
Sensors and cameras
Social network analysis
Tools and toys for this book
Why Python?
Why mlpy?
Why D3.js?
Why MongoDB?
Summary
2. Preprocessing Data
Data sources
Open data
Text files
Excel files
SQL databases
NoSQL databases
Multimedia
Web scraping
Data scrubbing
Statistical methods
Text parsing
Data transformation
Data formats
Parsing a CSV file with the CSV module
Parsing CSV file using NumPy
JSON
Parsing JSON file using the JSON module
XML
Parsing XML in Python using the XML module
YAML
Data reduction methods
Filtering and sampling
Binned algorithm
Dimensionality reduction
Getting started with OpenRefine
Text facet
Clustering
Text filters
Numeric facets
Transforming data
Exporting data
Operation history
Summary
3. Getting to Grips with Visualization
What is visualization?
Working with web-based visualization
Exploring scientific visualization
Visualization in art
The visualization life cycle
Visualizing different types of data
HTML
DOM
CSS
JavaScript
SVG
Getting started with D3.js
Bar chart
Pie chart
Scatter plots
Single line chart
Multiple line chart
Interaction and animation
Data from social networks
An overview of visual analytics
Summary
4. Text Classification
Learning and classification
Bayesian classification
Naïve Bayes
E-mail subject line tester
The data
The algorithm
Classifier accuracy
Summary
5. Similarity-Based Image Retrieval
Image similarity search
Dynamic time warping
Processing the image dataset
Implementing DTW
Analyzing the results
Summary
6. Simulation of Stock Prices
Financial time series
Random Walk simulation
Monte Carlo methods
Generating random numbers
Implementation in D3js
Quantitative analyst
Summary
7. Predicting Gold Prices
Working with time series data
Components of a time series
Smoothing time series
Lineal regression
The data - historical gold prices
Nonlinear regressions
Kernel Ridge Regressions
Smoothing the gold prices time series
Predicting in the smoothed time series
Contrasting the predicted value
Summary
8. Working with Support Vector Machines
Understanding the multivariate dataset
Dimensionality reduction
Linear Discriminant Analysis (LDA)
Principal Component Analysis (PCA)
Getting started with SVM
Kernel functions
The double spiral problem
SVM implemented on mlpy
Summary
9. Modeling Infectious Diseases with Cellular Automata
Introduction to epidemiology
The epidemiology triangle
The epidemic models
The SIR model
Solving the ordinary differential equation for the SIR model with SciPy
The SIRS model
Modeling with Cellular Automaton
Cell, state, grid, neighborhood
Global stochastic contact model
Simulation of the SIRS model in CA with D3.js
Summary
10. Working with Social Graphs
Structure of a graph
Undirected graph
Directed graph
Social networks analysis
Acquiring the Facebook graph
Working with graphs using Gephi
Statistical analysis
Male to female ratio
Degree distribution
Histogram of a graph
Centrality
Transforming GDF to JSON
Graph visualization with D3.js
Summary
11. Working with Twitter Data
The anatomy of Twitter data
Tweet
Followers
Trending topics
Using OAuth to access Twitter API
Getting started with Twython
Simple search using Twython
Working with timelines
Working with followers
Working with places and trends
Working with user data
Streaming API
Summary
12. Data Processing and Aggregation with MongoDB
Getting started with MongoDB
Database
Collection
Document
Mongo shell
Insert/Update/Delete
Queries
Data preparation
Data transformation with OpenRefine
Inserting documents with PyMongo
Group
Aggregation framework
Pipelines
Expressions
Summary
13. Working with MapReduce
An overview of MapReduce
Programming model
Using MapReduce with MongoDB
Map function
Reduce function
Using mongo shell
Using Jupyter
Using PyMongo
Filtering the input collection
Grouping and aggregation
Counting the most common words in tweets
Summary
14. Online Data Analysis with Jupyter and Wakari
Getting started with Wakari
Creating an account in Wakari
Getting started with IPython notebook
Data visualization
Introduction to image processing with PIL
Opening an image
Working with an image histogram
Filtering
Operations
Transformations
Getting started with pandas
Working with Time Series
Working with multivariate datasets with DataFrame
Grouping, Aggregation, and Correlation
Sharing your Notebook
The data
Summary
15. Understanding Data Processing using Apache Spark
Platform for data processing
The Cloudera platform
Installing Cloudera VM
An introduction to the distributed file system
First steps with Hadoop Distributed File System - HDFS
File management with HUE - web interface
An introduction to Apache Spark
The Spark ecosystem
The Spark programming model
An introductory working example of Apache Startup
Summary
Practical Data Analysis - Second Edition
Practical Data Analysis - Second Edition
Copyright © 2016 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: October 2013
Second published: September 2016
Production reference: 1260916
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham B3 2PB, UK.
ISBN 978-1-78528-971-2
www.packtpub.com
Credits
About the Authors
Hector Cuesta is founder and Chief Data Scientist at Dataxios, a machine intelligence research company. Holds a BA in Informatics and a M.Sc. in Computer Science. He provides consulting services for data-driven product design with experience in a variety of industries including financial services, retail, fintech, e-learning and Human Resources. He is an enthusiast of Robotics in his spare time.
You can follow him on Twitter at https://twitter.com/hmCuesta.
I would like to dedicate this book to my wife Yolanda, and to my wonderful children Damian and Isaac for all the joy they bring into my life. To my parents Elena and Miguel for their constant support and love.
Dr. Sampath Kumar works as an assistant professor and head of Department of Applied Statistics at Telangana University. He has completed M.Sc., M.Phl., and Ph. D. in statistics. He has five years of teaching experience for PG course. He has more than four years of experience in the corporate sector. His expertise is in statistical data analysis using SPSS, SAS, R, Minitab, MATLAB, and so on. He is an advanced programmer in SAS and matlab software. He has teaching experience in different, applied and pure statistics subjects such as forecasting models, applied regression analysis, multivariate data analysis, operations research, and so on for M.Sc. students. He is currently supervising Ph.D. scholars.
About the Reviewers
Chandana N. Athauda is currently employed at BAG (Brunei Accenture Group) Networks—Brunei and he serves as a technical consultant. He mainly focuses on Business Intelligence, Big Data and Data Visualization tools and technologies.
He has been working professionally in the IT industry for more than 15 years (Ex-Microsoft Most Valuable Professional (MVP) and Microsoft Ranger for TFS). His roles in the IT industry have spanned the entire spectrum from programmer to technical consultant. Technology has always been a passion for him.
If you would like to talk to Chandana about this book, feel free to write to him at info @inzeek.net or by giving him a tweet @inzeek.
Mark Kerzner is a Big Data architect and trainer. Mark is a founder and principal at Elephant Scale, offering Big Data training and consulting. Mark has written HBase Design Patterns for Packt.
I would like to acknowledge my co-founder Sujee Maniyam and his colleague Tim Fox, as well as all the students and teachers. Last but not least, thanks to my multi-talented family.
www.PacktPub.com
For support files and downloads related to your book, please visit www.PacktPub.com.
eBooks, discount offers, and more
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at customercare@packtpub.com for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
https://www2.packtpub.com/books/subscription/packtlib
Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.
Why subscribe?
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser
Free access for Packt account holders
Get notified! Find out when new books are published by following @PacktEnterprise on Twitter or the Packt Enterprise Facebook page.
Preface
Practical Data Analysis provides a series of practical projects in order to turn data into insight. It covers a wide range of data analysis tools and algorithms for classification, clustering, visualization, simulation and forecasting. The goal of this book is to help you to understand your data to find patterns, trends, relationships and insight.
This book contains practical projects that take advantage of the MongoDB, D3.js, Python language and its ecosystem to present the concepts using code snippets and detailed descriptions.
What this book covers
Chapter 1, Getting Started, In this chapter, we discuss the principles of data analysis and the data analysis process.
Chapter 2, Preprocessing Data, explains how to scrub and prepare your data for the analysis, also introduces the use of OpenRefine which is a Data Cleansing tool.
Chapter 3, Getting to Grips with Visualization, shows how to visualize different kinds of data using D3.js which is a JavaScript Visualization Framework.
Chapter 4, Text Classification, introduces the binary classification using a Naïve Bayes Algorithm to classify spam.
Chapter 5, Similarity-Based Image Retrieval, presents a project to find the Similarity between images using a dynamic time warping approach.
Chapter 6, Simulation of Stock Prices, explains how to simulate a Stock Price using Random Walk algorithm, visualized with a D3.js animation.
Chapter 7, Predicting Gold Prices, introduces how Kernel Ridge Regression works, and how to use it to predict the gold price using time series.
Chapter 8, Working with Support Vector Machines, describes how to use Support Vector Machines as a classification method.
Chapter 9, Modeling Infectious Diseases with Cellular Automata, introduces the basic concepts of Computational Epidemiology simulation and explains how to implement a cellular automaton to simulate an epidemic outbreak using D3.js and JavaScript.
Chapter 10, Working with Social Graphs, explains how to obtain and visualize your social media graph from Facebook using Gephi.
Chapter 11, Working with Twitter Data, explains how to use the Twitter API to retrieve data from twitter. We also see how to improve the text classification to perform a sentiment analysis using the Naïve Bayes Algorithm implemented in the Natural Language Toolkit (NLTK).
Chapter 12, Data Processing and Aggregation with MongoDB, introduces the basic operations in MongoDB as well as methods for grouping, filtering, and aggregation.
Chapter 13, Working with MapReduce, illustrates how to use the MapReduce programming model implemented in MongoDB.
Chapter 14, Online Data Analysis with Jupyter and Wakari, explains how to use the Wakari platform and introduces the basic use of Pandas and PIL with IPython.
Chapter 15, Understanding Data Processing using Apache Spark, explains how to use distributed file system along with Cloudera VM and how to get started with a data environment. Finally, we describe the main features of Apache Spark with a practical example.
What you need for this book
The basic requirements for this book are as follows:
Python
OpenRefine
D3.js
Mlpy
Natural Language Toolkit (NLTK)
Gephi
MongoDB
Who this book is for
This book is for Software Developers, Analyst and Computer Scientists who want to implement data analysis and visualization in a practical way. The book is also intended to provide a self-contained set of practical projects in order to get insight from different kinds of data like, time series, numerical, multidimensional, social media graphs and texts.
You are not required to have previous knowledge about data analysis, but some basic knowledge about statistics and a general understanding of Python programming is assumed.
Conventions
In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning. Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: For this example, we will use the BeautifulSoup library version 4.
A block of code is set as follows:
from bs4 import BeautifulSoup
import urllib.request
from time import sleep
from datetime import datetime
Any command-line input or output is written as follows:
>>> readers@packt.com
>>> readers
>>> packt.com
New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: Now, just click on the OK button to apply the transformation.
Note
Warnings or important notes appear in a box like this.
Tip
Tips and tricks appear like this.
Reader feedback
Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.
To send us general feedback, simply e-mail feedback@packtpub.com, and mention the book's title in the subject of your message.
If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.
Customer support
Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.
Downloading the example code
You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
You can download the code files by following these steps:
Log in or register to our website using your e-mail address and password.
Hover the mouse pointer on the SUPPORT tab at the top.
Click on Code Downloads & Errata.
Enter the name of the book in the Search box.
Select the book for which you're looking to download the code files.
Choose from the drop-down menu where you purchased this book from.
Click on Code Download.
You can also download the code files by clicking on the Code Files button on the book's webpage at the Packt Publishing website. This page can be accessed by entering the book's name in the Search box. Please note that you need to be logged in to your Packt account.
Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
WinRAR / 7-Zip for Windows
Zipeg / iZip / UnRarX for Mac
7-Zip / PeaZip for Linux
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Practical-Data-Analysis-Second-Edition. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
Downloading the color images of this book
We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/B04227_PracticalDataAnalysisSecondEdition_ColorImages.pdf.
Errata
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.
To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.
Piracy
Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.
Please contact us at copyright@packtpub.com with a link to the suspected pirated material.
We appreciate your help in protecting our authors and our ability to bring you valuable content.
Questions
If you have a problem with any aspect of this book, you can contact us at questions@packtpub.com, and we will do our best to address