Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Practical Data Analysis - Second Edition
Practical Data Analysis - Second Edition
Practical Data Analysis - Second Edition
Ebook575 pages4 hours

Practical Data Analysis - Second Edition

Rating: 0 out of 5 stars

()

Read preview

About this ebook

About This Book
  • Learn to use various data analysis tools and algorithms to classify, cluster, visualize, simulate, and forecast your data
  • Apply Machine Learning algorithms to different kinds of data such as social networks, time series, and images
  • A hands-on guide to understanding the nature of data and how to turn it into insight
Who This Book Is For

This book is for developers who want to implement data analysis and data-driven algorithms in a practical way. It is also suitable for those without a background in data analysis or data processing. Basic knowledge of Python programming, statistics, and linear algebra is assumed.

LanguageEnglish
Release dateSep 30, 2016
ISBN9781785286667
Practical Data Analysis - Second Edition
Author

Hector Cuesta

Hector Cuesta holds a B.A in Informatics and M.Sc. in Computer Science. He provides consulting services for software engineering and data analysis with experience in a variety of industries including financial services, social networking, e-learning, and human resources. He is a lecturer in the Department of Computer Science at the Autonomous University of Mexico State (UAEM). His main research interests lie in computational epidemiology, machine learning, computer vision, high-performance computing, big data, simulation, and data visualization. He helped in the technical review of the books, Raspberry Pi Networking Cookbook by Rick Golden and Hadoop Operations and Cluster Management Cookbook by Shumin Guo for Packt Publishing. He is also a columnist at Software Guru magazine and he has published several scientific papers in international journals and conferences. He is an enthusiast of Lego Robotics and Raspberry Pi in his spare time. You can follow him on Twitter at https://twitter.com/hmCuesta.

Related to Practical Data Analysis - Second Edition

Related ebooks

Computers For You

View More

Related articles

Reviews for Practical Data Analysis - Second Edition

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Practical Data Analysis - Second Edition - Hector Cuesta

    Table of Contents

    Practical Data Analysis - Second Edition

    Credits

    About the Authors

    About the Reviewers

    www.PacktPub.com

    eBooks, discount offers, and more

    Why subscribe?

    Free access for Packt account holders

    Preface

    What this book covers

    What you need for this book

    Who this book is for

    Conventions

    Reader feedback

    Customer support

    Downloading the example code

    Downloading the color images of this book

    Errata

    Piracy

    Questions

    1. Getting Started

    Computer science

    Artificial intelligence

    Machine learning

    Statistics

    Mathematics

    Knowledge domain

    Data, information, and knowledge

    Inter-relationship between data, information, and knowledge

    The nature of data

    The data analysis process

    The problem

    Data preparation

    Data exploration

    Predictive modeling

    Visualization of results

    Quantitative versus qualitative data analysis

    Importance of data visualization

    What about big data?

    Quantified self

    Sensors and cameras

    Social network analysis

    Tools and toys for this book

    Why Python?

    Why mlpy?

    Why D3.js?

    Why MongoDB?

    Summary

    2. Preprocessing Data

    Data sources

    Open data

    Text files

    Excel files

    SQL databases

    NoSQL databases

    Multimedia

    Web scraping

    Data scrubbing

    Statistical methods

    Text parsing

    Data transformation

    Data formats

    Parsing a CSV file with the CSV module

    Parsing CSV file using NumPy

    JSON

    Parsing JSON file using the JSON module

    XML

    Parsing XML in Python using the XML module

    YAML

    Data reduction methods

    Filtering and sampling

    Binned algorithm

    Dimensionality reduction

    Getting started with OpenRefine

    Text facet

    Clustering

    Text filters

    Numeric facets

    Transforming data

    Exporting data

    Operation history

    Summary

    3. Getting to Grips with Visualization

    What is visualization?

    Working with web-based visualization

    Exploring scientific visualization

    Visualization in art

    The visualization life cycle

    Visualizing different types of data

    HTML

    DOM

    CSS

    JavaScript

    SVG

    Getting started with D3.js

    Bar chart

    Pie chart

    Scatter plots

    Single line chart

    Multiple line chart

    Interaction and animation

    Data from social networks

    An overview of visual analytics

    Summary

    4. Text Classification

    Learning and classification

    Bayesian classification

    Naïve Bayes

    E-mail subject line tester

    The data

    The algorithm

    Classifier accuracy

    Summary

    5. Similarity-Based Image Retrieval

    Image similarity search

    Dynamic time warping

    Processing the image dataset

    Implementing DTW

    Analyzing the results

    Summary

    6. Simulation of Stock Prices

    Financial time series

    Random Walk simulation

    Monte Carlo methods

    Generating random numbers

    Implementation in D3js

    Quantitative analyst

    Summary

    7. Predicting Gold Prices

    Working with time series data

    Components of a time series

    Smoothing time series

    Lineal regression

    The data - historical gold prices

    Nonlinear regressions

    Kernel Ridge Regressions

    Smoothing the gold prices time series

    Predicting in the smoothed time series

    Contrasting the predicted value

    Summary

    8. Working with Support Vector Machines

    Understanding the multivariate dataset

    Dimensionality reduction

    Linear Discriminant Analysis (LDA)

    Principal Component Analysis (PCA)

    Getting started with SVM

    Kernel functions

    The double spiral problem

    SVM implemented on mlpy

    Summary

    9. Modeling Infectious Diseases with Cellular Automata

    Introduction to epidemiology

    The epidemiology triangle

    The epidemic models

    The SIR model

    Solving the ordinary differential equation for the SIR model with SciPy

    The SIRS model

    Modeling with Cellular Automaton

    Cell, state, grid, neighborhood

    Global stochastic contact model

    Simulation of the SIRS model in CA with D3.js

    Summary

    10. Working with Social Graphs

    Structure of a graph

    Undirected graph

    Directed graph

    Social networks analysis

    Acquiring the Facebook graph

    Working with graphs using Gephi

    Statistical analysis

    Male to female ratio

    Degree distribution

    Histogram of a graph

    Centrality

    Transforming GDF to JSON

    Graph visualization with D3.js

    Summary

    11. Working with Twitter Data

    The anatomy of Twitter data

    Tweet

    Followers

    Trending topics

    Using OAuth to access Twitter API

    Getting started with Twython

    Simple search using Twython

    Working with timelines

    Working with followers

    Working with places and trends

    Working with user data

    Streaming API

    Summary

    12. Data Processing and Aggregation with MongoDB

    Getting started with MongoDB

    Database

    Collection

    Document

    Mongo shell

    Insert/Update/Delete

    Queries

    Data preparation

    Data transformation with OpenRefine

    Inserting documents with PyMongo

    Group

    Aggregation framework

    Pipelines

    Expressions

    Summary

    13. Working with MapReduce

    An overview of MapReduce

    Programming model

    Using MapReduce with MongoDB

    Map function

    Reduce function

    Using mongo shell

    Using Jupyter

    Using PyMongo

    Filtering the input collection

    Grouping and aggregation

    Counting the most common words in tweets

    Summary

    14. Online Data Analysis with Jupyter and Wakari

    Getting started with Wakari

    Creating an account in Wakari

    Getting started with IPython notebook

    Data visualization

    Introduction to image processing with PIL

    Opening an image

    Working with an image histogram

    Filtering

    Operations

    Transformations

    Getting started with pandas

    Working with Time Series

    Working with multivariate datasets with DataFrame

    Grouping, Aggregation, and Correlation

    Sharing your Notebook

    The data

    Summary

    15. Understanding Data Processing using Apache Spark

    Platform for data processing

    The Cloudera platform

    Installing Cloudera VM

    An introduction to the distributed file system

    First steps with Hadoop Distributed File System - HDFS

    File management with HUE - web interface

    An introduction to Apache Spark

    The Spark ecosystem

    The Spark programming model

    An introductory working example of Apache Startup

    Summary

    Practical Data Analysis - Second Edition


    Practical Data Analysis - Second Edition

    Copyright © 2016 Packt Publishing

    All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

    Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

    Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

    First published: October 2013

    Second published: September 2016

    Production reference: 1260916

    Published by Packt Publishing Ltd.

    Livery Place

    35 Livery Street

    Birmingham B3 2PB, UK.

    ISBN 978-1-78528-971-2

    www.packtpub.com

    Credits

    About the Authors

    Hector Cuesta is founder and Chief Data Scientist at Dataxios, a machine intelligence research company. Holds a BA in Informatics and a M.Sc. in Computer Science. He provides consulting services for data-driven product design with experience in a variety of industries including financial services, retail, fintech, e-learning and Human Resources. He is an enthusiast of Robotics in his spare time.

    You can follow him on Twitter at https://twitter.com/hmCuesta.

    I would like to dedicate this book to my wife Yolanda, and to my wonderful children Damian and Isaac for all the joy they bring into my life. To my parents Elena and Miguel for their constant support and love. 

    Dr. Sampath Kumar works as an assistant professor and head of Department of Applied Statistics at Telangana University. He has completed M.Sc., M.Phl., and Ph. D. in statistics. He has five years of teaching experience for PG course. He has more than four years of experience in the corporate sector. His expertise is in statistical data analysis using SPSS, SAS, R, Minitab, MATLAB, and so on. He is an advanced programmer in SAS and matlab software. He has teaching experience in different, applied and pure statistics subjects such as forecasting models, applied regression analysis, multivariate data analysis, operations research, and so on for M.Sc. students. He is currently supervising Ph.D. scholars.

    About the Reviewers

    Chandana N.  Athauda is currently employed at BAG (Brunei Accenture Group) Networks—Brunei and he serves as a technical consultant. He mainly focuses on Business Intelligence, Big Data and Data Visualization tools and technologies.

    He has been working professionally in the IT industry for more than 15 years (Ex-Microsoft Most Valuable Professional (MVP) and Microsoft Ranger for TFS). His roles in the IT industry have spanned   the entire spectrum from programmer to technical consultant. Technology has always been a passion for him.

    If you would like to talk to Chandana about this book, feel free to write to him at info @inzeek.net or by giving him a tweet @inzeek.

    Mark Kerzner is a Big Data architect and trainer. Mark is a founder and principal at Elephant Scale, offering Big Data training and consulting. Mark has written HBase Design Patterns for Packt.

    I would like to acknowledge my co-founder Sujee Maniyam and his colleague Tim Fox, as well as all the students and teachers. Last but not least, thanks to my multi-talented family.

    www.PacktPub.com

    For support files and downloads related to your book, please visit www.PacktPub.com.

    eBooks, discount offers, and more

    Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at customercare@packtpub.com for more details.

    At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

    https://www2.packtpub.com/books/subscription/packtlib

    Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.

    Why subscribe?

    Fully searchable across every book published by Packt

    Copy and paste, print, and bookmark content

    On demand and accessible via a web browser

    Free access for Packt account holders

    Get notified! Find out when new books are published by following @PacktEnterprise on Twitter or the Packt Enterprise Facebook page.

    Preface

    Practical Data Analysis provides a series of practical projects in order to turn data into insight. It covers a wide range of data analysis tools and algorithms for classification, clustering, visualization, simulation and forecasting. The goal of this book is to help you to understand your data to find patterns, trends, relationships and insight.

    This book contains practical projects that take advantage of the MongoDB, D3.js, Python language and its ecosystem to present the concepts using code snippets and detailed descriptions.

    What this book covers

    Chapter 1, Getting Started, In this chapter, we discuss the principles of data analysis and the data analysis process.

    Chapter 2, Preprocessing Data, explains how to scrub and prepare your data for the analysis, also introduces the use of OpenRefine which is a Data Cleansing tool.

    Chapter 3, Getting to Grips with Visualization, shows how to visualize different kinds of data using D3.js which is a JavaScript Visualization Framework.

    Chapter 4, Text Classification, introduces the binary classification using a Naïve Bayes Algorithm to classify spam.

    Chapter 5, Similarity-Based Image Retrieval, presents a project to find the Similarity between images using a dynamic time warping approach.

    Chapter 6, Simulation of Stock Prices, explains how to simulate a Stock Price using Random Walk algorithm, visualized with a D3.js animation.

    Chapter 7, Predicting Gold Prices, introduces how Kernel Ridge Regression works, and how to use it to predict the gold price using time series.

    Chapter 8, Working with Support Vector Machines, describes how to use Support Vector Machines as a classification method.

    Chapter 9, Modeling Infectious Diseases with Cellular Automata, introduces the basic concepts of Computational Epidemiology simulation and explains how to implement a cellular automaton to simulate an epidemic outbreak using D3.js and JavaScript.

    Chapter 10, Working with Social Graphs, explains how to obtain and visualize your social media graph from Facebook using Gephi.

    Chapter 11, Working with Twitter Data, explains how to use the Twitter API to retrieve data from twitter. We also see how to improve the text classification to perform a sentiment analysis using the Naïve Bayes Algorithm implemented in the Natural Language Toolkit (NLTK).

    Chapter 12, Data Processing and Aggregation with MongoDB, introduces the basic operations in MongoDB as well as methods for grouping, filtering, and aggregation.

    Chapter 13, Working with MapReduce, illustrates how to use the MapReduce programming model implemented in MongoDB.

    Chapter 14, Online Data Analysis with Jupyter and Wakari, explains how to use the Wakari platform and introduces the basic use of Pandas and PIL with IPython.

    Chapter 15, Understanding Data Processing using Apache Spark, explains how to use distributed file system along with Cloudera VM and how to get started with a data environment. Finally, we describe the main features of Apache Spark with a practical example.

    What you need for this book

    The basic requirements for this book are as follows:

    Python

    OpenRefine

    D3.js

    Mlpy

    Natural Language Toolkit (NLTK)

    Gephi

    MongoDB

    Who this book is for

    This book is for Software Developers, Analyst and Computer Scientists who want to implement data analysis and visualization in a practical way. The book is also intended to provide a self-contained set of practical projects in order to get insight from different kinds of data like, time series, numerical, multidimensional, social media graphs and texts.

    You are not required to have previous knowledge about data analysis, but some basic knowledge about statistics and a general understanding of Python programming is assumed.

    Conventions

    In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning. Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: For this example, we will use the BeautifulSoup library version 4.

    A block of code is set as follows:

    from bs4 import BeautifulSoup

    import urllib.request

    from time import sleep

    from datetime import datetime

    Any command-line input or output is written as follows:

    >>> readers@packt.com

    >>> readers

    >>> packt.com

    New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: Now, just click on the OK button to apply the transformation.

    Note

    Warnings or important notes appear in a box like this.

    Tip

    Tips and tricks appear like this.

    Reader feedback

    Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

    To send us general feedback, simply e-mail feedback@packtpub.com, and mention the book's title in the subject of your message.

    If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

    Customer support

    Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

    Downloading the example code

    You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

    You can download the code files by following these steps:

    Log in or register to our website using your e-mail address and password.

    Hover the mouse pointer on the SUPPORT tab at the top.

    Click on Code Downloads & Errata.

    Enter the name of the book in the Search box.

    Select the book for which you're looking to download the code files.

    Choose from the drop-down menu where you purchased this book from.

    Click on Code Download.

    You can also download the code files by clicking on the Code Files button on the book's webpage at the Packt Publishing website. This page can be accessed by entering the book's name in the Search box. Please note that you need to be logged in to your Packt account.

    Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

    WinRAR / 7-Zip for Windows

    Zipeg / iZip / UnRarX for Mac

    7-Zip / PeaZip for Linux

    The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Practical-Data-Analysis-Second-Edition. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

    Downloading the color images of this book

    We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/B04227_PracticalDataAnalysisSecondEdition_ColorImages.pdf.

    Errata

    Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

    To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

    Piracy

    Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

    Please contact us at copyright@packtpub.com with a link to the suspected pirated material.

    We appreciate your help in protecting our authors and our ability to bring you valuable content.

    Questions

    If you have a problem with any aspect of this book, you can contact us at questions@packtpub.com, and we will do our best to address

    Enjoying the preview?
    Page 1 of 1