Ebook638 pages4 hours

Learning Predictive Analytics with R

Name: Learning Predictive Analytics with R
Author: Mayor Eric
ISBN: 9781782169369

By Mayor Eric

Rating: 0 out of 5 stars

()

Read preview

About this ebook

If you are a statistician, chief information officer, business analyst, data scientist, ML engineer, ML practitioner, quantitative analyst, or student of machine learning, this is the book for you.

Skip carousel

LanguageEnglish

PublisherPackt Publishing

Release dateSep 24, 2015

ISBN9781782169369

Author

Mayor Eric

Related authors

Skip carousel

Related to Learning Predictive Analytics with R

Related ebooks

Skip carousel

Big Data Analytics with R
Ebook
Big Data Analytics with R
bySimon Walkowiak
Rating: 0 out of 5 stars
0 ratings
R and Data Mining: Examples and Case Studies
Ebook
R and Data Mining: Examples and Case Studies
byYanchang Zhao
Rating: 3 out of 5 stars
3/5
Mastering Text Mining with R
Ebook
Mastering Text Mining with R
byAvinash Paul
Rating: 0 out of 5 stars
0 ratings
Mastering Predictive Analytics with R
Ebook
Mastering Predictive Analytics with R
byRui Miguel Forte
Rating: 4 out of 5 stars
4/5
R High Performance Programming
Ebook
R High Performance Programming
byAloysius Lim
Rating: 4 out of 5 stars
4/5
Machine Learning with R
Ebook
Machine Learning with R
byBrett Lantz
Rating: 4 out of 5 stars
4/5
R for Data Science
Ebook
R for Data Science
byDan Toomey
Rating: 5 out of 5 stars
5/5
R: Data Analysis and Visualization
Ebook
R: Data Analysis and Visualization
byBrett Lantz
Rating: 5 out of 5 stars
5/5
Machine Learning with R - Second Edition
Ebook
Machine Learning with R - Second Edition
byBrett Lantz
Rating: 5 out of 5 stars
5/5
Simulation for Data Science with R
Ebook
Simulation for Data Science with R
byMatthias Templ
Rating: 0 out of 5 stars
0 ratings
Learning pandas
Ebook
Learning pandas
byHeydt Michael
Rating: 4 out of 5 stars
4/5
Data Mining Applications with R
Ebook
Data Mining Applications with R
byYanchang Zhao
Rating: 4 out of 5 stars
4/5
Learning R Programming
Ebook
Learning R Programming
byKun Ren
Rating: 5 out of 5 stars
5/5
Learning Data Mining with Python
Ebook
Learning Data Mining with Python
byRobert Layton
Rating: 0 out of 5 stars
0 ratings
Learning Bayesian Models with R
Ebook
Learning Bayesian Models with R
byM.Koduvely Dr. Hari
Rating: 5 out of 5 stars
5/5
Bayesian Analysis with Python
Ebook
Bayesian Analysis with Python
byOsvaldo Martin
Rating: 5 out of 5 stars
5/5
Mastering Data Analysis with R
Ebook
Mastering Data Analysis with R
byDaróczi Gergely
Rating: 5 out of 5 stars
5/5
Regression Analysis with Python
Ebook
Regression Analysis with Python
byBoschetti Alberto
Rating: 0 out of 5 stars
0 ratings
R Object-oriented Programming
Ebook
R Object-oriented Programming
byKelly Black
Rating: 3 out of 5 stars
3/5
Mastering Scientific Computing with R
Ebook
Mastering Scientific Computing with R
byPaul Gerrard
Rating: 3 out of 5 stars
3/5
Mastering Machine Learning with R
Ebook
Mastering Machine Learning with R
byLesmeister Cory
Rating: 0 out of 5 stars
0 ratings
Practical Data Analysis - Second Edition
Ebook
Practical Data Analysis - Second Edition
byHector Cuesta
Rating: 0 out of 5 stars
0 ratings
Learning Data Mining with Python - Second Edition
Ebook
Learning Data Mining with Python - Second Edition
byRobert Layton
Rating: 0 out of 5 stars
0 ratings
Python Data Science Essentials
Ebook
Python Data Science Essentials
byBoschetti Alberto
Rating: 0 out of 5 stars
0 ratings
Learning Predictive Analytics with Python
Ebook
Learning Predictive Analytics with Python
byKumar Ashish
Rating: 0 out of 5 stars
0 ratings
Python Data Analysis
Ebook
Python Data Analysis
byIvan Idris
Rating: 4 out of 5 stars
4/5
Python Data Analysis - Second Edition
Ebook
Python Data Analysis - Second Edition
byArmando Fandango
Rating: 0 out of 5 stars
0 ratings
Python: Real-World Data Science
Ebook
Python: Real-World Data Science
byRobert Layton
Rating: 0 out of 5 stars
0 ratings
Introduction to R for Business Intelligence
Ebook
Introduction to R for Business Intelligence
byJay Gendron
Rating: 0 out of 5 stars
0 ratings
Mastering Python for Data Science
Ebook
Mastering Python for Data Science
bySamir Madhavan
Rating: 3 out of 5 stars
3/5

Applications & Software For You

Skip carousel

Adobe Illustrator: A Complete Course and Compendium of Features
Ebook
Adobe Illustrator: A Complete Course and Compendium of Features
byJason Hoppe
Rating: 0 out of 5 stars
0 ratings
FL Studio Cookbook
Ebook
FL Studio Cookbook
byShaun Friedman
Rating: 4 out of 5 stars
4/5
The Best Hacking Tricks for Beginners
Ebook
The Best Hacking Tricks for Beginners
byRAJ TYAGI
Rating: 4 out of 5 stars
4/5
Adobe Photoshop: A Complete Course and Compendium of Features
Ebook
Adobe Photoshop: A Complete Course and Compendium of Features
byStephen Laskevitch
Rating: 5 out of 5 stars
5/5
The Unofficial Guide to Open Broadcaster Software: OBS: The World's Most Popular Free Live-Streaming Application
Ebook
The Unofficial Guide to Open Broadcaster Software: OBS: The World's Most Popular Free Live-Streaming Application
byPaul Richards
Rating: 0 out of 5 stars
0 ratings
Adobe Premiere Pro: A Complete Course and Compendium of Features
Ebook
Adobe Premiere Pro: A Complete Course and Compendium of Features
byBen Goldsmith
Rating: 0 out of 5 stars
0 ratings
Adobe After Effects: A Complete Course and Compendium of Features
Ebook
Adobe After Effects: A Complete Course and Compendium of Features
byBen Goldsmith
Rating: 0 out of 5 stars
0 ratings
How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally
Ebook
How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally
byAlex Parkinson
Rating: 4 out of 5 stars
4/5
Hacks for TikTok: 150 Tips and Tricks for Editing and Posting Videos, Getting Likes, Keeping Your Fans Happy, and Making Money
Ebook
Hacks for TikTok: 150 Tips and Tricks for Editing and Posting Videos, Getting Likes, Keeping Your Fans Happy, and Making Money
byKyle Brach
Rating: 5 out of 5 stars
5/5
Blender 3D Basics Beginner's Guide Second Edition
Ebook
Blender 3D Basics Beginner's Guide Second Edition
byGordon Fisher
Rating: 5 out of 5 stars
5/5
Excel Essentials: A Step-by-Step Guide with Pictures for Absolute Beginners to Master the Basics and Start Using Excel with Confidence
Ebook
Excel Essentials: A Step-by-Step Guide with Pictures for Absolute Beginners to Master the Basics and Start Using Excel with Confidence
byNigel Tillery
Rating: 0 out of 5 stars
0 ratings
Logic Pro X For Dummies
Ebook
Logic Pro X For Dummies
byGraham English
Rating: 0 out of 5 stars
0 ratings
iPhone Photography For Dummies
Ebook
iPhone Photography For Dummies
byMark Hemmings
Rating: 0 out of 5 stars
0 ratings
YouTube Channels For Dummies
Ebook
YouTube Channels For Dummies
byRob Ciampa
Rating: 3 out of 5 stars
3/5
Kodi User Manual: Watch Unlimited Movies & TV shows for free on Your PC, Mac or Android Devices
Ebook
Kodi User Manual: Watch Unlimited Movies & TV shows for free on Your PC, Mac or Android Devices
byKazi Muhith
Rating: 0 out of 5 stars
0 ratings
Vlog Like a Boss: How to Kill It Online with Video Blogging
Ebook
Vlog Like a Boss: How to Kill It Online with Video Blogging
byAmy Schmittauer
Rating: 5 out of 5 stars
5/5
Matlab: A Practical Introduction to Programming and Problem Solving
Ebook
Matlab: A Practical Introduction to Programming and Problem Solving
byDorothy C. Attaway
Rating: 4 out of 5 stars
4/5
YouTube Growth Mastery: How to Start & Grow A Successful Youtube Channel. Get More Views, Subscribers, Hack The Algorithm, Make Money & Master YouTube
Ebook
YouTube Growth Mastery: How to Start & Grow A Successful Youtube Channel. Get More Views, Subscribers, Hack The Algorithm, Make Money & Master YouTube
byMax Lane
Rating: 3 out of 5 stars
3/5
GarageBand For Dummies
Ebook
GarageBand For Dummies
byBob LeVitus
Rating: 5 out of 5 stars
5/5
Six Figure Blogging In 3 Months
Ebook
Six Figure Blogging In 3 Months
byShekhar Mishra
Rating: 4 out of 5 stars
4/5
Excel : The Ultimate Comprehensive Step-By-Step Guide to the Basics of Excel Programming: 1
Ebook
Excel : The Ultimate Comprehensive Step-By-Step Guide to the Basics of Excel Programming: 1
byKevin Clark
Rating: 5 out of 5 stars
5/5
Essential Affinity Photo 2
Ebook
Essential Affinity Photo 2
byRobin Whalley
Rating: 0 out of 5 stars
0 ratings
Adobe InDesign CC: A Complete Course and Compendium of Features
Ebook
Adobe InDesign CC: A Complete Course and Compendium of Features
byStephen Laskevitch
Rating: 0 out of 5 stars
0 ratings
Memes for Music Producers: Top 100 Funny Memes for Musicians With Hilarious Jokes, Epic Fails & Crazy Comedy (Best Music Production Memes, EDM Memes, DJ Memes & FL Studio Memes 2021)
Ebook
Memes for Music Producers: Top 100 Funny Memes for Musicians With Hilarious Jokes, Epic Fails & Crazy Comedy (Best Music Production Memes, EDM Memes, DJ Memes & FL Studio Memes 2021)
byScreech House
Rating: 4 out of 5 stars
4/5
Experts' Guide to OneNote
Ebook
Experts' Guide to OneNote
byJeremy P. Jones
Rating: 5 out of 5 stars
5/5
The Chromebook Infused Classroom: Using Blended Learning to Create Engaging, Student-Centered Classrooms
Ebook
The Chromebook Infused Classroom: Using Blended Learning to Create Engaging, Student-Centered Classrooms
byHolly Clark
Rating: 0 out of 5 stars
0 ratings
Learn to Code. Get a Job. The Ultimate Guide to Learning and Getting Hired as a Developer.
Ebook
Learn to Code. Get a Job. The Ultimate Guide to Learning and Getting Hired as a Developer.
byGwendolyn Faraday
Rating: 5 out of 5 stars
5/5
Synthesizer Cookbook: How to Use Filters: Sound Design for Beginners, #2
Ebook
Synthesizer Cookbook: How to Use Filters: Sound Design for Beginners, #2
byScreech House
Rating: 3 out of 5 stars
3/5
How Do I Do That In InDesign?
Ebook
How Do I Do That In InDesign?
byDave Clayton
Rating: 5 out of 5 stars
5/5
Canon EOS Rebel T3/1100D For Dummies
Ebook
Canon EOS Rebel T3/1100D For Dummies
byJulie Adair King
Rating: 5 out of 5 stars
5/5

Related podcast episodes

Skip carousel

Advantages of Completing Small Python Projects
Podcast episode
Advantages of Completing Small Python Projects
byThe Real Python Podcast
0 ratings
0% found this document useful
[DataFramed Careers Series #2] What Makes a Great Data Science Portfolio
Podcast episode
[DataFramed Careers Series #2] What Makes a Great Data Science Portfolio
byDataFramed
0 ratings
0% found this document useful
Build Better Machine Learning Models With Confidence By Adding Validation With Deepchecks: A cross-over episode from The Machine Learning Podcast with the team from Deepchecks, exploring the challenges of testing and validating machine learning applications and their work to make it easier.
Podcast episode
Build Better Machine Learning Models With Confidence By Adding Validation With Deepchecks: A cross-over episode from The Machine Learning Podcast with the team from Deepchecks, exploring the challenges of testing and validating machine learning applications and their work to make it easier.
byThe Python Podcast.__init__
0 ratings
0% found this document useful
Graph Analytic Systems with Zachary Hanif - TWiML Talk #188: In this, the final episode of our Strata Data Conference series, we’re joined by Zachary Hanif, Director of Machine Learning at Capital One’s Center for Machine Learning. Zach led a session at Strata called “Network effects: Working with modern...
Podcast episode
Graph Analytic Systems with Zachary Hanif - TWiML Talk #188: In this, the final episode of our Strata Data Conference series, we’re joined by Zachary Hanif, Director of Machine Learning at Capital One’s Center for Machine Learning. Zach led a session at Strata called “Network effects: Working with modern...
byThe TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)
0 ratings
0% found this document useful
#35 Data Science in Finance
Podcast episode
#35 Data Science in Finance
byDataFramed
0 ratings
0% found this document useful
Simplifying Data Integration Through Eventual Connectivity - Episode 91: An interview about a new pattern for data integration that reduces the amount of effort required to find connections in numerous data sets
Podcast episode
Simplifying Data Integration Through Eventual Connectivity - Episode 91: An interview about a new pattern for data integration that reduces the amount of effort required to find connections in numerous data sets
byData Engineering Podcast
0 ratings
0% found this document useful
Spanner Myths Busted with Pritam Shah and Vaibhav Govil: This week, we’re busting myths around Google Cloud Spanner with our guests Pritam Shah and Vaibhav Govil. and host this episode and learn about the fantastic capabilities of Cloud Spanner. Our guests give us a quick run-down of Spanner database...
Podcast episode
Spanner Myths Busted with Pritam Shah and Vaibhav Govil: This week, we’re busting myths around Google Cloud Spanner with our guests Pritam Shah and Vaibhav Govil. and host this episode and learn about the fantastic capabilities of Cloud Spanner. Our guests give us a quick run-down of Spanner database...
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
Relevance-guided Supervision for OpenQA with ColBERT: Abstract Systems for Open-Domain Question Answering (OpenQA) generally depend on a retriever for finding candidate passages in a large corpus and a reader for extracting answers from those passages. In much recent work, the retriever is a learned com...
Podcast episode
Relevance-guided Supervision for OpenQA with ColBERT: Abstract Systems for Open-Domain Question Answering (OpenQA) generally depend on a retriever for finding candidate passages in a large corpus and a reader for extracting answers from those passages. In much recent work, the retriever is a learned com...
byPapers Read on AI
0 ratings
0% found this document useful
Democratizing Causality - Aleksander Molak
Podcast episode
Democratizing Causality - Aleksander Molak
byDataTalks.Club
0 ratings
0% found this document useful
Andrew Atkinson - Maintainable... Databases?: Robby engages with independent consultant and author, Andrew Atkinson, delving into the intricate world of software development and database maintenance. The episode is a treasure trove of insights, covering everything from optimizing database performance with rules to navigating the tricky terrain of advocating for codebase improvements in the face of reluctant stakeholders.
Podcast episode
Andrew Atkinson - Maintainable... Databases?: Robby engages with independent consultant and author, Andrew Atkinson, delving into the intricate world of software development and database maintenance. The episode is a treasure trove of insights, covering everything from optimizing database performance with rules to navigating the tricky terrain of advocating for codebase improvements in the face of reluctant stakeholders.
byMaintainable
0 ratings
0% found this document useful
The Value of Analysts and Observability with Nick Heudecker: Nick Heudecker, who leads Market Strategy and Competitive Intelligence at Cirbl, joins Corey who, as it turns out, has some similarities with Corey. Nick also spent some time in Maine, as a cryptologist for the Navy, and also spent the months of deep wint
Podcast episode
The Value of Analysts and Observability with Nick Heudecker: Nick Heudecker, who leads Market Strategy and Competitive Intelligence at Cirbl, joins Corey who, as it turns out, has some similarities with Corey. Nick also spent some time in Maine, as a cryptologist for the Navy, and also spent the months of deep wint
byScreaming in the Cloud
0 ratings
0% found this document useful
Rust in Production Ep 2 - PubNub's Stephen Blum: PubNub's CTO Stephen Blum discusses how implementing Rust improved memory and performance compared to their C and Python implementation. They highlight Rust's versatility, while emphasizing low latency and the importance of code simplicity.
Podcast episode
Rust in Production Ep 2 - PubNub's Stephen Blum: PubNub's CTO Stephen Blum discusses how implementing Rust improved memory and performance compared to their C and Python implementation. They highlight Rust's versatility, while emphasizing low latency and the importance of code simplicity.
byRust in Production
0 ratings
0% found this document useful
12: Coverage.py with Ned Batchelder: We also discuss edX, Python user groups, PyCon talks, and more.
Podcast episode
12: Coverage.py with Ned Batchelder: We also discuss edX, Python user groups, PyCon talks, and more.
byTest and Code
0 ratings
0% found this document useful
Setting the Standard: Impact of Method Standardization in Chromatography
Podcast episode
Setting the Standard: Impact of Method Standardization in Chromatography
byThe Analytical Wavelength
0 ratings
0% found this document useful
Dataprep with Eric Anderson: Eric Anderson joins the podcast to talk about how Dataprep is simplifying data wrangling!
Podcast episode
Dataprep with Eric Anderson: Eric Anderson joins the podcast to talk about how Dataprep is simplifying data wrangling!
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
The Rapid Rise of Vector Databases with Ram Sriharsha: Ram Sriharsha, VP of Engineering and R&D at Pinecone, joins Corey on Screaming in the Cloud to discuss Pinecone’s creation of Vector Databases, the challenges they solve, and why their customer adoption has seen such a rapid rise. Ram reveals the the comm
Podcast episode
The Rapid Rise of Vector Databases with Ram Sriharsha: Ram Sriharsha, VP of Engineering and R&D at Pinecone, joins Corey on Screaming in the Cloud to discuss Pinecone’s creation of Vector Databases, the challenges they solve, and why their customer adoption has seen such a rapid rise. Ram reveals the the comm
byScreaming in the Cloud
0 ratings
0% found this document useful
What to consider when choosing an image analysis solution for phenotyping? (part 3) w/ Regan Baird, Visiopharm
Podcast episode
What to consider when choosing an image analysis solution for phenotyping? (part 3) w/ Regan Baird, Visiopharm
byDigital Pathology Podcast
0 ratings
0% found this document useful
Data Observability - Barr Moses
Podcast episode
Data Observability - Barr Moses
byDataTalks.Club
0 ratings
0% found this document useful
?ThursdAI - LAION down, OpenChat beats GPT3.5, Apple is showing where it's going, Midjourney v6 is here & Suno can make music!
Podcast episode
?ThursdAI - LAION down, OpenChat beats GPT3.5, Apple is showing where it's going, Midjourney v6 is here & Suno can make music!
byThursdAI - The top AI news from the past week
0 ratings
0% found this document useful
Humans in the Loop - Lina Weichbrodt
Podcast episode
Humans in the Loop - Lina Weichbrodt
byDataTalks.Club
0 ratings
0% found this document useful
DASH sees a large route leak in Singapore: In june of this year, the Dashboard for AS Health or DASH, a service operated by APNIC saw a leak of approximately 260,000 BGP routes from a vantage point in Singapore, and sent alerts to around 90 subscribers to our routing mis-alignment notification ...
Podcast episode
DASH sees a large route leak in Singapore: In june of this year, the Dashboard for AS Health or DASH, a service operated by APNIC saw a leak of approximately 260,000 BGP routes from a vantage point in Singapore, and sent alerts to around 90 subscribers to our routing mis-alignment notification ...
byPING
0 ratings
0% found this document useful
Serverless Data APIs
Podcast episode
Serverless Data APIs
byThe Cloudcast
0 ratings
0% found this document useful
Putting machine learning into a database: Most data scientists bounce back and forth regula…
Podcast episode
Putting machine learning into a database: Most data scientists bounce back and forth regula…
byLinear Digressions
0 ratings
0% found this document useful
278: Beliefs in the Firmware: In this week's episode, Steph and Chris discuss the popular testing themes and questions that emerged during the RSpec training course, reflecting on which testing "rules" still apply and when to break the rules. They also chat about the results of the 2020 State of JS survey and repurposing email validations to be helpful vs strict.
Podcast episode
278: Beliefs in the Firmware: In this week's episode, Steph and Chris discuss the popular testing themes and questions that emerged during the RSpec training course, reflecting on which testing "rules" still apply and when to break the rules. They also chat about the results of the 2020 State of JS survey and repurposing email validations to be helpful vs strict.
byThe Bike Shed
0 ratings
0% found this document useful
Yaniv Tal: The Graph – A Marketplace for Web3 Data Indexes Based on GraphQL: We're joined by Yaniv Tal, Project Lead at The Graph. The project aims to create a scalable marketplace for high-availability blockchain data indexes.
Podcast episode
Yaniv Tal: The Graph – A Marketplace for Web3 Data Indexes Based on GraphQL: We're joined by Yaniv Tal, Project Lead at The Graph. The project aims to create a scalable marketplace for high-availability blockchain data indexes.
byEpicenter - Learn about Crypto, Blockchain, Ethereum, Bitcoin and Distributed Technologies
0 ratings
0% found this document useful
Ep. 179: The RAND Reading Model with Hugh Catts
Podcast episode
Ep. 179: The RAND Reading Model with Hugh Catts
byMelissa & Lori Love Literacy ™
0 ratings
0% found this document useful
Composable Data Analytics
Podcast episode
Composable Data Analytics
byThe Cloudcast
0 ratings
0% found this document useful
416: Multi-Dimensional Numbers: Joël discusses the challenges he encountered while optimizing slow SQL queries in a non-Rails application. Stephanie shares her experience with canary deploys in a Rails upgrade. Together, Stephanie and Joël address a listener's question about replacing the wkhtml2pdf tool, which is no longer maintained. The episode's main topic revolves around the concept of multidimensional numbers and their applications in software development. Joël introduces the idea of treating objects containing multiple numbers as single entities, using the example of 2D points in space to illustrate how custom classes can define mathematical operations like addition and subtraction for complex data types. They explore how this approach can simplify operations on data structures, such as inventories of T-shirt sizes, by treating them as mathematical objects.
Podcast episode
416: Multi-Dimensional Numbers: Joël discusses the challenges he encountered while optimizing slow SQL queries in a non-Rails application. Stephanie shares her experience with canary deploys in a Rails upgrade. Together, Stephanie and Joël address a listener's question about replacing the wkhtml2pdf tool, which is no longer maintained. The episode's main topic revolves around the concept of multidimensional numbers and their applications in software development. Joël introduces the idea of treating objects containing multiple numbers as single entities, using the example of 2D points in space to illustrate how custom classes can define mathematical operations like addition and subtraction for complex data types. They explore how this approach can simplify operations on data structures, such as inventories of T-shirt sizes, by treating them as mathematical objects.
byThe Bike Shed
0 ratings
0% found this document useful
Julien Le Dem: Why Data Lineage Matters: Julien has a unique history of building open frameworks that make data platforms interoperable. He’s contributed in various ways to Apache Arrow, Apache Iceberg, Apache Parquet, and Marquez, and is currently leading OpenLineage, an open framework...
Podcast episode
Julien Le Dem: Why Data Lineage Matters: Julien has a unique history of building open frameworks that make data platforms interoperable. He’s contributed in various ways to Apache Arrow, Apache Iceberg, Apache Parquet, and Marquez, and is currently leading OpenLineage, an open framework...
byThe Analytics Engineering Podcast
0 ratings
0% found this document useful
Revisiting the Minimalist Approach to Offline Reinforcement Learning: Recent years have witnessed significant advancements in offline reinforcement learning (RL), resulting in the development of numerous algorithms with varying degrees of complexity. While these algorithms have led to noteworthy improvements, many inco...
Podcast episode
Revisiting the Minimalist Approach to Offline Reinforcement Learning: Recent years have witnessed significant advancements in offline reinforcement learning (RL), resulting in the development of numerous algorithms with varying degrees of complexity. While these algorithms have led to noteworthy improvements, many inco...
byPapers Read on AI
0 ratings
0% found this document useful

Skip carousel

Scikit-Learn: The Ultimate Python Library
APC
Article
Scikit-Learn: The Ultimate Python Library
Jul 15, 2019
4 min read
Manipulate Data Like A Pro With Pandas
Linux Format
Article
Manipulate Data Like A Pro With Pandas
Jul 27, 2021
7 min read
Want A Job In Data Science? You Might Have To Take A Standardized Test When Applying
Chicago Tribune
Article
Want A Job In Data Science? You Might Have To Take A Standardized Test When Applying
Jul 10, 2018
3 min read
01 Giving Data Collectors—and Donors—a Real-Time Rush
Fast Company
Article
01 Giving Data Collectors—and Donors—a Real-Time Rush
Mar 20, 2017
7 min read
Arq 7 Backup: Uniquely Versatile Local And Online Backup
PCWorld
Article
Arq 7 Backup: Uniquely Versatile Local And Online Backup
Aug 1, 2023
4 min read
Machine-learning On Your Android Phone?
APC
Article
Machine-learning On Your Android Phone?
Dec 30, 2019
4 min read
Comparing Time Series Data Like A Pro
Linux Format
Article
Comparing Time Series Data Like A Pro
Jun 1, 2021
8 min read
New Tools for Using the Sherwood Tables for Transceiver Selection
CQ Amateur Radio
Article
New Tools for Using the Sherwood Tables for Transceiver Selection
Jan 1, 2023
Receive performance has been one of the top criteria for transceiver selection by hams for decades. As the well-worn phrase goes, “if you can’t hear ‘em, you can’t work ‘em.” Rob Sherwood has been conducting bench tests on the receive performance of
10 min read
Down To The Wire
Linux Format
Article
Down To The Wire
Jul 25, 2023
“WirePlumber is the modular and preferred session manager of PipeWire, the next-gen multimedia framework for Linux-based systems. With the 0.5 release, WirePlumber will see some fundamental changes to its system, including a move from Lua to a JSON-b
1 min read
Machine Learning – With Zero Programming
APC
Article
Machine Learning – With Zero Programming
Aug 12, 2019
6 min read
Google Answer Box Strategy
Techfastly
Article
Google Answer Box Strategy
Sep 21, 2020
Leveraging the Google PAA (People Also Ask) element on a Search Results Page for Targeted Content Creation with a Python Scraper All businesses that are online today are creating content at a furious pace. According to Technavio, a research firm, con
7 min read
Kaspersky Internet Security Premium
PC Pro Magazine
Article
Kaspersky Internet Security Premium
Mar 9, 2023
SCORE PRICE (10 devices) First year, £17 (£21 inc VAT), renewal £46 (£55 inc VAT) from kaspersky.co.uk Kaspersky Internet Security is an entirely effective anti-malware solution and a politically difficult one, given that the Russian-made suite’s use
1 min read
Data Analysis
Linux Format
Article
Data Analysis
Mar 10, 2020
Sometimes you receive raw data that needs to be processed before plotting. In Veusz, look under the Data > Operations menu and find lots of options for manipulating data sets. Joining, merging, finding the average, filtering and many more are availab
1 min read
Rediscover Speed With The Redis Revolution
Linux Format
Article
Rediscover Speed With The Redis Revolution
Jul 25, 2023
Credit: https://redis.io Redis is an open-source, in-memory data structure store that has gained popularity R as a highly efficient caching and messaging system. It prioritises speed, efficiency and versatility, making it a top choice for various ap
8 min read
2 The Use of Python in AI and ML
Techfastly
Article
2 The Use of Python in AI and ML
Nov 30, 2020
3 min read
The Race To Exascale Supercomputers
Maximum PC
Article
The Race To Exascale Supercomputers
Jun 21, 2022
9 min read
How And Where You Use Machine-learning
APC
Article
How And Where You Use Machine-learning
Oct 7, 2019
4 min read
Poisoning The Well
Linux Format
Article
Poisoning The Well
Jan 11, 2022
4 min read
End Of The Line!
Linux Format
Article
End Of The Line!
Nov 15, 2022
"October 2023 may seem like a long time away, but that’s when MySQL 5.7 will hit end-of-life (EOL) status. This normally means no more updates or security patches will be released. For companies running this database in their applications, it is time
1 min read
Balancing Tradition with Innovation
Smart Photography
Article
Balancing Tradition with Innovation
Apr 5, 2024
✓ Phone Case ✓ Power Adapter 100W SUPERVOOC ✓ USB Cable The OnePlus 12, a testament to the brand’s commitment to refinement, offers a sleek design, formidable performance, and enhanced battery life, albeit lacking in advanced AI functionalities and f
3 min read
Network Issues
MacLife
Article
Network Issues
May 29, 2018
These days, many wireless networks overlap due to their number and extended range, which can cause interference that manifests with poor performance and connection issues at shorter-than-expected ranges. Add in an ever-growing number of devices tryin
4 min read
Intelligent Machine Fun
Linux Format
Article
Intelligent Machine Fun
Apr 5, 2022
For our final project we’ll try something a bit more complicated. We’re going to leverage the extra grunt of the Pi 4 (this will work on a Pi 3 but it won’t be fun) and the TensorFlow machine learning software to enable the Pi, via a camera, to class
4 min read
Intelligent Machine Fun
Linux Format
Article
Intelligent Machine Fun
Apr 5, 2022
For our final project we’ll try something a bit more complicated. We’re going to leverage the extra grunt of the Pi 4 (this will work on a Pi 3 but it won’t be fun) and the TensorFlow machine learning software to enable the Pi, via a camera, to class
4 min read
Contacts
MacFormat
Article
Contacts
Sep 24, 2019
I enjoyed the feature on ‘44 mighty Mac tips’ (MF #341); I remember learning number 6 ‘Minimise clutter’ in System 7. I’ve recently discovered a new one: if you use Safari > Services > ‘Make new TextEdit window using selection’ to capture the content
2 min read
One Plus 12
T3 India
Article
One Plus 12
Mar 7, 2024
₹69,999 oneplus.in The OnePlus 12, serving as the brand’s standard flagship, has adhered to a consistent trajectory for several years, focusing on mastering the fundamentals. This flagship is the preferred choice for many OnePlus enthusiasts, especia
2 min read
Accurate, Open Source IP-based Localisation
Linux Format
Article
Accurate, Open Source IP-based Localisation
Dec 14, 2021
8 min read
Observability Of The Kernel And Containers
Linux Format
Article
Observability Of The Kernel And Containers
Apr 4, 2023
Mihalis Tsoukalos is currently working on Time Series. You can reach him at: @mactsouk. For our final delve into eBPF, we’re tackling applications, the kernel and Docker containers. At the end of the day, all Linux machines execute code for applicat
10 min read
Build A Static Analysis Development Pipeline
Linux Format
Article
Build A Static Analysis Development Pipeline
Jul 27, 2021
9 min read
Microcontrollers In Amateur Radio
CQ Amateur Radio
Article
Microcontrollers In Amateur Radio
Aug 1, 2022
My way of teaching about program data has always been a little different than the way most approach the subject. As you may know, pointers in C are a special type of variable that allows you to access data in a very efficient manner. Indeed, many com
6 min read
Hybrid Backup For Business
PC Pro Magazine
Article
Hybrid Backup For Business
Apr 8, 2021
4 min read

Related categories

Skip carousel

Reviews for Learning Predictive Analytics with R

Rating: 0 out of 5 stars

0 ratings

0 ratings0 reviews

Book preview

Learning Predictive Analytics with R - Mayor Eric

Learning Predictive Analytics with R

Credits

About the Author

About the Reviewers

www.PacktPub.com

Support files, eBooks, discount offers, and more

Why subscribe?

Free access for Packt account holders

Preface

Prediction

Supervised and unsupervised learning

Unsupervised learning

Supervised learning

Classification and regression problems

Classification

Regression

The role of field knowledge in data modeling

Caveats

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Downloading the example code

Downloading the color images of this book

Errata

Piracy

eBooks, discount offers, and more

Questions

1. Setting GNU R for Predictive Analytics

Installing GNU R

The R graphic user interface

The menu bar of the R console

A quick look at the File menu

A quick look at the Misc menu

Packages

Installing packages in R

Loading packages in R

Summary

2. Visualizing and Manipulating Data Using R

The roulette case

Histograms and bar plots

Scatterplots

Boxplots

Line plots

Application – Outlier detection

Formatting plots

Summary

3. Data Visualization with Lattice

Loading and discovering the lattice package

Discovering multipanel conditioning with xyplot()

Discovering other lattice plots

Histograms

Stacked bars

Dotplots

Displaying data points as text

Updating graphics

Case study – exploring cancer-related deaths in the US

Discovering the dataset

Integrating supplementary external data

Summary

4. Cluster Analysis

Distance measures

Learning by doing – partition clustering with kmeans()

Setting the centroids

Computing distances to centroids

Computing the closest cluster for each case

Tasks performed by the main function

Internal validation

Using k-means with public datasets

Understanding the data with the all.us.city.crime.1970 dataset

Finding the best number of clusters in the life.expectancy.1971 dataset

External validation

Summary

5. Agglomerative Clustering Using hclust()

The inner working of agglomerative clustering

Agglomerative clustering with hclust()

Exploring the results of votes in Switzerland

The use of hierarchical clustering on binary attributes

Summary

6. Dimensionality Reduction with Principal Component Analysis

The inner working of Principal Component Analysis

Learning PCA in R

Dealing with missing values

Selecting how many components are relevant

Naming the components using the loadings

PCA scores

Accessing the PCA scores

PCA scores for analysis

PCA diagnostics

Summary

7. Exploring Association Rules with Apriori

Apriori – basic concepts

Association rules

Itemsets

Support

Confidence

Lift

The inner working of apriori

Generating itemsets with support-based pruning

Generating rules by using confidence-based pruning

Analyzing data with apriori in R

Using apriori for basic analysis

Detailed analysis with apriori

Preparing the data

Analyzing the data

Coercing association rules to a data frame

Visualizing association rules

Summary

8. Probability Distributions, Covariance, and Correlation

Probability distributions

Introducing probability distributions

Discrete uniform distribution

The normal distribution

The Student's t-distribution

The binomial distribution

The importance of distributions

Covariance and correlation

Covariance

Correlation

Pearson's correlation

Spearman's correlation

Summary

9. Linear Regression

Understanding simple regression

Computing the intercept and slope coefficient

Obtaining the residuals

Computing the significance of the coefficient

Working with multiple regression

Analyzing data in R: correlation and regression

First steps in the data analysis

Performing the regression

Checking for the normality of residuals

Checking for variance inflation

Examining potential mediations and comparing models

Predicting new data

Robust regression

Bootstrapping

Summary

10. Classification with k-Nearest Neighbors and Naïve Bayes

Understanding k-NN

Working with k-NN in R

How to select k

Understanding Naïve Bayes

Working with Naïve Bayes in R

Computing the performance of classification

Summary

11. Classification Trees

Understanding decision trees

ID3

Entropy

Information gain

C4.5

The gain ratio

Post-pruning

C5.0

Classification and regression trees and random forest

CART

Random forest

Bagging

Conditional inference trees and forests

Installing the packages containing the required functions

Installing C4.5

Installing C5.0

Installing CART

Installing random forest

Installing conditional inference trees

Loading and preparing the data

Performing the analyses in R

Classification with C4.5

The unpruned tree

The pruned tree

C50

CART

Pruning

Random forests in R

Examining the predictions on the testing set

Conditional inference trees in R

Caret – a unified framework for classification

Summary

12. Multilevel Analyses

Nested data

Multilevel regression

Random intercepts and fixed slopes

Random intercepts and random slopes

Multilevel modeling in R

The null model

Random intercepts and fixed slopes

Random intercepts and random slopes

Predictions using multilevel models

Using the predict() function

Assessing prediction quality

Summary

13. Text Analytics with R

An introduction to text analytics

Loading the corpus

Data preparation

Preprocessing and inspecting the corpus

Computing new attributes

Creating the training and testing data frames

Classification of the reviews

Document classification with k-NN

Document classification with Naïve Bayes

Classification using logistic regression

Document classification with support vector machines

Mining the news with R

A successful document classification

Extracting the topics of the articles

Collecting news articles in R from the New York Times article search API

Summary

14. Cross-validation and Bootstrapping Using Caret and Exporting Predictive Models Using PMML

Cross-validation and bootstrapping of predictive models using the caret package

Cross-validation

Performing cross-validation in R with caret

Bootstrapping

Performing bootstrapping in R with caret

Predicting new data

Exporting models using PMML

What is PMML?

A brief description of the structure of PMML objects

Examples of predictive model exportation

Exporting k-means objects

Hierarchical clustering

Exporting association rules (apriori objects)

Exporting Naïve Bayes objects

Exporting decision trees (rpart objects)

Exporting random forest objects

Exporting logistic regression objects

Exporting support vector machine objects

Summary

A. Exercises and Solutions

Exercises

Chapter 1 – Setting GNU R for Predictive Modeling

Chapter 2 – Visualizing and Manipulating Data Using R

Chapter 3 – Data Visualization with Lattice

Chapter 4 – Cluster Analysis

Chapter 5 – Agglomerative Clustering Using hclust()

Chapter 6 – Dimensionality Reduction with Principal Component Analysis

Chapter 7 – Exploring Association Rules with Apriori

Chapter 8 – Probability Distributions, Covariance, and Correlation

Chapter 9 – Linear Regression

Chapter 10 – Classification with k-Nearest Neighbors and Naïve Bayes

Chapter 11 – Classification Trees

Chapter 12 – Multilevel Analyses

Chapter 13 – Text Analytics with R

Solutions

Chapter 1 – Setting GNU R for Predictive Modeling

Chapter 2 – Visualizing and Manipulating Data Using R

Chapter 3 – Data Visualization with Lattice

Chapter 4 – Cluster Analysis

Chapter 5 – Agglomerative Clustering Using hclust()

Chapter 6 – Dimensionality Reduction with Principal Component Analysis

Chapter 7 – Exploring Association Rules with Apriori

Chapter 8 – Probability Distributions, Covariance, and Correlation

Chapter 9 – Linear Regression

Chapter 10 – Classification with k-Nearest Neighbors and Naïve Bayes

Chapter 11 – Classification Trees

Chapter 12 – Multilevel Analyses

Chapter 13 – Text Analytics with R

B. Further Reading and References

Preface

Chapter 1 – Setting GNU R for Predictive Modeling

Chapter 2 – Visualizing and Manipulating Data Using R

Chapter 3 – Data Visualization with Lattice

Chapter 4 – Cluster Analysis

Chapter 5 – Agglomerative Clustering Using hclust()

Chapter 6 – Dimensionality Reduction with Principal Component Analysis

Chapter 7 – Exploring Association Rules with Apriori

Chapter 8 – Probability Distributions, Covariance, and Correlation

Chapter 9 – Linear Regression

Chapter 10 – Classification with k-Nearest Neighbors and Naïve Bayes

Chapter 11 – Classification Trees

Chapter 12 – Multilevel Analyses

Chapter 13 – Text Analytics with R

Chapter 14 – Cross-validation and Bootstrapping Using Caret and Exporting Predictive Models Using PMML

Index

Learning Predictive Analytics with R

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

First published: September 2015

Production reference: 1180915

Published by Packt Publishing Ltd.

Livery Place

35 Livery Street

Birmingham B3 2PB, UK.

ISBN 978-1-78216-935-2

www.packtpub.com

Credits

Author

Eric Mayor

Reviewers

Ajay Dhamija

Khaled Tannir

Matt Wiley

Commissioning Editor

Kunal Parikh

Acquisition Editor

Kevin Colaco

Content Development Editor

Siddhesh Salvi

Technical Editor

Deepti Tuscano

Copy Editors

Puja Lalwani

Merilyn Pereira

Project Coordinator

Kranti Berde

Proofreader

Safis Editing

Indexer

Rekha Nair

Production Coordinator

Aparna Bhagat

Cover Work

Aparna Bhagat

About the Author

Eric Mayor is a senior researcher and lecturer at the University of Neuchatel, Switzerland. He is an enthusiastic user of open source and proprietary predictive analytics software packages, such as R, Rapidminer, and Weka. He analyzes data on a daily basis and is keen to share his knowledge in a simple way.

About the Reviewers

Ajay Dhamija is a senior scientist in the Defence Research and Development Organization, Delhi. He has more than 24 years of experience as a researcher and an instructor. He holds an MTech (in computer science and engineering) from IIT Delhi and an MBA (in finance and strategy) from FMS, Delhi. He has to his credit more than 14 research works of international reputation in various fields, including data mining, reverse engineering, analytics, neural network simulation, TRIZ, and so on. He was instrumental in developing a state-of-the-art Computerized Pilot Selection System (CPSS) containing various cognitive and psycho-motor tests. It is used to comprehensively assess the flying aptitude of aspiring pilots of the Indian Air Force. Ajay was honored with an Agni Award for excellence in self reliance in 2005 by the Government of India. He specializes in predictive analytics, information security, big data analytics, machine learning, Bayesian social networks, financial modeling, neuro-fuzzy simulation, and data analysis and data mining using R. He is presently involved in his doctoral work on Financial Modeling of Carbon Finance Data from IIT, Delhi. He has written an international bestseller, Forecasting Exchange Rate: Use of Neural Networks in Quantitative Finance (http://www.amazon.com/Forecasting-Exchange-rate-Networks-Quantitative/dp/3639161807), and is currently authoring another book in R named Multivariate Analysis using R.

Apart from Analytics, Ajay Dhamija is actively involved in information security research. He has been associated with various international and national researchers in the government as well as the corporate sector to pursue his research on ways to amalgamate two important and contemporary fields of data handling, that is, predictive analytics and information security. While he was associated with researchers from the Predictive Analytics and Information Security Institute of India (PRAISIA: www.praisia.com) during his research endeavors, he worked on refining methods of big data analytics for security data analysis (log assessment, incident analysis, threat prediction, and so on) and vulnerability management automation.

You can connect with Ajay at:

LinkedIn: ajaykumardhamija

ResearchGate: Ajay_Dhamija2

Academia: ajaydhamija

Facebook: akdhamija

Twitter:@akdhamija

Quora: Ajay-Dhamija

I would like to thank my fellow scientists from the Defense Research and Development Organization and researchers from the corporate sector, including Predictive Analytics and Information Security Institute of India (PRAISIA). It is a unique institute of repute and of due to its pioneering work in marrying the two giant and contemporary fields of data handling in modern times, that is, predictive analytics and information security, by adopting bespoke and refined methods of big data analytics. They all contributed in presenting a fruitful review for this book. I'm also thankful to my wife, Seema Dhamija, the managing director at PRAISIA, who has been kind enough to share her research team's time with me in order to have a technical discussion. I'm also thankful to my son, Hemant Dhamija. Many a time, he gave invaluable feedback that I inadvertently neglected during the course of this review. I'm also thankful to a budding security researcher, Shubham Mittal from Makemytrip Inc., for his constant and constructive critiques of my work.

Matt Wiley is a tenured associate professor of mathematics who currently resides in Victoria, Texas. He holds degrees in mathematics (with a computer science minor) from the University of California and a master's degree in business administration from Texas A&M University. He directs the Quality Enhancement Plan at Victoria College and is the managing partner at Elkhart Group Limited, a statistical consultancy. With programming experience in R, C++, Ruby, Fortran, and JavaScript, he has always found ways to meld his passion for writing with his love of logical problem solving and data science. From the boardroom to the classroom, Matt enjoys finding dynamic ways to partner with interdisciplinary and diverse teams to make complex ideas and projects understandable and solvable.

Matt can be found online at www.MattWiley.org.

Khaled Tannir is a visionary solution architect with more than 20 years of technical experience focusing on big data technologies and data mining since 2010.

He is widely recognized as an expert in these fields and has a bachelor's degree in electronics and a master's degree in system information architectures. He completed his education with a master of research degree.

Khaled is a Microsoft Certified Solution Developer (MCSD) and an avid technologist. He has worked for many companies in France (and recently in Canada), leading the development and implementation of software solutions and giving technical presentations.

He is the author of the books RavenDB 2.x Beginner's Guide and Optimizing Hadoop MapReduce, Packt Publishing, and a technical reviewer on the books Pentaho Analytics for MongoDB and MongoDB High Availability, Packt Publishing.

He enjoys taking landscape and night photos, traveling, playing video games, creating funny electronics gadgets using Arduino, Raspberry Pi, and .Net Gadgeteer, and of course spending time with his wife and family.

You can reach him at contact@khaledtannir.net.

www.PacktPub.com

Support files, eBooks, discount offers, and more

For support files and downloads related to your book, please visit www.PacktPub.com.

Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

https://www2.packtpub.com/books/subscription/packtlib

Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.

Why subscribe?

Fully searchable across every book published by Packt

Copy and paste, print, and bookmark content

On demand and accessible via a web browser

Free access for Packt account holders

If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view 9 entirely free books. Simply use your login credentials for immediate access.

Preface

The amount of data in the world is increasing exponentially as time passes. It is estimated that the total amount of data produced in 2020 will be 20 zettabytes (Kotov, 2014), that is, 20 billion terabytes. Organizations spend a lot of effort and money on collecting and storing data, and still, most of it is not analyzed at all, or not analyzed properly. One reason to analyze data is to predict the future, that is, to produce actionable knowledge. The main purpose of this book is to show you how to do that with reasonably simple algorithms. The book is composed of chapters describing the algorithms and their use and of an appendices with exercises and solutions to the exercises and references.

Prediction

What is meant by prediction? The answer, of course, depends on the field and the algorithms used, but this explanation is true most of the time—given the attested reliable relationships between indicators (predictors) and an outcome, the presence (or level) of the indicators for similar cases is a reliable clue to the presence (or level) of the outcome in the future. Here are some examples of relationships, starting with the most obvious:

Taller people weigh more

Richer individuals spend more

More intelligent individuals earn more

Customers in segment X buy more of product Y

Customers who bought product P will also buy product Q

Products P and Q are bought together

Some credit card transactions predict fraud (Chan et al., 1999)

Google search queries predict influenza infections (Ginsberg et al., 2009)

Tweet content predicts election poll outcomes (O'Connor and Balasubramanyan, 2010)

In the following section, we provide minimal definitions of the distinctions between supervised and unsupervised learning and classification and regression problems.

Supervised and unsupervised learning

Two broad families of algorithms will be discussed in this book:

Unsupervised learning algorithms

Supervised learning algorithms

Unsupervised learning

In unsupervised learning, the algorithm will seek to find the structure that organizes unlabelled data. For instance, based on similarities or distances between observations, an unsupervised cluster analysis will determine groups and which observations fit best into each of the groups. An application of this is, for instance, document classification.

Supervised learning

In supervised learning, we know the class or the level of some observations of a given target attribute. When performing a prediction, we use known relationships in labeled data (data for which we know what the class or level of the target attribute is) to predict the class or the level of the attribute in new cases (of which we do not know the value).

Classification and regression problems

There are basically two types of problems that predictive modeling deals with:

Classification problems

Regression problems

Classification

In some cases, we want to predict which group an observation is part of. Here, we are dealing with a quality of the observation. This is a classification problem. Examples include:

The prediction of the species of plants based on morphological measurements

The prediction of whether individuals will develop a disease or not, based on their health habits

The prediction of whether an e-mail is spam or not

Regression

In other cases, we want to predict an observation's level on an attribute. Here, we are dealing with a quantity, and this is a regression problem. Examples include:

The prediction of how much individuals will cost to health care based on their health habits

The prediction of the weight of animals based on their diets

The prediction of the number of defective devices based on manufacturing specifications

The role of field knowledge in data modeling

Of course, analyzing data without knowledge of the field is not a serious way to proceed. This is okay to show how some algorithms work, how to make use of them, and to exercise. However, for real-life applications, be sure that you know the topic well, or else consult experts for help. The Cross Industry Standard Process for Data Mining (CRISP-DM, Shearer, 2000) underlines the importance of field knowledge. The steps of the process are depicted as follows:

The Cross Industry Standard Process for Data Mining

As stressed upon in the preceding diagram, field knowledge (here called Business Understanding) informs and is informed by data understanding. The understanding of the data then informs how the data has to be prepared. The next step is data modeling, which can also lead to further data preparation. Data models have to be evaluated, and this evaluation can be informed by field knowledge (this is also stressed upon in the diagram), which is also updated through the data mining process. Finally, if the evaluation is satisfactory, the models are deployed for prediction. This book will focus on the data modeling and evaluation stages.

Caveats

Of course, predictions are not always accurate, and some have written about the caveats of data science. What do you think about the relationship between the attributes titled Predictor and Outcome on the following plot? It seems like there is a relationship between the two. For the statistically inclined, I tested its significance: r = 0.4195, p = .0024. The value p is the probability of obtaining a relationship of this strength or stronger if there is actually no relationship between the attributes. We could conclude that the relationship between these variables in the population they come from is quite reliable, right?

The relationship between the attributes titled Predictor and Outcome

Believe it or not, the population these observations come from is that of randomly generated numbers. We generated a data frame of 50 columns of 50 randomly generated numbers. We then examined all the correlations (manually) and generated a scatterplot of the two attributes with the largest correlation we found. The code is provided here, in case you want to check it yourself—line 1 sets the seed so that you find the same results as we did, line 2 generates to the data frame, line 3 fills it with random numbers, column by column, line 4 generates the scatterplot, line 5 fits the regression line, and line 6 tests the significance of the correlation:

1 set.seed(1)

2 DF = data.frame(matrix(nrow=50,ncol=50))

3 for (i in 1:50) DF[,i] = runif(50)

4 plot(DF[[2]],DF[[16]], xlab = Predictor, ylab = Outcome)

5 abline(lm(DF[[2]]~DF[[16]]))

6 cor.test(DF[[2]], DF[[16]])

How could this relationship happen given that the odds were 2.4 in 1000 ? Well, think of it; we correlated all 50 attributes 2 x 2, which resulted in 2,450 tests (not considering the correlation of each attribute with itself). Such spurious correlation was quite expectable. The usual threshold below which we consider a relationship significant is p = 0.05, as we will discuss in Chapter 8, Probability Distributions, Covariance, and Correlation. This means that we expect to be wrong once in 20 times. You would be right to suspect that there are other significant correlations in the generated data frame (there should be approximately 125 of them in total). This is the reason why we should always correct the number of tests. In our example, as we performed 2,450 tests, our threshold for significance should be 0.0000204 (0.05 / 2450). This is called the Bonferroni correction.

Spurious correlations are always a possibility in data analysis and this should be kept in mind at all times. A related concept is that of overfitting. Overfitting happens, for instance, when a weak classifier bases its prediction on the noise in data. We will discuss overfitting in the book, particularly when discussing cross-validation in Chapter 14, Cross-validation and Bootstrapping Using Caret and Exporting Predictive Models Using PMML. All the chapters are listed in the following section.

We hope you enjoy reading the book and hope

Enjoying the preview?

Page 1 of 1

Learning Predictive Analytics with R

About this ebook

Mayor Eric

Related authors

Related to Learning Predictive Analytics with R

Related ebooks

Applications & Software For You

Related podcast episodes

Related articles

Related categories

Reviews for Learning Predictive Analytics with R

What did you think?

Book preview

Learning Predictive Analytics with R - Mayor Eric

Table of Contents

Learning Predictive Analytics with R

Learning Predictive Analytics with R

Credits

About the Author

About the Reviewers

Support files, eBooks, discount offers, and more

Why subscribe?

Preface

Prediction

Unsupervised learning

Supervised learning

Classification

Regression

The role of field knowledge in data modeling

Caveats