Perspectives on Data Science for Software Engineering

Ebook814 pages11 hours

Perspectives on Data Science for Software Engineering

Name: Perspectives on Data Science for Software Engineering
Brand: Elsevier Science
Rating: 5.0 (1 reviews)

By Tim Menzies, Laurie Williams and Thomas Zimmermann

Rating: 5 out of 5 stars

5/5

()

Read preview

About this ebook

Perspectives on Data Science for Software Engineering presents the best practices of seasoned data miners in software engineering. The idea for this book was created during the 2014 conference at Dagstuhl, an invitation-only gathering of leading computer scientists who meet to identify and discuss cutting-edge informatics topics.

At the 2014 conference, the concept of how to transfer the knowledge of experts from seasoned software engineers and data scientists to newcomers in the field highlighted many discussions. While there are many books covering data mining and software engineering basics, they present only the fundamentals and lack the perspective that comes from real-world experience. This book offers unique insights into the wisdom of the community’s leaders gathered to share hard-won lessons from the trenches.

Ideas are presented in digestible chapters designed to be applicable across many domains. Topics included cover data collection, data sharing, data mining, and how to utilize these techniques in successful software projects. Newcomers to software engineering data science will learn the tips and tricks of the trade, while more experienced data scientists will benefit from war stories that show what traps to avoid.

Presents the wisdom of community experts, derived from a summit on software analytics
Provides contributed chapters that share discrete ideas and technique from the trenches
Covers top areas of concern, including mining security and social data, data visualization, and cloud-based data
Presented in clear chapters designed to be applicable across many domains

Skip carousel

LanguageEnglish

PublisherElsevier Science

Release dateJul 14, 2016

ISBN9780128042618

Author

Tim Menzies

Tim Menzies, Full Professor, CS, NC State and a former software research chair at NASA. He has published 200+ publications, many in the area of software analytics. He is an editorial board member (1) IEEE Trans on SE; (2) Automated Software Engineering journal; (3) Empirical Software Engineering Journal. His research includes artificial intelligence, data mining and search-based software engineering. He is best known for his work on the PROMISE open source repository of data for reusable software engineering experiments.

Related authors

Skip carousel

Related to Perspectives on Data Science for Software Engineering

Related ebooks

Skip carousel

Smarter Data Science: Succeeding with Enterprise-Grade Data and AI Projects
Ebook
Smarter Data Science: Succeeding with Enterprise-Grade Data and AI Projects
byNeal Fishman
Rating: 0 out of 5 stars
0 ratings
Developing Analytic Talent: Becoming a Data Scientist
Ebook
Developing Analytic Talent: Becoming a Data Scientist
byVincent Granville
Rating: 3 out of 5 stars
3/5
Assessing and Improving Prediction and Classification: Theory and Algorithms in C++
Ebook
Assessing and Improving Prediction and Classification: Theory and Algorithms in C++
byTimothy Masters
Rating: 0 out of 5 stars
0 ratings
Sharing Data and Models in Software Engineering
Ebook
Sharing Data and Models in Software Engineering
byTim Menzies
Rating: 5 out of 5 stars
5/5
Building Big Data Applications
Ebook
Building Big Data Applications
byKrish Krishnan
Rating: 0 out of 5 stars
0 ratings
Big Data Analytics: From Strategic Planning to Enterprise Integration with Tools, Techniques, NoSQL, and Graph
Ebook
Big Data Analytics: From Strategic Planning to Enterprise Integration with Tools, Techniques, NoSQL, and Graph
byDavid Loshin
Rating: 5 out of 5 stars
5/5
Database Management for Business Leaders: Building and Using Data Solutions That Work for You
Ebook
Database Management for Business Leaders: Building and Using Data Solutions That Work for You
byLarry Ruddell
Rating: 0 out of 5 stars
0 ratings
DataOps A Complete Guide - 2020 Edition
Ebook
DataOps A Complete Guide - 2020 Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Modeling and Simulation-Based Data Engineering: Introducing Pragmatics into Ontologies for Net-Centric Information Exchange
Ebook
Modeling and Simulation-Based Data Engineering: Introducing Pragmatics into Ontologies for Net-Centric Information Exchange
byBernard P. Zeigler
Rating: 0 out of 5 stars
0 ratings
Edge Data Fabric Third Edition
Ebook
Edge Data Fabric Third Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Big Data: Principles and best practices of scalable realtime data systems
Ebook
Big Data: Principles and best practices of scalable realtime data systems
byJames Warren
Rating: 4 out of 5 stars
4/5
Graph Databases A Complete Guide - 2019 Edition
Ebook
Graph Databases A Complete Guide - 2019 Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Problem-solving in High Performance Computing: A Situational Awareness Approach with Linux
Ebook
Problem-solving in High Performance Computing: A Situational Awareness Approach with Linux
byIgor Ljubuncic
Rating: 0 out of 5 stars
0 ratings
Transactional Information Systems: Theory, Algorithms, and the Practice of Concurrency Control and Recovery
Ebook
Transactional Information Systems: Theory, Algorithms, and the Practice of Concurrency Control and Recovery
byGerhard Weikum
Rating: 5 out of 5 stars
5/5
Artificial Intelligence for the Internet of Everything
Ebook
Artificial Intelligence for the Internet of Everything
byWilliam Lawless
Rating: 3 out of 5 stars
3/5
Transaction Processing: Concepts and Techniques
Ebook
Transaction Processing: Concepts and Techniques
byJim Gray
Rating: 4 out of 5 stars
4/5
AI And Machine Learning A Complete Guide - 2019 Edition
Ebook
AI And Machine Learning A Complete Guide - 2019 Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Knowledge Representation and Reasoning
Ebook
Knowledge Representation and Reasoning
byRonald Brachman
Rating: 4 out of 5 stars
4/5
Oracle Warehouse Builder 11g: Getting Started
Ebook
Oracle Warehouse Builder 11g: Getting Started
byBob Griesemer
Rating: 0 out of 5 stars
0 ratings
Business Modeling and Data Mining
Ebook
Business Modeling and Data Mining
byDorian Pyle
Rating: 3 out of 5 stars
3/5
Data Visualization Strategy Standard Requirements
Ebook
Data Visualization Strategy Standard Requirements
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Social Data Analytics: Collaboration for the Enterprise
Ebook
Social Data Analytics: Collaboration for the Enterprise
byKrish Krishnan
Rating: 1 out of 5 stars
1/5
Social Media Data Mining and Analytics
Ebook
Social Media Data Mining and Analytics
byGabor Szabo
Rating: 0 out of 5 stars
0 ratings
Data Platform CDP A Complete Guide - 2019 Edition
Ebook
Data Platform CDP A Complete Guide - 2019 Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Object-Oriented Construction Handbook: Developing Application-Oriented Software with the Tools & Materials Approach
Ebook
Object-Oriented Construction Handbook: Developing Application-Oriented Software with the Tools & Materials Approach
byHeinz Züllighoven
Rating: 0 out of 5 stars
0 ratings
Data Pipelines A Complete Guide - 2019 Edition
Ebook
Data Pipelines A Complete Guide - 2019 Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
A Practical Guide to Analytics for Governments: Using Big Data for Good
Ebook
A Practical Guide to Analytics for Governments: Using Big Data for Good
byMarie Lowman
Rating: 0 out of 5 stars
0 ratings
Centers Of Excellence A Complete Guide - 2019 Edition
Ebook
Centers Of Excellence A Complete Guide - 2019 Edition
byGerardus Blokdyk
Rating: 0 out of 5 stars
0 ratings
Trends in Deep Learning Methodologies: Algorithms, Applications, and Systems
Ebook
Trends in Deep Learning Methodologies: Algorithms, Applications, and Systems
byVincenzo Piuri
Rating: 0 out of 5 stars
0 ratings
Big Data and Smart Service Systems
Ebook
Big Data and Smart Service Systems
byXiwei Liu
Rating: 0 out of 5 stars
0 ratings

Enterprise Applications For You

Skip carousel

Learn Windows PowerShell in a Month of Lunches
Ebook
Learn Windows PowerShell in a Month of Lunches
byDon Jones
Rating: 0 out of 5 stars
0 ratings
101 Most Popular Excel Formulas: 101 Excel Series, #1
Ebook
101 Most Popular Excel Formulas: 101 Excel Series, #1
byJohn Michaloudis
Rating: 4 out of 5 stars
4/5
Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates
Ebook
Creating Online Courses with ChatGPT | A Step-by-Step Guide with Prompt Templates
byCea West
Rating: 4 out of 5 stars
4/5
Mastering ChatGPT: Create Highly Effective Prompts, Strategies, and Best Practices to Go From Novice to Expert
Ebook
Mastering ChatGPT: Create Highly Effective Prompts, Strategies, and Best Practices to Go From Novice to Expert
byTJ Books
Rating: 3 out of 5 stars
3/5
AI for Smart Kids Ages 6-9: Discover how Artificial Intelligence is Changing the World
Ebook
AI for Smart Kids Ages 6-9: Discover how Artificial Intelligence is Changing the World
byDr. Leo Lexicon
Rating: 0 out of 5 stars
0 ratings
Excel Formulas and Functions 2020: Excel Academy, #1
Ebook
Excel Formulas and Functions 2020: Excel Academy, #1
byAdam Ramirez
Rating: 4 out of 5 stars
4/5
Excel for Beginners 2023: A Step-by-Step and Quick Reference Guide to Master the Fundamentals, Formulas, Functions, & Charts in Excel with Practical Examples | A Complete Excel Shortcuts Cheat Sheet
Ebook
Excel for Beginners 2023: A Step-by-Step and Quick Reference Guide to Master the Fundamentals, Formulas, Functions, & Charts in Excel with Practical Examples | A Complete Excel Shortcuts Cheat Sheet
byJames H. Moyle
Rating: 0 out of 5 stars
0 ratings
Excel 2019 For Dummies
Ebook
Excel 2019 For Dummies
byGreg Harvey
Rating: 3 out of 5 stars
3/5
QuickBooks 2023 All-in-One For Dummies
Ebook
QuickBooks 2023 All-in-One For Dummies
byStephen L. Nelson
Rating: 0 out of 5 stars
0 ratings
Excel : The Ultimate Comprehensive Step-By-Step Guide to the Basics of Excel Programming: 1
Ebook
Excel : The Ultimate Comprehensive Step-By-Step Guide to the Basics of Excel Programming: 1
byKevin Clark
Rating: 5 out of 5 stars
5/5
Excel 2019 Bible
Ebook
Excel 2019 Bible
byMichael Alexander
Rating: 4 out of 5 stars
4/5
Excel 2023 for Beginners: A Complete Quick Reference Guide from Beginner to Advanced with Simple Tips and Tricks to Master All Essential Fundamentals, Formulas, Functions, Charts, Tools, & Shortcuts
Ebook
Excel 2023 for Beginners: A Complete Quick Reference Guide from Beginner to Advanced with Simple Tips and Tricks to Master All Essential Fundamentals, Formulas, Functions, Charts, Tools, & Shortcuts
byTerry R. Hoffmann
Rating: 0 out of 5 stars
0 ratings
50 Useful Excel Functions: Excel Essentials, #3
Ebook
50 Useful Excel Functions: Excel Essentials, #3
byM.L. Humphrey
Rating: 5 out of 5 stars
5/5
ChatGPT Ultimate User Guide - How to Make Money Online Faster and More Precise Using AI Technology
Ebook
ChatGPT Ultimate User Guide - How to Make Money Online Faster and More Precise Using AI Technology
byMaximus Wilson
Rating: 0 out of 5 stars
0 ratings
Notion for Beginners: Notion for Work, Play, and Productivity
Ebook
Notion for Beginners: Notion for Work, Play, and Productivity
byJill Hamilton
Rating: 4 out of 5 stars
4/5
The Basics of Hacking and Penetration Testing: Ethical Hacking and Penetration Testing Made Easy
Ebook
The Basics of Hacking and Penetration Testing: Ethical Hacking and Penetration Testing Made Easy
byPatrick Engebretson
Rating: 4 out of 5 stars
4/5
The New Email Revolution: Save Time, Make Money, and Write Emails People Actually Want to Read!
Ebook
The New Email Revolution: Save Time, Make Money, and Write Emails People Actually Want to Read!
byRobert W. Bly
Rating: 5 out of 5 stars
5/5
Bitcoin For Dummies
Ebook
Bitcoin For Dummies
byPrypto
Rating: 4 out of 5 stars
4/5
101 Ready-to-Use Excel Formulas
Ebook
101 Ready-to-Use Excel Formulas
byMichael Alexander
Rating: 4 out of 5 stars
4/5
The Ridiculously Simple Guide To Numbers For Mac
Ebook
The Ridiculously Simple Guide To Numbers For Mac
byScott La Counte
Rating: 0 out of 5 stars
0 ratings
Access 2019 For Dummies
Ebook
Access 2019 For Dummies
byLaurie A. Ulrich
Rating: 0 out of 5 stars
0 ratings
SharePoint For Dummies
Ebook
SharePoint For Dummies
byRosemarie Withee
Rating: 0 out of 5 stars
0 ratings
Excel Guide for Success
Ebook
Excel Guide for Success
byKevin Pitch
Rating: 5 out of 5 stars
5/5
Enterprise AI For Dummies
Ebook
Enterprise AI For Dummies
byZachary Jarvinen
Rating: 3 out of 5 stars
3/5
Microsoft Office 365 Bible: 10:1 Mastery | Excel in Your Profession, Enhance Time Management, and Foster Exceptional Collaboration [III EDITION]: Career Elevator
Ebook
Microsoft Office 365 Bible: 10:1 Mastery | Excel in Your Profession, Enhance Time Management, and Foster Exceptional Collaboration [III EDITION]: Career Elevator
byKevin Pitch
Rating: 5 out of 5 stars
5/5
QuickBooks 2021 For Dummies
Ebook
QuickBooks 2021 For Dummies
byStephen L. Nelson
Rating: 0 out of 5 stars
0 ratings
The Ridiculously Simple Guide to Google Docs: A Practical Guide to Cloud-Based Word Processing
Ebook
The Ridiculously Simple Guide to Google Docs: A Practical Guide to Cloud-Based Word Processing
byScott La Counte
Rating: 0 out of 5 stars
0 ratings
Mastering QuickBooks 2020: The ultimate guide to bookkeeping and QuickBooks Online
Ebook
Mastering QuickBooks 2020: The ultimate guide to bookkeeping and QuickBooks Online
byCrystalynn Shelton
Rating: 0 out of 5 stars
0 ratings
Managing Humans: Biting and Humorous Tales of a Software Engineering Manager
Ebook
Managing Humans: Biting and Humorous Tales of a Software Engineering Manager
byMichael Lopp
Rating: 4 out of 5 stars
4/5
CompTIA Certification: The Ultimate Guide To Discover CompTIA. Certified Quickly And Easily Passing The Certification Exam. Real Practice Test With Detailed Screenshots, Answers And Explanations
Ebook
CompTIA Certification: The Ultimate Guide To Discover CompTIA. Certified Quickly And Easily Passing The Certification Exam. Real Practice Test With Detailed Screenshots, Answers And Explanations
byDavid Mayer
Rating: 0 out of 5 stars
0 ratings

Related podcast episodes

Skip carousel

62: Cracking the Data Code w/ Mike Bugembe: Mike Bugembe is a speaker, consultant, and Amazon best selling author of the book Cracking the Data Code. He joins today’s podcast to talk about the things that you can do that will help create successful analytics projects. After being the...
Podcast episode
62: Cracking the Data Code w/ Mike Bugembe: Mike Bugembe is a speaker, consultant, and Amazon best selling author of the book Cracking the Data Code. He joins today’s podcast to talk about the things that you can do that will help create successful analytics projects. After being the...
byAnalytics on Fire
0 ratings
0% found this document useful
Unlocking The Power of Data Lineage In Your Platform with OpenLineage: An interview with Julien Le Dem about the OpenLineage specification and the opportunity that it offers for simplifying the tracking and analysis of data lineage across your data platform.
Podcast episode
Unlocking The Power of Data Lineage In Your Platform with OpenLineage: An interview with Julien Le Dem about the OpenLineage specification and the opportunity that it offers for simplifying the tracking and analysis of data lineage across your data platform.
byData Engineering Podcast
0 ratings
0% found this document useful
Big Data: The money-making world of big data is discussed by Evan Davis and guests.
Podcast episode
Big Data: The money-making world of big data is discussed by Evan Davis and guests.
byThe Bottom Line
0 ratings
0% found this document useful
055 | Disinformation Visualization w/ Mushon Zer-Aviv
Podcast episode
055 | Disinformation Visualization w/ Mushon Zer-Aviv
byData Stories
0 ratings
0% found this document useful
Keeping Your Data Warehouse In Order With DataForm - Episode 102: An interview about Dataform and how it helps you to keep your data warehouse in good working order
Podcast episode
Keeping Your Data Warehouse In Order With DataForm - Episode 102: An interview about Dataform and how it helps you to keep your data warehouse in good working order
byData Engineering Podcast
0 ratings
0% found this document useful
245: Angela Harris - Building Relationships with Developers & Builders Helped Grow her Interior Design Firm: Welcome to today's show and happy birthday to LuAnn! Today's guest, Angela Harris, is the principal of Trio Environments and she is a real dynamo! Trio Environments is one of the fastest growing Interior Design and Visual Merchandising firms in the...
Podcast episode
245: Angela Harris - Building Relationships with Developers & Builders Helped Grow her Interior Design Firm: Welcome to today's show and happy birthday to LuAnn! Today's guest, Angela Harris, is the principal of Trio Environments and she is a real dynamo! Trio Environments is one of the fastest growing Interior Design and Visual Merchandising firms in the...
byA Well-Designed Business® | Interior Design Business Podcast
0 ratings
0% found this document useful
Episode 204: Cross-Device Experiences with Cheryl Platz: As designers, we’re often used to one device format, and that may lead to mistakes. What about accessibility? How can we design seamless experiences for multiple devices? Our guest today is Cheryl Platz, author of Design Beyond Devices. You’ll hear about multimodal design, UX frameworks, system behavior models, accessibility, nudges vs notifications, brainstorming, and more.
Podcast episode
Episode 204: Cross-Device Experiences with Cheryl Platz: As designers, we’re often used to one device format, and that may lead to mistakes. What about accessibility? How can we design seamless experiences for multiple devices? Our guest today is Cheryl Platz, author of Design Beyond Devices. You’ll hear about multimodal design, UX frameworks, system behavior models, accessibility, nudges vs notifications, brainstorming, and more.
byUI Breakfast: UI/UX Design and Product Strategy
0 ratings
0% found this document useful
040: Graph Databases: Traditional relational databases like MySQL or Postgres are really good at providing many solutions to the problem of persisting state. But these types of database are really horrible at querying highly connected models in an efficient way. Graph datab...
Podcast episode
040: Graph Databases: Traditional relational databases like MySQL or Postgres are really good at providing many solutions to the problem of persisting state. But these types of database are really horrible at querying highly connected models in an efficient way. Graph datab...
byPHPRoundtable Podcast
0 ratings
0% found this document useful
An Agile Approach To Master Data Management with Mark Marinelli - Episode 46: Building A Master Data Catalog Using Machine Learning (Interview)
Podcast episode
An Agile Approach To Master Data Management with Mark Marinelli - Episode 46: Building A Master Data Catalog Using Machine Learning (Interview)
byData Engineering Podcast
100%
100% found this document useful
[DataFramed Careers Series #2] What Makes a Great Data Science Portfolio
Podcast episode
[DataFramed Careers Series #2] What Makes a Great Data Science Portfolio
byDataFramed
0 ratings
0% found this document useful
#122 How Organizations Can Bridge the Data Literacy Gap
Podcast episode
#122 How Organizations Can Bridge the Data Literacy Gap
byDataFramed
0 ratings
0% found this document useful
GKE Cost Optimization with Kaslin Fields and Anthony Bushong: This week on the podcast, fellow Googlers Kaslin Fields and Anthony Bushong chat with hosts Mark Mirchandani and Stephanie Wong about how to budget and optimize spending with Google Kubernetes Engine.
Podcast episode
GKE Cost Optimization with Kaslin Fields and Anthony Bushong: This week on the podcast, fellow Googlers Kaslin Fields and Anthony Bushong chat with hosts Mark Mirchandani and Stephanie Wong about how to budget and optimize spending with Google Kubernetes Engine.
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
One Shot and Metric Learning - Quadruplet Loss (Machine Learning Dojo)
Podcast episode
One Shot and Metric Learning - Quadruplet Loss (Machine Learning Dojo)
byMachine Learning Street Talk (MLST)
0 ratings
0% found this document useful
#429: [Right Now at AWS] Episode 4 – Developing an IoT solution to keep water flowing for millions: Having a clear vision about the IoT solution needed to keep clean water flowing for millions of peop
Podcast episode
#429: [Right Now at AWS] Episode 4 – Developing an IoT solution to keep water flowing for millions: Having a clear vision about the IoT solution needed to keep clean water flowing for millions of peop
byAWS Podcast
0 ratings
0% found this document useful
It’s Not a Data Science Problem, It’s a Data Engineering Problem with Laurie Voss: Laurie Voss is a senior data analyst at Netlify, makers of a serverless platform designed to help teams build, deploy, and collaborate on web apps more effectively. Previously, Laurie worked as Chief Data Officer at npm, Inc., co-founded Snowball Factory,
Podcast episode
It’s Not a Data Science Problem, It’s a Data Engineering Problem with Laurie Voss: Laurie Voss is a senior data analyst at Netlify, makers of a serverless platform designed to help teams build, deploy, and collaborate on web apps more effectively. Previously, Laurie worked as Chief Data Officer at npm, Inc., co-founded Snowball Factory,
byScreaming in the Cloud
0 ratings
0% found this document useful
Cloud SQL Insights with Nimesh Bhagat: This week on the podcast, Mark Mirchandani and Gabi Ferrara talk with Nimesh Bhagat about Cloud SQL Insights.
Podcast episode
Cloud SQL Insights with Nimesh Bhagat: This week on the podcast, Mark Mirchandani and Gabi Ferrara talk with Nimesh Bhagat about Cloud SQL Insights.
byGoogle Cloud Platform Podcast
0 ratings
0% found this document useful
Peter Jones and Kristel van Ael, "Design Journeys Through Complex Systems" (Bis Publishers, 2022): An interview with Peter Jones and Kristel van Ael
Podcast episode
Peter Jones and Kristel van Ael, "Design Journeys Through Complex Systems" (Bis Publishers, 2022): An interview with Peter Jones and Kristel van Ael
byNew Books in Science, Technology, and Society
0 ratings
0% found this document useful
Shining A Light on Shadow IT In Data And Analytics - Episode 121: A conversation about the conflicts that lead to shadow IT in data and analytics projects and how to work toward resolving those tensions.
Podcast episode
Shining A Light on Shadow IT In Data And Analytics - Episode 121: A conversation about the conflicts that lead to shadow IT in data and analytics projects and how to work toward resolving those tensions.
byData Engineering Podcast
0 ratings
0% found this document useful
Massively Parallel Data Processing In Python Without The Effort Using Bodo: An interview about how Bodo converts standard Python code to native MPI automatically for massive speed ups in data processing workloads
Podcast episode
Massively Parallel Data Processing In Python Without The Effort Using Bodo: An interview about how Bodo converts standard Python code to native MPI automatically for massive speed ups in data processing workloads
byData Engineering Podcast
0 ratings
0% found this document useful
Morgan Senkal: Using Epics to Improve Code Quality Within Sprints: Robby speaks with Morgan Senkal, Software Architect at Metal Toad. Morgan recalls a challenging 15-year-old legacy project that was reminiscent of a Stephen King story and explains what to think about when considering a software rewrite. Morgan and Robby keep a running analogy of technical debt and automotive repairs.
Podcast episode
Morgan Senkal: Using Epics to Improve Code Quality Within Sprints: Robby speaks with Morgan Senkal, Software Architect at Metal Toad. Morgan recalls a challenging 15-year-old legacy project that was reminiscent of a Stephen King story and explains what to think about when considering a software rewrite. Morgan and Robby keep a running analogy of technical debt and automotive repairs.
byMaintainable
0 ratings
0% found this document useful
What is distributed computing?: Sometimes using a single computer just won't cut it, and buying time on a supercomputer can be prohibitively expensive. So what do you do next? Tune in and learn more about distributed computing in this podcast.
Podcast episode
What is distributed computing?: Sometimes using a single computer just won't cut it, and buying time on a supercomputer can be prohibitively expensive. So what do you do next? Tune in and learn more about distributed computing in this podcast.
byTechStuff
100%
100% found this document useful
Build Your Data Analytics Like An Engineer - Episode 81: An interview about how dbt enables your data teams to build better analytics in your data warehouse
Podcast episode
Build Your Data Analytics Like An Engineer - Episode 81: An interview about how dbt enables your data teams to build better analytics in your data warehouse
byData Engineering Podcast
0 ratings
0% found this document useful
Metadata Management And Integration At LinkedIn With DataHub - Episode 147: An interview about how LinkedIn designed their metadata management platform and how it is being used to power data discovery and integration.
Podcast episode
Metadata Management And Integration At LinkedIn With DataHub - Episode 147: An interview about how LinkedIn designed their metadata management platform and how it is being used to power data discovery and integration.
byData Engineering Podcast
0 ratings
0% found this document useful
Build Better Machine Learning Models With Confidence By Adding Validation With Deepchecks: A cross-over episode from The Machine Learning Podcast with the team from Deepchecks, exploring the challenges of testing and validating machine learning applications and their work to make it easier.
Podcast episode
Build Better Machine Learning Models With Confidence By Adding Validation With Deepchecks: A cross-over episode from The Machine Learning Podcast with the team from Deepchecks, exploring the challenges of testing and validating machine learning applications and their work to make it easier.
byThe Python Podcast.__init__
0 ratings
0% found this document useful
Every commit is a gift: celebrating Maintainer Week with Brett Cannon
Podcast episode
Every commit is a gift: celebrating Maintainer Week with Brett Cannon
byThe Changelog: Software Development, Open Source
0 ratings
0% found this document useful
Digitizing the 1st Mile with Blockchain, AI, and Self Sovereign Identity Connecting Coffee Farmers from Field to 1st Sip in Sustainable Supply Chains with Farmer Connect (CEO, Michael Chrisment)
Podcast episode
Digitizing the 1st Mile with Blockchain, AI, and Self Sovereign Identity Connecting Coffee Farmers from Field to 1st Sip in Sustainable Supply Chains with Farmer Connect (CEO, Michael Chrisment)
bySupply Chain Revolution
0 ratings
0% found this document useful
Ali Ghodsi – The Past, Present, and Future of Big Data – [Founder’s Field Guide, EP.18]: My Guest today is Ali Ghodsi, founder and CEO of Databricks, a data analytics platform for data scientists and developers. He's also the founder of Apache Spark, the open-source project that Databricks is built on, and is an accomplished researcher at...
Podcast episode
Ali Ghodsi – The Past, Present, and Future of Big Data – [Founder’s Field Guide, EP.18]: My Guest today is Ali Ghodsi, founder and CEO of Databricks, a data analytics platform for data scientists and developers. He's also the founder of Apache Spark, the open-source project that Databricks is built on, and is an accomplished researcher at...
byInvest Like the Best with Patrick O'Shaughnessy
0 ratings
0% found this document useful
#78 How Data & Culture Unlock Digital Transformation
Podcast episode
#78 How Data & Culture Unlock Digital Transformation
byDataFramed
0 ratings
0% found this document useful
Stryker on How to Connect Data Strategy to Business Value: Modern data leaders know creating a data-informed culture requires cross-functional partnership and collaboration across the entire business. IT by themselves can’t do it. Nor can individual business departments. Both the IT and business strategy must be in lock step to achieve results. On this episode of The Data Chief, Dora Boussias, Senior Director of Data Strategy and Architecture at Stryker, discusses the role of modern data executives, three keys to creating a data-informed culture, and her approach to breaking down silos based on her own 28 years of experience building effective data strategies across industries.
Podcast episode
Stryker on How to Connect Data Strategy to Business Value: Modern data leaders know creating a data-informed culture requires cross-functional partnership and collaboration across the entire business. IT by themselves can’t do it. Nor can individual business departments. Both the IT and business strategy must be in lock step to achieve results. On this episode of The Data Chief, Dora Boussias, Senior Director of Data Strategy and Architecture at Stryker, discusses the role of modern data executives, three keys to creating a data-informed culture, and her approach to breaking down silos based on her own 28 years of experience building effective data strategies across industries.
byThe Data Chief
0 ratings
0% found this document useful
Building Big Data Pipelines For Audio With Klio: An interview with Lynn Root about building the Klio project to power Spotify's big data pipelines to scale and simplify their capacity for working with audio files and other binary data.
Podcast episode
Building Big Data Pipelines For Audio With Klio: An interview with Lynn Root about building the Klio project to power Spotify's big data pipelines to scale and simplify their capacity for working with audio files and other binary data.
byThe Python Podcast.__init__
0 ratings
0% found this document useful

Skip carousel

Small Data
PC Pro Magazine
Article
Small Data
Oct 8, 2022
3 min read
How Google Is Making The AI That Powers Its Products Better.
HWM Singapore
Article
How Google Is Making The AI That Powers Its Products Better.
Jun 3, 2019
3 min read
Data In A Digital World
NZ Marketing
Article
Data In A Digital World
Sep 23, 2019
3 min read
The Three Cornerstones of a Smart Business
Rotman Management
Article
The Three Cornerstones of a Smart Business
Jan 1, 2019
Adaptable Products. Algorithms cannot iterate without the products—the online consumer interface that delivers customer experience directly while gathering consumer feedback to adjust algorithm models. Google’s search bar is a classic example of prod
1 min read
Google Searches For Ways To Put Artificial Intelligence To Use In Health Care
NPR
Article
Google Searches For Ways To Put Artificial Intelligence To Use In Health Care
Apr 22, 2019
6 min read
Augmented Reality: A New Goal for Apple
AppleMagazine
Article
Augmented Reality: A New Goal for Apple
Dec 22, 2017
4 min read
Pentagon Hands Microsoft $10b ‘War Cloud’ Deal, Snubs Amazon
AppleMagazine
Article
Pentagon Hands Microsoft $10b ‘War Cloud’ Deal, Snubs Amazon
Nov 1, 2019
2 min read
AI – Turn Buzz Into Biz
Facility Management
Article
AI – Turn Buzz Into Biz
Dec 23, 2018
4 min read
AI As A Service
PC Pro Magazine
Article
AI As A Service
Jul 9, 2020
2 min read
How This Startup is Making Mobile App Development Easier
Entrepreneur
Article
How This Startup is Making Mobile App Development Easier
Apr 1, 2016
1 min read
Rokoko Studio 2.0
3D World
Article
Rokoko Studio 2.0
Feb 23, 2021
1 min read
What European Banks Need to Know about Competing with Ecosystems
The European Business Review
Article
What European Banks Need to Know about Competing with Ecosystems
Dec 3, 2019
6 min read
DataStax The Real-time Data Company, Unveiled “Change Data Capture” (CDC) for Astra DB
Techfastly
Article
DataStax The Real-time Data Company, Unveiled “Change Data Capture” (CDC) for Astra DB
May 1, 2022
3 min read
The Sweet Spot Between Idiot And Expert….
NZ Marketing
Article
The Sweet Spot Between Idiot And Expert….
Jun 21, 2018
6 min read
We Need an FDA For Algorithms
Nautilus
Article
We Need an FDA For Algorithms
Nov 1, 2018
In the introduction to her new book, Hannah Fry points out something interesting about the phrase “Hello World.” It’s never been quite clear, she says, whether the phrase—which is frequently the entire output of a student’s first computer program—is
10 min read
The Razor’s Edge
Linux Format
Article
The Razor’s Edge
Mar 10, 2020
10 min read
Enter the Industry 4.0 Era Today by Using “Dark Data” You Already Have
The European Business Review
Article
Enter the Industry 4.0 Era Today by Using “Dark Data” You Already Have
Aug 2, 2019
7 min read
Why Is ELT Better For Cloud Data Warehousing?
Techfastly
Article
Why Is ELT Better For Cloud Data Warehousing?
Apr 1, 2021
2 min read
Bizarre Beginnings To Complex Success
Marketing
Article
Bizarre Beginnings To Complex Success
Aug 12, 2018
7 min read
The Big Idea Behind Big Data
NPR
Article
The Big Idea Behind Big Data
Nov 17, 2017
As we find our way in a world shaped by Big Data, it's not the reams of information we gather but the networks they illuminate that's the newest addition to science's index of things, says Adam Frank.
6 min read
Business Backbone
BeanScene
Article
Business Backbone
Aug 12, 2018
1 min read
The Acquisition Or Retention Conundrum
NZ Marketing
Article
The Acquisition Or Retention Conundrum
Sep 23, 2023
7 min read
Getting Smarter
Business Today
Article
Getting Smarter
Feb 7, 2022
8 min read
Data Fabric
PC Pro Magazine
Article
Data Fabric
Aug 13, 2020
3 min read
Innovation By Design 2018
Fast Company
Article
Innovation By Design 2018
Sep 10, 2018
3 min read
Upgrade Your Marketing With Machine Learning
Fast Company
Article
Upgrade Your Marketing With Machine Learning
Sep 9, 2019
2 min read
Quantum Computing’s DISRUPTION IN Finance Industry
Techfastly
Article
Quantum Computing’s DISRUPTION IN Finance Industry
Oct 1, 2021
5 min read
Idaho Needs To Shore Up Cybersecurity, Task Force Says
TechLife News
Article
Idaho Needs To Shore Up Cybersecurity, Task Force Says
May 7, 2022
2 min read
Data-driven Decision Making That Uses Data, Mind And Heart
The European Business Review
Article
Data-driven Decision Making That Uses Data, Mind And Heart
Jan 31, 2020
14 min read
Want A Job In Data Science? You Might Have To Take A Standardized Test When Applying
Chicago Tribune
Article
Want A Job In Data Science? You Might Have To Take A Standardized Test When Applying
Jul 10, 2018
3 min read

Related categories

Skip carousel

Reviews for Perspectives on Data Science for Software Engineering

Rating: 5 out of 5 stars

5/5

1 rating0 reviews

Book preview

Perspectives on Data Science for Software Engineering - Tim Menzies

Perspectives on Data Science for Software Engineering

First Edition

Tim Menzies

Laurie Williams

Thomas Zimmermann

Cover image

Title page

Copyright

Contributors

Acknowledgments

Introduction

Perspectives on data science for software engineering

Abstract

Why This Book?

About This Book

The Future

Software analytics and its application in practice

Abstract

Six Perspectives of Software Analytics

Experiences in Putting Software Analytics into Practice

Seven principles of inductive software engineering: What we do is different

Abstract

Different and Important

Principle #1: Humans Before Algorithms

Principle #2: Plan for Scale

Principle #3: Get Early Feedback

Principle #4: Be Open Minded

Principle #5: Be smart with your learning

Principle #6: Live With the Data You Have

Principle #7: Develop a Broad Skill Set That Uses a Big Toolkit

The need for data analysis patterns (in software engineering)

Abstract

The Remedy Metaphor

Software Engineering Data

Needs of Data Analysis Patterns

Building Remedies for Data Analysis in Software Engineering Research

From software data to software theory: The path less traveled

Abstract

Pathways of Software Repository Research

From Observation, to Theory, to Practice

Why theory matters

Abstract

Introduction

How to Use Theory

How to Build Theory

In Summary: Find a Theory or Build One Yourself

Success Stories/Applications

Mining apps for anomalies

Abstract

The Million-Dollar Question

App Mining

Detecting Abnormal Behavior

A Treasure Trove of Data …

… but Also Obstacles

Executive Summary

Embrace dynamic artifacts

Abstract

Acknowledgments

Can We Minimize the USB Driver Test Suite?

Still Not Convinced? Here’s More

Dynamic Artifacts Are Here to Stay

Mobile app store analytics

Abstract

Introduction

Understanding End Users

Conclusion

The naturalness of software

Abstract

Introduction

Transforming Software Practice

Conclusion

Advances in release readiness

Abstract

Predictive Test Metrics

Universal Release Criteria Model

Best Estimation Technique

Resource/Schedule/Content Model

Using Models in Release Management

Research to Implementation: A Difficult (but Rewarding) Journey

How to tame your online services

Abstract

Background

Service Analysis Studio

Success Story

Measuring individual productivity

Abstract

No Single and Simple Best Metric for Success/Productivity

Measure the Process, Not Just the Outcome

Allow for Measures to Evolve

Goodhart’s Law and the Effect of Measuring

How to Measure Individual Productivity?

Stack traces reveal attack surfaces

Abstract

Another Use of Stack Traces?

Attack Surface Approximation

Visual analytics for software engineering data

Abstract

Gameplay data plays nicer when divided into cohorts

Abstract

Cohort Analysis as a Tool for Gameplay Data

Play to Lose

Forming Cohorts

Case Studies of Gameplay Data

Challenges of Using Cohorts

Summary

A success story in applying data science in practice

Abstract

Overview

Analytics Process

Communication Process—Best Practices

Summary

There's never enough time to do all the testing you want

Abstract

The Impact of Short Release Cycles (There's Not Enough Time)

Learn From Your Test Execution History

The Art of Testing Less

Tests Evolve Over Time

In Summary

The perils of energy mining: measure a bunch, compare just once

Abstract

A Tale of Two HTTPs

Let's ENERGISE Your Software Energy Experiments

Summary

Identifying fault-prone files in large industrial software systems

Abstract

Acknowledgment

A tailored suit: The big opportunity in personalizing issue tracking

Abstract

Many Choices, Nothing Great

The Need for Personalization

Developer Dashboards or A Tailored Suit

Room for Improvement

What counts is decisions, not numbers—Toward an analytics design sheet

Abstract

Decisions Everywhere

The Decision-Making Process

The Analytics Design Sheet

Example: App Store Release Analysis

A large ecosystem study to understand the effect of programming languages on code quality

Abstract

Comparing Languages

Study Design and Analysis

Results

Summary

Code reviews are not for finding defects—Even established tools need occasional evaluation

Abstract

Results

Effects

Conclusions

Techniques

Interviews

Abstract

Why Interview?

The Interview Guide

Selecting Interviewees

Recruitment

Collecting Background Data

Conducting the Interview

Post-Interview Discussion and Notes

Transcription

Analysis

Reporting

Now Go Interview!

Look for state transitions in temporal data

Abstract

Bikeshedding in Software Engineering

Summarizing Temporal Data

Recommendations

Card-sorting: From text to themes

Abstract

Preparation Phase

Execution Phase

Analysis Phase

Tools! Tools! We need tools!

Abstract

Tools in Science

The Tools We Need

Recommendations for Tool Building

Evidence-based software engineering

Abstract

Introduction

The Aim and Methodology of EBSE

Contextualizing Evidence

Strength of Evidence

Evidence and Theory

Which machine learning method do you need?

Abstract

Learning Styles

Do additional Data Arrive Over Time?

Are Changes Likely to Happen Over Time?

If You Have a Prediction Problem, What Do You Really Need to Predict?

Do You Have a Prediction Problem Where Unlabeled Data are Abundant and Labeled Data are Expensive?

Are Your Data Imbalanced?

Do You Need to Use Data From Different Sources?

Do You Have Big Data?

Do You Have Little Data?

In Summary…

Structure your unstructured data first!: The case of summarizing unstructured data with tag clouds

Abstract

Unstructured Data in Software Engineering

Summarizing Unstructured Software Data

Conclusion

Parse that data! Practical tips for preparing your raw data for analysis

Abstract

Use Assertions Everywhere

Print Information About Broken Records

Use Sets or Counters to Store Occurrences of Categorical Variables

Restart Parsing in the Middle of the Data Set

Test on a Small Subset of Your Data

Redirect Stdout and Stderr to Log Files

Store Raw Data Alongside Cleaned Data

Finally, Write a Verifier Program to Check the Integrity of Your Cleaned Data

Natural language processing is no free lunch

Abstract

Natural Language Data in Software Projects

Natural Language Processing

How to Apply NLP to Software Projects

Summary

Aggregating empirical evidence for more trustworthy decisions

Abstract

What's Evidence?

What Does Data From Empirical Studies Look Like?

The Evidence-Based Paradigm and Systematic Reviews

How Far Can We Use the Outcomes From Systematic Review to Make Decisions?

If it is software engineering, it is (probably) a Bayesian factor

Abstract

Causing the Future With Bayesian Networks

The Need for a Hybrid Approach in Software Analytics

Use the Methodology, Not the Model

Becoming Goldilocks: Privacy and data sharing in just right conditions

Abstract

Acknowledgments

The Data Drought

Change is Good

Don’t Share Everything

Share Your Leaders

Summary

The wisdom of the crowds in predictive modeling for software engineering

Abstract

The Wisdom of the Crowds

So… How is That Related to Predictive Modeling for Software Engineering?

Examples of Ensembles and Factors Affecting Their Accuracy

Crowds for Transferring Knowledge and Dealing With Changes

Crowds for Multiple Goals

A Crowd of Insights

Ensembles as Versatile Tools

Combining quantitative and qualitative methods (when mining software data)

Abstract

Prologue: We Have Solid Empirical Evidence!

Correlation is Not Causation and, Even If We Can Claim Causation…

Collect Your Data: People and Artifacts

Build a Theory Upon Your Data

Conclusion: The Truth is Out There!

Suggested Readings

A process for surviving survey design and sailing through survey deployment

Abstract

Acknowledgments

The Lure of the Sirens: The Attraction of Surveys

Navigating the Open Seas: A Successful Survey Process in Software Engineering

In Summary

Wisdom

Log it all?

Abstract

A Parable: The Blind Woman and an Elephant

Misinterpreting Phenomenon in Software Engineering

Using Data to Expand Perspectives

Recommendations

Why provenance matters

Abstract

What’s Provenance?

What are the Key Entities?

What are the Key Tasks?

Another Example

Looking Ahead

Open from the beginning

Abstract

Alitheia Core

GHTorrent

Why the Difference?

Be Open or Be Irrelevant

Reducing time to insight

Abstract

What is Insight Anyway?

Time to Insight

The Insight Value Chain

What To Do

A Warning on Waste

Five steps for success: How to deploy data science in your organizations

Abstract

Step 1. Choose the Right Questions for the Right Team

Step 2. Work Closely With Your Consumers

Step 3. Validate and Calibrate Your Data

Step 4. Speak Plainly to Give Results Business Value

Step 5. Go the Last Mile—Operationalizing Predictive Models

How the release process impacts your software analytics

Abstract

Linking Defect Reports and Code Changes to a Release

How the Version Control System Can Help

Security cannot be measured

Abstract

Gotcha #1: Security is Negatively Defined

Gotcha #2: Having Vulnerabilities is Actually Normal

Gotcha #3: More Vulnerabilities Does not Always Mean Less Secure

Gotcha #4: Design Flaws are not Usually Tracked

Gotcha #5: Hackers are Innovative Too

An Unfair Question

Gotchas from mining bug reports

Abstract

Do Bug Reports Describe Code Defects?

It's the User That Defines the Work Item Type

Do Developers Apply Atomic Changes?

In Summary

Make visualization part of your analysis process

Abstract

Leveraging Visualizations: An Example With Software Repository Histories

How to Jump the Pitfalls

Don't forget the developers! (and be careful with your assumptions)

Abstract

Acknowledgments

Disclaimer

Background

Are We Actually Helping Developers?

Some Observations and Recommendations

Limitations and context of research

Abstract

Small Research Projects

Data Quality of Open Source Repositories

Lack of Industrial Representatives at Conferences

Research From Industry

Summary

Actionable metrics are better metrics

Abstract

What Would You Say… I Should DO?

The Offenders

Actionable Heroes

Cyclomatic Complexity: An Interesting Case

Are Unactionable Metrics Useless?

Replicated results are more trustworthy

Abstract

The Replication Crisis

Reproducible Studies

Reliability and Validity in Studies

So What Should Researchers Do?

So What Should Practitioners Do?

Diversity in software engineering research

Abstract

Introduction

What Is Diversity and Representativeness?

What Can We Do About It?

Evaluation

Recommendations

Future Work

Once is not enough: Why we need replication

Abstract

Motivating Example and Tips

Exploring the Unknown

Types of Empirical Results

Do's and Don't's

Mere numbers aren't enough: A plea for visualization

Abstract

Numbers Are Good, but…

Case Studies on Visualization

What to Do

Don’t embarrass yourself: Beware of bias in your data

Abstract

Dewey Defeats Truman

Impact of Bias in Software Engineering

Identifying Bias

Assessing Impact

Which Features Should I Look At?

Operational data are missing, incorrect, and decontextualized

Abstract

Background

Examples

A Life of a Defect

What to Do?

Data science revolution in process improvement and assessment?

Abstract

Correlation is not causation (or, when not to scream Eureka!)

Abstract

What Not to Do

Example

Examples from Software Engineering

What to Do

In Summary: Wait and Reflect Before You Report

Software analytics for small software companies: More questions than answers

Abstract

The Reality for Small Software Companies

Small Software Companies Projects: Smaller and Shorter

Different Goals and Needs

What to Do About the Dearth of Data?

What to Do on a Tight Budget?

Software analytics under the lamp post (or what star trek teaches us about the importance of asking the right questions)

Abstract

Prologue

Learning from Data

Which Bin is Mine?

Epilogue

What can go wrong in software engineering experiments?

Abstract

Operationalize Constructs

Evaluate Different Design Alternatives

Match Data Analysis and Experimental Design

Do Not Rely on Statistical Significance Alone

Do a Power Analysis

Find Explanations for Results

Follow Guidelines for Reporting Experiments

Improving the reliability of experimental results

One size does not fit all

Abstract

While models are good, simple explanations are better

Abstract

Acknowledgments

How Do We Compare a USB2 Driver to a USB3 Driver?

The Issue With Our Initial Approach

Just Tell us What Is Different and Nothing More

Looking Back

Users Prefer Simple Explanations

The white-shirt effect: Learning from failed expectations

Abstract

A Story

The Right Reaction

Practical Advice

Simpler questions can lead to better insights

Abstract

Introduction

Context of the Software Analytics Project

Providing Predictions on Buggy Changes

How to Read the Graph?

(Anti-)Patterns in the Error-Handling Graph

How to Act on (Anti-)Patterns?

Summary

Continuously experiment to assess values early on

Abstract

Most Ideas Fail to Show Value

Every Idea Can Be Tested With an Experiment

How Do We Find Good Hypotheses and Conduct the Right Experiments?

Key Takeaways

Lies, damned lies, and analytics: Why big data needs thick data

Abstract

How Great It Is, to Have Data Like You

Looking for Answers in All the Wrong Places

Beware the Reality Distortion Field

Build It and They Will Come, but Should We?

To Classify Is Human, but Analytics Relies on Algorithms

Lean in: How Ethnography Can Improve Software Analytics and Vice Versa

Finding the Ethnographer Within

The world is your test suite

Abstract

Watch the World and Learn

Crashes, Hangs, and Bluescreens

The Need for Speed

Protecting Data and Identity

Discovering Confusion and Missing Requirements

Monitoring Is Mandatory

Copyright

Morgan Kaufmann is an imprint of Elsevier

50 Hampshire Street, 5th Floor, Cambridge, MA 02139, USA

No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions.

This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein).

Notices

Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary.

Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility.

To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein.

Library of Congress Cataloging-in-Publication Data

A catalog record for this book is available from the Library of Congress

British Library Cataloguing-in-Publication Data

A catalogue record for this book is available from the British Library

ISBN: 978-0-12-804206-9

For information on all Morgan Kaufmann publications visit our website at https://www.elsevier.com/

Publisher: Todd Green

Editorial Project Manager: Lindsay Lawrence

Production Project Manager: Mohana Natarajan

Cover Designer: Mark Rogers

Typeset by SPi Global, India

Contributors

Bram Adams Polytechnique Montréal, Canada

A. Bacchelli Delft University of Technology, Delft, The Netherlands

T. Barik North Carolina State University

E.T. Barr University College London, London, United Kingdom

O. Baysal Carleton University, Ottawa, ON, Canada

A. Bener Ryerson University, Toronto ON, Canada

G.R. Bergersen University of Oslo, Norway

C. Bird Microsoft Research, Redmond, WA, United States

D. Budgen Durham University, Durham, United Kingdom

B. Caglayan Lero, University of Limerick, Ireland

T. Carnahan Microsoft, Redmond, WA, United States

J. Czerwonka Principal Architect Microsoft Corp., Redmond, WA, United States

P. Devanbu UC Davis, Davis, CA, United States

M. Di Penta University of Sannio, Benevento, Italy

S. Diehl University of Trier, Trier, Germany

T. Dybå SINTEF ICT, Trondheim, Norway

T. Fritz University of Zurich, Zurich, CHE

M.W. Godfrey University of Waterloo, Waterloo, ON, Canada

G. Gousios Radboud University Nijmegen, Nijmegen, The Netherlands

P. Guo University of Rochester, Rochester, NY, United States

K. Herzig Software Development Engineer, Microsoft Corporation, Redmond, United States

A. Hindle University of Alberta, Edmonton, AB, Canada

R. Holmes University of British Columbia, Vancouver, Canada

Zhitao Hou Microsoft Research, Beijing, China

J. Huang Brown University, Providence, RI, United States

Andrew J. Ko University of Washington, Seattle, WA, United States

N. Juristo Universidad Politécnica de Madrid, Madrid, Spain

S. Just Researcher, Software Engineering Chair & Center for IT-Security, Privacy and Accountability, Saarland University, Germany

M. Kim University of California, Los Angeles

E. Kocaguneli Microsoft, Seattle, WA, United States

K. Kuutti University of Oulu, Oulu, Finland

Qingwei Lin Microsoft Research, Beijing, China

Jian-Guang Lou Microsoft Research, Beijing, China

N. Medvidovic University of Southern California, Los Angeles, CA, United States

A. Meneely Rochester Institute of Technology, Rochester, NY, United States

T. Menzies North Carolina State University, Raleigh, NC, United States

L.L. Minku University of Leicester, Leicester, United Kingdom

A. Mockus The University Of Tennessee, Knoxville, TN, United States

J. Münch

Reutlingen University, Reutlingen, Germany

University of Helsinki, Helsinki, Finland

Herman Hollerith Center, Böblingen, Germany

G.C. Murphy University of British Columbia, Vancouver, BC, Canada

B. Murphy Microsoft Research, Cambridge, United Kingdom

E. Murphy-Hill North Carolina State University

M. Nagappan Rochester Institute of Technology, Rochester, NY, United States

M. Nayebi University of Calgary, Calgary, AB, Canada

M. Oivo University of Oulu, Oulu, Finland

A. Orso Georgia Institute of Technology, Atlanta, GA, United States

T. Ostrand Mälerdalen University, Västerås, Sweden

F. Peters Lero - The Irish Software Research Centre, University of Limerick, Limerick, Ireland

D. Posnett University of California, Davis, CA, United States

L. Prechelt Freie Universität Berlin, Berlin, Germany

Venkatesh-Prasad Ranganath Kansas State University, Manhattan, KS, United States

B. Ray University of Virginia, Charlottesville, VA, United States

R. Robbes University of Chile, Santiago, Chile

P. Rotella Cisco Systems, Inc., Raleigh, NC, United States

G. Ruhe University of Calgary, Calgary, AB, Canada

P. Runeson Lund University, Lund, Sweden

B. Russo Software Engineering Research Group, Faculty of Computer Science, Free University of Bozen-Bolzano, Italy

M. Shepperd Brunel University London, Uxbridge, United Kingdom

E. Shihab Concordia University, Montreal, QC, Canada

D.I.K. Sjøberg

University of Oslo

SINTEF ICT, Trondheim, Norway

D. Spinellis Athens University of Economics and Business, Athens, Greece

M.-A. Storey University of Victoria, Victoria, BC, Canada

C. Theisen North Carolina State University, Raleigh, NC, United States

A. Tosun Istanbul Technical University, Maslak Istanbul, Turkey

B. Turhan University of Oulu, Oulu, Finland

H. Valdivia-Garcia Rochester Institute of Technology, Rochester, NY, United States

S. Vegas Universidad Politécnica de Madrid, Madrid, Spain

S. Wagner University of Stuttgart, Stuttgart, Germany

E. Weyuker Mälerdalen University, Västerås, Sweden

J. Whitehead University of California, Santa Cruz, CA, United States

L. Williams North Carolina State University, Raleigh, NC, United States

Tao Xie University of Illinois at Urbana-Champaign, Urbana, IL, United States

A. Zeller Saarland University, Saarbrücken, Germany

Dongmei Zhang Microsoft Research, Beijing, China

Hongyu Zhang Microsoft Research, Beijing, China

Haidong Zhang Microsoft Research, Beijing, China

T. Zimmermann Microsoft Research, Redmond, WA, United States

Acknowledgments

A project this size is completed only with the dedicated help of many people. Accordingly, the editors of this book gratefully acknowledge the extensive and professional work of our authors and the Morgan Kaufmann editorial team. Also, special thanks to the staff and organizers of Schloss Dagstuhl (https://www.dagstuhl.de/ueber-dagstuhl/, where computer scientists meet) who hosted the original meeting that was the genesis of this book.

Introduction

Perspectives on data science for software engineering

T. Menzies*; L. Williams*; T. Zimmermann† * North Carolina State University, Raleigh, NC, United States

† Microsoft Research, Redmond, WA, United States

Abstract

Given recent increases in how much data we can collect, and given a shortage in skilled analysts that can assess that data, there now exists more data than people to study it. Consequently, the analysis of real-world data is an exploding field, to say this least. About software projects, a lot of information is recorded in software repositories. Never before have we had so much information about the details on how people collaborate to build software.

Keywords

Data Science; Software Analytics; Mining Software Repositories; Software repositories; Data mining; Data analytics

Chapter Outline

Why This Book?

About This Book

The Future

References

Why This Book?

Historically, this book began as a week-long workshop in Dagstuhl, Germany [1]. The goal of that meeting was to document the wide range of work on software analytics.

That meeting had the following premise: So little time, so much data.

That is, given recent increases in how much data we can collect, and given a shortage in skilled analysts that can assess that data [2], there now exists more data than people to study it. Consequently, the analysis of real-world data (using semi-automatic or fully automatic methods) is an exploding field, to say this least.

This issue is made more pressing by two factors:

• Many useful methods: Decades of research in artificial intelligence, social science methods, visualizations, statistics, etc. has generated a large number of powerful methods for learning from data.

• Much support for those methods: Many of those methods are explored in standard textbooks and education programs. Those methods are also supported in toolkits that are widely available (sometimes, even via free downloads). Further, given the Big Data revolution, it is now possible to acquire the hardware necessary, even for the longest runs of these tools. So now the issue becomes not "how to get these tools but, instead, how to use" these tools.

If general analytics is an active field, software analytics is doubly so. Consider what we know about software projects:

• source code;

• emails about that code;

• check-ins;

• work items;

• bug reports;

• test suites;

• test executions;

• and even some background information on the developers.

All that information is recorded in software repositories, such as CVS, Subversion, GIT, GITHUB, and Bugzilla. Found in these repositories are telemetry data, run-time traces, and log files reflecting how customers experience software, application and feature usage, records of performance and reliability, and more.

Never before have we had so much information about the details on how people collaborate to

• use someone else’s insights and software tools;

• generate and distribute new insights and software tools;

• maintain and update existing insights and software tools.

Here, by tools we mean everything from the four lines of SQL that are triggered when someone surfs to a web page, to scripts that might be only dozens to hundreds of lines of code, or to much larger open source and proprietary systems. Also, our use of tools includes building new tools as well as ongoing maintenance work, as well as combinations of hardware and software systems.

Accordingly, for your consideration, this book explores the process for analyzing data from software development applications to generate insights. The chapters here were written by participants at the Dagstuhl workshop (Fig. 1), plus numerous other experts in the field on industrial and academic data mining. Our goal is to summarize and distribute their experience and combined wisdom and understanding about the data analysis process.

Fig. 1 The participants of the Dagstuhl Seminar 14261 on Software Development Analytics (June 22-27, 2014)

About This Book

Each chapter is aimed at a generalized audience with some technical interest in software engineering (SE). Hence, the chapters are very short and to the point. Also, the chapter authors have taken care to avoid excessive and confusing techno-speak.

As to insights themselves, they are in two categories:

• Lessons specific to software engineering: Some chapters offer valuable comments on issues that are specific to data science for software engineering. For example, see Geunther Ruhe’s excellent chapter on decision support for software engineering.

• General lessons about data analytics: Other chapters are more general. These comment on issues relating to drawing conclusions from real-world data. The case study material for these chapters comes from the domain of software engineering problems. That said, this material has much to offer data scientists working in many other domains.

Our insights take many forms:

• Some introductory material to set the scene;

• Success stories and application case studies;

• Techniques;

• Words of wisdom;

• Tips for success, traps for the unwary, as well as the steps required to avoid those traps.

That said, all our insights have one thing in common: we wish we had known them years ago! If we had, then that would have saved us and our clients so much time and money.

The Future

While these chapters were written by experts, they are hardly complete. Data science methods for SE are continually changing, so we view this book as a first edition that will need significant and regular updates. To that end, we have created a news group for posting new insights. Feel free to make any comment at all there.

• To browse the messages in that group, go to https://groups.google.com/forum/#!forum/perspectivesds4se

• To post to that group, send an email to perspectivesds4se@googlegroups.com

• To unsubscribe from that group, send an email to perspectivesds4se+unsubscribe@googlegroups.com

Note that if you want to be considered for any future update of this book:

• Make the subject line an eye-catching mantra; ie, a slogan reflecting a best practice for data science for SE.

• The post should read something like the chapters of this book. That is, it should be:

○ Short, and to the point.

○ Make little or no use of jargon, formulas, diagrams, or references.

○ Be approachable by a broad audience and have a clear take-away message.

Share and enjoy!

References

[1] Software development analytics (Dagstuhl Seminar 14261) Gall H., Menzies T., Williams L., Zimmermann T. Dagstuhl Rep J. 2014;4(6):64–83. http://drops.dagstuhl.de/opus/volltexte/2014/4763/.

[2] Big data: The next frontier for competition. McKinsey & Company. http://www.mckinsey.com/features/big_data.

Software analytics and its application in practice

Dongmei Zhang*; Tao Xie† * Microsoft Research, Beijing, China

† University of Illinois at Urbana-Champaign, Urbana, IL, United States

Abstract

A huge wealth of data exists in software life cycle, and hidden in the data is information about the quality of software and services as well as the dynamics of software development. With various analytical and computing technologies, software analytics is to obtain insightful and actionable information for data-driven tasks in engineering software and services. In this chapter, we discuss the different aspects of software analytics, and we also share our lessons learned when putting software analytics into practice.

Keywords

Software analytics; research topics; target audience; technology pillars; connection to practice

Various types of data naturally exist in the software development process, such as source code, bug reports, check-in histories, and test cases. As software services and mobile applications are widely available in the Internet era, a huge amount of program runtime data, eg, traces, system events, and performance counters, as well as users’ usage data, eg, usage logs, user surveys, online forum posts, blogs, and tweets, can be readily collected.

Considering the increasing abundance and importance of data in the software domain, software analytics [1,2] is to utilize data-driven approaches to enable software practitioners to perform data exploration and analysis in order to obtain insightful and actionable information for completing various tasks around software systems, software users, and the software development process.

Software analytics has broad applications in real practice. For example, using a mechanism similar to Windows error reporting [3], event tracing for Windows traces can be collected to achieve Windows performance debugging [4]. Given limited time and resources, a major challenge is to identify and prioritize the performance issues using millions of callstack traces. Another example is data-driven quality management for online services [5]. When a live-site issue occurs, a major challenge is to help service-operation personnel utilize the humongous number of service logs and performance counters to quickly diagnose the issue and restore the service.

In this chapter, we discuss software analytics from six perspectives. We also share our experiences on putting software analytics into practice.

Six Perspectives of Software Analytics

The six perspectives of software analytics include research topics, target audience input, output, technology pillars, and connections to practice. While the first four perspectives are easily accessible from the definition of software analytics, the last two need some elaboration.

As stated in the definition, software analytics focuses on software systems, software users, and the software development process. From the research point of view, these focuses constitute three research topics—software quality, user experience, and development productivity. As illustrated in the aforementioned examples, the variety of data input to software analytics is huge. Regarding the insightful and actionable output, it often requires well-designed and complex analytics techniques to create such output. It should be noted that the target audience of software analytics spans across a broad range of software practitioners, including developers, testers, program managers, product managers, operation engineers, usability engineers, UX designers, customer-support engineers, management personnel, etc.

Technology Pillars. In general, primary technologies employed by software analytics include large-scale computing (to handle large-scale datasets), analysis algorithms in machine learning, data mining and pattern recognition, etc. (to analyze data), and information visualization (to help with analyzing data and presenting insights). While the software domain is called the vertical area on which software analytics focuses, these three technology areas are called the horizontal research areas. Quite often, in the vertical area, there are challenges that cannot be readily addressed using the existing technologies in one or more of the horizontal areas. Such challenges can open up new research opportunities in the corresponding horizontal areas.

Connection-to-Practice. Software analytics is naturally tied with practice, with four real elements.

Real data. The data sources under study in software analytics come from real-world settings, including both industrial proprietary settings and open-source settings. For example, open-source communities provide a huge data vault of source code, bug history, and check-in information, etc.; and better yet, the vault is active and evolving, which makes the data sources fresh and live.

Real problems. There are various types of questions to be answered in practice using the rich software artifacts. For example, when a service system is down, how can service engineers quickly diagnose the problem and restore the service [5]? How to increase the monthly active users and daily active users based on the usage data?

Real users. The aforementioned target audience is the consumers of software analytics results, techniques, and tools. They are also a source of feedback for continuously improving and motivating software analytics research.

Real tools. Software artifacts are constantly changing. Getting actionable insights from such dynamic data sources is critical to completing many software-related tasks. To accomplish this, software analytics tools are often deployed as part of software systems, enabling rich, reliable, and timely analyses requested by software practitioners.

Experiences in Putting Software Analytics into Practice

The connection-to-practice nature opens up great opportunities for software analytics to make an impact with a focus on the real settings. Furthermore, there is huge potential for the impact to be broad and deep because software analytics spreads across the areas of system quality, user experience, development productivity, etc.

Despite these opportunities, there are still significant challenges when putting software analytics technologies into real use. For example, how to ensure the output of the analysis results is insightful and actionable? How do we know whether practitioners are concerned about the questions answered with the data? How do we evaluate our analysis techniques in real-world settings? Next we share some of our learnings from working on various software analytics projects [2,4–6].

Identifying essential problems. Various types of data are incredibly rich in the software domain, and the scale of data is significantly large. It is often not difficult to grab some datasets, apply certain data analysis techniques, and obtain some observations. However, these observations, even with good evaluation results from the data-analysis perspective, may not be useful for accomplishing the target task of practitioners. It is important to first identify essential problems for accomplishing the target task in practice, and then obtain the right data sets suitable to help solve the problems. These essential problems are those that can be solved by substantially improving the overall effectiveness of tackling the task, such as improving software quality, user experience, and practitioner productivity.

Usable system built early to collect feedback. It is an iterative process to create software analytics solutions to solve essential problems in practice. Therefore, it is much more effective to build a usable system early on in order to start the feedback loop with the software practitioners. The feedback is often valuable for formulating research problems and researching appropriate analysis algorithms. In addition, software analytics projects can benefit from early feedback in terms of building trust between researchers and practitioners, as well as enabling the evaluation of the results in real-world settings.

Using domain semantics for proper data preparation. Software artifacts often carry semantics specific to the software domain; therefore, they cannot be simply treated as generic data such as text and sequences. For example, callstacks are sequences with program execution logic, and bug reports contain relational data and free text describing software defects, etc. Understanding the semantics of software artifacts is a prerequisite for analyzing the data later on. In the case of StackMine [4], there was a deep learning curve for us to understand the performance traces before we could conduct any analysis.

In practice, understanding data is three-fold: data interpretation, data selection, and data filtering. To conduct data interpretation, researchers need to understand basic definitions of domain-specific terminologies and concepts. To conduct data selection, researchers need to understand the connections between the data and the problem being solved. To conduct data filtering, researchers need to understand defects and limitations of existing data to avoid incorrect inference.

Scalable and customizable solutions. Due to the scale of data in real-world settings, scalable analytic solutions are often required to solve essential problems in practice. In fact, scalability may directly impact the underlying analysis algorithms for problem solving. Customization is another common requirement for incorporating domain knowledge due to the variations in software and services. The effectiveness of solution customization in analytics tasks can be summarized as (1) filtering noisy and irrelevant data, (2) specifying between data points their intrinsic relationships that cannot be derived from the data itself, (3) providing empirical and heuristic guidance to make the algorithms robust against biased data. The procedure of solution customization can be typically conducted in an iterative fashion via close collaboration between software analytics researchers and practitioners.

Evaluation criteria tied with real tasks in practice. Because of the natural connection with practice, software analytics projects should be (at least partly) evaluated using the real tasks that they are targeted to help with. Common evaluation criteria of data analysis, such as precision and recall, can be used to measure intermediate results. However, they are often not the only set of evaluation criteria when real tasks are involved. For example, in the StackMine project [4], we use the coverage of detected performance bottlenecks to evaluate our analysis results; such coverage is directly related to the analysis task of Windows analysts. When conducting an evaluation in practice with practitioners involved, researchers need to be aware of and cautious about the evaluation cost and benefits incurred for practitioners.

References

[1] Zhang D., Dang Y., Lou J.-G., Han S., Zhang H., Xie T. Software analytics as a learning case in practice: approaches and experiences. In: International workshop on machine learning technologies in software engineering (MALETS 2011); 2011:55–58.

[2] Zhang D., Han S., Dang Y., Lou J.-G., Zhang H., Xie T. Software analytics in practice. IEEE Software. 2013;30(5):30–37 [Special issue on the many faces of software analytics].

[3] Glerum K., Kinshumann K., Greenberg S., Aul G., Orgovan V., Nichols G., Grant D., Loihle G., Hunt G. Debugging in the (very) large: ten years of implementation and experience. In: Proceedings of the ACM SIGOPS 22nd symposium on operating systems principles, SOSP 2009; 2009:103–116.

[4] Han S., Dang Y., Ge S., Zhang D., Xie T. Performance debugging in the large via mining millions of stack traces. In: Proceedings of the 34th international conference on software engineering, ICSE 2012; 2012:145–155.

[5] Lou J.-G., Lin Q., Ding R., Fu Q., Zhang D., Xie T. Software analytics for incident management of online services: an experience report. In: Proceedings of 28th IEEE/ACM international conference on automated software engineering, ASE 2013; 2013:475–485 [Experience papers].

[6] Dang Y., Zhang D., Ge S., Chu C., Qiu Y., Xie Tao. XIAO: tuning code clones at hands of engineers in practice. In: Proceedings of the 28th annual computer security applications conference, ACSAC 2012; 2012:369–378.

Seven principles of inductive software engineering

What we do is different

T. Menzies North Carolina State University, Raleigh, NC, United States

Abstract

Inductive software engineering is the branch of software engineering focusing on the delivery of data-mining based software applications. Within those data mines, the core problem is induction, which is the extraction of small patterns from larger data sets. Inductive engineers spend much effort trying to understand business goals in order to inductively generate the models that matter the most.

Keywords

Inductive software engineering; Induction; Data mining; Feature selection; Row selection

Chapter Outline

Different and Important

Principle #1: Humans Before Algorithms

Principle #2: Plan for Scale

Principle #3: Get Early Feedback

Principle #4: Be Open Minded

Principle #5: Be Smart with Your Learning

Principle #6: Live with the Data You Have

Principle #7: Develop a Broad Skill Set That Uses a Big Toolkit

References

Different and Important

Previously, with Christian Bird, Thomas Zimmermann, Wolfram Schulte, and Ekrem Kocaganeli, we wrote an Inductive Engineering Manifesto [1] that offered some details on this new kind of engineering. The whole manifesto is a little long, so here I offer a quick summary. Following are seven key principles which, if ignored, can make it harder to deploy analytics in the real world. For more details (and more principles), refer to the original document [1].

Principle #1: Humans Before Algorithms

Mining algorithms are only good if humans find their use in real-world applications. This means that humans need to:

• understand the results

• understand that those results add value to their work.

Accordingly, it is strongly recommend that once the algorithms generate some model, then the inductive engineer talks to humans about those results. In the case of software analytics, these humans are the subject matter experts or business problem owners that are asking you to improve the ways they are generating software.

In our experience, such discussions lead to a second, third, fourth, etc., round of learning. To assess if you are talking in the right way to your humans, check the following:

• Do they bring their senior management to the meetings? If yes, great!

• Do they keep interrupting (you or each other) and debating your results? If yes, then stay quiet (and take lots of notes!)

• Do they indicate they understand your explanation of the results? For example, can they correctly extend your results to list desirable and undesirable implications of your results?

• Do your results touch on issues that concern them? This is easy to check… just count how many times they glance up from their notes, looking startled or alarmed.

• Do they offer more data sources for analysis? If yes, they like what you are doing and want you to do it more.

• Do they invite you to their workspace and ask you to teach them how to do XYZ? If yes, this is a real win.

Principle #2: Plan for Scale

Data mining methods are usually repeated multiple times in order to:

• answer new questions, inspired by the current results;

• enhance data mining method or fix some bugs; and

• deploy the results, or the analysis methods, to different user groups.

So that means that, if it works, you will be asked to do it again (and again and again). To put that another way, thou shalt not click. That is, if all your analysis requires lots of pointing-and-clicking in a pretty GUI environment, then you are definitely

Enjoying the preview?

Page 1 of 1

Perspectives on Data Science for Software Engineering

About this ebook

Tim Menzies

Related authors

Related to Perspectives on Data Science for Software Engineering

Related ebooks

Enterprise Applications For You

Related podcast episodes

Related articles

Related categories

Reviews for Perspectives on Data Science for Software Engineering

What did you think?

Book preview

Perspectives on Data Science for Software Engineering - Tim Menzies

Table of Contents

Introduction

Success Stories/Applications

Techniques

Wisdom

Copyright

Contributors

Acknowledgments

Abstract

Keywords

Chapter Outline

Why This Book?

About This Book

The Future

References

Why This Book?

About This Book

The Future

References

Abstract

Keywords

Six Perspectives of Software Analytics

Experiences in Putting Software Analytics into Practice

References

Abstract

Keywords

Chapter Outline

Different and Important

Principle #1: Humans Before Algorithms

Principle #2: Plan for Scale