Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Mastering Social Media Mining with Python
Mastering Social Media Mining with Python
Mastering Social Media Mining with Python
Ebook592 pages4 hours

Mastering Social Media Mining with Python

Rating: 5 out of 5 stars

5/5

()

Read preview

About this ebook

About This Book
  • Make sense of highly unstructured social media data with the help of the insightful use cases provided in this guide
  • Use this easy-to-follow, step-by-step guide to apply analytics to complicated and messy social data
  • This is your one-stop solution to fetching, storing, analyzing, and visualizing social media data
Who This Book Is For

This book is for Python developers who want to engage with the use of public APIs to collect data from social media platforms and perform statistical analysis in order to produce useful insights from data. The book assumes just a basic understanding of the Python Standard Library and provides practical examples to guide you toward the creation of your data analysis project based on social data.

LanguageEnglish
Release dateJul 29, 2016
ISBN9781783552023
Mastering Social Media Mining with Python

Related to Mastering Social Media Mining with Python

Related ebooks

Computers For You

View More

Related articles

Reviews for Mastering Social Media Mining with Python

Rating: 5 out of 5 stars
5/5

1 rating0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Mastering Social Media Mining with Python - Marco Bonzanini

    Table of Contents

    Mastering Social Media Mining with Python

    Credits

    About the Author

    About the Reviewer

    www.PacktPub.com

    eBooks, discount offers, and more

    Why subscribe?

    Preface

    What this book covers

    What you need for this book

    Who this book is for

    Conventions

    Reader feedback

    Customer support

    Downloading the example code

    Downloading the color images of this book

    Errata

    Piracy

    Questions

    1. Social Media, Social Data, and Python

    Getting started

    Social media - challenges and opportunities

    Opportunities

    Challenges

    Social media mining techniques

    Python tools for data science

    Python development environment setup

    pip and virtualenv

    Conda, Anaconda, and Miniconda

    Efficient data analysis

    Machine learning

    Natural language processing

    Social network analysis

    Data visualization

    Processing data in Python

    Building complex data pipelines

    Summary

    2. #MiningTwitter – Hashtags, Topics, and Time Series

    Getting started

    The Twitter API

    Rate limits

    Search versus Stream

    Collecting data from Twitter

    Getting tweets from the timeline

    The structure of a tweet

    Using the Streaming API

    Analyzing tweets - entity analysis

    Analyzing tweets - text analysis

    Analyzing tweets - time series analysis

    Summary

    3. Users, Followers, and Communities on Twitter

    Users, friends, and followers

    Back to the Twitter API

    The structure of a user profile

    Downloading your friends' and followers' profiles

    Analysing your network

    Measuring influence and engagement

    Mining your followers

    Mining the conversation

    Plotting tweets on a map

    From tweets to GeoJSON

    Easy maps with Folium

    Summary

    4. Posts, Pages, and User Interactions on Facebook

    The Facebook Graph API

    Registering your app

    Authentication and security

    Accessing the Facebook Graph API with Python

    Mining your posts

    The structure of a post

    Time frequency analysis

    Mining Facebook Pages

    Getting posts from a Page

    Facebook Reactions and the Graph API 2.6

    Measuring engagement

    Visualizing posts as a word cloud

    Summary

    5. Topic Analysis on Google+

    Getting started with the Google+ API

    Searching on Google+

    Embedding the search results in a web GUI

    Decorators in Python

    Flask routes and templates

    Notes and activities from a Google+ page

    Text analysis and TF-IDF on notes

    Capturing phrases with n-grams

    Summary

    6. Questions and Answers on Stack Exchange

    Questions and answers

    Getting started with the Stack Exchange API

    Searching for tagged questions

    Searching for a user

    Working with Stack Exchange data dumps

    Text classification for question tags

    Supervised learning and text classification

    Classification algorithms

    Naive Bayes

    k-Nearest Neighbor

    Support Vector Machines

    Evaluation

    Performing text classification on Stack Exchange data

    Embedding the classifier in a real-time application

    Summary

    7. Blogs, RSS, Wikipedia, and Natural Language Processing

    Blogs and NLP

    Getting data from blogs and websites

    Using the WordPress.com API

    Using the Blogger API

    Parsing RSS and Atom feeds

    Getting data from Wikipedia

    A few words about web scraping

    NLP Basics

    Text preprocessing

    Sentence boundary detection

    Word tokenization

    Part-of-speech tagging

    Word normalization

    Case normalization

    Stemming

    Lemmatization

    Stop word removal

    Synonym mapping

    Information extraction

    Summary

    8. Mining All the Data!

    Many social APIs

    Mining videos on YouTube

    Mining open source software on GitHub

    Mining local businesses on Yelp

    Building a custom Python client

    HTTP made simple

    Summary

    9. Linked Data and the Semantic Web

    A Web of Data

    Semantic Web vocabulary

    Microformats

    Linked Data and Open Data

    Resource Description Framework

    JSON-LD

    Schema.org

    Mining relations from DBpedia

    Mining geo coordinates

    Extracting geodata from Wikipedia

    Plotting geodata on Google Maps

    Summary

    Mastering Social Media Mining with Python


    Mastering Social Media Mining with Python

    Copyright © 2016 Packt Publishing

    All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.

    Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.

    Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

    First published: July 2016

    Production reference: 1260716

    Published by Packt Publishing Ltd.

    Livery Place

    35 Livery Street

    Birmingham 

    B3 2PB, UK.

    ISBN 978-1-78355-201-6

    www.packtpub.com

    Credits

    About the Author

    Marco Bonzanini is a data scientist based in London, United Kingdom. He holds a PhD in information retrieval from Queen Mary University of London. He specializes in text analytics and search applications, and over the years, he has enjoyed working on a variety of information management and data science problems.

    He maintains a personal blog at http://marcobonzanini.com, where he discusses different technical topics, mainly around Python, text analytics, and data science.

    When not working on Python projects, he likes to engage with the community at PyData conferences and meet-ups, and he also enjoys brewing homemade beer.

    This book is the outcome of a long journey that goes beyond the mere content preparation. Many people have contributed in different ways to shape the final result. Firstly, I would like to thank the team at Packt Publishing, particularly Sonali Vernekar and Siddhesh Salvi, for giving me the opportunity to work on this book and for being so helpful throughout the whole process. I would also like to thank Dr. Weiai Wayne Xu for reviewing the content of this book and suggesting many improvements. Many colleagues and friends, through casual conversations, deep discussions, and previous projects, strengthened the quality of the material presented in this book. Special mentions go to Dr. Miguel Martinez-Alvarez, Marco Campana, and Stefano Campana. I'm also happy to be part of the PyData London community, a group of smart people who regularly meet to talk about Python and data science, offering a stimulating environment. Last but not least, a distinct special mention goes to Daniela, who has encouraged me during the whole journey, sharing her thoughts, suggesting improvements, and providing a relaxing environment to go back to after work.

    About the Reviewer

    Weiai Wayne Xu is an assistant professor in the department of communication at University of Massachusetts – Amherst and is affiliated with the University’s Computational Social Science Institute. Previously, Xu worked as a network science scholar at the Network Science Institute of Northeastern University in Boston. His research on online communities, word-of-mouth, and social capital have appeared in various peer-reviewed journals. Xu also assisted four national grant projects in the area of strategic communication and public opinion. Aside from his professional appointment, he is a co-founder of a data lab called CuriosityBits Collective (http://www.curiositybits.org/).

    www.PacktPub.com

    eBooks, discount offers, and more

    Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at customercare@packtpub.com for more details.

    At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.

    https://www2.packtpub.com/books/subscription/packtlib

    Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.

    Why subscribe?

    Fully searchable across every book published by Packt

    Copy and paste, print, and bookmark content

    On demand and accessible via a web browser

    Preface

    In the past few years, the popularity of social media has grown dramatically, with more and more users sharing all kinds of information through different platforms. Companies use social media platforms to promote their brands, professionals maintain a public profile online and use social media for networking, and regular users discuss about any topic. More users also means more data waiting to be mined.

    You, the reader of this book, are likely to be a developer, engineer, analyst, researcher, or student who wants to apply data mining techniques to social media data. As a data mining practitioner (or practitioner-to-be), there is no lack of opportunities and challenges from this point of view.

    Mastering Social Media Mining with Python will give you the basic tools you need to take advantage of this wealth of data. This book will start a journey through the main tools for data analysis in Python, providing the information you need to get started with applications such as NLP, machine learning, social network analysis, and data visualization. A step-by-step guide through the most popular social media platforms, including Twitter, Facebook, Google+, Stack Overflow, Blogger, YouTube and more, will allow you to understand how to access data from these networks, and how to perform different types of analysis in order to extract useful insight from the raw data.

    There are three main aspects being touched in the book, as listed in the following list:

    Social media APIs: Each platform provides access to their data in different ways. Understanding how to interact with them can answer the questions: how do we get the data? and also what kind of data can we get? This is important because, without access to the data, there would be no data analysis to carry out. Each chapter focuses on different social media platforms and provides details on how to interact with the relevant API.

    Data mining techniques: Just getting the data out of an API doesn't provide much value to us. The next step is answering the question: what can we do with the data? Each chapter provides the concepts you need to appreciate the kind of analysis that you can carry out with the data, and why it provides value. In terms of theory, the choice is to simply scratch the surface of what is needed, without digging too much into details that belong to academic textbooks. The purpose is to provide practical examples that can get you easily started.

    Python tools for data science: Once we understand what we can do with the data, the last question is: how do we do it? Python has established itself as one of the main languages for data science. Its easy-to-understand syntax and semantics, together with its rich ecosystem for scientific computing, provide a gentle learning curve for beginners and all the sharp tools required by experts at the same time. The book introduces the main Python libraries used in the world of scientific computing, such as NumPy, pandas, NetworkX, scikit-learn, NLTK, and many more. Practical examples will take the form of short scripts that you can use (and possibly extend) to perform different and interesting types of analysis over the social media data that you have accessed.

    If exploring the area where these three main topics meet is something of interest, this book is for you. 

    What this book covers

    Chapter 1, Social Media, Social Data, and Python, introduces the main concepts of data mining applied to social media using Python. By walking the reader through a brief overview on machine learning, NLP, social network analysis, and data visualization, this chapter discusses the main Python tools for data science and provides some help to set up the Python environment.

    Chapter 2, #MiningTwitter – Hashtags, Topics, and Time Series, opens the practical discussion on data mining using the Twitter data. After setting up a Twitter app to interact with the Twitter API, the chapter explains how to get data through the streaming API and how to perform some frequentist analysis on hashtags and text. The chapter also discusses some time series analysis to understand the distribution of tweets over time.

    Chapter 3, Users, Followers, and Communities on Twitter, continues the discussion on Twitter mining, focusing the attention on users and interactions between users. This chapter shows how to mine the connections and conversations between the users. Interesting applications explained in the chapter include user clustering (segmentation) and how to measure influence and user engagement.

    Chapter 4, Posts, Pages, and User Interactions on Facebook, focuses on Facebook and the Facebook Graph API. After understanding how to interact with the Graph API, including aspects of security and privacy, examples of how to mine posts from a user's profile and Facebook pages are provided. The concepts of time series analysis and user engagement are applied to user interactions such as comments, Likes, and Reactions.

    Chapter 5, Topic Analysis on Google+, covers the social network by Google. After understanding how to access the Google centralized platform, examples of how to search content and users on Google+ are discussed. This chapter also shows how to embed data coming from the Google API into a custom web application that is built using the Python microframework, Flask.

    Chapter 6, Questions and Answers on Stack Exchange, explains the topic of question answering and uses the Stack Exchange network as paramount example. The reader has the opportunity to learn how to search for users and content on the different sites of this network, most notably Stack Overflow. By using their data dumps for online processing,  this chapter introduces supervised machine learning methods applied to text classification and shows how to embed machine learning model into a real-time application.

    Chapter 7, Blogs, RSS, Wikipedia, and Natural Language Processing, teaches text analytics. The Web is full of opportunities in terms of text mining, and this chapter shows how to interact with several data sources such as the WordPress.com API, Blogger API, RSS feeds, and Wikipedia API. Using textual data, the basic notions of NLP briefly mentioned throughout the book are formalized and expanded. The reader is then walked through the process of information extraction with custom examples on how to extract references of entities from free text.

    Chapter 8, Mining All the Data!, reminds us of the many opportunities, in terms of data mining, that are available out there beyond the most common social networks. Examples of how to mine data from YouTube, GitHub, and Yelp are provided, along with a discussion on how to build your own API client, in case a particular platform doesn't provide one.

    Chapter 9, Linked Data and the Semantic Web, provides an overview on the Semantic Web and related technologies. This chapter discusses the topics of Linked Data, microformats, and RDF, and offers examples on how to mine semantic information from DBpedia and Wikipedia.

    What you need for this book

    The code examples provided in this book assume that you are running a recent version of Python on either Linux, macOS, or Windows. The code has been tested on Python 3.4.* and Python 3.5.*. Older versions (Python 3.3.* or Python 2.*) are not explicitly supported.

    Chapter 1, Social Media, Social Data, and Python, provides some instructions to set up a local development environment and introduces a brief list of tools that are going to be used throughout the book. We're going to take advantage of some of the essential Python libraries for scientific computing (for example, NumPy, pandas, and matplotlib), machine learning (for example, scikit-learn), NLP (for example, NLTK), and social network analysis (for example, NetworkX).

    Who this book is for

    This book is for intermediate Python developers who want to engage with the use of public APIs to collect data from social media platforms and perform statistical analysis in order to produce useful insights from the data. The book assumes a basic understanding of Python standard library and provides practical examples to guide you towards the creation of your data analysis project based on social data.

    Conventions

    In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.

    Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: Moreover, the genre attribute is here presented as a list, with a variable number of values.

    A block of code is set as follows:

    from timeit import timeit

    import numpy as np

    if __name__ == '__main__':

      setup_sum = 'data = list(range(10000))'

      setup_np = 'import numpy as np;'

      setup_np += 'data_np = np.array(list(range(10000)))'

    When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

    Type your question, or type exit to quit. > What's up with Gandalf and Frodo lately? They haven't been in the Shire for a while...

    Question: What's up with Gandalf and Frodo lately? They haven't been in the Shire for a while...

    Predicted labels: plot-explanation, the-lord-of-the-rings

    Any command-line input or output is written as follows:

    $ pip install --upgrade [package name]

    New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: On the Keys and Access Tokens configuration page, the developer can find the API key and secret, as well as the access token and access token secret.

    Note

    Warnings or important notes appear in a box like this.

    Tip

    Tips and tricks appear like this.

    Reader feedback

    Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply e-mail feedback@packtpub.com, and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

    Customer support

    Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

    Downloading the example code

    You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

    You can download the code files by following these steps:

    Log in or register to our website using your e-mail address and password.

    Hover the mouse pointer on the SUPPORT tab at the top.

    Click on Code Downloads & Errata.

    Enter the name of the book in the Search box.

    Select the book for which you're looking to download the code files.

    Choose from the drop-down menu where you purchased this book from.

    Click on Code Download.

    Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

    WinRAR / 7-Zip for Windows

    Zipeg / iZip / UnRarX for Mac

    7-Zip / PeaZip for Linux

    The code bundle for the book is also hosted on GitHub at https://github.com/bonzanini/Book-SocialMediaMiningPython. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

    Downloading the color images of this book

    We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/MasteringSocialMediaMiningWithPython_ColorImages.pdf.

    Errata

    Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

    To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

    Piracy

    Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

    Please contact us at copyright@packtpub.com with a link to the suspected pirated material.

    We appreciate your help in protecting our authors and our ability to bring you valuable content.

    Questions

    If you have a problem with any aspect of this book, you can contact us at questions@packtpub.com, and we will do our best to address the problem.

    Chapter 1.  Social Media, Social Data, and Python

    This book is about applying data mining techniques to social media using Python. The three highlighted keywords in the previous sentence help us define the intended audience of this book: any developer, engineer, analyst, researcher, or student who is interested in exploring the area where the three topics meet.

    In this chapter, we will cover the following topics:

    Social media and social data

    The overall process of data mining from social media

    Setting up the Python development environment

    Python tools for data science

    Processing data in Python

    Getting started

    In the second quarter of 2015, Facebook reported nearly 1.5 billion monthly active users. In 2013, Twitter had reported a volume of 500+ million tweets per day. On a smaller scale, but certainly of interest for the readers of this book, in 2015, Stack Overflow announced that more than 10 million programming questions had been asked on their platform since the website has opened.

    These numbers are just the tip of the iceberg when describing how the popularity of social media has grown exponentially with more users sharing more and more information through different platforms. This wealth of data provides unique opportunities for data mining practitioners. The purpose of this book is to guide the reader through the use of social media APIs to collect data that can be analyzed with Python tools in order to produce interesting insights on how users interact on social media.

    This chapter lays the ground for an initial discussion on challenges and opportunities in social media mining and introduces some Python tools that will be used in the following chapters.

    Social media - challenges and opportunities

    In traditional media, users are typically just consumers. Information flows in one direction: from the publisher to the users. Social media breaks this model, allowing every user to be a consumer and publisher at the same time. Many academic publications have been written on this topic with the purpose of defining what the term social media really means (for example, Users of the world, unite! The challenges and opportunities of Social Media, Andreas M. Kaplan and Michael Haenlein, 2010). The aspects that are most commonly shared across different social media platforms are as follows:

    Internet-based applications

    User-generated content

    Networking

    Social media are Internet-based applications. It is clear that the advances in Internet and mobile technologies have promoted the expansion of social media. Through your mobile, you can, in fact, immediately connect to a social media platform, publish your content, or catch up with the latest news.

    Social media platforms are driven by user-generated content. As opposed to the traditional media model, every user is a potential publisher. More importantly, any user can interact with every other user by sharing content, commenting, or expressing positive appraisal via the like button (sometimes referred to as upvote, or thumbs up).

    Social media is about networking. As described, social media is about the users interacting with other users. Being connected is the central concept for most social media platform, and the content you consume via your news feed or timeline is driven by your connections.

    With these main features being central across several platforms, social media is used for a variety of purposes:

    Staying in touch with friends and family (for example, via Facebook)

    Microblogging and catching up with the latest news (for example, via Twitter)

    Staying in touch with your professional network (for example, via LinkedIn)

    Sharing multimedia content (for example, via Instagram, YouTube, Vimeo, and Flickr)

    Finding answers to your questions (for example, via Stack Overflow, Stack Exchange, and Quora)

    Finding and organizing items of interest (for example, via Pinterest)

    This book aims to answer one central question: how to extract useful knowledge from the data coming from the social media? Taking one step back, we need to define what is knowledge and what is useful.

    Traditional definitions of knowledge come from information science. The concept of knowledge is usually pictured as part of a pyramid, sometimes referred to as knowledge hierarchy, which has data as its foundation, information as the middle layer, and knowledge at the top. This knowledge hierarchy is represented in the following diagram:

    Figure 1.1: From raw data to semantic knowledge

    Climbing the pyramid means refining knowledge from raw data. The journey from raw data to distilled knowledge goes through the integration of context and meaning. As we climb up the pyramid, the technology we build gains a deeper understanding of the original data, and more importantly, of the users who generate such data. In other words, it becomes more useful.

    In this context, useful knowledge means actionable knowledge, that is, knowledge that enables a decision maker to implement a business strategy. As a reader of this book, you'll understand the key principles to extract value from social data. Understanding how users interact through social media platforms is one of the key aspects in this journey.

    The following sections lay down some of the challenges and opportunities of mining data from social media platforms.

    Opportunities

    The key opportunity of developing data mining systems is to extract useful insights from data. The aim of the process is to answer interesting (and sometimes difficult) questions using data mining techniques to enrich our knowledge about a particular domain. For example, an online retail store can apply data mining to understand how their customers shop. Through this analysis, they are able to recommend products to their customers, depending on their shopping habits (for example, users who buy item A, also buy item B). This, in

    Enjoying the preview?
    Page 1 of 1