Sunteți pe pagina 1din 29

Website Scraping with Python

How to use BeautifulSoup and Scrapy in practice


Gabor Laszlo Hajba
This book is for sale at http://leanpub.com/websitescrapingwithpython
This version was published on 2015-09-15

This is a Leanpub book. Leanpub empowers authors and publishers with the Lean Publishing
process. Lean Publishing is the act of publishing an in-progress ebook using lightweight tools
and many iterations to get reader feedback, pivot until you have the right book and build
traction once you do.
2015 Gabor Laszlo Hajba

Tweet This Book!


Please help Gabor Laszlo Hajba by spreading the word about this book on Twitter!
The suggested hashtag for this book is #WebsiteScrapingWithPython.
Find out what other people are saying about the book by clicking on this link to search for this
hashtag on Twitter:
https://twitter.com/search?q=#WebsiteScrapingWithPython

Also By Gabor Laszlo Hajba


CDI
XML processing and website scraping in Java
Python 3 in Anger

Contents
Preface . . . . . . . . . . . . . . . . .
What will I do exactly? . . . . . .
About the programming language
Some extra feature . . . . . . . . .
Prerequisites . . . . . . . . . . . .
Length of the book . . . . . . . . .
LeanPub . . . . . . . . . . . . . .

.
.
.
.
.
.
.

i
i
ii
ii
ii
ii
iii

1. Not really a chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2. BeautifulSoup4 The ancestor of JSoup . . . . . . . . . . . . . . . . . . . . . . . .

3. Scrapy another way to gather data . . . . . . . . . . . . . . . . . . . . . . . . . .

4. Performance of the solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5. Creating plots with Python . . .


5.1 Simple examples . . . . . . .
5.2 Display multiple data ranges
5.3 Displaying the averages . . .
5.4 Displaying the legend . . . .
5.5 Formatting the plot . . . . .
5.6 Conclusion . . . . . . . . . .

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

5
5
7
11
12
13
18

6. Some thoughts on functional programming . . . . . . . . . . . . . . . . . . . . . .

19

7. Parallel working . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

8. Extra! Extra! Read all about it! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

21

Preface
As you might know I have written some articles ([](), [](), []() for example) and a book about
website scraping with Java. I introduced some concepts, an example application implemented
with two different tools and did runtime comparison as I implemented features.
Now it is time to make this sample application in Python.
For this book I will take a look at the following tools: * BeautifulSoup4 * Scrapy
These tools are the most-known website scraping helper tools in the Python world, however if I
find a newer one which aids my work better, Ill introduce it too.

Note
This book is not a 1:1 copy of the Java version. Thats because I did not face the same
problems and tasks I encountered with the Google App Engine or such things. This book
only takes the concept of the sample application and performance measuring from its
predecessor nothing more. If you would like to read (more) about topics not covered
in this book but in the previous one, please feel free to send me a message / e-mail with
your request and Ill see what I can do about it.

What will I do exactly?


Starting a book is not as simple as you might think: I have a vague concept in my mind (the
title resembles this concept), but it is not everything. I do not have a full plan how to write each
chapter and how to organize the book properly.
However I know that I will give a simple introduction of the tools I mentioned above, each worth
its own chapter. Parallel to this introduction I will write the sample application from scratch using
those tools.
After the introduction section Ill do a comparison of runtime again on those machines you
might know from the other book.

Lean note
As of this book is published with LeanPub I plan to do the publishing in the lean way:
as early as possible and as often as it is required.
This means, that as soon as I am ready with the BeautifulSoup part I will publish the
first release. After this is done I can focus on your requests and notices on my book
and write the Scrapy part. And at the end I can write the sample application for the
last chapter which will be released again in two parts: BeautifulSoup and Scrapy.
http://leanpub.com/javaxml

Preface

ii

And at the end I plan to compare the results of the Python based applications with the Java based
ones. Again, you have to keep in mind that in this case I am comparing apples with peaches (now
not pears), however it is not bad to see which tool gives you a better result if you have the choice
to choose between Java and Python.
So, lets dive right into the programming, or with other words: slap into the soup.

About the programming language


Do not worry, I do not start to write a chapter about Python. I just give a glipmse about which
version I will use. As you know, there are two main-stream versions of Python available: 2.7 and
3.
I thought about which version to use. Older developers tend to 2.7 (I like it too because I have less
braces to write when using print) however 3 is the newer version and I come across people on
the internet needing help with Python and they use the actual version of Python 3. So I decided
to write the sample application in Python 3.4 (currently installed version is 3.4.2). However I am
optimistic and think that the application can run with 2.7 too but I will give it a try at the end
when I am done.

Some extra feature


As always you cannot give along a book without some extra features. I am planning to write a
complete application which I can provide for you to use and rewrite. It will be a bit different then
the one Ill create along in this book (to have some comparison reference with the Java version
of the application) but it will cover the same functionality.
I know I am a bit overwhelming myself with two different applications but I think this creates
the best value for your money.

Prerequisites
Nothing. Sure, no requirements if you want to read this book.
However optionally you can have some interest in website scraping, Python and software
development. I try to write the book for everyone but sometimes I get off-track. So if you cannot
follow along with my thoughts just cry out loud (preferably with a message) and I will come to
aid and fix the mess Ive created.

Length of the book


I plan this book to be a short one. Short but rich on information. Thats because you as a reader
do not want to waste time on something I might think but you want to learn new things (how
should you scrape websites for example) and I focus on this. As some extension you could learn
some new programming paradigms if you wish but you do not have to.
I hope you will see the price-value correspondence in the length of this book and the price you
pay for it.

iii

Preface

LeanPub
I publish my books on LeanPub because

I can distribute new parts faster


there are only digital versions: fewer trees have to die
cheaper because you need less infrastructure (and I too buy the cheaper books myself)
every book purchased gets the updates and corrections (no need to look-up some errata)

So if someone do not want to pay some dollars because there wont be any updates and he/she
has to use some online errata, I can just say: do not worry, here youll get every update of the
book and they are already included in the price.
It can happen that I expand the book with some chapters on my own or I write another sample
application and at this time the new version of the book will be available here too. This because
the technology is altering so fast and after some time a book stuffed with old knowledge wont
be useful.

1. Not really a chapter


This is the sample of my book which you can download for free. Here I include the table of
contents of the whole book with some sentences from the beginning of each chapter just like
a teaser for you to see, it is worth purchasing this book.
So, lets get started with the contents.

2. BeautifulSoup4 The ancestor of


JSoup
If youve read my book XML processing and website scraping in Java you know what JSoup is:
a handy tool to scrape websites using a CSS-query like syntax. The main idea behind this project
was the Python version: BeautifulSoup.
Now lets look at the tool and see how I can migrate the application from Java to Python
without copying and reusing code, only taking the concepts and ideas. So as you can see, I could
copy some code parts because the method names and parameters are reusable between JSoup
and BeautifulSoup.

Note
Not copying the code might result in a bit of overhead because I have to write every
website-select statement again and again. But I like this approach because in this case
I do learn something new (or refresh the old ideas) and maybe I can refactor the code
Ive created in Java.
And beside learning I am not tempted to copy some code that might not work in Python
because lists are handled some other way than they are in Java. For example they do
not have the methods first() and last().

http://leanpub.com/javaxml

3. Scrapy another way to gather


data
After BeautifulSoup lets take a look at another engine which lets you scrape websites in the
Python domain.
The sample application Im preparing with Scrapy is the same as I did with BeautifulSoup
because this is how I can measure performance.
So in this chapter Ill try to maintain the same structure I did when talking about BeautifulSoup
but this time Ill share my experience about Scrapy. The examples will select the same blocks of
the site to see how the two tools are alike.

Note
As you might know, Scrapy is not just a library to use like BeautifulSoup. It is mostly
used as a standalone tool where you configure your own scraper application with
spiders and so on. So this chapter will evaluate and introduce a tool on is own not
just a library.
However I will introduce how to create a simple script using Scrapy as a library and
I will evaluate the performance based on this solution too.

4. Performance of the solutions


So now I am done with introducing the the two tools for website scraping. Now it is time to
compare the performance of both.
And now the performance comparison is more correct because both tools are meant for the same
purpose: scraping data from websites not as in my other book, where XMLBeam was an XML
parsing library.
Im really curious which tool performs better.

5. Creating plots with Python


One of the most interesting parts in this book was the previous chapter: monitoring the
performance of the solutions. However visualizing the results is fun too because the data is only
raw and nobody cares about numbers in tables however plots visualize the results and reach
more people.
So in this chapter Ill introduce how I created the plots for this book with matplotlib.pyplot.

5.1 Simple examples


The import statement I use to import matplotlib is the following:
The import statement for the examples

import matplotlib.pyplot as plt

To get started lets see a simple example:


Simple plot example

plt.plot([1, 2, 3, 4, 5])
plt.ylabel('The value range')
plt.xlabel('The number range')
plt.show()

Creating plots with Python

Simple plot example

As you can see, the example draws a simple plot and the names of the functions are really
straightforward but lets go a bit deeper.
Extended plot example

plt.plot([1,2,3,4, 5], [1, 4, 9, 16, 25], 'ro')


plt.ylabel('The value range')
plt.xlabel('The number range')
plt.axis([0, 6, 0, 30])
plt.show()

Creating plots with Python

Extended plot example

The firs line above has now three parameters: the first list contains the numbers on the X axis,
the second list displays the according values to the first array on the Y axis.
The third parameter, the string describes how to display the plots values. The default display
method is b- which represents the plot as a blue line. And as you might guess ro lets display
the values as red circles. So yes, the first letter is the color of the values, the other parameters
describe how the plot should be displayed: line, dashed line, squares, circles, triangles and so on.

5.2 Display multiple data ranges


To get all the scenarios on one plot Ive had to display multiple value ranges.
The approach is really simple because the plot function takes arbitrary number of arguments.
However this would only be a perfect approach if the dataset would always be the same.
I will show a simple example:

Creating plots with Python

Displaying multiple datasets

# creates a dataset from 0 to 10 with 1 intervals


data = np.arange(0, 10, 1)
# blue asterisks, green squares and yellow triangles
plt.plot(data, data, 'b*', data, data*2, 'gs', data, data**2, 'y^')
plt.show()

The result is as described above: blue asterisks, green squares and yellow triangles:

Displaying multiple datasets

Unfortunately I plot various number of datasets to export them as images, so I needed something
different. However the solution is not too complex.
Because of the arbitrary number of arguments I can call the plot function in a loop and add new
data to visualize.
Without any parameters (like color or symbol) matplotlib will assign a new color and symbol to
the new data and visualize your information according to this information. Naturally you can
manually assign color and symbol to the dataset.
Lets look at the same example as above now I put the data into a list three times and I plot the
contents of the list:

Creating plots with Python

Displaying multiple datasets in a loop

dataset = []
dataset.append([i for i in range(10)])
dataset.append([i*2 for i in range(10)])
dataset.append([i**2 for i in range(10)])
for data in dataset:
x = [i for i in range(len(data))]
plt.plot(x, data)
plt.show()

Displaying multiple datasets in a loop

As you can see the result is quite not the same: the colors and the symbols are different. The
default symbol for plots is a line and lines are not a good representation for distinct data as
my runtime results are. To change the data to look the same as in the previous example I create
a list containing the formatting options and assign these options to the displayed dataset:

10

Creating plots with Python

Displaying multiple datasets in a loop

dataset = []
dataset.append([i for i in range(10)])
dataset.append([i*2 for i in range(10)])
dataset.append([i**2 for i in range(10)])
symbols = ['b*', 'gs', 'y^']
counter = 0
for data in dataset:
x = [i for i in range(len(data))]
plt.plot(x, data)
plt.plot(x, data, symbols[counter])
counter += 1
plt.show()

Displaying multiple datasets in a loop formatted

Naturally this approach is good for a fixed size dataset. But what about more data? The solution
I used is almost the same as above, but I defined a set for colors and another set for symbols.
And I choose the next color and symbol from the list with a simple modulo calculation:

Creating plots with Python

11

Displaying multiple datasets in a loop for variable amount of data

dataset = []
dataset.append([i for i in range(10)])
dataset.append([i*2 for i in range(10)])
dataset.append([i**2 for i in range(10)])
symbols = ['b*', 'gs', 'y^']
symbols = ['*', '^', '.', 'o', 's']
colors = ['r','g','b','c','k','m']
counter = 0
for data in dataset:
x = [i for i in range(len(data))]
plt.plot(x, data, symbols[counter])
symbol = counter % len(symbols)
color = counter % len(colors)
plt.plot(x, data, symbols[symbol]+colors[color])
counter += 1
plt.show()

If the dataset is bigger than the arrays the symbols and colors will restart from the beginning.

5.3 Displaying the averages


I decided to display the average runtime of a scenario as a horizontal line. To make clear that the
line belongs to the dataset I used the same color.
Creating the average is simple too: numpy has a good function which calculates the average of
a list and to display this average as a horizontal line I only need to display it for the length of
the dataset. This means if the dataset has 100 entries of the x axis the dataset for the average-line
has to contain the average runtime 100 times:
Calculating and displaying the average of a dataset

averages = [np.average(data)] * len(x)


plt.plot(x, averages, colors[color] + '-')

And this was it. Now the plot displays the distinct values of each run and the averages as a line
and both with the same color.

Creating plots with Python

12

5.4 Displaying the legend


The data itself does not sepeak too much. Without any legend information it is bare an contains
no information just colorful symbols displayed.
A legend could be of help where you can assign meaning to the values and displayed information.
Adding the legend is quite simple.
The first step is to assign the label to the displayed values. This is done while plotting the data.
Beside the arguments of color and symbol you can assign a label to the plot too:
Assigning the legend to the datasets

for data in dataset:


x = [i for i in range(len(data))]
symbol = counter % len(symbols)
color = counter % len(colors)
plt.plot(x, data, symbols[symbol]+colors[color])
plt.plot(x, data, symbols[symbol]+colors[color], label="Dataset #{0}".for\
mat(counter+1))
counter += 1

Naturally assigning the label does nothing. We have to tell matplotlib to show the legend with
the labels of the values. For this we have to call a function:
Calling the legend

plt.legend()
plt.show()

13

Creating plots with Python

Displaying the legend

This results in a basic legend display inside the plot which can be a bit pain in the neck because
the legend can hide some data as it does in this case too. In the next section I will show how you
can move the plot out of the data field.

5.5 Formatting the plot


Formatting a plot is something what can fill an entire book. Because of this I will not go into
details just give some thoughts about the needs and problems I encountered while creating the
plots for this book.

Setting the plot title


This is quite a simple thing but you need it for some of your plots.
Setting the plot title

plt.title("The title of the plot")

As you can see the solution is straightforward. You can pass in any string naturally a longer
one will not always fit the screen. And if you have a long title consider it to be shorter or add
it outside of the plot.

14

Creating plots with Python

Adding title for the axes


Labeling the axes is essential. Displaying only numbers without any context is not too helpful
for the readers of your plot. You will want to label the axes too to show what data is represented.
labeling the axes
plt.ylabel('Runtime in seconds')
plt.xlabel('Number of the scenario')

This solution is straightforward too: you label the x and y axes with their title.

Moving the legend outside of the plot


As I promised in the previous section here is how you can move the legend out of the data field.
This is quite simple: you only have to call the legend() function with some extra parameters:
Moving the legend out of the data field
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0., title='Legend')

However if you run the plotting you get an unexpected result:

Legend out of the data field

15

Creating plots with Python

To solve this issue you have to resize the subplots a bit to fit both into the display area:
Displaying both data field and legend

plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0., title='Legend')


plt.subplots_adjust(bottom=0.1, left=.05, right=.70, top=.95, hspace=.35)
plt.show()

Displaying the legend in the screen

Resizing the plot


Sometimes you have a lot of data which has to fit into the data field and the default settings of
matplotlib do not fit your expectations. This was the case when I started to display the runtime
results of the website scraping applications.
In this case you can resize the plot to be bigger when generated so the data has more space to
show itself.

16

Creating plots with Python

Resizing the plot

plt.figure(figsize=(15, 6))

The code above resizes the plot to 1500 x 600. This is not readable out of the box but if you look
at the resulting images you can see that the size is exactly 1500 x 600.

The resized plot

Some other format options


There are some options you can pass in to the figure option to reformat your plot. I will show
some of them but as Ive said this book is not about formatting plots.
facecolor
With the facecolor parameter you can set the color of your plots face ot at least the color of
the background behind the data field and the legend.
White background for the plot

plt.figure(figsize=(15, 6), facecolor='w')

17

Creating plots with Python

A plot with white background

edgecolor
As the name name tells, this modifies the edge of the plot. Well I do not see use cases where I
would set the edge, but for this example I add one green edge:
Green edge for the plot

plt.figure(figsize=(15, 6), edgecolor='g', linewidth=16)

Plot with a green edge

Well, not the most beautiful plot. This gray background and the green edges Well, lets take a
step forward in formatting.
Grids
This is more important than plot edges in my eyes: showing the grid. This is quite simple too:

18

Creating plots with Python

Enabling the grid

plt.grid()

This results in a basic grid. If you want to format the grid you have to provide some paramteres
to the grid() function.
The most used are linewidth, linestyle and color. They can take the values you can assign a
Line2D object in matplotlib.
For example I create a grid with red color and dotted style:
Formatted grid

plt.grid(color='r', linestyle=':')

The data with a formatted grid

Naturally there are some style guides where you can learn less is more however sometimes
your customers (or your boss) think that nice colors can make some plots even more fancy. Well,
implement their wishes and try to convice them that you can do better results with less colors
and design.

5.6 Conclusion
Matplotlib gives you everything to draw a plot and far more. You can configure the plots as
you wish and if you search a bit around the internet (StackOverflow for example) you can find
other curious use cases someone wanted to implement.
So if you have data in Python you can visualize it with matplotlib easily. This is why Ive chosen
this tool for my performance measurements: it is in Python and it is easy to learn and use.

6. Some thoughts on functional


programming
Again, this wont be a book about functional programming and I wont go into deeper detail
how to get there. If you are developing with Python since some time you might already know
how to develop more and more in the real functional way, but if you are new you might do it as
well but you do not know about it.
In this chapter Ill intoruce some bits and pieces of code which I did reformat after the first
version of the implementation. Yes, the first version was a wood-cutter, where I tried to avoid
functional paradigms to be able to write this chapter for you.

19

7. Parallel working
I introduced this type of working with Java 8 in my other book, and I think this would be a
nice chapter in this book too. Thats because I would like to learn a bit more about Pythons
concurrency model, and I hope some performance gain when I write some operation to work on
the data parallel.
Lets see where do I get with my tuning and if the results are worth the effort.
http://leanpub.com/javaxml

20

8. Extra! Extra! Read all about it!


Yes, this is something extra, something new and fresh you will love. A sample application which
covers the concept of website scraping, introduces the tools of covered in this book and you
get the sources. And all that for free
OK, almost for free, because you have to pay some money when you buy the book (and some
more for the sources) but I think this is a good exchange. It is rarely that you see such a detailed
application coming along with a book / tutorial / article about this topic.

21

S-ar putea să vă placă și