Documente Academic
Documente Profesional
Documente Cultură
This is a Leanpub book. Leanpub empowers authors and publishers with the Lean Publishing
process. Lean Publishing is the act of publishing an in-progress ebook using lightweight tools
and many iterations to get reader feedback, pivot until you have the right book and build
traction once you do.
2015 Gabor Laszlo Hajba
Contents
Preface . . . . . . . . . . . . . . . . .
What will I do exactly? . . . . . .
About the programming language
Some extra feature . . . . . . . . .
Prerequisites . . . . . . . . . . . .
Length of the book . . . . . . . . .
LeanPub . . . . . . . . . . . . . .
.
.
.
.
.
.
.
i
i
ii
ii
ii
ii
iii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5
5
7
11
12
13
18
19
7. Parallel working . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
21
Preface
As you might know I have written some articles ([](), [](), []() for example) and a book about
website scraping with Java. I introduced some concepts, an example application implemented
with two different tools and did runtime comparison as I implemented features.
Now it is time to make this sample application in Python.
For this book I will take a look at the following tools: * BeautifulSoup4 * Scrapy
These tools are the most-known website scraping helper tools in the Python world, however if I
find a newer one which aids my work better, Ill introduce it too.
Note
This book is not a 1:1 copy of the Java version. Thats because I did not face the same
problems and tasks I encountered with the Google App Engine or such things. This book
only takes the concept of the sample application and performance measuring from its
predecessor nothing more. If you would like to read (more) about topics not covered
in this book but in the previous one, please feel free to send me a message / e-mail with
your request and Ill see what I can do about it.
Lean note
As of this book is published with LeanPub I plan to do the publishing in the lean way:
as early as possible and as often as it is required.
This means, that as soon as I am ready with the BeautifulSoup part I will publish the
first release. After this is done I can focus on your requests and notices on my book
and write the Scrapy part. And at the end I can write the sample application for the
last chapter which will be released again in two parts: BeautifulSoup and Scrapy.
http://leanpub.com/javaxml
Preface
ii
And at the end I plan to compare the results of the Python based applications with the Java based
ones. Again, you have to keep in mind that in this case I am comparing apples with peaches (now
not pears), however it is not bad to see which tool gives you a better result if you have the choice
to choose between Java and Python.
So, lets dive right into the programming, or with other words: slap into the soup.
Prerequisites
Nothing. Sure, no requirements if you want to read this book.
However optionally you can have some interest in website scraping, Python and software
development. I try to write the book for everyone but sometimes I get off-track. So if you cannot
follow along with my thoughts just cry out loud (preferably with a message) and I will come to
aid and fix the mess Ive created.
iii
Preface
LeanPub
I publish my books on LeanPub because
So if someone do not want to pay some dollars because there wont be any updates and he/she
has to use some online errata, I can just say: do not worry, here youll get every update of the
book and they are already included in the price.
It can happen that I expand the book with some chapters on my own or I write another sample
application and at this time the new version of the book will be available here too. This because
the technology is altering so fast and after some time a book stuffed with old knowledge wont
be useful.
Note
Not copying the code might result in a bit of overhead because I have to write every
website-select statement again and again. But I like this approach because in this case
I do learn something new (or refresh the old ideas) and maybe I can refactor the code
Ive created in Java.
And beside learning I am not tempted to copy some code that might not work in Python
because lists are handled some other way than they are in Java. For example they do
not have the methods first() and last().
http://leanpub.com/javaxml
Note
As you might know, Scrapy is not just a library to use like BeautifulSoup. It is mostly
used as a standalone tool where you configure your own scraper application with
spiders and so on. So this chapter will evaluate and introduce a tool on is own not
just a library.
However I will introduce how to create a simple script using Scrapy as a library and
I will evaluate the performance based on this solution too.
plt.plot([1, 2, 3, 4, 5])
plt.ylabel('The value range')
plt.xlabel('The number range')
plt.show()
As you can see, the example draws a simple plot and the names of the functions are really
straightforward but lets go a bit deeper.
Extended plot example
The firs line above has now three parameters: the first list contains the numbers on the X axis,
the second list displays the according values to the first array on the Y axis.
The third parameter, the string describes how to display the plots values. The default display
method is b- which represents the plot as a blue line. And as you might guess ro lets display
the values as red circles. So yes, the first letter is the color of the values, the other parameters
describe how the plot should be displayed: line, dashed line, squares, circles, triangles and so on.
The result is as described above: blue asterisks, green squares and yellow triangles:
Unfortunately I plot various number of datasets to export them as images, so I needed something
different. However the solution is not too complex.
Because of the arbitrary number of arguments I can call the plot function in a loop and add new
data to visualize.
Without any parameters (like color or symbol) matplotlib will assign a new color and symbol to
the new data and visualize your information according to this information. Naturally you can
manually assign color and symbol to the dataset.
Lets look at the same example as above now I put the data into a list three times and I plot the
contents of the list:
dataset = []
dataset.append([i for i in range(10)])
dataset.append([i*2 for i in range(10)])
dataset.append([i**2 for i in range(10)])
for data in dataset:
x = [i for i in range(len(data))]
plt.plot(x, data)
plt.show()
As you can see the result is quite not the same: the colors and the symbols are different. The
default symbol for plots is a line and lines are not a good representation for distinct data as
my runtime results are. To change the data to look the same as in the previous example I create
a list containing the formatting options and assign these options to the displayed dataset:
10
dataset = []
dataset.append([i for i in range(10)])
dataset.append([i*2 for i in range(10)])
dataset.append([i**2 for i in range(10)])
symbols = ['b*', 'gs', 'y^']
counter = 0
for data in dataset:
x = [i for i in range(len(data))]
plt.plot(x, data)
plt.plot(x, data, symbols[counter])
counter += 1
plt.show()
Naturally this approach is good for a fixed size dataset. But what about more data? The solution
I used is almost the same as above, but I defined a set for colors and another set for symbols.
And I choose the next color and symbol from the list with a simple modulo calculation:
11
dataset = []
dataset.append([i for i in range(10)])
dataset.append([i*2 for i in range(10)])
dataset.append([i**2 for i in range(10)])
symbols = ['b*', 'gs', 'y^']
symbols = ['*', '^', '.', 'o', 's']
colors = ['r','g','b','c','k','m']
counter = 0
for data in dataset:
x = [i for i in range(len(data))]
plt.plot(x, data, symbols[counter])
symbol = counter % len(symbols)
color = counter % len(colors)
plt.plot(x, data, symbols[symbol]+colors[color])
counter += 1
plt.show()
If the dataset is bigger than the arrays the symbols and colors will restart from the beginning.
And this was it. Now the plot displays the distinct values of each run and the averages as a line
and both with the same color.
12
Naturally assigning the label does nothing. We have to tell matplotlib to show the legend with
the labels of the values. For this we have to call a function:
Calling the legend
plt.legend()
plt.show()
13
This results in a basic legend display inside the plot which can be a bit pain in the neck because
the legend can hide some data as it does in this case too. In the next section I will show how you
can move the plot out of the data field.
As you can see the solution is straightforward. You can pass in any string naturally a longer
one will not always fit the screen. And if you have a long title consider it to be shorter or add
it outside of the plot.
14
This solution is straightforward too: you label the x and y axes with their title.
15
To solve this issue you have to resize the subplots a bit to fit both into the display area:
Displaying both data field and legend
16
plt.figure(figsize=(15, 6))
The code above resizes the plot to 1500 x 600. This is not readable out of the box but if you look
at the resulting images you can see that the size is exactly 1500 x 600.
17
edgecolor
As the name name tells, this modifies the edge of the plot. Well I do not see use cases where I
would set the edge, but for this example I add one green edge:
Green edge for the plot
Well, not the most beautiful plot. This gray background and the green edges Well, lets take a
step forward in formatting.
Grids
This is more important than plot edges in my eyes: showing the grid. This is quite simple too:
18
plt.grid()
This results in a basic grid. If you want to format the grid you have to provide some paramteres
to the grid() function.
The most used are linewidth, linestyle and color. They can take the values you can assign a
Line2D object in matplotlib.
For example I create a grid with red color and dotted style:
Formatted grid
plt.grid(color='r', linestyle=':')
Naturally there are some style guides where you can learn less is more however sometimes
your customers (or your boss) think that nice colors can make some plots even more fancy. Well,
implement their wishes and try to convice them that you can do better results with less colors
and design.
5.6 Conclusion
Matplotlib gives you everything to draw a plot and far more. You can configure the plots as
you wish and if you search a bit around the internet (StackOverflow for example) you can find
other curious use cases someone wanted to implement.
So if you have data in Python you can visualize it with matplotlib easily. This is why Ive chosen
this tool for my performance measurements: it is in Python and it is easy to learn and use.
19
7. Parallel working
I introduced this type of working with Java 8 in my other book, and I think this would be a
nice chapter in this book too. Thats because I would like to learn a bit more about Pythons
concurrency model, and I hope some performance gain when I write some operation to work on
the data parallel.
Lets see where do I get with my tuning and if the results are worth the effort.
http://leanpub.com/javaxml
20
21