Sunteți pe pagina 1din 5

Proceedings of the 10th INDIACom; INDIACom-2016; IEEE Conference ID: 37465

2016 3 International Conference on “Computing for Sustainable Global Development”, 16th - 18th March, 2016
rd

Bharati Vidyapeeth's Institute of Computer Applications and Management (BVICAM), New Delhi (INDIA)

A Dive into Web Scraper World


Deepak Kumar Mahto Lisha Singh
HMR Institute of Technology and Management HMR Institute of Technology and Management,
Delhi, INDIA Delhi, INDIA
Email ID: deepak7mahto@gmail.com Email ID: lisha.engg@gmail.com

Abstract – This paper talks about the World of Web Scraper, websites whereby the data is extracted and saved to a local file
Web scraping is related to web indexing, whose task is to index in your computer or to a database in table (spreadsheet)
information on the web with the help of a bot or web crawler. Here format. Data displayed by most websites can only be viewed
the legal aspect, both positive and negative sides are taken into view. using a web browser. Examples are data listings at
Some cases regarding the legal issues are also taken into account.
The Web Scraper’s designing principles and methods are contrasted,
yellowpages’ directories, real estate sites, social networks,
it tells how a working Scraper is designed. The implementation is industrial inventory, online shopping sites, contact databases
divided into three parts: the Web Crawler to fetch the desired links, etc. Most websites do not offer the functionality to save a copy
the data extractor to fetch the data from the links and storing that of the data which they display to your computer. The only
data into a csv file. The Python language is used for the option then is to manually copy and paste the data displayed
implementation. On combining all these with the good knowledge of by the website in your browser to a local file in your computer
libraries and working experience, we can have a fully-fledged - a very tedious job which can take many hours or sometimes
Scraper. Due to a vast community and library support for Python days to complete.”
and the beauty of coding style of python language, it is most suitable The concept of Web Scraping is not new to us, it is getting
for Scraping data from Websites.
more famous these days because of the new Startups, as they
Keywords – Web Scraping; Screen Scraping; Web Crawling; don’t have to do much hard work to get the data, they mostly
Implementing Web Scrape; Designing Web Scraper prefer to use the data scraped from other similar websites and
then they modify it as per their need. Due to this, the bigger
existing companies are facing much loss as their data being
I INTRODUCTION anonymously gathered and then reproduced by some other
We all use Web Browser to extract the needed information companies.
from the Web Sites, if you think this is the only way to access
II IS WEB SCRAPING LEGAL?
information from internet then you are missing out a huge
range of available possibilities. Web Scraper is one of those This question is always left unanswered properly. There are
possibilities, in which we are accessing the information on the lots of different views of different people on the legal and
internet using some programs and pre written libraries. illegal aspects of Scraping the Web. In today’s world we can
[1]Wikipedia Says “Web scraping (web harvesting or web data see many examples of the legal use of Web Scraper such as
extraction) is a computer software technique of extracting price comparison websites and reviewing Websites. In this
information from websites. Usually, such software programs section we will see some example cases and try to answer this
simulate human exploration of the World Wide Web by either question.
implementing low-level Hypertext Transfer Protocol (HTTP), [3] The Ebay’s action “Not much could be done about the
or embedding a fully-fledged web browser, such as Mozilla practice until in 2000 eBay filed a preliminary injunction
Firefox. Web scraping is closely related to web indexing, against Bidder’s Edge. In the injunction eBay claimed that the
which indexes information on the web using a bot or web use of bots on the site, against the will of the company
crawler and is a universal technique adopted by most search violated Trespass to Chattels law. The court granted the
engines. In contrast, web scraping focuses more on the injunction because users had to opt in and agree to the terms
transformation of unstructured data on the web, typically in of service on the site and that a large number of bots could be
HTML format, into structured data that can be stored and disruptive to eBay’s computer systems. The lawsuit was
analyzed in a central local database or spreadsheet. Web settled out of court so it all never came to a head but the legal
scraping is also related to web automation, which simulates precedent was set.” Another case was seen “In 2001 however,
human browsing using computer software. Uses of web a travel agency sued a competitor who had “scraped” its prices
scraping include online price comparison, contact scraping, from its Web site to help the rival set its own prices. The judge
weather data monitoring, website change detection, research, ruled that the fact that this scraping was not welcomed by the
web mashup and web data integration.” site’s owner was not sufficient to make it “unauthorized
[2] Web Harvy says “Web Scraping (also termed Screen access” for the purpose of federal hacking laws.” Facebook’s
Scraping, Web Data Extraction, Web Harvesting etc.) is a action “In 2009 Facebook won one of the first copyright suits
technique employed to extract large amounts of data from against a web scraper. This laid the groundwork for numerous

978-9-3805-4421-2/16/$31.00 2016
c IEEE 689
o the 10th IND
Proceedings of DIACom; INDIA ACom-2016; IEEE
I Conferennce ID: 37465
rd
2016 3 Interrnational Confe mputing for Suustainable Globbal Developmeent”, 16th - 18thh March, 2016
ference on “Com

law
wsuits that tie any web scraping with a direct copyright B Daata Extractor
viollation and veryy clear monetarry damages.” There T is a casee of
The Data Extractor extracts
e the neeeded informattion from the
AT& &T’s where “Andrew Au uernheimer was convicted of
Web Page.
P There is lot
l of junk andd useful data is mixed on the
hacking based onn the act of weeb scraping. AlthoughA the data
d
website, we have to look onlyy for the neeeded useful
wass unprotected anda publically available via AT&T’sA website,
mation. To sepaarate out the useful
inform u one from
m the mixture
the fact that he wrrote web scrap pers to harvest that data in mass
m
we tarrget only thatt part of pagee in which innformation is
amoounted to “brutte force attack””. He did not have h to consentt to
presentt. To target thhat specific paart, we are ussing the CSS
termms of service to deploy his bots and conduct c the web
w
Selectoors.
scraaping. The datta was not av vailable for puurchase. It wassn’t
behhind a login. He H did not even financiallly gain from the EB CRAWLER
IV WE R EXPLAINED
D
agggregation of the t data. Mosst importantlyy, it was bugggy
proggraming by AT T&T that expo osed this inform mation in the first
f [5] Wiikipedia explaiins “Web searrch engines and some other
placce. Yet Andreew was at fau ult. This isn’t just a civil suits sites use
u Web crawling or spiderinng software too update their
anyymore. This chharge is a felon ny violation thhat is on par with
w web content or indeexes of others sites' web content.c Web
hacking or denial of service attaacks and carriees up to a 15-yyear crawleers can copy alll the pages theey visit for latter processing
by a seearch engine which
w indexes thhe downloadedd pages so the
senttence for each charge.”
[4] A Quora user answered “Th he key part is what
w you wantt to users can
c search mucch more efficiently.”
do with
w the scrapped data. If you u use it for yoour own, persoonal
use, then it is legal as it falls under fair use doctrine.
d Theree is
nothhing special inn accessing daata for yourselff with a browsser,
youu can use other means i.e. scraping.The com mplications starrt if
youu want to use scraped
s data foor other, especiially commerccial,
purpposes. Howeveer even then you y may be abble to do a lot. A
goood example is deep linking, a practice wheere links to pagges
withhin a site are placed on ano other site, bypassing target site s
homme page. Theree have been seeveral legal preecedents for suuch
casees and in severral of them cou urts have ruled that deep linkking
is leegal, even inclluding short descriptions andd meta data frrom
targget pages, as loong as it is cleaar that the site where deep linnks
are placed is not claiming
c ownerrship of the datta.”
Thee legality of thhe scrapper is still
s unclear baased on the vieews
of different
d peoplle, the only thiing we can stilll conclude about
the legality is thatt if we are nott doing any harrm to the webssite Fig. 1. Architechture of Web
W Scawler[5]
andd not selling the scrapped datta then it is leggal otherwise it i is
The working of webb crawler is very simple, it starts with list
not any more illeggal.
of URL Ls to visit, thaat is called seeeds. The crawller then visits
III DE
ESIGNING A CUSTOM
C SCR
RAPPER all thee URLs, then it processes all a the web paages on those
URLs, then the crawller finds the annchor tag on which w links are
A Web
W Scrapper broadly
b compo osed of two parrts they are:
containned, those linkks are stored inn a list which will
w be further
Webb crawler for crawling
c links + Data Extracctor from crawwled
processed one by onne and more linnks are fetched from them,
linkks.
that lisst of URL is called the craawl frontier. Two T types of
A Web Crawlerr URLs are seen by craawler while visiting the web pages, one is
Relativve URL and annother is Absoolute URL. Firsst the crawler
[5] Wikipedia sayys “A Web crrawler is an Innternet bot whhich has to identify weathher an URL is Relative URL L or Absolute
systtematically broowses the Worrld Wide Web,, typically for the URL, if i it is Absolute URL then , the t crawler cann simply store
purppose of Web inndexing. A Weeb crawler mayy also be calleed a those URL
U in the listt and start fetchhing pages from
m those URL,
Webb spider, an ant, a an automaatic indexer, or o (in the FOAF and if the URL is Reelative URL, thhen we need to t provide the
softtware context)) a Web scu utter.”A Web Crawler usuaally base URL,
U sometimmes there is a case when base b URL is
crawwls website using recursive algorithms,
a in which
w it scans the differeent to get compplete URL of thhe crawling linnks , there we
firstt page then finnds the links on
n that page,storres them in a tyype have too do manual analysis
a of Weeb Page sourcee, identifying
of data
d structure and then fetches first linkss from the stored the com mmon pattern in the needed URL, then acccordingly we
linkks, then open the
t webpage on n that link andd stores them into converrt them into Abbsolute URL.S Sometimes therre are redirect
the same data struucture and recu ursively repeatts the process, till loops that
t can cause our crawler to get stuck in innfinite loop if
all links
l get crawlled. we havve left the systeem and kept thhe crawler runnning, so these
types ofo many problems are there while crawlingg so we need
to takee into account all a these types of issues whille designing a

690 2016 International Conference on Computing for Sustainable Global Development (INDIACom)
A Dive into Web Scraper World

crawler. Let’s take an example on Wiki related to one such are found. Then for added layer filter we can check the new
issue “A simple online photo gallery may offer three options to link with all already found links, but this will add some more
users, as specified through HTTP GET parameters in the URL. complexity into the code.
If there exist four ways to sort images, three choices of
D Crawling The Deep Web
thumbnail size, two file formats, and an option to disable user-
provided content, then the same set of content can be accessed Deep Web is the part of internet that is not indexed by even
with 48 different URLs, all of which may be linked on the site. massive search engines such as google. The surface web is just
This mathematical combination creates a problem for crawlers, a small part of whole internet we have. The sites on Deep Web
as they must sort through endless combinations of relatively contain much more information than Surface Web. So
minor scripted changes in order to retrieve unique content.” sometimes we need to crawl and extract the data from Deep
Web too. But the structure of websites on Deep Web is much
C Crawling Policy challenging, so we need to design our crawler by doing careful
Following are the combination of policies who gives analysis of the structure of website we are focusing on.
thebehavior of a Web crawler: Websites placed on Deep Web often goes offline, some
Selection Policy - It States The Pages To Download. websites are there which comes online at specific time, so to
Selection policy simply refers to prioritizing Web Pages based extract content from such website we need to take into account
on metric of their importance. The importance in the Web lots of issues that can be faced by us.
Crawler to be designed for a Web Scrapper can be easily Browser Identification
identified, as here we only have to focus some important links Using the USER AGENT field, the HTTP Headers the Web
only not on all the links on that websites, as following all the Crawler is identified to a Web Server. Web Crawler
links will result in wastage of time and resources while running identification is helpful for Web Administrators to
the Web Crawler to find the links. communicate to the Crawler owner. But some website doesn’t
Re-Visits Policy -It States When To Check For Changes To The allow those crawlers which don’t identify themselves as some
Pages. famous Web Browsers, so for these the Crawlers USER
In today’s world most websites are found to have dynamic AGENT field is spoofed to a Web Browser so that crawler can
nature, suppose we have planned to crawl a website for all do its work easily. This spoofing sometimes prevents Crawl
available links to start extracting data, it could take days or traps. So we have to pose like a genuine user.
weeks to fully crawl a big website. The time at which our
crawler finished getting the links the website had added new V DATA EXTRACTOR
updated content, now we have to design the crawler in such a The Crawled List of links now available to us in which we are
way that it can identify the new content itself and doesn’t waste going to find the required information. If we start to gather
resources on the old content that has been already fetched. The whole website and save it as a copy, this won’t be that much
check for changes part of crawler can add very much efficiency useful. So we have to extract the useful information and
in writing a long term running crawler. convert that into needed format. All these are done step by
Politeness Policy - A Politeness Policy That States How To step.
Avoid Overloading Web Sites. i. We find out the content we are going to approach;
This policy is used to prevent the crawler from overloading a this can be done using the inspect element button
website from sending multiple request so that the website provided in the context menu of majority of Web
doesn’t goes offline, we need to consider that crawler can go Browser.
anywhere on the website and a much higher speed than human ii. We will find out the common pattern of CSS
do, this could affect the performance of the website. The Selectors which is same on all other similar pages.
solution could be using robots.txt exclusion in which the iii. Check out which method you are going to use you
administrator specifies which URL are not to be crawled by the have support for Xpath or Simple CSS Selectors,
crawler. To make our crawler faster we make a crawler that depending upon we can carry our work further.
make multiple request at one time, but this will degrade the iv. We have extractor ready, but we have to store the
performance of the website for genuine user. As developer we data in a well formatted manner which should be very
can provide an artificial delay between requests so that there common to get converted into any other format very
are not such issues to the websites while crawling them. easily, one such format is CSV, check for the support
Parellelization Policy - It States How To Coordinate of CSV into your programming language library.
Distributed Web Crawlers. v. We have Data Extractor ready, do a recheck by
In this policy we speed up the crawler by running parallel extracting content on the temporary file so that CSV
crawler side by side, with the goal to maximize the visiting rate format gets a well formatted required output.
of the website and avoiding repeated links from getting into Now we have required data, this data can be easily converted
database of links. Many algorithms can be designed to make into any other format as CSV is so common to all. We can
the crawler to work parallel. And to avoid getting duplicate convert it into JSON or SQL format.
links we need to write the links into database as soon as they VI IMPLEMENTAION OF CRAWLER USING PYTHON

2016 International Conference on Computing for Sustainable Global Development (INDIACom) 691
o the 10th IND
Proceedings of DIACom; INDIA ACom-2016; IEEE
I Conferennce ID: 37465
rd
2016 3 Interrnational Confe mputing for Suustainable Globbal Developmeent”, 16th - 18thh March, 2016
ference on “Com

[11]] In this impleementation we are using PYTHON as codding


langguage, the reeason behind choosing PY YTHON is that t
PYT THON has vaast community y support and good amount of
librraries availablee for crawling g a website suuch as its inbuuilt
urllib, third party library such ass mechanize, ettc.
In the followingg code we haave imported libraries nam med
urloopen from urlliib, urlopen will help us in vissiting the URL L by
opeening them,passing a URL to o urlopen will do that. Then we
havve imported BeeautifulSoup frrom bs4, BeauutifulSoup is very F 3. Title Extraactor [11]
Fig.
powwerful library for
f creating treee like structuree of the contentt of Here also
a we have ussed urlopen and BeautifulSouup library, we
webb page. Then we w have used python
p set, a seet in Python haave have discussed
d abouut them in Im mplementationn of Crawler
a sppecial propertyy that it can have list elem ments but all the sectionn. Now we aree asking for the URL from thhe user using
elemments stored will
w be unique regardless of their t position, set raw_innput then openiing the URL using urlopen we w get the raw
are unordered listt in Python. Th hen we have created
c a functtion contennt, then that conntent is converrted into tree sttructure using
nammed get_links to get the lin nks from the website, in this t BeautifulSoup , then using Beautifu fulSoup we havve fetched the
funcction we pass the fetched UR RL and the UR RL as base UR RL, title off the Web Sitee at that URL,, in similar fasshion we can
wheen the functionn will be called d first time thee fetched pageU Url print thhe needed infformation using CSS Selectoors also , for
variiable will be given
g empty strring as we donn’t have any URL more youy have to reead the docum mentation of BeautifulSoup,
B
fetcched now, thenn after the recu ursion calls wee have passed the then onn the title object we have useed get_text funnction as title
fetcched link in thhe pageUrl as now we have the URL. In the object has the html coode, but we neeed only the texxt.
get__links functionn, we are co onverting the relative URL to V
VIII WRITING G TO CSV
absoolute URL by concatenating g the fetched URLU and the base PYTHON has great support for CS SV, you just haave to import
giveen provided URL,
U we have fetched all thee links then using the CS SV module andd you can usee that with youur inbuilt file
finddAll we find thhe “a” tag in the page then wee check for “hrref” handlinng system. CSV V is a bit fasteer, smaller in siize, very easy
in thhe attributes liist of “a” tag. Then
T we checkk if that link iss in to hanndle (even in Excel) and many m existingg applications
thatt set or not, if link
l is not in the set then stoore it there, if it
i is undersstand it, it is a widely used standard.Here
s w have used
we
in thhe set then ignnore it and usin ng recursive algorithm
a we pass
p csv libbrary availablee into Python, then we openn a file name
the next link in thee set to this fun nction to keep fetching the linnks test.csvv in append mode
m that will create and add the content
andd storing them in i the set. into fille, then we aree writing into row
r the headinng, then using
for looop we are writinng the number into the row.

Fig. 4. CSV writting [11]


IX FUTURE SCOPE
While throwing lightts on the Futurre of Web Scraaper it can be
seen thhat people willl start to deveelop many kinnd of services
such as Price Compaarison Websitees, Big-Data Analysis.
A With
Interneet and web tecchnology spreaading, massivee amounts of
data will
w be accessibble on the weeb. Since ‘big data’ can be
Fig. 2. . Simple Web
W Crawler[11] both, structured
s and unstructured; web scrapers will have to
VII IMPLE EMENTAION OF EXTRACTOR USING get shharper and inncisive. With the rise of open source
PYTH HON languaages like Pythoon, R & Ruby,, Customized scraping
s tools
Datta can be extraacted using do oing simple HTML
H parsing by will onnly flourish brringing in a neew wave of daata collection
usinng python moddule named BeautifulSoup
B w
which is goodd in and agggregation metthods. The weeb scraping caan be seen as
creaating tree struccture of html elements. Usingg tree structuree of providing solution too those who want
w data for their
t big data
BeaautifulSoup, wee can navigate through the ellements and reaach analysiis.
the needed contennt easily. X CONCLU USION
“If proogramming is magic, then web w scraping is wizardry.”
says Ryan
R Mitchell, author of Book
B Web Sccraping using
Pythonn. From the diiscussed topicss we can concclude that the

692 2016 International Conference on Computing for Sustainable Global Development (INDIACom)
A Dive into Web Scraper World

use of Scraper in the coming world will be increased


significantly. As Scraper opens up another world of retrieving
information without the use of API, and mostly it is
anonymously accessed.
But the people who are doing Scraping should take into
account that they are not breaking any kind of law which could
make them liable for any offence. It should also be considered
that the resources should not be consumed too much so that the
target website is not able to produce content for the legit users.
REFERENCES
[1]. https://en.wikipedia.org/wiki/Web_scraping
[2]. https://www.webharvy.com/articles/what-is-web-scraping.html
[3]. http://resources.distilnetworks.com/h/i/53822104-is-web-
scraping-illegal-depends-on-what-the-meaning-of-the-
word-is-is/181642
[4]. https://www.quora.com/What-is-the-legality-of-web-scraping
[5]. https://en.wikipedia.org/wiki/Web_crawler
[6]. Kolari , P. and Joshi, A. , “Web mining : research and practice ,
Computing in Science & Engineering”, IEEE Transactions on
Knowledge and Data Engineering, vol. 6, no. 2,Vol. 6 , No. 4 , 2004
[7]. Malik , S.K. and Ravi , S.M. , “Information Extraction Using
Web Usage Mining, Web Scrapping and Semantic
Annotation,IEEE International Conference on Computational
Intelligence and Communication Networks (CICN), 2011
[8]. Fister, I. ; Fong, S. ; Yan Zhuang, “Data Reconstruction of Abandoned
Websites”, IEEE 2nd International Symposiumon Computational and
Business Intelligence (ISCBI), 2014
[9]. Quang Thai Le and Pishva, D. , “Application of Web Scraping and
Google API service to optimize convenience stores distribution”, 17th
IEEE International Conference on Advanced Communication Technology
(ICACT), 2015
[10]. Jaunt. Java – web Scraping and Automation,http://jaunt-api.com/jaunt-
tutorial.htm
[11]. Ryan Mitchell – Web Scrapping Using Python, First Edition, Orilley,
June 2015

2016 International Conference on Computing for Sustainable Global Development (INDIACom) 693

S-ar putea să vă placă și