Documente Academic
Documente Profesional
Documente Cultură
net
Iʼve used the FireBug plugin for Firefox web browser to look at the structure of the page,
as it allows me to explore in more detail than the standard ʻView Sourceʼ feature on
most browsers.
see if you can pick out the elements you want to scrape.
For example, each of the listings of Garages for Rent in Oxford are contained within a
div with the class ʻpagewidgetʼ, so I can use the selector $dom->find
('div.pagewidget') to locate them. (This sort of selector will be familiar to
anyone used to working with CSS - Cascading Style Sheets).
3) Check what Scraper Wiki returns and start refining your scraper
If you click ʻRunʼ below your scraper you should now see a range of elements returned
in the console. The default PHP template loops through all the elements that match the
selector we just set and prints them out to the console.
My scraper returns quite a few elements I donʼt want (there must be more than just the
Garage listings picked out by the div.pagewidget selector), so I look for something
uniform about the elements I do want. In this case they all start with ʻSite Locationʼ (or at
least the plaintext versions of them, as returned by $data->plaintext do.
I can now add some conditional code to my scraper to only carry on processing those
elements that contain ʻSite Locationʼ. Iʼve chosen to use the ʻstristrʼ function on PHP that
just checks if one string is contained in another and is case insensitive, rather than
checking the exact position of the phrase, to be tolerant in case there is variation in the
way the data is presented that Iʼve not spotted.
Sometimes, you get down to text which isnʼt nicely formatted in HTML, and then you will
need to use different string processing to pull apart the bits you want. For example, in
the Garage listings we can separate each line of plain text by splitting the text by <br>
Drafts for the Open Data Cook Book - see http://www.opendatacookbook.net
elements, and then splitting each line at the colon ʻ:ʼ used to separate titles and values.
A check of the raw source shows the Oxford Garages page uses both <BR> and <br />
as elements so we can use a replace function to standardise these (or we could use
regular expressions for splitting).
In the Oxford Garages case as well, our data is split across multiple pages, so once we
have the scraper for a single page working right, we can nest it inside a scraper that
grabs the list of pages and loops through those too. Scraper Wiki also includes useful
helper code for working with forms, for sites where you have to submit searches or
make selections in forms to view any data.
Firstly, an array indicating the name of the unique key in your data that should be used
to work out whether a record is new, or an update to an existing record.
Third, the date of the record (for indexing). Leave as null to just use the date the scraper
was run.
Fourth, an array of latitude and longitude if you have geocoded your data.
Run you scraper and check the ʻdataʼ tab to see what is being saved.
You can also create ʻViewsʼ onto your data, using pre-prepared templates to create
maps and other useful visualisations of your data, direct from within Scraperwiki.
Scraperwiki will run your scraper every 24 hours, meaning that as long as it keeps
working, you can rely on it as an up-to-date data source.
Below is the map I produced, showing Garages to Rent around Oxford, with the number of
garages, photos, and links off to the pages with details about them.
One of the best things about Scraper Wiki overall though, is that it is wiki-like. You can take
a look at my Oxford Garages code at http://scraperwiki.com/scrapers/oxford-garages-to-
rent/ and you can edit and improve it (and there are lots of potential improvements to be
made).
You can also suggest scrapers you would like other people to create, or respond to
requests for scrapers from others.