Sunteți pe pagina 1din 5

ENGAGED from #UF is Gay Very basic Parsing, on

Remember you heard it here first! returned web data - tutorial

Menu Alright, I'm sure you're saying to yourself, ok I have all this
data (web page, file data, it's all the same to us) but I really
Home
want to extract some very specific data out of it. Does that
SEO
sound like what you're looking for? Well what we'll do is a
CSS
basic php web scrape just like in the first tutorial, but we're
PHP Scripts
going to take and pull some data out of it. For our example
Design what we'd like to do is find out how many pages of our site is
Money Online indexed by MSN and just return that scraped number. Sound
Blog like something useful? Hopefully this is going to give you the
Dumb Fucks very basics of parsing out data. So lets go!
Suggestions

“Friends Don't let


Most Basic Web Data Parsing Script
Friends buy packaged Whole script -
software!” The whole script minus the line numbers of course. Those are
just their for our reference.

Join our Mailing List 1. <?php


Name: 2. $data = file_get_contents('http://search.msn.com/results.aspx?
q=site%3Afroogle.com');
3. $regex = '/Page 1 of (.+?) results/';
Email:
4. preg_match($regex,$data,$match);
5. var_dump($match);
Submit 6. echo $match[1];
7. ?>

Script Explanation -
Ok here goes with the basic explanation...

Line 2.
$data = file_get_contents('http://search.msn.com/results.aspx?
q=site%3Afroogle.com');
Now if you studied up on the first tutorial you'll know that
we're pulling data from MSN search using the
file_get_contents command and assigning the data to the $data
variable.

However we're also passing some data in the url to get the
specific page from MSN that we want to scrape. If you already
know about passing variables in the url you can go to Line 3.

You might be asking what is all that stuff after the MSN url?
I'm sure you've seen it a lot of times but might not been sure
what it was. Basically what all that stuff is, is just like passing
a variable in a php script but you're doing it through a url. Lets
take a peak at the url we're using here to get a better
understanding. Our url if you don't remember is
"http://search.msn.com/results.aspx?q=site%3Afroogle.com".

Let's break it into two parts split on the question mark. Why
you ask? That's where the url ends and the data being passed
begins. With is separated we have:

http://search.msn.com/results.aspx
and
q=site%3Afroogle.com

Now I hope I don't need to go into an explanation on the first


part so I'm really only going to talk about the second. Also I'll
do some basic tutorials on accepting data later so you have an
understand what happens to this url on the other side. When
you look at the second part of the url you'll always see a field
and a value for the field, although sometimes that value is
blank. How do you know which is the field and which is the
value you ask? The field is always going to come before the
equal sign = and the value will come after. Basically think of it
like assigning a variable a value. In this data being passed by
the url our field is "q" if you didn't already guess and our
value is site%3Afroogle.com. The field 'q' that MSN takes
stands for query. So passing data assigned to the 'q' field is
telling MSN, "hey look this search/query up for me."

The value assigned to the field 'q' is site%3Afroogle.com. First


thing you're probably thinking is what in the world is that
%3A, I didn't type that. Well to keep things very simplistic,
there's certain variables that can't be passed through url's things
like colon's, quotes, semi-colon's etc, because these are
protected and mean certain things to a web server when they
see them. So we need to use some other form of formatting. In
this case we're converting the ':' in site:froogle.com to a
encoded value (more on that later). So what we're asking for by
the site: command in MSN is how many pages from site X are
in your search engine. So specifically how many pages from
froogle are indexed in MSN.

Click here to see the page we're scraping

Line 3.
$regex = '/Page 1 of (.+?) results/';
First things first when we're scraping a page we're scraping the
source code of the page, so that's always what we're going to
want to be looking at when we're picking out what we want to
grab. If you know know this and you better or you're probably
lost. Go to view source in your browser then search for what
you're looking to pull out. Here's a chunk of the source code
we're going to pull our value out of.

div id="search_header"><h1>site:froogle.com</h1><h5>Page 1
of 9,138 results</h5>&nbsp;<b>&#01

Now that we have our data we want to to get the result from,
we can get into the meat of the parsing. I know to most of you
regex is big scary thing with all those crazy symbols and
patterns. And well if you want to be a regex master yes, it's
pretty daunting. But don't let all those funny chars scare you
cause there's a real simple way to use regex. The regex guru's
and preachers will mock you and say you're bastardizing it but
I say whatever works.

I'm not going to go into we're just assigning a string to a


variable in this statement. Anytime you see a $varname =
'something here'; or $varname = "something here"; you know
it's just a value being assigned to a variable. Also note you can
use single ' and double " quotes interchangeably.

(.+?) is our best friend when it comes to regex, it basically


means match everything starting from the text ( I'll call that
text anchors too, so be prepared for me to use the
interchangeably) in the beginning and stopping at our end
text/anchor. Something like this:

opening anchor text here ( .+?) closing anchor text here

Pretty easy huh? Yeah I thought so. The only other thing to
note in this is that there is the forward slashes in the '/stuff/';
that's a regex thing. Just know that in php you always need to
let regex know what to match inside of forward slashes.

Of course I can talk about regex all day and type 1000 pages
on it. But for now I'm trying to keep it super simple.

Line 4.
preg_match($regex,$data,$match);
Ah a new function's in town, preg_match(). Preg_match() is
the PHP function to call regex for a single match. So anytime
we want to match one thing in our data we're going to call the
parsing function preg_match().

With preg match we're doing something called passing data to


the function for it to work on. In this case we're passing
$regex, $data, $match. We know what both $regex (parsing
string we just made) and $data (scraped page from MSN) are
but what is the $match variable? It's just the variable that our
parsed data is going to be returned to. In plain english we're
saying take $data and then apply the filter $regex to it. Then
whatever comes through that filter dump out into $match. Make
sense?
I sure hope you said yep, that's easy.

Line 5.
var_dump($match);
The function var_dump() is your best friend as a programmer.
It says whatever is in this variable or array dump it out onto
the screen so I can see what's happening. So this line will
output this onto the screen.

array(2) {
[0]=>
string(23) "Page 1 of 9,138 results"
[1]=>
string(5) "9,138"
}

Array? What's that? Well this is as good a time as any to


introduce what an array is. They're extremely useful tools for
you to know. So lets backup a little we know that a variable is
something that holds 1 thing, right? Well an array is just like a
variable except it holds multiple things. I like to think of it like
this. Stop and imagine a train for a second it has all these cars
on it that hold things right? well a variable is a single car and
can only hold a single thing. Where an array is like a train that
has multiple cars to hold things. In the output above we have a
two (2) cell array, which is just like a 2 car train. In car 0 we
have the string 'Page 1 of 9,138 results' and in car 1 we have
the string '9,138', which is the result we want right? You
might be asking why does preg_match return an array rather
then just a simple string. It does this two give you two options
on how to match things. You'll notice car/cell 0 has the
anchors included as well as the matched text. Where car 1 only
has the text inside the anchors.

Line 6.
echo $match[1];
What's with the new notation? If you hadn't already guessed
that's how we access the cars in our train. We know if we have
a array and what we want is in car 1 we access that by
'referencing' that car which is what the [1] means. We want to
output only what's in the second cell because we don't want the
anchors included. This will output to our screen:

9,138

Which is exactly what we aimed to do.

Click here to see what your parsed result should look like!

Download the file here

Other things to try -


So fun stuff to try using our new skills.
1. Use the link: command in MSN and see if you can get the
number of links for a domain. Don't forget that : = %3A

2. See if you can get the title of a web of any web page. Hint:
anchors are going to be <title> and </title>.

Conclusion -
You can make some pretty cool tools with just the two very
basic things I've shard with you so far. Pulling data from
somewhere using the file_get_contents() function and the data
parsing preg_match() function. Have fun with it and I'll see you
on the next data scraping tutorial.

Next: Parsing Multiple Items from A Data Source

Back

Copyright Me Bitches! - Web 1.0 Style - Represent

S-ar putea să vă placă și