Sunteți pe pagina 1din 69

Luminati Provides

Web-Transparency
Web Scraping
Proxy Management
Workshop

#scraping_in_Delhi
Get ready for the workshop

Download Download
Luminati Proxy Manager Firefox and cURL
https://luminati.io/lpm Use cURL on Windows Git Bash or
Mac/Linux Terminal

$20 Bonus Download


Register with this link The presentation
http://bit.ly/workshop_newdelhi http://bit.ly/lum_workshop_pdf
Username and password

Email: workshop@luminati.io
Password: HandsOnWorkshop951
The Agenda for Today

01 Introduction
Tamir Roter, VP Sales

02 Getting Started with Scraping


Aviv Besinsky, Product Manager

03 Robot Detection
Saarya Berlinger, Solutions Engineer

04 SERP
Itamar Abramovich, Product Manager

05 Advanced Scraping Techniques


Saarya Berlinger, Solutions Engineer

06 One-on-One Sessions
Luminati Networks Join our community

Join us @ http://luminati.io/community

Community discussions
Talk to the community and the experts who built Luminati.

Community feature requests


Let us know how we can make this product even better.

Luminati tips & updates


Keep up-to-date with our latest releases, examples, tips, changes and fixes.

Follow us on Twitter @ luminati_io


Our Company

Tamir Roter
VP Sales
tamirr@luminati.io
Challenge: websites know who is watching

ONE WAY

€1,665

ROUNDTRIP

$1,324

They respond with different content, price, ad to different viewers


We developed a P2P network 35M+ consumers willing to help

Consumers opt-in to the


network in return for free
partner's application usage
How do we get users active consent?
How does it work?

We use a peer’s IP address only when a device meets 3 conditions:

Sits idle Connected to internet Connected to power


We route traffic through the We prefer to use WiFi and Device is plugged in or has
device only when it’s may use very limited cellular enough battery power
absolutely not in use data
Businesses can see the web, as these 35 million consumers would see it
Luminati Proxy Networks Available

Rotating Residential Network Mobile Network


● 35,000,000+ IPs ● 10,000,000+ IPs
● 195 countries ● 195 countries
● 99.99% uptime ● 3g/4g connections
● Country, City, ASN and Mobile ● Country, City, ASN and
Carrier Targeting Mobile Carrier Targeting

Datacenter Network Static Residential (ISP) IPs


● 750,000+ IPs ● 85,000+ IPs
● 95 countries ● 35+ countries
● Available with country and ● Non-rotating
city targeting ● Directly from an ISP
● Multiple IP types in a shared
or dedicated pool
Crawling Network Architecture

crawler
Unblocker - Automated Unblocking Software

Network and IP Management Header Management Automatic Protocol Upgrades

Route requests through the Automatically set User-Agent and Seamlessly upgrades HTTP
correct network and IP other headers based on target site protocol and rotates TLS/SSL
automatically requirements fingerprint

Cookie Management Detection and Matching Automatic Retry

Automatic IP priming and cookie Intelligent detection of blocked Automatically retries failed
management requests based on response codes, requests
response content, and request
timing, and more
Questions?

Thank You
VP Sales
tamirr@luminati.io
Getting Started
with Scraping
Aviv Besinsky
Product manager
avivb@luminati.io
Getting Started with Scraping

Aviv Besinsky
Product Manager
avivb@luminati.io
Getting Started with Scraping

Luminati Proxy Manager

An open-source software for seamlessly managing multiply proxies via API and
admin UI

● One entry point for managing proxies


● Built in features
● Auto retry rules
● Real time statistics
Practice 1: LPM basics - Selecting geolocation

● Open a new port in LPM


● Set it with targeting to DE
● Configure to LPM to browser and go to
http://lumtest.com/myip.json
OR make a direct request to
http://lumtest.com/myip.json through
cURL or any other client you are using
Practice 2: LPM basics - Rotating IPs

● Go to port settings → IP control, and set


Max requests to 1
● Run a few requests or refresh your browser
a few times to see how the IP rotates
Preparation – Install certificate

Go to https://luminati.io/faq#proxy-certificate
and follow the steps to install the certificate

OR go to https://luminati.io/faq and search for


‘certificate’
Practice 3: LPM basics - LPM rules

● Set LPM rule to retry if status code is 502


● To reproduce 502 status code send your
request to http://httpstat.us/502
Practice 4: LPM basics - assigning headers and user-agent

● Go to headers tab and choose a user agent


preset
● Go to http://lumtest.com/echo.json and
see how the user agent is the one you set
● Rotate between a few user agents in LPM
and see how they change in browser
Robot
Detection
Saarya Berlinger
FAE/Solutions Engineer
saarya@luminati.io
Robot Detection

Saarya
Berlinger
FAE/Solutions Engineer
saarya@luminati.io
Agenda

01 Data Collection

02 Bot Blocking

03 Fingerprints

04 How to Unblock
Data
Collection
Data Collection

● Data points

● Parsing

● Geo sensitivity

● Scale
Getting
Blocked?
Getting Blocked

Basic blocking Status codes

● Bad status code ● 403 Forbidden

● Captcha ● 404 Not found

● Socket hangup ● 500 Server Error

● Cloaking
Getting Blocked
Getting Blocked
Getting Blocked
Getting Blocked
Fingerprints –
how you are
getting blocked
Fingerprints

● Individual encounter -

consistency

● Recurring encounters -

uniqueness
Fingerprints

● TCP/IP

● TLS

● HTTP version

● HTTP headers

● Browser features

● Usage
Fingerprints

Desktop <> Mobile Android <> iOS


Fingerprints

https://amiunique.org/fp
Fingerprints

AudioContext properties:
How to
Unblock
How to Unblock

● TCP/IP

● TLS

● HTTP version

● HTTP headers

● Browser features

● Usage
How to Unblock

How can an IP Proxy Network


help?

● Send requests from a different


source
● Change origin geolocation
● Split many requests across
multiple sources
How to Unblock

Simple HTTP request: TLS, HTTP

● Browser profile (age, usage stats) ● HTTP version


● Header values ● HTTP2 settings
● Header values per page ● HTTP2 interaction
● Header values per session ● HTTP2 features
● Header values per geo ● TLS version
● Header order ● TLS extensions
● Header case ● TLS ciphers
How to Unblock

Browser
● webRTC
● Timezone
● JS
● CSS
● window size
● mouse movement

Account login
● User data
● Usage pattern
Unblocker Challenge

Make HTTP request to https://botcheck.io


and get every fingerprinting test to pass
Download today’s presentations

Learn more with our FAQ @


https://luminati.io/faq
Robot Detection

Thank you!

Saarya Berlinger
FAE/Solutions Engineer
saarya@luminati.io
Luminati
SERP
Itamar Abramovich
Product manager
itamar@luminati.io
Our Company

Itamar
Abramovich
Product manager
itamar@luminati.io
Agenda

01 What is SERP?

02 Practice
SERP? Search Engine Result Page

Send Google Search Requests through Luminati’s


Residential Network.
SERP? Search Engine Result Page

Send Google Search Requests through Luminati’s


Residential Network.
Practice 5: SERP - Google search

curl --proxy
zproxy.lum-superproxy.io:22225
--proxy-user
lum-customer-workshop-zone-googl
e:w05wd62fhjw6
'http://www.google.com/search?q=
taxi' > results_page.html
Practice 6: SERP - Google search + Specific peer + JSON

curl --proxy
zproxy.lum-superproxy.io:22225
--proxy-user
lum-customer-workshop-zone-googl
e:w05wd62fhjw6
'http://www.google.co.in/search?
q=taxi&gl=in&hl=hi&lum_json=1' >
results_page.json
Practice 6: SERP - Google search + Specific peer + JSON

curl --proxy
zproxy.lum-superproxy.io:22225
--proxy-user
lum-customer-workshop-zone-googl
e:w05wd62fhjw6
'http://www.google.co.in/search?
q=taxi&gl=in&hl=hi&lum_json=1' >
results_page.json
Practice 7: SERP - Google shopping

curl --proxy
zproxy.lum-superproxy.io:22225
--proxy-user
lum-customer-workshop-zone-googl
e:w05wd62fhjw6
'http://www.google.at/shopping/p
roduct/232536990647203309' >
product_page.html
Practice 8: SERP - Google maps

curl -v --compressed --proxy


zproxy.lum-superproxy.io:22225
--proxy-user
lum-customer-workshop-zone-googl
e:w05wd62fhjw6
'http://www.google.com/maps/sear
ch/restaurants,+new+york/@40.743
6582,-74.0216375,13z' -o
map_results_page.html
SERP - Main features
Our Company

Thank you!

Itamar Abramovich
Product manager
itamar@luminati.io
Advanced
Scraping
Techniques
Download today’s presentations

HTTP request VS Browser


Waterfall Routing

Target Website
Practice 8: Advanced Practice - Waterfall

HTTP requests: waterfall


● Set up 2 LPM ports: one with DC IPs and one residential

● Set rule on DC port to retry with Residential ports on 502

status code

● Test on https://httpstat.us/502
Practice 9: Advanced Practice - Browser

A) Browser session management API


● Set up LPM port to use long session

● Use LPM API to change IP on demand

● Test with http://lumtest.com/myip.json

B) Browser BW optimization
● Use LPM rule to give null response for images

● Test in browser on any website with images


One on One
practice
sessions
Download today’s presentations

/luminati_io /luminati-networks /luminati.io #scraping-in-delhi


Practice 10: One on One Practice

● If you have your own scraping project try using some

of the techniques we discussed with it

● If not choose some data that interests you and start

collecting!

● Sample practice project: can you find the most

popular birds in Germany by season?


Robot Detection

Thank you!

Saarya Berlinger
FAE/Solutions Engineer
saarya@luminati.io

S-ar putea să vă placă și