Sunteți pe pagina 1din 3

Bypass Captcha using Python and Tesseract OCR engine

A CAPTCHA is a type of challenge-response test used in computing


as an attempt to ensure that the response is generated by a
person. The process usually involves one computer (a server)
asking a user to complete a simple test which the computer is
able to generate and grade.The term "CAPTCHA" was coined in 2000
by Luis von Ahn, Manuel Blum, Nicholas J. Hopper, and John
Langford (all of Carnegie Mellon University). It is an acronym
based on the word "capture" and standing for "Completely
Automated Public Turing test to tell Computers and Humans
Apart".

In this post I am going to tell you guys how to crack weak


captcha s using python and Tesseract OCR engine.Few days back I
was playing around with an web application.The application was
using a captcha as an anti automation technique when taking
users feedback.
First let me give you guys a brief idea about how the captcha
was working in that web application.
Inspecting the captcha image I have found that the form loads
the captcha image in this way:
<img src="http://www.site.com/captcha.php">
From this you can easily understand that the “captcha.php” file
returns an image file.
If we try access the url http://www.site.com/captcha.php each
and every time it generates an image with a new random digit.
To make this clearer to you, Let me give you an example
Suppose after opening the feedback form you got few text fields
and a captcha.Suppose at a certain time the captcha loaded with
a number for ex. "4567".
So if you use that code "4567" the form will be submitted
successfully.

Now the most interesting thing was if you copy the captcha image
url (which is http://www.site.com/captcha.php in this case) and
open the image in new tab of same browser ,the cpatcha will load
with a different number as I have told you earlier. Suppose you
have got "9090" this time. Now if you try to submit the feedback
form with the number that’s was loaded earlier with the feedback
form( which was "4567" )the application will not accept that
form. If you enter “9090” then the application will accept that
form.
For more clear idea I have created this simple Fig.

Now my strategy to bypass this anti automation techniques was


1)Download the image only from
http://www.site.com/captcha.php
2)Feed that image to OCR Engine
3)Craft an http POST request with all required parameter and the
decoded captcha code, and POST it.
Now what is happening here??
When you are requesting the image file, the server will do steps
1 to 5 as shown in figure.
Now when we are posting the http request, the server will match
the received captcha code with the value that was temporarily
stored. Now the code will definitely match and server will
accept the form.

Now I have used this Python Script to automated this entire


process.

from PIL import Image


import ImageEnhance
from pytesser import *
from urllib import urlretrieve

def get(link):
urlretrieve(link,'temp.png')

get('http://www.site.com/captcha.php');
im = Image.open("temp.png")
nx, ny = im.size
im2 = im.resize((int(nx*5), int(ny*5)), Image.BICUBIC)
im2.save("temp2.png")
enh = ImageEnhance.Contrast(im)
enh.enhance(1.3).show("30% more contrast")

imgx = Image.open('temp2.png')
imgx = imgx.convert("RGBA")
pix = imgx.load()
for y in xrange(imgx.size[1]):
for x in xrange(imgx.size[0]):
if pix[x, y] != (0, 0, 0, 255):
pix[x, y] = (255, 255, 255, 255)
imgx.save("bw.gif", "GIF")
original = Image.open('bw.gif')
bg = original.resize((116, 56), Image.NEAREST)
ext = ".tif"
bg.save("input-NEAREST" + ext)
image = Image.open('input-NEAREST.tif')
print image_to_string(image)

Here I am only posting code of OCR engine. If your are a python


lover like me you can use "httplib" python module to do the rest
part.This script is not idependent. pytesser python module is
requred to run this script.PyTesser is an Optical Character
Recognition module for Python. It takes as input an image or
image file and outputs a string.
PyTesser uses the Tesseract OCR engine, converting images to an
accepted format and calling the Tesseract executable as an
external script.
You can get this package @ http://code.google.com/p/pytesser/

The script works in this way.


1)First the script will download the captcha image using python
module "urlretrive"
After that It will try to clean backgroug noises.

2)When this is done the script will make the image beigger to
better understading.
3)At last it will feed that processed image to OCR engine.
Here is another python script which is very useful while testing
captchas.You can add these line to your script if the taget captcha image
is too small.This python script can help you to change resolution of any
image.

from PIL import Image


import ImageEnhance

im = Image.open("test.png")
nx, ny = im.size
im2 = im.resize((int(nx*5), int(ny*5)), Image.BICUBIC)
im2.save("final_pic.png")
enh = ImageEnhance.Contrast(im)
enh.enhance(1.3).show("30% more contrast")

S-ar putea să vă placă și