Profile

Click to view full profile
Hi, I'm Veerapat Sriarunrungrueang, an expert in technology field, especially full stack web development and performance testing.This is my coding diary. I usually develop and keep code snippets or some tricks, and update to this diary when I have time. Nowadays, I've been giving counsel to many well-known firms in Thailand.
view more...

Monday, November 5, 2012

Crawl HTTP 403 Forbidden Error page (may solve)

Sometime, to crawl specific page, the web servers may block you crawl their contents using a simple method by looking at UserAgent header request from a client request. In well-known web clients like IE, FireFox, Chrome, Opera, those clients are trusted by anyone, so servers will allows them to access the contents. Specific programs, however, won't be allowed to do that just because they don't have signature for it. To solve this program, you can add UserAgent header into that program using signature from well-known web clients. In case that servers block for specific clients to prevent some bots that don't have UserAgent header,  This way is the easiest solution.

The following code is an example of adding UserAgent header in C#:
HttpWebRequest request = (HttpWebRequest)WebRequest.CreateDefault(uri);
request.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.6) Gecko/20060728 Firefox/1.5";
And this is an example code for Python:
import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
ob = opener.open('http://www.google.com/')
print ob.read()

1 comment: