Veerapat's IT journey: Crawl HTTP 403 Forbidden Error page (may solve)

Monday, November 5, 2012

Crawl HTTP 403 Forbidden Error page (may solve)

Sometime, to crawl specific page, the web servers may block you crawl their contents using a simple method by looking at UserAgent header request from a client request. In well-known web clients like IE, FireFox, Chrome, Opera, those clients are trusted by anyone, so servers will allows them to access the contents. Specific programs, however, won't be allowed to do that just because they don't have signature for it. To solve this program, you can add UserAgent header into that program using signature from well-known web clients. In case that servers block for specific clients to prevent some bots that don't have UserAgent header, This way is the easiest solution.

The following code is an example of adding UserAgent header in C#:

HttpWebRequest request = (HttpWebRequest)WebRequest.CreateDefault(uri);
request.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.6) Gecko/20060728 Firefox/1.5";

And this is an example code for Python:

import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
ob = opener.open('http://www.google.com/')
print ob.read()

1 comment:

UnknownNovember 10, 2014 at 4:53 PM
The article posted was very informative and useful
Web Development Company
ReplyDelete
Replies

Add comment

Veerapat's IT journey

Pages

Profile

Monday, November 5, 2012

Crawl HTTP 403 Forbidden Error page (may solve)

1 comment: