Sometime, to crawl specific page, the web servers may block you crawl their contents using a simple method by looking at UserAgent header request from a client request. In well-known web clients like IE, FireFox, Chrome, Opera, those clients are trusted by anyone, so servers will allows them to access the contents. Specific programs, however, won't be allowed to do that just because they don't have signature for it. To solve this program, you can add UserAgent header into that program using signature from well-known web clients. In case that servers block for specific clients to prevent some bots that don't have UserAgent header, This way is the easiest solution.
The following code is an example of adding UserAgent header in C#:
HttpWebRequest request = (HttpWebRequest)WebRequest.CreateDefault(uri); request.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.6) Gecko/20060728 Firefox/1.5";
And this is an example code for Python:
import urllib2 opener = urllib2.build_opener() opener.addheaders = [('User-agent', 'Mozilla/5.0')] ob = opener.open('http://www.google.com/') print ob.read()
The article posted was very informative and useful
ReplyDeleteWeb Development Company