Saving Bandwidth With Yahoo Crawlers

Status
Not open for further replies.

myshtern

Yellow Belt
This is straight from Yahoo :cool:

If you run a public webserver, you have likely seen our webcrawler, named Slurp, in your logs. Its job is to find, fetch, and archive all of the page content that is fed into the Yahoo! Search engine. We continuously improve our crawler to pick up new pages and changes of your sites, but the flip side is that our crawler will use up some of your bandwidth as we navigate your site. Here are a few features that Yahoo!'s crawler supports that you can use to help save bandwidth while ensuring that we get the latest content from your site:

Gzipped Files: Our crawler supports gzipped files to reduce bandwidth requirements. On average, you will get a 75% savings when you enable compression for your site. Many webservers provide mechanisms for sending out HTML content in a compressed format (for example, mod_gzip for Apache). How much of your site's total bandwidth you can save will depend on how much of your content is compressed and how well it compresses. In general, static pages are good candidates for compression. Any user agent, whether it is a browser or a search engine spider, will let the webserver know it can process compressed content by adding "Accept-Encoding: gzip, x-gzip" to the header of its HTTP request. All major browsers support gzip compressed content. Also you should be happy to know that if our crawler has any trouble with a compressed page, it will re-fetch the uncompressed version. In practice, it does encounter a small percentage of decompression failures.

Smart Caching: Our crawler acts very much like a web cache. Once we grab your content, we hold onto it and keep a history of how it changes over time. We do this for a variety of reasons. One of them is so that we can use HTTP mechanisms designed to help reduce network usage when a client (that's us) repeatedly fetches a web file that has not changed. In particular, our crawler often sends the HTTP If-Modified-Since header (see section 14.25 of rfc 2616) when making repeat requests. If your webserver is setup to recognize this header, it will respond with a 304 HTTP status code instead of a 200 if the content is unchanged. The advantage of this is that a 304 doesn't include your page content, so it uses up less bandwidth than a full 200 response. Again, I'd like to emphasize that our crawler is conservative when it comes to ensuring it has the latest content; it won't use an If-Modified-Since request if it needs to re-fetch your content for any reason.

Most webservers will automatically handle If-Modified-Since requests for static content out of the box. Proper cache control of dynamic content (such as PHP pages and cgi scripts) can be tricky and is an advanced topic. In most cases, servers will play it safe by ignoring If-Modified-Since requests for dynamic content. There are several sites on the web that let you test the cacheability of your web pages. For the purposes of our crawler, pay attention to what they say about the Last-Modified value in your response header.

Crawl-Delay: There's one last trick you can use to help reduce the bandwidth requirements of your site. You can use a special robots.txt directive, crawl-delay, to reduce the speed at which our crawlers make requests to your site. This allows webmasters to manage their bandwidth without restricting content on their site from crawlers and is being used effectively by sites like Slashdot. A safe value for this would be a delay that would allow us to fetch every page on your site in about five days. So a five second delay (crawl-delay: 5.0) would be fine for a site with 2,000 pages, but not for a site with 100,000 or more.

We hope you find these tips for safely saving hosting bandwidth useful and we'd appreciate any feedback, questions or new ideas to further help improve how our crawler interacts with your web sites.
 
Status
Not open for further replies.
Top