Friday, November 17, 2006

Ahhh, no more GoogleBot... I hope!

To my surprise, the bandwidth for our website at www.gerona.gov.ph is almost spent. :(

Upon inspection of log statistics with AWSTATS, I was surprised that it was no other than GoogleBot that has been churning our bandwidth. All the while, I thought that the culprit was the Gallery and all the photos on it.

So with my limited knowledge on this webserver stuff, how do I stop GoogleBot from eating our bandwidth?!

Ironically, Google itself provided the answers. I did this search and this search wherein I learned that I have to place a robots.txt at the accessible root (e.g. http://gerona.gov.ph/robots.txt) of our website.

So what are the contents of the robots.txt file?! I find this link very helpful. Even more helpful is this link where it automatically gives you the correct content of the robots.txt file to block specific bots. This webpage is also a good source of information.

As of this writing, I haven't tested if its working but I'm keeping my fingers crossed as GoogleBot is threatening to eat all of our site's bandwidth.

The question is, why is it doing that?! Is it being used by malicious spammers, some sort of vehicle?! I hope to be enlightened. Anyone?

Update: It works! GoogleBot and Inktomi is no longer eating my bandwidth :)

3 comments:

Elijah Alcantara said...

The question is, why is it doing that?! Is it being used by malicious spammers, some sort of vehicle?! I hope to be enlightened. Anyone?

It's from google, they are scanning your blog to index and make it searchable. It's actually a good thing so that other people can see your site from a google search.

If you liked that but would like to lessen the load of your bandwidth, you can set the crawl rate at a slower pace https://google.com/webmasters/tools/ you can setup an account there and change those settings.

fishfillet said...

Thanks Elijah!

iandexter said...

You may have to check your logs which particular pages or areas GoogleBot is searching.

It might be caught up in a trap. I noticed that you have a calendar on the site. You may want to exclude just that Google crawling, because most likely the bot is trying to reach all the links (chained by dates, etc) in the calendar.

Excluding / from Googlebot effectively prevents Google from showing your site in searches. A quick view of [site:gerona.gov.ph] lists only two results. (Your robots.txt worked. ;))