Many will look at that subject and go huh? Those with websites will tend to know what I am saying, well one exception in my mind is a good friend of mine who has a website but doesn’t look at logs. Hi buddy! 😉
ÂÂ
So anyway, I am looking at my web stats to see how IE is faring against FireFox in the web races. I don’t think I ever mentioned this here before but I watched the FireFox penetration go nuts. I saw IE drop from 98%+ for hits to under 70%. Now it is at 73.1% so it has been slowly getting back more of the “share” versus 17.8% FireFox. My Windows versus Linux is still around 94% for Windows to rounding error (.7%) for Linux but that is probably to be expected, my site is primarily Windows based. Interestingly, at least to a geek like me, I am showing 2 hits from CPM and 14 hits from BEOS. Is there some Operating System museum out there showing off my website or something? 
 
So what about this MSNBot thing; if you don’t know, Bot is short for Robot, no this isn’t like Rosie from the Jetsons or the one running around bellowing “Danger Will Robinson” nor is it R. Daneel Olivaw[1]. This is an automated program that scans web sites indexing content so when you want to search for something (or is that google for something…) you enter the keywords and the website  can quickly display hits for you. Without this pre-indexing and scouring of the web, it would be impossible to find pretty much anything unless you knew where to go up front.
 
To give a concrete example, if you enter something in the search box on the right side of this screen (assuming your RSS Feed reader allows you to see it) it scans immediately over the MySQL database that holds the blog entries. There is no pre-indexing, but theoretically, it should be fast as there should not be that much data relatively speaking. Now speaking of the whole web, it is HUGE, and it isn’t contained in whole in any one place, so it isn’t even a starter to try and search it “on the fly”. So the Robots or Bots as they are more commonly known, go off and look at pages and “weigh” them (i.e. give them priorities) and index keywords so when you enter the word, say joeware, it can present to you what it has seen previously. If you have put up a new web site and you wonder why Google doesn’t see you, this is why, they just haven’t crawled and indexed your site yet…
 
So back to topic… What is up with the MSNBot…. After looking at the FireFox versus IE stats I was kind of peeking over the other numbers and saw the Bot stats which piqued my interest. The top report bot was the MSNBot, with hits count of 8464+663, this is the number of hits plus the number of hits on the robots.txt file which tells robots how they are supposed to behave on your site (say for instance you don’t want them indexing some of your content because it is dynamic and wouldn’t show the same way twice). This was with a total amount of bandwidth of 226.92MB… contrast that with second highest hit from something I have never heard of which is called Inktomi Slurp with 2923+1769 and 60.93MB… Now where does Google fit into this? Googlebot is at #4 on the ranking and has 1966+64 hits and 36.80MB of bandwidth use.
 
All of that leads me to ask…. WTF MSNBot? What are you doing? Your searching capability is nowhere near six times better than Google yet you are consuming 6 times more bandwidth. I have actually been comparing Google versus MSN capability (check out http://www.addysanto.com/dualsearch.htm) for some time now and MSN is equivalent to Google only on good days.
I submitted feedback to MSN Search that if they don’t get their bandwidth utilization closer to the average that I am seeing for everyone else I will modify my robots text file to block their bot or if that doesn’t work, modify the site to not respond to their Bots requests at all.  If every Bot was such a pig, I would have over 6GB of bandwith just from robots crawling over the site which is insane.
Lets look at some monthly numbers
Month                              MSN                       #2                     Google
April 2006                       227MB                   61MB                    37MB
Mar 2006                       242MB                   62MB                    40MB
Feb 2006                        222MB                  25MB                     25MB
Jan 2006                         208MB                   30MB                   30MB
Dec 2005                       188MB                   37MB                   37MB
If someone from MSN wants to comment on why MSNBot is so inefficient I would be all ears. Again, if I don’t see any perceived benefit of that additional bandwidth consumption, would anyone else? Will other websites continue allowing this much bandwidth to be consumed by MSN? Again, what if all the rest of the search engines did it so poorly? There are Web Hosting companies that won’t let you as a web site owner use 5GB without charging you more for the bandwidth. Luckily I have a lot of bandwidth available so I will let this go on for a bit, but now that I am aware of it, I won’t let it go on much longer.
  joe
[1] From the Asimov stories for those of you who don’t do Science Fiction…
 
 


site:http://blog.joeware.net/
on msn and google, u will see msn has 900 pages and google around 300.
Hi print,
I am not sure I understand your argument…
The stats you quote just shows that MSN Search has another issue… That issue being guessing at how many hits it has available for me… I won’t complain about that much since I understand it is just a quick look at an index to get an estimate… but certainly not a good measure as to why MSN is using so much more bandwith.
If you scroll all the way to the end of the hits you will see the following:
MSN Search (live.com) shows 912 web pages available, scroll to the last page and you see you are on web page 243.
Google search shows 347 web pages available, scroll to the last page and you see you are on web page 319.
It is time to start doing something with robots.txt file I think. I took a look and saw that I only get about ~190 hits from the Microsoft search engine and over 8000 from Google. So blocking MSNBot won’t really impact me.
Hi Joe,
Thank you for bringing this to our attention. While we strive at MSN Search to have a fresh and comprehensive index, we of course hate to see individuals suffer from too much traffic.
You pointed that for your site MSNBot’s average monthly bandwidth consumption is 230MB, which is about 3.5 times that of Yahoo aka Inktomi Slurp (60MB) and six times that of Google (40MB).
Currently we have indexed 1760 pages from joeware.net, which is about three times as many as Google (565), and 3.5 times as many as Yahoo (492). This, combined with our policy of always refreshing content every few weeks, might explain the discrepancy.
One way to reduce bandwidth consumption by MSNBot is to use the Crawl-Delay rule in the robots.txt file:
User-Agent: msnbot
Crawl-Delay: 600
This instructs MSNBot to wait 10 minutes between downloads, resulting in a maximum rate of 144 pages per day. Note that this robots.txt file should also be copied to all subdomains.
Thanks,
Dave
Hi Dave,
Thanks for the reply. My response is nearly identical though to what I responded to “print” with, just a different “site:” value…
Thanks for the info on inktomi slurp though, I definitely didn’t know that one.
So stats for site joeware.net
site – initial reported max – actual max
live.com – 1698 – 247
google.com – 567 – 531
yahoo.com – 516 – 462
So the stats would be that MSFT has under half of as many pages as Google, and a little over half of what Yahoo has.
I would think that MSFT have made an analysis of this sort prior to this and I would hear, “oh yeah, we know exactly what is wrong, it is blah… and it is because we do this or that better…”.
One of my good friends who is also a huge MSFT supporter but not a blind follower says that it is probably related to .NET. He figures the code is written in .NET and is just fat and the people writing the stuff just have no idea what it is doing. While that is certainly possible, I really hope it isn’t.
I don’t find that robots.txt file listing very satisfactory. I don’t want to slow down how long it takes you guys to index my site, I would rather you figure out why you take so much bandwidth and correct it or just do it once or twice a month. If I am going to add something to robots .txt it will probably be to block MSNBot entirely just on principal. In the meanwhile though, I am going to collect logs and try to figure out what you guys are doing different from the other search engines so at least I have a clue and understand it.
Honestly I would think you guys would be looking at the deltas in what the indexing engines are doing on your own sites to really understand why you are eating more bandwith and seemingly presenting less of the content.
joe
It occurs to me that you can effectively limit the MSN Crawl rate by setting the crawl delay to a value based on the number of pages you have on your site divided into the number of seconds in a month, if you wanted one crawl. I’m assuming they have a decent queuing algorithm.
It’s unfortunate to have to treat Microsoft as a special case requiring different attention, but I’m seeing similar behavior a year and a half after your last entry. Unfortunately, in computing we’re rather used to this high-maintenance state of affairs from a certain industry player.