joeware - never stop exploring... :)

Information about joeware mixed with wild and crazy opinions...

MSNBot Revisited

by @ 5:40 pm on 6/3/2006. Filed under tech

Well I was looking at the traffic logs on June 1 and noted that MSNBot still hadn’t calmed down. I was looking at robots.txt info to figure out what to do when one of my MSFT friend’s pinged me about it (Did he have a reminder set in outlook too I wonder?). I indicated that it still had a problem and that I was actually trying to figure out how to block it. He asked me to hold on while he forwarded it to some people he knew. The result is a post (or maybe posts) in the comments of the original blog entry.

Unfortunately what they pointed out was inaccurate. They tried to use the google/yahoo/live.com “of about” hit count to measure how many pages they had on file (or “indexed” or “cached”) for the site. You have seen this, you type in some search term and then see “you are looking at pages 1 to 10 of about 200,000 for {insert search term here}” and then you go and look and run out of pages at around 1000 or so. The reason is that the number of pages is sort of a guess based on the backend indexing info. Depending on how the indexing is done and the number of actual hits the number may be a little off or way off, it is almost 100% guaranteed to not be correct unless it is like less than 10 or so.

No matter what, that number is not the way you should be measuring how many pages you have indexed. You have to say I am more than a little concerned that a search person would try to justify bandwidth based on a number like that, surely they would know how bogus that number is??? Right???? Maybe they figure __I__ don’t know how bogus the number is. Could be I guess, but with MSFT, I tend to not attribute to evil intents that can be attributed to lesser things such as lack of a clue or confusion or lack of communication or just outright really not knowing or understanding or alternately I just don’t understand what they are saying or trying to accomplish. I have been known to say in mixed company (i.e. MSFT and non-MSFT friendly people[1]) that MSFT is less evil and more addled than anything…

From one of the comments:

site:http://blog.joeware.net/
on msn and google, u will see msn has 900 pages and google around 300.

 From my response

MSN Search (live.com) shows 912 web pages available, scroll to the last page and you see you are on web page 243.

Google search shows 347 web pages available, scroll to the last page and you see you are on web page 319.

 From one of the other comments

You pointed that for your site MSNBot’s average monthly bandwidth consumption is 230MB, which is about 3.5 times that of Yahoo aka Inktomi Slurp (60MB) and six times that of Google (40MB).

Currently we have indexed 1760 pages from joeware.net, which is about three times as many as Google (565), and 3.5 times as many as Yahoo (492). This, combined with our policy of always refreshing content every few weeks, might explain the discrepancy.

From my response

So stats for site joeware.net

site – initial reported max – actual max

live.com – 1698 – 247
google.com – 567 – 531
yahoo.com – 516 – 462

 

The second response also gave a way to slow down the MSNBot and make it take longer to get through all of the site but that isn’t what I want.

One way to reduce bandwidth consumption by MSNBot is to use the Crawl-Delay rule in the robots.txt file:

User-Agent: msnbot
Crawl-Delay: 600

This instructs MSNBot to wait 10 minutes between downloads, resulting in a maximum rate of 144 pages per day. Note that this robots.txt file should also be copied to all subdomains.

If I am going to let something index my site at all I want it to get through all of my site in one quick pass and update all of it in one fell swoop, either they need to figure out why they are using more bandwidth or allow me to say, only update once a month or something.

When it gets down to it, I don’t care if I block MSNBot completely. I have been watching how many links in I get from people coming from MSFT Search and it is less than 200 per month compared to over 8000 from Google per month. All of the search engines together only account for 7.4% of the links to my page so I wouldn’t even mind blocking all bots except I use google myself to find my tool links when telling people where to go… Enter tool name and usually click “I feel lucky”.

I am sure there are some people who want to be in every search engine and feel it is so important that whatever is needed should be done or allowed. Me, I am not in that crowd, I dislike wasteful inefficient designs in anything but proof of concept work and I don’t care if people find my website or not. It doesn’t do anything for me for them to come here, I put the stuff out there for them and refer them to it in the newsgroups and listservs. 

My current stats are showing ~18000 unique visitors with about 42,000 visits per month and some 900,000 hits. Yes, still a small site with a low hit count but that is fine. The more I get, the better (read more expensive) the hosting package I have to pay for.

Oh for fun, here are the MSNBot stats along with the other search engines again for the Month of May:

Month                               MSN                       #2                      Google

May 2006                        181MB                   65MB                     39MB

 

I have decided I am going to collect the actual logs over the month of June (plus the logs going back to mid-May which I still had on the server) and I am going to try and work out why MSN is burning up so much more bandwidth. It really isn’t my job but now after those responses I figure someone should understand why this is happening. After that I will revisit the decision to block MSNBot completely. I am thinking it is a strong possibility plus I will advertise it and see if I can get Slash Dot to pay attention.

I like MSFT but if they do something half ass I have no problem pointing it out so they can fix it. It seems though that the only time some of them seem to pay attention is when you do something to make them feel pain from their decisions or designs. Certainly that applies to the Exchange Dev Team at MSFT, any time I have found a problem and told them about it nicely I have been steadfastly ignored but if I highlight it in the public eye and make fun of how stupid they are being they start looking at correcting things. I am sure that there are good folks on the Exchange teams, I just don’t know if they are being listened to.

I know MSFT has some absolutely amazing people, I have met many of them – these are usually locked up where you can’t normally reach them on the MSFT Campus in Redmond but I have met some from Los Colinas in Texas as well. I have also dealt with some real worthless folks though, unfortunately they seem to be more of the front line people and anyone I meet from a “local” office I tend to lump into this category until they prove themselves otherwise and a couple have. I think it is why so many people who only lightly deal with MSFT think they suck so bad overall. I think MSFT is great overall but with some severe troublespots but the understanding that once the team being a troublespot “gets it” the problem will go away. I hold out hope for the Exchange Dev folks (including folks who moved to LCS from Exchange).

  joe

 

[1] In groups like that, i.e. MSFT and non-MSFT friendly folks, I tend to be the black sheep because the MSFT friendly people think I am too mean to MSFT and the non-MSFT people know that I am an MVP so figure I am a cheerleader. It really is quite a fun situation because I don’t care what either think about my allegiances. 🙂 I can say things to rile both sides up and mean it all.

 

Rating 3.00 out of 5

2 Responses to “MSNBot Revisited”

  1. Alun Jones says:

    I always love those folks that think MVPs are MSFT cheerleaders. I’ll admit I got my MVP start in large part because I was telling people to stop posting complaints about MSFT – but that’s because their complaints were so bogus. They’d complain about broken features that were not broken, or behaviours that demonstrably didn’t exist.
    My point to them was that their rumour-mongering and ranting about fables was getting in the way of me trying to report real bugs and real problems.

  2. Fred says:

    My problem with MS is that I’ve met very few of them who actually know what’s going on with their own products. And you are likely right about the external people, Joe, because I have to think there are people–deep, deep–within MS who do know what they’re doing. This much is evident from their products.

    Know what else is evident from their products? There are a lot more people who don’t know what they’re doing. The paradox of Microsoft.

[joeware – never stop exploring… :) is proudly powered by WordPress.]