I often find “oddities” in various programs and systems and tend to work with the vendors to get them corrected.
Something I tend to run into a lot is the whole “bug” vs “by design” issue. Vendors like to say things are by design so that you get to follow a different method to get the changes you want, usually requiring a much longer wait period and often requiring funds from you and in the end there is no guarantee they will do it anyway. Bugs on the other hand are things that absolutely do not work according to the published functionality. Companies hate admitting to bugs because it means they really did kind of blow it and it will be entirely on their dime to correct it.
Many of the items I point out end up being declared “by design”, generally because there is no documentation that can be pointed at to show that no, what is being seen isn’t intended behavior. I think some companies and some products purposely are a bit light on docs just to help them in cases where they would rather say “by design” than admit to a bug.
Personally, I see two different aspects of “by design”, there is the true, this is the way we figured it would be used and this is the intended functionality and that is exactly how it is supposed to work and we expect it to and what you are doing is outside of what we had considered or in fact anyone has considered. When you encounter that, it is “generally” fairly easy to understand it and see that yes, you are doing something that probably was intended. You will note I use the word intended a couple of times there…
I think intent is HUGE in determining if something really is a bug or not. Was the results functioning in the way that was intended or are the results a side effect of the implementation of the design. To most vendors, not all, there is no difference there. If someone can claim that was part of the design, intended or not, and there is nothing documenting the functionality to the contrary then by gosh it is “by design” and definitely not a bug, submit a request to change the functionality and maybe fess up a few bucks. Other companies will look at it and say, well we kind of botched the design, we will call that a bug.
I personally do this quite a bit with “customers” using my tools. I know what I intended and if I botched something because of how I implemented, I will call something a bug even if it is a direct result of the design. There may be a person who two who have agreed with me on whether or not I intended something but that is how the cookie bounces.
Anyway, intent intent intent intent. SteveB should go around repeating that for awhile. A “by design” can be a bug if the intent isn’t to do what the implementation ends up doing.
What you want an example? Ok… Way back when I was doing ops for a largish org we kept running into the issue with people getting a password reset and apparently logging on ok but unable to change their password when prompted. The issue cropped up in Windows 2000 after Microsoft had switched from using a single master PDC for password changes to using a multimaster DC mechanism where a password could be changed anywhere.
First off, here is a rough sketch of how it kind of worked under NT4. The admin would reset the user’s password and set the ID to be expired. So the user would use the new password, the local BDC would say, that isn’t what I have, let me ask the PDC (called PDC Chaining). The PDC would say back to the DC, hey that is perfectly fine, let the user on, BTW, they need to change their password. So the user logs on and then gets hit with the next prompt, enter your old password and your new password. Obviously this was intending to mean enter your password you used to log on just now and also the new password you would like to use. The system was even nice and populated the first field. The password change call on the local machine would find the PDC and send the request there and the change would be a success. However if somehow the machine screwed up and hit a non-PDC (say the 1B WINS record was wrong), the NT4 server that got the request would say, oh, password change, ok, am I authoritative (i.e. am I the PDC)? No, ok refuse the change I am not the PDC. The client would get the response back that you are barking up the wrong tree and the client would pull the 1c record for the domain instead and query every machine until it found the machine that was the PDC and it responded that it was. Either way, as long the PDC could be resolved and responded to RPC calls, the password would get changed and everything was peachy.
This is where the design comes in, that was the design. But was there maybe an UNINTENDED problem? You tell me… Say a user gets their password reset and expired, they go to log on, the local DC says no that isn’t right, let me check the PDC and so it chains up to the PDC. The PDC says, and this should be familiar at this point, yes that is fine, however the user needs to change their password right away. So the local DC says sure, come on in, but you need to change your password and pops the familiar dialog. The user enters the new password, the password change routine knows that any DC can be used for that change now. So why heck, it uses the one that it was authenticated by (or at least thinks it was authenticated by). It sends the old password and the new password to the DC that just said, fine, come on in. The DC says, hmm am I authoritative… Of course I am, let me change it… But first, did they send me the right old password. It looks and says, well heck no. That password is not what I have, this change is not going to happen and it sends back a message to the client saying, you entered the wrong password. The user is sitting there going WTF, I didn’t even have to enter the password, it used the one that I entered to log on and you just told me it is fine…
See what happened is a bug in the implementation. The initial check for being authoritative was removed… however, where was the only place that could previously change the password? The PDC of course. As everyone knows, the PDC is the most authoritative authority on any password at any given time by definition. So when the DC checked its local password for the user, it knows from the old code that it is absolutely authoritative and doesn’t even start to have code to go chase back to the PDC with a chaining request like the logon did. It simply rejects the request.
Once I figured out what was going on and the most probably why which is what I layed out above, I started chatting with our local onsite PSS engineer. He told me that is by design and that is how it is, look at all of the KB articles they say that is how it is and you need to change the password on the DC that the user is going to hit… Great, that is fine to tell the rubes but that doesn’t work in Enterprise world, try again. We contested the issue for weeks, finally he hooked me up with a real PSS Engineer in Texas. I spent weeks in debate. Finally after much pain and sore throat spray I got to CPR, people who actually could read the code and better yet, change it. I spent several hours in a con call where the CPR guy kept slipping and saying things he wasn’t supposed and being dunned by the TAM that was on the line. Finally after walking me through all of the workarounds and I responded with why they don’t work at the level of enterprise I was at he admitted, yeah, there is no good answer for you guys, I can see why you consider this a bug… WHAT! Could it be victory? Yes indeed, it was, that change made it into 2K SP4 and K3 Gold. It took months of arguing literally eating up many hours debating with engineers and responding to emails and the whole time when I was told it was “by design” I just kept saying, so you intended to make it so I couldn’t change a password because of this? I would much rather hear you guys muffed it and didn’t think it through than you purposely wrote it to be that way.
Another example was a debate over the GCs that DSACCESS gives out clients. I spent several months working and reworking PSS to admit that it was screwed up and needed to be fixed back in 2003/2004. Eventually they did and now that functionality has been changed in E2K3 SP2 to correct issues I pointed out over and over again. Basically it is much harder now to get a GC that isn’t a DC for the domain the user is a member of. Why is this important? Well because the mechanism used by Outlook (NSPI) doesn’t do referrals so if you asked for a change to say your public delegates and you hit a GC of a DC that you couldn’t update, you were screwed and Outlook was too stupid to tell you and forever onward displayed the wrong info which didn’t align to reality.
I am working another one now where I have shown that the WMI DSACCESS provider can easily give you incorrect info for what DCs/GCs are in use by an Exchange Server. That provider is used by scripts, tools, and the ESM itself to show you what is being used. I determined that if you keep touching the provider by querying it for information it wouldn’t update the information it maintains. Basically you have to go six minutes or longer where NOTHING asks for the DClist through WMI from that server in order for the list to be updated. I am firmly at the point now of, that is “by design”. So I am firmly at the point of asking and reasking, are you saying that you intended to give me bad information? The pieces may be functioning per design, but I don’t expect the intended implementation was that it be possible to make it so the WMI provider NEVER updated the info it gave out. In fact, it is quite easy to make an Exchange server never update the DCLIST it shows through WMI unless you restart the management service and force it to reload. The following script will do it quite nicely
strComputer = "exch"
while 1
wscript.echo time()
set objWMI = GetObject("winmgmts:\\" & strComputer & "\root\MicrosoftExchangeV2")
set objDCList = objWMI.ExecQuery("Select * from Exchange_DSAccessDC",,48)
for each objDc in objDCList
Wscript.Echo "DCName:" & objDc.name
strTemp = "Automatic"
if (objDc.ConfigurationType=0) then strTemp="Manual"
Wscript.Echo " Selection: " & strTemp
Wscript.Echo " Is Fast : " & objDc.IsFast
Wscript.Echo " In Sync : " & objDc.IsInSync
Wscript.Echo " Is Up : " & objDc.IsUp
Wscript.Echo " Ldap Port: " & objDc.LDAPPort
strTemp = "Global Catalog"
if (objDc.type=0) then strTemp = "Config"
if (objDc.type=1) then strTemp = "Local Domain"
Wscript.Echo " Role : " & strTemp
Wscript.Echo "-----------"
Next
set objWMI=nothing
wscript.echo "Sleeping..."
wscript.sleep(60*60*1000) ' one hour sleep between queries
wend
Although the code appears to break the WMI connection and sleeps for an hour, this script will force the Exchange Management Service to maintain the list of DCs it had the moment the script starts or until the Exchange Management Service is recycled.
Two small changes make it so WMI can refresh the list, assuming nothing else queries via WMI in the meanwhile which in a large environment, isn’t something you can even start to guarantee.
The small changes are to add the following two lines above set objWMI=nothing
objDc=Nothing
objDCList=Nothing
The big problem as I see it here is that you can never query for the info and know how valid it is. Something could be preventing the list from being refreshed and there is no way for you to know that nor how recent the data was gathered. So if you can’t have any faith in the information, what is the point in Exchange supplying it at all? Again, this is “by design” but no one has said whether that is the intent or not. I have a lot of faith in MS, I don’t believe for an instant that someone knew this would happen and said that is fine. However, it doesn’t change the fact that I am still sitting here listening to how it was “by design”.
It does make me wonder, how much of the rest of the WMI counters and data is messed up in a similar way for Exchange. I have heard numerous complaints about how unrealiable the information is for Exchange out of WMI. WMI itself isn’t that flawed, I don’t like WMI but it isn’t that flawed or else all of MS would have thrown it out, this is simply issues with this implementation.
And yes, I know that MONAD is what will be used in E12. However, we will be running E2K and E2K3 for a long while yet. The WMI issues will be pain for quite a while longer. Possibly forever as I expect the Exchange folks aren’t too interested in fixing something that is going away, even though we the customers will be living with it for years. I have already started to point out “by design” issues I am perceiving in how they will use MONAD. If you watch the EHLO blog you will note that there was a E12 Dev blog that I commented on. Basically it was about pulling info. It seems the design or as they called it “the MONAD way” is to retrieve all of the mailbox properties even if you only need a couple. Then you use the script to toss out what you don’t need and display the rest… Great… That will certainly be efficient when querying 150,000 mailboxes across Exchange Servers around the world for the mailbox size or last time they were logged into…
In summary, something that is “by design” is not necessarily exempt from being called a bug in my opinion. When I think design, I think intent. If something works the way it does and isn’t intended but instead a result of not fully understanding the implications of the implementation, that can easily be a bug.