I get a chuckle with how many places will see SYMPTOMS of some sort and the first thing they come up will be to _decide_ the problem is that they need to reboot a DC. If I had to give this form of troubleshooting a name I would give it something like “Wishful thinking”, “I should be fired”, “I can’t be bothered to do my job”, etc.
This is especially popular in a larger orgs with a mix of decentralized and centralized support resources and domain controllers. Someone will have maybe once seen that rebooting a DC “solved” a problem and that becomes THE solution. You will get someone out in a remote site or in an application group who can somehow get physical access to a DC (or server in general) who will then take it as their job to reboot that server that they think needs to be rebooted, regardless of whether it does or not. Alternately maybe they will “request” the reboot and when I quote the word request it is because I don’t mean request, I mean they why and complain and demand you reboot as they “know” that will solve the problem.
The proper answer is to try and work out what the problem is, especially if it is recurring. I have walked into environments where I have been told the solution to a problem is to just reboot the DC when it occurs. This is not a solution, this is a bandage to alleviate symptoms. A solution involves actually troubleshooting and trying to work out the actual issues. And you can’t do that if the first thing you do is reboot.
This silly type “troubleshooting” mechanism is not just something low skill admins come up with but it is also something that I have heard come from the mouths of MSFT people unfortunately, more often MCS (Consulting) folks versus PSS. Most of the senior folks in PSS are very good in this area, the last thing they want is rebooting because it often erases all evidence of what is going on if the problem really is at the server being rebooted.
As one quick actual example, there was one company I went into back in 2001 who had some issues. Approximately 80% of the DCs in one domain (so about 80 DCs) weren’t actually working properly only they had no monitoring and they had no one proactively looking at things. The troubleshooting mechanism was if anyone at the site complained, the DC was rebooted. The local site people actually trained that that was the solution so their tickets started changing from, we see these symptoms to “reboot this DC” which would be simply processed by the centralized people (This is a firing offense in my opinion, trouble tickets are often like you telling your doctor your issues, he/she is supposed to take that and really work through the problem, not listen to you say what you would like and then just do it). When my group took these DCs over we were getting double digit reboot requests a week and immediately our response was “no, not going to happen”, you tell us what is going on and what you think is wrong and we will take it from there. I can’t begin to explain how bad this pissed some people off because they just wanted it done. I had high level escalations etc and thankfully this is an easy battle to win if your management aren’t complete idiots. The argument is “I would like to figure out WHY this needs to be done every X days/months/etc versus just doing it and maybe we can remove the reboot need entirely and give more availability.”… See how I got that availability key word in there. Mucho helpful. Anyway, there wasn’t a single issue I don’t think we didn’t track down to specific items that we were able to correct and the environment within 3-6 months stabilized dramatically due to lack of reboots and actually having everything configured properly.
Don’t get me wrong, sometimes a reboot is the answer, even the correct one. But you need to work to understand WHY it is. Rebooting because you know it will allieve symptoms is not troubleshooting, don’t pretend that it is. If you do reboot, what other steps are you then taking to ascertain what that reboot did to solve the issue and then prevent those things from occurring again?
joe
It’s really only since I started working with Microsoft server OSs that I ran into the whole pervasive idea of “reboot first, ask questions later.” As a Unix admin, I was loathe to restart my servers for any reason, uptime being such a golden egg.
Joe I really do enjoy your site a whole lot, but it seems like most of your posts are wasted on complaining about how shitty the rest of us admins are. I think you should really think about something like writing a post like this, then follow up with some of the things you did to fix the problem, tools you used, procedures, references. You are a pretty smart guy, as we all know, but throwing all this negativity isn’t going to help us become less stupid, you know?
Fred: Exactly my point. In the UNIX world a reboot is rarely accepted as the solution, it is a remediation step but is generally heavily investigated. Most Windows admins have not been subjected to this type of rigor but will more and more as applications move from Mainframe and UNIX platforms to Windows.
Daniel: Fair point but the issue is, it is not really possible to document even a fair number of the possible issues unless starting with a list of symptoms. It literally could be anything that someone thinks they need to reboot a server for. My point is, it is never right just to assume reboot is the answer as Windows admins have been taught and have come to accept. Admins need to do their job and actually troubleshoot the specific issues as they occur. Is it network connectivity issues? Ok start looking at things like netstat and network traces, etc. Is it resource issues, ok start looking at memory allocations and paged pool tracking, etc. Seriously it could be anything, people want to reboot servers for any number of issues any possibly that helps alleviate the symptoms and possibly it doesn’t, regardless it really isn’t troubleshooting.
please reboot! 😉
Maybe some common troubleshooting I guess. I know that a lot of admins run things in a similar fashion, that would be considered ‘wrong’. I am sure you have run in to situations that one admin gets lost in and you can hear the symptoms from a user and say ‘ahh hah’.
All I am saying, is more examples, more informative posts, more things that can be used by everyone who reads. I have pretty much come away from your post crying, but it is OK. I know I am guilty of restarting a server a couple times to ‘fix’ an issue, but normally I try to dig down and find out what is going on.
Hey Joe,
Are you going to publish update to your best-selling book “Active directory 3rd edition” before the release of windows server 2008?
Boaz
It is even worse than reboot to solve a problem at one of the jobs I work at. This is a big organization with 25k+ users. This place reboots all servers once a month; for no reason. Here is why I say it gets worse; I work at a remote site for that job and I ask “HQ†why are you guys rebooting again. The servers are rebooted every month during patch installs anyway.
The answer I usually get is. Well the servers are on the reboot list. Just no thinking, no forethought, just make sure to finish the checklist and don’t question anything. This is the same mentality I ran into many times when I was in the Army so I’m used to it by now.
I guess the only good news for me is that since it is my second job at night I can always be happy about the extra cash.
Only time I’ve suggested rebooting a DC is when it needs a new SSL certificate be picked up by the LDAP service.
Erm..joe? I think the comments are taking too long to get up on the site. Perhaps you should …..reboot?
😉
Jorge: ;o)
Daniel: Yeah I try to get that up when I can. Unfortunately everytime I post something like that I get about 500 emails with people describing similar but slightly different scenarios that they want me to troubleshoot for them. Or alternately it is the exact same scenario, they just don’t grasp what is going on. Also a lot of stuff I work on any more, I don’t have the complete freedom to discuss.
Boaz: O’Reilly hasn’t approached me and I haven’t approached them. To tell the truth I took a killing on that book. I make less than a dollar per book and while it has sold well the work effort versus financial remuneration made no sense. I knew that for the first one too but it was a case of wanting to “give back” as that was the first AD book I read (and unfortunately set a lot of wrong things in my head about how AD worked) and I wanted to help out by extending it with some I considered great info as well as correcting a lot of info that wasn’t correct. Being a technical writer doesn’t really pay very well unless you have a bunch of books or you aren’t really writing them, you have someone else write them (or better a series of them) and then you put your name on them… I won’t mention any names.
Mike: This is not uncommon and there have been times I have recommended similar actions based on time available and problem stack. Especially in the case of remote servers with no “Remote Access” type hardware solutions. I do always try to get back to those though and work out why they are hanging up unless the hardware is super old and I know that replacing it will likely solve the issue (say memory is getting flakey or disks are getting ready to puke).
Alun: Hmm that shouldn’t be necessary any more once we have Longhorn in place with the restartable DS but I wonder if that is a scenario that was tested.
M@: LOL, this actually runs on a, I think it is a Linux server now, used to be BSD when I initially signed up for it. As everyone knows, Linux NEVER has a problem nor has to be rebooted. ;o)
The pervasive use of “when in doubt reboot” comes from the perception that it is easier for the symptom to go away then it is to solve the underlying problem.
Ever heard, “Heck, it only takes a few minutes a day to reboot. It could take me hours to find the problem and fix it.”
The whole “reboot to fix it” reminds me too painfully of the “load the updates to fix it.”
When you have a problem, especially with a server, the *LAST* resort in solving the issue is to reboot it, udgrade/update it or load software on it.
I have worked in environments where this is the case. In some cases it can be lazyness. In others it can be lack of skill. In some cases, however, there are issues which are outside the control of the individual admin/team. One prime example is third party applications, where the application vendor is unable/unwilling/uninterested/no longer around and the part of the business reponsible for using that application are unable/unwilling to change.
This is less of a problem with DCs, I know, as a well run organisation doesn’t run third party software on DCs, but it is a very similar situation relating to application servers
Do you have any words of wisdom for dealing with this?
Mike