joeware - never stop exploring... :)

Information about joeware mixed with wild and crazy opinions...

Virtual DC Poll

by @ 5:07 pm on 3/12/2013. Filed under tech

I was in a discussion and someone said to me that greater than 60% of Enterprise class Microsoft customers are already virtualizing writeable Domain Controllers in their production corporate environments.  !!!B??u?!l!l!?s??h!!i!??t?!!!

I started chuckling when I heard this. I don’t believe even for a second that the numbers are ANYWHERE near that level of penetration. Certainly there is a lot of chatter in this space but my personal experience is that the numbers are down in the single digit penetration of companies using writeable virtual domain controllers in Enterprise class corporate environments where AD failure could impact thousands, tens of thousands, or hundreds of thousands of people. People are concerned and despite Windows Server 2012 there are reasons behind it. Certainly it can be done, I have seen it done magnificently with no loss of redundancy by using a single external disk storage system or other single points of failure. I have also seen it done completely half ass and know of a production environment that completely blew up and had to be rebuilt from scratch, not from backup, but from scratch. If you are lacking complete of solid process in one area, chances are that isn’t the only area…

So I guess you could play around with the definition of Enterprise class and different people considering different size environments as "Enterprise Class". I think the smaller the environment, the more likely people may be likely to take the risk but also have the tight operational control they need to successfully virtualize DCs.

Now lab environments… I see virtual lab environments all the time. Bravo, I would rather see a virtual lab environment than NO lab environment and really I am quite ok with lab environments being virtual assuming they are true no-SLA no-SLO no-production-expectations-of-any-sort sandbox environments.  Environments that if they went a bit wonky on you you could fairly painlessly or better, absolutely painlessly blow them away and start over. If you have an SLA/SLO or an expectation of it being available and expect people to be running to put it back together if it blows… That isn’t a lab environment, that is another production environment.

I have been running virtual DCs in a lab / sandbox manner since the first beta I received of Windows 2000 and loaded it in a guest on an NT4 box running a beta or POC of the first VMware Workstation product. And yes, I have seen issues when I have made a mistake or the backend storage wasn’t as solid as it needed to be or took a power hit at just the wrong time, etc. But again, it is a lab environment, when it screws up, I delete it or sometimes see just how bad things can get in AD before I start to cry or get a headache. I have one lab AD that won’t allow me to promote another DC no matter what. I have it off to the side because I want to troubleshoot it until I figure out why. Everyday I think I am more likely to see that somewhere else out in the real world. (This isn’t a request for someone to help me sort it out, I will get to it when I get to it)

Anyway, I am guessing based on what I have seen out in the world, this other person is, IMO, wildly guessing based on what they have seen out in the world. I figured I would give the joeware followers a chance to respond as I think they will comprise a good number of the big Enterprise class companies (and militaries and governments) out there and I truly am curious. The poll, if I set it up correctly, will run until March 31 and results should be out first week of April. Please please respond and get your friends in other companies to respond too. I would really like to see where we truly are at.

If you like, you can always email me as well. If you don’t virtualize DCs, are you being pressured to do so? Do you have written policy against it? If you do virtualize DCs, I would like to hear those stories as well. How big? How many issues? How is the redundancy handled? Internal pass-thru disks on the physical host per MSFTs recommendations or external or ????

      joe

 

POLLS START HERE

[yop_poll id=”2″] [yop_poll id=”3″] [yop_poll id=”4″] [yop_poll id=”6″] [yop_poll id=”7″] [yop_poll id=”8″] [yop_poll id=”9″]

THANKYOU!!!

Rating 4.71 out of 5

8 Responses to “Virtual DC Poll”

  1. Mike Kline says:

    60 percent doesn’t seem off (or a little low) if you are just talking about one virtual DC anywhere in the environment. The last three agencies (spans the HHS, DoD, and HHS umbrellas) I’ve supported all had some virtual DCs. I’m looking forward to seeing the final results.

  2. Mark says:

    Hey Joe,

    First and foremost, thank you for your great website, tools, and blogs that have helped my MS career over the many years. I work mostly in DoD environments and all are looking to run, if not already running their own private virtual cloud (VMware or Hyper-V). Domain controllers are a big part of this virtualization. The general rule I have seen is at least one physical DC per domain as a recovery from a virtual network\storage fail but, other than that, to utilize the virtual environment as much as possible to save on hardware costs and quickly provision new DCs (if needed)

    Mark

  3. wkasdo says:

    Joe, I’m not sure how you would define “Enterprise”, but at the customers I see (>5000 seats, generally) virtual DC’s are common. 60% is not far off. In some cases these are specialized or limited to branch offices, like you guessed. 100% virtualized is rare, and when I encounter that configuration I always get them to put the PDCe on a class-A physical machine with local storage.

    Yes, I agree that you need to know what you are doing when handling virtual DC’s. I also agree with you that vm-gen ID is just a partial solution. Still… it seems to me that you are overreacting. Just my 2 cts.

    • joe says:

      wkasdo,

      Before anything else. How many issues have you experienced directly or indirectly with virtual DCs? I have found that folks who haven’t had issues tend to fall on the side of “not much to worry about” where the folks who have encountered issues are a bit more tentative and want to understand environments and support limitations before recommending direction. Also the folks who have been burned by the “rarely happens” side of the coin are also on the tentative side because they have learned that rarely happens or very unlikely is not the same as truly impossible and then once they make that realization they have to weigh out the worst impact and how they would handle it and how the company would handle it. This is standard IT stuff though redundancy is waste and it is always a balancing act on how much waste you are willing to accept to mitigate risk or on the flip side how much risk you are willing to accept to reduce waste.

      5000 seats is what I would consider mid-sized. My personal admin experience is in environments of 75k up through a couple of hundred thousand. My technical lead / escalation / SWAT experience is wider but usually about 10k seats is on the smaller end of things that I have ended up working with. Mostly visualize environments that have follow the sun type operations with people sitting all over the world at different parts of the 24 hour day handling work. North America, South America, KL, China, India, Europe, etc. Think of any country in the world that has supplied a low-cost support center and I have probably worked at some point with a company that has worked with that center. Aside from that though a lot of it comes down to costs and penalties. You don’t have to be huge in order to have an environment where damage to AD or some other critical infrastructure will be measured in hundreds of thousands or perhaps millions of dollars per hour. I’ve seen them. I have also seen companies, large companies in the 150k user range, that could at one point fairly recently have lost their AD and been fine for a week or more.

      Anyway, it is currently looking like the smaller implementations are doing heavier amounts of virtualization and again, I kind of expected that.

      The larger the org, the larger the range in number, size, and quality of the support groups involved which possibly for anyone who isn’t used to large orgs (100k+) may seem counter-intuitive. Some would, and some often do, think that the larger the org, the faster and more flexible because of more money being available or whatever. It truly doesn’t work out that way in most large orgs I have seen. Some of them are almost paralyzed by process (and change control) that is designed to prevent massive screw-ups from occurring too often because of resources who likely shouldn’t even be doing the work they are doing and also cost savings concerns because you need to squeeze every penny that comes along to make stock holders happy. For them to try and implement new tech or a new processes can be excessively costly in time and human resources if they even have enough solid resources to accomplish the tasks in the first place.

      It is tougher and tougher now a days, the IT Talent pool is much more shallow than it was even 10 years ago due to the various financial collapses. Not many are willing to stay in a field where salaries have been cut 30-70% and layoff is a regular real concern after one rough quarter or market downturn.

      I am curious about the “when I encounter that configuration I always get them to put the PDCe on a class-A physical machine with local storage.”… Why? I am not trying to be leading or facetious but if you trust the virtual environment enough to use for every other DC in the environment except FSMO roles, why wouldn’t you do them all? Corruption can replicate, I have seen it first hand. Putting one DC on physical isn’t saving anything unless you have all of your other DCs on a single SAN and then someone just plain needs to be slapped.

      The new guidance from MSFT says that you don’t have to keep anything physical, it is ok for them all to be virtual. And that isn’t just for 2012, as mentioned previously, there is no real change in 2012 safety-wise other than to lock off one piece that could hurt you that has already been broadcast for a long time that it could hurt you. The true major enhancement is the cloning piece. So effectively any safety guidance for 2012 is valid for 2003 SP1 other than the “if you accidently click snapshot rollback you shouldn’t be screwed”. But again, that has been a known no no for a long time.

      On your last paragraph I think we agree more than disagree. I think perhaps I may have had opportunity to see more issues and more environments with processes and personnel that make me think that companies need to slow down and really work through it before jumping in head first. And again, you would think most companies would do that but when you get a bunch of IT Managers and CIO’s who are a long ways away from the tech and have some consultants buzzing in their ears or going off to fun conferences, it can be difficult to try and explain things out and what is really involved because the folks they first spoke to have no understanding at all of the environment they are talking about.

      Overall I am absolutely fine with virtualizing DCs, it just needs to be done with full knowledge and support from all necessary involved groups and in environments that live or die based on documentation and standard process all of that needs to be A+ quality because those environments are often trying to live on process and documentation because they know they don’t have “I.T. S.W.A.T.”-like capable support staff across the board – can’t afford to. Again, smaller support orgs and consulting firms are more likely to have a relatively high level of capability consistently across the board. It much easier to maintain quality if you have 50 support guys versus a couple of hundred or couple of thousand or more. The larger you get, the more you have to depend on process and tools and if you can’t depend on those near 100% of the time something could slip through and you have to understand what the implications of that are and whether or not you are willing to accept the impact. You want to know whether or not a given company can safely virtualize DCs, go and talk to the AD tech lead that is directly working with the guys running the systems; that is the main person I would trust assuming they were a solid technical resource. Next thing is to look at the sev-1 issue log for the last couple of years and see what kind of issues have occurred and how long it took to fix them and the quality of the root cause analysis docs. What? The company doesn’t track the sev-1 issues? That is a strike against doing anything complex right from the word go.

      joe

  4. wkasdo says:

    Hi Joe,

    I need to re-phrase this a bit. You are working for the truly large enterprises of this world, and the advise and opinions that you have reflect this. What I was thinking is that 90% of your readers manage smaller environments than this. Where I live, our customer base has maybe 10 customers at 50.000 seats and over. My team of engineers typically encounter customers between 5000 and 20.00 seats. That accounts for the majority of the work.

    For these customers, the impact of a global AD disaster is less than for a 150.000 seat giant — exceptions noted. Many of them adapt the “virtualize everything” attitude.

    > Putting one DC on physical isn’t saving anything unless you have all of your other DCs on a single SAN and then someone just plain needs to be slapped.

    That is exactly what I see happening. Worse, they think they are safe when they have twin datacenters and have the SAN replicate everything between them.

    What we have seen from the VM-related AD disasters that we have encountered is that having one physical DC would have allowed them to get back up to speed much quicker. DNS works, you can logon on, the ESX layer starts up and you can actually manage it (vCenter), you have a working source for dcpromo, etc. Is it a 100% solution? No, of course not, as you convincingly argued. In the end, you need a proven backup as last resort. And as you say, if all your AD is physical you are simply less vulnerable. No argument.

    So yes, I see your point. I really do. I’m just wondering if it is the most realistic advice for most companies.

    • joe says:

      As I have said in previous posts, my issues aren’t with virtualization per se; it is the lack of reflection of the additional risk understanding in the mindset and processes that I seem to encounter on a regular frequency. I have run into multiple environments where not even the most basic MSFT CYA advice is used. I have actually run into environments that specifically spell out snapshot rollback as a valid recovery method for DCs in multidomain forests and more often spoke with management who considered that as one of the benefits of using virtual DCs. The mental horsepower just doesn’t seem to be there sometimes to get the true understanding of mesh/distributed computing mechanics. You can’t blame them, many of the companies producing monitoring products don’t actually get mesh/distributed mechanisms either.

      Ironically the environments that truly seem to understand what needs to be done to properly support it in a minimal risk manner and have the capability to do so are often very unlikely to do so. Some very large serious tech companies with scary smart tech resources are willing to assist customers to virtualize but won’t virtualize their own internal corp forest. That though is simply reflecting the difference between the goals of IT and Sales.

  5. Sean says:

    Good poll. I have wondered about this too. There is talk on the internet like everyone is doing it but then when you go to conferences and actually talk to people running most large corporate environments they look at you funny or say they are doing it but backtrack and say it isn’t their main corporate forests but other utility forests on the network or just certain groups.

    I think “enterprise sized companies” start around multiple tens of thousands of users with global sites. The smaller environments would be small to mid-sized and are likely more consolidated and aren’t measuring their site outages in hundreds of thousands of dollars per hour and corporate outages in the millions or tens of millions per hour. A company like AT&T or Toyota Motors for example would be in really bad shape if they lost corporate authentication for a day or worse days. But a company like Tesla Motors, perhaps not so much. I imagine that if a serious issue occurred with authentication at AT&T, the CIO wouldn’t be in trouble, he would be put in front of a firing squad so the consideration of risk isn’t just to the company assets but personally as well.

    We aren’t anywhere near AT&T big but we run our production single domain forest environment with hardware. We have thought about virtual but have refrained for a lot of the reasons listed plus the concern that despite how well you follow process you still have enhanced risk and less cushion for a mistakes and there are always mistakes; no one is perfect – if that were the case you wouldn’t need change control. So even with a little risk, the results of an issue could be bad. As my manager says, his brother is a police officer and he isn’t likely to be shot in the course of his job even though he is out in a cruiser arresting people nearly every working day but he wears a bullet proof vest – not because he expects to get shot, but because he wants to lessen the danger in case it does happen.

    We have lost two different lab environments that were all virtual. Specifically we hit really weird issues with cross domain functions and Exchange acting hokey and no one could figure out why and said it wasn’t worth spending weeks trying. It really sucked because we were put behind on projects but they were labs and not production line critical. We weren’t ever sure what we did wrong as we thought we followed the rules properly. We now have three separate single domain forest virtual labs which more closely matches our production forest and things have worked out well so far but if we lose one, it will hopefully be one and not the whole environment. However now we are adding several physical machines to replicate production hardware as we recently performed a firmware/driver update on some DCs and something went hokey and they got really slow. We rolled back and everything is good again but it took us a while to figure it out which angered the application people. Now we are required to test all firmware/driver updates in the saftey of the lab and to the satisfaction of the line of business teams. It will slow things down but we did break them so we have to deal with the consequences of that.

    I wanted to say AdFind rocks. I use it every day. When is the next update coming out? Did you convert it to Visual Studio yet like you blogged about? I know you mentioned that should reduce the size of the tool. Also you probably don’t remember me, but I talked to you at TEC after the great Joe and Dean Show presentation and you mentioned you had wrote a tool that was multithreaded and could give “up to the second” replication times to all of the DCs in a domain or forest, I think you called it Puddle? Did you ever release that under a different name or something, I have looked a couple of times but haven’t found it yet.

    Thanks for everything you share. We appreciate it in the trenches. Also come back to ActiveDir.org, you don’t ever seem to answer questions there anymore.

    • joe says:

      Sean: Thanks for the feedback. I like analogies in general and I really like the police analogy. I was talking the other day with a math type guy who happens to be an architect and he went on a tear about numbers and risk etc and how many companies are likely to be hurt pretty badly at some point because they always figure the bad stuff will happen to someone else.

      On the cross domain functions issues stuff, that is interesting. I was just working with my co-author on an issue he was hitting in a customer’s virtual AD lab recently and cross domain trust and replication issues was one of the symptoms. We were digging through source code and trying to figure out various hacks and in the end, IIRC, he declared that lab a complete loss and told them rebuild it. For some odd and interesting reason the customer seemed to think the lab environment worked fine…

      I am unsure on the next AdFind update. I have been seriously busy with my real job and life so hard to dedicate enough time for dev work. 10 minutes here or there is not enough to do it. So no, it also isn’t converted to Visual C++ yet.

      Puddle… LOL. Hahahaha. That cracked me up, that should be the name, it is funnier than the real name. The utility is called Ripple. It basically emulated tossing a pebble into the AD “Puddle” and then watching for the ripples to hit the edges. I actually first started working on it back in about 2000 or so but never got it to scale well once the number of DCs hit several hundred since it spawned a single thread for every DC. I need to rework the whole threading model in it. I should get back to it as it was pretty cool. Would tell you live what was going on with replication.

      ActiveDir.org… I do miss that list, my main PC that I used to work on my personal joeware email on including the List work has been down. I need to completely rebuild it. I could go through the email in the gmail web interface (joeware is hosted out of google apps now) but it isn’t conducive to lists really. It is hard enough just to respond to regular email on it.

      I am glad the tools are useful to you.

      joe

[joeware – never stop exploring… :) is proudly powered by WordPress.]