One of the things I love about the MVP summit is getting together with really smart people and discussing various deployment architectures.
One of the topics of conversation during a get together at the Experience Music Project social event was a discussion about how to make a high visibility public web site based on Windows Servers in an Active Directory domain highly available.
First you need to discuss what highly available AD means…
Ability to logon due to a single DC or infrastructure failure is only one aspect of a highly available environment. What about being available through
- security compromise
- administrative "oh shit!" (OS!) events
- updates that just didn’t go right
While AD naturally has a fault tolerant distributed deployment model, that does nothing to help with those types of issues. In fact, depending on how it is all managed, a distributed deployment model could contribute to the possibility of these issues as well as the overall impact.
Security compromise… For the most part, *most* companies *probably* don’t have to worry about someone outright attacking their AD environment. However that doesn’t mean no one has to worry about it. In those companies where these concerns are real, security needs to be in the front seat for the high availability discussions… Think of the military, think of the government, think of the NSA, think of NASA, think of Microsoft, think of Apple, think of very large companies that are likely targets for corporate theft/espionage, think of companies using AD in a DMZ or similar for internet facing applications. The directory is in an exposed position and it is pretty much a certainty that there is someone who knows more about how to compromise things than the person running the AD knows at some point. Don’t take that as an insult… In the battle of good versus evil in the security world, you as the good guy have to be on the ball and right 100% of the time, the bad guy only has to be right once. Due to the nature of AD, if you have compromised one DC, it is a short step to compromise all DCs in a forest.
"OS!"… Everyone needs to be concerned about "OS!" events. PERIOD. We are, I believe, all humans, humans make mistakes, failure to take that into account in the first place is just one more failure to add onto the list of items you are reviewing when performing the failure analysis. These types of mistakes made to the directory will quickly (you wanted low convergence times right?) replicate around your entire domain/forest. You accidently delete all users in an OU and soon they will be gone from all DCs.
Good updates going bad… I think many of us, especially those of us have been in this business a long while, have seen this happen. Something worked great in the lab and out in production something goes left instead of right and you are standing there going WTF[1]? And those without a production environment at all… Well they really are likely to have an issue. What do I mean when I say you don’t have a production environment??? Let me quote something Don Hatcherl[2] said on ActiveDir.Org when someone said they just had a production environment and no lab environment…
I have to make a comment here, as I’ve heard this too many times. You do, in fact, have a lab environment. What you do not have is a production environment.
DonH
I have a great story about updates going bad when I was working for a Fortune 5 and Microsoft Consulting guys were testing Schema updates in the lab (yes we had an official lab) and everything looked great to them and the testing went months so you would think any issues in there, they would have found. Well it comes to production and ugh… we have mangled attribute names on several attributes. This is just one example of something that can go wrong. Fortunately that was pretty easy to fix but some other updates that go bad aren’t quite as easy to identify and fix. Anyone who ran into tcp chimney issues[3][4][5][6] with Windows Server 2003 SP2 can probably attest to that as it usually took some time to work out what was going on. That issue hit all DCs as well, but thankfully it wasn’t damaging, not like say… applying a kiosk GPO at the domain level and locking all machines down to kiosk mode or mucking with the machine certs of every machine in the domain or changing other security settings. All of which will replicate with lightning speed to the whole environment.
If you aren’t protected from these types of issues, can you protect yourself enough to build something where high availability AD means taking care of these items as well? It depends. It depends on how available is highly available to you and your company. This answer will vary and the resulting work and architecture that you need to put into place to cover for this will also vary. Like security this is a sliding scale that you need to slide to your sweet spot – or at least the spot you can deal with. For most companies, there is going to be a "good enough" point where they stop worrying about it because the concerns over money and resources to account for the problem exceed their concerns about the problem.
Back to the public web site…
In this environment, all three of these issues are very realistic and likely… in fact, even expected. These absolutely would be on the table as issues every single day of the admin’s life that had to run it. This environment must be absolutely available all of the time. Down time runs in the thousands of dollars per minute or perhaps even thousands of dollars per second. The environment absolutely would be a target for hackers and couldn’t afford downtime due to administrative OS! or update failures.
The first thing a configuration like this needs, which really isn’t about AD, is physical location redundancy. You do this by putting the web servers and domain controllers in multiple data centers. Say 4 data centers in North America, 4 data centers in Europe, 4 data centers in Asia. Regional failure/capacity planning says that you can lose a single data center and maintain standard performance, if you lose two in the same region, the site will still work but with reduced performance maybe costs you only a couple tens of thousands of dollars per hour which maybe is acceptable for short periods versus the cost of beefing up even more.
The next thing you need, which again isn’t about AD, is web/app server redundancy. You can throw as many web/app servers into every data center as you feel is needed to maintain availability. Also with the cool virtualization failover and resource management scenarios with VMWARE like VMotion / VMWARE HA / VMWARE DRS you don’t need quite as many web/app servers at any given moment to still have good redundancy.
Now we come to Active Directory. What is the best way to set AD up for this environment. The default thinking would be to set up a single domain forest with lots of DCs in each site. This might work out but I think it is wrong and you can’t properly address the three issues previously mentioned. AD is not isolated and a single forest cannot be isolated no matter how you try to break things up – staggering replication, OU security separation, whatever - it is still all connected. Any security issues, changes, or mistakes that impact the whole forest impact every single data center. Obviously that would be silly to do after taking the time and money to break up the web site into different physical data centers in the first place. So what do we do? IMO… You have a single domain forest dedicated to *each* *individual* data center. No trusts, no interconnections, firewalled off from each other, completely free standing in each case. Updates only occur in one data center at a time and don’t move on to another data center until everything is validated as working 100%. But the costs of separate forests… Oh my! Oh wait, you are paying for separate data centers, what is the small cost of the extra forests compared to that? Seriously.
Thoughts?
joe
[1] For those who don’t know this term… We shall say it is the NetBIOS name of the Windows Forest called "WindowsTestForest.loc". 🙂
[2] AD God
[3] http://msexchangeteam.com/archive/2007/07/18/446400.aspx
[4] http://blogs.msdn.com/jamesche/archive/2007/12/19/having-network-problems-on-win2003-sp2.aspx
[5] http://www.cisco.com/en/US/ts/fn/620/fn62761.html
[6] http://support.microsoft.com/kb/945977
You’ve concentrated on the environmental and AD issues, but what about the people side? Are you going to have a SA team in each region? Are you going to have some sort of auditing or oversight function?
My experience with large AD implementations is that nobody ever thinks about what actually has to happen when that Oh Shit moment happens. They concentrate on the technology and forget someone has to water and feed it. And someone has to do something when Hal won’t open the pod bay door.
When Hurricane Katrina quite literally drowned a bunch of the Evil Empire’s DCs in the US, the only person with the access, experience, and ability to work remotely to restore services was one really pissed off Exchange administrator in Sydney. Pretty sad for a company with offices and technical staff in more countries than Microsoft.
“…applying a kiosk GPO at the domain level and locking all machines down to kiosk mode…”
Really, when has one of those damn kiosk GPOs ever once (or twice) impacted a company before :-).