More comments on the Active Directory Services Team blog concerning Lag Sites. My friend Guido, probably one of the top guys in the industry in terms of understanding the backup/recovery solution space for Active Directory stepped up and commented as well. He didn’t even know I had left a comment and later pinged me and mentioned how similar our responses were.
Here was the response I received if you don’t feel like going to the site…
Hi Joe,
Great comments!
To add a few thoughts here (as Gary is out for a few days; I’ll let him reply in depth when he returns).
The lag site is *not* a fully supported scenario. That is the point of this post. If you call me and my team here and ask for advice on how to best configure a lag site, we will tell you the same. ‘Supported’ has a very specific meaning when you talk to our product group and us – it means we exhaustively test the scenario: this is not done for lag sites. It’s also why if you read our technet documentation you will not find a guide to creating lag sites.
The other main point that Gary was trying to reach is that we have found in Support that many thousands of customers have been using Lag Sites *exclusively*. They don’t use, maintain, or test their systemstate backup systems – then we work tons of cases each year where they thought that their lag sites would save them, and they did not. So this wasn’t directly pulled from Gary’s behind – we have 10 years of 3rd tier support cases evidence to back it up.
And your main point is well taken – you probably will not have good backups or a good disaster recovery strategy if you’re not doing your job as an admin.
(PS: love your webpage, tools, and general AD passion)
– Ned
This was my response
Hey Ned, glad you enjoy the utilities/site/etc. 🙂
So which part of the lag site concept isn’t supported?
My understanding from speaking to various folks around MS within PPS and the PG is that what isn’t supported is that a lag site be used as the sole DR recovery mechanism. Again, I fully agree with that. That is an insane position to put yourself into.
Anyway, lets break it down to some of the various components that may or may not be used in any given lag site configuration…
* Delayed replication sites are supported.
* Auth restoring objects on any arbitrary DC in a domain is supported.
* Disabling registration of domain SRV record specific DNS entries pointing to a given site is supported
* Disabling replication entirely (or shutting DCs down) for periods not exceeding the forest TSL on a given DC or every DC in a site is supported
I have been involved in various situations where PSS has indicated one or more of each of those be done for a given situation. Heck anyone who has been on a call with a customer and PSS in a major accidental deletion incident has likely heard “has the deletion replicated to all DCs in the domain?” and if not that is followed by “stop replication to that DC immediately and let’s restore the objects from there”. I have heard a multitude of stories from the PG that started that way. Every time that is done it is acknowledgement of the concept of the lag site.
Will PSS help someone set up a lag site if someone asks for that specific thing. Sounds like no and I can understand the reticence to do so unless you have a thorough understanding of the overall DR plan/process for a given customer. Will PSS help a customer set up a site to replicate on a schedule that is measured in days instead of hours or minutes… Absolutely, I have talked to customers who have been walked through the process by PSS. Will PSS help a customer auth restore an object from any arbitrary DC? Absolutely, have seen it with my own eyes. Ditto for the other items.
What seems to be the issue PSS has is the intent behind the uses of these features in the technology, not the use of the features themselves.
The comment that “many thousands of customers” have been using lag sites exclusively scares me. That would seem to me that someone at MSFT isn’t getting the concepts of how backup/restore works in AD out there very well. I am also just surprised to hear that number. I work in a very large services org for my full time job and have dealt with many large customers over the years and have seen very few instances of lag sites that I wasn’t involved in some way in setting up. Smaller companies never seem that interested due to the hardware and OS licensing investment.
Not to bust your chops but I think the 10 years of cases is a bit of an exaggeration Ned. We are on the 7th year of truly popular use of AD (though some of us had it in large scale Fortune x if not Fortune xx production as early as 99 or 2000) and lag sites didn’t really start catching mainstream attention until several years into AD being in production. Some of us picked up on the idea that a latent (non-converged) site (which is what those of us who were publicly discussing it called it initially) could be used for this type of recovery but the people talking about it were people who could work it out on their own and also understood the repercussions. I recall the first time I heard the “lag site” moniker was at one of the DEC conferences four or five years ago at which point the concept started to explode.
Anyway, people do a lot of stupid things in their production ADs. Lag sites are a relatively painless and innocuous item. I am far more worried and have seen far more issues with DC virtualization than lag sites though I do recommend lag sites be running on virtual machines when I recommend lag sites. 😉 And yes, I do officially recommend them to companies. I also give them the caveats of when it is and isn’t good to use and make sure they fully realize it is a mitigator, not a total DR solution.
Let’s face it, setting up a lag site isn’t rocket science. If someone can’t work it out themselves, they likely shouldn’t be doing it for a variety of reason. Being who I am I would also go as far as to say they probably shouldn’t be running AD at all but that’s just me. No one who has to call PSS to ask how it should be set up, should be doing it.
joe
Lag sites can be a real time-saver for those “oh, crap” moments, especially when used in conjunction with regular LDIF exports so you can fix group membership changes that happened since the last backup.
As far as the licensing goes, Microsoft virtualization is an excellent way to introduce a lag site with minimal cost (being able to utilize the bundled “free” OS licenses with Windows 2003/2008 EE or DE) and effort. The biggest problem with restoring DCs, of course, is the similar hardware requirement when restoring the system state. Granted, it’s not a big deal for those of us in larger orgs that buy servers by the dozens or hundreds, but it’s still a concern, nonetheless. Virtualization lessens the blow of that requirement as well, making it even easier to build a DR strategy that includes both backups and lag sites.
Ned Pyle had an update (sort of) to this Lag Sites discussion here:
http://blogs.technet.com/askds/archive/2008/11/15/follow-up-on-lag-sites-sort-of.aspx