Over on the “Ask the Directory Services Team” blog there is a post about Lag Sites. I really disliked what was written and left a nice long comment. I am not sure if it will be posted or not and I also wanted to reach folks with my comments about lag sites that possibly don’t read that blog.
My personal thoughts on lag sites are that they can be a good thing. They are not, and visualize me saying this three times out loud to you, they are not a COMPLETE DISASTER RECOVERY SOLUTION. However they can be PART of an overall DR solution. Even an integral part that can help you meet tight SLA/SLO goals.
Better than a lag site, IMO, would be a tombstone reanimation combined with snapshot data recovery mechanism but that is a Windows Server 2008 kind of thing and if the customers I work with are any indication, it will be awhile before most large orgs will be in a position to use that. In the meanwhile lag sites work with everything all the way back to OEM Windows 2000.
Here is the blog post
http://blogs.technet.com/askds/archive/2008/10/20/lag-site-or-hot-site-aka-delayed-replication-for-active-directory-disaster-recovery-support.aspx
These are my comments to the post:
All I read here is that you need to know what you are doing and should have a clear design and operational model in mind if you use lag sites. I would argue you should know what you are doing and have a good plan if you are responsible for AD at all.
Everything you list here can be covered if you have knowledgeable informed admins. If you have an unwitting admin, you already have a problem, maybe the lag site will help you catch the problem and eradicate it before something serious happens.
Even the repadmin /force can be stopped dead in its tracks if necessary. The methods may or may not be supported by PSS but it doesn’t mean they don’t work just fine. Lots of things in the real world aren’t supported by PSS… Yet…. that work just fine.
Point: Lag sites are not guaranteed to be intact in a disaster:
CounterPoint: Ditto for backups. You should hopefully (again not guaranteed if your admins are the idiots that couldn’t properly run a lag site) have enough backups over time to go back far enough but if the issue occurred pre-TSL, you are SOL either way. A friend of mine presented to the DS PG an awesome way to poison the backups several years ago that was completely undetectable under normal circumstances. That hole has since been plugged because of that conversation but if MSFT now guarantees that backups will be intact in the event of a disaster, I would like to see that guarantee in writing. If not, this point is moot with the understanding that Lag Sites are not a COMPLETE DR solution, but could be PART of an overall solution.
Point: Replicating from lag site might have unrecoverable consequences
Counterpoint: And restoring from tape is also using out of date data correct? Do you have the same concern there??? Logic says you should. Also doing a schema update can do the same, should we not do those either? This is the same scare tactics used by folks in the early days of AD to warn them off from doing Schema changes. We quickly learned that if we know what we are doing and use proper precautions and procedures we will be fine. I especially enjoyed the “may have to do a forest recovery…” bit. Had that been presented to me in a meeting with MSFT in front of a customer, I likely would have been unable to control my chuckling.
Point: Lag sites pose security threats to the corporate environment
Counterpoint: This one gave me a good chuckle too. Ever hear of normal slow convergence across a large enterprise? Ever hear of Kerberos Tickets? At what point did Kerberos start validating if a currently unexpired ticket was tied to a disabled or deleted userid? Yes there *could* be additional issues if auth is possible through the lag site, but this is simply a design and operational criteria to take into account for lag sites as well as normal overall convergence of data churn. It could be a bad thing that happens when repl gets plugged or when a site is normally more latent than other sites or with “official” lag sites or if someone adjusts kerb ticket configuration settings. It isn’t a “oh my god the sky is falling don’t do lag sites because of this” item.
Point: Careful consideration must be put in configuring and deploying lag sites:
Counterpoint: Of course. Careful consideration must be put in configuring and deploying ANY site as well as ANY domain or ANY forest or ANY domain controller.
You likely should have stopped with your post after stating that one week is the hard coded upper limit on normal replication schedules. The rest of this was unhelpful and again reminded me of all of the Schema Updates are bad scare tactics that went around for the initial years of AD.
If you wanted, you could have stated that Lag Sites need to be properly planned. They need to be properly managed. They aren’t a complete DR plan but they can be part of an overall DR plan that is used for various scenarios along with tombstone reanimation, Snapshot data recovery in Window Server 2008, and god forbid tape recovery. As a personal point of interest, I would much rather restore objects out of a lag site than from a backup file. I trust the lag sites more than I trust the backup/restore process.
Going forward, please don’t give advice based on misinformation, little information, or just plain “let’s scare em” type scenarios.
The wrap-up is that a lag site is simply a site that replicates on a longer convergence frequency than “normal sites”. Possibly up to a week out of convergence. This is a fully supported configuration by MSFT. It just isn’t supported as your sole Disaster Recovery solution. And it shouldn’t be because it isn’t a full Disaster Recovery solution.