joeware - never stop exploring... :)

Information about joeware mixed with wild and crazy opinions...

Active Directory Lag Sites

by @ 2:05 am on 10/22/2008. Filed under tech

Over on the “Ask the Directory Services Team” blog there is a post about Lag Sites. I really disliked what was written and left a nice long comment. I am not sure if it will be posted or not and I also wanted to reach folks with my comments about lag sites that possibly don’t read that blog.

My personal thoughts on lag sites are that they can be a good thing. They are not, and visualize me saying this three times out loud to you, they are not a COMPLETE DISASTER RECOVERY SOLUTION. However they can be PART of an overall DR solution. Even an integral part that can help you meet tight SLA/SLO goals.

Better than a lag site, IMO, would be a tombstone reanimation combined with snapshot data recovery mechanism but that is a Windows Server 2008 kind of thing and if the customers I work with are any indication, it will be awhile before most large orgs will be in a position to use that. In the meanwhile lag sites work with everything all the way back to OEM Windows 2000.

 

Here is the blog post

http://blogs.technet.com/askds/archive/2008/10/20/lag-site-or-hot-site-aka-delayed-replication-for-active-directory-disaster-recovery-support.aspx

 

These are my comments to the post:

All I read here is that you need to know what you are doing and should have a clear design and operational model in mind if you use lag sites. I would argue you should know what you are doing and have a good plan if you are responsible for AD at all.

Everything you list here can be covered if you have knowledgeable informed admins. If you have an unwitting admin, you already have a problem, maybe the lag site will help you catch the problem and eradicate it before something serious happens.

Even the repadmin /force can be stopped dead in its tracks if necessary. The methods may or may not be supported by PSS but it doesn’t mean they don’t work just fine. Lots of things in the real world aren’t supported by PSS… Yet…. that work just fine.

Point: Lag sites are not guaranteed to be intact in a disaster:

CounterPoint: Ditto for backups. You should hopefully (again not guaranteed if your admins are the idiots that couldn’t properly run a lag site) have enough backups over time to go back far enough but if the issue occurred pre-TSL, you are SOL either way. A friend of mine presented to the DS PG an awesome way to poison the backups several years ago that was completely undetectable under normal circumstances. That hole has since been plugged because of that conversation but if MSFT now guarantees that backups will be intact in the event of a disaster, I would like to see that guarantee in writing. If not, this point is moot with the understanding that Lag Sites are not a COMPLETE DR solution, but could be PART of an overall solution.

Point: Replicating from lag site might have unrecoverable consequences

Counterpoint: And restoring from tape is also using out of date data correct? Do you have the same concern there??? Logic says you should. Also doing a schema update can do the same, should we not do those either? This is the same scare tactics used by folks in the early days of AD to warn them off from doing Schema changes. We quickly learned that if we know what we are doing and use proper precautions and procedures we will be fine. I especially enjoyed the “may have to do a forest recovery…” bit. Had that been presented to me in a meeting with MSFT in front of a customer, I likely would have been unable to control my chuckling.

Point: Lag sites pose security threats to the corporate environment

Counterpoint: This one gave me a good chuckle too. Ever hear of normal slow convergence across a large enterprise? Ever hear of Kerberos Tickets? At what point did Kerberos start validating if a currently unexpired ticket was tied to a disabled or deleted userid? Yes there *could* be additional issues if auth is possible through the lag site, but this is simply a design and operational criteria to take into account for lag sites as well as normal overall convergence of data churn. It could be a bad thing that happens when repl gets plugged or when a site is normally more latent than other sites or with “official” lag sites or if someone adjusts kerb ticket configuration settings. It isn’t a “oh my god the sky is falling don’t do lag sites because of this” item.

Point: Careful consideration must be put in configuring and deploying lag sites:

Counterpoint: Of course. Careful consideration must be put in configuring and deploying ANY site as well as ANY domain or ANY forest or ANY domain controller.

You likely should have stopped with your post after stating that one week is the hard coded upper limit on normal replication schedules. The rest of this was unhelpful and again reminded me of all of the Schema Updates are bad scare tactics that went around for the initial years of AD.

If you wanted, you could have stated that Lag Sites need to be properly planned. They need to be properly managed. They aren’t a complete DR plan but they can be part of an overall DR plan that is used for various scenarios along with tombstone reanimation, Snapshot data recovery in Window Server 2008, and god forbid tape recovery. As a personal point of interest, I would much rather restore objects out of a lag site than from a backup file. I trust the lag sites more than I trust the backup/restore process.

Going forward, please don’t give advice based on misinformation, little information, or just plain “let’s scare em” type scenarios.

The wrap-up is that a lag site is simply a site that replicates on a longer convergence frequency than “normal sites”. Possibly up to a week out of convergence. This is a fully supported configuration by MSFT. It just isn’t supported as your sole Disaster Recovery solution. And it shouldn’t be because it isn’t a full Disaster Recovery solution.

Rating 4.00 out of 5

2 Responses to “Active Directory Lag Sites”

  1. Mike Kline says:

    Good point about the Windows 2008 rollouts. Here in DC the contracts I’ve been on are to support federal agencies and I’m not seeing people running to upgrade their forests and domains to W2K8 yet.

    Part of that is due to the uncertainty of future budgets and cuts that are probably on the way due to the current state of the economy here in the U.S.

    For now we are working fine with W2K3 and XP desktops. Exchange 2007 deployments are rolling out so we are not completely behind.

  2. Aaron says:

    Hehe. I remember hearing plenty of “you can’t roll back the Schema” or “Schema restores aren’t supported” talk from many techs and writers.

    Yet, when you authroitatively restore the system state of the SM role holder to a point prior to the backups and toast your other DCs, the “old” is back, prior to your changes. Hence the beauty of the backup.

    I had to do this for a Big 3 customer several years ago (around 2002) when they were deploying a new product into their environment. Since they had been counseled that “restoring the Schema is not supported,” they took it to mean that it was impossible. After I restored it for the 205th time in a test lab, they were convinced that it could be restored. We still took one DC off-line at each site when we did the upgrade (definite lag site)

    “Not supported” a lot of times is vendor-speak for “we don’t want to come rescue you after you’ve screwed the pooch.”

    Of course, it always pays to test your theories out (and have a current resume). 🙂

[joeware – never stop exploring… :) is proudly powered by WordPress.]