EMAD SABAR: Clustered Continuous Replication Failover with Standby Continuous Replication

A look at how site failover can be achieved by using Standby Continuous Replication from a Clustered Continuous Replication source.

Arguably the most important feature of Exchange 2007 Service Pack 1 is Standby Continuous Replication (SCR). In a nutshell, SCR allows you to replicate your Exchange database information from your production servers to a standby server that can be brought online should the production servers be lost. Although existing Exchange 2007 technologies such as Clustered Continuous Replication (CCR) offer high availability, site resilience is something currently best achieved via SCR. This is because it can be problematic to implement CCR across datacenters that have different IP subnets, as the members of the CCR cluster must be in the same subnet when Windows 2003 is used as the operating system. Although this requirement can sometimes be addressed by the networking team, many organizations are looking at implementing SCR in the backup datacenter and opting for manually initializing the SCR servers in the event of a disaster at the production datacenter. Quite often, it is desirable to have to manually intervene to bring up the Exchange system at the backup datacenter rather than have an automated process.

In this three part article, we are going to look at the process of implementing SCR between two sets of CCR environments. The idea behind this article is that I was interested to know an outline of the procedure of moving a Clustered Mailbox Server (CMS) from one CCR environment to another and then back again. Obviously in the real world these CCR environments would be in separate datacenters but for the purposes of this article all servers are virtual servers configured on the same network. For clarity, I will be using the terms production datacenter and backup datacenter to help illustrate which CCR environment we are dealing with at the time. In this article we will go through the process of:

Enabling SCR between the two CCR environments. In actual fact, the CCR environment in the backup datacenter is actually a standby cluster and is thus a pair of passive nodes ready to take ownership of and run the CMS.
Simulating the loss of the production CCR environment and therefore producing the need to bring the CMS up on the standby cluster in the backup datacenter.
Moving the CMS back to the CCR environment in the production datacenter once this datacenter is available again.

Server Configuration

Let us have a look at the five servers that I have in my virtual environment that will be used to construct and test the SCR scenario. They are:

NH-W2K3-SRV02, a combined domain controller, Client Access Server and Hub Transport server.
NH-W2K3-SRV03, initially set to the active node of the production CCR environment.
NH-W2K3-SRV04, initially set to the passive node of the production CCR environment.
NH-W2K3-SRV01, the first passive node of a standby cluster.
NH-W2K3-SRV05, the second passive node of a standby cluster.

Since the server names have incremental numbers, it would have been nice not to have servers NH-W2K3-SRV01 and NH-W2K3-SRV05 as the standby cluster but unfortunately I had already built the existing CCR environment up to server NH-W2K3-SRV04 and therefore did not want to reinstall the entire environment. In fact, server NH-W2K3-SRV01 used to be an Edge Transport server which is why the combined domain controller, Client Access Server and Hub Transport server is NH-W2K3-SRV02.

There are some other important names to identify:

The actual production cluster name is E2K7CLU01.
The standby cluster name for the backup datacenter is E2K7CLU02.
The CMS name is CCREX01. This is the name that the Outlook clients actually connect to.

You will note that there is only a single domain controller, Hub Transport server and Client Access Server within this setup. In the real world, the backup datacenter would contain additional domain controllers, Hub Transport servers and Client Access servers that would automatically be used by the CCR environment at the backup datacenter. As the focus of this article is about the recovery of the CMS to a new CCR environment using SCR, I shall be using the same domain controller, Hub Transport Server and Client Access Server for both the production and backup CCR environments. This keeps things simple for this article but of course in any real site resilience situation these additional servers should be considered.

One additional thing to note with this article is that all servers are running Windows 2003 and therefore the steps in this article relate to Windows 2003 and not Windows 2008. There are several different steps required if your servers are running on Windows 2008 that will not be included in this article. Maybe that will be the topic of a future article here on msexchange.org as Windows 2008 starts to be deployed.

Standby Cluster Installation

As I have already alluded to within this article, it is important to note the difference between the production CCR environment and the standby cluster in the backup datacenter. The production CCR environment is installed as detailed in Henrik Walther’s article, Installing, Configuring and Testing an Exchange 2007 CCR Based Mailbox Server on MSExchange.org. The standby cluster is installed slightly differently, since it is not designed to run a CMS from the outset. Broadly speaking, the main difference is that instead of installing the Active Clustered Mailbox Role on one cluster node and the Passive Clustered Mailbox Role on the other cluster node as is the case with CCR, both standby cluster nodes will be installed with the Passive Clustered Mailbox Role only. Therefore, in my example network, servers NH-W2K3-SRV01 and NH-W2K3-SRV05 are both configured with the Passive Clustered Mailbox Server role. This selection is made during the Exchange 2007 setup routine as you can see from Figure 1.

Figure 1: Passive Clustered Mailbox Server Installation

One key consideration with the installation of the standby cluster based on the fact that SCR will be used is that the path for the database and log files must be the same for both the SCR source and SCR target machines. In other words, if the CCR environment is configured to place all database files into E:\Databases, then the location of the databases on the standby cluster nodes will also be set to E:\Databases as and when SCR is enabled.

Activate SCR

Since this is an article on using SCR to achieve a failover between two CCR environments, the first thing to do is to enable SCR for both storage groups on the CMS. This is done using the Enable-StorageGroupCopy cmdlet which has the important –StandbyMachine parameter. Since the SCR target is a standby cluster consisting of two passive nodes, either of these can be specified in the –StandbyMachine parameter and will, ultimately, become the active node running the CMS when it is recovered later. In this article, I am going to choose NH-W2K3-SRV01 as the SCR target server. Also, the Enable-StorageGroupCopy cmdlet has been updated to include the –ReplayLagTime parameter, which is used to specify an amount of time to elapse before the log files that have been replicated to the SCR target are actually replayed into the database. This is useful in various situations such as when logical corruption has occurred with the databases on the production CCR environment, since you have time to ensure that this corruption does not make its way into the databases on the SCR target server. By default, the value for the –ReplayLagTime parameter is 1 day so I am going to override this value in the test environment and configure a value of 0. Therefore, the cmdlets required to enable SCR for both storage groups are as follows:

Enable-StorageGroupCopy “CCREX01\First Storage Group” –StandbyMachine NH-W2K3-SRV01 –ReplayLagTime 0.0:0:0

Enable-StorageGroupCopy “CCREX01\Second Storage Group” –StandbyMachine NH-W2K3-SRV01 –ReplayLagTime 0.0:0:0

The running of these cmdlets is shown in Figure 2.

Figure 2: Enable-StorageGroupCopy cmdlets

Once the above cmdlets have been executed, a copy of the two storage groups is created on the target machine NH-W2K3-SRV01. In Figure 3 below, you can see the contents of the First Storage Group folder on NH-W2K3-SRV01. Since in my lab environment I have chosen to keep the database and transaction log files in the same folder location, you may notice that there is at least one key file missing from this folder: the actual database file. There is a reason for this which will be explained in part two of this article.

Figure 3: Contents of the First Storage Group Folder

Seed the SCR Target

Let’s now look at the process of seeding the SCR target. Although this can be achieved manually by dismounting and copying the databases, I am going to use the Exchange Management Shell (EMS) to achieve the desired result. This process centers around the use of several EMS cmdlets, most notably the Update-StorageGroupCopy cmdlet, and we will be running these from the passive node NH-W2K3-SRV01.

Before re-seeding a database, the storage group replication process must be suspended by running the Suspend-StorageGroupCopy cmdlet. Since we have two storage groups that we are working with, both will be suspended at the same time as shown in Figure 4 below. The cmdlets to use are:

Suspend-StorageGroupCopy “CCREX01\First Storage Group” –StandbyMachine NH-W2K3-SRV01

Suspend-StorageGroupCopy “CCREX01\Second Storage Group” –StandbyMachine NH-W2K3-SRV01

Figure 4: Suspending the Storage Group Copy Process

With storage group replication suspended, it is now possible to remove any existing database files from NH-W2K3-SRV01. If you look again at Figure 3 from part one of this article, you can see that the enabling of SCR has already produced several transaction log files into the storage group folder on NH-W2K3-SRV01 but no database file. This is because SCR will only create the target database once at least 50 transaction log files have been copied over from the SCR source, plus the value of the period specified in the –ReplayLagTime value has occurred. The value of 50 transaction log files is hard-coded and therefore cannot be changed. As you have seen from the cmdlets used when we enabled SCR, the –ReplayLagTime value has been set to 0 which therefore means that effectively the databases will only be created once 50 transaction logs have been shipped from the SCR source. Re-seeding the database now will create the database immediately.

Let us get back to removing the existing database files. It is now possible to safely remove any .EDB, .LOG, .JRS and .CHK files from the folders containing the copies of the storage groups on NH-W2K3-SRV01. Once this has been done, the databases can be seeded onto NH-W2K3-SRV01 by running the following two cmdlets:

Update-StorageGroupCopy “CCREX01\First Storage Group” –StandbyMachine NH-W2K3-SRV01

Update-StorageGroupCopy “CCREX01\Second Storage Group” –StandbyMachine NH-W2K3-SRV01

The results of running these cmdlets can be seen in Figure 5 where you can see the database seeding process in action.

Figure 5: Database Reseeding in Progress

The cmdlets used above will automatically resume replication to the SCR target, so there is no need to use the Resume-StorageGroupCopy cmdlet at this time.

Site Failover Process

At this point SCR has been configured and any transaction logs that are created on the active node of the CCR environment are not only replicated to the CCR passive node, they are also replicated to the SCR target server NH-W2K3-SRV01. Thus, assuming the CCR environment is in the production datacenter, the SCR environment is in the backup datacenter and the other required services also exist in the backup datacenter, a site resilient solution has been created. These other required services include Active Directory domain controllers, Hub Transport servers, Client Access Servers, DNS, and so on.

To simulate a failure of the production datacenter, and thus the production CCR environment, I will simply shut down both the active and passive nodes of the CCR environment, namely NH-W2K3-SRV03 and NH-W2K3-SRV04. At this point, we now need to go about recovering the CMS, CCREX01, so that it is now running on the standby cluster. There are quite a few steps to perform to achieve this goal, the first being the need to activate the storage group copy on the standby cluster via the Restore-StorageGroupCopy cmdlet. You will remember that when we first enabled the storage group copy, we specified the target server as NH-W2K3-SRV01 so in this example I am now going to run the Restore-StorageGroupCopy cmdlets from this standby cluster node. The two cmdlets to run are:

Restore-StorageGroupCopy –Identity “CCREX01\First Storage Group” –StandbyMachine NH-W2K3-SRV01 -Force

Restore-StorageGroupCopy –Identity “CCREX01\Second Storage Group” –StandbyMachine NH-W2K3-SRV01 -Force

One thing that you should note from the above cmdlets is the use of the –Force parameter. This is used when the SCR source, in this case the CCR environment consisting of NH-W2K3-SRV03 and NH-W2K3-SRV04, is no longer available which will be the case should you have lost the production datacenter. If the original SCR source was still available, the –Force parameter would not be used as any outstanding transaction logs would be copied over from the SCR source. The results of running these cmdlets are shown in Figure 6.

Figure 6: Restoring the Storage Groups to the Standby Cluster Node

Recover the CMS

Once the storage groups have been prepared for mounting via the Restore-StorageGroupCopy cmdlets, the next thing to do is to recover the CMS. This is achieved easily by running the Exchange 2007 setup.com program on the target server NH-W2K3-SRV01. The setup.com program has a special switch called /RecoverCMS which requires that you then specify the CMS name that you are recovering, as well as the CMS IP address. One thing to remember here is that it is likely if you are recovering the CMS within a different datacenter, you will be specifying a new IP address for the CMS since the disaster recovery datacenter will likely be on a different IP subnet. This is why, in the example below, a different IP address of 172.16.6.153 is used rather than the one originally owned by the CMS (172.16.6.80) when it ran on nodes NH-W2K3-SRV03 and NH-W2K3-SRV04. This is perfectly normal. The setup.com command to use in my example is:

setup.com /RecoverCMS /CMSName:CCREX01 /CMSIPAddress:172.16.6.153

In Figure 7, you can see the results of running the setup.com program.

Figure 7: Recovering the CMS

Once the recovery of the CMS has been performed, the databases can be mounted either via the Exchange Management Console (EMC) or via the EMS. Since we have been focusing on management shell cmdlets and command-line programs so far, let us continue down this route and use the management shell to mount the two databases that we have in this system. This can be achieved by using the Mount-Database cmdlet. Since we have two databases to mount, there are two cmdlets to run as follows:

Mount-Database –Identity “CCREX01\First Storage Group\Mailbox Database”

Mount-Database –Identity “CCREX01\Second Storage Group\Public Folder Database”

Assuming the databases are mounted correctly, the EMS prompt simply returns without any error messages. I then verified that I could access my mailbox as normal, which I could.

Re-create CCR Environment

We have now successfully recovered the CMS to NH-W2K3-SRV01 and mounted the databases, so everything is looking good so far. However, since the production datacenter had a CCR configuration, it is desirable for the backup datacenter to have a similar configuration particularly if there are plans to operate out of this datacenter for some time. Therefore, we now have to ensure that the databases are seeded onto the other cluster node in the backup datacenter, namely NH-W2K3-SRV05. As a result of this process, the original standby cluster running in the backup datacenter will now be a full CCR environment running the original CMS called CCREX01. You have already seen detailed information on the database seeding process within this article, so I will not repeat that information again.

As a final test of the configuration of the standby cluster called E2K7CLU02, it is prudent to ensure that the CMS and other cluster resources can be correctly moved between the two cluster nodes NH-W2K3-SRV01 and NH-W2K3-SRV05. Since the resources are currently running on NH-W2K3-SRV01, we need to move them to NH-W2K3-SRV05 and test for correct functionality. The default cluster group that contains resources such as the Majority Node Set can be moved easily by right-clicking the group called Cluster Group in Cluster Administrator and choosing Move from the context menu as shown in Figure 8.

Figure 8: Moving the Cluster Group

The CMS resources have to be moved using the Move-ClusteredMailboxServer cmdlet. The full cmdlet to use is shown below, where the –TargetMachine parameter is used to specify the node in the cluster that you would like to move the resources to and the –MoveComment parameter is used to specify a reason for the move which is added to the application event log.

Move-ClusteredMailboxServer CCREX01 –TargetMachine NH-W2K3-SRV05 –MoveComment “Test move after CMS recovery”

The results of running this cmdlet should be that all CMS resources are taken offline, moved to NH-W2K3-SRV05 and then brought online again. At this point you need to test that access to the CMS is still possible and that you are confident that the CMS can safely operate on both cluster nodes of the standby CCR environment.

Move Back to the Production Datacenter

The situation now is that we have our CMS called CCREX01 running on the cluster called E2K7CLU02 which itself comprises the two nodes NH-W2K3-SRV01 and NH-W2K3-SRV05. However, it is possible at some point in the future the original datacenter will be available again and possibly the original CCR cluster nodes NH-W2K3-SRV03 and NH-W2K3-SRV04. Many organizations prefer to run their systems from a primary datacenter location and thus the situation of moving back to the primary datacenter must be addressed. The original cluster still believes it has a CMS configured on it and so this must be removed. Here is the procedure to accomplish this.

First, I will bring both NH-W2K3-SRV03 and NH-W2K3-SRV04 back online. Of course, this assumes that the required services such as Active Directory, Hub Transport and Client Access Servers are already up and running back at the production datacenter. At the time of the original failure at the production datacenter, the active cluster node was NH-W2K3-SRV03 and so running Cluster Administrator on this node shows us that all CMS resources are in an offline state as you would expect. This is shown in Figure 9.

Figure 9: Offline Cluster Resources

To remove the CMS resources from the cluster we need to run the Exchange 2007 setup.com program on node NH-W2K3-SRV03 with the /ClearLocalCMS switch. We also need to specify the CMS name. The full command to use is:

setup.com /ClearLocalCMS /CMSName:CCREX01

Figure 10 shows the output as a result of running the above command.

Figure 10: Clearing the Local CMS

Once this has finished, refreshing Cluster Administrator confirms that the CMS resources have been removed. If you think about it, servers NH-W2K3-SRV03 and NH-W2K3-SRV04 now have the same configuration that servers NH-W2K3-SRV01 and NH-W2K3-SRV05 had before we started the SCR replication process; we effectively have a mirror of our previous setup.

Activate SCR to Production Datacenter

Since we have a mirror of our previous setup, we now need to mirror the SCR configuration that we performed earlier in this article. In other words, we need to enable SCR from the backup datacenter, currently hosting the CMS, to the production datacenter. NH-W2K3-SRV03 will be selected as the target for SCR and therefore we now need to run the following two cmdlets:

Enable-StorageGroupCopy “CCREX01\First Storage Group” –StandbyMachine NH-W2K3-SRV03 –ReplayLagTime 0.0:0:0

Enable-StorageGroupCopy “CCREX01\Second Storage Group” –StandbyMachine NH-W2K3-SRV03 –ReplayLagTime 0.0:0:0

Of course, in my scenario I simply powered off NH-W2K3-SRV03 and NH-W2K3-SRV04 which therefore means that the old database and log files are still there to be used. However, in this article Iam going to assume a full re-seed is still required along with the implementation of SCR back to the production datacenter. We have already covered the results of these cmdlets in part one of this article so I will not include another screen shot here.

Database Reseed

With NH-W2K3-SRV03 now having SCR enabled and a copy of the database, it is also important to think about seeding the database onto NH-W2K3-SRV04 which can either be done after the CMS is brought online on NH-W2K3-SRV03, or it can be speeded up by ensuring that NH-W2K3-SRV04 is also configured as an SCR target of the CMS. Since this is a lab environment I am working with, I do not mind re-seeding the database onto NH-W2K3-SRV04 but make sure you consider these options in your production environment. I have already covered the database re-seeding operation within part two of this article so I will not be covering it again here.

Dismount the Databases

Before the switch back to NH-W2K3-SRV03 is made, the next step is to dismount the databases running on CCREX01 since, unlike the first time we switched the CMS between nodes, this time the CMS is still running and servicing the users. The databases must first be dismounted to make sure that they no longer generate any new transaction logs that would then have to be shipped over to NH-W2K3-SRV03. Continuing our theme of using the EMS within this article, the database dismount process can be achieved by using the Dismount-Database cmdlet twice as shown below:

Dismount-Database “CCREX01\First Storage Group\Mailbox Database”

Dismount-Database “CCREX01\Second Storage Group\Public Folder Database”

There is really not much to say about the output of running these cmdlets, as all you will receive from the above cmdlet is an “are you sure?” message unless you suppress it. To suppress this prompt, just use the –Confirm:$false parameter at the end of the cmdlets above.

Restore the Storage Groups

Next the storage groups must be prepared for mounting via the Restore-StorageGroupCopy cmdlet. The two cmdlets to use are:

Restore-StorageGroupCopy “CCREX01\First Storage Group” –StandbyMachine NH-W2K3-SRV03

Restore-StorageGroupCopy “CCREX01\Second Storage Group” –StandbyMachine NH-W2K3-SRV03

You may remember in part two of this article that we used the Restore-StorageGroupCopy cmdlet with the –Force parameter. This is not required this time since the CMS nodes are running and thus any required information is available. I will not include a screen shot here since no output is generated as a result of running these cmdlets.

Stop the CMS

The CMS running in the backup datacenter now needs to be stopped using the Stop-ClusteredMailboxServer cmdlet, because this time the CMS is still running whereas in the original failover situation it was not. The full cmdlet to use is:

Stop-ClusteredMailboxServer CCREX01 –StopReason “Moving back to production data center”

In the above cmdlet, we see the –StopReason parameter that is used to add into the event log the reason for the stopping of the CMS. If you examine the event log after issuing this cmdlet, you should find an event with an ID of 105 with your chosen phrase in it. Running the Stop-ClusteredMailboxServer cmdlet should give you an output similar to that shown in Figure 11. You should also check that the CMS resources are offline using Cluster Administrator.

Figure 11: Stopping the CMS

Recover the CMS

After doing this we are now back to the point where we need to recover the CMS to the server NH-W2K3-SRV03. We have already looked at this process in part two of this article so I will only briefly cover it again here. Just remember, however, that you need to specify the IP address of the CMS when using the /RecoverCMS switch and so, this time, you are going to be giving the CMS an IP address within the production datacentre IP subnet. In fact, you would most likely give the CMS its original IP address back.

Figure 12 shows the second time I have recovered the CMS, this time using the original IP address of 172.16.6.80.

Figure 12: Recovering the CMS

Once recovered, the databases can be remounted in the same way we have already seen in part two of this article. In my lab, I would then re-seed the databases back onto NH-W2K3-SRV04 at this point, although if I had chosen to also enable SCR for NH-W2K3-SRV04 earlier, I would also have to run the Resume-StorageGroupCopy cmdlet for each storage group to re-enable replication between the two nodes. When all databases have been mounted and replication is occurring between the two cluster nodes, we have successfully moved our CMS from one CCR environment to another and back again by using SCR.

The last thing to do is to make sure the backup datacenter is primed so that it can handle any future incidents at the production datacenter. There are only two major steps to do and we have already covered them both within the parts of this article so I shall only list the steps here:

Remove the local CMS from the cluster E2K7CLU02 running at the backup datacenter.
Re-enable SCR from the source CCR environment to the target standby cluster.

Saturday, July 26, 2008

Clustered Continuous Replication Failover with Standby Continuous Replication

Server Configuration

Standby Cluster Installation

Activate SCR

Seed the SCR Target

Site Failover Process

Recover the CMS

Re-create CCR Environment

Move Back to the Production Datacenter

Activate SCR to Production Datacenter

Database Reseed

Dismount the Databases

Restore the Storage Groups

Stop the CMS

Recover the CMS

No comments:

Blog Archive

About Me