There has been a quiet revolution in availability for SQL Server, and it comes under the heading of AlwaysOn. This name covers many different high availability technologies that have been enhanced and improved in SQL Server 2012. These include the ability to replicate between SQL Server instances (both synchronously and asynchronously), even when those instances are in multiple data centres. This allow for resilience of data between data centres, without the need for SAN replication.
SQL Server and the associated features are now supported in System Center 2012 SP1, with the full details of that support being available here http://technet.microsoft.com/en-US/library/jj628198.aspx
So why are you blogging about a SQL feature?
SQL 2012 has some great features and AlwaysOn can enhance the availability of the System Center databases. While this is a great feature, there are so many aspects to it, it is not always possible to fully support all of the features across the entire product set from day 1. This blog article is to detail a scenario that we have had with one of our clients that affected SCOM 2012 SP1 and to detail the work around that we implemented there. This will obviously require me to go into more detail around the SQL side than usual, but I suspect that this scenario will end up being more common as time goes on and so I felt it important for us to detail the problem and the solution.
In common with many enterprises, the client had two datacentres and wished to install SQL instances at each site and have the data replicated between them so that their operations manager infrastructure could failover between those sites. As SAN replication was not an option, we implemented AlwaysOn and configured an Availability Group Listener to allow SCOM to talk to the primary SQL instance. The initial setup would look like this.
The primary node and secondary nodes could be switched to provide for cross site failover. I am not going to go through all the setup for how to achieve this in AlwaysOn here as that that is another post by itself, but in SQL Management Studio, this looks will look like the following.
It is at this point that the default installation settings can start to cause a problem in some environments.
The issue – Cross Subnet Failover
As I said earlier – I needed an Availability Group Listener to allow SCOM to talk to the database, regardless of whether or not the primary node was in Site A or Site B. With this customer, they had separate subnets at each site. When both nodes are in a single subnet (stretched across both sites), a single IP address can be used, but to provide for correct routing, two IP addresses must be allocated to the listener. This will look like the following when implemented.
Now, in order for SQL 2012 to achieve the aims of providing for very high availability, *both* of these IP addresses will be registered with DNS by default. This is an excellent option for when you are writing a custom application, as you can add an option (MultiSubnetFailover) to the connection string to ensure that the client “knows” to start the connection to both IP addresses at the same time. Of course SCOM cannot drop support for older versions of SQL and as this option is not applicable to those connections strings it cannot be added at this time. This means that application such as SCOM will connect to these IP addresses one after the other and we have found that in some situations this can cause timeouts in a small number of SCOM workflows.
There are lots of potential errors that this can cause, mostly related to SQL timeouts in various scripts. The most frequent for me are timeouts storing data in the Data Warehouse. These were errors with Event ID 31551 in the Health Service Modules as shown in this capture.
The resolution, once the problem is understood, is pretty obvious – change the default so that one the *active* IP address is registered in DNS. This then allows SCOM to always connect to the primary node (regardless of site) and so to get a faster and more reliable connection. The only downside to this is that when failing over you now have to wait for DNS replication to take place before SCOM will pick up the new name. However, this setting can be adjusted down from the default of 20 minutes and so this time can be reduced to an appropriate value for your environment. Operations Manager’s default ability to queue updates at the Management Servers comes into play at this point and ensures that no data should be lost during this failover time.
The easiest way of changing these settings is with PowerShell on the SQL server. The following commands run from an administrator PowerShell prompt will change the registration to only register a single IP address and to reduce the DNS timeout to 5 minutes.
Get-ClusterResource jon-sql3-ag_jon-sql3-listen | Set-ClusterParameter RegisterAllProvidersIP 0
Get-ClusterResource jon-sql3-ag_jon-sql3-listen | Set-ClusterParameter HostRecordTTL 300
Note that the Cluster Resource name is the Availability Group Name, combined with the Listener name, separated with an underscore. The output from these commands will be similar to the following.
Obviously both IP addresses will still be registered in DNS from the initial setup – you can either delete the inactive IP address from DNS, or initiate a failover to force the new settings to take effect.
SQL 2012 AlwaysOn is a great new feature that can ensure high availability of your SCOM data in a remote data centre with no supporting hardware. While the default settings can cause problems in some scenarios, a simple change can ensure reliability and acceptably fast failover in the event of a disaster.