Polyserve failover issue...and my disappearing databases

by swjohnson 5/5/2007 5:00:00 PM

We had one instance in our matrix that would not failover correctly.  It would bounce from server to server in its rotation until it got back to its primary server.  When the instance was rehosted to a failover server, the SQL service would not start and it would move on to the next machine in the failover order and eventually coming to rest on the server where it was originally host but with no databases.  We called tech support and they got right back to us (we aren't live yet but...).  What we found out was that the system didn't have the proper permissions to start the TEMPDB in the new location.  I had recently used the Alter Database command to move the TEMPDB to its own set of LUNS for performance reasons and since SQL needs a TEMPDB in order to function properly, it failed to start the service and moved on to the next server.

Also, in our troubleshooting, we poked around in the registry a bit and it appears that one of the failovers didn't quite go a planned and the registry entries for that SQL instance were not pointing to the correct live instances of the master, model, and msdb databases.  So even once we got the permissions issue corrected, it was missing all of my databases.  Confused yet?

Polyserve, in order to make their system work, does a lot of swapping of registry entries from the primary/active machine to the failover machine.  Once the registry keys have been moved to the failover server from the primary, it starts the appropriate instances' SQL services on the failover server.

The registry keys (i.e. HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL.1) from the primary point to a common master, model, msdb databases for this instance (in our case c:\sql_instances\data1).  These files are located on a central location on the SAN so that all servers have access to (via the same junction point or mount point for you *nix users).  With the new registry entries, the failover server can now behave as if it was the primary server.  By the way, all your data is also stored in a central location on the SAN (c:\sql_data for the data files, c:\sql_logs for the transaction logs, and c:\sql_tempdb for the temp databases) for the same reason.

If you think about it, it's really the only way to get that specific instance to start on another server.  When you install an instance, you must copy that install to each of the other servers that will be a failover partner for that instance.  This creates a bunch of master, model, msdb databases (shell systems if you will) that aren't used and can cause you great confusion.  As such, you have to be very cognizant of where your live master, model, msdb database are and which are the active ones for your instance.  In our case, since the the failover didn't work correctly, the primary instance was pointing to one of the shells' master databases and that is why it seemed that our databases disappeared.  They weren't deleted but the server was just using a clean master database.

To fix our instance, we had modified the registry entry on the primary server so that it pointed to the correct master database and then restarted the SQL service and it started working again and all of our databases appeared.


Feeds