We had one instance in our matrix that would not
failover correctly. It would bounce from server to server in its
rotation until it got back to its primary server. When the instance
was rehosted to a failover server, the SQL service would not start and
it would move on to the next machine in the failover order and
eventually coming to rest on the server where it was originally host
but with no databases. We called tech support and they got right back
to us (we aren't live yet but...). What we found out was that the
system didn't have the proper permissions to start the TEMPDB in the
new location. I had recently used the Alter Database command to move
the TEMPDB to its own set of LUNS for performance reasons and since SQL
needs a TEMPDB in order to function properly, it failed to start the
service and moved on to the next server.
Also, in our troubleshooting, we poked around in
the registry a bit and it appears that one of the failovers didn't
quite go a planned and the registry entries for that SQL instance were
not pointing to the correct live instances of the master, model, and
msdb databases. So even once we got the permissions issue corrected,
it was missing all of my databases. Confused yet?
Polyserve, in order to make their system work,
does a lot of swapping of registry entries from the primary/active
machine to the failover machine. Once the registry keys have been
moved to the failover server from the primary, it starts the
appropriate instances' SQL services on the failover server.
The registry keys (i.e.
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL.1)
from the primary point to a common master, model, msdb databases for
this instance (in our case c:\sql_instances\data1). These files are
located on a central location on the SAN so that all servers have
access to (via the same junction point or mount point for you *nix
users). With the new registry entries, the failover server can now
behave as if it was the primary server. By the way, all your data is
also stored in a central location on the SAN (c:\sql_data for the data
files, c:\sql_logs for the transaction logs, and c:\sql_tempdb for the
temp databases) for the same reason.
If you think about it, it's really the only way
to get that specific instance to start on another server. When you
install an instance, you must copy that install to each of the other
servers that will be a failover partner for that instance. This
creates a bunch of master, model, msdb databases (shell systems if you
will) that aren't used and can cause you great confusion. As such, you
have to be very cognizant of where your live master, model, msdb
database are and which are the active ones for your instance. In our
case, since the the failover didn't work correctly, the primary
instance was pointing to one of the shells' master databases and that
is why it seemed that our databases disappeared. They weren't deleted
but the server was just using a clean master database.
To fix our instance, we had modified the registry
entry on the primary server so that it pointed to the correct master
database and then restarted the SQL service and it started working
again and all of our databases appeared.