Polyserve Failover...

by swjohnson 7/20/2007 5:16:00 PM

Hot Dog!  We had an HBA issue on one of our SQL DB servers in the Polyserve Matrix last night and Polyserve saw it, notified us, and failed the instance over correctly and shut down access to the errant server.  Total down time was about 30 seconds.  The instance it moved only had two large databases on it but it worked.  The client application which was coded to retry the connection to the DB if it receives an error code also worked correctly and then retried the transactions and the client never knew there was an outage.  We had a complaint of performance for a short time but I will take that over why is it down any day!

I was a bit worried about the notifiers because it uses a batch file and BLAT.exe (a command line SMTP server).  We weren't able to test it very much during our install and were cautious about using it this way. 

So far so good with Polyserve! 

Polyserve failover issue...and my disappearing databases

by swjohnson 5/5/2007 5:00:00 PM

We had one instance in our matrix that would not failover correctly.  It would bounce from server to server in its rotation until it got back to its primary server.  When the instance was rehosted to a failover server, the SQL service would not start and it would move on to the next machine in the failover order and eventually coming to rest on the server where it was originally host but with no databases.  We called tech support and they got right back to us (we aren't live yet but...).  What we found out was that the system didn't have the proper permissions to start the TEMPDB in the new location.  I had recently used the Alter Database command to move the TEMPDB to its own set of LUNS for performance reasons and since SQL needs a TEMPDB in order to function properly, it failed to start the service and moved on to the next server.

Also, in our troubleshooting, we poked around in the registry a bit and it appears that one of the failovers didn't quite go a planned and the registry entries for that SQL instance were not pointing to the correct live instances of the master, model, and msdb databases.  So even once we got the permissions issue corrected, it was missing all of my databases.  Confused yet?

Polyserve, in order to make their system work, does a lot of swapping of registry entries from the primary/active machine to the failover machine.  Once the registry keys have been moved to the failover server from the primary, it starts the appropriate instances' SQL services on the failover server.

The registry keys (i.e. HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL.1) from the primary point to a common master, model, msdb databases for this instance (in our case c:\sql_instances\data1).  These files are located on a central location on the SAN so that all servers have access to (via the same junction point or mount point for you *nix users).  With the new registry entries, the failover server can now behave as if it was the primary server.  By the way, all your data is also stored in a central location on the SAN (c:\sql_data for the data files, c:\sql_logs for the transaction logs, and c:\sql_tempdb for the temp databases) for the same reason.

If you think about it, it's really the only way to get that specific instance to start on another server.  When you install an instance, you must copy that install to each of the other servers that will be a failover partner for that instance.  This creates a bunch of master, model, msdb databases (shell systems if you will) that aren't used and can cause you great confusion.  As such, you have to be very cognizant of where your live master, model, msdb database are and which are the active ones for your instance.  In our case, since the the failover didn't work correctly, the primary instance was pointing to one of the shells' master databases and that is why it seemed that our databases disappeared.  They weren't deleted but the server was just using a clean master database.

To fix our instance, we had modified the registry entry on the primary server so that it pointed to the correct master database and then restarted the SQL service and it started working again and all of our databases appeared.

Polyserve Day #1

by swjohnson 5/2/2007 4:58:00 PM

So now you have signed on the dotted line.  What's next?  One of the first things you will get from Polyserve is a Pre-Install Checklist.  In this document, you will have to tell them about the server hardware you are planning to use, the operating system, version and service packs.  You will also tell them about the make and model of your switches, the number of VLAN's, multicasting and DNS. 

Probably one of the most important aspects is your storage infrastructure.  Here they want to know if you will be using iSCSI or FibreChannel and how many VLANS's and other related information. For us we are using FibreChannel with the Emulex HBA's and an IBM DS8100 SAN. 

Another very important part is the MPIO or Multi-path IO software and what fencing methodology you will be using.

Ok, I know some of you are like me and saying to yourself, what are all of these terms--fencing, MPIO, FibreChannel,...?  While I have been a DBA for many years, I have heard of some but never had to use some of them as most of my systems have always used large SCSI RAID arrays or I have taken over DB Servers that were already configured to the SAN and I could rely on the SAN administrator for all the things that I needed.

Fencing is the capability of shutting down access to various shared resources on a SAN in a controlled fashion.  The servers themselves can remain up even while they are excluded from accessing shared resources. This may be accomplished, depending on configuration, by turning off the FibreChannel port(s) to which the offending server is attached, or by manipulating zoning or other policies within the SAN.  This white paper by Polyserve explains how they implement it (http://www.erexi.com.tw/whitepapers/data_integrity_in_cfs_whitepaper.pdf) if you are interested.  With Polyserve you have the option to go with fabric based fencing or server based fencing (which is dependent on your hardware).  Basically it has the ability to stop communication with a server in order to protect data integrity. 

MPIO stands for Multi-Path I/O.  Multipathing solutions use redundant physical path components–adapters, cables, and switches–to create logical "paths" between the server and the storage device. In the event that one or more of these components fails, causing the path to fail, multipathing logic uses an alternate path for I/O so that applications can still access their data.  Basically you are creating multiple routes for the data to get to its end location. 

HBA is the acronym for Host Bus Adapter which is the interface card through which the computer communicates with the SAN.

VLAN is a Virtual Local Area Network.  A virtual LAN, commonly known as a VLAN, is a method of creating independent logical networks within a physical network. Several VLANs can co-exist within such a network. This helps in reducing the broadcast domain and aids in network administration by separating logical segments of a LAN (like company departments) that should not exchange data using a LAN (they still can exchange data by routing).  (Thanks to Wikipedia http://en.wikipedia.org/wiki/Virtual_LAN).  Polyserve wants 2 VLANs for communication.  One will be for public traffic that will carry your SQL communications between the db server and the client and one for private traffic from the Polyserve system to others in your matrix. 

So for me, I got a crash course in SAN architecture and I was glad my network admin and our SAN vendor/admin (Brian Kuebler of The ATS Group) were onsite during the install.  This is something that I would recommend as during the install as most of the first day will be spent in getting the SAN and network to communicate correctly with the Polyserve software.  Once things are working correctly and you know the IP addresses of your switches and VLANs, the software is surprising easy to configure. 

Another thing that you will be asked is about the LUNs for the system.  Polyserve has a requirement for 3 small 100 MB LUNs that it uses for storing cluster membership information.  Otherwise you can setup your data LUNs in many different manners.  We set our 1.5 TB up into 20 GB increments and will add more as necessary.   Why 20 GB increments?  Well, it was a round number but it was small enough to handle most thing but not too big either.  Actually, it really came down to the number of spindles that we were getting.  The more spindles, theoretically the faster your reads/writes should go.

Next we talked about our file structure and how and where we wanted to store the install files, data files, transaction logs, temp Db's and such.  All of these are going on our SAN and will be shared across all servers/instances.  We used junction points in Windows 2003 to map a directory to a volume.  It makes it look just like a regular windows folder but it is really on your SAN.

Then we talked about our network details such as machine names, IP addresses, domains, and then setup a local account and the requisite domain accounts for SQL services.  We also installed .Net 2.0 on each machine in the matrix as that is a requirement.   Now we finally got to install the Polyserve software and all the fixes and then installed SQL and all the Service Packs and then rebooted the servers and started some configuration so the system knew about the fiber channels, community strings.  We also configured the 3 membership partitions and were able to start the service. 

The next part just blew my mind.  We were able to push this configuration out to all of our other servers with just a few clicks. 

Then we were able to start building our dynamic volumes.  We had several volumes:  sql_install_bin (for the sql install files and it had a block size of 4K), sql_instances (for the active virtual instances--think live Master, model, and msdb and all the data files were set with block sizes of 8K), mxsqlshells (for the failover instances that are not used), sql_data_1 for the actual MDF and LDF files, sql_log_1 for the transaction logs and finally, sql_tempdb for the temp files for each instance. 

At this point, we called it a day and were given some home work called RTFM (read the fun manual)

Experiences with Polyserve...

by swjohnson 4/15/2007 4:49:00 PM

Let me start off by saying that this and upcoming blog entries are not a sales pitch but rather my experiences (good and/or bad) with a software solution that we selected to better manage our Microsoft SQL Server systems.  We have just installed it over the past three days and I am on the flight back home so we are starting to learn really learn the application and what it can do for us.  This first post will be about our decision and why and a bit about the selected solution.  Then, I will have one or two posts about the installation and initial tests.  As well, I will do additional posts when I see a need or come across something that would be beneficial knowing before purchasing (ah yes, those infamous if I only knew THAT before, I wouldn't have done it that way).

Polyserve  is a high availability solution for a variety of Linux and Windows environments in either 32 bit or 64 bit versions.  We are using ours for the management of our growing SQL Server farm so we have installed the Polyserve Matrix Server and the Polyserve Matrix SQL Server applications.  We had several different SQL environments for each of our product lines and each one had its own unique failover and management requirements.  Since several of our business lines were growing so rapidly and could see what senior management was planning for future growth, we laid out a plan to control the SQL proliferation for the entire company before it attacked us. 

Our current environment is a series of active/passive clusters using Double-Take for the mirroring and failover on one end of the spectrum to stand alone SQL server with log shipping (using a manually built-in system for log shipping and one using Red-Gate’s SQL Backup software for failover on the other end of the spectrum.  So about half of the machines were really just sitting idle and basically collecting dust and that was a serious chunk of money to waste in my opinion.  Also, we have been doubling our business activity every year for the last four years and our current design was starting to reach the limits of the current solutions we had in place.  We knew we had to scale out or up

To make matters worse, we never really fully liked our failover system as the failback was rather painful and could require some downtime--thankfully it rarely happened.  Also, we are in the midst of developing a new transaction processing system that we believe would really tax our current structure and we are starting to use SQL Server Reporting Services for all of our reporting needs whereas previously the data was exported and MS Access was used for our reports. 

So we did a fair amount of research about high available solutions for SQL Server from MSCS to Virtualization to Partitioning to Replication to Mirroring.  We talked with quite a few vendors but in the end we thought Polyserve would best meets our needs and goals. Our high level goals were pretty simple and are listed in order of importance:

  1. provide scalability and extensibility for our rapid growth for the next several years,
  2. allow us to better utilize our existing hardware (i.e. not require us to have any or very few idle machines),
  3. reduce administration and associated costs. 
  4. create a failover system that we could trust and help us meet our demanding SLA's,
  5. allow us to move to 64 bit SQL Server 2005 on Windows 2003 for the bulk of our system and SQL Server 2000 and Windows 2003 (both 32 bit) for about 4 systems and upgrade them easily at a later date. 

So why did we choose Polyserve?  Well, it was the one solution that we felt met all of our listed goals.  The price was within our budget (sorry I will not tell what we paid, let’s just say it isn’t cheap but should pay for itself in less than one year) and we were able to secure the necessary funding.  

With Polyserve it is very easy to move SQL Server instances around from one server that is being over utilized to another server that is not as busy.  We were also able to consolidate several servers with smaller volumes of transaction onto on server by stacking instances so that each one still maintained it own environment and configuration without affecting the others.  Simply put, we were able to better manage our load volume better.  Instead of managing a single server, we are managing resources or pools to provide better up-time and responsiveness of the system overall. 

All servers can see the same data via their shared Cluster File System.  This basically eliminates the need to mirror the data and subsequently reduces network traffic.  Data integrity is also protected with their distributed lock manager.  Since the data is all shared, each node in the cluster can manage the other nodes and since your SQL databases are all stored in one location on your SAN.  This makes your backups much easier to manage. 

Their Matrix Volume Manager manages the LUNS from our SAN (a SAN is a requirement for their solution and while they are compatible with most, you should check their compatibility section on their website).  As such, it is simply a matter of creating additional LUNS and publishing them to your network and the Matrix Volume Manager can quickly add that additional space into the existing system and you have just extended your disk space. 

You are allowed up to 16 servers of various processor sizes and OS configurations in a matrix.  A matrix is a grouping of servers that are all actively running SQL but have each other as a failover.  Therefore, unlike MSCS where all your hardware needs to be exactly alike, you can mix and match servers to create your matrix.  So you can consolidate your existing systems into your solution and rotate in newer machines more powerful machines as you purchase them. 

Their software allows us to bring a new server online and easily install all the appropriate SQL instances on the server.  A new server can be added to the matrix in a very short amount of time.  The software also has features that all you to push SQL Updates and Hotfixes to all your servers in one shot.  For example, we installed 10 instances of SQL Server in our matrix.  If I had to do that manually on each server, it would have taken more than a day.  However with their tools, it took about 10-15 minutes. 

Ok, so know you know what we were looking for and why we chose Polyserve.  In my next post, I will start talking about our experiences as we installed the solution and start using it. 

As a side note, when we purchased the software, Polyserve was an independent company.  However, after about 1 month of our signing the agreement, they were acquired by Hewlett Packard.

Feeds