Sunteți pe pagina 1din 12

Building a highly available Datacore SAN

User Rating: /3
Rate

Poor

Best

Created on Thursday, 22 December 2011 09:20


A few weeks ago, one of our customers asked if the SAN environment we proposed to him, is really that fault-tolerant and can handle errors in each in every part (assume there is only one part affected). He also asked how Datacore SANsymphony-V can handle split-brain scenarios. I guess, this is a question all customers all over the world are concerned about when choosing a new SAN environment. You won't spent thousands of dollars/euros on a SAN that won't survive a single failure. You think, well, it's a SAN, put in redundant components and that's it. Yes, it could be that easy but what if you have two datacenters with a spanned SAN and they run active-active? In that situation, you should keep some things in mind and that's what this article is all about.

First, let's draw a picture about what we talk.

You will have two separeted datacenters. Let's assume, they are in two different buildings on the same campus, about 200m in between. Both datacenters are equipped with one Datacore SANsymphony-V node and two fibre channel (or iSCSI, it doesn't matter) switches in separated fabrics. The kind of storage SSY-V uses doesn't matter, too. To simplify matters we use direct attached storage. The two SSY-V nodes have front-end ports in each fabric and two mirror ports that are directly connected via loop to the other node. It' also possible to use the two FC fabrics for the mirror traffic but in most environments, you would prefer a direct attached mode.

As hosts there is a VMware vSphere 4/5 host in each datacenter, using the mirrored virtual disks presented by the Datacore environment. The LAN is shown as non-redundant but for this case, it doesn't matter. You could also have redundant LAN fabrics. Well, looks like a standard layout for a redundant SAN environment. Sure, it is, we don't have to reinvent the wheel but you should really know, what happens in which kind of failure in the SAN. Now, let's have a look at the different things that could happen and how the SAN/the application hosts will be affected. 1. 2. The vSphere server in datacenter1/2 crashes: don't care, VMware HA/FT will take care of that failure The vSphere server in datacenter1/2 looses a SAN path because of a broken cable: VMware MPIO will take care and switch the active path to one of the surviving pathes 3. One of the FC switches in datacenter1/2 crashes: VMware MPIO will take care and switch the active path to one of the surviving pathes 4. The SSY-V server in datacenter1/2 crashes: the second node in the SSY-V server group will take over and traffic will be rerouted to the surviving SSY-V node 5. 6. Storage failure on the SSY-V server in datacenter1/2: see answer 4 The whole datacenter1/2 has an outage (perhaps power or whatever): all above will handle the SAN failover and after a short period of time, your VMs are back online The options above are quite easy to understand and to handle because all these errors will result in defined situations. But what if you run in undefined situations? But what the hell is an undefined situation? An undefined situation is occuring if the SAN or the application host doesn't exactly know what's wrong and therefore makes the wrong or no decision. This can happen in various ways. Let's have a look at these situations..... If you have a geographically distributed SAN like in our situation, there is always the risk that the connection between the two datacenters will be lost. In clustered SAN environments this could lead to a split brain. A split brain happens if the two storage controllers (here the SSY-V nodes) loose connection but remain active. The local application servers in each datacenter reamin online and keep working, thus changing data on the storage. Now the connection comes back up again and there are changes on both sides. Merging these changes can cause data corruption and that's what a split brain (in simple terms) is.

How is our environment affected by a split brain? In simple terms: not at all. That's because split brains can only happen in a clustered storage environment. Datacore's SANsymphony-V (and SANmelody/SANsymphony7 too) isn't a cluster. There is nothing shared between the two nodes, no quorum that could be corrupted. SSY-V is a grid architecture and that's why a split brain isn't even possible. Back to what could happen if there are problems on the connections between the datacenters. 1. One of the ISLs between the SAN fabrics get lost: no problem, the MPIO of the application servers (the vSphere servers) will take care about it. 2. Both ISLs between the SAN fabrics get lost: the MPIO of the VMware server will also take care about that situation. Traffic will be routed to the local SSY-V node and replication will be done via the separated mirror links. 3. LAN-communication between the SSY-V nodes get lost: SAN traffic will be handled normally but you will be unable to make any configuration changes in SSY-V GUI 4. 5. One of the mirror pathes between the SSY-V nodes get lost: mirror traffic will be re-routed over the surviving path Both mirror-pathes break down but the LAN communication is still available: Datacore handles this situation and denies access to one side of each mirrored virtual disk. In that case, the mirror will be out of sync but one side will be online and can be accessed by hosts. 6. Worst case: all ISLs (LAN and SAN) and mirror pathes between the two datacenters break down in exactly the same second: in that situation, Datacore will be unable to initiate setting one side of each virtual disk offline as it would do in the example above. This will lead to be both sides active. In this case, application servers like our two ESX servers above can continue to access the SAN storage and keep working on. Well, sounds good, so why making panic? That's because of the way vSphere works. Let's assume on the mirrored disk in the picture above there are two virtual machines. Since it is an active/active setup, it is possible that VM1 runs on ESX1 in datacenter1 and VM2 runs on ESX2 in datacenter2. In this case, both VMs will keep running and keep changing data on the virtual disk.......this leads to changed data on both side of the same virtual disk. As soon as the connection between the two SSY-V nodes comes up again, Datacore tries to resync the mirror but that's not possible with changed data on each side of the mirror. In that case, Datacore responds with an so called "double failure" and sets both parts of the mirrored disk offline to prevent data corruption. There is a manual and under certain circumstances time-consuming way to resolve this situation but this would lead to far for this article. Such a situation should ALWAYS be solved by getting in touch with Datacore support. They will help you bringing your SAN online without risking data integrity.

Important to remember is that regardless of what connection failure you will have, Datacore will handle it and rather stops accessing data than risk data corruption.

With that knowledge in mind, there are some simple best practices you should follow: 1. If you have a geographically distributed Datacore SAN and ESX servers on both sides, do not share datastores with virtual machines. Use one datastore for each VM. You can use several datastores for a single VM but do not use one datastore for two or more VMs. This will help you keep downtime as low as possible even in the split datacenter scenario above. vSphere 4 and 5 currently support up to 256 LUNs per host so if you have less than 256 VMs in your environment this is the way to go. If you have more than 256 VMs, use the 1VM/LUN approach for important VMs and shared datastores for the rest. 2. Try to use separated ISLs. Do not use bundle all connections over a single cable. Try to route the ISLs over different ways to keep the impact of an cable failure as low as possible 3. Use high quality cables for the ISLs and keep the maximum length for 2/4/8GBit in mind. You won't get happy with trying to push 8GBit over a 300m OM2 cable 4. 5. If possible, separate the mirror pathes from the SAN fabric ISLs and use direct-connected cables (HBA <-> HBA). Second, use the right path policy for your environment. Round-robin is fine for loadbalancing and maxing out your storage's performance but has some caveats you should know about. I will write an article on that in the near future, so come back and check it out.

I hope you now have a deeper knowledge of what is happening in a Datacore/Vmware environment in case of a hardware failure and understand why you won't get many hits in Google if you enter "Datacore SANsymphony-V split brain".

A few hours after publishing the article below Datacore sent out their email to announce availability of the HUP (Hot upgrade process) resp. the course DCIEs have to complete for doing the HUP. Currently theit training scheduler only shows a single course in 2011 in Ft. Lauderdale, for 2012 there is currently no course scheduled. Hope this will change....

A few weeks ago, Datacore revealed that there will be an official supported upgrade path from SANmelody3.x/SANsymphony7 to SANsymphony-V later this year. The upgrade won't be "cold" as formerly said but will be "hot".

Since the upgrade process isn't that easy and have some caveats, only specially trained DCIE-V (DataCore Certified Integration Engineers for SANsymphony-V) will be allowed to make the upgrade process. To get a trained DCIE-V you first need the DCIE-V status and then attend a one day update course to learn how to upgrade to SSY-V. This course is currently announced but not available, so we have to wait a few days/weeks more. I hope, the course will be available in early Q1/2012 because there are many SANmelody users waiting to upgrade to SSY-V to benefit from the new features and the enhanced GUI.

Created on Wednesday, 21 December 2011 08:22


For all admins planning to upgrade to vSphere5 in a Datacore environment take caution! Even with the latest SSY-V version (SSY-V 8.1 PSP1 Update 2) there is no official support from VMware. Datacore still hasn't made it's way on the VMware HCL (Hardware Compatibility List), thus VMware can deny support for vSphere5 environments running on Datacore software. SANmelody3 and the older SANsymphony7 suffer vSphere5 support, too. According to Datacore it's only a matter of time when the SSY-V will be on the HCL but vSphere5 is available since a few months and still we are waiting for official support. Nevertheless, we use vSphere5 on a SSY-V two node grid in our demo lab, using the same settings as they are valid for vSphere 4.x (Datacore's technical bulletin TB5c) and had no problems for a weeks now. So even if it's not official supported, from a technical view. there is, IMHO, no reason not to switch to a vSphere5/SANsymphony-V 8.1 combination

Beginning of August Datacore released the new version 8.1 of their SANsymphony-V product. Beside new features (taken from the official release notes):

Automated Storage Tiering. Migrates frequently accessed data to fast storage (e.g. SSD), and infrequently data to slower (lower cost, e.g. SATA) storage within a disk pool, based on access history, within individual virtual disks.

PowerShell cmdlet library. Provides a full-featured command line interface to SANsymphony-V. What can be done in the GUI, can now be performed thru a universal scripting interface and opens up integration with 3rd party applications such as VMware vSphere, Microsoft System Center, etc.

Support for greater than 5 node Windows clusters. Increase the number of hosts that share virtual disks using SCSI Persistent Reserve. This means that Microsoft Windows 2008 clusters will no longer be restricted to the 5 node limitation. Microsoft currently supports up to 16 nodes in a cluster.

VSS Enables the integration of SANsymphony-V with common Microsoft application backup environments, including Microsoft System Center Data Protection Manager (DPM).

Offline Replication Initialization. Provides the capability to initialize virtual disks to a transportable media and then ship to the destination server at the remote location.

iSCSI Redundant Mirror Paths. Adds support for inter-node mirror traffic over iSCSI to support mirror path failover. there are many error corrections and some minor enhancements that make this relaese the recommended version to use for all productive environments. A direct upgrade can be done from Version 8.0 PSP2, customers using version 8.0 PSP1 or below have to contact Datacore supprot prior to apply the update.

This version still does not enable customers to upgrade vom SANsymphony 7.x or SANmelody 3.x. There is currently no upgrade path (neither cold nor hot migrate) so we still have to wait.... The same is valid for the automatic leveling after adding hard disks or the automatic space reclamation.

On July 30, 2011 SANsymphony-V 8.1 will be released with some cool new features. Here is a short overview what is new with version 8.1

- Automated Storage Tiering. Migrates frequently accessed data to fast storage (e.g. SSD), and infrequently data to slower (lower cost, e.g. SATA) storage, based on access history, within individual virtual disks. This feature will be available as a base option for customers who are buying VL2 or VL3 and will be included for customers buying VL4 or VL5. This feature will not be available at VL1.

- PowerShell cmdlet library. Provides a full-featured command line interface to SANsymphony-V. What can be done in the GUI, can now be performed thru a universal scripting interface and opens up integration with 3rd party applications such as VMware vSphere, MSFT System Center, etc.

- Support for >5 node Windows clusters. Increase the number of hosts that share virtual disks using SCSI Persistent Reserve. This means that Microsoft Windows 2008 clusters will no longer be restricted to the 5 node limitation. Microsoft currently supports up to 16 nodes in a cluster.

- VSS. Enables the integration of SANsymphony-V with common Microsoft application backup environments, including MSFT System Center Data Protection Manager (DPM).

- Offline Replication Initialization. Addresses a long-standing difficulty in deploying Replication. Bandwidth required for initialization is often 10x or 100x greater than that required to perform day-to-day changes. Customers have requested a way to ship data to avoid higher bandwidth costs. This avoids the "I dont want to buy that much bandwidth which I only use one time. Instead, this provides the capability to initialize virtual disks to a transportable media and then ship to the machine at the remote location.

- iSCSI Redundant Mirror Paths. Previously only Fiber channel was supported for inter-node mirror traffic to support High-avaialbility, with this capability iSCSI is now available to customers.

- License activation enhancements. User experience improvement by simplifying the activation process and number of keys and keystrokes that needed to be performed; ability to activate multiple keys in a

single go, making manual web-based activation much easier. Also the ability to activate capacity partially, "I just got my 100 TB product key in email. I'll use it to put 20 TB on this server group, 40 TB on this other group, ..."

Created on Thursday, 16 June 2011 10:07


Only two weeks after PSP2 for SSY-V was released there is another update for Datacore's new storage virtualization product. Update1 solves some issues with trace logging, support bundle uploads and fixes a problem with VDS because of changed behavior of VDS after W2K8R2 SP1 installation. All users with current support are encouraged to update to this new release as soon as possible. The updated installation package can be found here (DataCore account required)

Created on Friday, 10 June 2011 15:07


If you have troubles installing SP1 for Windows Server 2008 R2 (error message 0x800f0a12, see scrrenshot below) while DataCore's SANsymphony-V is installed follow these steps. ATTENTION: SP1 is only supported for SANSymphony-V installation with PSP2 or later applied!! Do not try to update your Windows OS to SP1 if you are using any other version of Datacore software.{jcomments on}

1.

Stop SSY-V service in the SSY-V GUI (right-click on DCS you want to install SP1 on and select "Stop Datacore Server)

2. 3. 4. 5. 6.

Close the GUI Goto Windows services and set "Datacore Executive" service to disabled Reboot your server (you have to reboot the OS, simply stopping the service is not enough!) Install SP1 Set "Datacore Executive" service to Automatic and reboot your server once again If you still have problems installing SP1 please check if AUTOMOUNT is enabled. To check open a command prompt and start diskpart. Type "automount" in the diskpart command prompt and see if it is enabled or disabled. If it is disabled, enable it by typing "automount enable". You will get a success message. Exit diskpart and restart your server. Now SP1 should install without problems.

Created on Friday, 10 June 2011 12:42


Datacore released a new PSP (PSP2) for SANSymphony-V on 4th of July.

The change log for the new PSP is quite long and solves some major issues so all customers running SSY-V are recommended to update to the new release.

With PSP2, Datacore also starts supporting Windows Server 2008 R2 SP1 but you have to use PSP2. R2 SP1 will not be supported with SSY-V < PSP2.

Created on Friday, 10 June 2011 12:24


Recently we installed SANSymphony-V (SSY-V) in our demolab. Installation ran fine, no problems (as expected because SSY-V is really a good product). A few days later we needed to update the motherboard BIOS on both DCS to upgrade to a new SmartArray controller. Unfortunately this causes both motherboards to die because the BIOS flash image probably has some kind of error. Finally we had to replace both motherboards.

The used motherboards have two onboard LAN adapters so with the replacemnet of the boards, the MAC addresses of these two onboard ports also changed. These ports were used by SSY-V for management access and one for mirror connection. After replacing the boards and booting up the servers, Windows automatically configured the new adapters with correct IP information and we only had to rebind the Datacore iSCSI driver to the new cards.

So far, so good....

Starting the SSY-V GUI we first had problems to connect to the local server. Trying several times resolved that and we had the GUI open but then the fun began.

The GUI starts up and showed us...NOTHING. All virtual volumes were gone, no server ports were shown on both DCS. Surprisingly, the attached application hosts had no problems to access the storage via all configured pathes. So mirroring and presentation of storage seemed to work but the GUI was empty. Pressing F5 to refresh the screen showed up a tiny little red warning in the lower regio of the GUI telling us:

Update failed: The Datacore Executive service has experienced an internal error. The error reported is: Object reference not set to an instance of an object

Googling for this kind of error only brings up some developer sites with .Net Framework or any other programm language related things so this didn't help us anyway. Opening a call at Datacore showed that this kind of problem was unkown until then so the responsible support engineer scheduled a remote session. A few days later, Datacore found the error. There must have been some problems with the SSY-V internal config files Xconfig.xml and Xconfig.jrl. Unfortunately, only Datacore can repair these files. We had to exchange several support bundles with Datacore and they built a new config for us. After installing the new config files (simple copy&paste to the SSY-V install dir and restart of the DCS service), all objects were available again.

If you ever see this tiny litte red warning on the bottom of the SSY-V GUI and you are can't see all your objects, don't search any longer for a solution on the internet. Open a call at Datacore and ask for a new config file.