Update 3 : I have added two new aspects to consider – the notion of the guarantee of the array you buy and the aspect of capacity.
Update 2 : If you are interested in my recommendations for All-Flash Arrays please read my most recent post on the new class of flash arrays that fully support deduplication and a full range of storage features, Recommendations for All-Flash Storage Arrays; Aiming Beyond Simply IOPS and Low Latency. But make sure to read this as well as it provides a structure for helping you to determine how to go about choosing an all-flash array.
Update 1. If you are grappling with what questions to ask your SSD or flash storage array vendor who says his last customer saw a 10x increase in performance but can’t really give you a nice write-up of the hardware and software configuration – consider that maybe the 10x increase wasn’t only the flash in the array. Interestingly if everything on the market represents “next generation” “cloud” or “enterprise” infrastructure and software then the words “next generation” kind of lose their meaning. Here are a few candidate questions I have assembled to consider when talking to flash array vendors. I am continuously adding other questions that should be asked of flash storage array vendor. In this post I have added some more – Replication, Backup and Restore software, Monitoring, Automation and Data (Protection) and Company Viability. In the past, I aimed at the heart of the matter in another post, Today’s IOPS Matter Less Than A Good Architecture and Storage Features. In today’s post, I am aiming at providing for anyone looking at purchasing a new flash storage array with some questions that might cause tachycardia among sales folks – and perhaps even worse among marketing folks that often do their best to hide a product’s warts. These are part of the notes from a long-running, work-in-progress, An Introduction to Using High-Performance Flash-Based Storage in Constructing High Volume Transaction Architectures – A Manager’s Guide to Selecting Flash Storage – which will be coming out in the near future. This post is only part of the story – one of the architectural choices today being made today are whether server-local flash or networked array-shared flash will be chosen. Architects have and today make both architectures work. And the choices are becoming more interesting because the SSD and PCIe Flash vendors are providing increasingly higher capacities and in some cases their solutions are hot-swappable. It is possible to choose, for example, Fusion IO’s PCIe flash card solutions to provide in-memory databases with extremely high performance. Obviously, even though networks are getting faster – not traversing a network provides advantages. We will discuss these aspects in a future post.
If you have a fire-breathing dragon of a flash array that can deliver millions of IOPS but you can’t leverage the features you need to increase the storage capacity via data reduction (deduplication, compression and thin provisioning), upgrade the array while it’s running live in production, or you can’t easily replicate the data on it, or can’t scale-out twenty or so arrays into one unified view of the storage or guarantee service levels or throttle those IOPS with quality-of-service or protect running operations with scale-out forms of high availability – what do those IOPS serve ? Features that support cloud and enterprise operations within these flash storage arrays are as or more important than sheer IOPS and certainly architecture and price are important considerations as well. In reality, flash usually does provide better performance but in the end the IOPS that flash delivers matters less than a good architecture and good storage feature and the capacity that serve to give you critical features – some of which you may already have with your traditional disk architecture.
Some flash array vendors want you to ask the other vendors ‘gotcha’ questions. This gets silly and often it turns into asking questions of low relevance to you. Also the focus on high performance and low latency is a fraction of the problem we all face – some things can be very fast and not have features that you need – yet some vendors have overly focused on performance and low latency because historically they have been very week in other areas. Here is my list of questions to ask – these are not ‘gotcha’ questions, some may relate to what you are doing and some may be less relevant.
All the flash array vendors today have managed to do quite well at offering great performance and latency numbers on their arrays. Not all of them are equal in terms of storage features.
A quick word about benchmarks. When benchmarks are offered up – it’s good know what is being demonstrated. Some things to keep in mind include knowing exactly what type of benchmarks or workloads are executed and what the measured IOPS are in terms of block sizes. Also whether the benchmarks are read-only or a mixed read/write workload. Obviously, benchmarks done by reputable benchmarking sites like thessdperformanceblog, MySQL Performance Blog, Tom’s Hardware andStorageReview.com can speak volumes about the product in question. However, increasing use of vague anecdotal evidence is used to avoid providing these sites with the hardware to test and for customers to see true comparison. Anecdotes like – Customer X saw a 10x improvement in metric Y. While interesting is often virtually meaningless without a detailed whitepaper of workload, network, hardware and software configuration details and rigorous test information. Consider that a boost in performance can often be demonstrated by simply going from 8 Gbps FC to 16 Gbps FC cards. Did customer X really see a 10x improvement because of the flash array only or was it a combination of adding memory or adding high speed networking. Anecdotes are increasingly used loosely to avoid hard benchmark results. Beware of the anecdote bomb – it is usually delivered up with the intention of avoiding giving up hard information or distracting away from discussing or seeing reputable, independent third party benchmarks sites. The anecdote bomb looks something like this “X was able to achieve 8x processing times” and then the discussion turns to discussing that specific 8x increase in processing times – there is no real comparisons of before/after configurations (including possible network upgrades). The 8x or 20x number that is provided and is intended to distract and become the number the buyer will assume they will get.
On to some questions you might consider asking.
Does the array support full redundancy with hot-swap everything ? For example, does it have two controllers ? Do the scale-out arrays support each other if a full array fails ? This type of availability translates into increased resiliency. It also translates into an ability to hot-swap all components of an array in production without downtime. Or is resilience built specifically into a cluster of nodes to avoid outages. There are different ways to handle resiliency.
How does the flash array handle failure ? What if I pull a hot component out ? How will the array behave ? If you have data corruption – you have a serious problem. This is not a science question. It’s good to know before someone accidentally pulls out the wrong component out of the wrong array in the middle of the night – what happens.
Does the array support full non-disruptive upgrades to all aspects of the array ? Anytime you need to upgrade the operating environment in your flash array – you want to do so without taking an outage. Some vendors in the very recent past actually told you they had non-disruptive upgrades, but the sordid truth is they didn’t have full non-disruptive upgrades. They actually could partially upgrade their array without taking an outage – but that’s not a full upgrade of the array and you are left with a schizophrenic array inhabited by two versions of their operating system. If you don’t have full non-disruptive upgrade capability any major upgrade will probably require taking an outage. Why put the burden of this work on your storage administrators when the array should be doing this for you. Array vendors like Nimbus Data, SolidFire, Pure Storage and HDS provide full non-disruptive upgrades.
Does the flash array’s operating environment support data reduction features ? Specifically : de-duplication, compression and thin-provisioning ? De-duplication saves you considerably in storage costs. There is a lot of effort and discussion around dedup. It literally increases the capacity of your array. Is there an add-on cost to these in-line data-reduction features ? Data reduction matters. It makes you spend less on additional flash storage. It saves you on storage, storage costs and datacenter space plus the power and cooling costs. HDS, Pure Storage, SolidFire and Nimbus offer all three forms of data reduction with their arrays.
Does the array support scale-out features ? Can you natively cluster multiple arrays to produce a single view of storage ? Some vendors like SolidFire and Kaminario offer an ability to cluster their arrays from five to a hundred nodes. You get not only a single view of your flash storage but you also get cluster redundancy included. A number of flash arrays can scale only up to four nodes, others can scale much higher. If you are building an enterprise cloud or public cloud it is important to know that you can scale to petabytes rather than in the low hundreds of terabytes. For example, some vendors can scale-out to four (and later eight) nodes – 280 TB or 560 TB versus others than can scale multiple petabytes and over 100 nodes.
Does the array natively support snapshots and clones ? Is this an add-on cost ? Consider that this feature gives you a quick and easy way to do point-in-time snapshots of your data. In virtualization you can leverage clones with VMware linked clones which provide you with conserving storage space.
Does your flash storage vendor support NFS and SMB/CIFS ? Both of these are pervasive. For example, some companies rely completely on NFS as their shared file system. Others that are Windows-centric rely on SMB. Key question – what version of these protocols ?
How is the flash array serviced ? Some arrays require sliding them out of the rack and taking the top off to service to expose components. Others can be serviced in the rack from the front or back. This makes a difference. Also how service-safe are the components ? Does the act itself of servicing the array provide a risk or has the vendor designed fool-proof serviceability into the array design.
In some environments, like cloud providers – there is a desire for quality-of-service (QoS). Does the array offer QoS ? The quality-of-service feature allows performance guarantees or limits some hosts from over-consuming IOPS. Some of these cloud providers use QoS to monetize storage IOPS guarantees. This is something SolidFire does extremely well. Consider that in virtualized environments – memory, CPU and storage space is controlled via VM/System resource management. Already, some vendors treat IOPS in the same way and this is becoming extremely important in heavily virtualized cloud settings.
How easy is it to use the array ? In other words, does the array provide an easy to use user interface to easily create, export and delete LUNS or file filesystems ? Does it provide a way to easily view the LUNS and exported file systems.
Does the array support replication, if yes, what type ? Replication is an important feature. It allows an efficient method of keeping a remote a copy of the data elsewhere. It is a strategy which underlies disaster recovery. Does the flash array in question offer replication ?
When a single or more SSD drives or a flash modules suddenly die- what happens when storage fails ? What is the protection provided to support continued operations. How fast are the rebuilds ? What mechanism is used for protection – RAID or a RAID-like technology ?
Do they have a backup and restore strategy for the product ? Does the vendor bundle in a backup and restore software or do they provide you with recommendations ? Or do they leave it totally up to the customer to figure out the backup/restore strategy.
Does the vendor provide detailed reference architectures for building out a cloud or database cluster ? Do these reference architecture provide the details of the architecture – servers used, storage used, software used, etc ? Are they collaborating with companies that create the server and operating system ?
What tools are provided to monitor, first, the array and also the cloud storage aggregated arrays into a single storage view ? Some vendors have provided nice UI and command line tools for monitoring while others have remarkably sophisticated GUIs that
What level of automation and management is provided ? In other words are there user interface and command line automation tools that allow quick ways of managing storage management. Is their serious support for leading management frameworks such as OpenStack, CloudStack, VMware and other automaton frameworks ? Support for SNMP ?
Does the array support third party (Microsoft, Oracle, etc) In-Memory Database support ? Increasingly there is support for in-memory databases that leverage both main memory and flash storage (whether PCIe cards or flash arrays). In the PCIe card world, Fusion IO has demonstrated remarkable performance by leveraging in-memory features of databases like MySQL and Microsoft SQL Server. The open question for flash array vendors increasingly is can they demonstrate a qualitative advantage at leveraging any of these in-memory database features ?
What are the performance levels of networking that are supported ?New and extremely high performance network cards have and are becoming increasingly available. For example, 16 Gbps Fiber Channel, 40 Gbps Ethernet and 56 Gbps Infiniband are all or will be available – does the vendor support these networking cards.
What support is their for virtualization standards and products. Some vendors have been extremely slow to support OpenStack – and have simply offered rudimentary support, other vendors have chosen to support OpenStack by offering cloud reference platforms based on it. Obviously there is a difference in your enthusiasm for a vendor if you have chosen to go the OpenStack road. Not only OpenStack, but also VMware is another virtualization choice where support is important. Does the vendor support VMware ? In what ways ? Do they support VMware’s storage APIs ? Some companies, like Tintri and Nutanix, have attacked the problem by offering products that provision VMs, highlight rapidly changing VMs, create highly distributed architectures with a number of features aimed at heavily virtualized environments and cloud stacks.
What support is there for cloud use cases ? If you are building a cloud it is worthwhile asking a simple question – what public and private cloud infrastructures use your flash storage hardware ? Some companies have offered support to cloud providers because they support features cloud providers need. A simple example, scale-out and Quality of Service – provide cloud providers an ability to scale-out the storage (basically aggregate the arrays storage into one large storage namespace) and then control that storage with quality-of-service which allow for providing guaranteed service levels to customers.
Is the company viable ? Is it stable ? It should go without saying, but a company’s viability needs to be taken into consideration. I’m not talking about large versus small companies but about the question of how viable the company is. There are plenty of small, innovative viable companies with unique and excellent products. The thing to watch is the stability of the execs at a company and the earnings reports if they are public. For example, if the company loses their CTO, COO and CEO within a month – it is something to be extremely concerned about. If they are focused on storage, have only a few products and are losing money on storage quarter after quarter. A concern. These are clear warning signs. Is the company in heavy debt – with investors wanting to part it out or sell it? Or are they relatively debt free ? Or are they a large company with a long history in storage with extremely sluggish growth? Is the company viable for the long term ? All these are important considerations. Worse yet, are the companies where sales people are talking about the company being in hyper-growth mode, yet the company is quietly laying off people – it indicates a deep disconnect with reality. Large lay-offs indicate a loss of corporate memory and corporate talent and often to the company bing poorly managed. When you see a company where the technical and sales people are turning over quickly – that is a red flag. It usually indicates a poorly managed company. Viability can also be represented by the vibe the company produces in the marketplace – continuous losses, mass exodus of executives, lawsuits, steep stock declines, large lay-offs, competitive pressures, etc or a somewhat boring lack of activity or is the vibe one of creativity, quiet growth, a vision of direction and a lack of bad news – all of these signal a vibe. If the vibe produced is negative, it’s worth avoiding them and looking at a company that has a better chance of being here next year.
What are the maximum storage capacity provided ? You may need more capacity. Most flash arrays may not be able to address the amount of storage capacity you may need. In that case, you may want to look at hybrid arrays like those from Tegile (which also produces flash arrays). Hybrid arrays are a combination of traditional disk storage and flash storage and provide extremely excellent performance characteristics when compared to traditional disk arrays. A recent example at a University saw a hybrid array provide between 25-40x the performance of their traditional storage while giving them much more capacity. The interesting aspect about this option is it provides both a capacity and a performance choice and one can couple these as a high performance duo (made easier if it is from the same vendor with the same OS).
Finally, does your flash storage vendor offers you a guarantee ? One such guarantee (again from Tegile’s guarantee) offers minimum levels for arrays for performance levels, pricing per RAW GB of storage, minimum levels of data reduction, minimum levels of endurance guarantees for minimum heavy-write endurance (and free controller updates) and availability (minimum downtime per year). The notion of a guarantee for your arrays is something of a game-changer in my mind. It puts the onus of not living up to performance and endurance and cost claims by the vendor squarely on the vendor and provides the buyer with a significant advantage.
Depending on what you are doing, some or all of these may be of interest to you. In the end, it is about what you are trying to do with applications, performance, storage capacity required and a host of other aspects.
[ Photo : Dragon, Shanghai Art Museum. digitalcld.com ]
Go to more posts on storage and flash storage at http://digitalcld.com/cld/category/storage.