As drive capacities continue to increase and customer tolerance of data outages decreases, the high cost of RAID impacts how customers buy storage. Vendors hide most of RAID’s cost elements, making it hard to understand the real cost of protection. Exposing these costs and eliminating them is vital to lowering the overall cost of storage. Additionally, most of these costs make data more vulnerable to media failure, so addressing them also leads to improved data integrity and reliability.
The High Cost of RAID: Drive Redundancy
As discussed in “What is RAID?” redundancy levels are typically set system-wide, meaning that all volumes associated with it have to assume the same redundancy profile. A technology-based on erasure coding like our vRAID technology enables IT to set redundancy per volume. A volume level approach allows for a more granular application of redundancy settings. vRAID is much more capacity-efficient than RAID. With vRAID, you sacrifice less capacity to redundancy as a percentage than a comparable RAID 5 or 6 configurations. However, this capacity savings is nothing compared to the savings you will realize if you can lower the drive redundancy setting.
The High Cost Rebuild Times
Rebuild time is the time it takes for a media failure protection solution to return the system to a protected state. Most customers set drive redundancy to more than one drive because they are concerned about these rebuild times. It is not uncommon for an all-flash system that uses RAID 5 or RAID 6 to experience rebuild times of 10 to 12 hours and for hard disk customers using 12TB or 16TB HDDs to experience rebuild times of days or even weeks. If you are using single parity protection and another drive fails during the rebuild process, you lose all the data on the RAID group’s volumes. The key is surviving simultaneous drive failures. The longer the rebuild process takes, the greater the chances of a simultaneous drive failure.
If the rebuild time takes most of a day or even multiple days, you will more than likely want to increase your redundancy setting. Increasing your drive redundancy parameters increases the capacity overhead that is allotted to the protection process and lowers the applications’ capacity. The net result is that your dollar per usable GB of capacity increases.
In our Q3-2020 release, StorONE is advancing its already rapid rebuild technology to provide the fastest rebuilds in the industry. Thanks to our latest update, rebuilds of flash drives can occur in five or six minutes, and rebuilds of high capacity hard disk drives can happen in four to five hours.
These fast rebuild times mean that customers may not need to set as high a level of drive redundancy because the chances of a simultaneous drive failure are insignificant. vRAID provides the option to set your redundancy level as high as you’d like (assuming you have enough drives) but makes your need for such a high setting, unnecessary. The net result is a lower cost per usable GB.
The High Cost of RAID: Hot Spares
Another deceptive aspect of RAID is its dependence on hot spares. These drives sit idle until there is a failure. When a drive fails, the RAID technology logically puts the hot spare into the place of the failed drive. All the other drives in the RAID group, using parity information, copy data to that single drive. The first problem with hot spares is that the single drive creates a bottleneck. It can only receive so much data at one time and is partly responsible for the long rebuild times mentioned above. Once the hot spare is used, the customer must replace the failed drive with another drive as soon as possible. Because of the current pandemic, access to data centers isn’t as easy as it used to be, and we continue to see supply shortages.
As I explained in my blog, “Drive Density Kills RAID“, the third and more significant challenge of requiring hot spares is the increasing capacity of both flash and hard disk drives. This increased density means that dedicating a hot spare means sacrificing a sizable amount of capacity. Most storage administrators will set up at least two hot spares in their environment. Many will even set aside three. Additionally, because standard RAID doesn’t support mixing drive sizes or types, you have to have a set of global hot spares per drive type. It is prevalent for customers to have dozens to even hundreds of terabytes of capacity sitting idle in case a drive fails. This problem will get worse as drive capacities increase. Again, as with slow rebuilds, the requirement to use hot spares dramatically raises the cost per usable GB.
As we discussed in our on-demand whiteboard session, vRAID doesn’t require hot spares. Instead, it uses all available drives all the time. When a drive fails, it redistributes the data from the failed drive across the remaining drives. Redistributing the data across multiple drives is partly responsible for vRAID’s rapid rebuilds. It also means that no drives are sitting idle while waiting for a failure, instead their capacity is immediately usable, but your data is safer than ever. As a result, vRAID once again reduces the cost per usable GB while improving performance and resiliency.
vRAID’s ability to mix drive sizes also lowers the cost of media failure protection. Our customers can use 15.8TB SSDs today, enjoy their full capacity and get full performance while knowing that next year when 32TB drives are commonly available, they can mix those drives into the same media pools and enjoy their full capacity as well. Customers using the old RAID technology will have to either format these drives down to 16TBs or create new RAID groups and manually migrate data to them.
To learn more, sign up for our expert class on Primary Storage Data Protection. Each week’s technical training session includes exclusive written and video content on protecting production data.