Even though it is a second copy of data, backup storage needs better RAID than production storage systems. Today, most storage systems are at least front-ended with flash media, and many are all-flash. If a drive fails on a flash-based storage system then, while the rebuild process is not as fast as it should be, it is not as egregious as the multi-day rebuild time of hard disk drives, which backup storage targets use almost exclusively.
Backup Storage Targets Need Hard Drives
The simple solution to the backup storage RAID problem is to create an all-flash backup target. The reality is that while flash media is more affordable than it was ten years ago, it is still not at the point where it is price competitive with high-density hard disk drives. 1PB of hard disk capacity is almost 10X less expensive than 1PB of flash media.
A backup storage target is not only going to use hard disk drives. It will use high-density hard disk drives which provide 16TB or more of capacity. They consume less data center footprint and less power per TB of capacity, so it is also more economical to use them. They are, however, the primary instigator of the backup storage RAID problem.
Legacy Code is the Source of the Backup Storage RAID Problem
Legacy RAID algorithms take multiple days to bring a RAID group back to a protected state. Legacy RAID also limits how many simultaneous drive failures the backup storage target can sustain. In most cases, today’s backup storage targets can only sustain two simultaneous drive failures and in rare cases, three. The problem is that a backup storage target supporting a large environment may have hundreds if not thousands of drives. Two or three drives redundancy is too small of a percentage of the total drives available to protect a volume effectively. The low drive redundancy limitation forces IT to create and manage multiple backup volumes and manually route jobs to those volumes.
The Impact of the Backup Storage RAID Problem on Backups
All of these challenges may not seem critical since backup storage is a second copy of data. Still, the impact of the backup storage RAID problem has repercussions throughout the data center. If a drive fails on a backup storage target, that system must start its rebuild process over again; one that may take three days or more to complete. While that rebuild is occurring, backup jobs are still executing. The background process of the rebuild process negatively impacts ingest performance of the backup storage target significantly. As a result, backup jobs slow down, and in many cases, the backup software is now missing its backup window expectations. IT may need to limit the frequency of backups during this time and may even need to cancel backups of less critical servers.
The chances of an additional drive failure increase during the rebuild failure because of the additional load placed on the drives. If another drive fails during the multi-day rebuild process, the rebuild process extends even longer and requires more resources since it is rebuilding multiple drives. In the eventuality of enough drives failing to surpass the drive redundancy setting (one is very common for backup storage), the storage system will experience a complete data loss. Remember, this is the backup copy. There is no backup of it for restoration. Even if the organization made a disaster recovery copy, that copy is off-site or in the cloud. Restoring the backup data is slow. The only alternative is to restart your backup jobs from scratch, sacrificing all prior backup history and data versioning.
The Impact of the Backup Storage RAID Problem on Recovery
During a multi-day RAID rebuild process, it is likely that there will be recovery requests. Those requests come in two forms, recovery over the network replacing data or recovery in place (instant recovery). In either case, it is necessary to query the data on the backup storage device to locate the specific data requested. Response times to those requests may be much slower than usual. Some IT professionals report that their backup software is “unusable” while a backup storage target goes through a RAID rebuild.
RAID rebuilds will also negatively impact data recoveries which replace data over the network. First, if the backup software does deduplication, it must re-hydrate that data from a device working hard to recover from a failed drive. Second, after data re-hydration, the backup software needs to copy that data from that device to the backup storage target. IT can expect a 3X increase to standard recovery efforts.
A RAID rebuild also severely impacts performance in instant recovery situations where the backup software instantiates server or VM data on the backup storage target. First, they have the same recovery performance impact as a regular restore, minus the network transfer. Second, the backup storage target also needs to provide production class performance during the time it is hosting data. For the reasons we mention in our blog “Backup Storage Targets Need to Change”, traditional backup storage targets already struggle to provide that performance. Trying to provide production class performance during a rebuild is nearly impossible. IT should not even bother trying to make an instant recovery while the backup storage target is going through a RAID rebuild process.
Fixing the Backup Storage RAID Problem
StorONE’s S1:Backup integrates our new RAID technology, vRAID, which provides the fastest recovery from drive failure in the industry. The only limitation on the number of redundant drives is the number of drives in the system. As a result, you can set your drive redundancy ratio per the number of drives in the system. vRAID also future proofs your storage investment, protecting you from backup storage migrations. It enables you to add hard disk drives of higher densities, as they come to market, and mix them into the same volumes, yet still enjoy their full capacities.
This flexibility in volume configuration means that you can have a single backup volume and not create additional volumes as the capacity increases or technology advancements occur. The result is a common, consistent backup storage target that you can set and forget.
StorONE’s vRAID provides recovery from high-density hard disk drives (16TB or 18TB) in less than two hours, even if those drives are full of data. We’ve shown the speed of our vRAID rebuild performance on our vRAID page. If you want, contact us and we will provide a demonstration for you live.
During the rebuild process, StorONE’s vRAID does not impact the performance of backup or recovery jobs. This performance guarantee also includes instant recovery situations. S1:Backup includes a flash tier whose size is customer adjustable. All instant recoveries occur to that tier. Our Data-Centric volumes isolate storage IO so that RAID rebuilds can operate at full speed while the system is also serving up data to applications.
Finally, vRAID is also part of the reason we can provide such low upfront and long-term pricing. vRAID enables you to run at very high capacity utilization levels; 90% utilization is not uncommon for our backup storage target customers. It also requires no hot spares, so all those high-capacity HDDs are available all the time. When a drive fails, vRAID redistributes the data on the failed drive to surviving drives in the system. The speed at which we can rebuild data also lowers costs because although you can set drive redundancy high, you may not need to because you only need to protect from multiple simultaneous drive failures for two hours.
Learn More
To learn more, please register for our upcoming webinar below “4 Reasons RAID is Breaking Backups and How to Fix Them.” During the webinar, we will review these challenges, demonstrate S1:Backup and our vRAID engine.
If you are a Veeam customer or reseller, please register for our virtual whiteboard session “Building a Better Veeam Backup Infrastructure “.
If you have an All-Flash Array, then please register for our virtual whiteboard session “How to Backup an All-Flash Array”