As more and more of our A/V/L equipment becomes computerized—and thus is running from a hard drive—figuring out how long a drive will last is quickly becoming a big deal for technical leaders. A quick count in my tech booth tallied 20 hard drives (both SSD and spinning). Some of those are backups, and that doesn’t count the 3 others that we keep in another room as backups. So to say our operation relies on hard drives is an understatement.
But how long will they last?
That is the question. I’ve had a lot of experience with hard drives over the years. I’m sure I’ve owned, managed, bought, replaced hundreds of drives. And for the life of me, I’ve not been able to come up with a consistent answer. Thankfully, there are companies who don’t use hundreds of hard drives but thousands. Tens of thousands actually.
Backblaze is a company that provides online backup. They started five years ago and now deploy 75 petabytes of storage. They are quickly approaching 30,000 drives in service. They elected to install consumer-grade drives, not the server-grade, industrial strength ones. People told them they were crazy, but it’s worked out OK. They have a ton of data on hard drive failure rates.
Recently they wrote a blog post that details their current knowledge of failure rates. I recommend you go read the whole article because it’s quite interesting. But here’s the Cliff Notes version.
Drives last 6 years.
Well, that’s sort of true. They are extrapolating 5 year data, to 6 years, and arriving at the point where 50% of the drives fail. That becomes the median failure rate. In other words, if their projection holds up (and we’ll have to wait a year or more to see), 50% of drives will fail before 6 years. Which also meads 50% will continue to run. But wait…there’s more!
The Bathtub Curve.
Reliability engineers point to a curve called the Bathtub Curve. It shows three things; the early failure rate, a constant failure rate, and the parts wearing out failure rate. When overlaid on top of each other, it looks a bit like a bathtub.
This curve mirrors what Backblaze finds as well. Indeed, I would say this is what I’ve tended to see in my much more limited experience. Backblaze finds that drives have three distinct failure rates. From their post:
- For the first 1.5 years, drives fail at 5.1% per year.
- For the next 1.5 years, drives fail LESS, at about 1.4% per year.
- After 3 years though, failures rates skyrocket to 11.8% per year.
The early rate is the “infant mortality” rate; those that likely have some sort of manufacturing defect. However, if the drive survives the first year and a half, they seem to do quite well. After three years however, parts start to wear out. As mechanical devices, bearings will wear, heads will wear, and even the magnetic properties will change. At that point, almost 12% begin to fail.
While that looks like a huge jump, keep in mind that after three years, more than 80% of the drives are still working. So that’s not too bad.
What does this mean for us?
I think it means what it always meant: we have to back up regularly. For every mission critical drive, we need to have a hot backup that can be swapped in quickly in case of failure. If it’s not a RAID copy, having a clone of the drive with the most current data possible will make it much easier to get up and running when the drive fails.
Also, I think having a policy of replacing mission critical drives on a regular basis is a good idea. I’ve personally settled on a 3-year replacement policy for most of my drives, and that’s just based on my experience. Interestingly, it seems to be mirrored by this data. I could probably stretch it out to 4 years because I have good backups, but drives are so inexpensive now, it seems to make sense to replace them.
I would love to see a study like this with SSDs. My gut tells me we’ll be on a 3-year plan with those as well, but we’ll have to wait and see.
Like everything, planning is everything. Knowing our drives will fail makes it easy to justify backups as well as money in the budget for replacements. Remember, when it comes to drives, it’s not a question of “if,” but “when.”