Three years ago I installed a 1.5 TB WD Elements USB drive as an external backup for the “file server” in the Basement Laboratory. The log files show that the drive started spitting out “short reads” early in October, which means the rust has begun flaking off the platters.
Repeated fsck -fyv /dev/sda1 runs produce repeated failures at various spots, so it’s not in good condition:
e2fsck 1.41.14 (22-Dec-2010) Backup-1.5TB contains a file system with errors, check forced. Pass 1: Checking inodes, blocks, and sizes Error reading block 97649088 (Attempt to read block from filesystem resulted in short read) while getting next inode from scan. Ignore error? yes ... snippage ... Pass 2: Checking directory structure Error reading block 104039017 (Attempt to read block from filesystem resulted in short read) while reading directory block. Ignore error? yes Force rewrite? yes Directory inode 26009985, block #26, offset 0: directory corrupted Salvage? yes ... snippage ... Pass 4: Checking reference counts Inode 25903223 ref count is 41, should be 40. Fix? yes ... snippage ... Backup-1.5TB: ***** FILE SYSTEM WAS MODIFIED ***** 736471 inodes used (0.80%) 10173 non-contiguous files (1.4%) 9367 non-contiguous directories (1.3%) # of inodes with ind/dind/tind blocks: 119655/12234/0 142996292 blocks used (39.04%) 0 bad blocks 3 large files 276772 regular files 459614 directories 0 character device files 0 block device files 0 fifos 10377447 links 76 symbolic links (72 fast symbolic links) 0 sockets -------- 11113909 files
Given that rsnapshot lashes the daily backups together with extensive hard links, so that there’s only one copy of a given file version on the drive, I don’t know what 76 symbolic links might mean.
It’s been spinning up once a day, every day, for about 40 months; call it 1200 power cycles and you’ll be close. The usual runtime is about 10 minutes, giving the poor thing barely enough time to warm up.
One data point does not a curve make.
The warranty on new WD Element drives seems to be a year; I have no idea what it was slightly over three years ago, although I’m pretty sure it wasn’t more than three years…
The various desktop boxes around here get powered up once a day, too, but I tend to replace them every few years and have never had a hard drive failure; a few system boards have crapped out, though. The boxes acting as controllers for the 3D printers and the Sherline CNC mill have a much lower duty cycle.
Comments
19 responses to “Hard Drive Lifetime: Data Points”
My preference in disk drives is for WD Black drives. They still carry a 5 year warranty. They cost a bit more, but are worth it in my opinion.
At this point, I’m running whatever comes in the off-lease Dell Optiplexes, which already have a few years on them when they arrive. The drives definitely aren’t their weak point!
Haven’t scratch-built a PC in, oh, a decade or so…
My strategy is to use use the newest, most reliable drive I can get for backups. If I need to get something off of the backup drive, something bad has already happened – either a finger check or a real drive failure. I can’t tolerate having the backup drive fail, too. When the backup drive saves my butt, it is worth every extra penny I paid for it. At the first sign of trouble with a backup drive, it stops being a backup device.
Wisely is it written that if you have only one copy of the data, you have none. If you have two copies, you have a chance.
More-or-less once a year I dump the file server to a randomly chosen hard drive from the Big Box o’ Drives, walk over to the safe deposit box, and swap it for the one that’s been hiding there for a year. So far, I’ve never used any of those off-site drives for any reason.
In the event of an actual emergency taking out the neighborhood, I’m doomed anyway. Losing all my data wouldn’t be the worst thing that could possibly happen…
Back in the days of the first 5″ HDDs we used to say that the drives in our product worked fine as long as you never turned them on. Spinning them up was like Russian roulette. Once going they were fine, though.
Turned out their designer did not fully appreciate how far out of the safe operating area the drive transistors had to go when the spindle motor was powered on. Not a Good Thing…
Sort of like flying: it’s the takeoffs and landings that get ya.
I’ve had a few failures, but have been lucky to catch them in time. Back in the early 90s, lost a 40M(!) hard drive to the stiction problem that was keeping Seagate (I think) drives from spinning up. In 2003 I lost a 2 year old 80G drive from my prime Linux box. At that time I was using it 1-8 hours a day. Not sure of the original brand, but the replacement WD drive was going strong with moderate usage until I stopped using the box last year. FWIW, the drives in the ’98 vintage P2 box are still strong, with continuous usage for the first couple of years (I was running SETI-at-home until California power got erratic).
Note to self: I must get more paranoid about making backups on the newer boxes.
Hmm, recalling the stiction problem, I recall having one drive replaced under warranty, and the replacement failing a few months later. These were the two-bay 5″ HDDs, and were replaced by two one-bay drives, doubling my capacity.
We had several minicomputers at work, and the ones with non-removable media tended to last a long time–they usually went obsolete before they died. Of course, the costs were huge, even though we bought the units from ourselves (HP). Not such a good record on the removable HDDs, but they got exposed to random crud at times. The computer rooms were only sort-of clean.
Re power surge, the first big HP desktop computer (9845) could brownout its circuit on the first couple of cycles. It was quick enough not to trip breakers, but you wanted it OFF if power died to the building and you had more than one on hand.
That dying external drive had the daily / monthly backups for the last year, so I felt kind of exposed. In truth, I desperately need a backup file maybe three times a year, instantly after an egregious finger fumble: not having them isn’t a major loss.
The 750 GB external hard drive I just set up has the current full backup (476 GB @ 20 MB/s = 6.6 hours), one daily backup, and I’m starting to feel better already. It’s 70% full and could probably hold a year’s worth of backup, but it has a noisy always-on fan that’s impossible to live with.
When we have a power blink, all the UPS units start chirping and clicking and warbling at once; takes about five minutes to walk around the house and console them during a real outage.
If the power fail is at night, it’s three parts 1) wake up due to lack of CPAP power, 2) stop the UPS, and 3) calm the border collie. She has firm opinions on what is permitted to wake us up…
Always check the SMART status: it’s the firmware testing code present on practically all disks available today. Because SMART error counters are handled by the drive firmware they see errors while they first appear, are corrected and therefore hidden from the filesystem layer where they result in actual data loss.
smartctl and skdump are two Linux programs that read SMART data. The particular variables worth paying attention to are reallocated count and current pending sector: if they move off zero it’s time to replace teh disk.
SMART also has various short and long tests that can be triggered periodically if you’re so inclined. This is a good thing because disks nowadays are so large that the problems often remain unnoticed because start in the rarely used areas of the disk.
The status report now shows the drive has long since died from old age, which isn’t particularly surprising: it has 20240 power-on hours (2.3 years), 1281 power cycles (roughly what I estimated), and 3155 starts (three times what I estimated). I think the number of starts reflect the thing’s ability to spin down, perhaps while
rsnapshotfinds the next changed file.Back when this mess started, the short test completed successfully and said it was healthy; I ran
fsckto see what it could recover, rather than ask SMART what it knew, which turned into a two-day marathon. I’ll run a long test and see what happens.The 750 GB drive never had SMART turned on; I’ve fixed that and the long test is running even as we type…
I think SMART functionality is always ON, in that you can always get data from ‘smartctl/skdump’. There’s often a SMART on/off setting in the BIOS which I am not sure about—I think it makes the system query SMART status on bootup and give a warning if the disk reports problems. Could you check this before you turn SMART on?
BTW, SMART should work across most USB-SATA controllers now, so you can test your external drives as well as the internal ones (flash drives, sadly, are not covered, only the drives that internally use SATA). This used to NOT work, because SMART ATA commands weren’t passed across the USB interface due to a combination of hardware and sofware limitations.
I’d been testing it on the file server in the basement, but I have no idea what that BIOS does; it boots automagically at 6 am and turns itself off after the backup around midnight. I tested the dead drive on the new-to-me Optiplex 980, which has BIOS-level SMART checking turned on.
After
fsckran for two days straight on that drive, I spot-checked a few files and saw they were there, so I tucked the thing into the basement safe. Parts of the directory structure are chewed up and some files have damage, but in the unlikely event I need deep backup for a current file, most likely it’ll be there… as long as I don’t screw around with the drive and wear off any more rust.Better the backup drive fail than the main drive in the file server, though… even if the latter is nothing more than an ordinary desktop box on the Basement Laboratory floor.
My oldest drive in operation tells me it’s been powered on for 5.8 years and it’s properly at 0 for the things Przemek mentioned (and it’s fine on other variables too). I’m more inclined to think power cycles and start/stop counts are potential issues here, and possibly summer operating temperatures. Power cycles stands at 617 and start/stop at 1044.
Also, a 1.5TB drive from a few years ago probably has quite a number of platters in order to reach that capacity. More platters of course means more points of failure within a single drive.
If it were completely dead, I’d tear it apart and use its platters as wind chimes, but … maybe in a year or two.
We were just over the Cascades doing the Fall Costco run and a 1TB Seagate backup drive followed me home. I now remember to back up Quicken data to CD-ROM (only needed it once so far this year), but this computer is now doing a bunch more stuff than I expected. Between worship preparation and various church financial stuff, I’ve got a lot more data that I don’t want to lose. Silly me, I knew that I’d be working harder after retirement, but somehow I thought it might be simpler.
If the drive is halfway decent, I’ll pick up one or two more in the Spring Costco run (or the occasional Dreaded Winter run–ODOT uses a calcium salt on that highway, and it’s not fun stuff to get off paint).
This is a Windows box, so I get to decide between the MS version or the Seagate backup. I want a full backup and incremental dailys, so whichever does that gets the nod. Neither of the Linux boxes have better than USB-1, and the barn/shop environment might be too tough, so I might stick with CD-ROM until a budget for better Linux hardware falls out of the sky.
My (limited) experience is that the usb-to-whatever controllers in external drives tend to fail before the drives themselves. Might be worth opening it up to see if the drive is still usable.
Ever considered one of the small NAS boxes? I got a ReadyNAS 102 recently; nice feature set, though I’m still deciding exactly how I’m going to use it.
This one still looks like a hard drive, albeit with errors, so the controller is going strong. I’ll eventually try prying the drive out and replacing it, but …
There’s one in the heap, but the transfer rate over even a 100 Mb/s network was so pokey that I gave up on it: more than a day for a full-disk backup.