FRESH

Hacker News

Home

Is DWPD Still a Useful SSD Spec?

44 points by zdw

by mdtancsa

1 subcomments

dropping off the bus is the best case fail really. Its more annoying when writes become slower than the other disks often causing confusing performance profiles of the overall array. Having good metrics for each disk (we use telegraf) will help flag it early. On my zfs pools, monitoring disk io for each disk, smartmon metrics help tease that out. For SSDs probably the worst is when there is some firmware bug that triggers on all disks at the same time. e.g. the infamous HP SSD Failure at 32,768 Hours of Use. Yikes!

by Havoc

1 subcomments

After getting burned by consumer drives I decided it’s time for a zfs array from used enterprise ssds. Tons of writes on them but full mirrored config and zfs is easier to backup so should be ok. And the really noisy stuff like logging im just sticking into optanes - those are 6+ dwpd depending on model which may as well be unlimited for personal use scenarios

by igtztorrero

3 subcomments

The most common catastrophic failure you’ll see in SSDs: the entire drive simply drops off the bus as though it were no longer there.
Happened to me last week.
I just put it in a plastic bag into the freezer during 15 minutes, and works.
I made a copy to my laptop and then install a new server.
But not always works like charms.
Please always have a backup for documents, and a recent snapshot for critical systems.

by markhahn

1 subcomments

Text is wrong about CRCs: everyone uses pretty heavy ECC, so it's not just a re-read. This also provides a somewhat graduated measure of the block's actual health, so the housekeeping firmware can decide whether to stop using the block (ie, move the content elsewhere).
I'm also not a fan of buy bigger storage concept, or the conspiracy-theory on 480 v 512.
It sure would be nice if when considering a product, you could just look at some claimed stats from the vendor about time-related degradation, firmware sparing policy, etc. we shouldn't have to guess!

by 0manrho

0 subcomment

> Controller failure: by far the most common catastrophic failure you’ll see in SSDs
In consumer drives. Often not even a hardware failure, but a firmware one, but to most consumers, this is splitting hairs as the drive is still "Dead" as the common ingress points to fix this are not present/disabled on consumer class drives (thus the blurb at the end of that section about physically swapping controllers). Also, cell failure is far more prevalent than controller failure in instances where the drives lack a DRAM/SLC cache (aka transition flash) layer. Controllers still fail, even at the hardware level, for enterprise and consumers alike though, it's a prevalant issue (pro tip, monitor and rectify the thermals and the prevalence of this problem drops significantly)
> Failure to retain charge: typically, only seen in SSDs, thumb drives, and similar devices left unpowered for long periods of time.
Also happens to flash that see lots of writes, power cycles, or frequent significant temperature fluctuations. This is more common on portable media (thumb drives) or mobile devices (phones, laptops, especially thin ones)
> Now, let’s take a look at the DC600M Series 2.5” SATA Enterprise SSD datasheet for one of my favorite enterprise-grade drives: Kingston’s DC600M.
Strange choice of drive but okay, especially considering they don't talk about any of it's features that actually make it an enterprise version as opposed to their consumer alternatives: Power loss protection, Transition flash/DRAM cache, controller and diagnostics options, etc etc.
> Although Kingston’s DC600M is 3D TLC like Samsung’s EVO (and newer “Pro”) models, it offers nearly double the endurance of Samsung’s older MLC drives, let alone the cheaper TLC! What gives?
For starters the power regulation and delivery circuitry on entrprise grade drives tends to be more robust (usually, even on a low-end drive like the DC600M), so that those writes that wear the cells are much less likely to actually cause wear due to out-of-spec voltage/amps. Their flash topology, channels, bitwidths, redundancy (for wear levelling/error correction) etc etc are also typically significantly improved. all of these things are FAR more important than the TLC/SLC/MLC discussion they dive into. None of these things are a given just because someone brands it an "Enterprise drive" but these are things that enterprises are concerned with where consumers typically don't often have workloads where such considerations really make a meaningful difference and they can just use either DWPD or brute force by vastly overbuying capacity to evaluate what works for them.
> One might, for example, very confidently expect 20GB per day to be written to a LOG vdev in a pool with synchronous NFS exports, and therefore spec a tiny 128GB consumer SSD rated for 0.3 DWPD... On the surface, this seems more than fine:
Perhaps, but let me stop you right there as the math that follows is irrelevant for the context presented. You should be asking what kind of DRAM/Transition flash (typically SLC if not DRAM) is present in the drive and how the controller handles it (also if it has PLP) before you ever consider DWPD. If your (S)LOG's payloads fit within the controllers cache size, and that's it's only meaningful workload then 0.3DWPD is totally fine as the actual NAND cells that comprise the available capacity will experience much less wear than if there were no cache present on the drive.
Furthermore, regardless of specific application, if your burstable payloads exceed whatever cache layer your drive can handle, you're going to see much more immediate performance degradation entirely independent of wear on any of your components. This is one area that significantly separates consumer flash with enterprise flash, not QLC/TLC/MLC or how many 3d stacks of it there are. That stuff IS relevant, but it's equally relevant in enterprise and consumer, and is first and foremost a function of cost and capacity than endurance, performance, or anything else.
This is an example of how DWPD is a generic that can be broadly used, but when you get into the specifics of use, can kinda fall on it's face.
Thermals are also very important to endurance/wear and performance both, and often goes overlooked/misunderstood.
DWPD is not as important as it once was when flash was expensive, drive capacity limited, and their was significantly more overhead in scaling them up (to vastly oversimplify, a lot less PCIe lanes available), but it's still a valuable metric. And like any individual metric, in isolation it can only tell you so much, and different folks/context will have different constraints and needs.
Note, kudos for them bringing it up that not all DWPD is equal. Some report DWPD endurance over 3 years instead of 5 to artificially inflate their DWPD metric, something to be aware of.
TL;DR: DWPD, IOPs, Capacity and price are all perfectly valid ways to evaluate flash drives, especially in the consumer space. As your concerns get more specific/demanding/"enterprise", they come with more and more caveats/nuance, but that's true of any metric for any device tbh.

by justsomehnguy

0 subcomment

> Is DWPD Still a Useful SSD Spec?
Yes.
You need years from that SSD? Buy a drive with DWPD > 3.
You are a cheap ass and have the money only for a DWPD 0.3 drive? Replace it every year.
You are not sure what your usage would be? Over-provision by buying a bigger drive than you need.
And while we are at it: no, leaving >= 25% of the drive empty for the drives > 480GB is just idiotic. Either buy a bigger drive or use a common sense - even 10% of a 480GB drive is 48Gb already, for a 2048GB drive it's 204GB.

by mgerdts

1 subcomments

This article misses several important points.
- Consumer drives like Samsung 980 Pro and WD SN 850 Black use TLC as SLC when about 30+% of the drive is erased. At this time you a burst write a bit less than 10% of the drive capacity at 5 GB/s. After that, it slows remarkably. If the filesystem doesn’t automatically trim free space, the drive will eventually be stuck in slow mode all the time.
- Write amplification factor (WAF) is not discussed. Random small writes and partial block deletions will trigger garbage collection, which ends up rewriting data to reclaim freed space in a NAND block.
- A drive with a lot of erased blocks can endure more TBW than one that has all user blocks with data. This is because garbage collection can be more efficient. Again, enable TRIM on your fs.
- Overprovisioning can be used to increase a drive’s TBW. If before you write to your 0.3 DWPD 1024 GB drive, you partition it so you use only 960 GB, you now have a 1 DWPD drive.
- per the NVMe spec there are indicators of drive health in the SMART log page.
- Almost all current datacenter or enterprise drives support an OCP SMART log page. This allows you to observe things like the write amplification factor (WAF), rereads due to ECC errors, etc.