It seems pointless to issue flush commands when writing to an NVMe drive with a direct IO implementation that functions properly. The NVMe spec says:
> 6.8 Flush command
> …
> If a volatile write cache is not present or not enabled, then Flush commands shall complete successfully and have no effect.
And:
> 5.21.1.6 Volatile Write Cache
> …
> Note: If the controller is able to guarantee that data present in a write cache is written to non-volatile media on loss of power, then that write cache is considered non-volatile and this feature does not apply to that write cache.
If you know your application will only ever run against enterprise SSDs with power loss protection, then sending flush commands to the drive itself would indeed be pointless no-ops. But it if it's a flush command that has effects somewhere between the application layer and the NVMe drive (eg. if you're not using direct IO) or if there's any possibility of the code being run on a consumer SSD (eg. a developer's laptop) then the flush commands are probably worth including; the performance hit on enterprise drives will be very small.
IOCTLs can tell you if write caching is enabled or not. Can they reliably tell you whether the write cache is volatile, though? Many drives with PLPs still report volatile write caches, or at least did when I was testing this a few years back.
Per the definition of volatile write cache in the standard I quoted, pretty much any drive TLC drive in the hyperscalar, datacenter, or enterprise product lineup will have great write performance. They have a DRAM cache that is battery-backed, and as such is not a volatile write cache.
A specific somewhat dated example: Samsung 980 Pro (consumer client), PM9A1 (OEM client), and PM9A3 (datacenter) are very similar drives that have the same PCI ID and are all available as M.2. PM9A3 drives have power loss protection and the others don’t. It has very consistent write latency (on the order of 20 - 50 μs when not exceptionally busy) and very consistent throughput (up to 1.5 GB/s) regardless of how full it is. The same cannot be said of the client drives without PLP but with tricks like TurboWrite (aka pseudo-SLC). When more than 30% of the NAND is erased, the client drives can take writes at 5 GB/s but that rate falls off a cliff and gets wobbly when the pseudo-SLC cache fills.
Thanks! Yes, as the sibling noted, if you limit this to PLP drives it makes sense, but that is also a special case. Outside of the latency hit (which is significant in some cases), FLUSH is also nearly free on those though.
It seems pointless to issue flush commands when writing to an NVMe drive with a direct IO implementation that functions properly. The NVMe spec says:
> 6.8 Flush command
> …
> If a volatile write cache is not present or not enabled, then Flush commands shall complete successfully and have no effect.
And:
> 5.21.1.6 Volatile Write Cache
> …
> Note: If the controller is able to guarantee that data present in a write cache is written to non-volatile media on loss of power, then that write cache is considered non-volatile and this feature does not apply to that write cache.
If you know your application will only ever run against enterprise SSDs with power loss protection, then sending flush commands to the drive itself would indeed be pointless no-ops. But it if it's a flush command that has effects somewhere between the application layer and the NVMe drive (eg. if you're not using direct IO) or if there's any possibility of the code being run on a consumer SSD (eg. a developer's laptop) then the flush commands are probably worth including; the performance hit on enterprise drives will be very small.
IOCTLs can tell you if write caching is enabled or not. Can they reliably tell you whether the write cache is volatile, though? Many drives with PLPs still report volatile write caches, or at least did when I was testing this a few years back.
What SSDs are reasonably performant without a volatile write cache? The standards you quote specify why it is necessary to issue flush!
Per the definition of volatile write cache in the standard I quoted, pretty much any drive TLC drive in the hyperscalar, datacenter, or enterprise product lineup will have great write performance. They have a DRAM cache that is battery-backed, and as such is not a volatile write cache.
A specific somewhat dated example: Samsung 980 Pro (consumer client), PM9A1 (OEM client), and PM9A3 (datacenter) are very similar drives that have the same PCI ID and are all available as M.2. PM9A3 drives have power loss protection and the others don’t. It has very consistent write latency (on the order of 20 - 50 μs when not exceptionally busy) and very consistent throughput (up to 1.5 GB/s) regardless of how full it is. The same cannot be said of the client drives without PLP but with tricks like TurboWrite (aka pseudo-SLC). When more than 30% of the NAND is erased, the client drives can take writes at 5 GB/s but that rate falls off a cliff and gets wobbly when the pseudo-SLC cache fills.
Thanks! Yes, as the sibling noted, if you limit this to PLP drives it makes sense, but that is also a special case. Outside of the latency hit (which is significant in some cases), FLUSH is also nearly free on those though.
Did you check that the drives actually honor the flush? Half of drives tested lose FLUSH'd data on power loss.
Those are not contradictory
Very appropriate topic, after yesterday's High-Performance DBMSs with io_uring: When and How to use it. https://news.ycombinator.com/item?id=46517319 https://arxiv.org/abs/2512.04859
[flagged]