News Flash at The Next Platform: A 35 Petabyte All-Flash Balancing Act, about Perlmutter, a HPC distributec computer that features an all-flash storage in the 3 dozen Petabyte range.
With 35 petabytes, the system will be the largest all-flash storage system we’ve seen to date but scale is only one part of the story.
Okay. So let’s do this. Let’s plan a 35 PB all-flash storage and see where this leads us.
36000 TB of raw flash are 2250 drives at 16 TB each. Assuming you do not want to put more than 20 into a single EPYC storage box, you get some 120-ish machines, at around 60 kW of power.
2250 Microns are around 10-ish Million USD, maybe 7 Million if you buy them all at once.
120 EPYC storage boxes set you back another 3-ish million or so.
So you end up somewhere at around 10-13 Million USD for the entire storage cluster of 35 PB NVME, provided you have the rackspace and network to place them.
DWPD and Network
At 16.000.000 MB per device and 86400 seconds in a day, you need to write 185 MB/s for 1 drive write per day (DWPD), sustained, across all devices. That’s 3700 MB/s or 30-ish Gbit/s per device to achieve one DWPD.
The maximum is dictated by the network: 20 drives of 16 TB each are 320 TB per box. With a single 100 Gbit/s card at full blast you are not doing more than 12.5 GB/s, so at least 25600s for a full load front to back. That’s around 1/3 of a day (7.1h).
The total data stream across all drives for one DWPD is 416250 MB/s, around half a TB per second for one DWPD. So you run this at 2 DWPD to ingest one TByte/s (8 Terabit/s).
The box linked above has a Mellanox CX/5, which will give you 100 GBit/s per box and port. You build a leaf-and-spine, but with Arista 7060, 7050CX or Juniper 5200 as ToR and 100/400 instead of 10/40.
Also, you don’t build storage racks, but put one or two of the 120 storage boxes into the top of each of your compute racks. This will smear the storage workload across the entire leaf-and-spine, and also contain the blast radius of a single rack loss.
You could add more Mellanox ports per box to improve networking, but that may cost you PCI lanes that you are using for NVME. EPYC gives you 128 PCI lanes, and assuming you run full blast at all times, they need to be split evenly between NVME PCI and network PCI. That would be a very unusual workload outside of cryptobro circles, tho.