So after testing LVM Raid in princple, I have been trying it on some real hardware to see what happens. The idea was to estimate if it scales and if not, how it doesn’t. I was expecting to run into all kinds of obscure problems in my testing, but in fact, it was a quick and short death.
The setup was very straight forward:
# pvcreate /dev/nvme*n1 # vgcreate vg00 /dev/nvme*n1 # lvcreate -n mysqlvol -L60T vg00 # lvconvert --type raid1 -m1 /dev/vg00/mysqlVol
The objective was to ensure that an existing non-raided volume could be converted into a RAID after the fact. My installation is half-sized. The actual production version of this would be on a machine with 24 of these drives, 12 of them currently filled by a 90T database, and I just used a spare box to test the basic procedure.
So, this worked:
# lvs -a -o+raid_min_recovery_rate,raid_max_recovery_rate,raid_mismatch_count,raid_sync_action LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert MinSync MaxSync Mismatches SyncAction audit sysvm -wi-ao---- 1.95g log sysvm -wi-ao---- 10.00g root sysvm -wi-ao---- 9.77g swap sysvm -wi-ao---- 1000.00m var sysvm -wi-ao---- 35.00g mysqlVol vg00 rwi-a-r--- 60.00t 0.51 0 recover [mysqlVol_rimage_0] vg00 iwi-aor--- 60.00t [mysqlVol_rimage_1] vg00 Iwi-aor--- 60.00t [mysqlVol_rmeta_0] vg00 ewi-aor--- 4.00m [mysqlVol_rmeta_1] vg00 ewi-aor--- 4.00m
For very slow and single threaded values of work:
iostat -x -k 10
As you can see, nmve0n1 is being read at around 240 MB/s, and being written at the same speed to nvme6n1. The rimage_0 is dm-8, you can see the reading, and the writing goes to dm-10, the rimage_1.
At this rate, I will get a RAID sync in 3.5 days or so (60*1024/0.22 = 280k seconds).
The expected behavior is to see a request queue deep enough to get the full transfer rate from a single NVME device (3.5 GB/s according to the spec sheet, anything above 2.5 GB/s sustained is good enough), and on all 6 device pairs in parallel, for a total of anything above 6x 2.5GB/s = 15 GB/s for an unconstrained RAID-1 sync. That is what the hardware can do here. In this case, the sync would complete in 4096 seconds, a bit under 1.5h.
So basically this died in the crib, because it does not leverage any parallelism the CPU, the multitude of drives, the deep queues of the NVMEs or anything else would have to offer.
It’s from the past.
LVM is in need of serious overhaul in order to become a tool for the 2020’s. Outside of toy workloads, snapshots don’t work, lvmraid also does not work.
A single NVME drive (we have 2x 6 of them) like the one I used is a 4xPCIe 3.0 device, so we get a transfer rate on the bus of 32 GByte/s. The internal structure of the flash limits the performance here, and according to the technical specifications of the device we get 800k-ish IOPS (let’s say it’s a million) and 3.5 GByte/s from it.
We also can measure the device and see that it does have a read latency of around 100-120 microseconds, and a buffered write latency (to the internal RAM of the device, assume it has 512 MB or so) of 50 microseconds. If you write a lot, writes become unbuffered at, say, 420 micros.
In order to get 1 million IOPS single threaded at a queue depth of 1, you would need a latency of 1 microseconds. You don’t get that, so in order to get all IOPS, you need to have deep queues (and async IO) or many threads.
In order to get 3.5 GB/s (say, 4 GB/s) at 4 KB block size, you need 1 million blocks per second (920k for 3.5 GB/s). Again, you need deep queues or many thread to get that.
Today, I monitored the still syncing array and it has been progressing at around 250 MB/s equalling around 1.1% per hours over night, and is now at 20-something percent synced.
# lvchange -maxrecoveryspeed=1000k /dev/vg00/mysqlVol
This works, and throttles the sync to just below 1000k per second. Conversely, setting a recovery speed of 0k falls back to the default 250 MB/s.
But: Setting it to 1000000k ups the speed to around 550 MB/s, which is the single threaded native speed of the array. I see around 4300 requests/s at a size of 256 KB at the NVME level, and around 8600 requests/s at a size of 128 KB one level higher in the dm-layer. There is also some kind of request merging going on somewhere in the stack, which for adjacent requests even makes sense.
If you try to split the Volume while it is still synching, you will find that this is not possible.
# lvconvert --splitmirrors 1 -n splitlv /dev/vg00/mysqlVol Unable to split vg00/mysqlVol while it is not in-sync.
Another observationL While lvmraid employs mdraid code, working with devicemapper block devices for the data and external metadata, mdraid does not see any of this in /proc;
# cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] [raid1] unused devices: <none>
You can’t use any mdraid tooling to turn knobs inside lvm controlled mdraid code.