Revisiting the file server

The new disks in the file server had synchronized nicely, and that resulted in an interesting graph:
Sectors on the outer part of a hard disk are transferred faster than inner sectors. You can see how the disk speed halves between the outermost and the innermost part.
While watching, I decided on a whim that I wanted to convert the entire setup from using Linux mdraid
to dmraid
, the LVM2 implementation of RAID1.
It is essentially the same code, but integrated into LVM2 instead of using mdadm
for control.
I had already experimented with dmraid
before, as documented in an earlier article
, and that was not without problems
.
But since the array would contain no original data, only backups, I decided to give it a try.
So here we go:
Destroying the mdraid
I had created a RAID-1 pair of the disks /dev/sdb2
and /dev/sdc2
under the name /dev/md126
, then added /dev/md126
to the hdd
volume group. In order to get the disks back, I had to undo this.
So we need to check if the /dev/md126
PV is empty:
# pvdisplay --map /dev/sda2
--- Physical volume ---
PV Name /dev/md126
VG Name hdd
PV Size 9.09 TiB / not usable 1022.98 MiB
Allocatable yes
PE Size 1.00 GiB
Total PE 9303
Free PE 9303
Allocated PE 0
PV UUID ...
--- Physical Segments ---
Physical extent 1 to 9302:
FREE
That’s fine. We can remove the pair from the volume group again, remove the LVM2 label, and then stop and destroy the raid:
# vgreduce hdd /dev/md126
...
# pvremove /dev/md126
...
# mdadm --stop /dev/md126
...
# mdadm --remove /dev/md126
...
# mdadm --zero-superblock /dev/sdb2 /dev/sdc2
That did not work: An Ubuntu Component, os-prober
, took possession of the devices after removing the RAID-1. I had to actually uninstall the component and remove the osprober
from devicemapper before I could continue:
# ### After uninstalling os-prober:
# dmsetup ls
...
# dmsetup remove osprober-linux-sdb2
...
# dmsetup remove osprober-linux-sdc2
...
# mdadm --zero-superblock /dev/sdb2 /dev/sdc2
...
Preparing the disks individually for LVM2
Only then I could continue, add the disks to LVM2 and extend the hdd
volume group:
# pvcreate /dev/sd{b,c}2
...
# vgextend hdd /dev/sd{b,c}2
...
We now have a very weird, asymmetric VG in which there is a single 9.09 TiB raid PV and two 9.09 TiB unraided PVs. Our next objective is to evacuate the raided PV, and then destroy this as well, adding the disks back unraided. This will result in a 36 TiB total VG.
Evacuating /dev/md127
to unraid it
This is going to take a long time. We need to do this in a tmux
session:
# pvmove /dev/md127 /dev/sdb2
...
This will, over the course of about one day, move all physical extents from /dev/md127
to /dev/sdb2
.
This exposes us to disk failure, as for the moment the data on /dev/sdb2
is unraided.
Restoring redundancy
There are two kinds of backup on this disk pack:
A number of Apple Time Machine targets and an Acronis Windws target, tm_*
and win_*
, and an internal backup that is being produced by a cron job, rsync
ing data from the internal SSDs to the disks, /backup
.
I decided to destroy the /backup
LV and recreate it as a raid10
across all 4 disks.
This was relatively easy, and worked immediately – it just took a few hours for the backup job to run from scratch.
# umount /backup
# lvremove /dev/hdd/backup
...
# lvcreate --type raid10 -i2 -m1 -n backup -L4T hdd
...
# mkfs -t xfs -f /dev/hdd/backup
# mount /backup
# /root/bin/make-backup
For the Time Machine and Acronis targets, I decided to lvconvert
them to raid1
.
As stated before
, there are two competing implementations of of RAID in LVM2, --type mirror
and --type raid1
.
The mirror
implementation is very extremely strongly deprecated, the raid1
implementation is okay, because it uses the same code as mdraid
internally.
We need to make sure to specify --type raid1
in the lvconvert
command to ensure the proper type is being used.
For each target we do
# lvconvert --type raid1 -m1 /dev/hdd/tm_...
...
This returns immediately, and begins to sync the mirror halves internally. If we at this point convert all Time Machine and Acronis targets at once, the sync speed is abysmally slow, because of disk head treshing. There is no way to stop this. The only option is to slow down all resyncs except one:
# ### for all targets, slow them down to minimum
# lvchange --maxrecoveryspeed=1k /dev/hdd/tm_...
# ### for one target, set a very large max speed:
# lvchange --maxrecoveryspeed=100000k /dev/hdd/tm_...
...
and then let one tm_...
target finish.
After that, we can increase the max sync speed for the next target, and so on.
This is what a --type raid1
looks like in lsblk
:
# lsblk /dev/sda2 /dev/sdg2
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda2 8:2 0 9.1T 0 part
├─hdd-tm_aircat_rmeta_0 253:18 0 1G 0 lvm
│ └─hdd-tm_aircat 253:20 0 1T 0 lvm /export/tm_aircat
├─hdd-tm_aircat_rimage_0 253:25 0 1T 0 lvm
│ └─hdd-tm_aircat 253:20 0 1T 0 lvm /export/tm_aircat
...
sdg2 8:98 0 9.1T 0 part
├─hdd-tm_aircat_rmeta_1 253:26 0 1G 0 lvm
│ └─hdd-tm_aircat 253:20 0 1T 0 lvm /export/tm_aircat
├─hdd-tm_aircat_rimage_1 253:30 0 1T 0 lvm
│ └─hdd-tm_aircat 253:20 0 1T 0 lvm /export/tm_aircat
...
That is, for each leg of the RAID1 you get a rimage
and rmeta
LV.
If the LVs are named mimage_0
and mimage1
and there is no rmeta
but only a single mlog
, this is not --type raid
, but the deprecated mirror
implementation.
This is not good, and should be converted to --type raid1
.
There is no way to convert a linear
or raid1
to raid10
, unfortunately.
Checking status and progress
A few handy commands to check the status and the progress of the conversion.
Monitor the sync state of the devices:
# lvs -a -o name,copy_percent,devices,raid_max_recovery_rate,raid_mismatch_count,raid_sync_action hdd
The -a
shows all the LVs, even the internal ones.
We are interested into the copy_percent
to see the progress of the sync.
We also want the max_recovery_rate
, because we might have throttled it with the lvchange
command mentioned above.
And we want to see the raid_mismatch_count
and raid_sync_action
to see what’s going on.
Of course, the ubiquitous lvs -a -o +devices
is always handy to get an impression of the entire VG:
# lvs -a -o +devices hdd
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert Devices
backup hdd rwi-aor--- 4.00t 100.00 backup_rimage_0(0),backup_rimage_1(0),backup_rimage_2(0),backup_rimage_3(0)
[backup_rimage_0] hdd iwi-aor--- 2.00t /dev/sdb2(2561)
[backup_rimage_1] hdd iwi-aor--- 2.00t /dev/sdc2(1)
[backup_rimage_2] hdd iwi-aor--- 2.00t /dev/sda2(1026)
[backup_rimage_3] hdd iwi-aor--- 2.00t /dev/sdg2(3158)
[backup_rmeta_0] hdd ewi-aor--- 1.00g /dev/sdb2(2560)
[backup_rmeta_1] hdd ewi-aor--- 1.00g /dev/sdc2(0)
[backup_rmeta_2] hdd ewi-aor--- 1.00g /dev/sda2(1025)
[backup_rmeta_3] hdd ewi-aor--- 1.00g /dev/sdg2(3157)
tm_aircat hdd rwi-aor--- 1.00t 100.00 tm_aircat_rimage_0(0),tm_aircat_rimage_1(0)
[tm_aircat_rimage_0] hdd iwi-aor--- 1.00t /dev/sda2(0)
[tm_aircat_rimage_1] hdd iwi-aor--- 1.00t /dev/sdg2(2133)
[tm_aircat_rmeta_0] hdd ewi-aor--- 1.00g /dev/sda2(1024)
[tm_aircat_rmeta_1] hdd ewi-aor--- 1.00g /dev/sdg2(2132)
tm_joram hdd rwi-aor--- 1.50t 100.00 tm_joram_rimage_0(0),tm_joram_rimage_1(0)
[tm_joram_rimage_0] hdd iwi-aor--- 1.50t /dev/sdc2(2049)
[tm_joram_rimage_1] hdd iwi-aor--- 1.50t /dev/sdb2(1)
[tm_joram_rmeta_0] hdd ewi-aor--- 1.00g /dev/sdc2(3585)
[tm_joram_rmeta_1] hdd ewi-aor--- 1.00g /dev/sdb2(0)
tm_mini hdd rwi-aor--- 512.00g 100.00 tm_mini_rimage_0(0),tm_mini_rimage_1(0)
[tm_mini_rimage_0] hdd iwi-aor--- 512.00g /dev/sdb2(8786)
[tm_mini_rimage_1] hdd iwi-aor--- 512.00g /dev/sdc2(4609)
[tm_mini_rmeta_0] hdd ewi-aor--- 1.00g /dev/sdb2(9298)
[tm_mini_rmeta_1] hdd ewi-aor--- 1.00g /dev/sdc2(4608)
win_kk hdd rwi-aor--- 2.08t 100.00 win_kk_rimage_0(0),win_kk_rimage_1(0)
[win_kk_rimage_0] hdd iwi-aor--- 2.08t /dev/sdg2(2)
[win_kk_rimage_1] hdd iwi-aor--- 2.08t /dev/sda2(3075)
[win_kk_rmeta_0] hdd ewi-aor--- 1.00g /dev/sdg2(0)
[win_kk_rmeta_1] hdd ewi-aor--- 1.00g /dev/sda2(3074)
Another way to look at the construct is lsblk
:
# lsblk /dev/sd{a,b,c,g}
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 9.1T 0 disk
├─sda1 8:1 0 10G 0 part
└─sda2 8:2 0 9.1T 0 part
├─hdd-tm_aircat_rmeta_0 253:18 0 1G 0 lvm
│ └─hdd-tm_aircat 253:20 0 1T 0 lvm /export/tm_aircat
├─hdd-tm_aircat_rimage_0 253:25 0 1T 0 lvm
│ └─hdd-tm_aircat 253:20 0 1T 0 lvm /export/tm_aircat
├─hdd-backup_rmeta_2 253:35 0 1G 0 lvm
│ └─hdd-backup 253:39 0 4T 0 lvm /backup
├─hdd-backup_rimage_2 253:36 0 2T 0 lvm
│ └─hdd-backup 253:39 0 4T 0 lvm /backup
├─hdd-win_kk_rmeta_1 253:42 0 1G 0 lvm
│ └─hdd-win_kk 253:23 0 2.1T 0 lvm /export/win_kk
└─hdd-win_kk_rimage_1 253:43 0 2.1T 0 lvm
└─hdd-win_kk 253:23 0 2.1T 0 lvm /export/win_kk
...
What is where?
And of course, we might be interested into the actual distribution of data on the disk:
# pvdisplay --map /dev/sd{a,b,c,g}2
...
--- Physical volume ---
PV Name /dev/sdg2
VG Name hdd
PV Size 9.09 TiB / not usable 1022.98 MiB
Allocatable yes
PE Size 1.00 GiB
Total PE 9303
Free PE 4098
Allocated PE 5205
PV UUID ...
--- Physical Segments ---
Physical extent 0 to 0:
Logical volume /dev/hdd/win_kk_rmeta_0
Logical extents 0 to 0
Physical extent 1 to 1:
FREE
Physical extent 2 to 2131:
Logical volume /dev/hdd/win_kk_rimage_0
Logical extents 0 to 2129
Physical extent 2132 to 2132:
Logical volume /dev/hdd/tm_aircat_rmeta_1
Logical extents 0 to 0
Physical extent 2133 to 3156:
Logical volume /dev/hdd/tm_aircat_rimage_1
Logical extents 0 to 1023
Physical extent 3157 to 3157:
Logical volume /dev/hdd/backup_rmeta_3
Logical extents 0 to 0
Physical extent 3158 to 5205:
Logical volume /dev/hdd/backup_rimage_3
Logical extents 0 to 2047
Physical extent 5206 to 9302:
FREE
Defragmenting dev/sdb2
Due to the way we created things, initially all RAID1
have the left leg of their mirror on /dev/sdb2
, because that is where we pvmove
ed stuff initially.
We might want to fix that, and push a few things over.
I did that, as can be seen by the lvs -a -o +devices hdd
output further up.
Here is how:
# pvmove -n <LV_name> /dev/sdb2 /dev/sdg2
...
This will move all data belonging to LV_name
that is currently on /dev/sdb2
to /dev/sdg2
.
Again, this will take a long time, and there should be no other sync action currently active for maximum speed, as rotating disks slow down a lot when there are competing disk seeks.
This leaves us with a fragmented /dev/sdb2
and LVs on higher numbered extents, which are a lot slower than lower numbered extents.
We could fix that as well, again with pvmove
:
# pvdisplay --map /dev/sdc2
...
Physical extent 3587 to 3596:
Logical volume /dev/hdd/keks_rimage_1
Logical extents 0 to 9
Physical extent 3597 to 4607:
FREE
...
# pvmove -n keks --alloc anywhere /dev/sdc2:3587-3596 /dev/sdc2:4000-4009
/dev/sdc2: Moved: 0.00%
...
/dev/sdc2: Moved: 100.00%
This moves extents internally on a drive.
pvmove
normally refuses to do this, so we have to tell it to shut up about this, using --alloc anywhere
.
We then use extent-addressing to change the map manually:
We move data from /dev/sdc2:3587-3596
, the entire keks
Test-LV, somewhere into the FREE
space, 4000-4009
.
The result looks like this:
...
Physical extent 3587 to 3999:
FREE
Physical extent 4000 to 4009:
Logical volume /dev/hdd/keks_rimage_1
Logical extents 0 to 9
Physical extent 4010 to 4607:
FREE
...
# lvremove /dev/hdd/keks
Do you really want to remove and DISCARD active logical volume hdd/keks? [y/n]: y
Logical volume "keks" successfully removed
And that concludes a largely pointless refactoring of my home storage, because I could.
# vgs
VG #PV #LV #SN Attr VSize VFree
data 2 15 0 wz--n- 7.28t 3.25t
hdd 4 5 0 wz--n- 36.34t 18.17t
system 1 4 0 wz--n- 466.94g 234.94g
Do you have checksums?
Not yet.
There is a thing called dm-integrity
, though, and a gist
that I have to try. The dm-integrity Documentation
is here.
And integritysetup
is part of cryptsetup-bin
on Ubuntu.