Replacing and resizing a linux software raid, live

Let me describe the scenario:

You have a linux software raid (raid5, in my case, created with mdadm).
On top of it, you have a few LVM volumes, and LUKS encrypted partitions.
You literally set this up 10 years ago - 4 disks 2 Tb each.
It has been running strong for the last 10 years, with the occasional disk replaced.
You just bought new 8Tb disks.

And now, you want to replace the old disks for the new ones, increase the size of the raid5 volume and, well, you want to do it live (with the partition in use, read write without unmounting it, and without rebooting the machine).

All of this with consumer hardware, that DOES NOT SUPPORT ANY SORT OF HOT SWAP. Basically, no hardware raid controller, just the cheapest SATA support offered by the cheapest atom motherboard that you bought 10 years ago that happened to have enough SATA plugs.

Not for the faint of hearts, but turns out this is possible with a stock linux kernel, fairly easy to do, and worked really well for me.

All you need to do is to make sure you type a few more commands from your shell, so that your incredibly cheap and naive SATA controller and linux system knows what you're up to before going around touching the wiring.

During the entire process I did have to reboot the server once: the chassis was not server class, and I did not have access to the disks without removing the case, and moving some cables around.

But that was it: one shutdown, 10 mins of moving cables around, back on, and the rest of the work was done live.

Preparation

In short:

cat /proc/mdstat
lsblk -do +VENDOR,MODEL,SERIAL
echo check > /sys/block/mdX/md/sync_action
dmesg, smartctl -a /dev/sdX

Longer explanation:

If you don't remember the disks in your array, cat /proc/mdstat to see each volume. In my case, I had to replace the disks in md5.

 # cat /proc/mdstat 
 Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] 
 md0 : active (auto-read-only) raid1 sdb1[0] sda1[1]
       145344 blocks [2/2] [UU]

 md1 : active raid1 sdb5[0] sda5[1]
       244043264 blocks [2/2] [UU]

 md5 : active raid5 sde[7] sdd[6] sdc[5] sdf[4]
       5860538880 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [UUUU]

Label the disks. Run the command lsblk -do +VENDOR,MODEL,SERIAL, print the output or copy it to your laptop, open your case. Put a label on each disk marking "sdc", "sdd", "sde", "sdf" by checking the SERIAL # (printed on labels on the disk).

 NAME MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT VENDOR   MODEL                 SERIAL
 sda    8:0    0 232.9G  0 disk            ATA      WDC_WD2500AVVS-73M8B0 WD-WCAV94350152
 sdb    8:16   0 232.9G  0 disk            ATA      WDC_WD2500AVVS-73M8B0 WD-WCAV94283568
 sdc    8:32   0   1.8T  0 disk            ATA      WDC_WD20EARX-22PASB0  WD-WCAZAJ370736
 sdd    8:48   0   1.8T  0 disk            ATA      WDC_WD20EZRX-00D8PB0  WD-WCC4M1KD6YU5
 sde    8:64   0   1.8T  0 disk            ATA      WDC_WD20EARS-00MVWB0  WD-WMAZA3309946
 sdf    8:80   0   1.8T  0 disk            ATA      WDC_WD20EZRX-00D8PB0  WD-WMC4N0H1XYK1

Check that the raid is in good health - follow the next 2 steps. If it's a raid5, you can only afford one disk down at a time. If another disk turns out to be damaged while you are replacing a disk, you will lose data. The next steps
cat /proc/mdstat one more time. Verify that all disks are up. [UUUU] means that there are 4 disks, each in up state (see the example output above). If one of the disks had failed, you would have had something like [UU_U] indicating one disk down. In that case, you should have seen something like recovery in progress or degraded mode. If that's the case, you must replace the damaged disk before proceeding!

Trigger an array check. Command is echo check > /sys/block/md5/md/sync_action, with md5 being the name of your raid array. Check cat /proc/mdstat you should now see something like below indicating that the sync is in progress:

 # cat /proc/mdstat 
 Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] 
 md0 : active (auto-read-only) raid1 sdb1[0] sda1[1]
       145344 blocks [2/2] [UU]

 md1 : active raid1 sdb5[0] sda5[1]
       244043264 blocks [2/2] [UU]

 md5 : active raid5 sde[7] sdd[6] sdc[5] sdf[4]
       5860538880 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [UUUU]
       [>....................]  check =  1.7% (34292164/1953512960) finish=423.6min speed=97372K/sec

Wait. Wait. Wait, until the check is completed. I left my console up with watch -d cat /proc/mdstat.
At the end of the proces: check dmesg if any read error was reported. Then use smartctl -a /dev/sdc and then /dev/sdd, ... for each element of your array to see if any disk had errors.

In my case, the check was successful, dmesg was clean of errors after the array check was started, but smartctl -a /dev/sde showed that the disk - with more than 10 years of total run time - had multiple read errors and recovered from most of them. Still OK, but not in good health. About to fail.

Given that raid5 can only tolerate one disk failing, I started by replacing /dev/sde. Really, you don't want another disk to fail on you while you are waiting for your new disk to be synchronized.

Replacing each disk

It is actually easier than it sounds. In short (do at your own risk, or read the explanation below):

mdadm --manage /dev/mdX --fail /dev/sdX
mdadm --manage /dev/mdX --remove /dev/sdX
echo 1 > /sys/block/sdX/device/delete
Unplug the device (power cable first, sata cable next)
Plug the new device (sata cable first, power cable next)
Wait 10-15 seconds.

Run:

 for file in /sys/class/scsi_host/*/scan; do
   echo "- - -" > $file;
 done;

mdadm --add /dev/md5 /dev/sdc
cat /proc/mdstat
Once synchronization is done, repeat starting from 1.

Longer explanation:

Pick the disk to replace. In my case, I started with /dev/sde.
Mark it faulty, so the raid stops using it: mdadm --manage /dev/md5 --fail /dev/sde.
Tell linux you want the disk entirely out of the array: mdadm --manage /dev/md5 --remove /dev/sde.
If you plan to re-use the disk on a different machine or different array, and you are super-confident you won't have to plug it back in to recover your data, use mdadm --zero-superblock /dev/sde to remove the raid metadata, so your disk won't be detected as a raid array when plugged into another machine. I wanted to keep the old disk around in case the array sync failed for any reason, so I did not do this step.
Tell linux (and the controller!) you are about to unplug the device: echo 1 > /sys/block/sde/device/delete. This step may not be necessary if your controller supports hot swap.
Check dmesg, see a few messages notifying that the device was deleted. Check ls /dev/sd*, sde should be gone.
Physically remove the device. I personally removed the power first, and then unplugged the sata wire. If you have a hot swap/easy swap case, you can probably just remove the disk.

If you forgot about step 4 or 5, in my case linux noticed the disk gone on its own within a few minutes. Check step 5 again after unplugging, wait until the software layer thinks the drive is gone.

Not sure if I was lucky and step 4 and 5 are optional, but I did not trust my controller.

Prepare the new disk. Make sure the disk is new, or that any raid superblock was deleted beforehand (on a separate machine, or before it was unplugged) with mdadm --zero-superblock /dev/sdX.
Connect the disk. Here, I first connected the SATA wire, and then the power. Nothing happened, by looking at dmesg the disk was not detected, although it started spinning.

Ask linux to rescan all the SATA buses, looking for new disks:

  for file in /sys/class/scsi_host/*/scan; do
    echo "- - -" > $file;
  done;

Check dmesg, and ls /dev/sde. dmesg should tell you that a new disk was found, and /dev/sde should now be magically back.
If you need to, now it is the time to create a partition talbe on the disk if you need it. In my case, I use LVM on top of raid, so I just wanted to add the whole physical disk to the raid, no partition table.

Note that the new physical disk is 8Tb, while the old one is 2Tb. So 6Tb would get wasted. But my plan is to replace all disks in the array, and resize the array at the end. So no partitioning was necessary. If you wanted to have one disk larger, you could have partitioned it, put 2Tb back in the raid5 array, and used the rest for something else.
Add the new disk to the array mdadm --add /dev/md5 /dev/sde

Check cat /proc/mdstat. You should now see the disk being synchronized:

  # cat /proc/mdstat 
  Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] 
  md0 : active (auto-read-only) raid1 sdb1[0] sda1[1]
        145344 blocks [2/2] [UU]

  md1 : active raid1 sdb5[0] sda5[1]
        244043264 blocks [2/2] [UU]

  md5 : active raid5 sde[7] sdd[6] sdc[5] sdf[4]
        5860538880 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [_UUU]
        [>....................]  recovery =  3.4% (66457728/1953512960) finish=1237.1min speed=25421K/sec

Once recovery is done (in 1237 minutes...) make sure that it succeeded (check /proc/mdstat, always, and smartctl -a /dev/sde and any disk involved, verify errors on dmesg).
Assuming all went well, repeat the steps to replace each disk in the array.

Depending on what your machine is supposed to be doing while all of this is in progress, you can adjust the minimum and maximum sync speed by changing the value of:

  /sys/block/md5/md/sync_speed_max
  /sys/block/md5/md/sync_speed_min

For example, with:

  # cat /sys/block/md5/md/sync_speed
  24873
  # echo 50000 > /sys/block/md5/md/sync_speed_max
  # echo 40000 > /sys/block/md5/md/sync_speed_min
  # cat /sys/block/md5/md/sync_speed
  26988
  # cat /sys/block/md5/md/sync_speed
  31840
  # cat /sys/block/md5/md/sync_speed
  37234
  # cat /sys/block/md5/md/sync_speed
  40024

Repeat this same process for each disk. For a week, I'd pretty much replace 1 disk every time I came home from work. All had been replaced by the end of the week.

Resizing the raid

Resizing the raid was easier than I expected. In short:

mdadm --grow /dev/md5 -z max - to resize the array to the maximum supported by the underlying partitions / physical disk.
Wait for the resize process to be completed. I used something like: mdadm -D /dev/md5 | grep -e "Array Size" -e "Dev Size" to see the delta shrinking, and the usual cat /proc/mdstat.
pvresize /dev/md5 to instruct LVM that the physical volume is now larger, and its size can be adjusted to match it.
Once the resize is complete, vgs should show the additional space available in the volume group. Now you can resize any logical volume to have as much space as you need.
If you need instructions on how to resize a logical volume encrypted with LUKS, I wrote about it a few years ago. You can read them all here, they still worked as a charm this time around.

Conclusions

The whole process was suprisingly painless. All worked like a charm, and by the end of it I had an array with 24 TB of total space available, with close to zero downtime.

Speeding up the Carbon X1 Trackpad

Replacing and resizing a linux software raid, live

Preparation

Replacing each disk

Resizing the raid

Conclusions

Other posts