Let me describe the scenario:
And now, you want to replace the old disks for the new ones, increase the size of the raid5 volume and, well, you want to do it live (with the partition in use, read write without unmounting it, and without rebooting the machine).
All of this with consumer hardware, that DOES NOT SUPPORT ANY SORT OF HOT SWAP. Basically, no hardware raid controller, just the cheapest SATA support offered by the cheapest atom motherboard that you bought 10 years ago that happened to have enough SATA plugs.
Not for the faint of hearts, but turns out this is possible with a stock linux kernel, fairly easy to do, and worked really well for me.
All you need to do is to make sure you type a few more commands from your shell, so that your incredibly cheap and naive SATA controller and linux system knows what you're up to before going around touching the wiring.
During the entire process I did have to reboot the server once: the chassis was not server class, and I did not have access to the disks without removing the case, and moving some cables around.
But that was it: one shutdown, 10 mins of moving cables around, back on, and the rest of the work was done live.
In short:
cat /proc/mdstat
lsblk -do +VENDOR,MODEL,SERIAL
echo check > /sys/block/mdX/md/sync_action
dmesg
, smartctl -a /dev/sdX
Longer explanation:
If you don't remember the disks in your array, cat /proc/mdstat
to see each volume.
In my case, I had to replace the disks in md5.
# cat /proc/mdstat Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] md0 : active (auto-read-only) raid1 sdb1[0] sda1[1] 145344 blocks [2/2] [UU] md1 : active raid1 sdb5[0] sda5[1] 244043264 blocks [2/2] [UU] md5 : active raid5 sde[7] sdd[6] sdc[5] sdf[4] 5860538880 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [UUUU]
Label the disks. Run the command lsblk -do +VENDOR,MODEL,SERIAL
, print the output
or copy it to your laptop, open your case. Put a label on each disk marking "sdc", "sdd",
"sde", "sdf" by checking the SERIAL # (printed on labels on the disk).
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT VENDOR MODEL SERIAL sda 8:0 0 232.9G 0 disk ATA WDC_WD2500AVVS-73M8B0 WD-WCAV94350152 sdb 8:16 0 232.9G 0 disk ATA WDC_WD2500AVVS-73M8B0 WD-WCAV94283568 sdc 8:32 0 1.8T 0 disk ATA WDC_WD20EARX-22PASB0 WD-WCAZAJ370736 sdd 8:48 0 1.8T 0 disk ATA WDC_WD20EZRX-00D8PB0 WD-WCC4M1KD6YU5 sde 8:64 0 1.8T 0 disk ATA WDC_WD20EARS-00MVWB0 WD-WMAZA3309946 sdf 8:80 0 1.8T 0 disk ATA WDC_WD20EZRX-00D8PB0 WD-WMC4N0H1XYK1
Check that the raid is in good health - follow the next 2 steps. If it's a raid5, you can only afford one disk down at a time. If another disk turns out to be damaged while you are replacing a disk, you will lose data. The next steps
cat /proc/mdstat
one more time. Verify that all disks are up. [UUUU]
means that there
are 4 disks, each in up state (see the example output above). If one of the disks had
failed, you would have had something like [UU_U]
indicating one disk down.
In that case, you should have seen something like recovery in progress
or degraded
mode.
If that's the case, you must replace the damaged disk before proceeding!
Trigger an array check. Command is echo check > /sys/block/md5/md/sync_action
, with
md5 being the name of your raid array. Check cat /proc/mdstat
you should now see
something like below indicating that the sync is in progress:
# cat /proc/mdstat Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] md0 : active (auto-read-only) raid1 sdb1[0] sda1[1] 145344 blocks [2/2] [UU] md1 : active raid1 sdb5[0] sda5[1] 244043264 blocks [2/2] [UU] md5 : active raid5 sde[7] sdd[6] sdc[5] sdf[4] 5860538880 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [UUUU] [>....................] check = 1.7% (34292164/1953512960) finish=423.6min speed=97372K/sec
Wait. Wait. Wait, until the check is completed. I left my console up with
watch -d cat /proc/mdstat
.
At the end of the proces: check dmesg
if any read error was reported. Then use
smartctl -a /dev/sdc
and then /dev/sdd
, ... for each element of your array to see
if any disk had errors.
In my case, the check was successful, dmesg
was clean of errors after the array check
was started, but smartctl -a /dev/sde
showed that the disk - with more than 10 years of total
run time - had multiple read errors and recovered from most of them. Still OK, but not in
good health. About to fail.
Given that raid5 can only tolerate one disk failing, I started by replacing /dev/sde
.
Really, you don't want another disk to fail on you while you are waiting for your new
disk to be synchronized.
It is actually easier than it sounds. In short (do at your own risk, or read the explanation below):
mdadm --manage /dev/mdX --fail /dev/sdX
mdadm --manage /dev/mdX --remove /dev/sdX
echo 1 > /sys/block/sdX/device/delete
Run:
for file in /sys/class/scsi_host/*/scan; do echo "- - -" > $file; done;
mdadm --add /dev/md5 /dev/sdc
cat /proc/mdstat
Longer explanation:
Pick the disk to replace. In my case, I started with /dev/sde
.
Mark it faulty, so the raid stops using it: mdadm --manage /dev/md5 --fail /dev/sde
.
Tell linux you want the disk entirely out of the array: mdadm --manage /dev/md5 --remove /dev/sde
.
If you plan to re-use the disk on a different machine or different
array, and you are super-confident you won't have to plug it back in
to recover your data, use mdadm --zero-superblock /dev/sde
to remove
the raid metadata, so your disk won't be detected as a raid array when
plugged into another machine. I wanted to keep the old disk around
in case the array sync failed for any reason, so I did not do this step.
Tell linux (and the controller!) you are about to unplug the device:
echo 1 > /sys/block/sde/device/delete
.
This step may not be necessary if your controller supports hot swap.
Check dmesg
, see a few messages notifying that the device was
deleted. Check ls /dev/sd*
, sde
should be gone.
Physically remove the device. I personally removed the power first, and then unplugged the sata wire. If you have a hot swap/easy swap case, you can probably just remove the disk.
If you forgot about step 4 or 5, in my case linux noticed the disk gone on its own within a few minutes. Check step 5 again after unplugging, wait until the software layer thinks the drive is gone.
Not sure if I was lucky and step 4 and 5 are optional, but I did not trust my controller.
Prepare the new disk. Make sure the disk is new, or that any raid superblock was deleted beforehand (on a
separate machine, or before it was unplugged) with mdadm --zero-superblock /dev/sdX
.
Connect the disk. Here, I first connected the SATA wire, and then the power. Nothing happened, by looking at dmesg the disk was not detected, although it started spinning.
Ask linux to rescan all the SATA buses, looking for new disks:
for file in /sys/class/scsi_host/*/scan; do echo "- - -" > $file; done;
Check dmesg
, and ls /dev/sde
. dmesg should tell you that a new disk
was found, and /dev/sde
should now be magically back.
If you need to, now it is the time to create a partition talbe on the disk if you need it. In my case, I use LVM on top of raid, so I just wanted to add the whole physical disk to the raid, no partition table.
Note that the new physical disk is 8Tb, while the old one is 2Tb. So 6Tb would get wasted. But my plan is to replace all disks in the array, and resize the array at the end. So no partitioning was necessary. If you wanted to have one disk larger, you could have partitioned it, put 2Tb back in the raid5 array, and used the rest for something else.
Add the new disk to the array mdadm --add /dev/md5 /dev/sde
Check cat /proc/mdstat
. You should now see the disk being synchronized:
# cat /proc/mdstat Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] md0 : active (auto-read-only) raid1 sdb1[0] sda1[1] 145344 blocks [2/2] [UU] md1 : active raid1 sdb5[0] sda5[1] 244043264 blocks [2/2] [UU] md5 : active raid5 sde[7] sdd[6] sdc[5] sdf[4] 5860538880 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [_UUU] [>....................] recovery = 3.4% (66457728/1953512960) finish=1237.1min speed=25421K/sec
Once recovery is done (in 1237 minutes...) make sure that it succeeded (check /proc/mdstat
, always, and smartctl -a /dev/sde
and
any disk involved, verify errors on dmesg
).
Assuming all went well, repeat the steps to replace each disk in the array.
Depending on what your machine is supposed to be doing while all of this is in progress, you can adjust the minimum and maximum sync speed by changing the value of:
/sys/block/md5/md/sync_speed_max /sys/block/md5/md/sync_speed_min
For example, with:
# cat /sys/block/md5/md/sync_speed 24873 # echo 50000 > /sys/block/md5/md/sync_speed_max # echo 40000 > /sys/block/md5/md/sync_speed_min # cat /sys/block/md5/md/sync_speed 26988 # cat /sys/block/md5/md/sync_speed 31840 # cat /sys/block/md5/md/sync_speed 37234 # cat /sys/block/md5/md/sync_speed 40024
Repeat this same process for each disk. For a week, I'd pretty much replace 1 disk every time I came home from work. All had been replaced by the end of the week.
Resizing the raid was easier than I expected. In short:
mdadm --grow /dev/md5 -z max
- to resize the array to the maximum supported
by the underlying partitions / physical disk.
Wait for the resize process to be completed. I used something like:
mdadm -D /dev/md5 | grep -e "Array Size" -e "Dev Size"
to see the delta
shrinking, and the usual cat /proc/mdstat
.
pvresize /dev/md5
to instruct LVM that the physical volume is now larger,
and its size can be adjusted to match it.
Once the resize is complete, vgs
should show the additional space available
in the volume group. Now you can resize any logical volume to have as much
space as you need.
If you need instructions on how to resize a logical volume encrypted with LUKS, I wrote about it a few years ago. You can read them all here, they still worked as a charm this time around.
The whole process was suprisingly painless. All worked like a charm, and by the end of it I had an array with 24 TB of total space available, with close to zero downtime.