Categories
SysOps

How to replace failed software RAID device

I had an unexpected incident last week as one of the hard disks in my server failed.
This device was a part of a software RAID mirror, which created an opportunity to describe the whole process.

Identify failed device

At first, inspect dmesg output for potential issues.

$ dmesg | grep raid
[ 6782.323751] raid1: Disk failure on sdb1, disabling device.
[ 6782.323753] raid1: Operation continuing on 1 devices.

It looks like there is an issue with sdb1 device, so let’s inspect /proc/mdstat file to verify the status of software RAID arrays.

$ cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md6 : active raid1 sdb8[1] sda8[0]
      48829440 blocks [2/2] [UU]
md7 : active raid1 sdb9[1] sda9[0]
      25101440 blocks [2/2] [UU]
md1 : active raid1 sdb2[1] sda2[0]
      3903680 blocks [2/2] [UU]
md2 : active raid1 sdb3[1] sda3[0]
      29294400 blocks [2/2] [UU]
md5 : active raid1 sdb7[1] sda7[0]
      9767424 blocks [2/2] [UU]
md3 : active raid1 sdb5[1] sda5[0]
      341798784 blocks [2/2] [UU]
md4 : active raid1 sdb6[1] sda6[0]
      29294400 blocks [2/2] [UU]
md0 : active raid1 sdb1[1](F) sda1[0]
      393472 blocks [2/2] [U_]
unused devices: <none>

The sdb1 device belongs to md0 array, so let’s examine it further.

mdadm --detail /dev/md0
/dev/md0:
        Version : 00.90
  Creation Time : Wed Feb 17 12:35:33 2010
     Raid Level : raid1
     Array Size : 393472 (384.31 MiB 402.92 MB)
  Used Dev Size : 393472 (384.31 MiB 402.92 MB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 0
    Persistence : Superblock is persistent
    Update Time : Thu May  7 11:20:35 2015
          State : clean, degraded
 Active Devices : 1
Working Devices : 1
 Failed Devices : 1
  Spare Devices : 0
           UUID : 834bb5fd:544081d3:a1ae249f:9be67250
         Events : 89
    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       1       0        0        1      removed
       1       8       17        -      faulty /dev/sdb1

The simplified conclusion is that sdb needs to be replaced. You can investigate it further by analyzing dmesg output and using smartmontools.

Remove failed device

Remove failed device from the RAID array.

$ sudo mdadm --manage /dev/md0 --remove /dev/sdb1

Eradicate sdb device.

$ sudo mdadm --manage /dev/md1 --fail   /dev/sdb2
$ sudo mdadm --manage /dev/md1 --remove /dev/sdb2
$ sudo mdadm --manage /dev/md2 --fail   /dev/sdb3
$ sudo mdadm --manage /dev/md2 --remove /dev/sdb3
$ sudo mdadm --manage /dev/md3 --fail   /dev/sdb5
$ sudo mdadm --manage /dev/md3 --remove /dev/sdb5
$ sudo mdadm --manage /dev/md4 --fail   /dev/sdb6
$ sudo mdadm --manage /dev/md4 --remove /dev/sdb6
$ sudo mdadm --manage /dev/md5 --fail   /dev/sdb7
$ sudo mdadm --manage /dev/md5 --remove /dev/sdb7
$ sudo mdadm --manage /dev/md6 --fail   /dev/sdb8
$ sudo mdadm --manage /dev/md6 --remove /dev/sdb8
$ sudo mdadm --manage /dev/md7 --fail   /dev/sdb9
$ sudo mdadm --manage /dev/md7 --remove /dev/sdb9

Replace failed device

Physically replace the hard disk and boot system again.

Copy partition table

Copy partition table to new hard disk.

$ sudo sfdisk -d /dev/sda | sfdisk /dev/sdb

Please read How to backup DOS-type partition table/GPT and LVM meta-data blog post for more information.

Reinstall GRUB

Install GRUB on a new hard disk.

$ sudo grub-install /dev/sdb

Recreate RAID arrays

The new hard disk is already partitioned in the same way as the first one, so let’s recreate existing RAID1 arrays.

$ sudo mdadm --manage /dev/md0 --add /dev/sdb1
$ sudo mdadm --manage /dev/md1 --add /dev/sdb2
$ sudo mdadm --manage /dev/md2 --add /dev/sdb3
$ sudo mdadm --manage /dev/md3 --add /dev/sdb5
$ sudo mdadm --manage /dev/md4 --add /dev/sdb6
$ sudo mdadm --manage /dev/md5 --add /dev/sdb7
$ sudo mdadm --manage /dev/md6 --add /dev/sdb8
$ sudo mdadm --manage /dev/md7 --add /dev/sdb9

Verify and speed up the recovery process

Look at the /proc/mdstat file to verify the recovery process.

$ cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md1 : active raid1 sdb2[0] sda2[1]
      3903680 blocks [2/2] [UU]
md3 : active raid1 sdb5[2] sda5[1]
      341798784 blocks [2/1] [_U]
        resync=DELAYED
md2 : active raid1 sdb3[2] sda3[1]
      29294400 blocks [2/1] [_U]
        resync=DELAYED
md0 : active raid1 sdb1[0] sda1[1]
      393472 blocks [2/2] [UU]
md4 : active raid1 sdb6[2] sda6[1]
      29294400 blocks [2/1] [_U]
      [=>...................]  recovery =  5.2% (1546176/29294400) finish=5.6min speed=81377K/sec
md7 : active raid1 sdb9[2] sda9[1]
      25101440 blocks [2/1] [_U]
        resync=DELAYED
md5 : active raid1 sdb7[2] sda7[1]
      9767424 blocks [2/1] [_U]
        resync=DELAYED
md6 : active raid1 sdb8[2] sda8[1]
      48829440 blocks [2/1] [_U]
        resync=DELAYED
unused devices: <none>

Increase maximum rebuild speed on the used array.

$ sudo sysctl dev.raid.speed_limit_min=$(expr $(sysctl -n dev.raid.speed_limit_min) \* 2)

Increase maximum rebuild speed on the non-used array.

$ sudo sysctl dev.raid.speed_limit_max=$(expr $(sysctl -n dev.raid.speed_limit_max) \* 2)

I have increased these settings to values twice as large as default. Please note that these changes will only last until the next reboot.

Alternatively, you can use proc file-system to achieve the same goal.

Additional notes

This technique can also be used to create software RAID on the running system. You can also modify it to replace disks with their bigger versions.

References

Please read md and mdadm manual pages. Additional documentation is available in /usr/share/doc/mdadm/ directory.