How to replace failed software RAID device

I had an unexpected incident during last week as one of the hard disks in my server failed. This device was a part of software RAID mirror, so it created an opportunity to describe the whole process.

Identify failed device

At first, inspect dmesg output for potential issues.

$ dmesg | grep raid
[ 6782.323751] raid1: Disk failure on sdb1, disabling device.
[ 6782.323753] raid1: Operation continuing on 1 devices.

It looks like there is an issue with sdb1 device, so lets inspect /proc/mdstat file to verify status of software RAID arrays.

$ cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md6 : active raid1 sdb8[1] sda8[0]
      48829440 blocks [2/2] [UU]
md7 : active raid1 sdb9[1] sda9[0]
      25101440 blocks [2/2] [UU]
md1 : active raid1 sdb2[1] sda2[0]
      3903680 blocks [2/2] [UU]
md2 : active raid1 sdb3[1] sda3[0]
      29294400 blocks [2/2] [UU]
md5 : active raid1 sdb7[1] sda7[0]
      9767424 blocks [2/2] [UU]
md3 : active raid1 sdb5[1] sda5[0]
      341798784 blocks [2/2] [UU]
md4 : active raid1 sdb6[1] sda6[0]
      29294400 blocks [2/2] [UU]
md0 : active raid1 sdb1[1](F) sda1[0]
      393472 blocks [2/2] [U_]
unused devices: <none>

The sdb1 device belongs to md0 array, so lets examine it further.

mdadm --detail /dev/md0
        Version : 00.90
  Creation Time : Wed Feb 17 12:35:33 2010
     Raid Level : raid1
     Array Size : 393472 (384.31 MiB 402.92 MB)
  Used Dev Size : 393472 (384.31 MiB 402.92 MB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Thu May  7 11:20:35 2015
          State : clean, degraded
 Active Devices : 1
Working Devices : 1
 Failed Devices : 1
  Spare Devices : 0

           UUID : 834bb5fd:544081d3:a1ae249f:9be67250
         Events : 89

    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       1       0        0        1      removed
       1       8       17        -      faulty /dev/sdb1

The simplified conclusion is that sdb needs to replaced. You can investigate it further by analyzing dmesg output and using smartmontools.

Remove failed device

Remove failed device from RAID array.

$ sudo mdadm --manage /dev/md0 --remove /dev/sdb1

Define every other partition on sdb device as failed to put it off-line and safely remove from existing RAID arrays.

$ sudo mdadm --manage /dev/md1 --fail   /dev/sdb2
$ sudo mdadm --manage /dev/md1 --remove /dev/sdb2
$ sudo mdadm --manage /dev/md2 --fail   /dev/sdb3
$ sudo mdadm --manage /dev/md2 --remove /dev/sdb3
$ sudo mdadm --manage /dev/md3 --fail   /dev/sdb5
$ sudo mdadm --manage /dev/md3 --remove /dev/sdb5
$ sudo mdadm --manage /dev/md4 --fail   /dev/sdb6
$ sudo mdadm --manage /dev/md4 --remove /dev/sdb6
$ sudo mdadm --manage /dev/md5 --fail   /dev/sdb7
$ sudo mdadm --manage /dev/md5 --remove /dev/sdb7
$ sudo mdadm --manage /dev/md6 --fail   /dev/sdb8
$ sudo mdadm --manage /dev/md6 --remove /dev/sdb8
$ sudo mdadm --manage /dev/md7 --fail   /dev/sdb9
$ sudo mdadm --manage /dev/md7 --remove /dev/sdb9

Replace failed device

Physically replace hard disk and boot system again.

Copy partition table

Copy partition table to new hard disk.

$ sudo sfdisk -d /dev/sda | sfdisk /dev/sdb

Please read How to backup DOS-type partition table/GPT and LVM meta-data blog post for more information.

Reinstall GRUB

Install GRUB on new hard disk.

$ sudo grub-install /dev/sdb

Recreate RAID arrays

New hard disk is already partitioned in the same way as the first one, so lets recreate existing RAID1 arrays.

$ sudo mdadm --manage /dev/md0 --add /dev/sdb1
$ sudo mdadm --manage /dev/md1 --add /dev/sdb2
$ sudo mdadm --manage /dev/md2 --add /dev/sdb3
$ sudo mdadm --manage /dev/md3 --add /dev/sdb5
$ sudo mdadm --manage /dev/md4 --add /dev/sdb6
$ sudo mdadm --manage /dev/md5 --add /dev/sdb7
$ sudo mdadm --manage /dev/md6 --add /dev/sdb8
$ sudo mdadm --manage /dev/md7 --add /dev/sdb9

Verify and speed up recovery process

Look at the /proc/mdstat file to verify recovery process.

$ cat /proc/mdstat 
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md1 : active raid1 sdb2[0] sda2[1]
      3903680 blocks [2/2] [UU]
md3 : active raid1 sdb5[2] sda5[1]
      341798784 blocks [2/1] [_U]
md2 : active raid1 sdb3[2] sda3[1]
      29294400 blocks [2/1] [_U]
md0 : active raid1 sdb1[0] sda1[1]
      393472 blocks [2/2] [UU]
md4 : active raid1 sdb6[2] sda6[1]
      29294400 blocks [2/1] [_U]
      [=>...................]  recovery =  5.2% (1546176/29294400) finish=5.6min speed=81377K/sec
md7 : active raid1 sdb9[2] sda9[1]
      25101440 blocks [2/1] [_U]
md5 : active raid1 sdb7[2] sda7[1]
      9767424 blocks [2/1] [_U]
md6 : active raid1 sdb8[2] sda8[1]
      48829440 blocks [2/1] [_U]
unused devices: <none>

Increase maximum rebuild speed on used array.

$ sudo sysctl$(expr $(sysctl -n \* 2)

Increase maximum rebuild speed on non-used array.

$ sudo sysctl$(expr $(sysctl -n \* 2)

I have increased these settings to values twice as large as default. Please notice that these changes will only last until the next reboot.

Alternatively, you can use proc file-system to achieve the same goal.

Additional notes

This technique can also be used to create software RAID on running system. You can also modify it to replace disks with their bigger versions.


Please read md and mdadm manual pages. Additional documentation is available in /usr/share/doc/mdadm/ directory.