如何在 Linux 中更换软件raid 1的磁盘更换

如何使用mdadm 更换出现故障的硬盘?

理想情况下,RAID 1、RAID 5 等曾经可以轻松地进行热硬盘交换,因为它们支持硬件级别的镜像,但在软件 RAID 1 上执行相同操作变得棘手,因为理想情况下需要关闭操作系统以避免任何应用程序影响硬盘交换。

hpssacli rpm 可以从 HPE 网页下载,在本文中,假设我们已经在Knife 片机上下载并安装了一个。

注意:现在 hpssacli 最近更名为 ssacli,但由于我安装了旧版本的 hpssacli,命令将使用“hpssacli”,但相同的命令可以与“ssacli”一起使用

本文章的服务器设置:

  • HP Proliant BL460c Gen9
  • 两个内部磁盘,每个 900 GB
  • 硬件 RAID 0 配置了两个 Array(每个有一个磁盘)
  • 软件 RAID 1 配置在这些数组之上

正确的磁盘映射

通常硬盘到逻辑驱动器的映射如下
数组 A -> 逻辑驱动器 1 (/dev/sda) -> 托架 1
数组 B -> 逻辑驱动器 2 (/dev/sdb) -> 托架 2

但是在开始交换磁盘之前验证映射以确保替换正确的磁盘仍然是很好的。

# hpssacli ctrl slot=0  show config detail | grep 'Array:|Logical Drive:|Bay:|Disk'
   Array: A
      Logical Drive: 1
         Disk Name: /dev/sda          Mount Points: None
         Bay: 1
   Array: B
      Logical Drive: 2
         Disk Name: /dev/sdb          Mount Points: None
         Bay: 2

反转磁盘映射

数组 A -> 逻辑驱动器 1 (/dev/sda) -> 托架 2
数组 B -> 逻辑驱动器 2 (/dev/sdb) -> 托架 1

这里的输出如下

# hpssacli ctrl slot=0  show config detail | grep 'Array:|Logical Drive:|Bay:|Disk'
   Array: A
      Logical Drive: 1
         Disk Name: /dev/sda          Mount Points: None
         Bay: 2
   Array: B
      Logical Drive: 2
         Disk Name: /dev/sdb          Mount Points: None
         Bay: 1

如何检查硬盘是否有故障?

有多个位置(日志)可以收集足够的证据以获取有关故障磁盘的更多详细信息。

在 iLO 日志中,以下消息可用

右边硬盘:
内部存储机柜设备故障(Bay 1、Box 1、Port 1I、Slot 0)

左边硬盘:
内部存储机柜设备故障(Bay 2、Box 1、Port 1I、Slot 0)

操作系统的 Syslog 应包含以下消息(假设安装了 hp-ams 工具,因为它们报告所有硬件相关警报)

右边硬盘:

Aug 27 07:27:31 mylinux hp-ams[12332]: CRITICAL: Internal Storage Enclosure Device Failure (Bay 1, Box 1, Port 1I, Slot 0)

左边硬盘:

Aug 27 21:36:29 mylinux hp-ams[12854]: CRITICAL: Internal Storage Enclosure Device Failure (Bay 2, Box 1, Port 1I, Slot 0)

还可以使用以下命令检查逻辑驱动器状态

逻辑驱动器 1 故障状态

my-linux-box: # hpssacli ctrl slot=0 ld all show status
   logicaldrive 1 (838.3 GB, 0): Failed
   logicaldrive 2 (838.3 GB, 0): OK

逻辑驱动器 2 故障状态

my-linux-box: # hpssacli ctrl slot=0 ld all show status
   logicaldrive 1 (838.3 GB, 0): OK
   logicaldrive 2 (838.3 GB, 0): Failed

更换 逻辑驱动器 1 (/dev/sda)

检查RAID状态

接下来重新验证raid状态

# cat /proc/mdstat
Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md2 : active raid1 sda8[0](F) sdb8[1]
      870112064 blocks super 1.0 [2/1] [_U]
      bitmap: 3/7 pages [12KB], 65536KB chunk
md0 : active raid1 sda5[0](F) sdb5[1]
      529600 blocks super 1.0 [2/1] [_U]
      bitmap: 1/1 pages [4KB], 65536KB chunk
md3 : active raid1 sda7[0](F) sdb7[1]
      4200640 blocks super 1.0 [2/1] [_U]
      bitmap: 0/1 pages [0KB], 65536KB chunk
md1 : active raid1 sda6[0](F) sdb6[1]
      4200640 blocks super 1.0 [2/1] [_U]
      bitmap: 1/1 pages [4KB], 65536KB chunk
unused devices: <none>

现在删除失败的raid分区

my-linux-box:~ # mdadm /dev/md0 --remove /dev/sda5
mdadm: hot removed /dev/sda5 from /dev/md0
my-linux-box:~ # mdadm /dev/md1 --remove /dev/sda6
mdadm: hot removed /dev/sda6 from /dev/md1
my-linux-box:~ # mdadm /dev/md3  --remove /dev/sda7
mdadm: hot removed /dev/sda7 from /dev/md2
my-linux-box:~ # mdadm /dev/md2 --remove /dev/sda8
mdadm: hot removed /dev/sda8 from /dev/md3

接下来检查raid状态以验证是否已删除所有失败的分区

# cat /proc/mdstat
Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md2 : active raid1 sdb8[1]
      870112064 blocks super 1.0 [2/1] [_U]
      bitmap: 3/7 pages [12KB], 65536KB chunk
md0 : active raid1 sdb5[1]
      529600 blocks super 1.0 [2/1] [_U]
      bitmap: 1/1 pages [4KB], 65536KB chunk
md3 : active raid1 sdb7[1]
      4200640 blocks super 1.0 [2/1] [_U]
      bitmap: 0/1 pages [0KB], 65536KB chunk
md1 : active raid1 sdb6[1]
      4200640 blocks super 1.0 [2/1] [_U]
      bitmap: 1/1 pages [4KB], 65536KB chunk
unused devices: <none>

将故障磁盘更换为新磁盘,系统日志应包含与以下类似的消息

Aug 18 15:53:12 my-linux-box kernel: [ 8365.422069] hpsa 0000:03:00.0: added scsi 0:2:0:0: Direct-Access     HP       EG0900FBVFQ      RAID-UNKNOWN SSDSmartPathCap- En- Exp=2 qd=30

使用 hpssacli 重新启用逻辑驱动器

重新启用逻辑驱动器后,需要验证应返回“OK”的状态。

my-linux-box: # hpssacli ctrl slot=0 ld 1 modify reenable forced
my-linux-box:# hpssacli ctrl slot=0 ld all show status
   logicaldrive 1 (838.3 GB, 0): OK
   logicaldrive 2 (838.3 GB, 0): OK

sdax 现在按预期从 RAID 中丢失。

# cat /proc/mdstat
Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md2 : active raid1 sdb8[1]
      870112064 blocks super 1.0 [2/1] [_U]
      bitmap: 5/7 pages [20KB], 65536KB chunk
md0 : active raid1 sdb5[1]
      529600 blocks super 1.0 [2/1] [_U]
      bitmap: 1/1 pages [4KB], 65536KB chunk
md3 : active raid1 sdb7[1]
      4200640 blocks super 1.0 [2/1] [_U]
      bitmap: 0/1 pages [0KB], 65536KB chunk
md1 : active raid1 sdb6[1]
      4200640 blocks super 1.0 [2/1] [_U]
      bitmap: 1/1 pages [4KB], 65536KB chunk
unused devices: <none>

现在将分区表从 sdb 复制到 sda。

my-linux-box:~ # sfdisk -d /dev/sdb | grep -v ten | sfdisk /dev/sda –force –no-reread
Checking that no-one is using this disk right now ...
Warning: extended partition does not start at a cylinder boundary.
DOS and Linux will interpret the contents differently.
OK
Disk /dev/sda: 109437 cylinders, 255 heads, 63 sectors/track
Warning: extended partition does not start at a cylinder boundary.
DOS and Linux will interpret the contents differently.
Old situation:
Units = cylinders of 8225280 bytes, blocks of 1024 bytes, counting from 0
   Device Boot Start     End   #cyls    #blocks   Id  System
/dev/sda1   *      0+ 109437- 109438- 879054336    f  W95 Ext'd (LBA)
/dev/sda2          0       -       0          0    0  Empty
/dev/sda3          0       -       0          0    0  Empty
/dev/sda4          0       -       0          0    0  Empty
/dev/sda5          0+     66-     66-    529664   fd  Linux raid autodetect
/dev/sda6         66+    588-    523-   4200704   fd  Linux raid autodetect
/dev/sda7        589+   1111-    523-   4200704   fd  Linux raid autodetect
/dev/sda8       1112+ 109435- 108324- 870112256   fd  Linux raid autodetect
New situation:
Units = sectors of 512 bytes, counting from 0
   Device Boot    Start       End   #sectors  Id  System
/dev/sda1   *       512 1758109183 1758108672   f  W95 Ext'd (LBA)
/dev/sda2             0         -          0   0  Empty
/dev/sda3             0         -          0   0  Empty
/dev/sda4             0         -          0   0  Empty
/dev/sda5          1024   1060351    1059328  fd  Linux raid autodetect
/dev/sda6       1060864   9462271    8401408  fd  Linux raid autodetect
/dev/sda7       9462784  17864191    8401408  fd  Linux raid autodetect
/dev/sda8      17864704 1758089215 1740224512  fd  Linux raid autodetect
Warning: partition 1 does not end at a cylinder boundary
Successfully wrote the new partition table
Re-reading the partition table ...
If you created or changed a DOS partition, /dev/foo7, say, then use dd(1)
to zero the first 512 bytes:  dd if=/dev/zero of=/dev/foo7 bs=512 count=1
(See fdisk(8).)
Erase possible RAID config data ( from a reused disk)

执行此操作后,必须将可能剩余的旧 SW RAID 元数据从新连接的磁盘中删除,然后再将其重新添加到 RAID,这一点很重要。

my-linux-box:~ # mdadm --zero-superblock /dev/sda5
my-linux-box:~ # mdadm --zero-superblock /dev/sda6
my-linux-box:~ # mdadm --zero-superblock /dev/sda7
my-linux-box:~ # mdadm --zero-superblock /dev/sda8

之后,可以再次将逻辑卷添加到 SW RAID。

my-linux-box:~ # mdadm /dev/md0 --add /dev/sda5
mdadm: added /dev/sdb5
my-linux-box:~ # mdadm /dev/md1 --add /dev/sda6
mdadm: added /dev/sdb6
my-linux-box:~ # mdadm /dev/md3 --add /dev/sda7
mdadm: added /dev/sdb7
my-linux-box:~ # mdadm /dev/md2 --add /dev/sda8

注意:仅在最后添加的显示为 [UU] 时添加单个 raid 分区

如何在磁盘上安装 GRUB?

一旦 md0 同步了 grub 就应该在调用 grub 安装程序的两个磁盘上再次安装。

最后使用命令 grub-install 应该没有错误消息在两个磁盘(hd0 和 hd1)上安装 grub。

# grub-install
    GNU GRUB  version 0.97  (640K lower/3072K upper memory)
 [ Minimal BASH-like line editing is supported.  For the first word, TAB
   lists possible command completions.  Anywhere else TAB lists the possible
   completions of a device/filename. ]
grub> setup --stage2=/boot/grub/stage2 --force-lba (hd0) (hd0,4)
 Checking if "/boot/grub/stage1" exists... yes
 Checking if "/boot/grub/stage2" exists... yes
 Checking if "/boot/grub/e2fs_stage1_5" exists... yes
 Running "embed /boot/grub/e2fs_stage1_5 (hd0)"...  17 sectors are embedded.
succeeded
 Running "install --force-lba --stage2=/boot/grub/stage2 /boot/grub/stage1 (hd0) (hd0)1+17 p (hd0,4)/boot/grub/stage2 /boot/grub/menu.lst"... succeeded
Done.
grub> setup --stage2=/boot/grub/stage2 --force-lba (hd1) (hd1,4)
 Checking if "/boot/grub/stage1" exists... yes
 Checking if "/boot/grub/stage2" exists... yes
 Checking if "/boot/grub/e2fs_stage1_5" exists... yes
 Running "embed /boot/grub/e2fs_stage1_5 (hd1)"...  17 sectors are embedded.
succeeded
 Running "install --force-lba --stage2=/boot/grub/stage2 /boot/grub/stage1 (hd1) (hd1)1+17 p (hd1,4)/boot/grub/stage2 /boot/grub/menu.lst"... succeeded
Done.
grub> quit

最后验证raid状态

# cat /proc/mdstat
Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md2 : active raid1 sda8[2] sdb8[1]
      870112064 blocks super 1.0 [2/2] [UU]
      bitmap: 6/7 pages [24KB], 65536KB chunk
md0 : active raid1 sda5[2] sdb5[1]
      529600 blocks super 1.0 [2/2] [UU]
      bitmap: 1/1 pages [4KB], 65536KB chunk
md3 : active raid1 sda7[2] sdb7[1]
      4200640 blocks super 1.0 [2/2] [UU]
      bitmap: 0/1 pages [0KB], 65536KB chunk
md1 : active raid1 sda6[2] sdb6[1]
      4200640 blocks super 1.0 [2/2] [UU]
      bitmap: 0/1 pages [0KB], 65536KB chunk
unused devices: <none>

类似地,可以对第二个逻辑驱动器执行磁盘更换。

日期:2020-06-02 22:17:02 来源:oir作者:oir