如何使用mdadm 更换出现故障的硬盘?
理想情况下,RAID 1、RAID 5 等曾经可以轻松地进行热硬盘交换,因为它们支持硬件级别的镜像,但在软件 RAID 1 上执行相同操作变得棘手,因为理想情况下需要关闭操作系统以避免任何应用程序影响硬盘交换。
hpssacli rpm 可以从 HPE 网页下载,在本文中,假设我们已经在Knife 片机上下载并安装了一个。
注意:现在 hpssacli 最近更名为 ssacli,但由于我安装了旧版本的 hpssacli,命令将使用“hpssacli”,但相同的命令可以与“ssacli”一起使用
本文章的服务器设置:
- HP Proliant BL460c Gen9
- 两个内部磁盘,每个 900 GB
- 硬件 RAID 0 配置了两个 Array(每个有一个磁盘)
- 软件 RAID 1 配置在这些数组之上
正确的磁盘映射
通常硬盘到逻辑驱动器的映射如下
数组 A -> 逻辑驱动器 1 (/dev/sda) -> 托架 1
数组 B -> 逻辑驱动器 2 (/dev/sdb) -> 托架 2
但是在开始交换磁盘之前验证映射以确保替换正确的磁盘仍然是很好的。
# hpssacli ctrl slot=0 show config detail | grep 'Array:|Logical Drive:|Bay:|Disk' Array: A Logical Drive: 1 Disk Name: /dev/sda Mount Points: None Bay: 1 Array: B Logical Drive: 2 Disk Name: /dev/sdb Mount Points: None Bay: 2
反转磁盘映射
数组 A -> 逻辑驱动器 1 (/dev/sda) -> 托架 2
数组 B -> 逻辑驱动器 2 (/dev/sdb) -> 托架 1
这里的输出如下
# hpssacli ctrl slot=0 show config detail | grep 'Array:|Logical Drive:|Bay:|Disk' Array: A Logical Drive: 1 Disk Name: /dev/sda Mount Points: None Bay: 2 Array: B Logical Drive: 2 Disk Name: /dev/sdb Mount Points: None Bay: 1
如何检查硬盘是否有故障?
有多个位置(日志)可以收集足够的证据以获取有关故障磁盘的更多详细信息。
在 iLO 日志中,以下消息可用
右边硬盘:
内部存储机柜设备故障(Bay 1、Box 1、Port 1I、Slot 0)
左边硬盘:
内部存储机柜设备故障(Bay 2、Box 1、Port 1I、Slot 0)
操作系统的 Syslog 应包含以下消息(假设安装了 hp-ams 工具,因为它们报告所有硬件相关警报)
右边硬盘:
Aug 27 07:27:31 mylinux hp-ams[12332]: CRITICAL: Internal Storage Enclosure Device Failure (Bay 1, Box 1, Port 1I, Slot 0)
左边硬盘:
Aug 27 21:36:29 mylinux hp-ams[12854]: CRITICAL: Internal Storage Enclosure Device Failure (Bay 2, Box 1, Port 1I, Slot 0)
还可以使用以下命令检查逻辑驱动器状态
逻辑驱动器 1 故障状态
my-linux-box: # hpssacli ctrl slot=0 ld all show status logicaldrive 1 (838.3 GB, 0): Failed logicaldrive 2 (838.3 GB, 0): OK
逻辑驱动器 2 故障状态
my-linux-box: # hpssacli ctrl slot=0 ld all show status logicaldrive 1 (838.3 GB, 0): OK logicaldrive 2 (838.3 GB, 0): Failed
更换 逻辑驱动器 1 (/dev/sda)
检查RAID状态
接下来重新验证raid状态
# cat /proc/mdstat Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] md2 : active raid1 sda8[0](F) sdb8[1] 870112064 blocks super 1.0 [2/1] [_U] bitmap: 3/7 pages [12KB], 65536KB chunk md0 : active raid1 sda5[0](F) sdb5[1] 529600 blocks super 1.0 [2/1] [_U] bitmap: 1/1 pages [4KB], 65536KB chunk md3 : active raid1 sda7[0](F) sdb7[1] 4200640 blocks super 1.0 [2/1] [_U] bitmap: 0/1 pages [0KB], 65536KB chunk md1 : active raid1 sda6[0](F) sdb6[1] 4200640 blocks super 1.0 [2/1] [_U] bitmap: 1/1 pages [4KB], 65536KB chunk unused devices: <none>
现在删除失败的raid分区
my-linux-box:~ # mdadm /dev/md0 --remove /dev/sda5 mdadm: hot removed /dev/sda5 from /dev/md0 my-linux-box:~ # mdadm /dev/md1 --remove /dev/sda6 mdadm: hot removed /dev/sda6 from /dev/md1 my-linux-box:~ # mdadm /dev/md3 --remove /dev/sda7 mdadm: hot removed /dev/sda7 from /dev/md2 my-linux-box:~ # mdadm /dev/md2 --remove /dev/sda8 mdadm: hot removed /dev/sda8 from /dev/md3
接下来检查raid状态以验证是否已删除所有失败的分区
# cat /proc/mdstat Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] md2 : active raid1 sdb8[1] 870112064 blocks super 1.0 [2/1] [_U] bitmap: 3/7 pages [12KB], 65536KB chunk md0 : active raid1 sdb5[1] 529600 blocks super 1.0 [2/1] [_U] bitmap: 1/1 pages [4KB], 65536KB chunk md3 : active raid1 sdb7[1] 4200640 blocks super 1.0 [2/1] [_U] bitmap: 0/1 pages [0KB], 65536KB chunk md1 : active raid1 sdb6[1] 4200640 blocks super 1.0 [2/1] [_U] bitmap: 1/1 pages [4KB], 65536KB chunk unused devices: <none>
将故障磁盘更换为新磁盘,系统日志应包含与以下类似的消息
Aug 18 15:53:12 my-linux-box kernel: [ 8365.422069] hpsa 0000:03:00.0: added scsi 0:2:0:0: Direct-Access HP EG0900FBVFQ RAID-UNKNOWN SSDSmartPathCap- En- Exp=2 qd=30
使用 hpssacli 重新启用逻辑驱动器
重新启用逻辑驱动器后,需要验证应返回“OK”的状态。
my-linux-box: # hpssacli ctrl slot=0 ld 1 modify reenable forced my-linux-box:# hpssacli ctrl slot=0 ld all show status logicaldrive 1 (838.3 GB, 0): OK logicaldrive 2 (838.3 GB, 0): OK
sdax 现在按预期从 RAID 中丢失。
# cat /proc/mdstat Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] md2 : active raid1 sdb8[1] 870112064 blocks super 1.0 [2/1] [_U] bitmap: 5/7 pages [20KB], 65536KB chunk md0 : active raid1 sdb5[1] 529600 blocks super 1.0 [2/1] [_U] bitmap: 1/1 pages [4KB], 65536KB chunk md3 : active raid1 sdb7[1] 4200640 blocks super 1.0 [2/1] [_U] bitmap: 0/1 pages [0KB], 65536KB chunk md1 : active raid1 sdb6[1] 4200640 blocks super 1.0 [2/1] [_U] bitmap: 1/1 pages [4KB], 65536KB chunk unused devices: <none>
现在将分区表从 sdb 复制到 sda。
my-linux-box:~ # sfdisk -d /dev/sdb | grep -v ten | sfdisk /dev/sda –force –no-reread Checking that no-one is using this disk right now ... Warning: extended partition does not start at a cylinder boundary. DOS and Linux will interpret the contents differently. OK Disk /dev/sda: 109437 cylinders, 255 heads, 63 sectors/track Warning: extended partition does not start at a cylinder boundary. DOS and Linux will interpret the contents differently. Old situation: Units = cylinders of 8225280 bytes, blocks of 1024 bytes, counting from 0 Device Boot Start End #cyls #blocks Id System /dev/sda1 * 0+ 109437- 109438- 879054336 f W95 Ext'd (LBA) /dev/sda2 0 - 0 0 0 Empty /dev/sda3 0 - 0 0 0 Empty /dev/sda4 0 - 0 0 0 Empty /dev/sda5 0+ 66- 66- 529664 fd Linux raid autodetect /dev/sda6 66+ 588- 523- 4200704 fd Linux raid autodetect /dev/sda7 589+ 1111- 523- 4200704 fd Linux raid autodetect /dev/sda8 1112+ 109435- 108324- 870112256 fd Linux raid autodetect New situation: Units = sectors of 512 bytes, counting from 0 Device Boot Start End #sectors Id System /dev/sda1 * 512 1758109183 1758108672 f W95 Ext'd (LBA) /dev/sda2 0 - 0 0 Empty /dev/sda3 0 - 0 0 Empty /dev/sda4 0 - 0 0 Empty /dev/sda5 1024 1060351 1059328 fd Linux raid autodetect /dev/sda6 1060864 9462271 8401408 fd Linux raid autodetect /dev/sda7 9462784 17864191 8401408 fd Linux raid autodetect /dev/sda8 17864704 1758089215 1740224512 fd Linux raid autodetect Warning: partition 1 does not end at a cylinder boundary Successfully wrote the new partition table Re-reading the partition table ... If you created or changed a DOS partition, /dev/foo7, say, then use dd(1) to zero the first 512 bytes: dd if=/dev/zero of=/dev/foo7 bs=512 count=1 (See fdisk(8).) Erase possible RAID config data ( from a reused disk)
执行此操作后,必须将可能剩余的旧 SW RAID 元数据从新连接的磁盘中删除,然后再将其重新添加到 RAID,这一点很重要。
my-linux-box:~ # mdadm --zero-superblock /dev/sda5 my-linux-box:~ # mdadm --zero-superblock /dev/sda6 my-linux-box:~ # mdadm --zero-superblock /dev/sda7 my-linux-box:~ # mdadm --zero-superblock /dev/sda8
之后,可以再次将逻辑卷添加到 SW RAID。
my-linux-box:~ # mdadm /dev/md0 --add /dev/sda5 mdadm: added /dev/sdb5 my-linux-box:~ # mdadm /dev/md1 --add /dev/sda6 mdadm: added /dev/sdb6 my-linux-box:~ # mdadm /dev/md3 --add /dev/sda7 mdadm: added /dev/sdb7 my-linux-box:~ # mdadm /dev/md2 --add /dev/sda8
注意:仅在最后添加的显示为 [UU] 时添加单个 raid 分区
如何在磁盘上安装 GRUB?
一旦 md0 同步了 grub 就应该在调用 grub 安装程序的两个磁盘上再次安装。
最后使用命令 grub-install 应该没有错误消息在两个磁盘(hd0 和 hd1)上安装 grub。
# grub-install GNU GRUB version 0.97 (640K lower/3072K upper memory) [ Minimal BASH-like line editing is supported. For the first word, TAB lists possible command completions. Anywhere else TAB lists the possible completions of a device/filename. ] grub> setup --stage2=/boot/grub/stage2 --force-lba (hd0) (hd0,4) Checking if "/boot/grub/stage1" exists... yes Checking if "/boot/grub/stage2" exists... yes Checking if "/boot/grub/e2fs_stage1_5" exists... yes Running "embed /boot/grub/e2fs_stage1_5 (hd0)"... 17 sectors are embedded. succeeded Running "install --force-lba --stage2=/boot/grub/stage2 /boot/grub/stage1 (hd0) (hd0)1+17 p (hd0,4)/boot/grub/stage2 /boot/grub/menu.lst"... succeeded Done. grub> setup --stage2=/boot/grub/stage2 --force-lba (hd1) (hd1,4) Checking if "/boot/grub/stage1" exists... yes Checking if "/boot/grub/stage2" exists... yes Checking if "/boot/grub/e2fs_stage1_5" exists... yes Running "embed /boot/grub/e2fs_stage1_5 (hd1)"... 17 sectors are embedded. succeeded Running "install --force-lba --stage2=/boot/grub/stage2 /boot/grub/stage1 (hd1) (hd1)1+17 p (hd1,4)/boot/grub/stage2 /boot/grub/menu.lst"... succeeded Done. grub> quit
最后验证raid状态
# cat /proc/mdstat Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] md2 : active raid1 sda8[2] sdb8[1] 870112064 blocks super 1.0 [2/2] [UU] bitmap: 6/7 pages [24KB], 65536KB chunk md0 : active raid1 sda5[2] sdb5[1] 529600 blocks super 1.0 [2/2] [UU] bitmap: 1/1 pages [4KB], 65536KB chunk md3 : active raid1 sda7[2] sdb7[1] 4200640 blocks super 1.0 [2/2] [UU] bitmap: 0/1 pages [0KB], 65536KB chunk md1 : active raid1 sda6[2] sdb6[1] 4200640 blocks super 1.0 [2/2] [UU] bitmap: 0/1 pages [0KB], 65536KB chunk unused devices: <none>
类似地,可以对第二个逻辑驱动器执行磁盘更换。