ASM 磁盘添加/删除时挂起僵死
查看更多教程 https://on  itroad.com

解决方案

  1. 以独占模式持有卸载磁盘发现入队“DD-00000000-00000000”的会话之一无限期地等待“kfk:异步磁盘IO”。

此过程阻止 RBAL 为在 RAC 环境中的其他节点上添加的新设备获取相同的队列(卸载的磁盘发现队列)。
这就是为什么从 RBAL 跟踪中重复以下消息的原因。

kfgbTryFn: failed to acquire DD.0.0 in 6 for kfgbDiscoverNow (of group 7/0x259d8ac6)

这可以从下面给出的 sql 脚本中检查:

set linesize 200
set pagesize 1000
column username format a10
column mod format a20
column blocker format a7
column waiter format a7
column lmode format 9999
column request format 9999
column I format 99
column sid format 9999
col username format a6
col osuser format a8
col s# format 99999
col CS_pid format a13
col pname format a10
col program format a20
col waitsec format 999,999,999
col pid format 9999
--col p1 format 9999
col p2 format a20
col sql format a20
spool locking_information
prompt ########################
prompt # Blocking Information #
prompt ########################
select  b.inst_id||'/'||b.sid blocker,
--      s.module,
        w.inst_id||'/'||w.sid waiter,
        b.type,
        b.id1,
        b.id2,
        b.lmode,
        w.request
from    gv$lock b,
        ( select inst_id, sid, type, id1, id2, lmode, request
          from   gv$lock  where request > 0 ) w
--      gv$session s
where   b.lmode > 0
and     ( b.id1 = w.id1 and b.id2 = w.id2 and b.type = w.type )
--and   ( b.sid = s.sid and b.inst_id = s.inst_id )
order by b.inst_id, b.sid
/
prompt ##########################
prompt # Rebalance Information  #
prompt ##########################
select * from gv$asm_operation
/
prompt ########################
prompt # Locking Information  #
prompt ########################
select a.type, a.id1, a.id2, a.lmode, a.request, a.inst_id inst, a.sid,
case when a.type='DD' and a.id1=0 and a.id2=0 and a.lmode=6 then '<<<<<<------------------' end "Dismounted DD enq holder"
from gv$lock a
order by a.type, a.id1, a.id2, a.lmode
/
prompt ########################
prompt # Session Information  #
prompt ########################
select  s.inst_id I, s.sid, s.serial# s#, p.pid, s.username, s.process||'/'||spid CS_pid, p.pname,  --> p.program in 10g_11gR1
s.status, s.module program, s.osuser ,
substr(w.event, 1, 30) wait_event, w.seconds_in_wait waitsec, w.p1,
case
  when w.event='DFS lock handle' and w.p2=38 then 'ASM diskgroup discovery wait'
  when w.event='DFS lock handle' and w.p2=39 then 'ASM diskgroup release'
  when w.event='DFS lock handle' and w.p2=40 then 'ASM push DB updates'
  when w.event='DFS lock handle' and w.p2=41 then 'ASM add ACD chunk'
  when w.event='DFS lock handle' and w.p2=42 then 'ASM map resize message'
  when w.event='DFS lock handle' and w.p2=43 then 'ASM map lock message'
  when w.event='DFS lock handle' and w.p2=44 then 'ASM map unlock message (phase 1)'
  when w.event='DFS lock handle' and w.p2=45 then 'ASM map unlock message (phase 2)'
  when w.event='DFS lock handle' and w.p2=46 then 'ASM generate add disk redo marker'
  when w.event='DFS lock handle' and w.p2=47 then 'ASM check of PST validity'
  when w.event='DFS lock handle' and w.p2=48 then 'ASM offline disk CIC'
  when w.event='DFS lock handle' and w.p2=52 then 'ASM F1X0 relocation'
  when w.event='DFS lock handle' and w.p2=55 then 'ASM disk operation message'
  when w.event='DFS lock handle' and w.p2=56 then 'ASM I/O error emulation'
  when w.event='DFS lock handle' and w.p2=60 then 'ASM Pre-Existing Extent Lock wait'
  when w.event='DFS lock handle' and w.p2=61 then 'Perform a ksk action through DBWR'
  when w.event='DFS lock handle' and w.p2=62 then 'ASM diskgroup refresh wait'
  else to_char(w.p2)
end  p2 , substr(q.sql_text, 1, 100) sql
from gv$session s , gv$process p , gv$session_wait w , gv$sqlarea q
where   ( s.paddr = p.addr and s.inst_id = p.inst_id )
and     ( s.inst_id = w.inst_id and s.sid = w.sid )
and     ( s.inst_id = q.inst_id(+) and s.sql_address = q.address(+) )
order by s.inst_id, s.sid --, s.audsid
/
spool off
exit

示例输出:

------------------------------------------------------------------------------------------------------------------------
DD          0          0     6          0                   2         182 <<<<<<------------------        ( Inst# 2, SID 182 is an exclusive holder process for DD-00000000-00000000 )

注意ID1和ID2为“0”,例如:DD-00000000-00000000,LMODE为“6”,为独占模式。

  1. 添加到受影响磁盘组的设备之一显示接近 100% 的利用率。
    例如,“iostat -xt 2”的输出,其中 xvdev1 是添加的设备之一。
Device:         rrqm/s   wrqm/s   r/s   w/s     rsec/s   wsec/s avgrq-sz  avgqu-sz     await     svctm     %util
xvdev           0.00     0.00    0.00  0.00     0.00     0.00     0.00    8.00         0.00      0.00      100.00          <<<<<------- Utilization shows 100%
xvdev1          0.00     0.00    0.00  0.00     0.00     0.00     0.00    3.00         0.00      0.00      100.00

解决问题:

  1. 修复设备在操作系统或者存储级别上显示接近 100% 利用率的问题。

  2. 修复有问题的设备后,通过使用新设备以注释 557348.1 中描述的方式创建虚拟磁盘组来模拟相同的问题。
    并运行 asm_blocking.sql 以检查是否有任何进程长时间持有“DD-00000000-00000000”。
    如果可以毫无问题地创建新的 DUMMY 磁盘组,则不会发生同样的情况。

  3. 如果在操作系统级别修复存储问题后未自动启动重新平衡,则重新启动磁盘组的重新平衡。

SQL>  alter diskgroup DATA rebalance power 6;

问题

  1. 在RAC环境中,在现有的磁盘组中添加了多个磁盘,sqlplus会话发起添加操作没有返回控制权,需要手动断开。

  2. v$asm_operation 没有发生重新平衡:

SQL> select * from gv$asm_operation;
no rows selected
  1. 其他节点中的“磁盘验证挂起”消息可见,但 ASM 警报.log 中没有“成功:刷新成员资格”消息:
Tue Aug 27 23:32:36 2013
NOTE: disk validation pending for group 2/0x75fe02b8 (DATA)
Wed Aug 28 05:28:52 2013
  1. RBAL 跟踪重复显示以下消息。
kfgbTryFn: failed to acquire DD.0.0 in 6 for kfgbDiscoverNow (of group 7/0x259d8ac6)

注意:“DD.0.0”用于卸载磁盘发现入队,“6”用于独占模式。

  1. 查询 v$asm_disk 和 v$asm_diskgroup 挂起,但查询 v$asm_disk_stat 和 v$asm_diskgroup_stat 视图有效。

新设备的 v$asm_disk_stat 输出示例。
注意“添加”状态:

GN DN    m_status    h_status     mo_status     state     dname
 2   10  OPENED      MEMBER       SYNCING       ADDING    DATA_0010   
 2   11  OPENED      MEMBER       SYNCING       ADDING    DATA_0011
日期:2020-09-17 00:11:19 来源:oir作者:oir