rac修改异步IO导致挂死

今天遇到个rac的异步IO的问题，环境是oracle 10201的双节点rac，存储使用ASM，机器是HP 的PA的机器，操作系统是11.23.
在udump下的日志中有大量的报错：

oracle@rp8440a:/oracle/oracle/admin/gzxnh/bdump> cat gzxnh1_ora_19552.trc
/oracle/oracle/admin/gzxnh/bdump/gzxnh1_ora_19552.trc
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP and Data Mining options
ORACLE_HOME = /oracle/oracle/product/10.2.0/db_2
System name:    HP-UX
Node name:      rp8440a
Release:        B.11.23
Version:        U
Machine:        9000/800
Instance name: gzxnh1
Redo thread mounted by this instance: 1
Oracle process number: 0
Unix process pid: 19552, image: oracle@rp8440a

Ioctl ASYNC_CONFIG error, errno = 1

oracle@rp8440a:/oracle/oracle/admin/gzxnh/bdump> cat gzxnh1_ora_19552.trc

/oracle/oracle/admin/gzxnh/bdump/gzxnh1_ora_19552.trc

Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - 64bit Production

With the Partitioning, Real Application Clusters, OLAP and Data Mining options

ORACLE_HOME = /oracle/oracle/product/10.2.0/db_2

System name: HP-UX

Node name: rp8440a

Release: B.11.23

Version: U

Machine: 9000/800

Instance name: gzxnh1

Redo thread mounted by this instance: 1

Oracle process number: 0

Unix process pid: 19552, image: oracle@rp8440a

Ioctl ASYNC_CONFIG error, errno = 1

看到这个报错，觉得很轻松，之前也遇到过，是oracle用户的dba组在操作系统上没有MLOCK的权限，在这里的文章，我也写到过。由于hp的11.23的系统，默认是开asyncdsk的。于是就用root给oracle用户的dba组加MLOCK权限了。

可是，等加完之后，问题就出现了：登录的提示就是不完整的，登录之后select所有信息，都会报没有连接数据库：

oracle@rp8440a:/oracle/oracle/admin/gzxnh/udump> sqlplus "/ as sysdba"

SQL*Plus: Release 10.2.0.1.0 - Production on Thu Dec 3 10:28:47 2009

Copyright (c) 1982, 2005, Oracle.  All rights reserved.

Connected.
SQL> select * from dual;
select * from dual
*
ERROR at line 1:
ORA-01012: not logged on


SQL> conn /a  as sysdba
Connected to an idle instance.
SQL> 
SQL> 
SQL> exit

oracle@rp8440a:/oracle/oracle/admin/gzxnh/udump> sqlplus "/ as sysdba"

SQL*Plus: Release 10.2.0.1.0 - Production on Thu Dec 3 10:28:47 2009

Connected.

SQL> select * from dual;

select * from dual

ERROR at line 1:

ORA-01012: not logged on

SQL> conn /a as sysdba

Connected to an idle instance.

SQL>

SQL> exit

检查环境变量的正常的，确实登录的是gzxnh1的实例，而且，oracle的进程都是在的：

oracle@rp8440a:/oracle/oracle/admin/gzxnh/udump> ps -ef |grep ora _
  oracle 24562 13823  0 10:28:41 pts/1     0:00 grep ora_
  oracle 26691     1  0 00:37:55 ?         0:00 ora_q002_gzxnh1
  oracle 18836     1  0  Nov 28  ?         4:42 ora_cjq0_gzxnh1
  oracle 18865     1  0  Nov 28  ?         4:15 ora_mmon_gzxnh1
  oracle 19044     1  0  Nov 28  ?         0:09 ora_rbal_gzxnh1
  oracle 18614     1  0  Nov 28  ?         1:13 ora_pmon_gzxnh1
  oracle 19598     1  0  Nov 28  ?        273:22 ora_j000_gzxnh1
  oracle 18800     1  0  Nov 28  ?         2:03 ora_ckpt_gzxnh1
  oracle 18927     1  0  Nov 28  ?        13:13 ora_lck0_gzxnh1
  oracle 18643     1  0  Nov 28  ?         3:21 ora_diag_gzxnh1
  oracle 18805     1  0  Nov 28  ?         1:42 ora_smon_gzxnh1
  oracle 18672     1  0  Nov 28  ?         0:10 ora_psp0_gzxnh1
  oracle 18674     1  0  Nov 28  ?        22:01 ora_lmon_gzxnh1
  oracle 18771     1  0  Nov 28  ?         2:56 ora_lgwr_gzxnh1
  oracle 19013     1  0  Nov 28  ?         0:11 ora_asmb_gzxnh1
  oracle 18834     1  0  Nov 28  ?         0:00 ora_reco_gzxnh1
  oracle 18867     1  0  Nov 28  ?         7:25 ora_mmnl_gzxnh1
  oracle 19835     1  0  Nov 28  ?         0:14 ora_q000_gzxnh1
  oracle 18898     1  0  Nov 28  ?         0:00 ora_s000_gzxnh1
  oracle 18896     1  0  Nov 28  ?         0:00 ora_d000_gzxnh1
  oracle 18707     1  0  Nov 28  ?        10:52 ora_lms0_gzxnh1
  oracle 18703     1  0  Nov 28  ?        18:31 ora_lmd0_gzxnh1
  oracle 18736     1  0  Nov 28  ?         8:54 ora_lms1_gzxnh1
  oracle 19461     1  0 22:00:01 ?         0:05 ora_j001_gzxnh1
  oracle 18738     1  0  Nov 28  ?         0:25 ora_mman_gzxnh1
  oracle 18769     1  0  Nov 28  ?         3:22 ora_dbw0_gzxnh1
  oracle 27708     1  0 10:54:35 ?         1:47 ora_pz99_gzxnh1
  oracle 15114     1  0  Dec  1  ?         0:00 ora_o002_gzxnh1
  oracle 19541     1  0  Nov 28  ?         0:14 ora_qmnc_gzxnh1
oracle@rp8440a:/oracle/oracle/admin/gzxnh/udump>

oracle@rp8440a:/oracle/oracle/admin/gzxnh/udump> ps -ef |grep ora _

oracle 24562 13823 0 10:28:41 pts/1 0:00 grep ora_

oracle 26691 1 0 00:37:55 ? 0:00 ora_q002_gzxnh1

oracle 18836 1 0 Nov 28 ? 4:42 ora_cjq0_gzxnh1

oracle 18865 1 0 Nov 28 ? 4:15 ora_mmon_gzxnh1

oracle 19044 1 0 Nov 28 ? 0:09 ora_rbal_gzxnh1

oracle 18614 1 0 Nov 28 ? 1:13 ora_pmon_gzxnh1

oracle 19598 1 0 Nov 28 ? 273:22 ora_j000_gzxnh1

oracle 18800 1 0 Nov 28 ? 2:03 ora_ckpt_gzxnh1

oracle 18927 1 0 Nov 28 ? 13:13 ora_lck0_gzxnh1

oracle 18643 1 0 Nov 28 ? 3:21 ora_diag_gzxnh1

oracle 18805 1 0 Nov 28 ? 1:42 ora_smon_gzxnh1

oracle 18672 1 0 Nov 28 ? 0:10 ora_psp0_gzxnh1

oracle 18674 1 0 Nov 28 ? 22:01 ora_lmon_gzxnh1

oracle 18771 1 0 Nov 28 ? 2:56 ora_lgwr_gzxnh1

oracle 19013 1 0 Nov 28 ? 0:11 ora_asmb_gzxnh1

oracle 18834 1 0 Nov 28 ? 0:00 ora_reco_gzxnh1

oracle 18867 1 0 Nov 28 ? 7:25 ora_mmnl_gzxnh1

oracle 19835 1 0 Nov 28 ? 0:14 ora_q000_gzxnh1

oracle 18898 1 0 Nov 28 ? 0:00 ora_s000_gzxnh1

oracle 18896 1 0 Nov 28 ? 0:00 ora_d000_gzxnh1

oracle 18707 1 0 Nov 28 ? 10:52 ora_lms0_gzxnh1

oracle 18703 1 0 Nov 28 ? 18:31 ora_lmd0_gzxnh1

oracle 18736 1 0 Nov 28 ? 8:54 ora_lms1_gzxnh1

oracle 19461 1 0 22:00:01 ? 0:05 ora_j001_gzxnh1

oracle 18738 1 0 Nov 28 ? 0:25 ora_mman_gzxnh1

oracle 18769 1 0 Nov 28 ? 3:22 ora_dbw0_gzxnh1

oracle 27708 1 0 10:54:35 ? 1:47 ora_pz99_gzxnh1

oracle 15114 1 0 Dec 1 ? 0:00 ora_o002_gzxnh1

oracle 19541 1 0 Nov 28 ? 0:14 ora_qmnc_gzxnh1

oracle@rp8440a:/oracle/oracle/admin/gzxnh/udump>

怎么回事？为何数据库登录有问题？登录上去报错没有连接，再次登录就报错connect to an idle instance？

进一步检查oracle的alertlog：

Thread 1 advanced to log sequence 332
  Current log# 6 seq# 332 mem# 0: +DATA1/gzxnh/onlinelog/group_6.592.704141411
Thu Dec  3 10:24:47 2009
Process PZ98 died, see its trace file
Thu Dec  3 10:32:55 2009
Process PZ99 died, see its trace file
Process PZ99 died, see its trace file
Thu Dec  3 10:33:57 2009
Process PZ99 died, see its trace file
Thu Dec  3 10:34:26 2009
Process PZ99 died, see its trace file
Process PZ99 died, see its trace file
oracle@rp8440a:/oracle/oracle/admin/gzxnh/bdump>

Thread 1 advanced to log sequence 332

Current log# 6 seq# 332 mem# 0: +DATA1/gzxnh/onlinelog/group_6.592.704141411

Thu Dec 3 10:24:47 2009

Process PZ98 died, see its trace file

Thu Dec 3 10:32:55 2009

Process PZ99 died, see its trace file

Thu Dec 3 10:33:57 2009

Process PZ99 died, see its trace file

Thu Dec 3 10:34:26 2009

Process PZ99 died, see its trace file

oracle@rp8440a:/oracle/oracle/admin/gzxnh/bdump>

PZ99的进程死了？虽然在alertlog中没看到trace file指的是哪个，但是通过bdump下的最新一个trace文件可以看到：

oracle@rp8440a:/oracle/oracle/admin/gzxnh/bdump> cat gzxnh1 _ora _19598.trc
/oracle/oracle/admin/gzxnh/bdump/gzxnh1_ora_19598.trc
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP and Data Mining options
ORACLE_HOME = /oracle/oracle/product/10.2.0/db_2
System name:    HP-UX
Node name:      rp8440a
Release:        B.11.23
Version:        U
Machine:        9000/800
Instance name: gzxnh1
Redo thread mounted by this instance: 1
Oracle process number: 0
Unix process pid: 19598, image: oracle@rp8440a
……
*** 2009-12-03 10:34:26.934
*** ACTION NAME:() 2009-12-03 10:34:26.934
*** MODULE NAME:() 2009-12-03 10:34:26.934
*** SERVICE NAME:(SYS$USERS) 2009-12-03 10:34:26.934
*** CLIENT ID:() 2009-12-03 10:34:26.934
*** SESSION ID:(3251.32206) 2009-12-03 10:34:26.934
kxfpg1srv
        could not start local PZ99
kxfpg1sg
        GV$ query failed to get slave on inst 1
kxfpg1srv
        could not start PZ99, inst 2
kxfpg1sg
        GV$ query failed to get slave on inst 2
kxfpg1srv
        could not start local PZ99
kxfpg1sg
        GV$ query failed to get slave on inst 1
kxfpg1srv
        could not start PZ99, inst 2
kxfpg1sg
        GV$ query failed to get slave on inst 2
oracle@rp8440a:/oracle/oracle/admin/gzxnh/bdump>

oracle@rp8440a:/oracle/oracle/admin/gzxnh/bdump> cat gzxnh1 _ora _19598.trc

/oracle/oracle/admin/gzxnh/bdump/gzxnh1_ora_19598.trc

Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - 64bit Production

With the Partitioning, Real Application Clusters, OLAP and Data Mining options

ORACLE_HOME = /oracle/oracle/product/10.2.0/db_2

System name: HP-UX

Node name: rp8440a

Release: B.11.23

Version: U

Machine: 9000/800

Instance name: gzxnh1

Redo thread mounted by this instance: 1

Oracle process number: 0

Unix process pid: 19598, image: oracle@rp8440a

……

*** 2009-12-03 10:34:26.934

*** ACTION NAME:() 2009-12-03 10:34:26.934

*** MODULE NAME:() 2009-12-03 10:34:26.934

*** SERVICE NAME:(SYS$USERS) 2009-12-03 10:34:26.934

*** CLIENT ID:() 2009-12-03 10:34:26.934

*** SESSION ID:(3251.32206) 2009-12-03 10:34:26.934

kxfpg1srv

could not start local PZ99

kxfpg1sg

GV$ query failed to get slave on inst 1

kxfpg1srv

could not start PZ99, inst 2

kxfpg1sg

GV$ query failed to get slave on inst 2

kxfpg1srv

could not start local PZ99

kxfpg1sg

GV$ query failed to get slave on inst 1

kxfpg1srv

could not start PZ99, inst 2

kxfpg1sg

GV$ query failed to get slave on inst 2

oracle@rp8440a:/oracle/oracle/admin/gzxnh/bdump>

在rac2上看：

oracle@rp8440b:/oracle/oracle/admin/gzxnh/bdump> cat gzxnh2_ora_10642.trc

/oracle/oracle/admin/gzxnh/bdump/gzxnh2_ora_10642.trc
Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP and Data Mining options
ORACLE_HOME = /oracle/oracle/product/10.2.0/db_2
System name:    HP-UX
Node name:      rp8440b
Release:        B.11.23
Version:        U
Machine:        9000/800
Instance name: gzxnh2
Redo thread mounted by this instance: 2
Oracle process number: 0
Unix process pid: 10642, image: oracle@rp8440b

Died during process startup with error 27061 (seq=596)
OPIRIP: Uncaught error 27061. Error stack:
ORA-27061: waiting for async I/Os failed
oracle@rp8440b:/oracle/oracle/admin/gzxnh/bdump

oracle@rp8440b:/oracle/oracle/admin/gzxnh/bdump> cat gzxnh2_ora_10642.trc

/oracle/oracle/admin/gzxnh/bdump/gzxnh2_ora_10642.trc

Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - 64bit Production

With the Partitioning, Real Application Clusters, OLAP and Data Mining options

ORACLE_HOME = /oracle/oracle/product/10.2.0/db_2

System name: HP-UX

Node name: rp8440b

Release: B.11.23

Version: U

Machine: 9000/800

Instance name: gzxnh2

Redo thread mounted by this instance: 2

Oracle process number: 0

Unix process pid: 10642, image: oracle@rp8440b

Died during process startup with error 27061 (seq=596)

OPIRIP: Uncaught error 27061. Error stack:

ORA-27061: waiting for async I/Os failed

oracle@rp8440b:/oracle/oracle/admin/gzxnh/bdump

看来确实和rac上刚刚开启的异步IO有关系了，既然这样，赶紧还是把异步IO关闭了吧：在操作系统级别用sam将asyncdsk改成unused，将/etc/privgroup文件删除，重启主机后，数据库启动，恢复正常。

PZ99是rac中并发进程的slave进程（可查看这里），查询Gv$视图需要该进程，目前我猜测是由于修改成了异步IO，导致原来正在执行的pz99进程挂死，sqlplus登录时无法查询gv$视图。如果此时kill掉PZ99的进程，说不定不用重启主机，关闭异步IO，就能解决故障了。

不过，还是想不通PZ99进程和异步IO间的关系，哪位如果也遇到这个情况，能说说当时你是怎么处理的么？

rac修改异步IO导致挂死

3条评论

回复江风渔火取消回复

rac修改异步IO导致挂死

相关文章

在11.2的grid中安装10g database报错OUI-35000

oracle 11g 慢慢的走近了我们……

某项目KT技术小结

3条评论

回复 江风渔火 取消回复

回复江风渔火取消回复