今天遇到个rac的异步IO的问题,环境是oracle 10201的双节点rac,存储使用ASM,机器是HP 的PA的机器,操作系统是11.23.
在udump下的日志中有大量的报错:
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | oracle@rp8440a:/oracle/oracle/admin/gzxnh/bdump> cat gzxnh1_ora_19552.trc /oracle/oracle/admin/gzxnh/bdump/gzxnh1_ora_19552.trc Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - 64bit Production With the Partitioning, Real Application Clusters, OLAP and Data Mining options ORACLE_HOME = /oracle/oracle/product/10.2.0/db_2 System name:    HP-UX Node name:      rp8440a Release:        B.11.23 Version:        U Machine:        9000/800 Instance name: gzxnh1 Redo thread mounted by this instance: 1 Oracle process number: 0 Unix process pid: 19552, image: oracle@rp8440a Ioctl ASYNC_CONFIG error, errno = 1 | 
看到这个报错,觉得很轻松,之前也遇到过,是oracle用户的dba组在操作系统上没有MLOCK的权限,在这里的文章,我也写到过。由于hp的11.23的系统,默认是开asyncdsk的。于是就用root给oracle用户的dba组加MLOCK权限了。
可是,等加完之后,问题就出现了:登录的提示就是不完整的,登录之后select所有信息,都会报没有连接数据库:
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | oracle@rp8440a:/oracle/oracle/admin/gzxnh/udump> sqlplus "/ as sysdba" SQL*Plus: Release 10.2.0.1.0 - Production on Thu Dec 3 10:28:47 2009 Copyright (c) 1982, 2005, Oracle.  All rights reserved. Connected. SQL> select * from dual; select * from dual * ERROR at line 1: ORA-01012: not logged on SQL> conn /a  as sysdba Connected to an idle instance. SQL>  SQL>  SQL> exit | 
检查环境变量的正常的,确实登录的是gzxnh1的实例,而且,oracle的进程都是在的:
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | oracle@rp8440a:/oracle/oracle/admin/gzxnh/udump> ps -ef |grep ora _   oracle 24562 13823  0 10:28:41 pts/1     0:00 grep ora_   oracle 26691     1  0 00:37:55 ?         0:00 ora_q002_gzxnh1   oracle 18836     1  0  Nov 28  ?         4:42 ora_cjq0_gzxnh1   oracle 18865     1  0  Nov 28  ?         4:15 ora_mmon_gzxnh1   oracle 19044     1  0  Nov 28  ?         0:09 ora_rbal_gzxnh1   oracle 18614     1  0  Nov 28  ?         1:13 ora_pmon_gzxnh1   oracle 19598     1  0  Nov 28  ?        273:22 ora_j000_gzxnh1   oracle 18800     1  0  Nov 28  ?         2:03 ora_ckpt_gzxnh1   oracle 18927     1  0  Nov 28  ?        13:13 ora_lck0_gzxnh1   oracle 18643     1  0  Nov 28  ?         3:21 ora_diag_gzxnh1   oracle 18805     1  0  Nov 28  ?         1:42 ora_smon_gzxnh1   oracle 18672     1  0  Nov 28  ?         0:10 ora_psp0_gzxnh1   oracle 18674     1  0  Nov 28  ?        22:01 ora_lmon_gzxnh1   oracle 18771     1  0  Nov 28  ?         2:56 ora_lgwr_gzxnh1   oracle 19013     1  0  Nov 28  ?         0:11 ora_asmb_gzxnh1   oracle 18834     1  0  Nov 28  ?         0:00 ora_reco_gzxnh1   oracle 18867     1  0  Nov 28  ?         7:25 ora_mmnl_gzxnh1   oracle 19835     1  0  Nov 28  ?         0:14 ora_q000_gzxnh1   oracle 18898     1  0  Nov 28  ?         0:00 ora_s000_gzxnh1   oracle 18896     1  0  Nov 28  ?         0:00 ora_d000_gzxnh1   oracle 18707     1  0  Nov 28  ?        10:52 ora_lms0_gzxnh1   oracle 18703     1  0  Nov 28  ?        18:31 ora_lmd0_gzxnh1   oracle 18736     1  0  Nov 28  ?         8:54 ora_lms1_gzxnh1   oracle 19461     1  0 22:00:01 ?         0:05 ora_j001_gzxnh1   oracle 18738     1  0  Nov 28  ?         0:25 ora_mman_gzxnh1   oracle 18769     1  0  Nov 28  ?         3:22 ora_dbw0_gzxnh1   oracle 27708     1  0 10:54:35 ?         1:47 ora_pz99_gzxnh1   oracle 15114     1  0  Dec  1  ?         0:00 ora_o002_gzxnh1   oracle 19541     1  0  Nov 28  ?         0:14 ora_qmnc_gzxnh1 oracle@rp8440a:/oracle/oracle/admin/gzxnh/udump>  | 
怎么回事?为何数据库登录有问题?登录上去报错没有连接,再次登录就报错connect to an idle instance?
进一步检查oracle的alertlog:
| 1 2 3 4 5 6 7 8 9 10 11 12 13 | Thread 1 advanced to log sequence 332   Current log# 6 seq# 332 mem# 0: +DATA1/gzxnh/onlinelog/group_6.592.704141411 Thu Dec  3 10:24:47 2009 Process PZ98 died, see its trace file Thu Dec  3 10:32:55 2009 Process PZ99 died, see its trace file Process PZ99 died, see its trace file Thu Dec  3 10:33:57 2009 Process PZ99 died, see its trace file Thu Dec  3 10:34:26 2009 Process PZ99 died, see its trace file Process PZ99 died, see its trace file oracle@rp8440a:/oracle/oracle/admin/gzxnh/bdump>  | 
PZ99的进程死了?虽然在alertlog中没看到trace file指的是哪个,但是通过bdump下的最新一个trace文件可以看到:
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | oracle@rp8440a:/oracle/oracle/admin/gzxnh/bdump> cat gzxnh1 _ora _19598.trc /oracle/oracle/admin/gzxnh/bdump/gzxnh1_ora_19598.trc Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - 64bit Production With the Partitioning, Real Application Clusters, OLAP and Data Mining options ORACLE_HOME = /oracle/oracle/product/10.2.0/db_2 System name:    HP-UX Node name:      rp8440a Release:        B.11.23 Version:        U Machine:        9000/800 Instance name: gzxnh1 Redo thread mounted by this instance: 1 Oracle process number: 0 Unix process pid: 19598, image: oracle@rp8440a …… *** 2009-12-03 10:34:26.934 *** ACTION NAME:() 2009-12-03 10:34:26.934 *** MODULE NAME:() 2009-12-03 10:34:26.934 *** SERVICE NAME:(SYS$USERS) 2009-12-03 10:34:26.934 *** CLIENT ID:() 2009-12-03 10:34:26.934 *** SESSION ID:(3251.32206) 2009-12-03 10:34:26.934 kxfpg1srv         could not start local PZ99 kxfpg1sg         GV$ query failed to get slave on inst 1 kxfpg1srv         could not start PZ99, inst 2 kxfpg1sg         GV$ query failed to get slave on inst 2 kxfpg1srv         could not start local PZ99 kxfpg1sg         GV$ query failed to get slave on inst 1 kxfpg1srv         could not start PZ99, inst 2 kxfpg1sg         GV$ query failed to get slave on inst 2 oracle@rp8440a:/oracle/oracle/admin/gzxnh/bdump>  | 
在rac2上看:
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | oracle@rp8440b:/oracle/oracle/admin/gzxnh/bdump> cat gzxnh2_ora_10642.trc /oracle/oracle/admin/gzxnh/bdump/gzxnh2_ora_10642.trc Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - 64bit Production With the Partitioning, Real Application Clusters, OLAP and Data Mining options ORACLE_HOME = /oracle/oracle/product/10.2.0/db_2 System name:    HP-UX Node name:      rp8440b Release:        B.11.23 Version:        U Machine:        9000/800 Instance name: gzxnh2 Redo thread mounted by this instance: 2 Oracle process number: 0 Unix process pid: 10642, image: oracle@rp8440b Died during process startup with error 27061 (seq=596) OPIRIP: Uncaught error 27061. Error stack: ORA-27061: waiting for async I/Os failed oracle@rp8440b:/oracle/oracle/admin/gzxnh/bdump | 
看来确实和rac上刚刚开启的异步IO有关系了,既然这样,赶紧还是把异步IO关闭了吧:在操作系统级别用sam将asyncdsk改成unused,将/etc/privgroup文件删除,重启主机后,数据库启动,恢复正常。
PZ99是rac中并发进程的slave进程(可查看这里),查询Gv$视图需要该进程,目前我猜测是由于修改成了异步IO,导致原来正在执行的pz99进程挂死,sqlplus登录时无法查询gv$视图。如果此时kill掉PZ99的进程,说不定不用重启主机,关闭异步IO,就能解决故障了。
不过,还是想不通PZ99进程和异步IO间的关系,哪位如果也遇到这个情况,能说说当时你是怎么处理的么?
3条评论
没down库就改了? 修改asyncio时两个节点都改了?
re dhhb:由于os系统默认是开启异步IO的,数据库默认也是开启异步IO,只是oracle用户所在的dba组没MLOCK权限,因此在线操作了。
太好了,真行啊