今天遇到个rac的异步IO的问题,环境是oracle 10201的双节点rac,存储使用ASM,机器是HP 的PA的机器,操作系统是11.23.
在udump下的日志中有大量的报错:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
oracle@rp8440a:/oracle/oracle/admin/gzxnh/bdump> cat gzxnh1_ora_19552.trc /oracle/oracle/admin/gzxnh/bdump/gzxnh1_ora_19552.trc Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - 64bit Production With the Partitioning, Real Application Clusters, OLAP and Data Mining options ORACLE_HOME = /oracle/oracle/product/10.2.0/db_2 System name: HP-UX Node name: rp8440a Release: B.11.23 Version: U Machine: 9000/800 Instance name: gzxnh1 Redo thread mounted by this instance: 1 Oracle process number: 0 Unix process pid: 19552, image: oracle@rp8440a Ioctl ASYNC_CONFIG error, errno = 1 |
看到这个报错,觉得很轻松,之前也遇到过,是oracle用户的dba组在操作系统上没有MLOCK的权限,在这里的文章,我也写到过。由于hp的11.23的系统,默认是开asyncdsk的。于是就用root给oracle用户的dba组加MLOCK权限了。
可是,等加完之后,问题就出现了:登录的提示就是不完整的,登录之后select所有信息,都会报没有连接数据库:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
oracle@rp8440a:/oracle/oracle/admin/gzxnh/udump> sqlplus "/ as sysdba" SQL*Plus: Release 10.2.0.1.0 - Production on Thu Dec 3 10:28:47 2009 Copyright (c) 1982, 2005, Oracle. All rights reserved. Connected. SQL> select * from dual; select * from dual * ERROR at line 1: ORA-01012: not logged on SQL> conn /a as sysdba Connected to an idle instance. SQL> SQL> SQL> exit |
检查环境变量的正常的,确实登录的是gzxnh1的实例,而且,oracle的进程都是在的:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
oracle@rp8440a:/oracle/oracle/admin/gzxnh/udump> ps -ef |grep ora _ oracle 24562 13823 0 10:28:41 pts/1 0:00 grep ora_ oracle 26691 1 0 00:37:55 ? 0:00 ora_q002_gzxnh1 oracle 18836 1 0 Nov 28 ? 4:42 ora_cjq0_gzxnh1 oracle 18865 1 0 Nov 28 ? 4:15 ora_mmon_gzxnh1 oracle 19044 1 0 Nov 28 ? 0:09 ora_rbal_gzxnh1 oracle 18614 1 0 Nov 28 ? 1:13 ora_pmon_gzxnh1 oracle 19598 1 0 Nov 28 ? 273:22 ora_j000_gzxnh1 oracle 18800 1 0 Nov 28 ? 2:03 ora_ckpt_gzxnh1 oracle 18927 1 0 Nov 28 ? 13:13 ora_lck0_gzxnh1 oracle 18643 1 0 Nov 28 ? 3:21 ora_diag_gzxnh1 oracle 18805 1 0 Nov 28 ? 1:42 ora_smon_gzxnh1 oracle 18672 1 0 Nov 28 ? 0:10 ora_psp0_gzxnh1 oracle 18674 1 0 Nov 28 ? 22:01 ora_lmon_gzxnh1 oracle 18771 1 0 Nov 28 ? 2:56 ora_lgwr_gzxnh1 oracle 19013 1 0 Nov 28 ? 0:11 ora_asmb_gzxnh1 oracle 18834 1 0 Nov 28 ? 0:00 ora_reco_gzxnh1 oracle 18867 1 0 Nov 28 ? 7:25 ora_mmnl_gzxnh1 oracle 19835 1 0 Nov 28 ? 0:14 ora_q000_gzxnh1 oracle 18898 1 0 Nov 28 ? 0:00 ora_s000_gzxnh1 oracle 18896 1 0 Nov 28 ? 0:00 ora_d000_gzxnh1 oracle 18707 1 0 Nov 28 ? 10:52 ora_lms0_gzxnh1 oracle 18703 1 0 Nov 28 ? 18:31 ora_lmd0_gzxnh1 oracle 18736 1 0 Nov 28 ? 8:54 ora_lms1_gzxnh1 oracle 19461 1 0 22:00:01 ? 0:05 ora_j001_gzxnh1 oracle 18738 1 0 Nov 28 ? 0:25 ora_mman_gzxnh1 oracle 18769 1 0 Nov 28 ? 3:22 ora_dbw0_gzxnh1 oracle 27708 1 0 10:54:35 ? 1:47 ora_pz99_gzxnh1 oracle 15114 1 0 Dec 1 ? 0:00 ora_o002_gzxnh1 oracle 19541 1 0 Nov 28 ? 0:14 ora_qmnc_gzxnh1 oracle@rp8440a:/oracle/oracle/admin/gzxnh/udump> |
怎么回事?为何数据库登录有问题?登录上去报错没有连接,再次登录就报错connect to an idle instance?
进一步检查oracle的alertlog:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
Thread 1 advanced to log sequence 332 Current log# 6 seq# 332 mem# 0: +DATA1/gzxnh/onlinelog/group_6.592.704141411 Thu Dec 3 10:24:47 2009 Process PZ98 died, see its trace file Thu Dec 3 10:32:55 2009 Process PZ99 died, see its trace file Process PZ99 died, see its trace file Thu Dec 3 10:33:57 2009 Process PZ99 died, see its trace file Thu Dec 3 10:34:26 2009 Process PZ99 died, see its trace file Process PZ99 died, see its trace file oracle@rp8440a:/oracle/oracle/admin/gzxnh/bdump> |
PZ99的进程死了?虽然在alertlog中没看到trace file指的是哪个,但是通过bdump下的最新一个trace文件可以看到:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
oracle@rp8440a:/oracle/oracle/admin/gzxnh/bdump> cat gzxnh1 _ora _19598.trc /oracle/oracle/admin/gzxnh/bdump/gzxnh1_ora_19598.trc Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - 64bit Production With the Partitioning, Real Application Clusters, OLAP and Data Mining options ORACLE_HOME = /oracle/oracle/product/10.2.0/db_2 System name: HP-UX Node name: rp8440a Release: B.11.23 Version: U Machine: 9000/800 Instance name: gzxnh1 Redo thread mounted by this instance: 1 Oracle process number: 0 Unix process pid: 19598, image: oracle@rp8440a …… *** 2009-12-03 10:34:26.934 *** ACTION NAME:() 2009-12-03 10:34:26.934 *** MODULE NAME:() 2009-12-03 10:34:26.934 *** SERVICE NAME:(SYS$USERS) 2009-12-03 10:34:26.934 *** CLIENT ID:() 2009-12-03 10:34:26.934 *** SESSION ID:(3251.32206) 2009-12-03 10:34:26.934 kxfpg1srv could not start local PZ99 kxfpg1sg GV$ query failed to get slave on inst 1 kxfpg1srv could not start PZ99, inst 2 kxfpg1sg GV$ query failed to get slave on inst 2 kxfpg1srv could not start local PZ99 kxfpg1sg GV$ query failed to get slave on inst 1 kxfpg1srv could not start PZ99, inst 2 kxfpg1sg GV$ query failed to get slave on inst 2 oracle@rp8440a:/oracle/oracle/admin/gzxnh/bdump> |
在rac2上看:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
oracle@rp8440b:/oracle/oracle/admin/gzxnh/bdump> cat gzxnh2_ora_10642.trc /oracle/oracle/admin/gzxnh/bdump/gzxnh2_ora_10642.trc Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - 64bit Production With the Partitioning, Real Application Clusters, OLAP and Data Mining options ORACLE_HOME = /oracle/oracle/product/10.2.0/db_2 System name: HP-UX Node name: rp8440b Release: B.11.23 Version: U Machine: 9000/800 Instance name: gzxnh2 Redo thread mounted by this instance: 2 Oracle process number: 0 Unix process pid: 10642, image: oracle@rp8440b Died during process startup with error 27061 (seq=596) OPIRIP: Uncaught error 27061. Error stack: ORA-27061: waiting for async I/Os failed oracle@rp8440b:/oracle/oracle/admin/gzxnh/bdump |
看来确实和rac上刚刚开启的异步IO有关系了,既然这样,赶紧还是把异步IO关闭了吧:在操作系统级别用sam将asyncdsk改成unused,将/etc/privgroup文件删除,重启主机后,数据库启动,恢复正常。
PZ99是rac中并发进程的slave进程(可查看这里),查询Gv$视图需要该进程,目前我猜测是由于修改成了异步IO,导致原来正在执行的pz99进程挂死,sqlplus登录时无法查询gv$视图。如果此时kill掉PZ99的进程,说不定不用重启主机,关闭异步IO,就能解决故障了。
不过,还是想不通PZ99进程和异步IO间的关系,哪位如果也遇到这个情况,能说说当时你是怎么处理的么?
3条评论
没down库就改了? 修改asyncio时两个节点都改了?
re dhhb:由于os系统默认是开启异步IO的,数据库默认也是开启异步IO,只是oracle用户所在的dba组没MLOCK权限,因此在线操作了。
太好了,真行啊