oracle在shutdown immediate的时候,如果遇到活动的事务,比如job还在跑,那么该shutdown操作就会挂起,直到job完成为止。我们在alertlog中可以看到:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
ARC0: Completed archiving log 2 thread 1 sequence 68549 Thu May 20 01:18:52 2010 Shutting down instance: further logons disabled Shutting down instance (immediate) License high water mark = 530 Thu May 20 01:23:59 2010 Active call for process 6622 user 'oracle' program 'oracle@my_dc01 (J000)' Active call for process 6246 user 'oracle' program 'oracle@my_dc01 (TNS V1-V3)' SHUTDOWN: waiting for active calls to complete. Thu May 20 02:13:28 2010 Thread 1 advanced to log sequence 68551 Current log# 1 seq# 68551 mem# 0: /dev/vg_ora1/rredo_256m_01 Current log# 1 seq# 68551 mem# 1: /dev/vg_ora2/rredo_256m_11 Current log# 1 seq# 68551 mem# 2: /dev/vg_ora3/rredo_256m_21 Thu May 20 02:13:28 2010 ARC0: Evaluating archive log 3 thread 1 sequence 68550 ARC0: Beginning to archive log 3 thread 1 sequence 68550 Creating archive destination LOG_ARCHIVE_DEST_1: '/oraarchlog/arch_1_68550.arc' ARC0: Completed archiving log 3 thread 1 sequence 68550 |
如果在一小时内,job没有跑完,那么该shutdown的操作就会自动被取消。这些动作,在alertlog中为:
1 2 3 4 5 6 7 8 9 10 11 12 |
Thu May 20 02:19:33 2010 SHUTDOWN: Active sessions prevent database close operation Thu May 20 03:15:08 2010 Thread 1 advanced to log sequence 68552 Current log# 4 seq# 68552 mem# 0: /dev/vg_ora1/rredo_256m_04 Current log# 4 seq# 68552 mem# 1: /dev/vg_ora2/rredo_256m_14 Current log# 4 seq# 68552 mem# 2: /dev/vg_ora3/rredo_256m_24 Thu May 20 03:15:08 2010 ARC0: Evaluating archive log 1 thread 1 sequence 68551 ARC0: Beginning to archive log 1 thread 1 sequence 68551 Creating archive destination LOG_ARCHIVE_DEST_1: '/oraarchlog/arch_1_68551.arc' Thu May 20 03:15:27 2010 |
这是一个非常令人头痛的问题,尤其是在没有dba介入的双机热备切换操作中。
上周就出现了这样一个故障,某省做4506的交换机切换测试,交换机是核心交换机,有2台,是主备模式的架构,用于连接应用服务器和数据库服务器,数据库服务器的架构为HP SG的双机热备。每个数据库主机有2个网卡,lan1和lan3做冗余,lan1接4506-1,lan3接4506-2。此次切换测试的操作是:
1、先down掉4506-1,工作在4506-2,
2、再启动4506-1,down掉4506-2,工作在4506-1,
3、最后再启动4506-2。
本来,这次操作和数据库没什么关系,因为在同一时间点内,肯定有至少1台交换机在线。
但是问题出现了,由于在进程切换测试时,当第2步的时候,估计启动4506-1时间太短,还没启动完整,就把4506-2 down掉了。于是就出现了同一时间内,没有交换机在线了。在syslog中,我们可以看到:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
May 20 00:56:04 my_dc01 cmnetd[27383]: lan1 failed May 20 00:56:04 my_dc01 cmnetd[27383]: lan1 switching to lan3 May 20 00:56:04 my_dc01 cmnetd[27383]: Subnet 10.203.92.0 switching from lan1 to lan3 May 20 00:56:04 my_dc01 cmnetd[27383]: Subnet 10.203.92.0 switched from lan1 to lan3 May 20 00:56:04 my_dc01 cmcld[27375]: Local switch has occurred since net_id 0x3 was not found on subnet 10.203.92.0. May 20 00:56:04 my_dc01 cmnetd[27383]: lan1 switched to lan3 May 20 01:18:52 my_dc01 cmnetd[27383]: lan3 failed May 20 01:18:52 my_dc01 cmnetd[27383]: Subnet 10.203.92.0 down May 20 01:18:52 my_dc01 cmcld[27375]: Subnet 10.203.92.0 in package oracle is down. May 20 01:18:52 my_dc01 cmcld[27375]: Failing package oracle on node my_dc01 due to subnet failure. May 20 01:18:52 my_dc01 cmcld[27375]: Request from node my_dc01 to fail package oracle on node my_dc01. May 20 01:18:52 my_dc01 cmcld[27375]: Executing '/etc/cmcluster/pkg1/pkg1.cntl stop' for package oracle, as service PKG*5377. May 20 01:18:52 my_dc01 syslog: cmhaltserv db_service May 20 01:18:52 my_dc01 su: + tty?? root-oracle May 20 01:19:20 my_dc01 cmnetd[27383]: lan1 recovered May 20 01:19:20 my_dc01 cmnetd[27383]: Subnet 10.203.92.0 switching from lan3 to lan1 May 20 01:19:20 my_dc01 cmnetd[27383]: Subnet 10.203.92.0 switched from lan3 to lan1 May 20 01:19:20 my_dc01 cmnetd[27383]: lan3 switched to lan1 May 20 02:14:22 my_dc01 cmnetd[27383]: lan3 recovered |
我们在log中看到 00:56,发生lan1 fail,准备切换到做冗余的lan3上。到01:18,lan3 fail,因此sg认为此时没有一个正常的网络环境,做冗余的2个网卡都down了,需要切换到数据库备机。01:19 lan1恢复,工作在lan 1上;在02:14 lan3恢复。
根据syslog,我怀疑网络工程师在操作上述的第2步时,在01:18不是先起4506-1,再down 4506-2,而是操作成了先down 4506-2(造成01:18 lan3 fail),再起4506-1(在01:19 lan1 恢复)。因此,syslog中是先出现lan3 fail——此时却没有在线的交换机,所以发生sg mc切换。
发生mc切换,如果数据库里面没有active事务,那也就没事了,但是偏偏有active事务,于是,shutdown命令发起1小时后,数据库自动取消了此次shutdown操作。
如果数据库自动取消了shutdown操作,那也就没事了,偏偏在mc切换的脚本中有这样一个判断:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 |
ora_7_3_3=yes SID_NAME=locldb ORACLE_HOME=/oracle/app/oracle/product/9.2.0 SQLNET=yes SQLNET_NAME= SQLNET_PASS= MONITOR_INTERVAL=30 set -A MONITOR_PROCESSES ora_pmon_${SID_NAME} ora_ckpt_${SID_NAME} ora_dbw0_${SID_NAME} ora_reco_${SID_NAME} ora_lgwr_${SID_NAME} PACKAGE_NAME=oracle TIME_OUT=60 HOST=`hostname` DATE=`date` PATH=${ORACLE_HOME}/bin:/sbin:/usr/bin:/usr/sbin:/etc:/bin export ORACLE_SID=${SID_NAME} export ORACLE_HOME function shutdown_cmds { if [[ ${ora_7_3_3} = yes ]] then su oracle -c ${ORACLE_HOME}/bin/sqlplus <<EOF connect / as sysdba shutdown immediate EOF su oracle -c ${ORACLE_HOME}/bin/lsnrctl <<EOF stop EOF else su oracle -c ${ORACLE_HOME}/bin/sqldba lmode=y <<EOF connect internal shutdown immediate EOF fi if [[ $? != 0 ]] then print "Oracle shutdown failed!" else print "Oracle shutdown done!" fi } |
我们这里看到$? != 0 的判断,这个判断是上述最后一个语句执行成功的话,返回为0,如果不成功,返回为非0。而在上面的mc切换脚本中,先后包含了2个操作步骤,先是shutdown数据库,再是stop 侦听。
在前一步,我们确实因为active事务,在运行shutdown命令1小时后失败,但是后续的操作stop侦听却是成功的,于是$? != 0根据最后一个语句判断返回值为0为真,认为语句执行成功(含数据库shutdown成功和stop侦听成功),准备切换到备机!
于是mc的log中:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 |
May 20 01:18:52 - Node "tj_dc01": Halting service db_service *** /etc/cmcluster/pkg1/oracle.sh called with shutdown argument! *** "tj_dc01": Shutting down Oracle SESSION tjmisc at Thu May 20 01:18:52 EAT 2010 SQL*Plus: Release 9.2.0.8.0 - Production on Thu May 20 01:18:52 2010 Copyright (c) 1982, 2002, Oracle Corporation. All rights reserved. Enter user-name: Enter password: Connected to: Oracle9i Enterprise Edition Release 9.2.0.8.0 - 64bit Production With the Partitioning, OLAP and Oracle Data Mining options JServer Release 9.2.0.8.0 - Production SQL> ORA-01013: user requested cancel of current operation SQL> Disconnected from Oracle9i Enterprise Edition Release 9.2.0.8.0 - 64bit Production With the Partitioning, OLAP and Oracle Data Mining options JServer Release 9.2.0.8.0 - Production LSNRCTL for HPUX: Version 9.2.0.8.0 - Production on 20-MAY-2010 02:19:33 Copyright (c) 1991, 2006, Oracle Corporation. All rights reserved. Welcome to LSNRCTL, type "help" for information. LSNRCTL> Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=10.203.92.11)(PORT=1521))) The command completed successfully LSNRCTL> Oracle shutdown done! May 20 02:19:33 - Node "tj_dc01": Remove IP address 10.203.92.11 from subnet 10.203.92.0 May 20 02:19:33 - Node "tj_dc01": Unmounting filesystem on /dev/vg_arch/lvol1 May 20 02:19:34 - Node "tj_dc01": Deactivating volume group vg_ora1 vgchange: Couldn't deactivate volume group "vg_ora1": Device busy May 20 02:19:34 - vgchange -a n vg_ora1 failed, trying again. vgchange: Couldn't deactivate volume group "vg_ora1": Device busy May 20 02:19:35 - vgchange -a n vg_ora1 failed, trying again. vgchange: Couldn't deactivate volume group "vg_ora1": Device busy ERROR: Function deactivate_volume_group ERROR: Failed to deactivate vg_ora1 May 20 02:19:36 - Node "tj_dc01": Deactivating volume group vg_ora2 vgchange: Couldn't deactivate volume group "vg_ora2": Device busy May 20 02:19:36 - vgchange -a n vg_ora2 failed, trying again. vgchange: Couldn't deactivate volume group "vg_ora2": |
在切换时由于数据库没down下去,vg还是激活状态,因此对vg的操作失败。mc也就没切到db02上去。于是在db02上的vg没挂过去,数据库没起来,浮动IP也没起来,侦听也没起来。
当前实际的情况是:在db01上oracle是启动的,但是浮动IP没有启动——因为切换包的时候,浮动IP会从db01切换到db02上,但是由于切换出现问题,数据库没有切换过去,浮动IP切换失败——浮动IP在2个db主机上都没有了。
再次启动mc的包,将浮动IP启动在db01上。原本以为没事,结果又发现job的守护进程cjq0不见了:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
oracle@tj_dc01:/oracle/app/oracle/admin/tjmisc/bdump > ps -ef |grep ora_ oracle 27481 1 0 Jan 28 ? 2:14 ora_arc1_tjmisc oracle 27471 1 0 Jan 28 ? 57:13 ora_ckpt_tjmisc oracle 27465 1 9 Jan 28 ? 95:41 ora_dbw2_tjmisc oracle 27467 1 7 Jan 28 ? 90:34 ora_dbw3_tjmisc oracle 27473 1 0 Jan 28 ? 5:35 ora_smon_tjmisc oracle 27459 1 0 Jan 28 ? 79:44 ora_pmon_tjmisc oracle 27469 1 0 Jan 28 ? 1198:19 ora_lgwr_tjmisc oracle 27479 1 0 Jan 28 ? 97:40 ora_arc0_tjmisc oracle 27475 1 0 Jan 28 ? 0:01 ora_reco_tjmisc oracle 27461 1 8 Jan 28 ? 492:13 ora_dbw0_tjmisc oracle 27463 1 8 Jan 28 ? 119:42 ora_dbw1_tjmisc oracle 19623 7671 0 05:16:37 pts/tc 0:00 grep ora_ |
这个问题倒是以前没发现过,原来active进程prevent shutdown操作,等待1小时之后,数据库自动取消shutdown操作,会导致job的守护进程消失!
没办法,用cmhaltcl的方式down掉整个cluster,在cmruncl的方式重新启动cluster,带起数据库,cjq0的进程恢复。
故障总结:
1、进行网络测试的时候,可能没有按照事先做好方案,没有严格按照步骤执行,操作错了执行步骤。——根据syslog判断。
2、mc切换的脚本有缺陷,对$? != 0的判断,导致数据库没down下去,mc却认为数据库已经shutdown成功。去做后续的操作,包括去激活vg和虚拟IP的切换。——根据mc的切换log和切换shell判断。
3、shutdown immediate时如果有active事务,如果事务无法完成,等1小时后,数据库自动取消shutdown操作,会导致cjq0进程消失。需要重启数据库。——根据alertlog判断。对付这个,最好就是去kill在alertlog中显示的active process(会有进程号显示)。
5条评论
“发生mc切换,如果数据库里面有active事务,那也就没事了,但是偏偏有active事务,于是,shutdown命令发起1小时后,数据库自动取消了此次shutdown操作。”
应该是:“发生mc切换,如果数据库里没有active事务… …”吧!?
re TQ:谢谢,是写错了。已经修改了。
最近遇上差不多的情况,脚本自动shutdown immediate数据库不成功
Fri May 11 06:38:09 2012
SHUTDOWN: Active processes prevent shutdown operation
Fri May 11 06:43:10 2012
SHUTDOWN: Active processes prevent shutdown operation
请教下,if [[ $? != 0 ]]真能判断shutdown失败吗?
系统是RHEL 4,谢谢
re andyoung:if [[ $? != 0 ]] 能判断,不过由于shutdown immediate是否成功的结果出来要1小时以后了,所以基本没有意义。
那还有什么好方法吗