oracle一直在“鼓吹”着他的rac是如何如何的高可用,如何如何的可以实现针对应用透明的failover,但是,在实际的使用过程中,要完全实现这样的透明,条件是何等的苛刻。
先从一次故障说起吧。某天中午正在吃饭的时候,突然接到电话做应用程序连接数据库挂死了,并且也在客户端测试连接数据库也是挂死,长时间没有响应。
该省的数据库是2节点的rac数据库,登录数据库后发现rac2已经宕机,只留着侦听还没宕。观察到rac2的alertlog为:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
Tue Apr 29 10:12:10 2008 Thread 2 advanced to log sequence 11999 Current log# 8 seq# 11999 mem# 0: /dev/vg_rac1/rG1_redo_512m_204 Current log# 8 seq# 11999 mem# 1: /dev/vg_rac2/rG2_redo_512m_214 Tue Apr 29 10:12:10 2008 Current log# 8 seq# 11999 mem# 2: /dev/vg_rac3/rG3_redo_512m_224 Tue Apr 29 10:12:10 2008 ARC1: Evaluating archive log 6 thread 2 sequence 11998 Tue Apr 29 10:12:10 2008 ARC1: Beginning to archive log 6 thread 2 sequence 11998 Creating archive destination LOG_ARCHIVE_DEST_1: '/archlog2/fjmisc2_2_11998.arc' Tue Apr 29 10:12:32 2008 ARC1: Completed archiving log 6 thread 2 sequence 11998 Tue Apr 29 10:42:18 2008 alter tablespace DATA_PORTAL add datafile '/dev/vg_rac1/rG1_data_2g_185' size 2047m Tue Apr 29 10:42:50 2008 Completed: alter tablespace DATA_PORTAL add datafile '/dev/vg Tue Apr 29 12:16:59 2008 Communications reconfiguration: instance 0 Waiting for clusterware split-brain resolution Tue Apr 29 12:17:32 2008 Trace dumping is performing id=[80429121659] Tue Apr 29 12:18:09 2008 Trace dumping is performing id=[80429121745] Tue Apr 29 12:27:04 2008 Errors in file /oracle/app/oracle/admin/fjmisc/bdump/fjmisc2_lmon_6757.trc: ORA-29740: evicted by member 0, group incarnation 5 Tue Apr 29 12:27:04 2008 LMON: terminating instance due to error 29740 Tue Apr 29 12:27:04 2008 Errors in file /oracle/app/oracle/admin/fjmisc/bdump/fjmisc2_pmon_6753.trc: ORA-29740: evicted by member , group incarnation Tue Apr 29 12:27:04 2008 Errors in file /oracle/app/oracle/admin/fjmisc/bdump/fjmisc2_dbw0_6765.trc: ORA-29740: evicted by member , group incarnation Tue Apr 29 12:27:05 2008 Errors in file /oracle/app/oracle/admin/fjmisc/bdump/fjmisc2_lck0_6793.trc: ORA-29740: evicted by member , group incarnation Instance terminated by LMON, pid = 6757 |
根据报错信息,看到是ora-29740的错误,这个一般是rac之间的通信问题,导致一个节点被踢出cluster group。具体的错误分析,还是要看trace文件:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 |
*** 2008-04-29 12:16:59.317 kjxgrgetresults: Detect reconfig from 0, seq 4, reason 3 kjxgrrcfgchk: Initiating reconfig, reason 3 *** 2008-04-29 12:17:04.332 kjxgmrcfg: Reconfiguration started, reason 3 kjxgmcs: Setting state to 4 0. *** 2008-04-29 12:17:04.347 Name Service frozen kjxgmcs: Setting state to 4 1. *** 2008-04-29 12:17:04.392 kjxgrrecp2: Waiting for split-brain resolution, upd 0, seq 5 *** 2008-04-29 12:27:04.479 Voting results, upd 0, seq 5, bitmap: 0 *** 2008-04-29 12:27:04.479 kjxgrdtrt: Evicted by 0, seq (5, 5) IMR state information Member 1, thread 2, state 4, flags 0x00a1 RR seq 5, propstate 3, pending propstate 0 rcfg rsn 3, rcfg time 2235139911, mem ct 2 master 1, master rcfg time 2235139911 Member information: Member 0, incarn 4, version 122818100 thrd 1, prev thrd 65535, status 0x0047, err 0x0002 valct 0 Member 1, incarn 4, version 134210 thrd 2, prev thrd 65535, status 0x0107, err 0x0000 valct 2 Group name: FJMISC Member id: 1 Cached SKGXN event: 0 Group State: State: 4 1 Commited Map: 0 1 New Map: 0 1 SKGXN Map: 0 1 Master node: 0 Memcnt 2 Rcvcnt 0 Substate Proposal: false Inc Proposal: incarn 0 memcnt 0 master 0 proposal false matched false map: Master Inc State: incarn 0 memcnt 0 agrees 0 flag 0x1 wmap: nmap: ubmap: Submitting asynchronized dump request [1] error 29740 detected in background process ORA-29740: evicted by member 0, group incarnation 5 ksuitm: waiting for [5] seconds before killing DIAG oracle@FJ_DB02:/oracle/app/oracle/admin/fjmisc/bdump > |
在这里我们看到是由于ora-29740的reason 3引起宕机的,查询metalink(Note:219361.1)上得知:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 |
For troubleshooting ORA-29740 errors, the 'reason' will be very important. In the above example, the first section indicates the reason for the initiated reconfiguration. The reasons are as follows: Reason 0 = No reconfiguration Reason 1 = The Node Monitor generated the reconfiguration. Reason 2 = An instance death was detected. Reason 3 = Communications Failure Reason 4 = Reconfiguration after suspend For ORA-29740 errors, you will most likely see reasons 1, 2, or 3. ----------------------------------------------------------------------------- Reason 1: The Node Monitor generated the reconfiguration. This can happen if: a) An instance joins the cluster b) An instance leaves the cluster c) A node is halted It should be easy to determine the cause of the error by reviewing the alert logs and LMON trace files from all instances. If an instance joins or leaves the cluster or a node is halted then the ORA-29740 error is not a problem. ORA-29740 evictions with reason 1 are usually expected when the cluster membership changes. Very rarely are these types of evictions a real problem. If you feel that this eviction was not correct, do a search in Metalink or the bug database for: ORA-29740 'reason 1' Important files to review are: a) Each instance's alert log b) Each instance's LMON trace file c) Statspack reports from all nodes leading up to the eviction d) Each node's syslog or messages file e) iostat output before, after, and during evictions f) vmstat output before, after, and during evictions g) netstat output before, after, and during evictions There is a tool called "OS Watcher" that is being developed that helps gather this information. For more information on "OS Watcher" see Note 301137.1 "OS Watcher User Guide". ----------------------------------------------------------------------------- Reason 2: An instance death was detected. This can happen if: a) An instance fails to issue a heartbeat to the control file. When the heartbeat is missing, LMON will issue a network ping to the instance not issuing the heartbeat. As long as the instance responds to the ping, LMON will consider the instance alive. If, however, the heartbeat is not issued for the length of time of the control file enqueue timeout, the instance is considered to be problematic and will be evicted. Common causes for an ORA-29740 eviction (Reason 2): a) NTP (Time changes on cluster) - usually on Linux, Tru64, or IBM AIX b) Network Problems (SAN). c) Resource Starvation (CPU, I/O, etc..) d) An Oracle bug. Common bugs for reason 2 evictions: Bug 2820871 - Abrupt time adjustments can crash instance with ORA-29740 (Reason 2) (Linux Only) Fixed-Releases: 9204+ A000 Bug 3917158 - ORA-29740 and a false instance eviction can occur (Reason 2) (IBM AIX Only) Fixed Releases: 9206+ If you feel that this eviction was not correct, do a search in Metalink or the bug database for: ORA-29740 'reason 2' Important files to review are: a) Each instance's alert log b) Each instance's LMON trace file c) Statspack reports from all nodes leading up to the eviction d) The CKPT process trace file of the evicted instance e) Other bdump or udump files... f) Each node's syslog or messages file g) iostat output before, after, and during evictions h) vmstat output before, after, and during evictions i) netstat output before, after, and during evictions There is a tool called "OS Watcher" that is being developed that helps gather this information. For more information on "OS Watcher" see Note 301137.1 "OS Watcher User Guide". ----------------------------------------------------------------------------- Reason 3: Communications Failure. This can happen if: a) The LMON processes lose communication between one another. b) One instance loses communications with the LMS, LMD, process of another instance. c) The LCK processes lose communication between one another. d) A process like LMON, LMD, LMS, or LCK is blocked, spinning, or stuck and is not responding to remote requests. In this case the ORA-29740 error is recorded when there are communication issues between the instances. It is an indication that an instance has been evicted from the configuration as a result of IPC send timeout. A communications failure between processes across instances will also generate a ORA-29740 with reason 3. When this occurs, the trace file of the process experiencing the error will print a message: Reporting Communication error with instance: If communication is lost at the cluster layer (for example, network cables are pulled), the cluster software may also perform node evictions in the event of a cluster split-brain. Oracle will detect a possible split-brain and wait for cluster software to resolve the split-brain. If cluster software does not resolve the split-brain within a specified interval, Oracle proceeds with evictions. Oracle Support has seen cases where resource starvation (CPU, I/O, etc...) can cause an instance to be evicted with this reason code. The LMON or LMD process could be blocked waiting for resources and not respond to polling by the remote instance(s). This could cause that instance to be evicted. If you have a statspack report available from the time just prior to the eviction on the evicted instance, check for poor I/O times and high CPU utilization. Poor I/O times would be an average read time of > 20ms. Common causes for an ORA-29740 eviction (Reason 3): a) Network Problems. b) Resource Starvation (CPU, I/O, etc..) c) Severe Contention in Database. d) An Oracle bug. |
OK,那么我们基本将问题定位在网络方面的原因,继续查看系统的syslog:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
Apr 29 11:59:10 FJ_DB02 su: + 0 root-oracle Apr 29 12:11:02 FJ_DB02 cmcld: lan1 failed Apr 29 12:11:02 FJ_DB02 cmcld: Subnet 10.203.17.0 switching from lan1 to lan2 Apr 29 12:11:02 FJ_DB02 cmcld: lan1 switching to lan2 Apr 29 12:11:02 FJ_DB02 cmcld: Subnet 10.203.17.0 switched from lan1 to lan2 Apr 29 12:11:02 FJ_DB02 cmcld: lan1 switched to lan2 Apr 29 12:11:22 FJ_DB02 cmcld: lan1 recovered Apr 29 12:11:22 FJ_DB02 cmcld: Subnet 10.203.17.0 switching from lan2 to lan1 Apr 29 12:11:22 FJ_DB02 cmcld: lan2 switching to lan1 Apr 29 12:11:22 FJ_DB02 cmcld: Subnet 10.203.17.0 switched from lan2 to lan1 Apr 29 12:11:22 FJ_DB02 cmcld: lan2 switched to lan1 Apr 29 12:14:15 FJ_DB02 vmunix: NFS server FJ_DB01 not responding still trying Apr 29 12:14:59 FJ_DB02 cmcld: HB connection to 10.203.17.1 not responding, closing Apr 29 12:14:59 FJ_DB02 cmcld: GS connection to 10.203.17.1 not responding, closing Apr 29 12:18:59 FJ_DB02 cmcld: GS connection to 10.203.17.1 is responding Apr 29 12:19:26 FJ_DB02 vmunix: NFS server FJ_DB01 ok Apr 29 12:20:59 FJ_DB02 cmcld: HB connection to 10.203.17.1 is responding Apr 29 12:22:47 FJ_DB02 xntpd[3717]: synchronized to 218.200.246.196, stratum=1 |
发现主机确实出现了网络问题,在rac2上连接不到rac1(10.203.17.1),此时应用挂死,怀疑是应用发起的连接去了rac2,而rac2的lsnr还是打开的但是instance已经宕机,由于配置了local lsnr和remote lsnr,在客户端也配置了load balance和failover,当应用的连接随机的连接到实例时候,由于配置了load balance和remote lsnr,需要检查对方节点的服务器负载,由于网络不正常,无法确定对方节点情况,因此挂死。关闭rac2上的lsnr后,应用恢复正常!!
从主机的syslog中进一步观察,主机的网络在12:18已经恢复,网络开始responding。ping rac1也是正常,检查cluster软件没有问题后,开始重启rac2的instance。
问题是,启动的时候再次出现了问题,本来rac1上的连接都是好的,且应用也是顺利的连接到rac1,可是在重启的时候,所有的应用挂死。由于考虑到此时两个节点是在做内存同步,对split brain的东西做回复。在这段时间内的rac2和rac1都会很慢,就让应用先等了一会……谁知过了15分钟了,rac2还是没起来,而且rac2 instance再次被lmon终止了。
汗!看来网络还是有问题!!白天不敢动了,怕再次影响业务,到晚上继续捣鼓……
晚上拉上了网络工程师,发现还是有“FJ_DB02 cmcld: GS connection to 10.203.17.1 not responding, closing”的报错,仔细的把网络检查了一遍:
1 2 3 4 |
$ sqlplus "/ as sysdba" SQL> oradebug setmypid SQL> oradebug ipc SQL> exit; |
发现在trace文件中:
1 2 3 4 5 6 7 8 9 10 |
SSKGXPT 0x1aa5e4 flags SSKGXPT_READPENDING info for network 0 socket no 9 IP 10.203.17.2 UDP 57854<------此处 sflags SSKGXPT_UP info for network 1 socket no 0 IP 0.0.0.0 UDP 0 sflags SSKGXPT_DOWN context timestamp 0 no ports sconno accono ertt state seq# sent async sync rtrans acks ach accono sconno admno state lstack nxtseq rcv rtrans acks credts |
发现此处配的10.203.17.2,即应用的连接的网段,没有用private网段,进一步检查:
1 2 3 4 5 6 7 |
SQL> show parameter cluster NAME TYPE VALUE ------------------------------------ ----------- ------------------------------ cluster_database boolean TRUE cluster_database_instances integer 2 cluster_interconnects string |
发现也是没有配private的网卡。
由于这个省的rac的网络配置是这样的:
lan0 : 192.168.0.0
lan1 : 10.103.17.0
lan2 : 0.0.0.0
其中,lan2是lan1的备用。网卡lan1之间的连接是经过两台互为冗余的交换机,而lan0则是通过交叉线直接连接。应用程序通过10.103.17.0网段的IP连接数据库,因此,而lan0对于数据库来讲目前是没有用上的。lan1上的问题导致了此次rac2的宕机。
通过网络工程师的进一步ping包,发现lan1的接口上有丢包的现象,由于是少量的不稳定的丢包,并不是网络全断,因此也没有切换到lan2上。因此造成因为丢包而split brain,从而rac2实例宕机。而进一步的咨询现场了解到,在当天中午的12点多确实有人拔动过lan1的网线。可能是拔动后没有插紧,造成了这次宕机。
解决方法:重新插拔lan1的网线,更换网线,更换网口后,节点间的通信回复正常。顺利的启动rac2.
后续改进:将直连的interconnect改为通过交换机连接(需千兆网络),配置cluster_interconnects=ip_private.
总结:rac其实要求还是挺严格的。首先,在硬件上,要求千兆private网络用于interconnect,如果有多块网卡且既能做为interconnect通信又能用于应用连接,需要配置cluster_interconnects进行指定。load balance其实在2处地方可以做load balance:一处是在服务器端,根据配置的remote lsnr和local lsnr来做负载均衡,即在客户端指定连接rac1也有可能在服务器端做再次调整连去rac2;另一处是在客户端,通过客户端的tnsnames来配置load balance,客户端根据tnsnames中的address list随机的连接服务器端。至于最终连接到哪个节点,要根据服务器端的lsnr再做调整(感觉这不是很好控制,如果同时配置了客户端的load balance和服务器端的load balance,基本不可预知在客户端发起的连接会去哪个节点。更何况,oracle美其名曰说是通过cpu的负载来实现load balance,至于其中如何的运算cpu的比例,还是一个黑盒子。)。至于failover这个功能,感觉只是和已经连接上rac的实例select有关,update、insert还是基本不能实现TAF,至于新发起的连接,和failover没啥关系了……
5条评论
我也有以下的錯誤求救耶!!:
Fri Jun 13 00:27:37 2008
IPC Send timeout detected.Sender: ospid 5000
Receiver: inst 1 binc 10682 ospid 4224
Fri Jun 13 00:27:39 2008
Errors in file d:\oracle\product\10.2.0\admin\etccms\bdump\etccms3_lms1_5000.trc:
ORA-27508: 傳送訊息發生 IPC 錯誤
ORA-27300: OS 系統相依作業:IPCSOCK_Send 失敗, 狀態: 10054
ORA-27301: OS 失敗訊息: 遠端主機已強制關閉一個現存的連線。
ORA-27302: 失敗發生於: send_3
Fri Jun 13 00:27:39 2008
IPC Send timeout to 0.2 inc 26 for msg type 38 from opid 8
Fri Jun 13 00:27:39 2008
Communications reconfiguration: instance_number 1
Fri Jun 13 00:27:39 2008
Trace dumping is performing id=[cdmp_20080613002739]
Fri Jun 13 00:29:20 2008
Waiting for clusterware split-brain resolution
Fri Jun 13 00:39:20 2008
Errors in file d:\oracle\product\10.2.0\admin\etccms\bdump\etccms3_lmon_5540.trc:
ORA-29740: 由成員 2, 群組拷貝 28 收回
Fri Jun 13 00:39:20 2008
LMON: terminating instance due to error 29740
Fri Jun 13 00:39:20 2008
WARNING: inbound connection timed out (ORA-3136)
Fri Jun 13 00:39:20 2008
Errors in file d:\oracle\product\10.2.0\admin\etccms\bdump\etccms3_pmon_2072.trc:
ORA-29740: 由成員 , 群組拷貝 收回
Fri Jun 13 00:39:20 2008
Errors in file d:\oracle\product\10.2.0\admin\etccms\bdump\etccms3_lmd0_5396.trc:
ORA-29740: 由成員 , 群組拷貝 收回
Fri Jun 13 00:39:20 2008
Errors in file d:\oracle\product\10.2.0\admin\etccms\bdump\etccms3_lms1_5000.trc:
ORA-29740: 由成員 , 群組拷貝 收回
Fri Jun 13 00:39:20 2008
WARNING: inbound connection timed out (ORA-3136)
Fri Jun 13 00:39:20 2008
WARNING: inbound connection timed out (ORA-3136)
Fri Jun 13 00:39:20 2008
WARNING: inbound connection timed out (ORA-3136)
…….
Errors in file d:\oracle\product\10.2.0\admin\etccms\bdump\etccms3_dbw0_5380.trc:
ORA-29740: 由成員 , 群組拷貝 收回
Fri Jun 13 00:39:20 2008
WARNING: inbound connection timed out (ORA-3136)
Fri Jun 13 00:39:21 2008
Errors in file d:\oracle\product\10.2.0\admin\etccms\bdump\etccms3_lck0_4744.trc:
ORA-29740: 由成員 , 群組拷貝 收回
Fri Jun 13 00:39:21 2008
WARNING: inbound connection timed out (ORA-3136)
Fri Jun 13 00:39:21 2008
WARNING: inbound connection timed out (ORA-3136)
…
Fri Jun 13 00:39:28 2008
Errors in file d:\oracle\product\10.2.0\admin\etccms\bdump\etccms3_reco_1028.trc:
ORA-29740: 由成員 , 群組拷貝 收回
Instance terminated by LMON, pid = 5540
Fri Jun 13 02:20:46 2008
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Interface type 1 private 192.168.1.0 configured from OCR for use as a cluster interconnect
Interface type 1 public 10.81.4.0 configured from OCR for use as a public interface
Picked latch-free SCN scheme 2
Autotune of undo retention is turned on.
LICENSE_MAX_USERS = 0
SYS auditing is disabled
ksdpec: called for event 13740 prior to event group initialization
Starting up ORACLE RDBMS Version: 10.2.0.1.0.
System parameters with non-default values:
processes = 150
…
…
DB 是重啟了, 且有時候甚至不會重啟..
還連兩天呢…
還沒能找到直接解決方案呢……著急!!
good
我订阅!
好东西。今天上午还在劝说用户采用交换机做内网而不是直连呢。呵呵。活生生的例子拿给他们看额……
好东西,顶!!!!