RAC的高可用也不是那么好用

2008-05-09 5条评论

oracle一直在“鼓吹”着他的rac是如何如何的高可用，如何如何的可以实现针对应用透明的failover，但是，在实际的使用过程中，要完全实现这样的透明，条件是何等的苛刻。

先从一次故障说起吧。某天中午正在吃饭的时候，突然接到电话做应用程序连接数据库挂死了，并且也在客户端测试连接数据库也是挂死，长时间没有响应。

该省的数据库是2节点的rac数据库，登录数据库后发现rac2已经宕机，只留着侦听还没宕。观察到rac2的alertlog为：

Tue Apr 29 10:12:10 2008
Thread 2 advanced to log sequence 11999
  Current log# 8 seq# 11999 mem# 0: /dev/vg_rac1/rG1_redo_512m_204
  Current log# 8 seq# 11999 mem# 1: /dev/vg_rac2/rG2_redo_512m_214
Tue Apr 29 10:12:10 2008
  Current log# 8 seq# 11999 mem# 2: /dev/vg_rac3/rG3_redo_512m_224
Tue Apr 29 10:12:10 2008
ARC1: Evaluating archive   log 6 thread 2 sequence 11998
Tue Apr 29 10:12:10 2008
ARC1: Beginning to archive log 6 thread 2 sequence 11998
Creating archive destination LOG_ARCHIVE_DEST_1: '/archlog2/fjmisc2_2_11998.arc'
Tue Apr 29 10:12:32 2008
ARC1: Completed archiving  log 6 thread 2 sequence 11998
Tue Apr 29 10:42:18 2008
alter tablespace DATA_PORTAL add datafile '/dev/vg_rac1/rG1_data_2g_185' size 2047m
Tue Apr 29 10:42:50 2008
Completed: alter tablespace DATA_PORTAL add datafile '/dev/vg
Tue Apr 29 12:16:59 2008
Communications reconfiguration: instance 0
Waiting for clusterware split-brain resolution
Tue Apr 29 12:17:32 2008
Trace dumping is performing id=[80429121659]
Tue Apr 29 12:18:09 2008
Trace dumping is performing id=[80429121745]
Tue Apr 29 12:27:04 2008
Errors in file /oracle/app/oracle/admin/fjmisc/bdump/fjmisc2_lmon_6757.trc:
ORA-29740: evicted by member 0, group incarnation 5
Tue Apr 29 12:27:04 2008
LMON: terminating instance due to error 29740
Tue Apr 29 12:27:04 2008
Errors in file /oracle/app/oracle/admin/fjmisc/bdump/fjmisc2_pmon_6753.trc:
ORA-29740: evicted by member , group incarnation 
Tue Apr 29 12:27:04 2008
Errors in file /oracle/app/oracle/admin/fjmisc/bdump/fjmisc2_dbw0_6765.trc:
ORA-29740: evicted by member , group incarnation 
Tue Apr 29 12:27:05 2008
Errors in file /oracle/app/oracle/admin/fjmisc/bdump/fjmisc2_lck0_6793.trc:
ORA-29740: evicted by member , group incarnation 
Instance terminated by LMON, pid = 6757

Tue Apr 29 10:12:10 2008

Thread 2 advanced to log sequence 11999

Current log# 8 seq# 11999 mem# 0: /dev/vg_rac1/rG1_redo_512m_204

Current log# 8 seq# 11999 mem# 1: /dev/vg_rac2/rG2_redo_512m_214

Tue Apr 29 10:12:10 2008

Current log# 8 seq# 11999 mem# 2: /dev/vg_rac3/rG3_redo_512m_224

Tue Apr 29 10:12:10 2008

ARC1: Evaluating archive log 6 thread 2 sequence 11998

Tue Apr 29 10:12:10 2008

ARC1: Beginning to archive log 6 thread 2 sequence 11998

Creating archive destination LOG_ARCHIVE_DEST_1: '/archlog2/fjmisc2_2_11998.arc'

Tue Apr 29 10:12:32 2008

ARC1: Completed archiving log 6 thread 2 sequence 11998

Tue Apr 29 10:42:18 2008

alter tablespace DATA_PORTAL add datafile '/dev/vg_rac1/rG1_data_2g_185' size 2047m

Tue Apr 29 10:42:50 2008

Completed: alter tablespace DATA_PORTAL add datafile '/dev/vg

Tue Apr 29 12:16:59 2008

Communications reconfiguration: instance 0

Waiting for clusterware split-brain resolution

Tue Apr 29 12:17:32 2008

Trace dumping is performing id=[80429121659]

Tue Apr 29 12:18:09 2008

Trace dumping is performing id=[80429121745]

Tue Apr 29 12:27:04 2008

Errors in file /oracle/app/oracle/admin/fjmisc/bdump/fjmisc2_lmon_6757.trc:

ORA-29740: evicted by member 0, group incarnation 5

Tue Apr 29 12:27:04 2008

LMON: terminating instance due to error 29740

Tue Apr 29 12:27:04 2008

Errors in file /oracle/app/oracle/admin/fjmisc/bdump/fjmisc2_pmon_6753.trc:

ORA-29740: evicted by member , group incarnation

Tue Apr 29 12:27:04 2008

Errors in file /oracle/app/oracle/admin/fjmisc/bdump/fjmisc2_dbw0_6765.trc:

ORA-29740: evicted by member , group incarnation

Tue Apr 29 12:27:05 2008

Errors in file /oracle/app/oracle/admin/fjmisc/bdump/fjmisc2_lck0_6793.trc:

ORA-29740: evicted by member , group incarnation

Instance terminated by LMON, pid = 6757

根据报错信息，看到是ora-29740的错误，这个一般是rac之间的通信问题，导致一个节点被踢出cluster group。具体的错误分析，还是要看trace文件：

*** 2008-04-29 12:16:59.317
kjxgrgetresults: Detect reconfig from 0, seq 4, reason 3
kjxgrrcfgchk: Initiating reconfig, reason 3
*** 2008-04-29 12:17:04.332
kjxgmrcfg: Reconfiguration started, reason 3
kjxgmcs: Setting state to 4 0.
*** 2008-04-29 12:17:04.347
     Name Service frozen
kjxgmcs: Setting state to 4 1.
*** 2008-04-29 12:17:04.392
kjxgrrecp2: Waiting for split-brain resolution, upd 0, seq 5
*** 2008-04-29 12:27:04.479
Voting results, upd 0, seq 5, bitmap: 0 
*** 2008-04-29 12:27:04.479
kjxgrdtrt: Evicted by 0, seq (5, 5)
IMR state information
  Member 1, thread 2, state 4, flags 0x00a1
  RR seq 5, propstate 3, pending propstate 0
  rcfg rsn 3, rcfg time 2235139911, mem ct 2
  master 1, master rcfg time 2235139911
 
  Member information:
    Member 0, incarn 4, version 122818100
      thrd 1, prev thrd 65535, status 0x0047, err 0x0002
      valct 0
    Member 1, incarn 4, version 134210
      thrd 2, prev thrd 65535, status 0x0107, err 0x0000
      valct 2
 
Group name: FJMISC
Member id: 1 
Cached SKGXN event: 0
Group State:
  State: 4 1 
  Commited Map: 0 1 
  New Map: 0 1 
  SKGXN Map: 0 1 
  Master node: 0
  Memcnt 2  Rcvcnt 0
  Substate Proposal: false
Inc Proposal:
  incarn 0  memcnt 0  master 0
  proposal false  matched false
  map: 
Master Inc State:
  incarn 0  memcnt 0  agrees 0  flag 0x1
  wmap: 
  nmap: 
  ubmap: 
Submitting asynchronized dump request [1]
error 29740 detected in background process
ORA-29740: evicted by member 0, group incarnation 5
ksuitm: waiting for [5] seconds before killing DIAG
oracle@FJ_DB02:/oracle/app/oracle/admin/fjmisc/bdump >

*** 2008-04-29 12:16:59.317

kjxgrgetresults: Detect reconfig from 0, seq 4, reason 3

kjxgrrcfgchk: Initiating reconfig, reason 3

*** 2008-04-29 12:17:04.332

kjxgmrcfg: Reconfiguration started, reason 3

kjxgmcs: Setting state to 4 0.

*** 2008-04-29 12:17:04.347

Name Service frozen

kjxgmcs: Setting state to 4 1.

*** 2008-04-29 12:17:04.392

kjxgrrecp2: Waiting for split-brain resolution, upd 0, seq 5

*** 2008-04-29 12:27:04.479

Voting results, upd 0, seq 5, bitmap: 0

*** 2008-04-29 12:27:04.479

kjxgrdtrt: Evicted by 0, seq (5, 5)

IMR state information

Member 1, thread 2, state 4, flags 0x00a1

RR seq 5, propstate 3, pending propstate 0

rcfg rsn 3, rcfg time 2235139911, mem ct 2

master 1, master rcfg time 2235139911

Member information:

Member 0, incarn 4, version 122818100

thrd 1, prev thrd 65535, status 0x0047, err 0x0002

valct 0

Member 1, incarn 4, version 134210

thrd 2, prev thrd 65535, status 0x0107, err 0x0000

valct 2

Group name: FJMISC

Member id: 1

Cached SKGXN event: 0

Group State:

State: 4 1

Commited Map: 0 1

New Map: 0 1

SKGXN Map: 0 1

Master node: 0

Memcnt 2 Rcvcnt 0

Substate Proposal: false

Inc Proposal:

incarn 0 memcnt 0 master 0

proposal false matched false

map:

Master Inc State:

incarn 0 memcnt 0 agrees 0 flag 0x1

wmap:

nmap:

ubmap:

Submitting asynchronized dump request [1]

error 29740 detected in background process

ORA-29740: evicted by member 0, group incarnation 5

ksuitm: waiting for [5] seconds before killing DIAG

oracle@FJ_DB02:/oracle/app/oracle/admin/fjmisc/bdump >

在这里我们看到是由于ora-29740的reason 3引起宕机的，查询metalink（Note:219361.1）上得知：

For troubleshooting ORA-29740 errors, the 'reason' will be very important. 
In the above example, the first section indicates the reason for the 
initiated reconfiguration. The reasons are as follows:

Reason 0 = No reconfiguration 
Reason 1 = The Node Monitor generated the reconfiguration.
Reason 2 = An instance death was detected.
Reason 3 = Communications Failure
Reason 4 = Reconfiguration after suspend

For ORA-29740 errors, you will most likely see reasons 1, 2, or 3.

-----------------------------------------------------------------------------

Reason 1: The Node Monitor generated the reconfiguration. This can happen if:

a) An instance joins the cluster
b) An instance leaves the cluster
c) A node is halted

It should be easy to determine the cause of the error by reviewing the alert
logs and LMON trace files from all instances. If an instance joins or leaves
the cluster or a node is halted then the ORA-29740 error is not a problem. 

ORA-29740 evictions with reason 1 are usually expected when the cluster 
membership changes. Very rarely are these types of evictions a real problem.

If you feel that this eviction was not correct, do a search in Metalink or 
the bug database for:

ORA-29740 'reason 1'

Important files to review are:

a) Each instance's alert log
b) Each instance's LMON trace file
c) Statspack reports from all nodes leading up to the eviction
d) Each node's syslog or messages file
e) iostat output before, after, and during evictions
f) vmstat output before, after, and during evictions
g) netstat output before, after, and during evictions

There is a tool called "OS Watcher" that is being developed that helps gather
this information. For more information on "OS Watcher" see Note 301137.1 
"OS Watcher User Guide".

-----------------------------------------------------------------------------

Reason 2: An instance death was detected. This can happen if:

a) An instance fails to issue a heartbeat to the control file.

When the heartbeat is missing, LMON will issue a network ping to the instance 
not issuing the heartbeat. As long as the instance responds to the ping, 
LMON will consider the instance alive. If, however, the heartbeat is not 
issued for the length of time of the control file enqueue timeout, the 
instance is considered to be problematic and will be evicted. 

Common causes for an ORA-29740 eviction (Reason 2):

a) NTP (Time changes on cluster) - usually on Linux, Tru64, or IBM AIX
b) Network Problems (SAN).
c) Resource Starvation (CPU, I/O, etc..)
d) An Oracle bug.


Common bugs for reason 2 evictions:
Bug 2820871 - Abrupt time adjustments can crash instance with ORA-29740 
(Reason 2) (Linux Only)
Fixed-Releases: 9204+ A000
Bug 3917158 - ORA-29740 and a false instance eviction can occur (Reason 2) 
(IBM AIX Only)
Fixed Releases: 9206+

If you feel that this eviction was not correct, do a search in Metalink or the 
bug database for:

ORA-29740 'reason 2'

Important files to review are:

a) Each instance's alert log
b) Each instance's LMON trace file
c) Statspack reports from all nodes leading up to the eviction
d) The CKPT process trace file of the evicted instance
e) Other bdump or udump files...
f) Each node's syslog or messages file 
g) iostat output before, after, and during evictions
h) vmstat output before, after, and during evictions
i) netstat output before, after, and during evictions

There is a tool called "OS Watcher" that is being developed that helps gather
this information. For more information on "OS Watcher" see Note 301137.1 
"OS Watcher User Guide".

-----------------------------------------------------------------------------

Reason 3: Communications Failure. This can happen if:

a) The LMON processes lose communication between one another.
b) One instance loses communications with the LMS, LMD, process of 
another instance.
c) The LCK processes lose communication between one another.
d) A process like LMON, LMD, LMS, or LCK is blocked, spinning, or stuck 
and is not responding to remote requests.

In this case the ORA-29740 error is recorded when there are communication 
issues between the instances. It is an indication that an instance has been 
evicted from the configuration as a result of IPC send timeout. A 
communications failure between processes across instances will also generate a 
ORA-29740 with reason 3. When this occurs, the trace file of the process 
experiencing the error will print a message:

Reporting Communication error with instance:

If communication is lost at the cluster layer (for example, network cables
are pulled), the cluster software may also perform node evictions in the 
event of a cluster split-brain. Oracle will detect a possible split-brain 
and wait for cluster software to resolve the split-brain. If cluster
software does not resolve the split-brain within a specified interval, 
Oracle proceeds with evictions.

Oracle Support has seen cases where resource starvation (CPU, I/O, etc...) can 
cause an instance to be evicted with this reason code. The LMON or LMD process 
could be blocked waiting for resources and not respond to polling by the remote 
instance(s). This could cause that instance to be evicted. If you have
a statspack report available from the time just prior to the eviction on the
evicted instance, check for poor I/O times and high CPU utilization. Poor I/O 
times would be an average read time of > 20ms.

Common causes for an ORA-29740 eviction (Reason 3):

a) Network Problems.
b) Resource Starvation (CPU, I/O, etc..)
c) Severe Contention in Database.
d) An Oracle bug.

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

For troubleshooting ORA-29740 errors, the 'reason' will be very important.

In the above example, the first section indicates the reason for the

initiated reconfiguration. The reasons are as follows:

Reason 0 = No reconfiguration

Reason 1 = The Node Monitor generated the reconfiguration.

Reason 2 = An instance death was detected.

Reason 3 = Communications Failure

Reason 4 = Reconfiguration after suspend

For ORA-29740 errors, you will most likely see reasons 1, 2, or 3.

-----------------------------------------------------------------------------

Reason 1: The Node Monitor generated the reconfiguration. This can happen if:

a) An instance joins the cluster

b) An instance leaves the cluster

c) A node is halted

It should be easy to determine the cause of the error by reviewing the alert

logs and LMON trace files from all instances. If an instance joins or leaves

the cluster or a node is halted then the ORA-29740 error is not a problem.

ORA-29740 evictions with reason 1 are usually expected when the cluster

membership changes. Very rarely are these types of evictions a real problem.

If you feel that this eviction was not correct, do a search in Metalink or

the bug database for:

ORA-29740 'reason 1'

Important files to review are:

a) Each instance's alert log

b) Each instance's LMON trace file

c) Statspack reports from all nodes leading up to the eviction

d) Each node's syslog or messages file

e) iostat output before, after, and during evictions

f) vmstat output before, after, and during evictions

g) netstat output before, after, and during evictions

There is a tool called "OS Watcher" that is being developed that helps gather

this information. For more information on "OS Watcher" see Note 301137.1

"OS Watcher User Guide".

-----------------------------------------------------------------------------

Reason 2: An instance death was detected. This can happen if:

a) An instance fails to issue a heartbeat to the control file.

When the heartbeat is missing, LMON will issue a network ping to the instance

not issuing the heartbeat. As long as the instance responds to the ping,

LMON will consider the instance alive. If, however, the heartbeat is not

issued for the length of time of the control file enqueue timeout, the

instance is considered to be problematic and will be evicted.

Common causes for an ORA-29740 eviction (Reason 2):

a) NTP (Time changes on cluster) - usually on Linux, Tru64, or IBM AIX

b) Network Problems (SAN).

c) Resource Starvation (CPU, I/O, etc..)

d) An Oracle bug.

Common bugs for reason 2 evictions:

Bug 2820871 - Abrupt time adjustments can crash instance with ORA-29740

(Reason 2) (Linux Only)

Fixed-Releases: 9204+ A000

Bug 3917158 - ORA-29740 and a false instance eviction can occur (Reason 2)

(IBM AIX Only)

Fixed Releases: 9206+

If you feel that this eviction was not correct, do a search in Metalink or the

bug database for:

ORA-29740 'reason 2'

Important files to review are:

a) Each instance's alert log

b) Each instance's LMON trace file

c) Statspack reports from all nodes leading up to the eviction

d) The CKPT process trace file of the evicted instance

e) Other bdump or udump files...

f) Each node's syslog or messages file

g) iostat output before, after, and during evictions

h) vmstat output before, after, and during evictions

i) netstat output before, after, and during evictions

There is a tool called "OS Watcher" that is being developed that helps gather

this information. For more information on "OS Watcher" see Note 301137.1

"OS Watcher User Guide".

-----------------------------------------------------------------------------

Reason 3: Communications Failure. This can happen if:

a) The LMON processes lose communication between one another.

b) One instance loses communications with the LMS, LMD, process of

another instance.

c) The LCK processes lose communication between one another.

d) A process like LMON, LMD, LMS, or LCK is blocked, spinning, or stuck

and is not responding to remote requests.

In this case the ORA-29740 error is recorded when there are communication

issues between the instances. It is an indication that an instance has been

evicted from the configuration as a result of IPC send timeout. A

communications failure between processes across instances will also generate a

ORA-29740 with reason 3. When this occurs, the trace file of the process

experiencing the error will print a message:

Reporting Communication error with instance:

If communication is lost at the cluster layer (for example, network cables

are pulled), the cluster software may also perform node evictions in the

event of a cluster split-brain. Oracle will detect a possible split-brain

and wait for cluster software to resolve the split-brain. If cluster

software does not resolve the split-brain within a specified interval,

Oracle proceeds with evictions.

Oracle Support has seen cases where resource starvation (CPU, I/O, etc...) can

cause an instance to be evicted with this reason code. The LMON or LMD process

could be blocked waiting for resources and not respond to polling by the remote

instance(s). This could cause that instance to be evicted. If you have

a statspack report available from the time just prior to the eviction on the

evicted instance, check for poor I/O times and high CPU utilization. Poor I/O

times would be an average read time of > 20ms.

Common causes for an ORA-29740 eviction (Reason 3):

a) Network Problems.

b) Resource Starvation (CPU, I/O, etc..)

c) Severe Contention in Database.

d) An Oracle bug.

OK，那么我们基本将问题定位在网络方面的原因，继续查看系统的syslog：

Apr 29 11:59:10 FJ_DB02 su: + 0 root-oracle
Apr 29 12:11:02 FJ_DB02 cmcld: lan1 failed
Apr 29 12:11:02 FJ_DB02 cmcld: Subnet 10.203.17.0 switching from lan1 to lan2
Apr 29 12:11:02 FJ_DB02 cmcld: lan1 switching to lan2
Apr 29 12:11:02 FJ_DB02 cmcld: Subnet 10.203.17.0 switched from lan1 to lan2
Apr 29 12:11:02 FJ_DB02 cmcld: lan1 switched to lan2
Apr 29 12:11:22 FJ_DB02 cmcld: lan1 recovered
Apr 29 12:11:22 FJ_DB02 cmcld: Subnet 10.203.17.0 switching from lan2 to lan1
Apr 29 12:11:22 FJ_DB02 cmcld: lan2 switching to lan1
Apr 29 12:11:22 FJ_DB02 cmcld: Subnet 10.203.17.0 switched from lan2 to lan1
Apr 29 12:11:22 FJ_DB02 cmcld: lan2 switched to lan1
Apr 29 12:14:15 FJ_DB02 vmunix: NFS server FJ_DB01 not responding still trying
Apr 29 12:14:59 FJ_DB02 cmcld: HB connection to 10.203.17.1 not responding, closing
Apr 29 12:14:59 FJ_DB02 cmcld: GS connection to 10.203.17.1 not responding, closing
Apr 29 12:18:59 FJ_DB02 cmcld: GS connection to 10.203.17.1 is responding
Apr 29 12:19:26 FJ_DB02 vmunix: NFS server FJ_DB01 ok
Apr 29 12:20:59 FJ_DB02 cmcld: HB connection to 10.203.17.1 is responding
Apr 29 12:22:47 FJ_DB02 xntpd[3717]: synchronized to 218.200.246.196, stratum=1

Apr 29 11:59:10 FJ_DB02 su: + 0 root-oracle

Apr 29 12:11:02 FJ_DB02 cmcld: lan1 failed

Apr 29 12:11:02 FJ_DB02 cmcld: Subnet 10.203.17.0 switching from lan1 to lan2

Apr 29 12:11:02 FJ_DB02 cmcld: lan1 switching to lan2

Apr 29 12:11:02 FJ_DB02 cmcld: Subnet 10.203.17.0 switched from lan1 to lan2

Apr 29 12:11:02 FJ_DB02 cmcld: lan1 switched to lan2

Apr 29 12:11:22 FJ_DB02 cmcld: lan1 recovered

Apr 29 12:11:22 FJ_DB02 cmcld: Subnet 10.203.17.0 switching from lan2 to lan1

Apr 29 12:11:22 FJ_DB02 cmcld: lan2 switching to lan1

Apr 29 12:11:22 FJ_DB02 cmcld: Subnet 10.203.17.0 switched from lan2 to lan1

Apr 29 12:11:22 FJ_DB02 cmcld: lan2 switched to lan1

Apr 29 12:14:15 FJ_DB02 vmunix: NFS server FJ_DB01 not responding still trying

Apr 29 12:14:59 FJ_DB02 cmcld: HB connection to 10.203.17.1 not responding, closing

Apr 29 12:14:59 FJ_DB02 cmcld: GS connection to 10.203.17.1 not responding, closing

Apr 29 12:18:59 FJ_DB02 cmcld: GS connection to 10.203.17.1 is responding

Apr 29 12:19:26 FJ_DB02 vmunix: NFS server FJ_DB01 ok

Apr 29 12:20:59 FJ_DB02 cmcld: HB connection to 10.203.17.1 is responding

Apr 29 12:22:47 FJ_DB02 xntpd[3717]: synchronized to 218.200.246.196, stratum=1

发现主机确实出现了网络问题，在rac2上连接不到rac1（10.203.17.1），此时应用挂死，怀疑是应用发起的连接去了rac2，而rac2的lsnr还是打开的但是instance已经宕机，由于配置了local lsnr和remote lsnr，在客户端也配置了load balance和failover，当应用的连接随机的连接到实例时候，由于配置了load balance和remote lsnr，需要检查对方节点的服务器负载，由于网络不正常，无法确定对方节点情况，因此挂死。关闭rac2上的lsnr后，应用恢复正常！！

从主机的syslog中进一步观察，主机的网络在12：18已经恢复，网络开始responding。ping rac1也是正常，检查cluster软件没有问题后，开始重启rac2的instance。

问题是，启动的时候再次出现了问题，本来rac1上的连接都是好的，且应用也是顺利的连接到rac1，可是在重启的时候，所有的应用挂死。由于考虑到此时两个节点是在做内存同步，对split brain的东西做回复。在这段时间内的rac2和rac1都会很慢，就让应用先等了一会……谁知过了15分钟了，rac2还是没起来，而且rac2 instance再次被lmon终止了。

汗！看来网络还是有问题！！白天不敢动了，怕再次影响业务，到晚上继续捣鼓……

晚上拉上了网络工程师，发现还是有“FJ_DB02 cmcld: GS connection to 10.203.17.1 not responding, closing”的报错，仔细的把网络检查了一遍：

$ sqlplus "/ as sysdba"
SQL> oradebug setmypid
SQL> oradebug ipc
SQL> exit;

$ sqlplus "/ as sysdba"

SQL> oradebug setmypid

SQL> oradebug ipc

SQL> exit;

发现在trace文件中：

SSKGXPT 0x1aa5e4 flags SSKGXPT_READPENDING 	info for network 0
	socket no 9 	IP 10.203.17.2 	UDP 57854<------此处
	sflags SSKGXPT_UP 
	info for network 1
	socket no 0 	IP 0.0.0.0 	UDP 0
	sflags SSKGXPT_DOWN 
context timestamp 0
	no ports
    sconno     accono   ertt  state   seq#   sent  async   sync rtrans   acks
       ach     accono     sconno      admno  state lstack nxtseq    rcv rtrans   acks credts

SSKGXPT 0x1aa5e4 flags SSKGXPT_READPENDING info for network 0

socket no 9 IP 10.203.17.2 UDP 57854<------此处

sflags SSKGXPT_UP

info for network 1

socket no 0 IP 0.0.0.0 UDP 0

sflags SSKGXPT_DOWN

context timestamp 0

no ports

sconno accono ertt state seq# sent async sync rtrans acks

ach accono sconno admno state lstack nxtseq rcv rtrans acks credts

发现此处配的10.203.17.2，即应用的连接的网段，没有用private网段，进一步检查：

SQL> show parameter cluster              

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
cluster_database                     boolean     TRUE
cluster_database_instances           integer     2
cluster_interconnects                string

SQL> show parameter cluster

NAME TYPE VALUE

------------------------------------ ----------- ------------------------------

cluster_database boolean TRUE

cluster_database_instances integer 2

cluster_interconnects string

发现也是没有配private的网卡。

由于这个省的rac的网络配置是这样的：
lan0 : 192.168.0.0
lan1 : 10.103.17.0
lan2 : 0.0.0.0
其中，lan2是lan1的备用。网卡lan1之间的连接是经过两台互为冗余的交换机，而lan0则是通过交叉线直接连接。应用程序通过10.103.17.0网段的IP连接数据库，因此，而lan0对于数据库来讲目前是没有用上的。lan1上的问题导致了此次rac2的宕机。

通过网络工程师的进一步ping包，发现lan1的接口上有丢包的现象，由于是少量的不稳定的丢包，并不是网络全断，因此也没有切换到lan2上。因此造成因为丢包而split brain，从而rac2实例宕机。而进一步的咨询现场了解到，在当天中午的12点多确实有人拔动过lan1的网线。可能是拔动后没有插紧，造成了这次宕机。

解决方法：重新插拔lan1的网线，更换网线，更换网口后，节点间的通信回复正常。顺利的启动rac2.

后续改进：将直连的interconnect改为通过交换机连接(需千兆网络)，配置cluster_interconnects=ip_private.

总结：rac其实要求还是挺严格的。首先，在硬件上，要求千兆private网络用于interconnect，如果有多块网卡且既能做为interconnect通信又能用于应用连接，需要配置cluster_interconnects进行指定。load balance其实在2处地方可以做load balance：一处是在服务器端，根据配置的remote lsnr和local lsnr来做负载均衡，即在客户端指定连接rac1也有可能在服务器端做再次调整连去rac2；另一处是在客户端，通过客户端的tnsnames来配置load balance，客户端根据tnsnames中的address list随机的连接服务器端。至于最终连接到哪个节点，要根据服务器端的lsnr再做调整（感觉这不是很好控制，如果同时配置了客户端的load balance和服务器端的load balance，基本不可预知在客户端发起的连接会去哪个节点。更何况，oracle美其名曰说是通过cpu的负载来实现load balance，至于其中如何的运算cpu的比例，还是一个黑盒子。）。至于failover这个功能，感觉只是和已经连接上rac的实例select有关，update、insert还是基本不能实现TAF，至于新发起的连接，和failover没啥关系了……

rac

Previous Post Next Post

5条评论

tonylu说道：

2008-06-13 11:30

我也有以下的錯誤求救耶!!:
Fri Jun 13 00:27:37 2008
IPC Send timeout detected.Sender: ospid 5000
Receiver: inst 1 binc 10682 ospid 4224
Fri Jun 13 00:27:39 2008
Errors in file d:\oracle\product\10.2.0\admin\etccms\bdump\etccms3_lms1_5000.trc:
ORA-27508: 傳送訊息發生 IPC 錯誤
ORA-27300: OS 系統相依作業:IPCSOCK_Send 失敗, 狀態: 10054
ORA-27301: OS 失敗訊息: 遠端主機已強制關閉一個現存的連線。
ORA-27302: 失敗發生於: send_3

Fri Jun 13 00:27:39 2008
IPC Send timeout to 0.2 inc 26 for msg type 38 from opid 8
Fri Jun 13 00:27:39 2008
Communications reconfiguration: instance_number 1
Fri Jun 13 00:27:39 2008
Trace dumping is performing id=[cdmp_20080613002739]
Fri Jun 13 00:29:20 2008
Waiting for clusterware split-brain resolution
Fri Jun 13 00:39:20 2008
Errors in file d:\oracle\product\10.2.0\admin\etccms\bdump\etccms3_lmon_5540.trc:
ORA-29740: 由成員 2, 群組拷貝 28 收回

Fri Jun 13 00:39:20 2008
LMON: terminating instance due to error 29740
Fri Jun 13 00:39:20 2008
WARNING: inbound connection timed out (ORA-3136)
Fri Jun 13 00:39:20 2008
Errors in file d:\oracle\product\10.2.0\admin\etccms\bdump\etccms3_pmon_2072.trc:
ORA-29740: 由成員 , 群組拷貝收回

Fri Jun 13 00:39:20 2008
Errors in file d:\oracle\product\10.2.0\admin\etccms\bdump\etccms3_lmd0_5396.trc:
ORA-29740: 由成員 , 群組拷貝收回

Fri Jun 13 00:39:20 2008
Errors in file d:\oracle\product\10.2.0\admin\etccms\bdump\etccms3_lms1_5000.trc:
ORA-29740: 由成員 , 群組拷貝收回

Fri Jun 13 00:39:20 2008
WARNING: inbound connection timed out (ORA-3136)
Fri Jun 13 00:39:20 2008
WARNING: inbound connection timed out (ORA-3136)
Fri Jun 13 00:39:20 2008
WARNING: inbound connection timed out (ORA-3136)
…….
Errors in file d:\oracle\product\10.2.0\admin\etccms\bdump\etccms3_dbw0_5380.trc:
ORA-29740: 由成員 , 群組拷貝收回

Fri Jun 13 00:39:20 2008
WARNING: inbound connection timed out (ORA-3136)
Fri Jun 13 00:39:21 2008
Errors in file d:\oracle\product\10.2.0\admin\etccms\bdump\etccms3_lck0_4744.trc:
ORA-29740: 由成員 , 群組拷貝收回

Fri Jun 13 00:39:21 2008
WARNING: inbound connection timed out (ORA-3136)
Fri Jun 13 00:39:21 2008
WARNING: inbound connection timed out (ORA-3136)
…
Fri Jun 13 00:39:28 2008
Errors in file d:\oracle\product\10.2.0\admin\etccms\bdump\etccms3_reco_1028.trc:
ORA-29740: 由成員 , 群組拷貝收回

Instance terminated by LMON, pid = 5540
Fri Jun 13 02:20:46 2008
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Interface type 1 private 192.168.1.0 configured from OCR for use as a cluster interconnect
Interface type 1 public 10.81.4.0 configured from OCR for use as a public interface
Picked latch-free SCN scheme 2
Autotune of undo retention is turned on.
LICENSE_MAX_USERS = 0
SYS auditing is disabled
ksdpec: called for event 13740 prior to event group initialization
Starting up ORACLE RDBMS Version: 10.2.0.1.0.
System parameters with non-default values:
processes = 150
…
…
DB 是重啟了, 且有時候甚至不會重啟..
還連兩天呢…
還沒能找到直接解決方案呢……著急!!

回复
laojiu说道：

2009-10-27 10:13

good

回复
laojiu说道：

2009-10-27 10:15

我订阅！

回复
Given说道：

2009-12-02 14:04

好东西。今天上午还在劝说用户采用交换机做内网而不是直连呢。呵呵。活生生的例子拿给他们看额……

回复
匿名说道：

2011-03-22 10:19

好东西，顶！！！！

回复

发表回复取消回复

此站点使用Akismet来减少垃圾评论。了解我们如何处理您的评论数据。

RAC的高可用也不是那么好用

相关文章

难道是rebuild index的一个bug？

升级到12.1.0.2之后rman无法删除已经归档日志

如何做一份完善的补丁分析

5条评论

发表回复 取消回复

发表回复取消回复