前几天同事遇到一个问题,在某RAC环境中,由于SA要打patch,所以希望RAC跑在单节点模式下,他先down掉了一台机器,在另外一台机器上,叫DBA把instance起来。
这是一个2个RAC的环境,有2台server,每个server上跑2个instance。即:
对于SIAP数据库,SIAP1在server1上;SIAP2在server2上;对于SIMP数据库,SIMP1在server1上,SIMP2在server2上。
数据库是11g的RAC,共享存储和心跳有veritas控制,RAC cluster有CRS控制。
目前server2已经down了,希望在server1上启动SIAP1和SIMP1。问题是,SIMP1已经启动,但是SIAP1却启动不了。启动时报错为:
1 2 3 4 5 6 7 |
SQL> startup ORA-27504: IPC error creating OSD context ORA-27300: OS system dependent operation:check if cable failed with status: 0 ORA-27301: OS failure message: Error 0 ORA-27302: failure occurred at: skgxpcini1 ORA-27303: additional information: requested interface ce2 interface not running set _disable_interface_checking = TRUE to disable this check for single instance cluster. Check output from ifconf |
看报错的提示,是ce2的网卡没有在跑,要求将_disable_interface_checking设置成true才能把数据库起来。
由于当时情况紧急,没来得及细细研究,只是当SA把另一台server也起来的时候,SIAP1就能启动起来了。
于是,我们不禁要问,为什么在同一台server,一个instance能起来,另一个却不能?
我们来查一下RAC的网络配置,看看ce2是什么网卡。(当前状态已正常,2个server,4个instance均工作正常)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
CRS中的信息: oracle@vus029pa:SIAP1:/opt/app/oracle/admin $ oifcfg getif ce0 144.135.159.0 global public ce6 144.135.159.0 global public hosts文件中的配置: oracle@vus029pa:SIAP1:/opt/app/oracle/admin $ more /etc/hosts #Public 144.135.159.111 vus029pa vus029pa.in.telstra.com.au loghost #Oracle Virtual IP Addresses 144.135.159.110 osiiprd1dbr01.in.telstra.com.au 144.135.159.112 osiiprd1dbr02.in.telstra.com.au # Private Interconnects 192.168.0.1 osiiprd1db2-priv osiiprd1db2-priv.in.telstra.com.au 192.168.0.2 osiiprd1db1-priv osiiprd1db1-priv.in.telstra.com.au 网卡的配置: oracle@vus029pa:SIAP1:/opt/app/oracle/admin $ ifconfig -a ce0: flags=1009040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4,NOFAILOVER,FIXEDMTU> mtu 1500 index 2 inet 144.135.159.13 netmask ffffff00 broadcast 144.135.159.255 groupname clustermgmt-mnb ce0:1: flags=1001000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4,FIXEDMTU> mtu 1500 index 2 inet 144.135.159.111 netmask ffffff00 broadcast 144.135.159.255 ce0:2: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2 inet 144.135.159.109 netmask ffffff00 broadcast 144.135.159.255 ce2: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 18 inet 192.168.0.1 netmask ffffff00 broadcast 192.168.0.255 ce6: flags=1009040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4,NOFAILOVER,FIXEDMTU> mtu 1500 index 6 inet 144.135.159.63 netmask ffffff00 broadcast 144.135.159.255 groupname clustermgmt-mnb ce6:1: flags=1040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4> mtu 1500 index 6 inet 144.135.159.110 netmask ffffff00 broadcast 144.135.159.255 |
我们看到ce2网卡配置的IP是192.168.0.1,而从hosts文件中看到,这个地址是private的地址。也就是说,SIAP1在启动的时候,去检查private地址的网卡是否up,如果up,实例才能正常启动。
类似这样的检查private网络的网卡,在ASM+10gRAC的环境中也同样存在:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 |
[root@rac1 ~]# ifconfig eth0 Link encap:Ethernet HWaddr 00:0C:29:AE:9A:38 inet addr:192.168.190.131 Bcast:192.168.190.255 Mask:255.255.255.0 inet6 addr: fe80::20c:29ff:feae:9a38/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:3648 errors:0 dropped:0 overruns:0 frame:0 TX packets:3809 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:362192 (353.7 KiB) TX bytes:357537 (349.1 KiB) Interrupt:10 Base address:0x1480 eth1 Link encap:Ethernet HWaddr 00:0C:29:AE:9A:42 inet addr:10.10.10.31 Bcast:10.10.10.255 Mask:255.255.255.0 inet6 addr: fe80::20c:29ff:feae:9a42/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:595 errors:0 dropped:0 overruns:0 frame:0 TX packets:22 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:107822 (105.2 KiB) TX bytes:1092 (1.0 KiB) Interrupt:5 Base address:0x1800 lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:41187 errors:0 dropped:0 overruns:0 frame:0 TX packets:41187 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:12143968 (11.5 MiB) TX bytes:12143968 (11.5 MiB) [root@rac1 ~]# cat /etc/hosts |grep priv 10.10.10.31 rac1-priv.mycorpdomain.com rac1-priv 10.10.10.32 rac2-priv.mycorpdomain.com rac2-priv 10.10.10.33 rac3-priv.mycorpdomain.com rac3-priv [root@rac1 ~]# [root@rac1 ~]# [root@rac1 ~]# ifconfig eth1 down [root@rac1 ~]# ifconfig eth0 Link encap:Ethernet HWaddr 00:0C:29:AE:9A:38 inet addr:192.168.190.131 Bcast:192.168.190.255 Mask:255.255.255.0 inet6 addr: fe80::20c:29ff:feae:9a38/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:3769 errors:0 dropped:0 overruns:0 frame:0 TX packets:3914 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:373764 (365.0 KiB) TX bytes:368107 (359.4 KiB) Interrupt:10 Base address:0x1480 lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:41240 errors:0 dropped:0 overruns:0 frame:0 TX packets:41240 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:12145505 (11.5 MiB) TX bytes:12145505 (11.5 MiB) [root@rac1 ~]# [root@rac1 ~]# [root@rac1 ~]# su - oracle rac1-> rac1-> rac1-> crs_stat -t Name Type Target State Host ------------------------------------------------------------ ora.devdb.db application OFFLINE OFFLINE ora....b1.inst application OFFLINE OFFLINE ora....b2.inst application OFFLINE OFFLINE ora....b3.inst application OFFLINE OFFLINE ora....SM1.asm application OFFLINE OFFLINE ora....C1.lsnr application OFFLINE OFFLINE ora.rac1.gsd application OFFLINE OFFLINE ora.rac1.ons application OFFLINE OFFLINE ora.rac1.vip application OFFLINE OFFLINE ora....SM2.asm application OFFLINE OFFLINE ora....C2.lsnr application OFFLINE OFFLINE ora.rac2.gsd application OFFLINE OFFLINE ora.rac2.ons application OFFLINE OFFLINE ora.rac2.vip application OFFLINE OFFLINE ora....SM3.asm application OFFLINE OFFLINE ora....C3.lsnr application OFFLINE OFFLINE ora.rac3.gsd application OFFLINE OFFLINE ora.rac3.ons application OFFLINE OFFLINE ora.rac3.vip application OFFLINE OFFLINE rac1-> export ORACLE_SID=+ASM1 rac1-> sqlplus "/as sysdba" SQL*Plus: Release 10.2.0.1.0 - Production on Fri Jul 13 22:07:31 2012 Copyright (c) 1982, 2005, Oracle. All rights reserved. Connected to an idle instance. SQL> startup ORA-27504: IPC error creating OSD context ORA-27300: OS system dependent operation:if_not_up failed with status: 0 ORA-27301: OS failure message: Error 0 ORA-27302: failure occurred at: skgxpvaddr5 ORA-27303: additional information: requested interface eth1 is not UP. Check output from ifconfig command SQL> |
在启动过程中,asm的alertlong中也可以看到会检查private网络:
1 2 3 4 5 6 7 |
Starting ORACLE instance (normal) LICENSE_MAX_SESSION = 0 LICENSE_SESSIONS_WARNING = 0 Interface type 1 eth1 10.10.10.0 configured from OCR for use as a cluster interconnect Interface type 1 eth0 192.168.190.0 configured from OCR for use as a public interface Picked latch-free SCN scheme 2 ...... |
也就是说,不管在10g还是11g中,不管是asm instance还是database instance,在RAC环境下,启动的时候,总是会检查private的网卡是否up,只有up的情况下,才能启动instance。
那么,为什么我们的SIMP1却能启动呢?
我们来看看SIAP1和SIMP1启动时的alertlog,看看有何不同:
SIAP1启动的alertlog:
1 2 3 4 5 6 7 8 9 10 11 |
Sat Jul 07 19:27:24 GMT 2012 Starting ORACLE instance (normal) cluster_interconnects = 192.168.0.1 Cluster communication is configured to use the following interface(s) for this instance 192.168.0.1 Sat Jul 07 19:28:07 GMT 2012 cluster interconnect IPC version:Oracle UDP/IP (generic) IPC Vendor 1 proto 2 siap1 instance setting cluster_interconnects and not list NIC info, simp1 instance not setting cluster_interconnects and list NIC info, ...... |
SIMP1启动时候的alertlog:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
Sat Jul 07 15:09:36 GMT 2012 Starting ORACLE instance (normal) LICENSE_MAX_SESSION = 0 LICENSE_SESSIONS_WARNING = 0 Interface type 1 ce0 144.135.159.0 configured from OCR for use as a public interface Interface type 1 ce6 144.135.159.0 configured from OCR for use as a public interface WARNING: No cluster interconnect has been specified. Depending on the communication driver configured Oracle cluster traffic may be directed to the public interface of this machine. Oracle recommends that RAC clustered databases be configured with a private interconnect for enhanced security and performance. .... Cluster communication is configured to use the following interface(s) for this instance 144.135.159.111 Sat Jul 07 15:09:43 GMT 2012 cluster interconnect IPC version:Oracle UDP/IP (generic) IPC Vendor 1 proto 2 |
我们看到,SIMP1启动时,是用144.135.159.111 这个IP做节点间通信的,而SIAP1启动时,是用192.168.0.1这个IP做节点间通信。
为什么在CRS中都没有配置cluster_interconnect,2个instance会走截然不同的IP。
我们知道在CRS中如果没有配置cluster_interconnect,那么private是会走public IP的,因此,SIMP1确实属于这种情况。那为何SIAP1却没有按照这种情况走?
我们想到有另外一个参数,初始化参数cluster_interconnects,当配置这个参数时,CRS中的配置就是失效,因为优先权还是初始化参数中的cluster_interconnects高。我们来检查一下2个instance的这个参数:
SIAP1:
1 2 3 |
NAME TYPE VALUE ----------------------------- ----------- ------------------- cluster_interconnects string 192.168.0.1 |
SIMP1上:
1 2 3 |
NAME TYPE VALUE ----------------------------- ----------- ------------------- cluster_interconnects string |
看来这就是问题所在了,由于在SIAP1中配置了初始化参数cluster_interconnects 为固定的private地址,这个配置会忽略CRS中的设置,不再走public的地址,所以当SA down掉private的网卡,也就是ce2的时候,SIAP1就起不来了。
但是SIMP1由于没有配置cluster_interconnects,它所使用的配置是CRS中的信息,且CRS中没配cluster_interconnect,所以就走public网络了,即我们在alertlog中看到的144.135.159.111的地址。
初始化参数cluster_interconnects的配置,CRS中global public的配置,CRS中global cluster_interconnect的配置。明白了这些的关系和优先级,故障的原因也就明了了。
5条评论
写的很好,学习了www.yuandingsoft.com
大侠,看你的文章简直太爽了, 好像是刨工解牛, 分析的很清楚,我怎样才能达到你这种程度呢 ???
很想得到 大侠的指点, 我的邮箱是 :gghuyangg@163.com 。
Best regards!
刚好数据库报错:
SQL> startup
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:check if cable failed with status: 0
ORA-27301: OS failure message: Error 0
ORA-27302: failure occurred at: skgxpcini1
ORA-27303: additional information: requested interface ce0 interface not running set _disable_interface_checking = TRUE to disable this check for single instance
cluster. Check output from ifcon
当然,原因已经找到,是私有网络的交换机坏了。而且我数据库的参数cluster_interconnect,并没有配置具体的地址,而是空值。我启动数据库的时候,就报上述的错误,并没有
你所说的“我们看到,SIMP1启动时,是用144.135.159.111 这个IP做节点间通信的,而SIAP1启动时,是用192.168.0.1这个IP做节点间通信。
为什么在CRS中都没有配置cluster_interconnect,2个instance会走截然不同的IP。
我们知道在CRS中如果没有配置cluster_interconnect,那么private是会走public IP的,因此,SIMP1确实属于这种情况。那为何SIAP1却没有按照这种情况走?
我们想到有另外一个参数,初始化参数cluster_interconnects,当配置这个参数时,CRS中的配置就是失效,因为优先权还是初始化参数中的cluster_interconnects高”
我想你的结论是有些问题的。
谢谢。
re www_xylove,你的crs配置了cluster_interconnect吗?oifcfg getif -global看看。我想我和你的情况是不一样的,你应该配置了crs 的cluster_interconnect,而我的没有配。