修改hosts后，rac通讯失败

今天收到个告警某省的一个数据库的一个节点down了，重启后，只能到started状态，数据库无法open，登录上去后，看到alertlog中：

Mon Jun 15 15:38:28 2009
Errors in file /oracle/app/oracle/admin/fjmisc/udump/fjmisc2_ora_26950.trc:
ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:gethostbyname failed with status: 3
ORA-27301: OS failure message: No such process
ORA-27302: failure occurred at: sskgxpmyip2
ORA-27303: additional information: nodename FJ_DB02
 Mon Jun 15 15:41:19 2009
Errors in file /oracle/app/oracle/admin/fjmisc/udump/fjmisc2_ora_27062.trc:
ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:gethostbyname failed with status: 3
ORA-27301: OS failure message: No such process
ORA-27302: failure occurred at: sskgxpmyip2
ORA-27303: additional information: nodename FJ_DB02
Mon Jun 15 15:42:37 2009
Errors in file /oracle/app/oracle/admin/fjmisc/udump/fjmisc2_ora_27147.trc:
ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:gethostbyname failed with status: 3
ORA-27301: OS failure message: No such process
ORA-27302: failure occurred at: sskgxpmyip2
ORA-27303: additional information: nodename FJ_DB02
 Mon Jun 15 15:46:19 2009
Errors in file /oracle/app/oracle/admin/fjmisc/udump/fjmisc2_ora_27362.trc:
ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context

Mon Jun 15 15:38:28 2009

Errors in file /oracle/app/oracle/admin/fjmisc/udump/fjmisc2_ora_26950.trc:

ORA-00603: ORACLE server session terminated by fatal error

ORA-27504: IPC error creating OSD context

ORA-27300: OS system dependent operation:gethostbyname failed with status: 3

ORA-27301: OS failure message: No such process

ORA-27302: failure occurred at: sskgxpmyip2

ORA-27303: additional information: nodename FJ_DB02

Mon Jun 15 15:41:19 2009

Errors in file /oracle/app/oracle/admin/fjmisc/udump/fjmisc2_ora_27062.trc:

ORA-00603: ORACLE server session terminated by fatal error

ORA-27504: IPC error creating OSD context

ORA-27300: OS system dependent operation:gethostbyname failed with status: 3

ORA-27301: OS failure message: No such process

ORA-27302: failure occurred at: sskgxpmyip2

ORA-27303: additional information: nodename FJ_DB02

Mon Jun 15 15:42:37 2009

Errors in file /oracle/app/oracle/admin/fjmisc/udump/fjmisc2_ora_27147.trc:

ORA-00603: ORACLE server session terminated by fatal error

ORA-27504: IPC error creating OSD context

ORA-27300: OS system dependent operation:gethostbyname failed with status: 3

ORA-27301: OS failure message: No such process

ORA-27302: failure occurred at: sskgxpmyip2

ORA-27303: additional information: nodename FJ_DB02

Mon Jun 15 15:46:19 2009

Errors in file /oracle/app/oracle/admin/fjmisc/udump/fjmisc2_ora_27362.trc:

ORA-00603: ORACLE server session terminated by fatal error

ORA-27504: IPC error creating OSD context

当时第一个反映是gethostbyname failed 应该是主机名解析有问题了。但是ping的时候发现还是能ping通：

oracle@FJ_DB02:/oracle/app/oracle/admin/fjmisc/bdump > ping FJ_DB01
PING FJ_DB01: 64 byte packets
64 bytes from 10.203.17.1: icmp_seq=0. time=0. ms
64 bytes from 10.203.17.1: icmp_seq=1. time=0. ms
64 bytes from 10.203.17.1: icmp_seq=2. time=1. ms
64 bytes from 10.203.17.1: icmp_seq=3. time=3. ms
64 bytes from 10.203.17.1: icmp_seq=4. time=0. ms

----FJ_DB01 PING Statistics----
5 packets transmitted, 5 packets received, 0% packet loss
round-trip (ms)  min/avg/max = 0/1/3
oracle@FJ_DB02:/oracle/app/oracle/admin/fjmisc/bdump >

oracle@FJ_DB02:/oracle/app/oracle/admin/fjmisc/bdump > ping FJ_DB01

PING FJ_DB01: 64 byte packets

64 bytes from 10.203.17.1: icmp_seq=0. time=0. ms

64 bytes from 10.203.17.1: icmp_seq=1. time=0. ms

64 bytes from 10.203.17.1: icmp_seq=2. time=1. ms

64 bytes from 10.203.17.1: icmp_seq=3. time=3. ms

64 bytes from 10.203.17.1: icmp_seq=4. time=0. ms

----FJ_DB01 PING Statistics----

5 packets transmitted, 5 packets received, 0% packet loss

round-trip (ms) min/avg/max = 0/1/3

oracle@FJ_DB02:/oracle/app/oracle/admin/fjmisc/bdump >

奇怪了，但是能ping通为啥报这个错？

询问了当地的同事，说修改了hosts文件，但是现在已经改回去了。进一步检查hosts文件，发现内容虽然已经改回去，但是权限没改回去：

drwxr-xr-x   2 bin        bin             96 Dec 26  2006 gss
-rw-r--r--   1 bin        bin            489 Dec 27  2006 hba.conf
-rw-------   1 bin        bin            847 Jun 15 15:08 hosts
-r--r--r--   1 root       sys            847 Jun 15 14:57 hosts.bak
-r--r--r--   1 root       sys            626 Dec 28  2006 hosts.old

drwxr-xr-x 2 bin bin 96 Dec 26 2006 gss

-rw-r--r-- 1 bin bin 489 Dec 27 2006 hba.conf

-rw------- 1 bin bin 847 Jun 15 15:08 hosts

-r--r--r-- 1 root sys 847 Jun 15 14:57 hosts.bak

-r--r--r-- 1 root sys 626 Dec 28 2006 hosts.old

oracle用户cat hosts文件是cat不出来的。
于是改回这个文件的权限为444，重启数据库（注意先用abort shutdown数据库），问题解决：

Mon Jun 15 15:52:09 2009
Shutting down instance (abort)
License high water mark = 2
Instance terminated by USER, pid = 28002
Mon Jun 15 15:52:34 2009
Starting ORACLE instance (normal)
Mon Jun 15 15:52:34 2009
Global Enqueue Service Resources = 36504, pool = 8
Mon Jun 15 15:52:34 2009
Global Enqueue Service Enqueues = 54768
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Mon Jun 15 15:52:50 2009
SCN scheme 1
Mon Jun 15 15:53:16 2009
Using log_archive_dest parameter default value
LICENSE_MAX_USERS = 0
SYS auditing is disabled
Starting up ORACLE RDBMS Version: 9.2.0.6.0.
System parameters with non-default values:
  processes                = 1500
  timed_statistics         = TRUE
  shared_pool_size         = 536870912
  large_pool_size          = 218103808
  java_pool_size           = 218103808
  spfile                   = /dev/vg_rac3/rG3_spfile_128m_01
  control_files            = /dev/vg_rac1/rG1_ctrl_256m_01, /dev/vg_rac2/rG2_ctrl_256m_02, /dev/vg_rac3/rG3_ctrl_256m_03
  db_block_size            = 8192
  db_cache_size            = 8589934592
  compatible               = 9.2.0.0.0
  log_archive_start        = TRUE
  log_archive_dest_1       = location=/archlog2
  log_archive_format       = fjmisc2_%t_%s.arc
  log_buffer               = 4194304
  db_files                 = 2000
  db_file_multiblock_read_count= 16
  cluster_database         = TRUE
  cluster_database_instances= 2
  thread                   = 2
  fast_start_mttr_target   = 800
  instance_number          = 2
  undo_management          = AUTO
  undo_tablespace          = UNDOTBS2
  undo_suppress_errors     = TRUE
  undo_retention           = 21600
  remote_login_passwordfile= EXCLUSIVE
  db_domain                = 
  instance_name            = fjmisc2
  local_listener           = fjmisc2
  remote_listener          = fjmisc
  job_queue_processes      = 10
  hash_join_enabled        = TRUE
  background_dump_dest     = /oracle/app/oracle/admin/fjmisc/bdump
  user_dump_dest           = /oracle/app/oracle/admin/fjmisc/udump
  max_dump_file_size       = 5120
  core_dump_dest           = /oracle/app/oracle/admin/fjmisc/cdump
  sort_area_size           = 524288
  db_name                  = fjmisc
  open_cursors             = 400
  star_transformation_enabled= FALSE
  query_rewrite_enabled    = TRUE
  pga_aggregate_target     = 838860800
Mon Jun 15 15:53:16 2009
cluster interconnect IPC version:Oracle UDP/IP
IPC Vendor 1 proto 2 Version 1.0
PMON started with pid=2
DIAG started with pid=3
LMON started with pid=4
LMD0 started with pid=5
LMS0 started with pid=6
LMS1 started with pid=7
DBW0 started with pid=8
LGWR started with pid=9
CKPT started with pid=10
SMON started with pid=11
RECO started with pid=12
CJQ0 started with pid=13
Mon Jun 15 15:53:20 2009
ARCH: STARTING ARCH PROCESSES
ARC0 started with pid=14
ARC0: Archival started
ARC1 started with pid=15
Mon Jun 15 15:53:20 2009
ARCH: STARTING ARCH PROCESSES COMPLETE
Mon Jun 15 15:53:20 2009
ARC1: Archival started
Mon Jun 15 15:53:20 2009
ARC1: Thread not mounted
Mon Jun 15 15:53:20 2009
ARC0: Thread not mounted
Mon Jun 15 15:53:20 2009
ALTER DATABASE   MOUNT
Mon Jun 15 15:53:21 2009
lmon registered with NM - instance id 2 (internal mem no 1)
Mon Jun 15 15:53:21 2009
Reconfiguration started (old inc 0, new inc 4)
List of nodes:
 0 1
 Global Resource Directory frozen
 Communication channels reestablished
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
 Resources and enqueues cleaned out
 Resources remastered 0
 0 GCS shadows traversed, 0 cancelled, 0 closed
 0 GCS resources traversed, 0 cancelled
 set master node info 
 Submitted all remote-enqueue requests
 Update rdomain variables
 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted
 0 GCS shadows traversed, 0 replayed, 0 unopened
 Submitted all GCS remote-cache requests
 0 write requests issued in 480118 GCS resources
 116 PIs marked suspect, 0 flush PI msgs
Mon Jun 15 15:53:29 2009
Reconfiguration complete
LCK0 started with pid=17
Mon Jun 15 15:53:35 2009
Successful mount of redo thread 2, with mount id 248173721
Mon Jun 15 15:53:35 2009
Database mounted in Shared Mode (CLUSTER_DATABASE=TRUE).
Completed: ALTER DATABASE   MOUNT
Mon Jun 15 15:53:35 2009
ALTER DATABASE OPEN
Picked Lamport scheme to generate SCNs
Mon Jun 15 15:53:39 2009
LGWR: Primary database is in CLUSTER CONSISTENT mode
Thread 2 advanced to log sequence 18067
Thread 2 opened at log sequence 18067
  Current log# 8 seq# 18067 mem# 0: /dev/vg_rac1/rG1_redo_512m_204
  Current log# 8 seq# 18067 mem# 1: /dev/vg_rac2/rG2_redo_512m_214
  Current log# 8 seq# 18067 mem# 2: /dev/vg_rac3/rG3_redo_512m_224
Successful open of redo thread 2
Mon Jun 15 15:53:39 2009
SMON: enabling cache recovery
Mon Jun 15 15:53:39 2009
ARC0: Evaluating archive   log 6 thread 2 sequence 18066
ARC0: Beginning to archive log 6 thread 2 sequence 18066
Creating archive destination LOG_ARCHIVE_DEST_1: '/archlog2/fjmisc2_2_18066.arc'
Mon Jun 15 15:53:40 2009
Successfully onlined Undo Tablespace 3.
Mon Jun 15 15:53:40 2009
SMON: enabling tx recovery
Mon Jun 15 15:53:40 2009
Database Characterset is ZHS16GBK
replication_dependency_tracking turned off (no async multimaster replication found)
Completed: ALTER DATABASE OPEN

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

Mon Jun 15 15:52:09 2009

Shutting down instance (abort)

License high water mark = 2

Instance terminated by USER, pid = 28002

Mon Jun 15 15:52:34 2009

Starting ORACLE instance (normal)

Mon Jun 15 15:52:34 2009

Global Enqueue Service Resources = 36504, pool = 8

Mon Jun 15 15:52:34 2009

Global Enqueue Service Enqueues = 54768

LICENSE_MAX_SESSION = 0

LICENSE_SESSIONS_WARNING = 0

Mon Jun 15 15:52:50 2009

SCN scheme 1

Mon Jun 15 15:53:16 2009

Using log_archive_dest parameter default value

LICENSE_MAX_USERS = 0

SYS auditing is disabled

Starting up ORACLE RDBMS Version: 9.2.0.6.0.

System parameters with non-default values:

processes = 1500

timed_statistics = TRUE

shared_pool_size = 536870912

large_pool_size = 218103808

java_pool_size = 218103808

spfile = /dev/vg_rac3/rG3_spfile_128m_01

control_files = /dev/vg_rac1/rG1_ctrl_256m_01, /dev/vg_rac2/rG2_ctrl_256m_02, /dev/vg_rac3/rG3_ctrl_256m_03

db_block_size = 8192

db_cache_size = 8589934592

compatible = 9.2.0.0.0

log_archive_start = TRUE

log_archive_dest_1 = location=/archlog2

log_archive_format = fjmisc2_%t_%s.arc

log_buffer = 4194304

db_files = 2000

db_file_multiblock_read_count= 16

cluster_database = TRUE

cluster_database_instances= 2

thread = 2

fast_start_mttr_target = 800

instance_number = 2

undo_management = AUTO

undo_tablespace = UNDOTBS2

undo_suppress_errors = TRUE

undo_retention = 21600

remote_login_passwordfile= EXCLUSIVE

db_domain =

instance_name = fjmisc2

local_listener = fjmisc2

remote_listener = fjmisc

job_queue_processes = 10

hash_join_enabled = TRUE

background_dump_dest = /oracle/app/oracle/admin/fjmisc/bdump

user_dump_dest = /oracle/app/oracle/admin/fjmisc/udump

max_dump_file_size = 5120

core_dump_dest = /oracle/app/oracle/admin/fjmisc/cdump

sort_area_size = 524288

db_name = fjmisc

open_cursors = 400

star_transformation_enabled= FALSE

query_rewrite_enabled = TRUE

pga_aggregate_target = 838860800

Mon Jun 15 15:53:16 2009

cluster interconnect IPC version:Oracle UDP/IP

IPC Vendor 1 proto 2 Version 1.0

PMON started with pid=2

DIAG started with pid=3

LMON started with pid=4

LMD0 started with pid=5

LMS0 started with pid=6

LMS1 started with pid=7

DBW0 started with pid=8

LGWR started with pid=9

CKPT started with pid=10

SMON started with pid=11

RECO started with pid=12

CJQ0 started with pid=13

Mon Jun 15 15:53:20 2009

ARCH: STARTING ARCH PROCESSES

ARC0 started with pid=14

ARC0: Archival started

ARC1 started with pid=15

Mon Jun 15 15:53:20 2009

ARCH: STARTING ARCH PROCESSES COMPLETE

Mon Jun 15 15:53:20 2009

ARC1: Archival started

Mon Jun 15 15:53:20 2009

ARC1: Thread not mounted

Mon Jun 15 15:53:20 2009

ARC0: Thread not mounted

Mon Jun 15 15:53:20 2009

ALTER DATABASE MOUNT

Mon Jun 15 15:53:21 2009

lmon registered with NM - instance id 2 (internal mem no 1)

Mon Jun 15 15:53:21 2009

Reconfiguration started (old inc 0, new inc 4)

List of nodes:

0 1

Global Resource Directory frozen

Communication channels reestablished

Master broadcasted resource hash value bitmaps

Non-local Process blocks cleaned out

Resources and enqueues cleaned out

Resources remastered 0

0 GCS shadows traversed, 0 cancelled, 0 closed

0 GCS resources traversed, 0 cancelled

set master node info

Submitted all remote-enqueue requests

Update rdomain variables

Dwn-cvts replayed, VALBLKs dubious

All grantable enqueues granted

0 GCS shadows traversed, 0 replayed, 0 unopened

Submitted all GCS remote-cache requests

0 write requests issued in 480118 GCS resources

116 PIs marked suspect, 0 flush PI msgs

Mon Jun 15 15:53:29 2009

Reconfiguration complete

LCK0 started with pid=17

Mon Jun 15 15:53:35 2009

Successful mount of redo thread 2, with mount id 248173721

Mon Jun 15 15:53:35 2009

Database mounted in Shared Mode (CLUSTER_DATABASE=TRUE).

Completed: ALTER DATABASE MOUNT

Mon Jun 15 15:53:35 2009

ALTER DATABASE OPEN

Picked Lamport scheme to generate SCNs

Mon Jun 15 15:53:39 2009

LGWR: Primary database is in CLUSTER CONSISTENT mode

Thread 2 advanced to log sequence 18067

Thread 2 opened at log sequence 18067

Current log# 8 seq# 18067 mem# 0: /dev/vg_rac1/rG1_redo_512m_204

Current log# 8 seq# 18067 mem# 1: /dev/vg_rac2/rG2_redo_512m_214

Current log# 8 seq# 18067 mem# 2: /dev/vg_rac3/rG3_redo_512m_224

Successful open of redo thread 2

Mon Jun 15 15:53:39 2009

SMON: enabling cache recovery

Mon Jun 15 15:53:39 2009

ARC0: Evaluating archive log 6 thread 2 sequence 18066

ARC0: Beginning to archive log 6 thread 2 sequence 18066

Creating archive destination LOG_ARCHIVE_DEST_1: '/archlog2/fjmisc2_2_18066.arc'

Mon Jun 15 15:53:40 2009

Successfully onlined Undo Tablespace 3.

Mon Jun 15 15:53:40 2009

SMON: enabling tx recovery

Mon Jun 15 15:53:40 2009

Database Characterset is ZHS16GBK

replication_dependency_tracking turned off (no async multimaster replication found)

Completed: ALTER DATABASE OPEN

在这里，我们注意一下，在hpux的环境下，哪怕某用户没有cat /etc/hosts的权限，他也是能ping通主机名的，这个和linux还是有点区别的。但是rac是需要读取hosts文件的，因此oracle用户必须有读hosts的权限。

修改hosts后，rac通讯失败

相关文章

阿里云的rds的几个问题的探究

耐心，再耐心些

11g中autotask disable后应该查哪个表来验证？

发表回复 取消回复

发表回复取消回复