今天收到个告警某省的一个数据库的一个节点down了,重启后,只能到started状态,数据库无法open,登录上去后,看到alertlog中:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
Mon Jun 15 15:38:28 2009 Errors in file /oracle/app/oracle/admin/fjmisc/udump/fjmisc2_ora_26950.trc: ORA-00603: ORACLE server session terminated by fatal error ORA-27504: IPC error creating OSD context ORA-27300: OS system dependent operation:gethostbyname failed with status: 3 ORA-27301: OS failure message: No such process ORA-27302: failure occurred at: sskgxpmyip2 ORA-27303: additional information: nodename FJ_DB02 Mon Jun 15 15:41:19 2009 Errors in file /oracle/app/oracle/admin/fjmisc/udump/fjmisc2_ora_27062.trc: ORA-00603: ORACLE server session terminated by fatal error ORA-27504: IPC error creating OSD context ORA-27300: OS system dependent operation:gethostbyname failed with status: 3 ORA-27301: OS failure message: No such process ORA-27302: failure occurred at: sskgxpmyip2 ORA-27303: additional information: nodename FJ_DB02 Mon Jun 15 15:42:37 2009 Errors in file /oracle/app/oracle/admin/fjmisc/udump/fjmisc2_ora_27147.trc: ORA-00603: ORACLE server session terminated by fatal error ORA-27504: IPC error creating OSD context ORA-27300: OS system dependent operation:gethostbyname failed with status: 3 ORA-27301: OS failure message: No such process ORA-27302: failure occurred at: sskgxpmyip2 ORA-27303: additional information: nodename FJ_DB02 Mon Jun 15 15:46:19 2009 Errors in file /oracle/app/oracle/admin/fjmisc/udump/fjmisc2_ora_27362.trc: ORA-00603: ORACLE server session terminated by fatal error ORA-27504: IPC error creating OSD context |
当时第一个反映是gethostbyname failed 应该是主机名解析有问题了。但是ping的时候发现还是能ping通:
1 2 3 4 5 6 7 8 9 10 11 12 |
oracle@FJ_DB02:/oracle/app/oracle/admin/fjmisc/bdump > ping FJ_DB01 PING FJ_DB01: 64 byte packets 64 bytes from 10.203.17.1: icmp_seq=0. time=0. ms 64 bytes from 10.203.17.1: icmp_seq=1. time=0. ms 64 bytes from 10.203.17.1: icmp_seq=2. time=1. ms 64 bytes from 10.203.17.1: icmp_seq=3. time=3. ms 64 bytes from 10.203.17.1: icmp_seq=4. time=0. ms ----FJ_DB01 PING Statistics---- 5 packets transmitted, 5 packets received, 0% packet loss round-trip (ms) min/avg/max = 0/1/3 oracle@FJ_DB02:/oracle/app/oracle/admin/fjmisc/bdump > |
奇怪了,但是能ping通为啥报这个错?
询问了当地的同事,说修改了hosts文件,但是现在已经改回去了。进一步检查hosts文件,发现内容虽然已经改回去,但是权限没改回去:
1 2 3 4 5 |
drwxr-xr-x 2 bin bin 96 Dec 26 2006 gss -rw-r--r-- 1 bin bin 489 Dec 27 2006 hba.conf -rw------- 1 bin bin 847 Jun 15 15:08 hosts -r--r--r-- 1 root sys 847 Jun 15 14:57 hosts.bak -r--r--r-- 1 root sys 626 Dec 28 2006 hosts.old |
oracle用户cat hosts文件是cat不出来的。
于是改回这个文件的权限为444,重启数据库(注意先用abort shutdown数据库),问题解决:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 |
Mon Jun 15 15:52:09 2009 Shutting down instance (abort) License high water mark = 2 Instance terminated by USER, pid = 28002 Mon Jun 15 15:52:34 2009 Starting ORACLE instance (normal) Mon Jun 15 15:52:34 2009 Global Enqueue Service Resources = 36504, pool = 8 Mon Jun 15 15:52:34 2009 Global Enqueue Service Enqueues = 54768 LICENSE_MAX_SESSION = 0 LICENSE_SESSIONS_WARNING = 0 Mon Jun 15 15:52:50 2009 SCN scheme 1 Mon Jun 15 15:53:16 2009 Using log_archive_dest parameter default value LICENSE_MAX_USERS = 0 SYS auditing is disabled Starting up ORACLE RDBMS Version: 9.2.0.6.0. System parameters with non-default values: processes = 1500 timed_statistics = TRUE shared_pool_size = 536870912 large_pool_size = 218103808 java_pool_size = 218103808 spfile = /dev/vg_rac3/rG3_spfile_128m_01 control_files = /dev/vg_rac1/rG1_ctrl_256m_01, /dev/vg_rac2/rG2_ctrl_256m_02, /dev/vg_rac3/rG3_ctrl_256m_03 db_block_size = 8192 db_cache_size = 8589934592 compatible = 9.2.0.0.0 log_archive_start = TRUE log_archive_dest_1 = location=/archlog2 log_archive_format = fjmisc2_%t_%s.arc log_buffer = 4194304 db_files = 2000 db_file_multiblock_read_count= 16 cluster_database = TRUE cluster_database_instances= 2 thread = 2 fast_start_mttr_target = 800 instance_number = 2 undo_management = AUTO undo_tablespace = UNDOTBS2 undo_suppress_errors = TRUE undo_retention = 21600 remote_login_passwordfile= EXCLUSIVE db_domain = instance_name = fjmisc2 local_listener = fjmisc2 remote_listener = fjmisc job_queue_processes = 10 hash_join_enabled = TRUE background_dump_dest = /oracle/app/oracle/admin/fjmisc/bdump user_dump_dest = /oracle/app/oracle/admin/fjmisc/udump max_dump_file_size = 5120 core_dump_dest = /oracle/app/oracle/admin/fjmisc/cdump sort_area_size = 524288 db_name = fjmisc open_cursors = 400 star_transformation_enabled= FALSE query_rewrite_enabled = TRUE pga_aggregate_target = 838860800 Mon Jun 15 15:53:16 2009 cluster interconnect IPC version:Oracle UDP/IP IPC Vendor 1 proto 2 Version 1.0 PMON started with pid=2 DIAG started with pid=3 LMON started with pid=4 LMD0 started with pid=5 LMS0 started with pid=6 LMS1 started with pid=7 DBW0 started with pid=8 LGWR started with pid=9 CKPT started with pid=10 SMON started with pid=11 RECO started with pid=12 CJQ0 started with pid=13 Mon Jun 15 15:53:20 2009 ARCH: STARTING ARCH PROCESSES ARC0 started with pid=14 ARC0: Archival started ARC1 started with pid=15 Mon Jun 15 15:53:20 2009 ARCH: STARTING ARCH PROCESSES COMPLETE Mon Jun 15 15:53:20 2009 ARC1: Archival started Mon Jun 15 15:53:20 2009 ARC1: Thread not mounted Mon Jun 15 15:53:20 2009 ARC0: Thread not mounted Mon Jun 15 15:53:20 2009 ALTER DATABASE MOUNT Mon Jun 15 15:53:21 2009 lmon registered with NM - instance id 2 (internal mem no 1) Mon Jun 15 15:53:21 2009 Reconfiguration started (old inc 0, new inc 4) List of nodes: 0 1 Global Resource Directory frozen Communication channels reestablished Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned out Resources and enqueues cleaned out Resources remastered 0 0 GCS shadows traversed, 0 cancelled, 0 closed 0 GCS resources traversed, 0 cancelled set master node info Submitted all remote-enqueue requests Update rdomain variables Dwn-cvts replayed, VALBLKs dubious All grantable enqueues granted 0 GCS shadows traversed, 0 replayed, 0 unopened Submitted all GCS remote-cache requests 0 write requests issued in 480118 GCS resources 116 PIs marked suspect, 0 flush PI msgs Mon Jun 15 15:53:29 2009 Reconfiguration complete LCK0 started with pid=17 Mon Jun 15 15:53:35 2009 Successful mount of redo thread 2, with mount id 248173721 Mon Jun 15 15:53:35 2009 Database mounted in Shared Mode (CLUSTER_DATABASE=TRUE). Completed: ALTER DATABASE MOUNT Mon Jun 15 15:53:35 2009 ALTER DATABASE OPEN Picked Lamport scheme to generate SCNs Mon Jun 15 15:53:39 2009 LGWR: Primary database is in CLUSTER CONSISTENT mode Thread 2 advanced to log sequence 18067 Thread 2 opened at log sequence 18067 Current log# 8 seq# 18067 mem# 0: /dev/vg_rac1/rG1_redo_512m_204 Current log# 8 seq# 18067 mem# 1: /dev/vg_rac2/rG2_redo_512m_214 Current log# 8 seq# 18067 mem# 2: /dev/vg_rac3/rG3_redo_512m_224 Successful open of redo thread 2 Mon Jun 15 15:53:39 2009 SMON: enabling cache recovery Mon Jun 15 15:53:39 2009 ARC0: Evaluating archive log 6 thread 2 sequence 18066 ARC0: Beginning to archive log 6 thread 2 sequence 18066 Creating archive destination LOG_ARCHIVE_DEST_1: '/archlog2/fjmisc2_2_18066.arc' Mon Jun 15 15:53:40 2009 Successfully onlined Undo Tablespace 3. Mon Jun 15 15:53:40 2009 SMON: enabling tx recovery Mon Jun 15 15:53:40 2009 Database Characterset is ZHS16GBK replication_dependency_tracking turned off (no async multimaster replication found) Completed: ALTER DATABASE OPEN |
在这里,我们注意一下,在hpux的环境下,哪怕某用户没有cat /etc/hosts的权限,他也是能ping通主机名的,这个和linux还是有点区别的。但是rac是需要读取hosts文件的,因此oracle用户必须有读hosts的权限。