今天在做一个change的时候,change的内容本身比较简单,就是将控制文件冗余到不同的mount point去,当alter system control_file scope=spfile之后,关闭RAC,mv控制文件,将之再次启动,使得spfile中指向的新路径的控制文件生效。问题出在了关闭RAC之后,数据库就启动不了了。
我们从关闭RAC开始,看看当时的场景。
1 |
oracle@vus029pa:SIAP1:/opt/app/oracle/admin $ srvctl stop database -d SIAP |
同时我们tail -f节点1和节点2上的alertlog,发现节点2是很快很顺利的关闭了,但是节点1迟迟未关闭。
节点2的alertlog:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 |
Sat Jun 02 14:04:45 GMT 2012 Shutting down instance (immediate) Sat Jun 02 14:04:45 GMT 2012 Shutting down instance: further logons disabled Sat Jun 02 14:04:49 GMT 2012 Stopping background process CJQ0 Sat Jun 02 14:04:49 GMT 2012 Stopping background process QMNC Sat Jun 02 14:04:51 GMT 2012 Stopping background process MMNL Sat Jun 02 14:04:52 GMT 2012 Stopping background process MMON License high water mark = 1004 Sat Jun 02 14:04:53 GMT 2012 Job queue slave processes stopped Sat Jun 02 14:05:59 GMT 2012 ALTER DATABASE CLOSE NORMAL Sat Jun 02 14:05:59 GMT 2012 SMON: disabling tx recovery SMON: disabling cache recovery Sat Jun 02 14:06:00 GMT 2012 Shutting down archive processes Archiving is disabled Sat Jun 02 14:06:05 GMT 2012 ARCH shutting down ARC9: Archival stopped Sat Jun 02 14:06:10 GMT 2012 ARCH shutting down ARC8: Archival stopped Sat Jun 02 14:06:15 GMT 2012 ARCH shutting down ARC7: Archival stopped Sat Jun 02 14:06:21 GMT 2012 ARCH shutting down ARC6: Archival stopped Sat Jun 02 14:06:31 GMT 2012 ARCH shutting down ARC4: Archival stopped Sat Jun 02 14:06:36 GMT 2012 ARCH shutting down ARC3: Archival stopped Sat Jun 02 14:06:46 GMT 2012 ARCH shutting down ARC1: Archival stopped Sat Jun 02 14:06:51 GMT 2012 ARCH shutting down Sat Jun 02 14:06:51 GMT 2012 ARC5: Becoming the 'no FAL' ARCH Sat Jun 02 14:06:51 GMT 2012 ARC0: Archival stopped Sat Jun 02 14:06:51 GMT 2012 ARC5: Becoming the 'no SRL' ARCH ARC5: Archiving disabled Sat Jun 02 14:06:51 GMT 2012 ARCH shutting down ARC5: Archival stopped Sat Jun 02 14:06:51 GMT 2012 ARC2: Becoming the heartbeat ARCH Sat Jun 02 14:06:58 GMT 2012 Thread 2 closed at log sequence 21703 Successful close of redo thread 2 Sat Jun 02 14:07:26 GMT 2012 Completed: ALTER DATABASE CLOSE NORMAL Sat Jun 02 14:07:26 GMT 2012 ALTER DATABASE DISMOUNT Completed: ALTER DATABASE DISMOUNT ARCH: Archival disabled due to shutdown: 1089 Shutting down archive processes Archiving is disabled Archive process shutdown avoided: 0 active ARCH: Archival disabled due to shutdown: 1089 Shutting down archive processes Archiving is disabled Archive process shutdown avoided: 0 active Sat Jun 02 14:07:33 GMT 2012 freeing rdom 0 |
在14:07左右节点2就关闭了。
节点1的alertlog:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
Sat Jun 02 14:04:52 GMT 2012 Shutting down instance (immediate) Sat Jun 02 14:04:52 GMT 2012 Shutting down instance: further logons disabled Sat Jun 02 14:04:54 GMT 2012 Stopping background process CJQ0 Sat Jun 02 14:04:54 GMT 2012 Stopping background process QMNC Sat Jun 02 14:04:56 GMT 2012 Stopping background process MMNL Sat Jun 02 14:04:57 GMT 2012 Stopping background process MMON License high water mark = 1417 Sat Jun 02 14:04:58 GMT 2012 Job queue slave processes stopped Sat Jun 02 14:12:41 GMT 2012 IPC Send timeout detected.Sender: ospid 14002 Receiver: inst 2 binc 379475 ospid 17274 Sat Jun 02 14:12:42 GMT 2012 IPC Send timeout to 1.0 inc 4 for msg type 8 from opid 33 Sat Jun 02 14:12:42 GMT 2012 Communications reconfiguration: instance_number 2 Sat Jun 02 14:12:42 GMT 2012 Trace dumping is performing id=[cdmp_20120602141242] Sat Jun 02 14:13:33 GMT 2012 Shutting down instance (abort) License high water mark = 1417 Termination issued to instance processes. Waiting for the processes to exit Sat Jun 02 14:13:43 GMT 2012 Instance termination failed to kill one or more processes Sat Jun 02 14:14:13 GMT 2012 Instance terminated by USER, pid = 5214 |
我们发现节点1是在14:13分,被RAC以shutdown abort的方式关闭的。
至此,关闭RAC的命令以以下报错结束:
1 2 3 4 |
oracle@vus029pa:SIAP1:/opt/app/oracle/admin $ srvctl stop database -d SIAP PRKP-1002 : Error stopping instance SIAP1 on node vus029pa CRS-0216: Could not stop resource 'ora.SIAP.SIAP1.inst'. oracle@vus029pa:SIAP1:/opt/app/oracle/admin $ |
虽然oracle提示了节点1无法关闭,但是检查了oracle的进程都已经消失了。确认节点1已经关闭,因此开始重新启动数据库。没想到启动的时候,起不来了:
1 2 3 4 5 6 7 |
oracle@vus029pa:SIAP1:/ora/db001/data001/SIAP $ srvctl start database -d SIAP PRKP-1001 : Error starting instance SIAP1 on node vus029pa CRS-1028: Dependency analysis failed because of: CRS-0223: Resource 'ora.SIAP.SIAP1.inst' has placement error. PRKP-1001 : Error starting instance SIAP2 on node vus160po CRS-0215: Could not start resource 'ora.SIAP.SIAP2.inst'. oracle@vus029pa:SIAP1:/ora/db001/data001/SIAP $ |
发现此时节点是是启动到nomount状态,即有进程存在但是无法启动,节点2是没启动。
对应的alertlog为:
节点1:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 |
Starting ORACLE instance (normal) Sat Jun 02 14:40:57 GMT 2012 LICENSE_MAX_SESSION = 0 LICENSE_SESSIONS_WARNING = 0 Picked latch-free SCN scheme 3 LICENSE_MAX_USERS = 0 SYS auditing is disabled ksdpec: called for event 13740 prior to event group initialization Starting up ORACLE RDBMS Version: 10.2.0.5.0. System parameters with non-default values: processes = 1500 timed_statistics = TRUE __shared_pool_size = 6442450944 shared_pool_size = 6442450944 __large_pool_size = 218103808 large_pool_size = 218103808 __java_pool_size = 33554432 java_pool_size = 33554432 __streams_pool_size = 0 spfile = /ora/db001/data001/SIAP/spfileSIAP.ora filesystemio_options = SETALL disk_asynch_io = TRUE cluster_interconnects = 192.168.0.1 sga_target = 21474836480 control_files = /ora/db001/data001/SIAP/control01SIAP.ctl, /ora/db001/redo002/SIAP/control02SIAP.ctl, /ora/db001/redo001/SIAP/control03SIAP.ctl control_file_record_keep_time= 30 db_block_size = 8192 __db_cache_size = 14763950080 db_cache_size = 10737418240 db_writer_processes = 10 compatible = 10.2.0 log_archive_config = DG_CONFIG=(SIAP,SIAP_DG) log_archive_dest_1 = location=/ora/db001/dbfra001/SIAP valid_for=(all_logfiles,all_roles) db_unique_name=SIAP log_archive_dest_2 = service=SIAP_DG valid_for=(online_logfiles,primary_role) LGWR ASYNC DB_UNIQUE_NAME=SIAP_DG log_archive_dest_state_1 = ENABLE log_archive_dest_state_2 = ENABLE log_archive_max_processes= 10 standby_archive_dest = /ora/db001/dbfra001/SIAP/standby log_archive_format = arch_SIAP_%s_%t_%r.arc fal_client = SIAP fal_server = SIAP_DG log_buffer = 14234112 db_files = 1000 db_file_multiblock_read_count= 16 cluster_database = TRUE cluster_database_instances= 2 db_recovery_file_dest = /ora/db001/dbfra001/SIAP db_recovery_file_dest_size= 64424509440 standby_file_management = AUTO thread = 1 fast_start_mttr_target = 300 log_checkpoints_to_alert = TRUE instance_number = 1 undo_management = AUTO undo_tablespace = undots undo_retention = 3600 create_stored_outlines = FALSE remote_login_passwordfile= EXCLUSIVE db_domain = in.telstra.world instance_name = SIAP1 local_listener = LISTENER_VUS029PA remote_listener = LISTENER_SIAP session_cached_cursors = 100 utl_file_dir = /tmp job_queue_processes = 2 cursor_sharing = force background_dump_dest = /opt/app/oracle/admin/SIAP/bdump user_dump_dest = /opt/app/oracle/admin/SIAP/udump max_dump_file_size = 5M core_dump_dest = /opt/app/oracle/admin/SIAP/cdump audit_file_dest = /opt/app/oracle/admin/SIAP/adump session_max_open_files = 200 open_links = 4 audit_trail = DB db_name = SIAP open_cursors = 1000 star_transformation_enabled= true query_rewrite_enabled = true query_rewrite_integrity = trusted pga_aggregate_target = 17179869184 workarea_size_policy = auto Cluster communication is configured to use the following interface(s) for this instance 192.168.0.1 Sat Jun 02 14:41:01 GMT 2012 cluster interconnect IPC version:Oracle UDP/IP (generic) IPC Vendor 1 proto 2 PMON started with pid=2, OS id=10522 DIAG started with pid=3, OS id=10524 PSP0 started with pid=4, OS id=10526 LMON started with pid=5, OS id=10528 LMD0 started with pid=6, OS id=10530 LMS0 started with pid=7, OS id=10539 LMS1 started with pid=8, OS id=10544 LMS2 started with pid=9, OS id=10548 LMS3 started with pid=10, OS id=10552 LMS4 started with pid=11, OS id=10556 LMS5 started with pid=12, OS id=10560 LMS6 started with pid=13, OS id=10564 LMS7 started with pid=14, OS id=10576 LMS8 started with pid=15, OS id=10580 LMS9 started with pid=16, OS id=10584 LMSa started with pid=17, OS id=10594 LMSb started with pid=18, OS id=10598 LMSc started with pid=19, OS id=10602 LMSd started with pid=20, OS id=10606 MMAN started with pid=21, OS id=10623 DBW0 started with pid=22, OS id=10653 DBW1 started with pid=23, OS id=10673 DBW2 started with pid=24, OS id=10686 DBW3 started with pid=25, OS id=10690 DBW4 started with pid=26, OS id=10692 DBW5 started with pid=27, OS id=10694 DBW6 started with pid=28, OS id=10696 DBW7 started with pid=29, OS id=10706 DBW8 started with pid=30, OS id=10708 DBW9 started with pid=31, OS id=10710 LGWR started with pid=32, OS id=10712 CKPT started with pid=33, OS id=10714 SMON started with pid=34, OS id=10716 RECO started with pid=35, OS id=10718 CJQ0 started with pid=36, OS id=10720 MMON started with pid=37, OS id=10723 MMNL started with pid=38, OS id=10732 <<<挂死在此处 |
节点2的alertlog为:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 |
Sat Jun 02 14:22:15 GMT 2012 Starting ORACLE instance (normal) Sat Jun 02 14:23:25 GMT 2012 LICENSE_MAX_SESSION = 0 LICENSE_SESSIONS_WARNING = 0 Picked latch-free SCN scheme 3 LICENSE_MAX_USERS = 0 SYS auditing is disabled ksdpec: called for event 13740 prior to event group initialization Starting up ORACLE RDBMS Version: 10.2.0.5.0. System parameters with non-default values: processes = 1500 timed_statistics = TRUE __shared_pool_size = 6442450944 shared_pool_size = 6442450944 __large_pool_size = 218103808 large_pool_size = 218103808 __java_pool_size = 33554432 java_pool_size = 33554432 __streams_pool_size = 0 spfile = /ora/db001/data001/SIAP/spfileSIAP.ora filesystemio_options = SETALL disk_asynch_io = TRUE cluster_interconnects = 192.168.0.2 sga_target = 21474836480 control_files = /ora/db001/data001/SIAP/control01SIAP.ctl, /ora/db001/redo002/SIAP/control02SIAP.ctl, /ora/db001/redo001/SIAP/control03SIAP.ctl control_file_record_keep_time= 30 db_block_size = 8192 __db_cache_size = 14763950080 db_cache_size = 10737418240 db_writer_processes = 10 compatible = 10.2.0 log_archive_config = DG_CONFIG=(SIAP,SIAP_DG) log_archive_dest_1 = location=/ora/db001/dbfra001/SIAP valid_for=(all_logfiles,all_roles) db_unique_name=SIAP log_archive_dest_2 = service=SIAP_DG valid_for=(online_logfiles,primary_role) LGWR ASYNC DB_UNIQUE_NAME=SIAP_DG log_archive_dest_state_1 = ENABLE log_archive_dest_state_2 = ENABLE log_archive_max_processes= 10 standby_archive_dest = /ora/db001/dbfra001/SIAP/standby log_archive_format = arch_SIAP_%s_%t_%r.arc fal_client = SIAP fal_server = SIAP_DG log_buffer = 14234112 db_files = 1000 db_file_multiblock_read_count= 16 cluster_database = TRUE cluster_database_instances= 2 db_recovery_file_dest = /ora/db001/dbfra001/SIAP db_recovery_file_dest_size= 64424509440 standby_file_management = AUTO thread = 2 fast_start_mttr_target = 300 log_checkpoints_to_alert = TRUE instance_number = 2 undo_management = AUTO undo_tablespace = undots2 undo_retention = 3600 create_stored_outlines = FALSE remote_login_passwordfile= EXCLUSIVE db_domain = in.telstra.world instance_name = SIAP2 local_listener = LISTENER_VUS160PO remote_listener = LISTENER_SIAP session_cached_cursors = 100 utl_file_dir = /tmp job_queue_processes = 2 cursor_sharing = force background_dump_dest = /opt/app/oracle/admin/SIAP/bdump user_dump_dest = /opt/app/oracle/admin/SIAP/udump max_dump_file_size = 5M core_dump_dest = /opt/app/oracle/admin/SIAP/cdump audit_file_dest = /opt/app/oracle/admin/SIAP/adump session_max_open_files = 200 open_links = 4 audit_trail = DB db_name = SIAP open_cursors = 1000 star_transformation_enabled= true query_rewrite_enabled = true query_rewrite_integrity = trusted pga_aggregate_target = 17179869184 workarea_size_policy = auto Cluster communication is configured to use the following interface(s) for this instance 192.168.0.2 Sat Jun 02 14:23:28 GMT 2012 cluster interconnect IPC version:Oracle UDP/IP (generic) IPC Vendor 1 proto 2 PMON started with pid=2, OS id=24777 DIAG started with pid=3, OS id=24779 PSP0 started with pid=4, OS id=24781 LMON started with pid=5, OS id=24783 LMD0 started with pid=6, OS id=24785 LMS0 started with pid=7, OS id=24787 LMS1 started with pid=8, OS id=24794 LMS2 started with pid=9, OS id=24803 LMS3 started with pid=10, OS id=24807 LMS4 started with pid=11, OS id=24811 LMS5 started with pid=12, OS id=24816 LMS6 started with pid=13, OS id=24841 LMS7 started with pid=14, OS id=24852 LMS8 started with pid=15, OS id=24857 LMS9 started with pid=16, OS id=24861 LMSa started with pid=17, OS id=24865 LMSb started with pid=18, OS id=24901 LMSc started with pid=19, OS id=24907 LMSd started with pid=20, OS id=24920 MMAN started with pid=21, OS id=24924 DBW0 started with pid=22, OS id=24926 DBW1 started with pid=23, OS id=24928 DBW2 started with pid=24, OS id=24930 DBW3 started with pid=25, OS id=24932 DBW4 started with pid=26, OS id=24934 DBW5 started with pid=27, OS id=24938 DBW6 started with pid=28, OS id=24946 DBW7 started with pid=29, OS id=24948 DBW8 started with pid=30, OS id=24950 DBW9 started with pid=31, OS id=24952 LGWR started with pid=32, OS id=24954 CKPT started with pid=33, OS id=24956 SMON started with pid=34, OS id=24958 RECO started with pid=35, OS id=24968 CJQ0 started with pid=36, OS id=24970 MMON started with pid=37, OS id=24972 MMNL started with pid=38, OS id=24976 Sat Jun 02 14:23:46 GMT 2012 USER: terminating instance due to error 304 Instance terminated by USER, pid = 23354 |
我们看到节点2其实也尝试启动过,但是遭遇了error 304,所以被自动关闭了。
手工shutdown abort节点1之后,尝试再次重启,还是同样的问题,节点1能启动进程,但是挂死,节点2启动之后被关闭。
不过报错信息稍有不同,只有节点2无法启动的错误了。
1 2 3 4 |
oracle@vus029pa:SIAP1:/ora/db001/data001/SIAP $ srvctl start database -d SIAP PRKP-1001 : Error starting instance SIAP2 on node vus160po CRS-0215: Could not start resource 'ora.SIAP.SIAP2.inst'. oracle@vus029pa:SIAP1:/ora/db001/data001/SIAP $ |
多次尝试都是该错误,因此,RAC起不来的原因,应该是节点2无法启动,而节点2无法启动,应该是遭遇304的错误。
我们来查一下error 304是什么:
1 2 3 4 5 6 7 8 9 |
oracle@vus160po:SIAP2:/opt/app/oracle/product/10.2.0/db_1/dbs $ oerr ora 304 00304, 00000, "requested INSTANCE_NUMBER is busy" // *Cause: An instance tried to start by using a value of the // initialization parameter INSTANCE_NUMBER that is already in use. // *Action: Either // a) specify another INSTANCE_NUMBER, // b) shut down the running instance with this number // c) wait for instance recovery to complete on the instance with // this number. |
根据alertlog,发现节点2的instance_number是2:
1 2 3 4 5 |
…… fast_start_mttr_target = 300 log_checkpoints_to_alert = TRUE instance_number = 2 …… |
尝试修改pfile中的instance_number,改为:
1 2 |
SIAP1.instance_number=3 SIAP2.instance_number=4 |
尝试再次重启,还是报错:
1 2 3 |
oracle@vus029pa:SIAP1:/opt/app/oracle/product/10.2.0/db_1/dbs $ srvctl start service -d SIAP -s SIAP2 PRKP-1030 : Failed to start the service SIAP2. CRS-0215: Could not start resource 'ora.SIAP.SIAP2.cs'. |
此时在节点1的alertlog情况和原来相同,也是到MMNL started with pid=xxx这一步就挂住了,但是在节点2的alertlog完全没有新的内容写出来,也就是节点2根本没尝试启动。
由于改instance_number没效果,只好把instance_number再改回去。
尝试手工启动节点2:
1 2 3 |
SQL> startup nomount ORA-00304: requested INSTANCE_NUMBER is busy SQL> |
此时,同事也在一起看这个问题,他发现了故障原因所在:虽然数据库已经被shutdown了,但是在操作系统中,还仍然存在不少LOCAL=NO的进程。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
ps -ef |grep SIAP2 oracle 5074 1 0 17:49:36 ? 0:02 oracleSIAP2 (LOCAL=NO) oracle 18782 1 0 18:00:05 ? 1:01 oracleSIAP2 (LOCAL=NO) oracle 4649 1 0 17:49:19 ? 0:46 oracleSIAP2 (LOCAL=NO) oracle 21736 1 0 18:02:31 ? 0:29 oracleSIAP2 (LOCAL=NO) oracle 3693 1 0 Mar 23 ? 0:00 oracleSIAP2 (LOCAL=NO) oracle 26392 1 0 Jan 25 ? 0:00 oracleSIAP2 (LOCAL=NO) oracle 12509 8593 0 17:31:03 pts/4 0:00 tail -f /opt/app/oracle/admin/SIAP/bdump/alert_SIAP2.log oracle 26382 1 0 Jan 25 ? 0:00 oracleSIAP2 (LOCAL=NO) oracle 29425 1 0 Apr 14 ? 0:00 oracleSIAP2 (LOCAL=NO) oracle 26372 1 0 Jan 25 ? 0:00 oracleSIAP2 (LOCAL=NO) oracle 23702 1 0 May 19 ? 0:00 oracleSIAP2 (LOCAL=NO) oracle 26368 1 0 Jan 25 ? 0:00 oracleSIAP2 (LOCAL=NO) oracle 26377 1 0 Jan 25 ? 0:00 oracleSIAP2 (LOCAL=NO) oracle 7945 1 0 Apr 28 ? 0:00 oracleSIAP2 (LOCAL=NO) oracle 26396 1 0 Jan 25 ? 0:00 oracleSIAP2 (LOCAL=NO) oracle 26379 1 0 Jan 25 ? 0:00 oracleSIAP2 (LOCAL=NO) oracle 14564 1 0 May 25 ? 0:00 oracleSIAP2 (LOCAL=NO) oracle 26367 1 0 Jan 25 ? 0:00 oracleSIAP2 (LOCAL=NO) oracle 26373 1 0 Jan 25 ? 0:00 oracleSIAP2 (LOCAL=NO) oracle 23715 1 0 May 19 ? 0:00 oracleSIAP2 (LOCAL=NO) oracle 6311 1 0 Apr 18 ? 0:00 oracleSIAP2 (LOCAL=NO) oracle 29428 1 0 Apr 14 ? 0:00 oracleSIAP2 (LOCAL=NO) oracle 27062 1 0 Apr 02 ? 0:00 oracleSIAP2 (LOCAL=NO) oracle 1283 1 0 May 12 ? 0:00 oracleSIAP2 (LOCAL=NO) oracle 26380 1 0 Jan 25 ? 0:00 oracleSIAP2 (LOCAL=NO) oracle 26374 1 0 Jan 25 ? 0:00 oracleSIAP2 (LOCAL=NO) oracle 26369 1 0 Jan 25 ? 0:00 oracleSIAP2 (LOCAL=NO) oracle 26395 1 0 Jan 25 ? 0:00 oracleSIAP2 (LOCAL=NO) oracle 26370 1 0 Jan 25 ? 0:00 oracleSIAP2 (LOCAL=NO) oracle 20775 1 0 May 28 ? 0:00 oracleSIAP2 (LOCAL=NO) oracle 26371 1 0 Jan 25 ? 0:00 oracleSIAP2 (LOCAL=NO) |
通过ps -ef看到,确实存在很多没有down下去的LOCAL=N的进程。也因为是这个原因,让oracle认为节点2还是在活动的,因此不能重新启动该实例。
而该系统的架构是2台服务器上跑2个RAC,我们目前只down下去了1个RAC,另外一个RAC还在跑,服务器上还有不少LOCAL=NO的进程,因此当时我也没太注意竟然还有down掉的那个RAC的LOCAL=NO的进程。
ok,问题已经找到,用
1 |
kill -9 `ps -ef|grep LOCAL=NO |grep SIAP2|grep -v grep|awk '{print $2}'` |
杀掉进程之后,再次重启数据库,成功。
唉,关于杀LOCAL=NO的进程的问题,其实我在很早的时候就写过一篇文章讲停数据库的技巧,没想到在RAC环境中,还是一样适用,没杀掉LOCAL=N进程,竟然会引起数据库无法启动的问题。