环境:
oracle 9206 三节点rac。hp 11.11。
【10月29日,故障发生】
在一次truncate一个大表的过程中(以下操作均在节点1(gdrac1)上操作):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
SQL> select sum(bytes)/1024/1024/1024 size_g from user_segments where segment_name='ALERTBUFF'; SIZE_G ---------- 329.8125 1* select tablespace_name,PARTITIONED from user_tables where table_name='ALERTBUFF' SQL> / TABLESPACE_NAME PAR ------------------------------ --- DATA_DYNAMIC NO SQL> select EXTENT_MANAGEMENT,SEGMENT_SPACE_MANAGEMENT from user_tablespaces where tablespace_name='DATA_DYNAMIC'; EXTENT_MAN SEGMEN ---------- ------ LOCAL MANUAL SQL> select index_name,tablespace_name,PARTITIONED from user_indexes where table_name='ALERTBUFF'; INDEX_NAME TABLESPACE_NAME PAR ------------------------------ ------------------------------ --- IDX_ALERTBUFF_ALERTTIME DATA_DYNAMIC NO PK_ALERTBUFF DATA_DYNAMIC NO |
我们执行了truncate table xxx reuse storage,希望通过分步回缩extent的方式在truncate该表。
但是在truncate table ALERTBUFF reuse storage的时候,网络中断了。再次登录db的时候,发现该session还在,找到其sid,serial#,spid,kill掉session和os中的process,希望重新跑。
检查起状态已经变成killed,但是其dba dml lock中仍然有其锁存在:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
SQL> select * from dba_dml_locks; SESSION_ID OWNER NAME MODE_HELD MODE_REQUESTE LAST_CONVERT BLOCKING_OTHERS ---------- ------------------------------ ------------------------------ ------------- ------------- ------------ ---------------------------------------- 64 MISC ALERTBUFF Exclusive None 1643 Global 64 SYS MLOG$ Row-X (SX) None 1643 Global SQL> l 1* select * from v$session_wait where sid=64 SQL> / SID SEQ# EVENT P1TEXTP1 P1RAW P2TEXT P2 P2RAW P3TEXT P3 P3RAW WAIT_TIME SECONDS_IN_WAIT STATE ---------- ---------- ---------------------------------------------------------------- ---------------------------------------------------------------- ---------- ---------------- ---------------------------------------------------------------- ---------- ---------------- ---------------------------------------------------------------- ---------- ---------------- ---------- --------------- ------------------- 64 3540 DFS lock handle type|mode1128857605 0000000043490005 id1 18 0000000000000012 id2 2 0000000000000002 0 39340 WAITING SQL> |
在alertlog中有告警信息:
1 2 3 4 5 6 7 8 |
…… Wed Oct 29 23:12:28 2008 PMON failed to delete process, see PMON trace file Wed Oct 29 23:14:40 2008 Thread 1 advanced to log sequence 53672 Current log# 1 seq# 53672 mem# 0: /ora_redo/redo11 Current log# 1 seq# 53672 mem# 1: /ora_redo/redo12 Wed Oct 29 23:14:40 2008 |
相关的pmon trace file点击此处:gdrac1_pmon_7976。
当时关于这个表的任何操作都无法进行!即使是select也会有library cache lock。
itpub上有好多热心的大侠帮忙想办法(点击此处查看),如去找pmon中的dead process,然后在os级把这个process杀掉。对该表做rename等等,但是结果都是该process我在操作当晚已经kill掉。
【10月30日,故障升级】
随着时间的推移,还发现了除了alertbuff表,其他的表也收到了影响:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
12:43:56 SQL> select * from gv$lock where block=1; INST_ID ADDR KADDR SID TY ID1 ID2 LMODE REQUEST CTIME BLOCK ---------- ---------------- ---------------- ---------- -- ---------- ---------- ---------- ---------- ---------- ---------- 1 C0000004439C2CB0 C0000004439C2CD0 95 JI 1529349 0 6 0 27846 1 3 C0000004439C27D0 C0000004439C27F0 577 RO 1 0 6 0 4008 1 2 rows selected. 去instance3上检查: SQL> select instance_name from v$instance; INSTANCE_NAME ---------------- gdrac3 SQL> select event from v$session_wait where sid=577; EVENT ---------------------------------------------------------------- DFS lock handle SQL> select sql_text from v$session a ,v$sqlarea b 2 where a.sql_address=b.address and a.sid=577; SQL_TEXT ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ truncate table Service_exclusive |
desc Service_exclusive这个表长期无响应。
对这个表select 也是library cache lock:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
SQL> select sid from v$mystat group by sid; SID ---------- 945 1* select * from portal. Service_exclusive where rownum<=10 SQL> / 此时该sql hang住 另一个session看: SQL> select event from gv$session_wait where inst_id=3 and sid=945; EVENT ---------------------------------------------------------------- library cache lock |
同时在metalink中提交了SR,oracle说是一个bug(5732292),没有临时解决办法(no workaround),唯一的解决办法是重启数据库(You must try restarting the instances to see if this helps fix it.)
根据SR的建议,计划重启rac2和rac3,看是否能恢复;若不能,另找机会重启所有节点:
【11月1日,重启rac2,rac3】
一、计划:
停2个节点,留一个节点跑还不会造成业务中断(业务波动)。停3个节点的话就是业务全影响了(业务中断)。
尝试影响小的恢复,oracle在sr中也说停rac2,rac3,检查是否恢复。如果此时不能恢复,再shutdown abort rac1。因此从影响角度考虑,先重启2个节点,看是否能解决。如果不能,再安排重启3节点。
为减少对业务影响,尽量在不完全down数据库的情况下尝试恢复。在凌晨业务不忙的时候操作:
1.停部分非单点的应用,
2.停侦听,job process设置为0,kill LOCAL=NO,shutdown rac2
3.停侦听,job process设置为0,kill LOCAL=NO,shutdown rac3
4.保持rac1在线,观察rac 1上的v$transaction、session_wait以及是否出现了smon timer,说明恢复正常。
5.不管是否恢复,在3小时后启动数据库,恢复业务系统。
二、执行情况:
shutdown rac2的时候,近一个小时没有动静。alterlog中:
ALTER DATABASE CLOSE NORMAL
Sat Nov 1 00:28:11 2008
Waiting for smon to disable tx recovery.
Sat Nov 1 01:17:08 2008
Thread 2 advanced to log sequence 46765
Current log# 9 seq# 46765 mem# 0: /ora_redo/redo91
Current log# 9 seq# 46765 mem# 1: /ora_redo/redo92
Sat Nov 1 01:17:08 2008
ARC1: Evaluating archive log 10 thread 2 sequence 46764
ARC1: Beginning to archive log 10 thread 2 sequence 46764
Creating archive destination LOG_ARCHIVE_DEST_1: ‘/arch/gdrac_2_46764.arc’
ARC1: Completed archiving log 10 thread 2 sequence 46764
无奈shutdown abort。
再次启动,正常,再次shutdown immediate。
还是没动静。
看到能shutdown abort后能正常起来,于是决定shutdown abort rac1,再继续按照计划停rac3。
shutdown abort rac2
shutdown immediate rac3,继续无响应。等大半小时。
shutdown abort rac3
观察rac1上的事务和等待,挂死的64号进程仍然存在。数据库中仍有大量的library cache pin和library cache lock和enqueue。
重启rac2和rac2侦听,重启rac3和rac3侦听
恢复业务。
没办法了,不能在最小影响业务的情况下解决该问题,看来要全部节点重启了。
三、后续工作:
目前计划周一(11月3日)晚上重启整个数据库。
【11月4日凌晨,重启3个节点,即重启整个数据库】:
1.23点停rac2侦听
2.checkpoint 3次,switchlogfile 3次,停job query process,shutdown immediate
3.30分钟后仍然没有down,通过oradebug的trace文件看到:
WAIT #1: nam=’instance state change’ ela= 986275 p1=1 p2=0 p3=170
WAIT #1: nam=’instance state change’ ela= 986281 p1=1 p2=0 p3=171
WAIT #1: nam=’instance state change’ ela= 986287 p1=1 p2=0 p3=172
WAIT #1: nam=’instance state change’ ela= 986275 p1=1 p2=0 p3=173
……
4.shutdown abort rac2
5.0点停应用
6.停rac3侦听,停rac1侦听
7.在rac3上checkpoint 3次,switchlogfile 3次,停job query process,shutdown immediate
8.30分钟后仍然没有down,通过oradebug的trace文件看到:
WAIT #1: nam=’instance state change’ ela= 986275 p1=1 p2=0 p3=170
WAIT #1: nam=’instance state change’ ela= 986281 p1=1 p2=0 p3=171
WAIT #1: nam=’instance state change’ ela= 986287 p1=1 p2=0 p3=172
WAIT #1: nam=’instance state change’ ela= 986275 p1=1 p2=0 p3=173
……
9.shutdown abort rac3
10.在rac1上checkpoint 3次,switchlogfile 3次,停job query process,shutdown immediate
11.仍然没有down,通过oradebug的trace文件看到:
WAIT #0: nam=’opishd’ ela= 986314 p1=0 p2=0 p3=0
WAIT #0: nam=’opishd’ ela= 986146 p1=0 p2=0 p3=0
WAIT #0: nam=’opishd’ ela= 986281 p1=0 p2=0 p3=0
WAIT #0: nam=’opishd’ ela= 986285 p1=0 p2=0 p3=0
WAIT #0: nam=’opishd’ ela= 986309 p1=0 p2=0 p3=0
WAIT #0: nam=’opishd’ ela= 986193 p1=0 p2=0 p3=0
WAIT #0: nam=’opishd’ ela= 986278 p1=0 p2=0 p3=0
WAIT #0: nam=’opishd’ ela= 986271 p1=0 p2=0 p3=0
WAIT #0: nam=’opishd’ ela= 986277 p1=0 p2=0 p3=0
WAIT #0: nam=’opishd’ ela= 986306 p1=0 p2=0 p3=0
WAIT #0: nam=’opishd’ ela= 986210 p1=0 p2=0 p3=0
WAIT #0: nam=’opishd’ ela= 986268 p1=0 p2=0 p3=0
……
12.shutdown abort rac1.
13.重启rac1,正常启动
14.重启rac2,正常启动
15.重启rac3,正常启动
16.检查数据库无异常,恢复job query process和侦听。
17.启动业务,进行业务验证。
至此,问题解决。
【经验教训】:
在这个故障上,我们还是可以看到有很多可以汲取的教训:
1.truncate大表的命令尽量不要放在客户端执行,尽量的写成脚本挂在后台跑。
2.truncate过程中如果中断,不要轻易的kill session。
3.truncate大表可以先用dbms_metadata.get_ddl获取索引的建立语句,先drop索引,再truncate大表,再根据ddl语句重建索引。
附:SR的讨论过程。顺便提一句,metalink上的工程师的水平、责任心以及对风险的控制能力,唉……
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 |
SR Number 7251740.992 Open Date 30-OCT-08 01:54:01 Support Identifier 14827454 Name He JianMin Severity 2 Last Update 04-NOV-08 07:37:03 Product Oracle Server - Enterprise Edition Product Version 9.2.0.6.0 Platform HP-UX PA-RISC (64-bit) Detailed Status Hard Close SR Reference n/a BUG Reference n/a Abstract TRUNCATE TABLE HANG,SESSION WAIT IS DFS LOCK HANDLE,PMON CAN'T CLEAN DEAD PROCES Resolution History 30-OCT-08 01:54:01 GMT ### PROBLEM STATEMENT ### we try to truncate a very large table, SQL> select sum(bytes)/1024/1024/1024 size_g from user_segments where segment_name='ALERTBUFF'; SIZE_G ---------- 329.8125 1* select tablespace_name,PARTITIONED from user_tables where table_name='ALERTBUFF' SQL> / TABLESPACE_NAME PAR ------------------------------ --- DATA_DYNAMIC NO SQL> select EXTENT_MANAGEMENT,SEGMENT_SPACE_MANAGEMENT from user_tablespaces where tablespace_name='DATA_DYNAMIC'; EXTENT_MAN SEGMEN ---------- ------ LOCAL MANUAL SQL> select index_name,tablespace_name,PARTITIONED from user_indexes where table_name='ALERTBUFF'; INDEX_NAME TABLESPACE_NAME PAR ------------------------------ ------------------------------ --- IDX_ALERTBUFF_ALERTTIME DATA_DYNAMIC NO PK_ALERTBUFF DATA_DYNAMIC NO we execute "truncate table xxx reuse storage",when it was executing,the network broke.And we login database again,we found the session still active. we try to found this sid,serial#,spid.we killed this session and process in OS and wanted to execute that truncate statement again. we check this status is killed but in dba_dml_locks it still lock: SQL> select * from dba_dml_locks; SESSION_ID OWNER NAME MODE_HELD MODE_REQUESTE LAST_CONVERT BLOCKING_OTHERS ---------- ------------------------------ ------------------------------ ------------- ------------- ------------ ---------------------------------------- 64 MISC ALERTBUFF Exclusive None 1643 Global 64 SYS MLOG$ Row-X (SX) None 1643 Global SQL> l 1* select * from v$session_wait where sid=64 SQL> / SID SEQ# EVENT P1TEXTP1 P1RAW P2TEXT P2 P2RAW P3TEXT P3 P3RAW WAIT_TIME SECONDS_IN_WAIT STATE ---------- ---------- ---------------------------------------------------------------- ---------------------------------------------------------------- ---------- ---------------- ---------------------------------------------------------------- ---------- ---------------- ---------------------------------------------------------------- ---------- ---------------- ---------- --------------- ------------------- 64 3540 DFS lock handle type|mode1128857605 0000000043490005 id1 18 0000000000000012 id2 2 0000000000000002 0 39340 WAITING SQL> and in alertlog,we found that: "PMON failed to delete process, see PMON trace file" oracle@gd_rac01:/oracle/app/oracle/admin/gdrac/bdump$ more gdrac1_pmon_7976.trc Dump file /oracle/app/oracle/admin/gdrac/bdump/gdrac1_pmon_7976.trc Oracle9i Enterprise Edition Release 9.2.0.6.0 - 64bit Production With the Partitioning and Real Application Clusters options JServer Release 9.2.0.6.0 - Production ORACLE_HOME = /oracle/app/oracle/product/9.2.0 System name: HP-UX Node name: gd_rac01 Release: B.11.11 Version: U Machine: 9000/800 Instance name: gdrac1 Redo thread mounted by this instance: 1 Oracle process number: 2 Unix process pid: 7976, image: oracle@gd_rac01 (PMON) *** 2008-10-29 22:55:45.399 *** SESSION ID:(1.1) 2008-10-29 22:55:45.380 ksb_wait_for_cic_completion: ksiget(KSBCIQUE) returned 33 ksb_wait_for_cic_completion: ksiget(KSBCIPAR) returned 33 ... we can not do any thing on table ALERTBUFF,even select can cause library cache lock. we want to know how to solve the problem except restart database. pls help us as soon as possible. Text continued in next action... 30-OCT-08 01:54:03 GMT Text continued from previous action... ### GENERAL SYSTEM/DATABASE INFORMATION ### gdrac1 production ### RECENT CHANGES ### no change ### WORKAROUNDS USED ### no workround ### BUSINESS IMPACT ### we can not do anything on the table ALERTBUFF! ### NUMBER OF INSTANCES/NODES ### 3 ### STEPS TO REPRODUCE ### can not reproduce ### O/S DETAILS ### HP-UX gd_rac01 B.11.11 U 9000/800 2468376034 unlimited-user license ### CLUSTERWARE ### Veritas Storage Foundation for Oracle RAC ### CLUSTERWARE DETAILS ### unknown ### DATABASE VERSION/PATCHSET ### not allowed ### CLUSTER INTERCONNECT ### ### CLUSTER INTERCONNECT DETAILS ### ### OCR, VOTING/QUORUM DISK LOCATION ### ### OCR, VOTING/QUORUM DISK LOCATION DETAILS ### ### DATABASE FILE LOCATION ### ### DATABASE FILE LOCATION DETAILS ### ### DIAGNOSTIC DATA COLLECTED? ### ### DIAGNOSTIC DATA COLLECTED ### Can you easily recover from, bypass or work around the problem? No Does your system or application continue normally after the problem occurs? No Are the standard features of the system or application still available; is the loss of service minor? No Contact me via : MetaLink 30-OCT-08 02:00:03 GMT New info : jimmyhe1981@gmail.com all that operation is execute on gdrac1,not no gdrac2 or gdrac3. 30-OCT-08 02:01:06 GMT New info : jimmyhe1981@gmail.com we check our veritas cluster version is 3.5 30-OCT-08 02:07:01 GMT The customer : jimmyhe1981@gmail.com : has uploaded the following file via MetaLink: E:\alert_gdrac1.log 30-OCT-08 02:08:02 GMT The customer : jimmyhe1981@gmail.com : has uploaded the following file via MetaLink: E:\gdrac1_pmon_7976.trc 30-OCT-08 02:09:02 GMT New info : jimmyhe1981@gmail.com some of the alertlog and pmon trace file has sent to you. 30-OCT-08 02:17:58 GMT UPDATE ====== Hi, Thank you for using MetaLink. We are currently reviewing/researching the situation and will update the Service Request (SR) as soon as we hav e relevant information. Thank you for your patience. Best Regards, Koushal Global Customer Services STATUS ======= @WIP -- Work In Progress 30-OCT-08 02:59:43 GMT . DATA COLLECTED =============== <alert_gdrac1.log> _____________________ Wed Oct 29 23:12:28 2008 PMON failed to delete process, see PMON trace file <gdrac1_pmon_7976.trc> _________________________ *** SESSION ID:(1.1) 2008-10-29 22:55:45.380 ksb_wait_for_cic_completion: ksiget(KSBCIQUE) returned 33 ksb_wait_for_cic_completion: ksiget(KSBCIPAR) returned 33 .. ....same as above.... .. *** 2008-10-30 10:04:51.504 found process c0000004435e0a00 pid=28 serial=82 dead ksb_wait_for_cic_completion: ksiget(KSBCIQUE) returned 33 ksb_wait_for_cic_completion: ksiget(KSBCIQUE) returned 33 ksb_wait_for_cic_completion: ksiget(KSBCIQUE) returned 33 . STATUS ======= @WIP -- Work In Progress 30-OCT-08 03:18:05 GMT . RESEARCH ========= Found mutliple bugs for 9i: Bug.4595726 & Bug.5392244/5368853/4207513/5732292 . UPDATE ======= Hi, From the internal bugs that I have referred for this errors, I have learnt there is no workaround. You must try restarting the instances to see if this helps fix it. You can apply patch for Base Bug.5732292 to avoid this problem in future. (Available on top of 9.2.0.7 as o ne-off fix) Thanks, Koushal 30-OCT-08 03:18:12 GMT Email Update button has been pressed: Sending email to jimmyhe1981@gmail.com. 30-OCT-08 05:09:03 GMT New info : jimmyhe1981@gmail.com Thanks for you help. You must try restarting the instances to see if this helps fix it. ----do you mean that restart one instance or three instance? why do you suggest this? And if restart three instance,should i stop all of them and start all of them ? or stop one and start one ,th en stop another and start another? why do you suggest this? And how do you guarantee the db will shutdown or startup normally?because we found there is a tran saction in v$transaction,and it does not change any data since this accident hap pen.if there still transaction in v$transaction,can it be exec "shutdown immedia te" without hang? This is a serious product system,we should make sure every operation correct. 30-OCT-08 05:17:03 GMT . UPDATE ======= You can first try shutting down 2&3 instances out of 3 to avoid CI calls (CI-cross instance) that seem to be failing here. If this too doesnot fix the problem then performing shutdown abort of last instance 1 and then star tup each instance one-by-one is what I would suggest. There is no workaround mentioned in any internal bugs update so we cannot guarantee which is going to work . 9i is no longer a supported relase so I cannot involve Oracle Development for this issue. Thnx. 30-OCT-08 05:17:05 GMT New info : jimmyhe1981@gmail.com i found the bug information: Bug 5732292.cannot be displayed. Possible reasons are:... can you paste this note to me? 30-OCT-08 05:17:09 GMT Email Update button has been pressed: Sending email to jimmyhe1981@gmail.com. 30-OCT-08 06:29:02 GMT New info : jimmyhe1981@gmail.com You can first try shutting down 2&3 instances out of 3 to avoid CI calls (CI-cross instance) that seem to be failing here. ----how do you found that CIC is on rac3? and we found new situation: on rac1: 12:43:56 SQL> select * from gv$lock where block=1; INST_ID ADDR KADDR SID TY ID1 ID2 LMODE REQUEST CTI ME BLOCK ---------- ---------------- ---------------- ---------- -- ---------- ---------- ---------- ---------- ---------- ---------- 1 C0000004439C2CB0 C0000004439C2CD0 95 JI 1529349 0 6 0 27846 1 3 C0000004439C27D0 C0000004439C27F0 577 RO 1 0 6 0 4008 1 2 rows selected. then on rac3: SQL> select instance_name from v$instance; INSTANCE_NAME ---------------- gdrac3 SQL> select event from v$session_wait where sid=577; EVENT ---------------------------------------------------------------- DFS lock handle SQL> select sql_text from v$session a ,v$sqlarea b 2 where a.sql_address=b.address and a.sid=577; SQL_TEXT ------------------------------------------------------------------------------------------------------------------------ -------------------------------------------------------------------------------- ---------------------------------------------------------------- truncate table Service_exclusive this is new table we found can not truncate.Does this has any relationship? 30-OCT-08 06:34:02 GMT New info : jimmyhe1981@gmail.com If this too doesnot fix the problem then performing shutdown abort of last instance 1 ----how i check whether it work?if i shutdown rac2 and rac3,you mean just waiting?how long should i wait to take the next step or what can i do to check it has already been recovered. 30-OCT-08 07:21:02 GMT New info : jimmyhe1981@gmail.com are you sure the instance you suggest shutdown first is rac2 and rac3,let the rac1 runing? this problem happen on rac1.why not we shutdown rac1 first? And if we shutdown all the instance,which instance you suggest startup first? rac3->rac2->rac1 is that ok? 30-OCT-08 07:30:33 GMT . UPDATE ======= shutdown rac2 & rac3, let rac1 be kept running, wait for a while to check if the dropped table ALERTBUFF is stil listed in dba_ views, see if you can drop it again or perform a select on same table to verfiy if there are a ny new issues. shutdown rac1 too if above doesnot fix it. You can startup instances in any order this doesnot matter, just make sure to start one after the ot her (after open complete) and not all together. Thnx. 30-OCT-08 07:30:44 GMT Email Update button has been pressed: Sending email to jimmyhe1981@gmail.com. 30-OCT-08 08:37:16 GMT New info : jimmyhe1981@gmail.com when rac 1 is online, at that time ,should we set FAST_START_PARALLEL_ROLLBACK=high? and we want to trace smon whether it work well,how do we trace this? 30-OCT-08 08:43:00 GMT . UPDATE ======= -- Here is the information on parameter : FAST_START_PARALLEL_ROLLBACK Syntax FAST_START_PARALLEL_ROLLBACK = {HI | LO | FALSE} Default value LOW Parameter class Dynamic: ALTER SYSTEM FAST_START_PARALLEL_ROLLBACK determines the maximum number of processes that can exist for performing paral lel rollback. This parameter is useful on systems in which some or all of the tr ansactions are long running. Values: FALSE indicates that parallel rollback is disabled LOW limits the number of rollback processes to 2 * CPU_COUNT HIGH limits the number of rollback processes to 4 * CPU_COUNT -- I am not sure why you want to trace SMON. Your last update is not clear, please restate problem in detail. Thnx. 30-OCT-08 08:43:06 GMT Email Update button has been pressed: Sending email to jimmyhe1981@gmail.com. 30-OCT-08 10:00:03 GMT New info : jimmyhe1981@gmail.com THANKS. I WANT TO KNOW IS THIS PARAMETER HELPFUL TO OUR RECOVERY? DO YOU SUGGEST TO SET IT TO HIGH? BECAUSE OF SMON NOW IS NOT WORKING WELL TO CLEAN DEAD TRANASCTION,DURING THE RECOVER TIME,HOW CAN I KNOW IT WORK WELL?CAN I TRACE WHAT SMON IS DOING? 30-OCT-08 10:15:17 GMT . UPDATE ======= You can leave this parameter to default (i.e. LOW) If need be you can make this parameter value HIGH. Monitoring parallel rollback can be accomplished using the V$FAST_START_SERVERS and V$FAST_START_TRANSACTIONS views. What makes you feel SMON is not working well ? i am not sure why you want to trac e SMON, this will make things worse and not recommended by us unless we require it to do so. Thnx. 30-OCT-08 10:15:31 GMT Email Update button has been pressed: Sending email to jimmyhe1981@gmail.com. 03-NOV-08 01:53:02 GMT New info : jimmyhe1981@gmail.com on 31,Oct we try to restart rac2 and rac3 to see if it can recover.if not ,then we take the next step,restart 3 instance; 【plan on 31,Oct】: 1.stop some app. 2.stop lsnr,set job process=0,kill LOCAL=NO,see if any transaction in v$transaction, shutdown rac2 3.stop lsnr,set job process=0,kill LOCAL=NO,see if any transaction in v$transaction,shutdown rac3 4.keep rac1 online,see if the table ALERTBUFF can be desc,if there is no DFS LOCK HANDLE eve nt. 5.whatever it because normal,after 3 hours ,startup rac 2 and rac3,restart app. 【situation on 31,OCT】: we shutdown rac2,and see alterlog: ALTER DATABASE CLOSE NORMAL Sat Nov 1 00:28:11 2008 Waiting for smon to disable tx recovery. Sat Nov 1 01:17:08 2008 Thread 2 advanced to log sequence 46765 Current log# 9 seq# 46765 mem# 0: /ora_redo/redo91 Current log# 9 seq# 46765 mem# 1: /ora_redo/redo92 Sat Nov 1 01:17:08 2008 ARC1: Evaluating archive log 10 thread 2 sequence 46764 ARC1: Beginning to archive log 10 thread 2 sequence 46764 Creating archive destination LOG_ARCHIVE_DEST_1: '/arch/gdrac_2_46764.arc' ARC1: Completed archiving log 10 thread 2 sequence 46764 we see about an hour can not shutdown. so we try shutdown abort rac2.it work.and we startup,check db can startup normally.we try shutdown immediate again. but it still hang. so we have to shutdown abort rac2,and try to shutdowm rac as our plan. shutdown immediate rac3,wait half an hour still hang. shutdown abort rac3 see still a lot library cache pin,library cache lock,enqueue,2 DFS lock handle. and sid=64 that session there. we still CAN NOT desc alertbuff,desc portal.Service_exclusive wait about half an hour,startup rac2 and rac3. so it seem that we can not solve the problem by just shutdown 2 instance. we try to shutdown 3 instance tonight. 【question】: we will shutdown 3 instance tonight,as we try to stop 2 instance,we can shutdown t hem normally,we have to shutdown them by shutdown abort. my question is that if we shutdown 3 instance tonoght,it seem that we have to use shutdown abort on 3 instance. is that any risk on this operation? how can we do to avoid that risk? how should we guarantee that it can restart db saft and solve the problem ? 03-NOV-08 01:55:03 GMT New info : jimmyhe1981@gmail.com please give me a step by setp plan what i should do tonight.thanks. 03-NOV-08 03:35:04 GMT New info : jimmyhe1981@gmail.com pls check my question,thanks. 03-NOV-08 07:35:19 GMT New info : jimmyhe1981@gmail.com pls check my question as soon as possible,many thanks. 04-NOV-08 02:36:08 GMT . UPDATE ======= From all the tests you have donw have you not tried shutting down all the RAC instance at once and then restarting it ? What are the symptoms when you say tables cannot be described ? There should not be any problem with uisng abort especially when immediate hangs or takes longer time. Just make sure you do SQL> alter system checkpoint ; before doing a abort and this should be OK to use. 04-NOV-08 02:36:16 GMT Email Update button has been pressed: Sending email to jimmyhe1981@gmail.com. 04-NOV-08 06:34:04 GMT New info : jimmyhe1981@gmail.com your answer is too late! i update the tar three time yesterday to ask for answer,but you do nothing just let me waiting! the reason we shutdown 2 instance on 31,OCT is that,considering the influence of shut down database,we have to try 2 instance first.if we shutdown 2 instance,it just business partial abnomal,but still can work.if we shutdown all the instance,it i s business interrupt,which is more serious than abnormal.to avoid the bigger ris k is very common in large system,as an oracle engineer,can't you understand this ? and as you said "You can first try shutting down 2&3 instances out of 3 to avoid CI calls (CI-cross instance) that seem to be failing here." so we try to sh utdown 2 instance first to see if it recover,if not recover,then we take the ne xt step,restart 3 instance. can not desc table mean that,when i DECS ALETRBUFF or DESC PORTAL.Service_exclusive,it was hang.waiting for library cache lock.in my case,can not truncate table usually go with library cache lock.you don't und erstand this situation and did not ask me to collect information to confirm,did not help me to analyze what is holder of the lock,just told me to restart my da tabase at the begining of this case? you said shutdown abort will be safe,but i met a situation that i stop lnsr,kill all LOCAL=NO process,checkpint 3 time, switch logfile 3 time,set job_queue_processes=0,set aq_tm_processes=0,check v$tr ansaction,then shutdown immeidate,wait half an hour,still not down,so i shutdown abort.when i startup,there is a lot of alert like "SMON: about to recover undo segment 60",i have to alter session set "_smu_debug_mode"=4 and alter rollback s egment "_SYSSMU60$" online.shutdown abort is not safe as you think. when your customer need you response,when your customer need you professional analysis,when your customer need you professional advice,what are you doing?are you serious t o my tar? are you suitable be an oracle engineer? i called a on-site oracle engineer to solve the problem yesterday night. i am not satisfied your service on this tar. 04-NOV-08 07:13:50 GMT . UPDATE ======= Hi, Sorry for replying late, I was not in office yesterday to read your replies. Incase of urgent issues please make sure to call Oracle support numbers for immediate assistance from available analyst on shift. The issues you have faced for shutdown immediate, etc are different from original probl em. The suggestion to restart instances was given after researching on this case and no easy way available to fix it. I hope everyting is all right now and we are good to close this case ? Please reply as appropiate. Thnx, Koushal 04-NOV-08 07:13:58 GMT Email Update button has been pressed: Sending email to jimmyhe1981@gmail.com. 04-NOV-08 07:29:03 GMT New info : victor.xu@oracle.com ============== ACS Update ============== Dear Support, The whole RAC instances have been restarted last late night, so far the "library cache pin/lock" has gone. The affected table "MISC.ALERTBUFF" has been sucessfully truncated. After discussed with customer DBA ZhaoLeHuan, he agreed that we can close this SR now. Thanks! Regards, Victor Xu OSS China ACS 04-NOV-08 07:31:46 GMT . UPDATE ======= Hi Victor, Thanks for update. I am inactivating this Service Request, as I have received confirmation from you that this issue is now resolve d. Best Regards, Koushal Global Customer Services STATUS ======= Inactivated -- Issue Resolved 04-NOV-08 07:31:56 GMT Email Update button has been pressed: Sending email to victor.xu@oracle.com. 04-NOV-08 07:37:04 GMT Closed by customer : victor.xu@oracle.com |
6条评论
不错的总结,学习了,谢谢小荷
总结的第一条才是王道!
不错的总结! 谢谢!!
请问你如何做
11.仍然没有down,通过oradebug的trace文件看到:
WAIT #0: nam=’opishd’ ela= 986314 p1=0 p2=0 p3=0
WAIT #0: nam=’opishd’ ela= 986146 p1=0 p2=0 p3=0
WAIT #0: nam=’opishd’ ela= 986281 p1=0 p2=0 p3=0
我以前也碰到过长时间down不下来的情况,我一般是检查是不是产生归档,如果产生,则
可能是一个大的事务在回滚.但有不产生归档的情况.这种情况下如何查什么原因?
我有时候就先联1个session,查状态,然后用另外一个shutdown,但有时忘了,总的说来,我觉得方法不好,
请教你的方法?
re hihiw:在shutdown命令敲下去之前,检查自己的spid,然后通过oradebug去进行10046的trace,然后敲下shutdown命令,去看udump下的trace文件。
我明白了,你仍然是检查自己的session的情况.而我以前的想法主要认为是后台进程有
什么原因导致down不下去。我以前的理解,shutdown命令发出之后,实际上各后台进程需
要做各种操作,发出命令的进程则等待,所有执行完成后,协调信息返回发出命令的
进程.有点类似与commit之后,实际在等待lgwr完成.
所以追踪自身进程的意义不大.(这是我的理解,)
或者我的理解有错误?
我做个实验看看.
谢谢
大致看了一下,我的理解是shutdown命令发出后,自身进程需要执行很多步骤,通过监控详细的步骤来推断
大致原因,我以前过于简单的理解,认为主要的步骤由pmon来执行,其他进程为辅。
我太想当然了!所以也就没有想trace一下发出命令的进程。
谢谢!