truncate异常中断后，发生DFS LOCK HANDLE

环境：
oracle 9206 三节点rac。hp 11.11。

【10月29日，故障发生】

在一次truncate一个大表的过程中（以下操作均在节点1（gdrac1）上操作）：

SQL> select sum(bytes)/1024/1024/1024 size_g from user_segments where 
segment_name='ALERTBUFF';

SIZE_G
----------
329.8125

1* select tablespace_name,PARTITIONED from user_tables where table_name='ALERTBUFF'
SQL> /

TABLESPACE_NAME PAR
------------------------------ ---
DATA_DYNAMIC NO

SQL> select EXTENT_MANAGEMENT,SEGMENT_SPACE_MANAGEMENT from user_tablespaces where 
tablespace_name='DATA_DYNAMIC';

EXTENT_MAN SEGMEN
---------- ------
LOCAL MANUAL

SQL> select index_name,tablespace_name,PARTITIONED from user_indexes where table_name='ALERTBUFF';

INDEX_NAME TABLESPACE_NAME PAR
------------------------------ ------------------------------ ---
IDX_ALERTBUFF_ALERTTIME DATA_DYNAMIC NO
PK_ALERTBUFF DATA_DYNAMIC NO

SQL> select sum(bytes)/1024/1024/1024 size_g from user_segments where

segment_name='ALERTBUFF';

SIZE_G

----------

329.8125

1* select tablespace_name,PARTITIONED from user_tables where table_name='ALERTBUFF'

SQL> /

TABLESPACE_NAME PAR

------------------------------ ---

DATA_DYNAMIC NO

SQL> select EXTENT_MANAGEMENT,SEGMENT_SPACE_MANAGEMENT from user_tablespaces where

tablespace_name='DATA_DYNAMIC';

EXTENT_MAN SEGMEN

---------- ------

LOCAL MANUAL

SQL> select index_name,tablespace_name,PARTITIONED from user_indexes where table_name='ALERTBUFF';

INDEX_NAME TABLESPACE_NAME PAR

------------------------------ ------------------------------ ---

IDX_ALERTBUFF_ALERTTIME DATA_DYNAMIC NO

PK_ALERTBUFF DATA_DYNAMIC NO

我们执行了truncate table xxx reuse storage，希望通过分步回缩extent的方式在truncate该表。

但是在truncate table ALERTBUFF reuse storage的时候，网络中断了。再次登录db的时候，发现该session还在，找到其sid，serial#，spid，kill掉session和os中的process，希望重新跑。

检查起状态已经变成killed，但是其dba dml lock中仍然有其锁存在：

SQL> select * from dba_dml_locks;

SESSION_ID OWNER NAME MODE_HELD 
MODE_REQUESTE LAST_CONVERT BLOCKING_OTHERS
---------- ------------------------------ ------------------------------ ------------- 
------------- ------------ ----------------------------------------
64 MISC ALERTBUFF Exclusive 
None 1643 Global
64 SYS MLOG$ Row-X (SX) None 1643 
Global


SQL> l
1* select * from v$session_wait where sid=64
SQL> /

SID SEQ# EVENT 
P1TEXTP1 P1RAW P2TEXT 
P2 P2RAW P3TEXT P3 P3RAW WAIT_TIME 
SECONDS_IN_WAIT STATE
---------- ---------- ---------------------------------------------------------------- 
---------------------------------------------------------------- ---------- 
---------------- 
---------------------------------------------------------------- ---------- 
---------------- 
---------------------------------------------------------------- ---------- 
---------------- ---------- --------------- -------------------
64 3540 DFS lock handle 
type|mode1128857605 0000000043490005 id1 
18 0000000000000012 id2 2 0000000000000002 
0 39340 WAITING

SQL>

SQL> select * from dba_dml_locks;

SESSION_ID OWNER NAME MODE_HELD

MODE_REQUESTE LAST_CONVERT BLOCKING_OTHERS

---------- ------------------------------ ------------------------------ -------------

------------- ------------ ----------------------------------------

64 MISC ALERTBUFF Exclusive

None 1643 Global

64 SYS MLOG$ Row-X (SX) None 1643

Global

SQL> l

1* select * from v$session_wait where sid=64

SQL> /

SID SEQ# EVENT

P1TEXTP1 P1RAW P2TEXT

P2 P2RAW P3TEXT P3 P3RAW WAIT_TIME

SECONDS_IN_WAIT STATE

---------- ---------- ----------------------------------------------------------------

---------------------------------------------------------------- ----------

----------------

---------------------------------------------------------------- ----------

----------------

---------------------------------------------------------------- ----------

---------------- ---------- --------------- -------------------

64 3540 DFS lock handle

type|mode1128857605 0000000043490005 id1

18 0000000000000012 id2 2 0000000000000002

0 39340 WAITING

SQL>

在alertlog中有告警信息:

……
Wed Oct 29 23:12:28 2008
PMON failed to delete process, see PMON trace file
Wed Oct 29 23:14:40 2008
Thread 1 advanced to log sequence 53672
  Current log# 1 seq# 53672 mem# 0: /ora_redo/redo11
  Current log# 1 seq# 53672 mem# 1: /ora_redo/redo12
Wed Oct 29 23:14:40 2008

……

Wed Oct 29 23:12:28 2008

PMON failed to delete process, see PMON trace file

Wed Oct 29 23:14:40 2008

Thread 1 advanced to log sequence 53672

Current log# 1 seq# 53672 mem# 0: /ora_redo/redo11

Current log# 1 seq# 53672 mem# 1: /ora_redo/redo12

Wed Oct 29 23:14:40 2008

相关的pmon trace file点击此处：gdrac1_pmon_7976。

当时关于这个表的任何操作都无法进行！即使是select也会有library cache lock。

itpub上有好多热心的大侠帮忙想办法（点击此处查看），如去找pmon中的dead process，然后在os级把这个process杀掉。对该表做rename等等，但是结果都是该process我在操作当晚已经kill掉。

【10月30日，故障升级】

随着时间的推移，还发现了除了alertbuff表，其他的表也收到了影响：

12:43:56 SQL> select * from gv$lock where block=1;

   INST_ID ADDR             KADDR                   SID TY        ID1        ID2      LMODE    REQUEST      CTIME      BLOCK
---------- ---------------- ---------------- ---------- -- ---------- ---------- ---------- ---------- ---------- ----------
         1 C0000004439C2CB0 C0000004439C2CD0         95 JI    1529349          0          6          0      27846          1
         3 C0000004439C27D0 C0000004439C27F0        577 RO          1          0          6          0       4008          1

2 rows selected.


去instance3上检查：
SQL> select instance_name from v$instance;

INSTANCE_NAME
----------------
gdrac3

SQL> select event from v$session_wait where sid=577;

EVENT
----------------------------------------------------------------
DFS lock handle


SQL> select sql_text from v$session a ,v$sqlarea b
  2  where a.sql_address=b.address and a.sid=577;

SQL_TEXT
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
truncate table Service_exclusive

12:43:56 SQL> select * from gv$lock where block=1;

INST_ID ADDR KADDR SID TY ID1 ID2 LMODE REQUEST CTIME BLOCK

---------- ---------------- ---------------- ---------- -- ---------- ---------- ---------- ---------- ---------- ----------

1 C0000004439C2CB0 C0000004439C2CD0 95 JI 1529349 0 6 0 27846 1

3 C0000004439C27D0 C0000004439C27F0 577 RO 1 0 6 0 4008 1

2 rows selected.

去instance3上检查：

SQL> select instance_name from v$instance;

INSTANCE_NAME

----------------

gdrac3

SQL> select event from v$session_wait where sid=577;

EVENT

----------------------------------------------------------------

DFS lock handle

SQL> select sql_text from v$session a ,v$sqlarea b

2 where a.sql_address=b.address and a.sid=577;

SQL_TEXT

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

truncate table Service_exclusive

desc Service_exclusive这个表长期无响应。

对这个表select 也是library cache lock：

SQL> select sid from v$mystat group by sid;

       SID
----------
       945

  1* select * from portal. Service_exclusive where rownum<=10
SQL> /
此时该sql hang住


另一个session看：
SQL>     select event from gv$session_wait where inst_id=3 and sid=945;

EVENT
----------------------------------------------------------------
library cache lock

SQL> select sid from v$mystat group by sid;

SID

----------

945

1* select * from portal. Service_exclusive where rownum<=10

SQL> /

此时该sql hang住

另一个session看：

SQL> select event from gv$session_wait where inst_id=3 and sid=945;

EVENT

----------------------------------------------------------------

library cache lock

同时在metalink中提交了SR，oracle说是一个bug（5732292），没有临时解决办法（no workaround），唯一的解决办法是重启数据库（You must try restarting the instances to see if this helps fix it.）

根据SR的建议，计划重启rac2和rac3，看是否能恢复；若不能，另找机会重启所有节点：

【11月1日，重启rac2，rac3】

一、计划：
停2个节点，留一个节点跑还不会造成业务中断（业务波动）。停3个节点的话就是业务全影响了（业务中断）。
尝试影响小的恢复，oracle在sr中也说停rac2，rac3，检查是否恢复。如果此时不能恢复，再shutdown abort rac1。因此从影响角度考虑，先重启2个节点，看是否能解决。如果不能，再安排重启3节点。

为减少对业务影响，尽量在不完全down数据库的情况下尝试恢复。在凌晨业务不忙的时候操作：
1.停部分非单点的应用，
2.停侦听，job process设置为0，kill LOCAL=NO，shutdown rac2
3.停侦听，job process设置为0，kill LOCAL=NO，shutdown rac3
4.保持rac1在线，观察rac 1上的v$transaction、session_wait以及是否出现了smon timer，说明恢复正常。
5.不管是否恢复，在3小时后启动数据库，恢复业务系统。

二、执行情况：
shutdown rac2的时候，近一个小时没有动静。alterlog中：
ALTER DATABASE CLOSE NORMAL
Sat Nov 1 00:28:11 2008
Waiting for smon to disable tx recovery.
Sat Nov 1 01:17:08 2008
Thread 2 advanced to log sequence 46765
Current log# 9 seq# 46765 mem# 0: /ora_redo/redo91
Current log# 9 seq# 46765 mem# 1: /ora_redo/redo92
Sat Nov 1 01:17:08 2008
ARC1: Evaluating archive log 10 thread 2 sequence 46764
ARC1: Beginning to archive log 10 thread 2 sequence 46764
Creating archive destination LOG_ARCHIVE_DEST_1: ‘/arch/gdrac_2_46764.arc’
ARC1: Completed archiving log 10 thread 2 sequence 46764

无奈shutdown abort。
再次启动，正常，再次shutdown immediate。
还是没动静。

看到能shutdown abort后能正常起来，于是决定shutdown abort rac1，再继续按照计划停rac3。

shutdown abort rac2

shutdown immediate rac3，继续无响应。等大半小时。

shutdown abort rac3

观察rac1上的事务和等待，挂死的64号进程仍然存在。数据库中仍有大量的library cache pin和library cache lock和enqueue。

重启rac2和rac2侦听，重启rac3和rac3侦听

恢复业务。

没办法了，不能在最小影响业务的情况下解决该问题，看来要全部节点重启了。

三、后续工作：
目前计划周一（11月3日）晚上重启整个数据库。

【11月4日凌晨，重启3个节点，即重启整个数据库】：

1.23点停rac2侦听
2.checkpoint 3次，switchlogfile 3次，停job query process，shutdown immediate
3.30分钟后仍然没有down，通过oradebug的trace文件看到：
WAIT #1: nam=’instance state change’ ela= 986275 p1=1 p2=0 p3=170
WAIT #1: nam=’instance state change’ ela= 986281 p1=1 p2=0 p3=171
WAIT #1: nam=’instance state change’ ela= 986287 p1=1 p2=0 p3=172
WAIT #1: nam=’instance state change’ ela= 986275 p1=1 p2=0 p3=173
……
4.shutdown abort rac2
5.0点停应用
6.停rac3侦听，停rac1侦听
7.在rac3上checkpoint 3次，switchlogfile 3次，停job query process，shutdown immediate
8.30分钟后仍然没有down，通过oradebug的trace文件看到：
WAIT #1: nam=’instance state change’ ela= 986275 p1=1 p2=0 p3=170
WAIT #1: nam=’instance state change’ ela= 986281 p1=1 p2=0 p3=171
WAIT #1: nam=’instance state change’ ela= 986287 p1=1 p2=0 p3=172
WAIT #1: nam=’instance state change’ ela= 986275 p1=1 p2=0 p3=173
……
9.shutdown abort rac3
10.在rac1上checkpoint 3次，switchlogfile 3次，停job query process，shutdown immediate
11.仍然没有down，通过oradebug的trace文件看到：
WAIT #0: nam=’opishd’ ela= 986314 p1=0 p2=0 p3=0
WAIT #0: nam=’opishd’ ela= 986146 p1=0 p2=0 p3=0
WAIT #0: nam=’opishd’ ela= 986281 p1=0 p2=0 p3=0
WAIT #0: nam=’opishd’ ela= 986285 p1=0 p2=0 p3=0
WAIT #0: nam=’opishd’ ela= 986309 p1=0 p2=0 p3=0
WAIT #0: nam=’opishd’ ela= 986193 p1=0 p2=0 p3=0
WAIT #0: nam=’opishd’ ela= 986278 p1=0 p2=0 p3=0
WAIT #0: nam=’opishd’ ela= 986271 p1=0 p2=0 p3=0
WAIT #0: nam=’opishd’ ela= 986277 p1=0 p2=0 p3=0
WAIT #0: nam=’opishd’ ela= 986306 p1=0 p2=0 p3=0
WAIT #0: nam=’opishd’ ela= 986210 p1=0 p2=0 p3=0
WAIT #0: nam=’opishd’ ela= 986268 p1=0 p2=0 p3=0
……
12.shutdown abort rac1.
13.重启rac1，正常启动
14.重启rac2，正常启动
15.重启rac3，正常启动
16.检查数据库无异常，恢复job query process和侦听。
17.启动业务，进行业务验证。

至此，问题解决。

【经验教训】：

在这个故障上，我们还是可以看到有很多可以汲取的教训：
1.truncate大表的命令尽量不要放在客户端执行，尽量的写成脚本挂在后台跑。
2.truncate过程中如果中断，不要轻易的kill session。
3.truncate大表可以先用dbms_metadata.get_ddl获取索引的建立语句，先drop索引，再truncate大表，再根据ddl语句重建索引。

附：SR的讨论过程。顺便提一句，metalink上的工程师的水平、责任心以及对风险的控制能力，唉……

SR Number  7251740.992   Open Date 30-OCT-08 01:54:01 
Support Identifier  14827454   Name He JianMin 
Severity  2   Last Update 04-NOV-08 07:37:03 
Product Oracle Server - Enterprise Edition   Product Version   9.2.0.6.0 
Platform HP-UX PA-RISC (64-bit)   Detailed Status Hard Close 
SR Reference   n/a   BUG Reference n/a 

Abstract
 
TRUNCATE TABLE HANG,SESSION WAIT IS DFS LOCK HANDLE,PMON CAN'T CLEAN DEAD PROCES  
      

Resolution History 
 

30-OCT-08 01:54:01 GMT



### PROBLEM STATEMENT ###
we try to truncate a very large table,
SQL> select sum(bytes)/1024/1024/1024 size_g from user_segments where 
segment_name='ALERTBUFF';

SIZE_G
----------
329.8125

1* select tablespace_name,PARTITIONED from user_tables where table_name='ALERTBUFF'
SQL> /

TABLESPACE_NAME PAR
------------------------------ ---
DATA_DYNAMIC NO

SQL> select EXTENT_MANAGEMENT,SEGMENT_SPACE_MANAGEMENT from user_tablespaces where 
tablespace_name='DATA_DYNAMIC';

EXTENT_MAN SEGMEN
---------- ------
LOCAL MANUAL

SQL> select index_name,tablespace_name,PARTITIONED from user_indexes where table_name='ALERTBUFF';

INDEX_NAME TABLESPACE_NAME PAR
------------------------------ ------------------------------ ---
IDX_ALERTBUFF_ALERTTIME DATA_DYNAMIC NO
PK_ALERTBUFF DATA_DYNAMIC NO

we execute "truncate table xxx reuse storage",when it was executing,the network broke.And 
we login database again,we found the session still active.
we try to found this sid,serial#,spid.we killed this session and process in OS and wanted to execute 
that truncate statement again.
we check this status is killed but in dba_dml_locks it still lock:
SQL> select * from dba_dml_locks;

SESSION_ID OWNER NAME MODE_HELD 
MODE_REQUESTE LAST_CONVERT BLOCKING_OTHERS
---------- ------------------------------ ------------------------------ ------------- 
------------- ------------ ----------------------------------------
64 MISC ALERTBUFF Exclusive 
None 1643 Global
64 SYS MLOG$ Row-X (SX) None 1643 
Global


SQL> l
1* select * from v$session_wait where sid=64
SQL> /

SID SEQ# EVENT 
P1TEXTP1 P1RAW P2TEXT 
P2 P2RAW P3TEXT P3 P3RAW WAIT_TIME 
SECONDS_IN_WAIT STATE
---------- ---------- ---------------------------------------------------------------- 
---------------------------------------------------------------- ---------- 
---------------- 
---------------------------------------------------------------- ---------- 
---------------- 
---------------------------------------------------------------- ---------- 
---------------- ---------- --------------- -------------------
64 3540 DFS lock handle 
type|mode1128857605 0000000043490005 id1 
18 0000000000000012 id2 2 0000000000000002 
0 39340 WAITING

SQL> 

and in alertlog,we found that:
"PMON failed to delete process, see PMON trace 
file"

oracle@gd_rac01:/oracle/app/oracle/admin/gdrac/bdump$ more gdrac1_pmon_7976.trc
Dump file /oracle/app/oracle/admin/gdrac/bdump/gdrac1_pmon_7976.trc
Oracle9i Enterprise Edition Release 9.2.0.6.0 - 64bit Production
With the Partitioning and Real Application Clusters options
JServer Release 9.2.0.6.0 - Production
ORACLE_HOME = /oracle/app/oracle/product/9.2.0
System name: HP-UX
Node name: gd_rac01
Release: B.11.11
Version: U
Machine: 9000/800
Instance name: gdrac1
Redo thread mounted by this instance: 1
Oracle process number: 2
Unix process pid: 7976, image: oracle@gd_rac01 (PMON)

*** 2008-10-29 22:55:45.399
*** SESSION ID:(1.1) 2008-10-29 22:55:45.380
ksb_wait_for_cic_completion: ksiget(KSBCIQUE) returned 33
ksb_wait_for_cic_completion: ksiget(KSBCIPAR) returned 33
...

we can not do any thing on table ALERTBUFF,even select can cause library cache lock.

we want to know how to solve the problem except restart database.

pls help us as soon as possible.

Text continued in next action...


30-OCT-08 01:54:03 GMT

Text continued from previous action...

### GENERAL SYSTEM/DATABASE INFORMATION ###
gdrac1 
production

### RECENT CHANGES ###
no change

### WORKAROUNDS USED ###
no workround

### BUSINESS IMPACT ###
we can not do anything on the table ALERTBUFF!

### NUMBER OF INSTANCES/NODES ###
3

### STEPS TO REPRODUCE ###
can not reproduce

### O/S DETAILS ###
HP-UX gd_rac01 B.11.11 U 9000/800 2468376034 unlimited-user license

### CLUSTERWARE ###
Veritas Storage Foundation for Oracle RAC

### CLUSTERWARE DETAILS ###
unknown

### DATABASE VERSION/PATCHSET ###
not allowed

### CLUSTER INTERCONNECT ###

### CLUSTER INTERCONNECT DETAILS ###

### OCR, VOTING/QUORUM DISK LOCATION ###

### OCR, VOTING/QUORUM DISK LOCATION DETAILS ###

### DATABASE FILE LOCATION ###

### DATABASE FILE LOCATION DETAILS ###

### DIAGNOSTIC DATA COLLECTED? ###

### DIAGNOSTIC DATA COLLECTED ###

Can you easily recover from, bypass or work around the problem? 
No

Does your system or application continue normally after the problem occurs? 
No

Are the standard features of the system or application still available; is the loss of service minor?
No


Contact me via : MetaLink


30-OCT-08 02:00:03 GMT

New info : jimmyhe1981@gmail.com
all that operation is execute on gdrac1,not no gdrac2 or gdrac3.


30-OCT-08 02:01:06 GMT

New info : jimmyhe1981@gmail.com
we check our veritas cluster version is 3.5


30-OCT-08 02:07:01 GMT

The customer : jimmyhe1981@gmail.com : has uploaded the following file via MetaLink:
E:\alert_gdrac1.log



30-OCT-08 02:08:02 GMT

The customer : jimmyhe1981@gmail.com : has uploaded the following file via MetaLink:
E:\gdrac1_pmon_7976.trc



30-OCT-08 02:09:02 GMT

New info : jimmyhe1981@gmail.com
some of the alertlog and pmon trace file has sent to you.



30-OCT-08 02:17:58 GMT

UPDATE
======
Hi,

Thank you for using MetaLink. We are currently reviewing/researching the situation and will update the Service Request (SR) as soon as we hav
e relevant information. Thank you for your patience.

Best Regards,
Koushal
Global Customer Services

STATUS
=======

@WIP -- Work In Progress




30-OCT-08 02:59:43 GMT


.
DATA COLLECTED
===============

<alert_gdrac1.log>
_____________________
Wed Oct 29 23:12:28 2008
PMON failed to delete process, see PMON trace file


<gdrac1_pmon_7976.trc>
_________________________
*** SESSION ID:(1.1) 2008-10-29 22:55:45.380
ksb_wait_for_cic_completion: ksiget(KSBCIQUE) returned 33
ksb_wait_for_cic_completion: ksiget(KSBCIPAR) returned 33
..
....same as above....
..
*** 2008-10-30 10:04:51.504
found process c0000004435e0a00 pid=28 serial=82 dead
ksb_wait_for_cic_completion: ksiget(KSBCIQUE) returned 33
ksb_wait_for_cic_completion: ksiget(KSBCIQUE) returned 33
ksb_wait_for_cic_completion: ksiget(KSBCIQUE) returned 33

.
STATUS
=======

@WIP -- Work In Progress




30-OCT-08 03:18:05 GMT

.
RESEARCH
=========

Found mutliple bugs for 9i:
Bug.4595726 & Bug.5392244/5368853/4207513/5732292


.
UPDATE
=======

Hi,

From the internal bugs that I have referred for this errors, I have learnt there is no workaround. 
You must try restarting the instances to see if this helps fix it. 
You can apply patch for Base Bug.5732292 to avoid this problem in future. (Available on top of 9.2.0.7 as o
ne-off fix)

Thanks,
Koushal


30-OCT-08 03:18:12 GMT

Email Update button has been pressed: Sending email to jimmyhe1981@gmail.com. 

30-OCT-08 05:09:03 GMT

New info : jimmyhe1981@gmail.com
Thanks for you help.

You must try restarting the instances to see if this helps fix it.
----do you mean that restart one instance or three instance? why do you suggest this? 

And if restart three instance,should i stop all of them and start all of them ? or stop one and start one ,th
en stop another and start another? why do you suggest this?

And how do you guarantee the db will shutdown or startup normally?because we found there is a tran
saction in v$transaction,and it does not change any data since this accident hap
pen.if there still transaction in v$transaction,can it be exec "shutdown immedia
te" without hang?

This is a serious product system,we should make sure every operation correct.



30-OCT-08 05:17:03 GMT

.
UPDATE
=======

You can first try shutting down 2&3 instances out of 3 to avoid CI calls (CI-cross instance) that seem to be failing here.
If this too doesnot fix the problem then performing shutdown abort of last instance 1 and then star
tup each instance one-by-one is what I would suggest.
There is no workaround mentioned in any internal bugs update so we cannot guarantee which is going to work
.
9i is no longer a supported relase so I cannot involve Oracle Development for this issue.

Thnx.


30-OCT-08 05:17:05 GMT

New info : jimmyhe1981@gmail.com
i found the bug information:
Bug 5732292.cannot be displayed. Possible reasons are:...

can you paste this note to me?



30-OCT-08 05:17:09 GMT

Email Update button has been pressed: Sending email to jimmyhe1981@gmail.com. 

30-OCT-08 06:29:02 GMT

New info : jimmyhe1981@gmail.com
You can first try shutting down 2&3 instances out of 3 to avoid CI calls (CI-cross instance) that seem to be failing here.
----how do you found that CIC is on rac3?

and we found new situation:
on rac1:
12:43:56 SQL> select * from gv$lock where block=1;

INST_ID ADDR KADDR SID TY ID1 ID2 LMODE REQUEST CTI
ME BLOCK
---------- ---------------- ---------------- ---------- -- ---------- ---------- ---------- ---------- ---------- ----------
1 C0000004439C2CB0 C0000004439C2CD0 95 JI 1529349 0 6 
0 27846 1
3 C0000004439C27D0 C0000004439C27F0 577 RO 1 0 6 0 4008 1

2 rows selected.

then on rac3：
SQL> select instance_name from v$instance;

INSTANCE_NAME
----------------
gdrac3

SQL> select event from v$session_wait where sid=577;

EVENT
----------------------------------------------------------------
DFS lock handle


SQL> select sql_text from v$session a ,v$sqlarea b
2 where a.sql_address=b.address and a.sid=577;

SQL_TEXT
------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------
----------------------------------------------------------------
truncate table Service_exclusive




this is new table we found can not truncate.Does this has any relationship?




30-OCT-08 06:34:02 GMT

New info : jimmyhe1981@gmail.com
If this too doesnot fix the problem then performing shutdown abort of last instance 1
----how i check whether it work?if i shutdown rac2 and rac3,you mean just waiting?how long should i wait to take the next
step or what can i do to check it has already been recovered. 



30-OCT-08 07:21:02 GMT

New info : jimmyhe1981@gmail.com
are you sure the instance you suggest shutdown first is rac2 and rac3,let the rac1 runing?

this problem happen on rac1.why not we shutdown rac1 first?

And if we shutdown all the instance,which instance you suggest startup first?
rac3->rac2->rac1 is that ok?



30-OCT-08 07:30:33 GMT

.
UPDATE
=======

shutdown rac2 & rac3, let rac1 be kept running, wait for a while to check if the dropped table ALERTBUFF is stil listed in dba_ views, see if 
you can drop it again or perform a select on same table to verfiy if there are a
ny new issues.

shutdown rac1 too if above doesnot fix it.

You can startup instances in any order this doesnot matter, just make sure to start one after the ot
her (after open complete) and not all together.

Thnx.


30-OCT-08 07:30:44 GMT

Email Update button has been pressed: Sending email to jimmyhe1981@gmail.com. 

30-OCT-08 08:37:16 GMT

New info : jimmyhe1981@gmail.com
when rac 1 is online, at that time ,should we set FAST_START_PARALLEL_ROLLBACK=high?
and we want to trace smon whether it work well,how do we trace this?



30-OCT-08 08:43:00 GMT

.
UPDATE
=======

--
Here is the information on parameter : FAST_START_PARALLEL_ROLLBACK

Syntax
FAST_START_PARALLEL_ROLLBACK = {HI | LO | FALSE}

Default value
LOW

Parameter class
Dynamic: ALTER SYSTEM


FAST_START_PARALLEL_ROLLBACK determines the maximum number of processes that can exist for performing paral
lel rollback. This parameter is useful on systems in which some or all of the tr
ansactions are long running.

Values:

FALSE indicates that parallel rollback is disabled 
LOW limits the number of rollback processes to 2 * CPU_COUNT 
HIGH limits the number of rollback processes to 4 * CPU_COUNT 


--
I am not sure why you want to trace SMON. Your last update is not clear, please restate problem in 
detail.


Thnx.


30-OCT-08 08:43:06 GMT

Email Update button has been pressed: Sending email to jimmyhe1981@gmail.com. 

30-OCT-08 10:00:03 GMT

New info : jimmyhe1981@gmail.com
THANKS.

I WANT TO KNOW IS THIS PARAMETER HELPFUL TO OUR RECOVERY? DO YOU SUGGEST TO SET IT TO HIGH?

BECAUSE OF SMON NOW IS NOT WORKING WELL TO CLEAN DEAD TRANASCTION,DURING THE RECOVER TIME,HOW CAN I KNOW 
IT WORK WELL?CAN I TRACE WHAT SMON IS DOING?



30-OCT-08 10:15:17 GMT

.
UPDATE
=======

You can leave this parameter to default (i.e. LOW) If need be you can make this parameter value HIGH.
Monitoring parallel rollback can be accomplished using the V$FAST_START_SERVERS and V$FAST_START_TRANSACTIONS views. 

What makes you feel SMON is not working well ? i am not sure why you want to trac
e SMON, this will make things worse and not recommended by us unless we require 
it to do so.

Thnx.




30-OCT-08 10:15:31 GMT

Email Update button has been pressed: Sending email to jimmyhe1981@gmail.com. 

03-NOV-08 01:53:02 GMT

New info : jimmyhe1981@gmail.com
on 31,Oct we try to restart rac2 and rac3 to see if it can recover.if not ,then we take the next step,restart 3 instance;

【plan on 31,Oct】：
1.stop some app.
2.stop lsnr，set job process=0，kill LOCAL=NO,see if any transaction in v$transaction, shutdown rac2
3.stop lsnr，set job process=0，kill LOCAL=NO,see if any transaction in v$transaction,shutdown rac3
4.keep rac1 online，see if the table ALERTBUFF can be desc,if there is no DFS LOCK HANDLE eve
nt.
5.whatever it because normal,after 3 hours ,startup rac 2 and rac3,restart app.

【situation on 31,OCT】：
we shutdown rac2,and see alterlog：
ALTER DATABASE CLOSE NORMAL
Sat Nov 1 00:28:11 2008
Waiting for smon to disable tx recovery.
Sat Nov 1 01:17:08 2008
Thread 2 advanced to log sequence 46765
Current log# 9 seq# 46765 mem# 0: /ora_redo/redo91
Current log# 9 seq# 46765 mem# 1: /ora_redo/redo92
Sat Nov 1 01:17:08 2008
ARC1: Evaluating archive log 10 thread 2 sequence 46764
ARC1: Beginning to archive log 10 thread 2 sequence 46764
Creating archive destination LOG_ARCHIVE_DEST_1: '/arch/gdrac_2_46764.arc'
ARC1: Completed archiving log 10 thread 2 sequence 46764

we see about an hour can not shutdown.

so we try shutdown abort rac2.it work.and we startup,check db can startup normally.we try shutdown immediate again.
but it still hang.

so we have to shutdown abort rac2,and try to shutdowm rac as our plan.


shutdown immediate rac3,wait half an hour still hang.

shutdown abort rac3

see still a lot library cache pin,library cache lock,enqueue,2 DFS lock handle.
and sid=64 that session there. we still CAN NOT desc alertbuff,desc portal.Service_exclusive

wait about half an hour,startup rac2 and rac3.

so it seem that we can not solve the problem by just shutdown 2 instance.
we try to shutdown 3 instance tonight.

【question】：
we will shutdown 3 instance tonight,as we try to stop 2 instance,we can shutdown t
hem normally,we have to shutdown them by shutdown abort.
my question is that if we shutdown 3 instance tonoght,it seem that we have to use shutdown abort on 3 
instance.
is that any risk on this operation? how can we do to avoid that risk?
how should we guarantee that it can restart db saft and solve the problem ?



03-NOV-08 01:55:03 GMT

New info : jimmyhe1981@gmail.com
please give me a step by setp plan what i should do tonight.thanks.



03-NOV-08 03:35:04 GMT

New info : jimmyhe1981@gmail.com
pls check my question,thanks.



03-NOV-08 07:35:19 GMT

New info : jimmyhe1981@gmail.com
pls check my question as soon as possible,many thanks.



04-NOV-08 02:36:08 GMT

.
UPDATE
=======

From all the tests you have donw have you not tried shutting down all the RAC instance at once and then restarting it ? What are the symptoms
when you say tables cannot be described ?
There should not be any problem with uisng abort especially when immediate hangs or takes longer time. Just make sure
you do SQL> alter system checkpoint ; before doing a abort and this should be 
OK to use.



04-NOV-08 02:36:16 GMT

Email Update button has been pressed: Sending email to jimmyhe1981@gmail.com. 

04-NOV-08 06:34:04 GMT

New info : jimmyhe1981@gmail.com
your answer is too late!
i update the tar three time yesterday to ask for answer,but you do nothing just let me waiting!

the reason we shutdown 2 instance on 31,OCT is that,considering the influence of shut
down database,we have to try 2 instance first.if we shutdown 2 instance,it just 
business partial abnomal,but still can work.if we shutdown all the instance,it i
s business interrupt,which is more serious than abnormal.to avoid the bigger ris
k is very common in large system,as an oracle engineer,can't you understand this
?

and as you said "You can first try shutting down 2&3 instances out of 3 to avoid CI calls (CI-cross instance) that seem to be failing here." so we try to sh
utdown 2 instance first to see if it recover,if not recover,then we take the ne
xt step,restart 3 instance.

can not desc table mean that,when i DECS ALETRBUFF or DESC PORTAL.Service_exclusive,it was hang.waiting for library cache lock.in
my case,can not truncate table usually go with library cache lock.you don't und
erstand this situation and did not ask me to collect information to confirm,did
not help me to analyze what is holder of the lock,just told me to restart my da
tabase at the begining of this case?


you said shutdown abort will be safe,but i met a situation that i stop lnsr,kill all LOCAL=NO process,checkpint 3 time,
switch logfile 3 time,set job_queue_processes=0,set aq_tm_processes=0,check v$tr
ansaction,then shutdown immeidate,wait half an hour,still not down,so i shutdown
abort.when i startup,there is a lot of alert like "SMON: about to recover undo 
segment 60",i have to alter session set "_smu_debug_mode"=4 and alter rollback s
egment "_SYSSMU60$" online.shutdown abort is not safe as you think.

when your customer need you response,when your customer need you professional analysis,when
your customer need you professional advice,what are you doing?are you serious t
o my tar? are you suitable be an oracle engineer?

i called a on-site oracle engineer to solve the problem yesterday night.
i am not satisfied your service on this tar.






04-NOV-08 07:13:50 GMT

.
UPDATE
=======

Hi,

Sorry for replying late, I was not in office yesterday to read your replies. 
Incase of urgent issues please make sure to call Oracle support numbers for immediate assistance from available analyst on shift.

The issues you have faced for shutdown immediate, etc are different from original probl
em. The suggestion to restart instances was given after researching on this case
and no easy way available to fix it.

I hope everyting is all right now and we are good to close this case ? Please reply as appropiate.

Thnx, Koushal


04-NOV-08 07:13:58 GMT

Email Update button has been pressed: Sending email to jimmyhe1981@gmail.com. 

04-NOV-08 07:29:03 GMT

New info : victor.xu@oracle.com
==============
ACS Update
==============
Dear Support,

The whole RAC instances have been restarted last late night, so far the "library cache pin/lock" has gone. The affected table "MISC.ALERTBUFF" has been
sucessfully truncated. After discussed with customer DBA ZhaoLeHuan, he agreed 
that we can close this SR now. Thanks!


Regards,
Victor Xu
OSS China ACS







04-NOV-08 07:31:46 GMT

.
UPDATE
=======

Hi Victor,

Thanks for update.
I am inactivating this Service Request, as I have received confirmation from you that this issue is now resolve
d. 

Best Regards,
Koushal
Global Customer Services

STATUS
=======

Inactivated -- Issue Resolved



04-NOV-08 07:31:56 GMT

Email Update button has been pressed: Sending email to victor.xu@oracle.com. 

04-NOV-08 07:37:04 GMT

Closed by customer : victor.xu@oracle.com

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

454

455

456

457

458

459

460

461

462

463

464

465

466

467

468

469

470

471

472

473

474

475

476

477

478

479

480

481

482

483

484

485

486

487

488

489

490

491

492

493

494

495

496

497

498

499

500

501

502

503

504

505

506

507

508

509

510

511

512

513

514

515

516

517

518

519

520

521

522

523

524

525

526

527

528

529

530

531

532

533

534

535

536

537

538

539

540

541

542

543

544

545

546

547

548

549

550

551

552

553

554

555

556

557

558

559

560

561

562

563

564

565

566

567

568

569

570

571

572

573

574

575

576

577

578

579

580

581

582

583

584

585

586

587

588

589

590

591

592

593

594

595

596

597

598

599

600

601

602

603

604

605

606

607

608

609

610

611

612

613

614

615

616

617

618

619

620

621

622

623

624

625

626

627

628

629

630

631

632

633

634

635

636

637

638

639

640

641

642

643

644

645

646

647

648

649

650

651

652

653

654

655

656

657

658

659

660

661

662

663

664

665

666

667

668

669

670

671

672

673

674

675

676

677

678

679

680

681

682

683

684

685

686

687

688

689

690

691

692

693

694

695

696

697

698

699

700

701

702

703

704

705

706

707

708

709

710

711

712

713

714

715

716

717

718

719

720

721

722

723

724

725

726

727

728

729

730

731

732

733

734

735

736

737

738

739

740

741

742

743

744

745

746

747

748

749

750

SR Number 7251740.992 Open Date 30-OCT-08 01:54:01

Support Identifier 14827454 Name He JianMin

Severity 2 Last Update 04-NOV-08 07:37:03

Product Oracle Server - Enterprise Edition Product Version 9.2.0.6.0

Platform HP-UX PA-RISC (64-bit) Detailed Status Hard Close

SR Reference n/a BUG Reference n/a

Abstract

TRUNCATE TABLE HANG,SESSION WAIT IS DFS LOCK HANDLE,PMON CAN'T CLEAN DEAD PROCES

Resolution History

30-OCT-08 01:54:01 GMT

### PROBLEM STATEMENT ###

we try to truncate a very large table,

SQL> select sum(bytes)/1024/1024/1024 size_g from user_segments where

segment_name='ALERTBUFF';

SIZE_G

----------

329.8125

1* select tablespace_name,PARTITIONED from user_tables where table_name='ALERTBUFF'

SQL> /

TABLESPACE_NAME PAR

------------------------------ ---

DATA_DYNAMIC NO

SQL> select EXTENT_MANAGEMENT,SEGMENT_SPACE_MANAGEMENT from user_tablespaces where

tablespace_name='DATA_DYNAMIC';

EXTENT_MAN SEGMEN

---------- ------

LOCAL MANUAL

SQL> select index_name,tablespace_name,PARTITIONED from user_indexes where table_name='ALERTBUFF';

INDEX_NAME TABLESPACE_NAME PAR

------------------------------ ------------------------------ ---

IDX_ALERTBUFF_ALERTTIME DATA_DYNAMIC NO

PK_ALERTBUFF DATA_DYNAMIC NO

we execute "truncate table xxx reuse storage",when it was executing,the network broke.And

we login database again,we found the session still active.

we try to found this sid,serial#,spid.we killed this session and process in OS and wanted to execute

that truncate statement again.

we check this status is killed but in dba_dml_locks it still lock:

SQL> select * from dba_dml_locks;

SESSION_ID OWNER NAME MODE_HELD

MODE_REQUESTE LAST_CONVERT BLOCKING_OTHERS

---------- ------------------------------ ------------------------------ -------------

------------- ------------ ----------------------------------------

64 MISC ALERTBUFF Exclusive

None 1643 Global

64 SYS MLOG$ Row-X (SX) None 1643

Global

SQL> l

1* select * from v$session_wait where sid=64

SQL> /

SID SEQ# EVENT

P1TEXTP1 P1RAW P2TEXT

P2 P2RAW P3TEXT P3 P3RAW WAIT_TIME

SECONDS_IN_WAIT STATE

---------- ---------- ----------------------------------------------------------------

---------------------------------------------------------------- ----------

----------------

---------------------------------------------------------------- ----------

----------------

---------------------------------------------------------------- ----------

---------------- ---------- --------------- -------------------

64 3540 DFS lock handle

type|mode1128857605 0000000043490005 id1

18 0000000000000012 id2 2 0000000000000002

0 39340 WAITING

SQL>

and in alertlog,we found that:

"PMON failed to delete process, see PMON trace

file"

oracle@gd_rac01:/oracle/app/oracle/admin/gdrac/bdump$ more gdrac1_pmon_7976.trc

Dump file /oracle/app/oracle/admin/gdrac/bdump/gdrac1_pmon_7976.trc

Oracle9i Enterprise Edition Release 9.2.0.6.0 - 64bit Production

With the Partitioning and Real Application Clusters options

JServer Release 9.2.0.6.0 - Production

ORACLE_HOME = /oracle/app/oracle/product/9.2.0

System name: HP-UX

Node name: gd_rac01

Release: B.11.11

Version: U

Machine: 9000/800

Instance name: gdrac1

Redo thread mounted by this instance: 1

Oracle process number: 2

Unix process pid: 7976, image: oracle@gd_rac01 (PMON)

*** 2008-10-29 22:55:45.399

*** SESSION ID:(1.1) 2008-10-29 22:55:45.380

ksb_wait_for_cic_completion: ksiget(KSBCIQUE) returned 33

ksb_wait_for_cic_completion: ksiget(KSBCIPAR) returned 33

...

we can not do any thing on table ALERTBUFF,even select can cause library cache lock.

we want to know how to solve the problem except restart database.

pls help us as soon as possible.

Text continued in next action...

30-OCT-08 01:54:03 GMT

Text continued from previous action...

### GENERAL SYSTEM/DATABASE INFORMATION ###

gdrac1

production

### RECENT CHANGES ###

no change

### WORKAROUNDS USED ###

no workround

### BUSINESS IMPACT ###

we can not do anything on the table ALERTBUFF!

### NUMBER OF INSTANCES/NODES ###

### STEPS TO REPRODUCE ###

can not reproduce

### O/S DETAILS ###

HP-UX gd_rac01 B.11.11 U 9000/800 2468376034 unlimited-user license

### CLUSTERWARE ###

Veritas Storage Foundation for Oracle RAC

### CLUSTERWARE DETAILS ###

unknown

### DATABASE VERSION/PATCHSET ###

not allowed

### CLUSTER INTERCONNECT ###

### CLUSTER INTERCONNECT DETAILS ###

### OCR, VOTING/QUORUM DISK LOCATION ###

### OCR, VOTING/QUORUM DISK LOCATION DETAILS ###

### DATABASE FILE LOCATION ###

### DATABASE FILE LOCATION DETAILS ###

### DIAGNOSTIC DATA COLLECTED? ###

### DIAGNOSTIC DATA COLLECTED ###

Can you easily recover from, bypass or work around the problem?

Does your system or application continue normally after the problem occurs?

Are the standard features of the system or application still available; is the loss of service minor?

Contact me via : MetaLink

30-OCT-08 02:00:03 GMT

New info : jimmyhe1981@gmail.com

all that operation is execute on gdrac1,not no gdrac2 or gdrac3.

30-OCT-08 02:01:06 GMT

New info : jimmyhe1981@gmail.com

we check our veritas cluster version is 3.5

30-OCT-08 02:07:01 GMT

The customer : jimmyhe1981@gmail.com : has uploaded the following file via MetaLink:

E:\alert_gdrac1.log

30-OCT-08 02:08:02 GMT

The customer : jimmyhe1981@gmail.com : has uploaded the following file via MetaLink:

E:\gdrac1_pmon_7976.trc

30-OCT-08 02:09:02 GMT

New info : jimmyhe1981@gmail.com

some of the alertlog and pmon trace file has sent to you.

30-OCT-08 02:17:58 GMT

UPDATE

======

Hi,

Thank you for using MetaLink. We are currently reviewing/researching the situation and will update the Service Request (SR) as soon as we hav

e relevant information. Thank you for your patience.

Best Regards,

Koushal

Global Customer Services

STATUS

=======

@WIP -- Work In Progress

30-OCT-08 02:59:43 GMT

DATA COLLECTED

===============

<alert_gdrac1.log>

_____________________

Wed Oct 29 23:12:28 2008

PMON failed to delete process, see PMON trace file

<gdrac1_pmon_7976.trc>

_________________________

*** SESSION ID:(1.1) 2008-10-29 22:55:45.380

ksb_wait_for_cic_completion: ksiget(KSBCIQUE) returned 33

ksb_wait_for_cic_completion: ksiget(KSBCIPAR) returned 33

....same as above....

*** 2008-10-30 10:04:51.504

found process c0000004435e0a00 pid=28 serial=82 dead

ksb_wait_for_cic_completion: ksiget(KSBCIQUE) returned 33

STATUS

=======

@WIP -- Work In Progress

30-OCT-08 03:18:05 GMT

RESEARCH

=========

Found mutliple bugs for 9i:

Bug.4595726 & Bug.5392244/5368853/4207513/5732292

UPDATE

=======

Hi,

From the internal bugs that I have referred for this errors, I have learnt there is no workaround.

You must try restarting the instances to see if this helps fix it.

You can apply patch for Base Bug.5732292 to avoid this problem in future. (Available on top of 9.2.0.7 as o

ne-off fix)

Thanks,

Koushal

30-OCT-08 03:18:12 GMT

Email Update button has been pressed: Sending email to jimmyhe1981@gmail.com.

30-OCT-08 05:09:03 GMT

New info : jimmyhe1981@gmail.com

Thanks for you help.

You must try restarting the instances to see if this helps fix it.

----do you mean that restart one instance or three instance? why do you suggest this?

And if restart three instance,should i stop all of them and start all of them ? or stop one and start one ,th

en stop another and start another? why do you suggest this?

And how do you guarantee the db will shutdown or startup normally?because we found there is a tran

saction in v$transaction,and it does not change any data since this accident hap

pen.if there still transaction in v$transaction,can it be exec "shutdown immedia

te" without hang?

This is a serious product system,we should make sure every operation correct.

30-OCT-08 05:17:03 GMT

UPDATE

=======

You can first try shutting down 2&3 instances out of 3 to avoid CI calls (CI-cross instance) that seem to be failing here.

If this too doesnot fix the problem then performing shutdown abort of last instance 1 and then star

tup each instance one-by-one is what I would suggest.

There is no workaround mentioned in any internal bugs update so we cannot guarantee which is going to work

9i is no longer a supported relase so I cannot involve Oracle Development for this issue.

Thnx.

30-OCT-08 05:17:05 GMT

New info : jimmyhe1981@gmail.com

i found the bug information:

Bug 5732292.cannot be displayed. Possible reasons are:...

can you paste this note to me?

30-OCT-08 05:17:09 GMT

Email Update button has been pressed: Sending email to jimmyhe1981@gmail.com.

30-OCT-08 06:29:02 GMT

New info : jimmyhe1981@gmail.com

You can first try shutting down 2&3 instances out of 3 to avoid CI calls (CI-cross instance) that seem to be failing here.

----how do you found that CIC is on rac3?

and we found new situation:

on rac1:

12:43:56 SQL> select * from gv$lock where block=1;

INST_ID ADDR KADDR SID TY ID1 ID2 LMODE REQUEST CTI

ME BLOCK

---------- ---------------- ---------------- ---------- -- ---------- ---------- ---------- ---------- ---------- ----------

1 C0000004439C2CB0 C0000004439C2CD0 95 JI 1529349 0 6

0 27846 1

3 C0000004439C27D0 C0000004439C27F0 577 RO 1 0 6 0 4008 1

2 rows selected.

then on rac3：

SQL> select instance_name from v$instance;

INSTANCE_NAME

----------------

gdrac3

SQL> select event from v$session_wait where sid=577;

EVENT

----------------------------------------------------------------

DFS lock handle

SQL> select sql_text from v$session a ,v$sqlarea b

2 where a.sql_address=b.address and a.sid=577;

SQL_TEXT

------------------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------

----------------------------------------------------------------

truncate table Service_exclusive

this is new table we found can not truncate.Does this has any relationship?

30-OCT-08 06:34:02 GMT

New info : jimmyhe1981@gmail.com

If this too doesnot fix the problem then performing shutdown abort of last instance 1

----how i check whether it work?if i shutdown rac2 and rac3,you mean just waiting?how long should i wait to take the next

step or what can i do to check it has already been recovered.

30-OCT-08 07:21:02 GMT

New info : jimmyhe1981@gmail.com

are you sure the instance you suggest shutdown first is rac2 and rac3,let the rac1 runing?

this problem happen on rac1.why not we shutdown rac1 first?

And if we shutdown all the instance,which instance you suggest startup first?

rac3->rac2->rac1 is that ok?

30-OCT-08 07:30:33 GMT

UPDATE

=======

shutdown rac2 & rac3, let rac1 be kept running, wait for a while to check if the dropped table ALERTBUFF is stil listed in dba_ views, see if

you can drop it again or perform a select on same table to verfiy if there are a

ny new issues.

shutdown rac1 too if above doesnot fix it.

You can startup instances in any order this doesnot matter, just make sure to start one after the ot

her (after open complete) and not all together.

Thnx.

30-OCT-08 07:30:44 GMT

Email Update button has been pressed: Sending email to jimmyhe1981@gmail.com.

30-OCT-08 08:37:16 GMT

New info : jimmyhe1981@gmail.com

when rac 1 is online, at that time ,should we set FAST_START_PARALLEL_ROLLBACK=high?

and we want to trace smon whether it work well,how do we trace this?

30-OCT-08 08:43:00 GMT

UPDATE

=======

Here is the information on parameter : FAST_START_PARALLEL_ROLLBACK

Syntax

FAST_START_PARALLEL_ROLLBACK = {HI | LO | FALSE}

Default value

LOW

Parameter class

Dynamic: ALTER SYSTEM

FAST_START_PARALLEL_ROLLBACK determines the maximum number of processes that can exist for performing paral

lel rollback. This parameter is useful on systems in which some or all of the tr

ansactions are long running.

Values:

FALSE indicates that parallel rollback is disabled

LOW limits the number of rollback processes to 2 * CPU_COUNT

HIGH limits the number of rollback processes to 4 * CPU_COUNT

I am not sure why you want to trace SMON. Your last update is not clear, please restate problem in

detail.

Thnx.

30-OCT-08 08:43:06 GMT

Email Update button has been pressed: Sending email to jimmyhe1981@gmail.com.

30-OCT-08 10:00:03 GMT

New info : jimmyhe1981@gmail.com

THANKS.

I WANT TO KNOW IS THIS PARAMETER HELPFUL TO OUR RECOVERY? DO YOU SUGGEST TO SET IT TO HIGH?

BECAUSE OF SMON NOW IS NOT WORKING WELL TO CLEAN DEAD TRANASCTION,DURING THE RECOVER TIME,HOW CAN I KNOW

IT WORK WELL?CAN I TRACE WHAT SMON IS DOING?

30-OCT-08 10:15:17 GMT

UPDATE

=======

You can leave this parameter to default (i.e. LOW) If need be you can make this parameter value HIGH.

Monitoring parallel rollback can be accomplished using the V$FAST_START_SERVERS and V$FAST_START_TRANSACTIONS views.

What makes you feel SMON is not working well ? i am not sure why you want to trac

e SMON, this will make things worse and not recommended by us unless we require

it to do so.

Thnx.

30-OCT-08 10:15:31 GMT

Email Update button has been pressed: Sending email to jimmyhe1981@gmail.com.

03-NOV-08 01:53:02 GMT

New info : jimmyhe1981@gmail.com

on 31,Oct we try to restart rac2 and rac3 to see if it can recover.if not ,then we take the next step,restart 3 instance;

【plan on 31,Oct】：

1.stop some app.

2.stop lsnr，set job process=0，kill LOCAL=NO,see if any transaction in v$transaction, shutdown rac2

3.stop lsnr，set job process=0，kill LOCAL=NO,see if any transaction in v$transaction,shutdown rac3

4.keep rac1 online，see if the table ALERTBUFF can be desc,if there is no DFS LOCK HANDLE eve

nt.

5.whatever it because normal,after 3 hours ,startup rac 2 and rac3,restart app.

【situation on 31,OCT】：

we shutdown rac2,and see alterlog：

ALTER DATABASE CLOSE NORMAL

Sat Nov 1 00:28:11 2008

Waiting for smon to disable tx recovery.

Sat Nov 1 01:17:08 2008

Thread 2 advanced to log sequence 46765

Current log# 9 seq# 46765 mem# 0: /ora_redo/redo91

Current log# 9 seq# 46765 mem# 1: /ora_redo/redo92

Sat Nov 1 01:17:08 2008

ARC1: Evaluating archive log 10 thread 2 sequence 46764

ARC1: Beginning to archive log 10 thread 2 sequence 46764

Creating archive destination LOG_ARCHIVE_DEST_1: '/arch/gdrac_2_46764.arc'

ARC1: Completed archiving log 10 thread 2 sequence 46764

we see about an hour can not shutdown.

so we try shutdown abort rac2.it work.and we startup,check db can startup normally.we try shutdown immediate again.

but it still hang.

so we have to shutdown abort rac2,and try to shutdowm rac as our plan.

shutdown immediate rac3,wait half an hour still hang.

shutdown abort rac3

see still a lot library cache pin,library cache lock,enqueue,2 DFS lock handle.

and sid=64 that session there. we still CAN NOT desc alertbuff,desc portal.Service_exclusive

wait about half an hour,startup rac2 and rac3.

so it seem that we can not solve the problem by just shutdown 2 instance.

we try to shutdown 3 instance tonight.

【question】：

we will shutdown 3 instance tonight,as we try to stop 2 instance,we can shutdown t

hem normally,we have to shutdown them by shutdown abort.

my question is that if we shutdown 3 instance tonoght,it seem that we have to use shutdown abort on 3

instance.

is that any risk on this operation? how can we do to avoid that risk?

how should we guarantee that it can restart db saft and solve the problem ?

03-NOV-08 01:55:03 GMT

New info : jimmyhe1981@gmail.com

please give me a step by setp plan what i should do tonight.thanks.

03-NOV-08 03:35:04 GMT

New info : jimmyhe1981@gmail.com

pls check my question,thanks.

03-NOV-08 07:35:19 GMT

New info : jimmyhe1981@gmail.com

pls check my question as soon as possible,many thanks.

04-NOV-08 02:36:08 GMT

UPDATE

=======

From all the tests you have donw have you not tried shutting down all the RAC instance at once and then restarting it ? What are the symptoms

when you say tables cannot be described ?

There should not be any problem with uisng abort especially when immediate hangs or takes longer time. Just make sure

you do SQL> alter system checkpoint ; before doing a abort and this should be

OK to use.

04-NOV-08 02:36:16 GMT

Email Update button has been pressed: Sending email to jimmyhe1981@gmail.com.

04-NOV-08 06:34:04 GMT

New info : jimmyhe1981@gmail.com

your answer is too late!

i update the tar three time yesterday to ask for answer,but you do nothing just let me waiting!

the reason we shutdown 2 instance on 31,OCT is that,considering the influence of shut

down database,we have to try 2 instance first.if we shutdown 2 instance,it just

business partial abnomal,but still can work.if we shutdown all the instance,it i

s business interrupt,which is more serious than abnormal.to avoid the bigger ris

k is very common in large system,as an oracle engineer,can't you understand this

and as you said "You can first try shutting down 2&3 instances out of 3 to avoid CI calls (CI-cross instance) that seem to be failing here." so we try to sh

utdown 2 instance first to see if it recover,if not recover,then we take the ne

xt step,restart 3 instance.

can not desc table mean that,when i DECS ALETRBUFF or DESC PORTAL.Service_exclusive,it was hang.waiting for library cache lock.in

my case,can not truncate table usually go with library cache lock.you don't und

erstand this situation and did not ask me to collect information to confirm,did

not help me to analyze what is holder of the lock,just told me to restart my da

tabase at the begining of this case?

you said shutdown abort will be safe,but i met a situation that i stop lnsr,kill all LOCAL=NO process,checkpint 3 time,

switch logfile 3 time,set job_queue_processes=0,set aq_tm_processes=0,check v$tr

ansaction,then shutdown immeidate,wait half an hour,still not down,so i shutdown

abort.when i startup,there is a lot of alert like "SMON: about to recover undo

segment 60",i have to alter session set "_smu_debug_mode"=4 and alter rollback s

egment "_SYSSMU60$" online.shutdown abort is not safe as you think.

when your customer need you response,when your customer need you professional analysis,when

your customer need you professional advice,what are you doing?are you serious t

o my tar? are you suitable be an oracle engineer?

i called a on-site oracle engineer to solve the problem yesterday night.

i am not satisfied your service on this tar.

04-NOV-08 07:13:50 GMT

UPDATE

=======

Hi,

Sorry for replying late, I was not in office yesterday to read your replies.

Incase of urgent issues please make sure to call Oracle support numbers for immediate assistance from available analyst on shift.

The issues you have faced for shutdown immediate, etc are different from original probl

em. The suggestion to restart instances was given after researching on this case

and no easy way available to fix it.

I hope everyting is all right now and we are good to close this case ? Please reply as appropiate.

Thnx, Koushal

04-NOV-08 07:13:58 GMT

Email Update button has been pressed: Sending email to jimmyhe1981@gmail.com.

04-NOV-08 07:29:03 GMT

New info : victor.xu@oracle.com

==============

ACS Update

==============

Dear Support,

The whole RAC instances have been restarted last late night, so far the "library cache pin/lock" has gone. The affected table "MISC.ALERTBUFF" has been

sucessfully truncated. After discussed with customer DBA ZhaoLeHuan, he agreed

that we can close this SR now. Thanks!

Regards,

Victor Xu

OSS China ACS

04-NOV-08 07:31:46 GMT

UPDATE

=======

Hi Victor,

Thanks for update.

I am inactivating this Service Request, as I have received confirmation from you that this issue is now resolve

Best Regards,

Koushal

Global Customer Services

STATUS

=======

Inactivated -- Issue Resolved

04-NOV-08 07:31:56 GMT

Email Update button has been pressed: Sending email to victor.xu@oracle.com.

04-NOV-08 07:37:04 GMT

Closed by customer : victor.xu@oracle.com

6条评论

niubro说道：

2008-11-05 09:26

不错的总结，学习了，谢谢小荷

回复
NetFire(Fire.Rolland.Han)说道：

2008-11-05 09:42

总结的第一条才是王道！

回复
hihiw说道：

2009-02-27 16:02

不错的总结! 谢谢!!
请问你如何做
11.仍然没有down，通过oradebug的trace文件看到：
WAIT #0: nam=’opishd’ ela= 986314 p1=0 p2=0 p3=0
WAIT #0: nam=’opishd’ ela= 986146 p1=0 p2=0 p3=0
WAIT #0: nam=’opishd’ ela= 986281 p1=0 p2=0 p3=0

我以前也碰到过长时间down不下来的情况,我一般是检查是不是产生归档,如果产生,则
可能是一个大的事务在回滚.但有不产生归档的情况.这种情况下如何查什么原因?
我有时候就先联1个session,查状态,然后用另外一个shutdown,但有时忘了,总的说来,我觉得方法不好,
请教你的方法?

回复
小荷说道：

2009-02-27 17:04

re hihiw：在shutdown命令敲下去之前，检查自己的spid，然后通过oradebug去进行10046的trace，然后敲下shutdown命令，去看udump下的trace文件。

回复
hihiw说道：

2009-02-27 23:40

我明白了，你仍然是检查自己的session的情况.而我以前的想法主要认为是后台进程有
什么原因导致down不下去。我以前的理解,shutdown命令发出之后,实际上各后台进程需
要做各种操作,发出命令的进程则等待,所有执行完成后,协调信息返回发出命令的
进程.有点类似与commit之后，实际在等待lgwr完成.
所以追踪自身进程的意义不大.(这是我的理解,)
或者我的理解有错误?
我做个实验看看.
谢谢

回复
hihiw说道：

2009-02-28 00:03

大致看了一下,我的理解是shutdown命令发出后，自身进程需要执行很多步骤，通过监控详细的步骤来推断
大致原因，我以前过于简单的理解，认为主要的步骤由pmon来执行，其他进程为辅。
我太想当然了！所以也就没有想trace一下发出命令的进程。
谢谢！

回复

发表回复取消回复

此站点使用Akismet来减少垃圾评论。了解我们如何处理您的评论数据。

truncate异常中断后，发生DFS LOCK HANDLE

相关文章

技术杂记

谨慎的使用shutdown abort

修复missing的datafile

6条评论

发表回复 取消回复

发表回复取消回复