deadlock引起数据库挂死

某天，应用程序突然挂了，程序中报错连不上数据库。登录数据库主机后，发现sqlplus也登录不了，登录时，sqlplus长时间没有响应，检查alertlog只是发现半小时前检测到一个deadlock，没有其他的报错信息。心想oracle检测到deadlock能自动解锁，因此一开始没怀疑到是因为deadlock的原因，由于登录不上数据库，为了尽快的回复业务，停掉应用程序，杀掉ps -ef |grep LOCAL=N的进程后，再次能登录数据库，重启了数据库，再次启动应用，回复正常。

追溯原因时，一开始是怀疑系统的内核参数设置不正确，导致主机资源耗尽，因此无法sqlplus登录。但是检查了一遍主机的内核参数，发现均正常。主机的syslog同样也没发现异常。奇怪，为什么会出现挂死的现象？请教了老网虫白鳝后，才明白了某些版本的oracle检查到deadlock后会触发一个关于deadlock的bug，导致enqueue hash chain的parent latch被hold住，然后在做SYSTEM DUMP的时候可能产生LATCH死锁，这个时候可能hang住，时间可能是在几秒钟，也有可能是几小时。

……
ARC0: Evaluating archive   log 3 thread 1 sequence 36918
ARC0: Beginning to archive log 3 thread 1 sequence 36918
Creating archive destination LOG_ARCHIVE_DEST_1: '/arch/arch_1_36918.arc'
Thu Jan 31 15:59:13 2008
ARC0: Completed archiving  log 3 thread 1 sequence 36918
Thu Jan 31 16:05:30 2008
Thread 1 advanced to log sequence 36920
  Current log# 1 seq# 36920 mem# 0: /dev/vg_ora1/rredo_256m_01
  Current log# 1 seq# 36920 mem# 1: /dev/vg_ora2/rredo_256m_11
  Current log# 1 seq# 36920 mem# 2: /dev/vg_ora3/rredo_256m_21
Thu Jan 31 16:05:30 2008
ARC1: Evaluating archive   log 4 thread 1 sequence 36919
ARC1: Beginning to archive log 4 thread 1 sequence 36919
Creating archive destination LOG_ARCHIVE_DEST_1: '/arch/arch_1_36919.arc'
Thu Jan 31 16:05:32 2008
ORA-000060: Deadlock detected. More info in file /oracle/app/oracle/admin/sd168/udump/sd168_ora_14652.trc.---- 注意此处
Thu Jan 31 17:00:11 2008
ARC1: Completed archiving  log 4 thread 1 sequence 36919
Thu Jan 31 17:00:14 2008
SMON: Parallel transaction recovery tried
Thu Jan 31 17:02:34 2008
Thread 1 advanced to log sequence 36921
  Current log# 2 seq# 36921 mem# 0: /dev/vg_ora1/rredo_256m_02
  Current log# 2 seq# 36921 mem# 1: /dev/vg_ora2/rredo_256m_12
  Current log# 2 seq# 36921 mem# 2: /dev/vg_ora3/rredo_256m_22
Thu Jan 31 17:02:34 2008
ARC0: Evaluating archive   log 1 thread 1 sequence 36920
ARC0: Beginning to archive log 1 thread 1 sequence 36920
Creating archive destination LOG_ARCHIVE_DEST_1: '/arch/arch_1_36920.arc'
ARC0: Completed archiving  log 1 thread 1 sequence 36920
……

……

ARC0: Evaluating archive log 3 thread 1 sequence 36918

ARC0: Beginning to archive log 3 thread 1 sequence 36918

Creating archive destination LOG_ARCHIVE_DEST_1: '/arch/arch_1_36918.arc'

Thu Jan 31 15:59:13 2008

ARC0: Completed archiving log 3 thread 1 sequence 36918

Thu Jan 31 16:05:30 2008

Thread 1 advanced to log sequence 36920

Current log# 1 seq# 36920 mem# 0: /dev/vg_ora1/rredo_256m_01

Current log# 1 seq# 36920 mem# 1: /dev/vg_ora2/rredo_256m_11

Current log# 1 seq# 36920 mem# 2: /dev/vg_ora3/rredo_256m_21

Thu Jan 31 16:05:30 2008

ARC1: Evaluating archive log 4 thread 1 sequence 36919

ARC1: Beginning to archive log 4 thread 1 sequence 36919

Creating archive destination LOG_ARCHIVE_DEST_1: '/arch/arch_1_36919.arc'

Thu Jan 31 16:05:32 2008

ORA-000060: Deadlock detected. More info in file /oracle/app/oracle/admin/sd168/udump/sd168_ora_14652.trc.---- 注意此处

Thu Jan 31 17:00:11 2008

ARC1: Completed archiving log 4 thread 1 sequence 36919

Thu Jan 31 17:00:14 2008

SMON: Parallel transaction recovery tried

Thu Jan 31 17:02:34 2008

Thread 1 advanced to log sequence 36921

Current log# 2 seq# 36921 mem# 0: /dev/vg_ora1/rredo_256m_02

Current log# 2 seq# 36921 mem# 1: /dev/vg_ora2/rredo_256m_12

Current log# 2 seq# 36921 mem# 2: /dev/vg_ora3/rredo_256m_22

Thu Jan 31 17:02:34 2008

ARC0: Evaluating archive log 1 thread 1 sequence 36920

ARC0: Beginning to archive log 1 thread 1 sequence 36920

Creating archive destination LOG_ARCHIVE_DEST_1: '/arch/arch_1_36920.arc'

ARC0: Completed archiving log 1 thread 1 sequence 36920

……

进一步检查这个deadlock的trace文件：

/oracle/app/oracle/admin/sd168/udump/sd168_ora_14652.trc
Oracle9i Enterprise Edition Release 9.2.0.6.0 - 64bit Production
With the Partitioning, OLAP and Oracle Data Mining options
JServer Release 9.2.0.6.0 - Production
ORACLE_HOME = /oracle/app/oracle/product/9.2.0
System name:	HP-UX
Node name:	sd_db01
Release:	B.11.11
Version:	U
Machine:	9000/800
Instance name: sd168
Redo thread mounted by this instance: 1
Oracle process number: 590
Unix process pid: 14652, image: oracle@sd_db01 (TNS V1-V3)

*** 2008-01-31 16:05:32.480
*** SESSION ID:(465.37272) 2008-01-31 16:05:32.403
DEADLOCK DETECTED
Current SQL statement for this session:
 delete eips_corporation corp where corp.corpid=:1   
The following deadlock is not an ORACLE error. It is a
deadlock due to user error in the design of an application
or from issuing incorrect ad-hoc SQL. The following
information may aid in determining the deadlock:
Deadlock graph:
                       ---------Blocker(s)--------  ---------Waiter(s)---------
Resource Name          process session holds waits  process session holds waits
TM-0000872c-00000000       590     465    SX   SSX      581     281    SX   SSX  -------注意此处
TM-0000872c-00000000       581     281    SX   SSX      590     465    SX   SSX
session 465: DID 0001-024E-0000003F	session 281: DID 0001-0245-0000002E
session 281: DID 0001-0245-0000002E	session 465: DID 0001-024E-0000003F
Rows waited on:
Session 281: obj - rowid = 00008762 - AAAIdiACGAAAAAAAAA
  (dictionary objn - 34658, file - 134, block - 0, slot - 0)
Session 465: obj - rowid = 00008762 - AAAIdiACGAAAAAAAAA
  (dictionary objn - 34658, file - 134, block - 0, slot - 0)
Information on the OTHER waiting sessions:
Session 281:
  pid=581 serial=53477 audsid=7366710 user: 54/EICP
  O/S info: user: eicp, term: unknown, ospid: , machine: sd-oam01
            program: JDBC Thin Client
  application name: JDBC Thin Client, hash value=0
  Current SQL Statement:
   delete eips_corporation corp where corp.corpid=:1   
End of information on OTHER waiting sessions.
===================================================
PROCESS STATE
-------------
Process global information:
     process: c0000001991a1720, call: c0000001a57bcfd8, xact: c00000019bfbc5b0, curses: c0000001993bfd88, usrses: c0000001993bfd88
  ----------------------------------------
  SO: c0000001991a1720, type: 2, owner: 0000000000000000, flag: INIT/-/-/0x00
  (process) Oracle pid=590, calls cur/top: c0000001a57bcfd8/c0000001a57bcfd8, flag: (0) -
            int error: 0, call error: 0, sess error: 0, txn error 0
  (post info) last post received: 197 0 4
              last post received-location: kslpsr
              last process to post me: c0000001990e5be0 1 6
              last post sent: 0 0 16
              last post sent-location: ksasnd
              last process posted by me: c0000001990e5be0 1 6
    (latch info) wait_event=0 bits=10
      holding     c0000001c1a4d458 Parent+children enqueue hash chains level=4    -------注意此处
        Location from where latch is held: ksqcmi: kslgpl: 
        Context saved from call: 0
        state=busy
        recovery area:
Dump of memory from 0xC0000001990E4628 to 0xC0000001990E4638
……

/oracle/app/oracle/admin/sd168/udump/sd168_ora_14652.trc

Oracle9i Enterprise Edition Release 9.2.0.6.0 - 64bit Production

With the Partitioning, OLAP and Oracle Data Mining options

JServer Release 9.2.0.6.0 - Production

ORACLE_HOME = /oracle/app/oracle/product/9.2.0

System name: HP-UX

Node name: sd_db01

Release: B.11.11

Version: U

Machine: 9000/800

Instance name: sd168

Redo thread mounted by this instance: 1

Oracle process number: 590

Unix process pid: 14652, image: oracle@sd_db01 (TNS V1-V3)

*** 2008-01-31 16:05:32.480

*** SESSION ID:(465.37272) 2008-01-31 16:05:32.403

DEADLOCK DETECTED

Current SQL statement for this session:

delete eips_corporation corp where corp.corpid=:1

The following deadlock is not an ORACLE error. It is a

deadlock due to user error in the design of an application

or from issuing incorrect ad-hoc SQL. The following

information may aid in determining the deadlock:

Deadlock graph:

---------Blocker(s)-------- ---------Waiter(s)---------

Resource Name process session holds waits process session holds waits

TM-0000872c-00000000 590 465 SX SSX 581 281 SX SSX -------注意此处

TM-0000872c-00000000 581 281 SX SSX 590 465 SX SSX

session 465: DID 0001-024E-0000003F session 281: DID 0001-0245-0000002E

session 281: DID 0001-0245-0000002E session 465: DID 0001-024E-0000003F

Rows waited on:

Session 281: obj - rowid = 00008762 - AAAIdiACGAAAAAAAAA

(dictionary objn - 34658, file - 134, block - 0, slot - 0)

Session 465: obj - rowid = 00008762 - AAAIdiACGAAAAAAAAA

(dictionary objn - 34658, file - 134, block - 0, slot - 0)

Information on the OTHER waiting sessions:

Session 281:

pid=581 serial=53477 audsid=7366710 user: 54/EICP

O/S info: user: eicp, term: unknown, ospid: , machine: sd-oam01

program: JDBC Thin Client

application name: JDBC Thin Client, hash value=0

Current SQL Statement:

delete eips_corporation corp where corp.corpid=:1

End of information on OTHER waiting sessions.

===================================================

PROCESS STATE

-------------

Process global information:

process: c0000001991a1720, call: c0000001a57bcfd8, xact: c00000019bfbc5b0, curses: c0000001993bfd88, usrses: c0000001993bfd88

----------------------------------------

SO: c0000001991a1720, type: 2, owner: 0000000000000000, flag: INIT/-/-/0x00

(process) Oracle pid=590, calls cur/top: c0000001a57bcfd8/c0000001a57bcfd8, flag: (0) -

int error: 0, call error: 0, sess error: 0, txn error 0

(post info) last post received: 197 0 4

last post received-location: kslpsr

last process to post me: c0000001990e5be0 1 6

last post sent: 0 0 16

last post sent-location: ksasnd

last process posted by me: c0000001990e5be0 1 6

(latch info) wait_event=0 bits=10

holding c0000001c1a4d458 Parent+children enqueue hash chains level=4 -------注意此处

Location from where latch is held: ksqcmi: kslgpl:

Context saved from call: 0

state=busy

recovery area:

Dump of memory from 0xC0000001990E4628 to 0xC0000001990E4638

……

检查addr为c0000001c1a4d458的latch：

SQL> select name from v$latch where addr=upper('c0000001c1a4d458');

NAME
----------------------------------------------------------------
enqueue hash chains

SQL> select name from v$latch where addr=upper('c0000001c1a4d458');

NAME

----------------------------------------------------------------

enqueue hash chains

确实我们看到了enqueue hash chain。enqueue hash chain是用于保护每一个library cache中的hash bucket。
（这里借用一下eygle的图说明一下library cache的结构）

由于parent latch被HOLD了，所有对数据库library cache加锁的操作都无法进行，而sqlplus登录的时候要申请AUDSES$的SEQUENCE，因此sqlplus无法登录，同理所有的应用程序也无法登录——应用程序挂死。

在metalink上也确认了这个bug：2530125

可是，在第二天下午，再次出现的由于deadlock导致数据库挂死的问题，检查alertlog发现在很早之前已经有deadlock的现象，为何系统会频繁的出现deadlock？

根据deadlock的trace文件，根据deadlock graph的wait的模式是SSX，这就很有可能是外键没加索引造成。检查相关语句中的外键，靠！确实是外键没加索引。

对外键增加索引后，系统正常运行一周，不再发生死锁，也不再发生挂死现象。至此问题解决。回顾问题：由于外键没加索引–引起死锁–触发bug2530125–enqueue hash latch–无法登录–程序挂死。

deadlock引起数据库挂死

相关文章

导回awr报告中的历史执行计划

SQL Plan Directives与过量的动态采样

oracle将推出Exadata mini版？

发表回复 取消回复

发表回复取消回复