最近在做一个省的调优时发现该省的数据库总是不定期的出现业务堵塞,而且晚上出现的情况更加严重些。为了分析该情况,部署了一个statspack,每隔10分钟收集一次数据。
发现在发生堵塞的时候,statspack中top 5 event的第一位是enqueue:
| 1 2 3 4 5 6 7 8 9 10 | Top 5 Timed Events ~~~~~~~~~~~~~~~~~~                                                     % Total Event                                               Waits    Time (s) Ela Time -------------------------------------------- ------------ ----------- -------- enqueue                                             2,057       6,012    57.63 db file sequential read                           230,588       2,250    21.57 db file scattered read                            141,246       1,797    17.22 CPU time                                                          186     1.78 buffer busy waits                                  12,426          57      .54           ------------------------------------------------------------- | 
收集了好几个堵塞时刻的statspack,第一位都是enqueue。于是,我们在继续去找statspack中关于enqueue部分,但是每次的statspack中的Enqueue activity 部分总是有所不同:
| 1 2 3 4 5 6 7 8 9 10 | Enqueue activity for DB: XJMISC  Instance: xjmisc  Snaps: 985 -986 -> Enqueue stats gathered prior to 9i should not be compared with 9i data -> ordered by Wait Time desc, Waits desc                                                         Avg Wt         Wait Eq     Requests    Succ Gets Failed Gets       Waits   Time (ms)     Time (s) -- ------------ ------------ ----------- ----------- ------------- ------------ CF          305          305           0           1        818.00            1 SQ          707          707           0           6           .50            0           ------------------------------------------------------------- | 
| 1 2 3 4 5 6 7 8 9 | Enqueue activity for DB: XJMISC  Instance: xjmisc  Snaps: 979 -980 -> Enqueue stats gathered prior to 9i should not be compared with 9i data -> ordered by Wait Time desc, Waits desc                                                         Avg Wt         Wait Eq     Requests    Succ Gets Failed Gets       Waits   Time (ms)     Time (s) -- ------------ ------------ ----------- ----------- ------------- ------------ SQ        1,446        1,446           0          11           .18            0           ------------------------------------------------------------- | 
CF的enqueue是表示controlfile的enqueue,莫非是备份在作怪?那尝试把备份作业取消掉;SQ的enqueue是表说sequence的enqueue,难道是序列的cache不够大?那尝试把序列的enqueue加大。
再次收集了一天的数据,凌晨还是enqueue的event最高。在下面的Enqueue activity 看到的还是SQ、CF之类的,且等待时间都不是很长。
是什么原因造成了enqueue?难道不是SQ或者CF的原因吗?
为了弄清原因,我找了一个平时会发生业务堵塞的凌晨,登录了数据库查看数据库状态,发现数据库中存在大量的enqueue,通过计算:
| 1 2 3 4 5 | select sid, event, p1, p1raw,        chr(bitand(P1,-16777216)/16777215)||chr(bitand(P1,16711680)/65535) type,        mod(P1, 16) "MODE"    from v$session_wait where event ='enqueue'; | 
发现都是TX的mode 6的enqueue。
知道了这个,后续的解决办法就好找多了。发现是一个长时间做全表扫描的大表,要很久才做commit造成。
问题是,我们在statspack上看到的第一位enqueue,要具体分析却不能通过statspack中的Enqueue activity 去看!
并且,我们虽然在Enqueue activity 中看到CF的等待,其实这个等待和引起系统堵塞的TX等待在statspack中对不上号。
我在另一个时间段的statspack中更加证实了我的猜想,statspack中top 5 event中如果有enqueue,和Enqueue activity无关。
| 1 2 3 4 5 6 7 8 9 10 | Top 5 Timed Events ~~~~~~~~~~~~~~~~~~                                                     % Total Event                                               Waits    Time (s) Ela Time -------------------------------------------- ------------ ----------- -------- db file sequential read                           428,602       6,175    22.06 sbtwrite2                                         746,814       6,076    21.71 db file scattered read                            861,713       5,756    20.57 async disk IO                                     772,638       4,070    14.54 CPU time                                                        2,944    10.52           ------------------------------------------------------------- | 
| 1 2 3 4 5 6 7 8 9 10 11 12 | Enqueue activity for DB: XJMISC  Instance: xjmisc  Snaps:  1571 -1577 -> Enqueue stats gathered prior to 9i should not be compared with 9i data -> ordered by Wait Time desc, Waits desc                                                         Avg Wt         Wait Eq     Requests    Succ Gets Failed Gets       Waits   Time (ms)     Time (s) -- ------------ ------------ ----------- ----------- ------------- ------------ TX      344,261      344,259           0         116          8.49            1 CF        3,524        3,521           3           5        135.60            1 TC           15           15           0           3         55.33            0 SQ        3,684        3,684           0          16           .13            0           ------------------------------------------------------------- | 
这次statspack的时候,在做数据库全备份(其实从sbtwrite2 的等待也能看出是在做备份),在top 5 event中没有看到enqueue,但是在Enqueue activity中,我们看到CF和TC的等待比较高,但是却没在top 5 event中出现。且那时也没有业务堵塞。
为了验证我的猜想,我再次做了一个实验,多个做update的进程被一个没有commit进程堵塞,这几个进程,都是tx的enqueue,在top 5中确实有看到enqueue,但是在Enqueue activity中看到却没有关于TX的:
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 | Enqueue activity  DB/Inst: ORALOCAL/oralocal  Snaps: 2-5 -> only enqueues with waits are shown -> Enqueue stats gathered prior to 10g should not be compared with 10g data -> ordered by Wait Time desc, Waits desc Enqueue Type (Request Reason) ------------------------------------------------------------------------------     Requests    Succ Gets Failed Gets       Waits Wt Time (s)  Av Wt Time(ms) ------------ ------------ ----------- ----------- ------------ -------------- CF-Controlfile Transaction          120          120           0           1            0         375.00 TC-Tablespace Checkpoint            3            3           0           1            0            .00           ------------------------------------------------------------- | 
不过,幸运的是,如果是在10g环境中,top 5中就能帮我们指出是什么类型的enqueue了,我们上面的那个Enqueue activity所对应的top 5:
| 1 2 3 4 5 6 7 8 9 10 | Top 5 Timed Events                                                    Avg %Total ~~~~~~~~~~~~~~~~~~                                                   wait   Call Event                                            Waits    Time (s)   (ms)   Time ----------------------------------------- ------------ ----------- ------ ------ enq: TX - row lock contention                      449       1,349   3004   79.6 RMAN backup & recovery I/O                       1,458         181    124   10.7 control file sequential read                       745         124    166    7.3 control file parallel write                         83          15    181     .9 db file sequential read                             57          11    201     .7           ------------------------------------------------------------- | 
可以看到,在10g中的top 5中已经指出了enqueue的类型:enq: TX – row lock contention。
4条评论
这种例子就是awr发挥功用的地方。
不过9i得statspack比较奇怪,至少应该能够看到TX类型得lock占用的time比较多才对
不错的案例!
的确,好多时候借助statspack分析问题还是有限,还是要在问题出现时进行分析。
statspack 互相比较.从sql 部分应该可以看到些异常.
分析问题的思路值得学习, 看来尽信书不如无书。