近期遇一个省的数据库说总是繁忙,idle很小,在凌晨稍微跑几个比较大的统计脚本,idle就出现0的情况,导致业务发生堵塞。
经过连续几天的观察,发现cpu曲线大致如下:
和白鳝交流了一下,老白说cpu的user%比例过高可能是由于buffer get比较高的sql引起。可以重点找一下statspack中buffer get比较高的语句。
根据sar出来的结果,查看对应时间段的statspack:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 |
Snap Id Snap Time Sessions Curs/Sess Comment --------- ------------------ -------- --------- ------------------- Begin Snap: 148 17-Nov-08 23:59:01 439 3,968.9 End Snap: 149 18-Nov-08 00:04:02 444 3,925.3 Elapsed: 5.02 (mins) …… Top 5 Timed Events ~~~~~~~~~~~~~~~~~~ % Total Event Waits Time (s) Ela Time -------------------------------------------- ------------ ----------- -------- CPU time 495 29.68 db file scattered read 78,197 465 27.90 db file sequential read 36,235 342 20.50 PL/SQL lock timer 1 117 7.03 enqueue 108 75 4.49 …… SQL ordered by Gets for DB: XJUSER Instance: xjUSER Snaps: 148 -149 -> End Buffer Gets Threshold: 10000 -> Note that resources reported for PL/SQL includes the resources used by all SQL statements called within the PL/SQL code. As individual SQL statements are also reported, it is possible and valid for the summed total % to exceed 100 CPU Elapsd Buffer Gets Executions Gets per Exec %Total Time (s) Time (s) Hash Value --------------- ------------ -------------- ------ -------- --------- ---------- 2,619,562 3,841 682.0 23.8 326.51 330.57 3767922078 Module: dbtrans_query@xj_svr02 (TNS V1-V3) select provinceid into :b0 from PROBLEM_TAB where PREFIXNUM=sub str(:b1,1,7) 1,095,641 1 1,095,641.0 9.9 7.54 8.65 1460516408 Module: dbtrans_single@xj_svr01 (TNS V1-V3) select b.servid ,a.tb1shortname ,b.servname ,(((TO_CHAR(c.PreDel Time,'MM')||'月')||TO_CHAR(c.PreDelTime,'DD'))||'日') ,(((TO_CHA R(c.expectdeltime,'MM')||'月')||TO_CHAR(c.expectdeltime,'DD'))|| '日') ,b.accessmodeid from tb1 a ,inst_tab_01 b ,servdelinterfa ce c where (((a.tb1id=b.tb1id and b.servid=c.servid) and b.servs tatus='P') and b.servid in (select servid from instead_table_01 where (dealstatus=1 and expectdeltime<=Sysdate))) union select b.servid ,a.tb1shortname ,b.servname ,(((TO_CHAR(c.PreDelTime,'M M')||'月')||TO_CHAR(c.PreDelTime,'DD'))||'日') ,(((TO_CHAR(c.exp ectdeltime,'MM')||'月')||TO_CHAR(c.expectdeltime,'DD'))||'日') , b.accessmodeid from tb1 a ,inst_tab_01 b ,instead_table_01 c wh ere (((a.tb1id=b.tb1id and b.servid=c.servid) and b.servstatus=' E') and b.servid in (select servid from instead_table_01 where (dealstatus=0 and expectdeltime<=Sysdate))) union select b.servi d ,a.tb1shortname ,b.servname ,(((TO_CHAR(c.PreDelTime,'MM')||' 月')||TO_CHAR(c.PreDelTime,'DD'))||'日') ,(((TO_CHAR(c.expectdel time,'MM')||'月')||TO_CHAR(c.expectdeltime,'DD'))||'日') ,b.acce ssmodeid from tb1 a ,inst_tab_01 b ,instead_table_01 c ,tb1serv ices d where ((((((((a.tb1id=b.tb1id and b.servstatus<>'P') and b.servstatus<>'E') and b.type=1) and b.ServGroupId=d.ServGroupId ) and d.type=0) and d.servstatus='P') and d.servid=c.servid) and d.servid in (select servid from instead_table_01 where (dealst atus=1 and expectdeltime<=Sysdate))) union select b.servid ,a.ic pshortname ,b.servname ,(((TO_CHAR(c.PreDelTime,'MM')||'月')||TO _CHAR(c.PreDelTime,'DD'))||'日') ,(((TO_CHAR(c.expectdeltime,'MM ')||'月')||TO_CHAR(c.expectdeltime,'DD'))||'日') ,b.accessmodeid from tb1 a ,inst_tab_01 b ,instead_table_01 c ,inst_tab_01 d w here ((((((((a.tb1id=b.tb1id and b.servstatus<>'P') and b.servst atus<>'E') and b.type=1) and b.ServGroupId=d.ServGroupId) and d. type=0) and d.servstatus='E') and d.servid=c.servid) and d.servi d in (select servid from instead_table_01 where (dealstatus=0 a nd expectdeltime<=Sysdate))) 710,704 433 1,641.3 6.5 44.54 104.60 3049199920 Module: boss_syncstatu@xj_svr02 (TNS V1-V3) begin pk_bossusrchgstatus_proc . nUserStatusChgProc ( :nBatchRow s , :sTablePartName , :nRet ) ; end ; 474,094 1 474,094.0 4.3 2.91 9.39 2446775708 Module: dbtrans_single@xj_svr01 (TNS V1-V3) select b.servid ,a.tb1shortname ,b.servname ,(((TO_CHAR(c.PreDel Time,'MM')||'月')||TO_CHAR(c.PreDelTime,'DD'))||'日') ,(((TO_CHA R(c.expectdeltime,'MM')||'月')||TO_CHAR(c.expectdeltime,'DD'))|| SQL ordered by Gets for DB: XJUSER Instance: xjUSER Snaps: 148 -149 -> End Buffer Gets Threshold: 10000 -> Note that resources reported for PL/SQL includes the resources used by all SQL statements called within the PL/SQL code. As individual SQL statements are also reported, it is possible and valid for the summed total % to exceed 100 CPU Elapsd Buffer Gets Executions Gets per Exec %Total Time (s) Time (s) Hash Value --------------- ------------ -------------- ------ -------- --------- ---------- '日') ,b.accessmodeid from tb1 a ,inst_tab_01 b ,servdelinterfa ce c where (((a.tb1id=b.tb1id and b.servid=c.servid) and b.servs tatus='P') and b.servid in (select servid from instead_table_01 where dealstatus=0)) union select b.servid ,a.tb1shortname ,b.s ervname ,(((TO_CHAR(c.PreDelTime,'MM')||'月')||TO_CHAR(c.PreDelT ime,'DD'))||'日') ,(((TO_CHAR(c.expectdeltime,'MM')||'月')||TO_C HAR(c.expectdeltime,'DD'))||'日') ,b.accessmodeid from tb1 a ,i cpservices b ,instead_table_01 c ,inst_tab_01 d where ((((((((a. tb1id=b.tb1id and b.servstatus<>'P') and b.servstatus<>'E') and b.type=1) and b.ServGroupId=d.ServGroupId) and d.type=0) and d.s ervstatus='P') and d.servid=c.servid) and d.servid in (select se rvid from instead_table_01 where dealstatus=0)) 364,188 535 680.7 3.3 43.21 43.90 3582277248 Module: boss_syncstatu@xj_svr02 (TNS V1-V3) SELECT NETWORKTYPEID FROM PROBLEM_TAB WHERE PREFIXNUM = SUBSTR(: B1 , 1, 7) 293,675 1 293,675.0 2.7 17.87 183.31 2752729414 Module: SQL*Plus select count(*) from table_0002 where length(deviceid)>7 and sta tus<>'D' |
首先查了hash value为1460516408和2446775708的sql,因为这2个sql的执行次数不高,但是单次执行的消耗很高,查询后发现2个sql的consistent gets有47万。与开发沟通后,该sql的执行计划已经无法改变,但是由于是通过时间点触发,可以将该sql的执行时间延迟到6点之后执行。
同时hash value为3767922078之类的sql由于执行太过于频繁,要求该进程的执行频率是否能够降低一些?但是该进程由于涉及状态同步,如果降低频率会引起用户投诉,暂时无法处理该sql。
待调整后,继续观察一天。发现晚上的曲线还是大致相仿:
继续观察对应时间段的statspack:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 |
Snap Id Snap Time Sessions Curs/Sess Comment --------- ------------------ -------- --------- ------------------- Begin Snap: 447 19-Nov-08 00:19:05 438 4,018.6 End Snap: 448 19-Nov-08 00:24:01 434 4,055.7 Elapsed: 4.93 (mins) …… Top 5 Timed Events ~~~~~~~~~~~~~~~~~~ % Total Event Waits Time (s) Ela Time -------------------------------------------- ------------ ----------- -------- db file sequential read 138,603 1,651 54.24 CPU time 937 30.79 db file scattered read 22,998 337 11.08 buffer busy waits 5,000 42 1.37 log file parallel write 34,923 34 1.12 …… SQL ordered by Gets for DB: XJUSER Instance: xjUSER Snaps: 447 -448 -> End Buffer Gets Threshold: 10000 -> Note that resources reported for PL/SQL includes the resources used by all SQL statements called within the PL/SQL code. As individual SQL statements are also reported, it is possible and valid for the summed total % to exceed 100 CPU Elapsd Buffer Gets Executions Gets per Exec %Total Time (s) Time (s) Hash Value --------------- ------------ -------------- ------ -------- --------- ---------- 4,436,086 2,070 2,143.0 30.4 147.41 150.68 2931180186 Module: boss_usersvr@xj_svr02 (TNS V1-V3) select a.HistoryId ,a.mid ,a.SubsId ,a.ActionId ,b.umflag ,to_ch ar(a.ActionDate,'yyyymmddhh24miss') ,a.ActionPlace ,b.tb1servid ,c.tb1code ,a.AccessModeId ,a.servid ,b.ServAttr ,b.USERID ,a.Ch argeMid ,a.ChannelId into :s1:s2 ,:s3:s4 ,:s5:s6 ,:s7:s8 ,:s9:s1 0 ,:s11:s12 ,:s13:s14 ,:s15:s16 ,:s17:s18 ,:s19:s20 ,:s21:s22 ,: s23:s24 ,:s25:s26 ,:s27:s28 ,:s29:s30 from instead_tab2_Histor y a ,service b ,tb1 c where (((a.historyid=:b2 and a.servid=b.se rvid) and a.tb1id=c.tb1id) and a.tb1id=b.tb1id) 4,148,056 101 41,069.9 28.5 367.21 1019.89 3049199920 Module: boss_syncstatu@xj_svr02 (TNS V1-V3) begin pk_bossusrchgstatus_proc . nUserStatusChgProc ( :nBatchRow s , :sTablePartName , :nRet ) ; end ; 2,526,128 3,702 682.4 17.3 300.70 317.37 3582277248 Module: boss_syncstatu@xj_svr02 (TNS V1-V3) SELECT NETWORKTYPEID FROM PROBLEM_TAB WHERE PREFIXNUM = SUBSTR(: B1 , 1, 7) 2,319,482 3,402 681.8 15.9 289.98 299.17 3767922078 Module: dbtrans_query@xj_svr02 (TNS V1-V3) select provinceid into :b0 from PROBLEM_TAB where PREFIXNUM=sub str(:b1,1,7) |
发现consistent gets很高的那几个sql已经不再出现,但是为何user的比例还是很高呢?
此时,突然发现了执行很频繁的几个sql,cpu time(s)有300多秒!!如hash value 3582277248、3767922078这几个sql的cpu time比昨天调整的consistent gets 还高!!
其实昨天查的时候,也注意过这几个sql,这几个sql虽然是走全表扫描,不过执行的速度很快:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
PLAN_TABLE_OUTPUT -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- --------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost | --------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 11 | 64 | |* 1 | TABLE ACCESS FULL | PROBLEM_TAB | 1 | 11 | 64 | --------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter("PROBLEM_TAB"."PREFIXNUM"=SUBSTR(:Z,1,7)) Note: cpu costing is off 14 rows selected. 11:01:41 SQL> select count(*) from PROBLEM_TAB; COUNT(*) ---------- 100415 11:01:50 SQL> |
buffer get部分中cpu time这么高,估计是这几个sql影响了。查询了一下该sql在别的省的执行计划,发现是走PK索引的。这个省的主键不知道被谁删除掉了!!
建立PK后,发现user一下子就下去了。
1 2 3 4 5 6 7 8 9 |
11:02:41 SQL> l 1 ALTER TABLE "USER"."PROBLEM_TAB" ADD CONSTRAINT "PK_USERPREFIX_USERID_PREFIXNUM" PRIMARY KEY ("PREFIXNUM") 2 USING INDEX PCTFREE 10 INITRANS 2 MAXTRANS 255 3 STORAGE(INITIAL 524288 NEXT 524288 MINEXTENTS 1 MAXEXTENTS 2147483645 4 PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1 BUFFER_POOL DEFAULT) 5* TABLESPACE "INDEX_DYNAMIC" ENABLE 11:02:42 SQL> / Table altered. |
1 2 3 4 5 6 7 8 9 10 11 12 13 |
%usr %sys %wio %idle 11:02:38 40 2 50 8 11:02:40 40 1 51 9 11:02:42 38 1 51 9 11:02:44 42 1 45 12 11:02:46 32 1 46 20 11:02:48 25 1 50 25 11:02:50 3 0 47 50 11:02:52 6 1 59 34 11:02:54 5 2 75 19 11:02:56 3 0 67 30 11:02:58 5 1 64 31 11:03:00 4 0 65 30 |
总结:遇到cpu的user高问题,如果没有enqueue,latch free之类的等待,仅仅是cpu time的等待,看statspack 中buffer get中的cpu time(s)高的sql很重要。
4条评论
经典案例啊~!
白老大果然很强 案例很经典值得学习
cpu曲线 是什么软件绘制的
很常见的问题