今天本来是休假半天,正在外面办事的时候,突然收到pushmail的告警,oracle数据库中有500多个ora-的报错,心里咯噔了一下,怎么会有那么多的报错!但是又没收到数据库down的pushmail,难道是程序遇到bug了?赶紧打了电话给同事,请他们帮忙看看。
同时也顾不上吃午饭了,匆匆赶到了公司,了解情况。Team Leader已经把该故障升级为一级故障,已经影响到了部分应用模块的正常使用。同事已经在metalink上开了SR,另外也开了一个电话会议,向客户介绍了当前情况。
赶紧先把SR扫了一遍,大致了解了情况,是ora-600[19004]的报错,oracle要求提供explain plan和10053的trace,于是一边登录现网执行语句,查看情况,另一边也介入了电话会议,确认了影响的范围,安抚客户情绪,并表达了针对这个问题的初步诊断分析,以及对策。
在alertlog中,我们看到有大量的报错:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
Errors in file /u01/app/oracle/product/10.2.0/admin/MICPRD/udump/micprd_ora_1482830.trc: ORA-00600: internal error code, arguments: [19004], [], [], [], [], [], [], [] Thu Jan 20 14:16:13 2011 Errors in file /u01/app/oracle/product/10.2.0/admin/MICPRD/udump/micprd_ora_1789954.trc: ORA-00600: internal error code, arguments: [19004], [], [], [], [], [], [], [] Thu Jan 20 14:16:16 2011 Errors in file /u01/app/oracle/product/10.2.0/admin/MICPRD/udump/micprd_ora_1507510.trc: ORA-00600: internal error code, arguments: [19004], [], [], [], [], [], [], [] Thu Jan 20 14:16:29 2011 Errors in file /u01/app/oracle/product/10.2.0/admin/MICPRD/udump/micprd_ora_1884392.trc: ORA-00600: internal error code, arguments: [19004], [], [], [], [], [], [], [] Thu Jan 20 14:16:31 2011 Errors in file /u01/app/oracle/product/10.2.0/admin/MICPRD/udump/micprd_ora_1372166.trc: ORA-00600: internal error code, arguments: [19004], [], [], [], [], [], [], [] Thu Jan 20 14:16:35 2011 Errors in file /u01/app/oracle/product/10.2.0/admin/MICPRD/udump/micprd_ora_1388784.trc: ORA-00600: internal error code, arguments: [19004], [], [], [], [], [], [], [] Thu Jan 20 14:16:37 2011 Errors in file /u01/app/oracle/product/10.2.0/admin/MICPRD/udump/micprd_ora_1388784.trc: ORA-00600: internal error code, arguments: [19004], [], [], [], [], [], [], [] Thu Jan 20 14:16:37 2011 Errors in file /u01/app/oracle/product/10.2.0/admin/MICPRD/udump/micprd_ora_1884392.trc: ORA-00600: internal error code, arguments: [19004], [], [], [], [], [], [], [] Thu Jan 20 14:16:43 2011 Errors in file /u01/app/oracle/product/10.2.0/admin/MICPRD/udump/micprd_ora_1105974.trc: ORA-00600: internal error code, arguments: [19004], [], [], [], [], [], [], [] Thu Jan 20 14:17:17 2011 Errors in file /u01/app/oracle/product/10.2.0/admin/MICPRD/udump/micprd_ora_1482830.trc: ORA-00600: internal error code, arguments: [19004], [], [], [], [], [], [], [] Thu Jan 20 14:17:24 2011 Errors in file /u01/app/oracle/product/10.2.0/admin/MICPRD/udump/micprd_ora_1773588.trc: ORA-00600: internal error code, arguments: [19004], [], [], [], [], [], [], [] Thu Jan 20 14:17:30 2011 Errors in file /u01/app/oracle/product/10.2.0/admin/MICPRD/udump/micprd_ora_1527844.trc: ORA-00600: internal error code, arguments: [19004], [], [], [], [], [], [], [] Thu Jan 20 14:17:33 2011 Errors in file /u01/app/oracle/product/10.2.0/admin/MICPRD/udump/micprd_ora_1527844.trc: ORA-00600: internal error code, arguments: [19004], [], [], [], [], [], [], [] Thu Jan 20 14:17:39 2011 Errors in file /u01/app/oracle/product/10.2.0/admin/MICPRD/udump/micprd_ora_1773588.trc: ORA-00600: internal error code, arguments: [19004], [], [], [], [], [], [], [] Thu Jan 20 14:17:40 2011 Errors in file /u01/app/oracle/product/10.2.0/admin/MICPRD/udump/micprd_ora_1388784.trc: ORA-00600: internal error code, arguments: [19004], [], [], [], [], [], [], [] Thu Jan 20 14:17:47 2011 Errors in file /u01/app/oracle/product/10.2.0/admin/MICPRD/udump/micprd_ora_1589298.trc: |
从tracefile中看,语句是类似的,只是带入的变量有所不同。其sql是:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 |
SELECT a.attr2 AS attr2, d.valuename AS pccodename, a.workingscope AS workingscope, h.currencyname AS currencyname, a.globalpccode AS globalpccode, a.grade AS grade, a.fax AS fax, b.zmclpattern AS zmclpattern1, a.orgtype AS orgtype, a.industrytype AS industry, a.scagreementtype AS scagreementtype, a.regioncode AS regioncode, a.booksetid AS booksetid, a.areaid AS areaid, a.citycode AS citycode, g.languagename AS languagename, a.relationtype AS relationtype, a.saleschannelid AS saleschannelid, a.corporateperson AS corporateperson, j.cityname AS cityname, a.attr3 AS attr3, f.countryname AS countryname, a.isbottom AS isbottom, a.paymentmethod AS paymentmethod, n.orgname AS relatedorgname, a.bgorgid AS bgorgid, a.shipmentterm AS shipmentterm, a.shipmentmethod AS shipmentmethod, '' AS tyresize, c.areacode AS areacode, a.parentorgid AS parentorgid, '' AS selectedcustom, j.citycode AS citycode1, a.comments AS comments, a.ownerorgid AS ownerorgid, a.registerfund AS registerfund, a.telephone AS telephone, a.orgid AS orgid, a.opendate AS opendate, o.valuename AS solutiontypename, a.orgname AS orgname, l.valuename AS statusname, a.solutiontype AS solutiontype, a.initflag AS initflag, a.datelastupdated AS datelastupdated, a.createdate AS createdate, a.targettype AS targettype, a.url AS url, a.attr1 AS attr1, d.languagecode AS languagecode, a.portcode AS portcode, a.preferedcurrency AS preferedcurrency, c.areaname AS areaname, a.countrycode AS orgcountrycode, a.status AS status, i.saleschannelname AS saleschannelname, a.paymentterm AS paymentterm, e.valuename AS corptypename, b.orgcode AS parentorgcode, a.booksetopendate AS booksetopendate, a.relatedorgid AS relatedorgid, b.shortname AS parentorgname, a.corporationtype AS corporationtype, a.orgcode AS orgcode, a.shortname AS shortname, a.email AS email, a.defaultlanguage AS defaultlanguage, j.languagecode AS languagecode7, 1 AS COUNT FROM table_abcd d, table_abcd o, table_112233 c, table_aaaa_bbbb_cc h, lang g, table_abcd l, table_xyzzxy b, saleschannel i, table_xyzzxy a, table_abcd_1234_a f, city j, table_xyzzxy n, table_abcd e WHERE 1 = 1 AND h.languagecode = j.languagecode AND a.defaultlanguage = g.languagecode AND d.languagecode = h.languagecode AND l.TYPE = 'Status' AND a.status = l.VALUE AND d.TYPE = 'GlobalPCCode' AND n.orgid(+) = a.relatedorgid AND d.languagecode = o.languagecode AND o.TYPE = 'SolutionType' AND a.solutiontype = o.VALUE AND a.globalpccode = d.VALUE AND d.languagecode = f.languagecode AND i.saleschannelid(+) = a.saleschannelid AND a.preferedcurrency = h.currencycode AND j.citycode(+) = a.citycode AND c.areaid(+) = a.areaid AND e.TYPE = 'CorporationType' AND b.orgid(+) = a.parentorgid AND a.countrycode = f.countrycode AND a.corporationtype = e.VALUE AND d.languagecode = l.languagecode AND d.languagecode = e.languagecode AND d.languagecode = 'cn' AND a.orgid = 100001419 |
这个sql无论在做explain plan for还是10053的时候,都会报600的错。
在metalink中查到,这个ora-600的报错是一个oracle的bug,在oracle 9208,10105,10203版本的几乎所有平台都会发生。这个bug已经有patch了:patch 4899105
当然,幸好oracle也提供了临时解决方法:
1 |
Gather statistics without histograms |
在故障现场,执行了以下2个语句之后,问题解决:
1 2 3 4 5 |
--删除原有的统计信息: exec dbms_stats.delete_table_stats(ownname=>'user_name',tabname=>'table_name',cascade_indexes=>true); --重新收集不带直方图的统计信息: exec dbms_stats.gather_table_stats(ownname=>'user_name',tabname=>'table_name',cascade=>true,method_opt=>'for all columns size 1'); |
临时的解决故障后,之后的步骤,就是和客户商量时间打上这个重要的patch了。
其实我还是有点纳闷的是,为什么会触发这个bug,难道是统计信息不够精确?但是我的数据库是使用oracle的默认收集统计信息的设置,而且某个月还会手工收集一次,应该不会有很旧的统计信息。另外,如果表的直方图信息有问题,这样的表应该不止一个,为什么直方图的信息会导致600的报错?为什么只是在执行一个sql的时候报错了600,而不是所有信息不准确的且multi join的sql都报错?