一天,在一个业务高峰期,某省的一个数据库突然宕机了,根据alertlog中发现的报错:
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | Mon Jan 12 09:49:42 2009 ARC0: Evaluating archive   log 5 thread 1 sequence 142032 ARC0: Beginning to archive log 5 thread 1 sequence 142032 Creating archive destination LOG_ARCHIVE_DEST_1: '/oraarchlog/arch_1_142032.arc' Mon Jan 12 09:49:56 2009 ARC0: Completed archiving  log 5 thread 1 sequence 142032 Mon Jan 12 09:57:57 2009 Errors in file /oracle/app/oracle/admin/orasid/bdump/orasid_lgwr_8267.trc: ORA-00313: open failed for members of log group 1 of thread 1 ORA-00312: online log 1 thread 1: '/dev/vg_ora3/rredo_256m_21' ORA-27041: unable to open file HP-UX Error: 23: File table overflow Additional information: 2 ORA-00312: online log 1 thread 1: '/dev/vg_ora2/rredo_256m_11' ORA-27041: unable to open file HP-UX Error: 23: File table overflow Additional information: 2 ORA-00312: online log 1 thread 1: '/dev/vg_ora1/rredo_256m_01' ORA-27041: unable to open file HP-UX Error: 23: File table overflow Additional information: 2 Mon Jan 12 09:57:57 2009 Errors in file /oracle/app/oracle/admin/orasid/bdump/orasid_lgwr_8267.trc: ORA-00313: open failed for members of log group 1 of thread 1 ORA-00312: online log 1 thread 1: '/dev/vg_ora3/rredo_256m_21' ORA-27041: unable to open file HP-UX Error: 23: File table overflow Additional information: 2 ORA-00312: online log 1 thread 1: '/dev/vg_ora2/rredo_256m_11' ORA-27041: unable to open file HP-UX Error: 23: File table overflow Additional information: 2 | 
看来是打开文件数太多,导致了此次数据库宕机。
通过监控打开文件数发现,其实在前一天的时候,打开文件数已经到98%,在上午8点多的时候,打开文件数已经到99%,但是一直都没有引起监控人员的注意,终于导致打开文件数被撑爆。
我们通过sar -v可以发现打开文件数的总数:
| 1 2 3 4 5 6 7 8 | HP-UX my_db01 B.11.11 U 9000/800    01/17/09 01:08:52 text-sz  ov  proc-sz  ov  inod-sz  ov  file-sz  ov  01:08:54   N/A   N/A 1139/4096  0  2678/34816 0  113494/200010 0 01:08:56   N/A   N/A 1139/4096  0  2678/34816 0  113494/200010 0 01:08:58   N/A   N/A 1135/4096  0  2675/34816 0  113487/200010 0 01:09:00   N/A   N/A 1135/4096  0  2675/34816 0  113487/200010 0 01:09:02   N/A   N/A 1135/4096  0  2675/34816 0  113487/200010 0 | 
其中的file-sz就是指打开文件数。其中20多万的那个值是打开文件数的阈值,前者是当前实际打开的文件数。这阈值是有核心参数nfile决定的。我们在设置nfile的时候,建议是(15 * NPROC + 2048);提醒值是oracle的process数乘以datafile数再加2048;极限值,在存数据库环境应该是os的极限进程数乘以数据文件数,即每个进程都打开所有的数据文件。
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | Nfile Max Number of simultaneously Open files system-wide at any given time. Total number of slots it the file descriptor table, default=16*(nproc+16+maxusers)/10+32+ 2*(npty+nstrpty) recommended:   to use default. >= Oracle 9i = (15 * NPROC + 2048) For Oracle installations with a high number of data files this might be not enough, than use the  >= ((number of oracle processes) * (number of Oracle data files) + 2048) Nproc Max Number of Processes that can exist simultaneously in the system, default=(20+8*MAXUSERS),  influences ninode, nfile. recommended:  >= Oracle 9i = 4096 | 
根据计算,如果根据建议值,那么nfile将设置成63488,如果根提醒值,将有141万,如果根据极限值,将有386万:
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | SQL>  ! oracle@my_db01:/oracle > kmtune |grep nproc nproc                    4096  -  4096                        oracle@my_db01:/oracle > exit SQL> select 15*4096+2048 from dual; 15*4096+2048 ------------        63488 SQL> show parameter process NAME                                 TYPE        VALUE ------------------------------------ ----------- ------------------------------ processes                            integer     1500 SQL>  SQL>  SQL> select count(*) from v$datafile;   COUNT(*) ----------        944 SQL> select 1500*944+2048 from dual; 1500*944+2048 -------------       1418048 SQL> select 4096*944 from dual;   4096*944 ----------    3866624 SQL>  | 
对于设置6万,肯定是完全不适合我们这个系统;如果设置141万或者设置386万,这似乎又太大了。
到底是什么进程需要打开那么多文件数?在sar -v的命令中,我们只看到了一个总数,能具体的去看什么进程打开多少文件数吗?
在这里,HP提供了一个工具:crashinfo。可以查看各个进程的打开文件数:
用法是:
| 1 2 3 | 在root权限下: my_db01#[/tmp]chmod +x crashinfo my_db01#[/tmp]./crashinfo -ofiles | 
此时能看到各个进程的打开文件数情况:
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 | my_db01#[/tmp] ./crashinfo -ofiles crashinfo (4.74) Note: HP CONFIDENTIAL libp4 (9.305): Opening /stand/vmunix /dev/kmem Loading symbols from /stand/vmunix Kernel TEXT pages not requested in crashconf Will use an artificial mapping from a.out TEXT pages Command line: ./crashinfo -ofiles                        crashinfo (4.74) pid   p_maxof ofiles p_comm ----- ------- ------ ------ 1     4096    2      init 2011  4096    4      getty 94    4096    4      evmmirr 62    4096    4      emcpdaemon 63    4096    4      emcpProcd 64    4096    4      emcpd 65    4096    4      emcpdfd 66    4096    4      emcpwdd 67    4096    4      emcpstratd 68    4096    4      MpAsyncIoDaemo 69    4096    4      MpTestDaemon 70    4096    4      MpPirpNotDoneD 71    4096    4      MpPeriodicDaem 72    4096    4      MpDeferredDaem 73    4096    4      MpGeneralReque 74    4096    4      MpcAsyncIoDaem 75    4096    4      MpcResumeIoDae 76    4096    4      MpcPeriodicCal 12986 4096    671    oracle 13005 4096    671    oracle 13291 4096    10     oracle 12991 4096    671    oracle …… 22466 4096    51     oracle 12300 4096    847    oracle 12898 4096    9      oracle 12302 4096    979    oracle 13003 4096    671    oracle 13009 4096    670    oracle 13012 4096    459    oracle 13014 4096    49     oracle Total = 116775 my_db01#[/tmp]ps -ef |grep 12298   oracle 12298     1  0  Jan 15  ?         9:31 ora_dbw0_orasid     root 12469 11867  1 02:06:39 pts/te    0:00 grep 12298 my_db01#[/tmp]ps -ef |grep 12300   oracle 12300     1  0  Jan 15  ?         9:32 ora_dbw1_orasid     root 13363 11867  0 02:16:09 pts/te    0:00 grep 12300 my_db01#[/tmp] | 
通过ps -ef |grep 打开文件数最多的几个进程号,发现打开文件数最多的进程是dbwr,smon,lgwr,chkpt。每个进程基本上800~900个打开文件数,且dbwr进程有8个,打开文件数那就更加多了。其他打开文件数较多的进程是LOCAL=NO的进程,通过观察spid关联sql_text,发现是对历史表的操作。历史表是大表,表空间有300多个数据文件,因此,这些进程的打开文件数,也在每个600左右。
此时忽然记起,前段时间该省做优化时,当时DBA将dbwr进程数从2调整到了8,dbwr数增加了4倍,导致打开文件数也相应的增加,这应该就是造成nfile在近期慢慢增长到撑爆为止的原因了吧。
找到了原因,问题就比较好解决了。找了一个停机的时间,把db_writer_processes调整回2,且为了以防万一,再将nfile值改为40万。
附:crashinfo命令的其他用法:
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 | Usage:   crashinfo [options ...] [coredir | kernel core] Default: coredir="." if "INDEX" file present else          kernel="/stand/vmunix" core="/dev/kmem" Options:         -h | -help [flag,flag...]                  flags: detail         -u | -update         -v | -verbose         -c | -continue         -H | -Html         -e | -email <mail_addr>[,flag,flag...]                  flags: file=<file>                         from=<from>                         callid=<callid>         -t | -trace [flag,flag...]                  flags: args                         regs    (PA Only)                         Rregs   (PA Only)                         locals  (IA64 Only)                         frame   (IA64 Only)                         mems    (IA64 Only)                         bsp     (IA64 Only)                         ss      (IA64 Only)         -s | -syscall         -f | -full_comm         -S | -Sleep         -l | -listonly [flag,flag...]                  flags: pid=<pid>              -top [count]         -p | -proctree [flag,flag...]                  flags: tids         -i | -ioscan              -ofiles [pid][,flag,flag...]                  flags: tids              -signals [pid]              -seminfo              -unlinked [all]         -m | -msginfo [flag,flag...]                  flags: skip                         id=<id>                         msg=<msg>                         type=<type>              -flocks [flag,flag...]                  flags: sleep                         vnode=<vnode>                         inode=<inode>              -sgfr [version]              -fwlog [cpu]              -vmtrace [flag,flag...]                  flags: bucket=<bucket>                         arena=<arena>                         count=<num>                         leak                         cor                         log                         parse 如: my_db01#[/tmp]ps -ef |grep ckpt   oracle 12304     1  0  Jan 15  ?         3:29 ora_ckpt_orasid     root 13509 11867  0 02:18:35 pts/te    0:00 grep ckpt my_db01#[/tmp] my_db01#[/tmp]./crashinfo -ofiles 12304 crashinfo (4.74) Note: HP CONFIDENTIAL libp4 (9.305): Opening /stand/vmunix /dev/kmem Loading symbols from /stand/vmunix Kernel TEXT pages not requested in crashconf Will use an artificial mapping from a.out TEXT pages Command line: ./crashinfo -ofiles 12304                        crashinfo (4.74) pid   p_maxof ofiles p_comm ----- ------- ------ ------ 12304 4096    961    oracle my_db01#[/tmp] | 
5条评论
nfile的使用要监控起来,我们这边一般到85就开始要走应急流程处理了。
85,是生命线呀
re David.Guo:我们也是有监控的,不过没引起监控人员的注意。唉,失败呀~~~
不过更失败的是SA的service guard没配好,没monitor到oracle进程,数据库宕机后竟然没自动切换到B机。
我们这里nfile是100W.呵呵
我记得lsof可以看打开文件数的。
最近也遇到这个问题了,但是没有工具进行分析查看。
能否把这个crashinfo的工具共享下,网站上下载不下来,
发我邮箱也可以
ora315@gmail.com
谢谢!
我也遇到这个问题,按照metalink的提示增加了
nproc = Max Number of Processes
nfile = Max Number of Open Files
nflocks = Max Number of File Locks
的值,但是file-sz 的值一直在涨(要是股票就好了),最后又down了,再增大,再dwon,我晕了。请何大师指点