一天,在一个业务高峰期,某省的一个数据库突然宕机了,根据alertlog中发现的报错:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
Mon Jan 12 09:49:42 2009 ARC0: Evaluating archive log 5 thread 1 sequence 142032 ARC0: Beginning to archive log 5 thread 1 sequence 142032 Creating archive destination LOG_ARCHIVE_DEST_1: '/oraarchlog/arch_1_142032.arc' Mon Jan 12 09:49:56 2009 ARC0: Completed archiving log 5 thread 1 sequence 142032 Mon Jan 12 09:57:57 2009 Errors in file /oracle/app/oracle/admin/orasid/bdump/orasid_lgwr_8267.trc: ORA-00313: open failed for members of log group 1 of thread 1 ORA-00312: online log 1 thread 1: '/dev/vg_ora3/rredo_256m_21' ORA-27041: unable to open file HP-UX Error: 23: File table overflow Additional information: 2 ORA-00312: online log 1 thread 1: '/dev/vg_ora2/rredo_256m_11' ORA-27041: unable to open file HP-UX Error: 23: File table overflow Additional information: 2 ORA-00312: online log 1 thread 1: '/dev/vg_ora1/rredo_256m_01' ORA-27041: unable to open file HP-UX Error: 23: File table overflow Additional information: 2 Mon Jan 12 09:57:57 2009 Errors in file /oracle/app/oracle/admin/orasid/bdump/orasid_lgwr_8267.trc: ORA-00313: open failed for members of log group 1 of thread 1 ORA-00312: online log 1 thread 1: '/dev/vg_ora3/rredo_256m_21' ORA-27041: unable to open file HP-UX Error: 23: File table overflow Additional information: 2 ORA-00312: online log 1 thread 1: '/dev/vg_ora2/rredo_256m_11' ORA-27041: unable to open file HP-UX Error: 23: File table overflow Additional information: 2 |
看来是打开文件数太多,导致了此次数据库宕机。
通过监控打开文件数发现,其实在前一天的时候,打开文件数已经到98%,在上午8点多的时候,打开文件数已经到99%,但是一直都没有引起监控人员的注意,终于导致打开文件数被撑爆。
我们通过sar -v可以发现打开文件数的总数:
1 2 3 4 5 6 7 8 |
HP-UX my_db01 B.11.11 U 9000/800 01/17/09 01:08:52 text-sz ov proc-sz ov inod-sz ov file-sz ov 01:08:54 N/A N/A 1139/4096 0 2678/34816 0 113494/200010 0 01:08:56 N/A N/A 1139/4096 0 2678/34816 0 113494/200010 0 01:08:58 N/A N/A 1135/4096 0 2675/34816 0 113487/200010 0 01:09:00 N/A N/A 1135/4096 0 2675/34816 0 113487/200010 0 01:09:02 N/A N/A 1135/4096 0 2675/34816 0 113487/200010 0 |
其中的file-sz就是指打开文件数。其中20多万的那个值是打开文件数的阈值,前者是当前实际打开的文件数。这阈值是有核心参数nfile决定的。我们在设置nfile的时候,建议是(15 * NPROC + 2048);提醒值是oracle的process数乘以datafile数再加2048;极限值,在存数据库环境应该是os的极限进程数乘以数据文件数,即每个进程都打开所有的数据文件。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
Nfile Max Number of simultaneously Open files system-wide at any given time. Total number of slots it the file descriptor table, default=16*(nproc+16+maxusers)/10+32+ 2*(npty+nstrpty) recommended: to use default. >= Oracle 9i = (15 * NPROC + 2048) For Oracle installations with a high number of data files this might be not enough, than use the >= ((number of oracle processes) * (number of Oracle data files) + 2048) Nproc Max Number of Processes that can exist simultaneously in the system, default=(20+8*MAXUSERS), influences ninode, nfile. recommended: >= Oracle 9i = 4096 |
根据计算,如果根据建议值,那么nfile将设置成63488,如果根提醒值,将有141万,如果根据极限值,将有386万:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
SQL> ! oracle@my_db01:/oracle > kmtune |grep nproc nproc 4096 - 4096 oracle@my_db01:/oracle > exit SQL> select 15*4096+2048 from dual; 15*4096+2048 ------------ 63488 SQL> show parameter process NAME TYPE VALUE ------------------------------------ ----------- ------------------------------ processes integer 1500 SQL> SQL> SQL> select count(*) from v$datafile; COUNT(*) ---------- 944 SQL> select 1500*944+2048 from dual; 1500*944+2048 ------------- 1418048 SQL> select 4096*944 from dual; 4096*944 ---------- 3866624 SQL> |
对于设置6万,肯定是完全不适合我们这个系统;如果设置141万或者设置386万,这似乎又太大了。
到底是什么进程需要打开那么多文件数?在sar -v的命令中,我们只看到了一个总数,能具体的去看什么进程打开多少文件数吗?
在这里,HP提供了一个工具:crashinfo。可以查看各个进程的打开文件数:
用法是:
1 2 3 |
在root权限下: my_db01#[/tmp]chmod +x crashinfo my_db01#[/tmp]./crashinfo -ofiles |
此时能看到各个进程的打开文件数情况:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 |
my_db01#[/tmp] ./crashinfo -ofiles crashinfo (4.74) Note: HP CONFIDENTIAL libp4 (9.305): Opening /stand/vmunix /dev/kmem Loading symbols from /stand/vmunix Kernel TEXT pages not requested in crashconf Will use an artificial mapping from a.out TEXT pages Command line: ./crashinfo -ofiles crashinfo (4.74) pid p_maxof ofiles p_comm ----- ------- ------ ------ 1 4096 2 init 2011 4096 4 getty 94 4096 4 evmmirr 62 4096 4 emcpdaemon 63 4096 4 emcpProcd 64 4096 4 emcpd 65 4096 4 emcpdfd 66 4096 4 emcpwdd 67 4096 4 emcpstratd 68 4096 4 MpAsyncIoDaemo 69 4096 4 MpTestDaemon 70 4096 4 MpPirpNotDoneD 71 4096 4 MpPeriodicDaem 72 4096 4 MpDeferredDaem 73 4096 4 MpGeneralReque 74 4096 4 MpcAsyncIoDaem 75 4096 4 MpcResumeIoDae 76 4096 4 MpcPeriodicCal 12986 4096 671 oracle 13005 4096 671 oracle 13291 4096 10 oracle 12991 4096 671 oracle …… 22466 4096 51 oracle 12300 4096 847 oracle 12898 4096 9 oracle 12302 4096 979 oracle 13003 4096 671 oracle 13009 4096 670 oracle 13012 4096 459 oracle 13014 4096 49 oracle Total = 116775 my_db01#[/tmp]ps -ef |grep 12298 oracle 12298 1 0 Jan 15 ? 9:31 ora_dbw0_orasid root 12469 11867 1 02:06:39 pts/te 0:00 grep 12298 my_db01#[/tmp]ps -ef |grep 12300 oracle 12300 1 0 Jan 15 ? 9:32 ora_dbw1_orasid root 13363 11867 0 02:16:09 pts/te 0:00 grep 12300 my_db01#[/tmp] |
通过ps -ef |grep 打开文件数最多的几个进程号,发现打开文件数最多的进程是dbwr,smon,lgwr,chkpt。每个进程基本上800~900个打开文件数,且dbwr进程有8个,打开文件数那就更加多了。其他打开文件数较多的进程是LOCAL=NO的进程,通过观察spid关联sql_text,发现是对历史表的操作。历史表是大表,表空间有300多个数据文件,因此,这些进程的打开文件数,也在每个600左右。
此时忽然记起,前段时间该省做优化时,当时DBA将dbwr进程数从2调整到了8,dbwr数增加了4倍,导致打开文件数也相应的增加,这应该就是造成nfile在近期慢慢增长到撑爆为止的原因了吧。
找到了原因,问题就比较好解决了。找了一个停机的时间,把db_writer_processes调整回2,且为了以防万一,再将nfile值改为40万。
附:crashinfo命令的其他用法:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 |
Usage: crashinfo [options ...] [coredir | kernel core] Default: coredir="." if "INDEX" file present else kernel="/stand/vmunix" core="/dev/kmem" Options: -h | -help [flag,flag...] flags: detail -u | -update -v | -verbose -c | -continue -H | -Html -e | -email <mail_addr>[,flag,flag...] flags: file=<file> from=<from> callid=<callid> -t | -trace [flag,flag...] flags: args regs (PA Only) Rregs (PA Only) locals (IA64 Only) frame (IA64 Only) mems (IA64 Only) bsp (IA64 Only) ss (IA64 Only) -s | -syscall -f | -full_comm -S | -Sleep -l | -listonly [flag,flag...] flags: pid=<pid> -top [count] -p | -proctree [flag,flag...] flags: tids -i | -ioscan -ofiles [pid][,flag,flag...] flags: tids -signals [pid] -seminfo -unlinked [all] -m | -msginfo [flag,flag...] flags: skip id=<id> msg=<msg> type=<type> -flocks [flag,flag...] flags: sleep vnode=<vnode> inode=<inode> -sgfr [version] -fwlog [cpu] -vmtrace [flag,flag...] flags: bucket=<bucket> arena=<arena> count=<num> leak cor log parse 如: my_db01#[/tmp]ps -ef |grep ckpt oracle 12304 1 0 Jan 15 ? 3:29 ora_ckpt_orasid root 13509 11867 0 02:18:35 pts/te 0:00 grep ckpt my_db01#[/tmp] my_db01#[/tmp]./crashinfo -ofiles 12304 crashinfo (4.74) Note: HP CONFIDENTIAL libp4 (9.305): Opening /stand/vmunix /dev/kmem Loading symbols from /stand/vmunix Kernel TEXT pages not requested in crashconf Will use an artificial mapping from a.out TEXT pages Command line: ./crashinfo -ofiles 12304 crashinfo (4.74) pid p_maxof ofiles p_comm ----- ------- ------ ------ 12304 4096 961 oracle my_db01#[/tmp] |
5条评论
nfile的使用要监控起来,我们这边一般到85就开始要走应急流程处理了。
85,是生命线呀
re David.Guo:我们也是有监控的,不过没引起监控人员的注意。唉,失败呀~~~
不过更失败的是SA的service guard没配好,没monitor到oracle进程,数据库宕机后竟然没自动切换到B机。
我们这里nfile是100W.呵呵
我记得lsof可以看打开文件数的。
最近也遇到这个问题了,但是没有工具进行分析查看。
能否把这个crashinfo的工具共享下,网站上下载不下来,
发我邮箱也可以
ora315@gmail.com
谢谢!
我也遇到这个问题,按照metalink的提示增加了
nproc = Max Number of Processes
nfile = Max Number of Open Files
nflocks = Max Number of File Locks
的值,但是file-sz 的值一直在涨(要是股票就好了),最后又down了,再增大,再dwon,我晕了。请何大师指点