HP的查看打开文件数工具:crashinfo

一天，在一个业务高峰期，某省的一个数据库突然宕机了，根据alertlog中发现的报错：

Mon Jan 12 09:49:42 2009
ARC0: Evaluating archive   log 5 thread 1 sequence 142032
ARC0: Beginning to archive log 5 thread 1 sequence 142032
Creating archive destination LOG_ARCHIVE_DEST_1: '/oraarchlog/arch_1_142032.arc'
Mon Jan 12 09:49:56 2009
ARC0: Completed archiving  log 5 thread 1 sequence 142032
Mon Jan 12 09:57:57 2009
Errors in file /oracle/app/oracle/admin/orasid/bdump/orasid_lgwr_8267.trc:
ORA-00313: open failed for members of log group 1 of thread 1
ORA-00312: online log 1 thread 1: '/dev/vg_ora3/rredo_256m_21'
ORA-27041: unable to open file
HP-UX Error: 23: File table overflow
Additional information: 2
ORA-00312: online log 1 thread 1: '/dev/vg_ora2/rredo_256m_11'
ORA-27041: unable to open file
HP-UX Error: 23: File table overflow
Additional information: 2
ORA-00312: online log 1 thread 1: '/dev/vg_ora1/rredo_256m_01'
ORA-27041: unable to open file
HP-UX Error: 23: File table overflow
Additional information: 2
Mon Jan 12 09:57:57 2009
Errors in file /oracle/app/oracle/admin/orasid/bdump/orasid_lgwr_8267.trc:
ORA-00313: open failed for members of log group 1 of thread 1
ORA-00312: online log 1 thread 1: '/dev/vg_ora3/rredo_256m_21'
ORA-27041: unable to open file
HP-UX Error: 23: File table overflow
Additional information: 2
ORA-00312: online log 1 thread 1: '/dev/vg_ora2/rredo_256m_11'
ORA-27041: unable to open file
HP-UX Error: 23: File table overflow
Additional information: 2

Mon Jan 12 09:49:42 2009

ARC0: Evaluating archive log 5 thread 1 sequence 142032

ARC0: Beginning to archive log 5 thread 1 sequence 142032

Creating archive destination LOG_ARCHIVE_DEST_1: '/oraarchlog/arch_1_142032.arc'

Mon Jan 12 09:49:56 2009

ARC0: Completed archiving log 5 thread 1 sequence 142032

Mon Jan 12 09:57:57 2009

Errors in file /oracle/app/oracle/admin/orasid/bdump/orasid_lgwr_8267.trc:

ORA-00313: open failed for members of log group 1 of thread 1

ORA-00312: online log 1 thread 1: '/dev/vg_ora3/rredo_256m_21'

ORA-27041: unable to open file

HP-UX Error: 23: File table overflow

Additional information: 2

ORA-00312: online log 1 thread 1: '/dev/vg_ora2/rredo_256m_11'

ORA-27041: unable to open file

HP-UX Error: 23: File table overflow

Additional information: 2

ORA-00312: online log 1 thread 1: '/dev/vg_ora1/rredo_256m_01'

ORA-27041: unable to open file

HP-UX Error: 23: File table overflow

Additional information: 2

Mon Jan 12 09:57:57 2009

Errors in file /oracle/app/oracle/admin/orasid/bdump/orasid_lgwr_8267.trc:

ORA-00313: open failed for members of log group 1 of thread 1

ORA-00312: online log 1 thread 1: '/dev/vg_ora3/rredo_256m_21'

ORA-27041: unable to open file

HP-UX Error: 23: File table overflow

Additional information: 2

ORA-00312: online log 1 thread 1: '/dev/vg_ora2/rredo_256m_11'

ORA-27041: unable to open file

HP-UX Error: 23: File table overflow

Additional information: 2

看来是打开文件数太多，导致了此次数据库宕机。

通过监控打开文件数发现，其实在前一天的时候，打开文件数已经到98%，在上午8点多的时候，打开文件数已经到99%，但是一直都没有引起监控人员的注意，终于导致打开文件数被撑爆。

我们通过sar -v可以发现打开文件数的总数：

HP-UX my_db01 B.11.11 U 9000/800    01/17/09

01:08:52 text-sz  ov  proc-sz  ov  inod-sz  ov  file-sz  ov 
01:08:54   N/A   N/A 1139/4096  0  2678/34816 0  113494/200010 0
01:08:56   N/A   N/A 1139/4096  0  2678/34816 0  113494/200010 0
01:08:58   N/A   N/A 1135/4096  0  2675/34816 0  113487/200010 0
01:09:00   N/A   N/A 1135/4096  0  2675/34816 0  113487/200010 0
01:09:02   N/A   N/A 1135/4096  0  2675/34816 0  113487/200010 0

HP-UX my_db01 B.11.11 U 9000/800 01/17/09

01:08:52 text-sz ov proc-sz ov inod-sz ov file-sz ov

01:08:54 N/A N/A 1139/4096 0 2678/34816 0 113494/200010 0

01:08:56 N/A N/A 1139/4096 0 2678/34816 0 113494/200010 0

01:08:58 N/A N/A 1135/4096 0 2675/34816 0 113487/200010 0

01:09:00 N/A N/A 1135/4096 0 2675/34816 0 113487/200010 0

01:09:02 N/A N/A 1135/4096 0 2675/34816 0 113487/200010 0

其中的file-sz就是指打开文件数。其中20多万的那个值是打开文件数的阈值，前者是当前实际打开的文件数。这阈值是有核心参数nfile决定的。我们在设置nfile的时候，建议是(15 * NPROC + 2048)；提醒值是oracle的process数乘以datafile数再加2048；极限值，在存数据库环境应该是os的极限进程数乘以数据文件数，即每个进程都打开所有的数据文件。

Nfile
Max Number of simultaneously Open
files system-wide at any given time. Total
number of slots it the file descriptor table,
default=16*(nproc+16+maxusers)/10+32+ 2*(npty+nstrpty)
recommended:
  to use default.
>= Oracle 9i = (15 * NPROC + 2048)

For Oracle installations with a high number of data files this might be not enough, than use the 
>= ((number of oracle processes) * (number of Oracle data files) + 2048)

Nproc
Max Number of Processes that can exist
simultaneously in the system,
default=(20+8*MAXUSERS), 
influences ninode, nfile.
recommended: 
>= Oracle 9i = 4096

Nfile

Max Number of simultaneously Open

files system-wide at any given time. Total

number of slots it the file descriptor table,

default=16*(nproc+16+maxusers)/10+32+ 2*(npty+nstrpty)

recommended:

to use default.

>= Oracle 9i = (15 * NPROC + 2048)

For Oracle installations with a high number of data files this might be not enough, than use the

>= ((number of oracle processes) * (number of Oracle data files) + 2048)

Nproc

Max Number of Processes that can exist

simultaneously in the system,

default=(20+8*MAXUSERS),

influences ninode, nfile.

recommended:

>= Oracle 9i = 4096

根据计算，如果根据建议值，那么nfile将设置成63488，如果根提醒值，将有141万，如果根据极限值，将有386万：

SQL>  !
oracle@my_db01:/oracle > kmtune |grep nproc
nproc                    4096  -  4096                       
oracle@my_db01:/oracle > exit

SQL> select 15*4096+2048 from dual;

15*4096+2048
------------
       63488

SQL> show parameter process

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
processes                            integer     1500

SQL> 
SQL> 
SQL> select count(*) from v$datafile;

  COUNT(*)
----------
       944

SQL> select 1500*944+2048 from dual;

1500*944+2048
-------------
      1418048

SQL> select 4096*944 from dual;

  4096*944
----------
   3866624

SQL>

SQL> !

oracle@my_db01:/oracle > kmtune |grep nproc

nproc 4096 - 4096

oracle@my_db01:/oracle > exit

SQL> select 15*4096+2048 from dual;

15*4096+2048

------------

63488

SQL> show parameter process

NAME TYPE VALUE

------------------------------------ ----------- ------------------------------

processes integer 1500

SQL>

SQL> select count(*) from v$datafile;

COUNT(*)

----------

944

SQL> select 1500*944+2048 from dual;

1500*944+2048

-------------

1418048

SQL> select 4096*944 from dual;

4096*944

----------

3866624

SQL>

对于设置6万，肯定是完全不适合我们这个系统；如果设置141万或者设置386万，这似乎又太大了。

到底是什么进程需要打开那么多文件数？在sar -v的命令中，我们只看到了一个总数，能具体的去看什么进程打开多少文件数吗？

在这里，HP提供了一个工具：crashinfo。可以查看各个进程的打开文件数：

crashinfo

用法是：

在root权限下：
my_db01#[/tmp]chmod +x crashinfo
my_db01#[/tmp]./crashinfo -ofiles

在root权限下：

my_db01#[/tmp]chmod +x crashinfo

my_db01#[/tmp]./crashinfo -ofiles

此时能看到各个进程的打开文件数情况：

my_db01#[/tmp] ./crashinfo -ofiles
crashinfo (4.74)

Note: HP CONFIDENTIAL
libp4 (9.305): Opening /stand/vmunix /dev/kmem

Loading symbols from /stand/vmunix
Kernel TEXT pages not requested in crashconf
Will use an artificial mapping from a.out TEXT pages
Command line: ./crashinfo -ofiles 

                      crashinfo (4.74)


pid   p_maxof ofiles p_comm
----- ------- ------ ------
1     4096    2      init
2011  4096    4      getty
94    4096    4      evmmirr
62    4096    4      emcpdaemon
63    4096    4      emcpProcd
64    4096    4      emcpd
65    4096    4      emcpdfd
66    4096    4      emcpwdd
67    4096    4      emcpstratd
68    4096    4      MpAsyncIoDaemo
69    4096    4      MpTestDaemon
70    4096    4      MpPirpNotDoneD
71    4096    4      MpPeriodicDaem
72    4096    4      MpDeferredDaem
73    4096    4      MpGeneralReque
74    4096    4      MpcAsyncIoDaem
75    4096    4      MpcResumeIoDae
76    4096    4      MpcPeriodicCal
12986 4096    671    oracle
13005 4096    671    oracle
13291 4096    10     oracle
12991 4096    671    oracle
……
22466 4096    51     oracle
12300 4096    847    oracle
12898 4096    9      oracle
12302 4096    979    oracle
13003 4096    671    oracle
13009 4096    670    oracle
13012 4096    459    oracle
13014 4096    49     oracle

Total = 116775

my_db01#[/tmp]ps -ef |grep 12298
  oracle 12298     1  0  Jan 15  ?         9:31 ora_dbw0_orasid
    root 12469 11867  1 02:06:39 pts/te    0:00 grep 12298
my_db01#[/tmp]ps -ef |grep 12300
  oracle 12300     1  0  Jan 15  ?         9:32 ora_dbw1_orasid
    root 13363 11867  0 02:16:09 pts/te    0:00 grep 12300
my_db01#[/tmp]

my_db01#[/tmp] ./crashinfo -ofiles

crashinfo (4.74)

Note: HP CONFIDENTIAL

libp4 (9.305): Opening /stand/vmunix /dev/kmem

Loading symbols from /stand/vmunix

Kernel TEXT pages not requested in crashconf

Will use an artificial mapping from a.out TEXT pages

Command line: ./crashinfo -ofiles

crashinfo (4.74)

pid p_maxof ofiles p_comm

----- ------- ------ ------

1 4096 2 init

2011 4096 4 getty

94 4096 4 evmmirr

62 4096 4 emcpdaemon

63 4096 4 emcpProcd

64 4096 4 emcpd

65 4096 4 emcpdfd

66 4096 4 emcpwdd

67 4096 4 emcpstratd

68 4096 4 MpAsyncIoDaemo

69 4096 4 MpTestDaemon

70 4096 4 MpPirpNotDoneD

71 4096 4 MpPeriodicDaem

72 4096 4 MpDeferredDaem

73 4096 4 MpGeneralReque

74 4096 4 MpcAsyncIoDaem

75 4096 4 MpcResumeIoDae

76 4096 4 MpcPeriodicCal

12986 4096 671 oracle

13005 4096 671 oracle

13291 4096 10 oracle

12991 4096 671 oracle

……

22466 4096 51 oracle

12300 4096 847 oracle

12898 4096 9 oracle

12302 4096 979 oracle

13003 4096 671 oracle

13009 4096 670 oracle

13012 4096 459 oracle

13014 4096 49 oracle

Total = 116775

my_db01#[/tmp]ps -ef |grep 12298

oracle 12298 1 0 Jan 15 ? 9:31 ora_dbw0_orasid

root 12469 11867 1 02:06:39 pts/te 0:00 grep 12298

my_db01#[/tmp]ps -ef |grep 12300

oracle 12300 1 0 Jan 15 ? 9:32 ora_dbw1_orasid

root 13363 11867 0 02:16:09 pts/te 0:00 grep 12300

my_db01#[/tmp]

通过ps -ef |grep 打开文件数最多的几个进程号，发现打开文件数最多的进程是dbwr，smon，lgwr，chkpt。每个进程基本上800～900个打开文件数，且dbwr进程有8个，打开文件数那就更加多了。其他打开文件数较多的进程是LOCAL=NO的进程，通过观察spid关联sql_text，发现是对历史表的操作。历史表是大表，表空间有300多个数据文件，因此，这些进程的打开文件数，也在每个600左右。

此时忽然记起，前段时间该省做优化时，当时DBA将dbwr进程数从2调整到了8，dbwr数增加了4倍，导致打开文件数也相应的增加，这应该就是造成nfile在近期慢慢增长到撑爆为止的原因了吧。

找到了原因，问题就比较好解决了。找了一个停机的时间，把db_writer_processes调整回2，且为了以防万一，再将nfile值改为40万。

附：crashinfo命令的其他用法：

Usage:   crashinfo [options ...] [coredir | kernel core]
Default: coredir="." if "INDEX" file present else
         kernel="/stand/vmunix" core="/dev/kmem"
Options:
        -h | -help [flag,flag...]
                 flags: detail
        -u | -update
        -v | -verbose
        -c | -continue
        -H | -Html
        -e | -email <mail_addr>[,flag,flag...]
                 flags: file=<file>
                        from=<from>
                        callid=<callid>
        -t | -trace [flag,flag...]
                 flags: args
                        regs    (PA Only)
                        Rregs   (PA Only)
                        locals  (IA64 Only)
                        frame   (IA64 Only)
                        mems    (IA64 Only)
                        bsp     (IA64 Only)
                        ss      (IA64 Only)
        -s | -syscall
        -f | -full_comm
        -S | -Sleep
        -l | -listonly [flag,flag...]
                 flags: pid=<pid>
             -top [count]
        -p | -proctree [flag,flag...]
                 flags: tids
        -i | -ioscan
             -ofiles [pid][,flag,flag...]
                 flags: tids
             -signals [pid]
             -seminfo
             -unlinked [all]
        -m | -msginfo [flag,flag...]
                 flags: skip
                        id=<id>
                        msg=<msg>
                        type=<type>
             -flocks [flag,flag...]
                 flags: sleep
                        vnode=<vnode>
                        inode=<inode>
             -sgfr [version]
             -fwlog [cpu]
             -vmtrace [flag,flag...]
                 flags: bucket=<bucket>
                        arena=<arena>
                        count=<num>
                        leak
                        cor
                        log
                        parse

如：
my_db01#[/tmp]ps -ef |grep ckpt
  oracle 12304     1  0  Jan 15  ?         3:29 ora_ckpt_orasid
    root 13509 11867  0 02:18:35 pts/te    0:00 grep ckpt
my_db01#[/tmp]
my_db01#[/tmp]./crashinfo -ofiles 12304
crashinfo (4.74)

Note: HP CONFIDENTIAL
libp4 (9.305): Opening /stand/vmunix /dev/kmem

Loading symbols from /stand/vmunix
Kernel TEXT pages not requested in crashconf
Will use an artificial mapping from a.out TEXT pages
Command line: ./crashinfo -ofiles 12304 

                      crashinfo (4.74)


pid   p_maxof ofiles p_comm
----- ------- ------ ------
12304 4096    961    oracle
my_db01#[/tmp]

Usage: crashinfo [options ...] [coredir | kernel core]

Default: coredir="." if "INDEX" file present else

kernel="/stand/vmunix" core="/dev/kmem"

Options:

-h | -help [flag,flag...]

flags: detail

-u | -update

-v | -verbose

-c | -continue

-H | -Html

-e | -email <mail_addr>[,flag,flag...]

flags: file=<file>

from=<from>

callid=<callid>

-t | -trace [flag,flag...]

flags: args

regs (PA Only)

Rregs (PA Only)

locals (IA64 Only)

frame (IA64 Only)

mems (IA64 Only)

bsp (IA64 Only)

ss (IA64 Only)

-s | -syscall

-f | -full_comm

-S | -Sleep

-l | -listonly [flag,flag...]

flags: pid=<pid>

-top [count]

-p | -proctree [flag,flag...]

flags: tids

-i | -ioscan

-ofiles [pid][,flag,flag...]

flags: tids

-signals [pid]

-seminfo

-unlinked [all]

-m | -msginfo [flag,flag...]

flags: skip

id=<id>

msg=<msg>

type=<type>

-flocks [flag,flag...]

flags: sleep

vnode=<vnode>

inode=<inode>

-sgfr [version]

-fwlog [cpu]

-vmtrace [flag,flag...]

flags: bucket=<bucket>

arena=<arena>

count=<num>

leak

cor

log

parse

如：

my_db01#[/tmp]ps -ef |grep ckpt

oracle 12304 1 0 Jan 15 ? 3:29 ora_ckpt_orasid

root 13509 11867 0 02:18:35 pts/te 0:00 grep ckpt

my_db01#[/tmp]

my_db01#[/tmp]./crashinfo -ofiles 12304

crashinfo (4.74)

Note: HP CONFIDENTIAL

libp4 (9.305): Opening /stand/vmunix /dev/kmem

Loading symbols from /stand/vmunix

Kernel TEXT pages not requested in crashconf

Will use an artificial mapping from a.out TEXT pages

Command line: ./crashinfo -ofiles 12304

crashinfo (4.74)

pid p_maxof ofiles p_comm

----- ------- ------ ------

12304 4096 961 oracle

my_db01#[/tmp]

5条评论

David.Guo说道：

2009-01-17 19:15

nfile的使用要监控起来，我们这边一般到85就开始要走应急流程处理了。
85，是生命线呀

回复
小荷说道：

2009-01-17 23:40

re David.Guo：我们也是有监控的，不过没引起监控人员的注意。唉，失败呀～～～
不过更失败的是SA的service guard没配好，没monitor到oracle进程，数据库宕机后竟然没自动切换到B机。

回复
dhhb说道：

2009-01-19 12:50

我们这里nfile是100W.呵呵

我记得lsof可以看打开文件数的。

回复
oracle.dba说道：

2009-02-25 21:51

最近也遇到这个问题了，但是没有工具进行分析查看。
能否把这个crashinfo的工具共享下，网站上下载不下来，
发我邮箱也可以
ora315@gmail.com
谢谢！

回复
make说道：

2009-05-14 17:06

我也遇到这个问题，按照metalink的提示增加了
nproc = Max Number of Processes
nfile = Max Number of Open Files
nflocks = Max Number of File Locks
的值，但是file-sz 的值一直在涨（要是股票就好了），最后又down了，再增大，再dwon，我晕了。请何大师指点

回复

发表回复取消回复

此站点使用Akismet来减少垃圾评论。了解我们如何处理您的评论数据。

HP的查看打开文件数工具:crashinfo

相关文章

减少rman备份对OLTP系统的影响

opatch被异常中断后的处理

pg和oracle分区的一些心得

5条评论

发表回复 取消回复

发表回复取消回复