今天遇到一个问题,同事在做完change之后,尝试用srvctl start database -d
1 2 3 4 5 6 7 8 |
2011-11-28 17:59:51.413: [ OCRUTL][42]u_freem: mem passed is null 2011-11-28 17:59:53.119: [ CRSRES][155876] startRunnable: setting CLI values 2011-11-28 17:59:53.157: [ CRSRES][155875] Attempting to start `ora.TSPRD.TSPRD1.inst` on member `au11qapa50tels2` 2011-11-28 17:59:53.172: [ CRSRES][155876] Attempting to start `ora.TSPRD.TSPRD2.inst` on member `au11qapa70tels2` 2011-11-28 17:59:54.736: [ OCRUTL][48]u_freem: mem passed is null 2011-11-28 18:00:13.427: [ CRSAPP][155876] StartResource error for ora.TSPRD.TSPRD2.inst error code = 1 2011-11-28 18:00:16.323: [ CRSRES][155876] Start of `ora.TSPRD.TSPRD2.inst` on member `au11qapa70tels2` failed. 2011-11-28 18:04:06.100: [ CRSRES][155875] Start of `ora.TSPRD.TSPRD1.inst` on member `au11qapa50tels2` succeeded. |
检查crs_stat -t 发现node2确实是offline。
在node2上的alertlog中,我们看到:
1 2 3 4 5 |
Mon Nov 28 17:59:56 2011 Starting ORACLE instance (normal) Mon Nov 28 17:59:56 2011 WARNING: EINVAL creating segment of size 0x0000000380008000 fix shm parameters in /etc/system or equivalent |
我们尝试用srvctl指定instance来启动,运行srvctl start instance -d
那么为什么在node2上会报错:WARNING: EINVAL creating segment of size 0x0000000380008000和fix shm parameters in /etc/system or equivalent, 难道是内核参数设置的问题?
从报错信息,我们看到0x0000000380008000,转换成十进制是14G,也就是我们sga的大小,也就是sga的大小设置超过了操作系统中类似像shmmax之类的参数设置,操作系统层面无法给oracle配置超过14G的内存,所以oracle就报错了。
我们尝试用sqlplus登录后,启动node2,却能正常启动了。奇怪,srvctl无法正常启动,报错设置的操作系统内存太小,但是sqlplus却能正常启动?
如果说是sqlplus也不能正常启动,那确实是因为操作系统层面设置的问题了,但是目前的情况是sqlplus能正常启动,但是srvctl却不行,难道是bug?
在搜索metalink中,我找到了一篇文章:Bug 5340239: SRVCTL IS NOT ABLE TO GET THE RIGHT SHM RELATED SETTINGS(该bug全文见最后)。上面说,srcvtl在启动时会通过prctl命令读取/etc/system中的信息,似乎srvctl在读取该信息的时候,无法正确的读取,造成该报错。而sqlplus去读取/etc/system则是没有问题的。临时的解决方法是手工的在/etc/system中加入shmmax参数的设置。
检查了2个节点的/etc/system,发现2边的文件都是一样的,都没有在里面设置shmmax参数。那么,为什么在node1上能起来,node2上无法起来呢?
检查oracle用户的shm-memory,发现privileged是22G,也确实超过了sga需要的14G。
1 2 3 4 5 6 7 8 9 |
au11qapa70tels2:TSPRD2:/ora/admin>id -p uid=150(oracle) gid=200(dba) projid=100(user.oracle) au11qapa70tels2:TSPRD2:/ora/admin>prctl -n project.max-shm-memory -i project 100 project: 100: user.oracle NAME PRIVILEGE VALUE FLAG ACTION RECIPIENT project.max-shm-memory privileged 22.0GB - deny - system 16.0EB max deny - au11qapa70tels2:TSPRD2:/ora/admin> |
那既然oracle设置的没有问题,为什么node1上会启动成功,node2上会失败?接着继续找原因,我们发现另外一个文档:SRVCTL FAILS TO START DATABASE WHILE SQLPLUS WORKS (Doc ID 603051.1 该文档全文见最后),该文档说,project for oracle是没有问题的,问题是出在system的project上,因为srvctl启动数据库的时候,会继承system的project。
因为srvctl调用crsd.bin,而crsd.bin的owner是root:
1 2 |
au11qapa70tels2:TSPRD2:/ora/product/crs/10.2.0/crs_1/bin>ls -l crsd.bin -rwxr----x 1 root dba 13545504 Mar 15 2008 crsd.bin |
而在crsd.bin以root身份调用racgwrap 的时候,虽然会setid,setgid来更改自身的user id和group id,但是无法更改其project id,所以,如果root的project,设置的使用内存小于sga的大小,那么由于srvctl在启动的时候,调用以root身份的crsd.bin,并且继承了root(system)的project设置,那么就会出现上述的报错。
我们来检查一下root的project设置:
node1上:
1 2 3 4 5 6 7 8 |
au11qapa50tels2:TSPRD1:/ora/admin>cat /etc/project system:0::::project.max-sem-ids=(privileged,256,deny);project.max-shm-memory=(privileged,23622320128,deny) user.root:1:::: noproject:2:::: default:3::::project.max-sem-ids=(privileged,256,deny);project.max-shm-memory=(priv,17179869184,deny) group.staff:10:::: user.oracle:100::oracle::process.max-sem-nsems=(priv,512,deny);project.max-sem-ids=(priv,256,deny);project.max-shm-ids=(priv,128,deny);project.max-shm-memory=(priv,23622320128,deny) u11qapa50tels2:TSPRD1:/ora/admin> |
node2上:
1 2 3 4 5 6 7 8 |
au11qapa70tels2:TSPRD2:/ora/admin>cat /etc/project system:0:::: user.root:1:::: noproject:2:::: default:3::::project.max-shm-memory=(priv,17179869184,deny) group.staff:10:::: user.oracle:100::oracle::process.max-sem-nsems=(priv,512,deny);project.max-sem-ids=(priv,256,deny);project.max-shm-ids=(priv,128,deny);project.max-shm-memory=(priv,23622320128,deny) au11qapa70tels2:TSPRD2:/ora/admin> |
我们看到node1的project文件中,比node2多了关于system的设置:
1 |
system:0::::project.max-sem-ids=(privileged,256,deny);project.max-shm-memory=(privileged,23622320128,deny) |
这就是问题所在。
如果用prctl命令来看,会看的更清楚:
node1
1 2 3 4 5 6 |
au11qapa70tels2:TSPRD2:/ora/admin>prctl -n project.max-shm-memory -i project user.oracle project: 100: user.oracle NAME PRIVILEGE VALUE FLAG ACTION RECIPIENT project.max-shm-memory privileged 22.0GB - deny - system 16.0EB max deny - |
node2:
1 2 3 4 5 6 |
au11qapa70tels2:TSPRD2:/ora/admin>prctl -n project.max-shm-memory -i project system project: 0: system NAME PRIVILEGE VALUE FLAG ACTION RECIPIENT project.max-shm-memory privileged 7.74GB - deny - system 16.0EB max deny - |
看,在node2上,system的内存限制在了7.74G(注:在没有指定project.max-shm-memory的情况下,默认大小为系统物理内存的是四分之一,因此,32G物理内存的主机,四分之一即为8G。),小于了sga的14G。
解决方法为以root用户运行一下命令,使得system的内存设置大小为22G,超过sga的14G:
1 2 |
# prctl -n project.max-shm-memory -r -v 22G -i project system # projmod -s -K "project.max-shm-memory=(priv,22G,deny)" system |
另外,需要给/etc/user_attr中root的用户加上project=system:
1 |
root::::type=normal;auths=solaris.*,solaris.grant;profiles=Web Console Management,All;lock_after_retries=no;clearance=admin_high;min_label=admin_low;project=system |
附:Bug 5340239和Doc ID 603051.1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 |
Bug 5340239: SRVCTL IS NOT ABLE TO GET THE RIGHT SHM RELATED SETTINGS Bug Comments Hdr: 5340239 10.2.0.2 RDBMS 10.2.0.2 RAC PRODID-5 PORTID-23 CRS-215 5217662 Abstract: SRVCTL IS NOT ABLE TO GET THE RIGHT SHM RELATED SETTINGS *** 06/17/06 12:26 am *** TAR: ---- 5521206.992 PROBLEM: -------- The srvctl utility is unable to get the right shm related settings using prctl. It seems that the srvctl is reading the settings from /etc/system. This is causing serious problems when CRS is trying to startup databases with an SGA larger than 1/4 of the total system memory. The same instance is starting up correctly using sqlplus. DIAGNOSTIC ANALYSIS: -------------------- oracle>srvctl start instance -d rac -i rac1 -o open PRKP-1001 : Error starting instance rac1 on node test1 CRS-0215: Could not start resource 'ora.rac.rac1.inst'. test1:oracle> WORKAROUND: ----------- The workaround is to add the shmmax parameter to the old /etc/system file. This workaround is not acceptable because on Sun's Solaris manuals it is written that "Using the obsolete parameters is not recommended" RELATED BUGS: ------------- REPRODUCIBILITY: ---------------- TEST CASE: ---------- STACK TRACE: ------------ SUPPORTING INFORMATION: ----------------------- 24 HOUR CONTACT INFORMATION FOR P1 BUGS: ---------------------------------------- DIAL-IN INFORMATION: -------------------- IMPACT DATE: ------------ *** 06/17/06 12:27 am *** (CHG: Sta->16) *** 06/18/06 11:49 pm *** *** 06/18/06 11:49 pm *** *** 06/19/06 12:47 am *** (CHG: Sta->10) *** 06/19/06 12:47 am *** *** 06/19/06 01:19 am *** *** 06/19/06 01:19 am *** (CHG: Sta->16) *** 06/19/06 01:50 am *** (CHG: Sta->10) *** 06/19/06 01:50 am *** *** 06/19/06 02:01 am *** *** 06/19/06 05:21 am *** *** 06/20/06 01:02 am *** (CHG: Sta->32) *** 06/20/06 01:02 am *** *** 06/20/06 08:01 am *** *** 06/20/06 08:01 am *** (CHG: Sta->16) *** 06/20/06 08:13 am *** (CHG: Sta->10) *** 06/20/06 08:13 am *** *** 06/20/06 11:01 pm *** *** 06/20/06 11:01 pm *** (CHG: Sta->16) *** 06/20/06 11:51 pm *** (CHG: Sta->10) *** 06/20/06 11:51 pm *** *** 06/21/06 05:16 am *** *** 06/21/06 05:17 am *** (CHG: Sta->16) *** 06/21/06 05:27 am *** *** 06/21/06 05:59 am *** *** 06/21/06 06:21 am *** *** 06/21/06 06:21 am *** (CHG: Sta->11 Asg->RDBMSREP) *** 06/21/06 07:32 am *** *** 06/22/06 05:38 am *** *** 07/31/06 05:21 am *** *** 08/03/06 08:47 am *** *** 11/20/06 06:51 am *** *** 01/03/07 07:21 am *** *** 01/08/07 11:34 am *** *** 01/09/07 11:34 am *** (CHG: Sta->30) *** 01/09/07 11:34 am *** *** 01/19/07 05:07 pm *** *** 01/24/07 04:54 pm *** *** 01/25/07 12:35 am *** (CHG: Sta->11) *** 01/25/07 12:35 am *** *** 01/26/07 11:35 am *** *** 01/29/07 12:51 pm *** *** 01/31/07 04:35 pm *** (CHG: Sta->36) *** 01/31/07 04:35 pm *** *** 12/27/07 08:10 am *** (CHG: Sta->96) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 |
SRVCTL FAILS TO START DATABASE WHILE SQLPLUS WORKS (Doc ID 603051.1) In this Document Symptoms Changes Cause Solution References Applies to: Oracle Server - Enterprise Edition - Version: 10.1.0.5 and later [Release: 10.1 and later ] Oracle Solaris on x86 (32-bit) Oracle Solaris on SPARC (64-bit) Oracle Solaris on x86-64 (64-bit) Solaris Operating System (Intel)Solaris Operating System (SPARC) (64-bit)Solaris Operating System (x86-64) Solaris Operating System (x86)Solaris Operating System (SPARC 64-bit) Symptoms Attempting to start an instance/database via 'srvctl' would fail as follows: % srvctl start instance -d test -i test1 PRKP-1001 : Error starting instance test1 on node node1 CRS-0215: Could not start resource 'ora.test.test1.inst'. There are two cases: Case I: alert.log shows: WARNING: EINVAL creating segment of size 0x00000004e200c000 fix shm parameters in /etc/system or equivalent RACG log under $ORACLE_HOME/log/<hostname>/racg/imon_<DBNAME>.log shows: SQL> ORA-27102: out of memory SVR4 Error: 22: Invalid argument SQL> Disconnect' Case II: alert log shows: Fri Jun 4 22:32:44 2010 Starting ORACLE instance (normal) RACG log under $ORACLE_HOME/log/<hostname>/racg/imon_<DBNAME>.log shows: 2010-06-04 22:32:44.501: [ RACG][6] [18218][6][ora.test.test1.inst]: SQL*Plus: Release 10.2.0.4.0 - Production on Fri Jun 4 22:32:44 2010 Copyright (c) 1982, 2007, Oracle. All Rights Reserved. Enter user-name: Connected to an idle instance. SQL> ORA-27102: out of memory SVR4 Error: 28: No space left on device However, Starting the instance via sqlplus as oracle user works fine! Changes % cat /etc/procject shows -> system:0:::: user.root:1:::: noproject:2:::: default:3::::process.max-file-descriptor=(priv,4096,deny) group.staff:10:::: group.dba:100::::process.max-file-descriptor=(priv,4096,deny);process.max-sem-nsems=(priv,12000,deny );process.max-sem-ops=(priv,4096,deny);project.max-sem-ids=(priv,4096,deny);project.max-shm-ids=(priv,4096,deny);project.max-shm-memory=(priv,30064771072,deny) Project for the oracle user is correct.. You can verify an instance is using it if you start with sqlplus & then run -> % ps -u oracle -o pid,user,project,args Cause Problem is not the oracle project. This occurs because when we start an instance using srvctl it will inherit the system project, since the crsd.bin is running as root, our racg infrastructure is starting the resource as oracle however racgwrap is initially started as root. It calls setuid/setgid to change the user & group IDs but unfortunately when we do this we don't call the project ID. One can confirm the project setting used by crsd.bin process via: as root user: ps -ef|grep crsd.bin prctl <pid of crsd.bin> For case I, the project setting for max-shm-memory is too small. For case II, the project setting for max-shm-ids is too small. Solution Easiest fix for this problem is to define a project for system. 1. As root user, run following command: # projmod -s -K "project.max-shm-memory=(privileged,2.00TB,deny)" system # projmod -s -K "project.max-sem-ids=(privileged,2048,deny)" system # projmod -s -K "project.max-shm-ids=(privileged,512,deny)" system Above command will make changes in /etc/project. After above command, run: projects -l to confirm system project has above parameters set. The first two parameters will fix case I error while the 3rd parameter will help to fix case II error. 2. Assign root user the system project in order for the CRSD daemon (run under root's user id) to pick up the changes: # vi /etc/user_attr Append "project=system" to the line with "root::::", for example root::::auths=solaris.*,solaris.grant;profiles=Web Console Management,All;lock_after_retries=no;project=system Please do not add a new line if there is an existing line with "root::::...". The new line will not take effect. 3. Assign crsd.bin to use system project as root user: # ps -ef | grep crsd.bin # newtask -p system -c <pid of crsd.bin> To verify crsd.bin process is now using system project: # prctl -n project.max-shm-memory <pid of crsd.bin> # prctl -n project.max-sem-ids <pid of crsd.bin> # prctl -n project.max-shm-ids <pid of crsd.bin> It should show the new setting, for example: # prctl -n project.max-shm-memory 29316 process: 29316: /u01/oracle/product/10.2.0/crs/bin/crsd.bin reboot NAME PRIVILEGE VALUE FLAG ACTION RECIPIENT project.max-shm-memory privileged 2.00TB - deny - system 16.0EB max deny - 4. Now starting database via srvctl should work 5. For above change to take effect across node reboot, please create an rc script /etc/rc3.d/S91proj with content: #!/bin/sh # # see http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=5000940 # why this is needed # newtask -p system /usr/bin/true For more information, please refer to NOTE:435464.1 References BUG:5217662 - SRVCTL START DATABASE FAILS WITH ORA-27102: OUT OF MEMORY BUG:5629487 - WRONG RESOURCE CONTROL PROJECT SELECTED WHEN STARTING DATABASE USING SRVCTL NOTE:390547.1 - ORA-27102 Cannot Startup Instance via sqlplus NOTE:429191.1 - Kernel setup for Solaris 10 using project files. NOTE:435464.1 - ENABLING SOLARIS PROJECT SETTINGS FOR CRS |
5条评论
这个问题我印象深刻,因为我在11gR2也有一个类似的bug,在Soalris上。
project的设置有问题会srvctl启动db有问题。不过现在这个问题已经修复了。
能具体的解释下这个project在solaris系统里具体的功能和作用是什么吗?
楼主很爱捉摸,总能把问题搞得很清楚,很是佩服。
我之前遇到一个问题到现在还没有头绪。
9.2的库
1、使用sqlplus “user/pass”执行语句没问题。
2、使用sqlplus “user/pass@tns”执行语句就报错。说是LARGE POOL设置太小。
re wangliang :project是solaris对系统资源的一种管理方式,solaris 10开始配置的文件有所不同,这里有个文章可以参考。http://www.princeton.edu/~unix/Solaris/troubleshoot/resmgmt.html
解决了我的一个疑惑,我以前遇到这个问题没搞明白,就修改了root的project解决了,没搞明白怎么解决的这个问题。