在客户处遇到个问题,客户昨晚清理数据,删除一些表的分区,第一个表跑了20多分钟,但是差不多数据量的另外一个表,跑了好几个小时才完成。为何类似的数据在drop 分区的时候,有那么大的差异?
检查了改删除分区的语句,发现是update了global索引,并且,还用到了parallel 8的并行。
1 |
alter table MYOWNER_1.MYTABLE_FOR_CK DROP partition MYTABLE_FOR_CK201306 update global indexes parallel 8 |
虽然时间已经过去了一个晚上,但是根据当晚操作人员提供的信息,结合在测试库上的测试,我们还是能从已知的信息中获取一些可能导致当时运行缓慢的一些线索:
(1) 从测试库的测试看到,正常执行是有并发的PX类型的等待的。
(2) 从DBA反馈得知,当晚没有看到并发子进程,只看到单个进程。
(3) 测试有并发和没并发的drop分区,在10046的trace中行为有很大不同,有并发的主要是PX、direct path read、db file scattered read等待,没并发主要是db file sequential read等待。
(4) 设置parallel max servers为4,先运行4个并发的脚本占据并发度,再运行并发drop分区的脚本,从10046的trace看,后面执行的并发drop分区没有并发子进程,且大部分是db file sequential read等待。
执行上述的drop partition时,我们能看到相关并发进程的等待事件:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 |
--从16:47开始运行,这是并发执行的sql,我们看到,在测试时,MYTEST用户下是有多个并发进程的。 --主进程是PX Deq Credit: send blkd,表示在等待处理来自各个子进程的消息。 --这个过程一直到17:59分左右 TO_CHAR(SYSDATE,'YYYY-MM-DDHH2 EVENT SADDR SID ------------------------------ ---------------------------------------------------------------- ---------------- ---------- 2014-01-15 17:06:52 local write wait 000000046F17D928 201 2014-01-15 17:06:52 PX Deq Credit: send blkd 000000046D17AEC8 239 2014-01-15 17:06:52 PX Deq: Execution Msg 000000046F171078 87 2014-01-15 17:06:52 PX Deq: Execution Msg 000000046C160510 96 2014-01-15 17:06:52 PX Deq: Execution Msg 000000046F173028 105 2014-01-15 17:06:52 PX Deq: Execution Msg 000000046B211EE8 205 2014-01-15 17:06:52 PX Deq: Execution Msg 000000046F17AEE8 177 2014-01-15 17:06:52 PX Deq: Execution Msg 000000046C179670 324 2014-01-15 17:06:52 PX Deq: Execution Msg 000000046A1780A0 326 2014-01-15 17:06:52 PX Deq: Execution Msg 000000046F181888 237 2014-01-15 17:06:52 PX Deq: Table Q Normal 000000046C16A380 186 2014-01-15 17:06:52 PX Deq: Table Q Normal 000000046A16B7F0 212 2014-01-15 17:06:52 PX Deq: Table Q Normal 000000046C17AB90 336 2014-01-15 17:06:52 PX Deq: Table Q Normal 000000046A17A050 344 2014-01-15 17:06:52 PX Deq: Table Q Normal 000000046C178150 312 2014-01-15 17:06:52 PX Deq: Table Q Normal 000000046E1720C0 214 2014-01-15 17:06:52 PX Deq: Table Q Normal 000000046D176F68 203 17 rows selected SQL> …… --从17:59分开始,也一直是在并发处理,主进程开始出现db file scattered read。且和PX Deq Credit: send blkd交替出现, --重复检查该脚本的等待事件,一直到18:49分,发现该脚本交替处于db file scattered read和PX Deq Credit: send blkd。 SQL> / TO_CHAR(SYSDATE,'YYYY-MM-DDHH2 EVENT SADDR SID ------------------------------ ---------------------------------------------------------------- ---------------- ---------- -- 2014-01-15 17:59:20 db file scattered read 000000046D17AEC8 239 2014-01-15 17:59:20 PX Deq: Execution Msg 000000046F171078 87 2014-01-15 17:59:20 PX Deq: Execution Msg 000000046C160510 96 2014-01-15 17:59:20 PX Deq: Execution Msg 000000046F173028 105 2014-01-15 17:59:20 PX Deq: Execution Msg 000000046F181888 237 2014-01-15 17:59:20 PX Deq: Execution Msg 000000046F17AEE8 177 2014-01-15 17:59:20 PX Deq: Execution Msg 000000046B211EE8 205 2014-01-15 17:59:20 PX Deq: Execution Msg 000000046C179670 324 2014-01-15 17:59:20 PX Deq: Execution Msg 000000046A1780A0 326 2014-01-15 17:59:20 PX Deq: Table Q Normal 000000046C16A380 186 2014-01-15 17:59:20 PX Deq: Table Q Normal 000000046F17D928 201 2014-01-15 17:59:20 PX Deq: Table Q Normal 000000046A16B7F0 212 2014-01-15 17:59:20 PX Deq: Table Q Normal 000000046C17AB90 336 2014-01-15 17:59:20 PX Deq: Table Q Normal 000000046A17A050 344 2014-01-15 17:59:20 PX Deq: Table Q Normal 000000046C178150 312 2014-01-15 17:59:20 PX Deq: Table Q Normal 000000046E1720C0 214 2014-01-15 17:59:20 PX Deq: Table Q Normal 000000046D176F68 203 17 rows selected …… |
从上面,我们可以看出,脚本在执行的时候,应该是会有并发进程出现,如果按照username=’MYTEST’查询,应该可以看得到PX的session等待。但是和当晚操作人员的沟通中得知,当天晚上,他观察到的现象是没有看到并发的进程,只看到一个进程长时间处于db file sequential read的等待,该进程是在执行4094517213的语句。
从当晚的V$SESSION_WAIT snapshot记录进一步检查, 检查db file sequential read所对应的file id和block id,都是在操作PK_MY_TEST_TABLE_OWN_BY_ORA:
1 2 3 |
OWNER SEGMENT_TYPE SEGMENT_NAME ------------------------------ ------------------ -------------------------------------------------------------------------------- MYOWNER_1 INDEX PK_MY_TEST_TABLE_OWN_BY_ORA |
另外,我在我的测试库中测试了用并发和不用并发,drop 500万数据的partition:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
--有并发, alter table t drop partition t6 update global indexes parallel 8: ******************************************************************************** insert /*+ RELATIONAL(T) APPEND PARALLEL(T,8) */ into "TEST"."T" partition ("T6") select /*+ RELATIONAL(T) */ * from "TEST"."T" partition ("T6") delete global indexes call count cpu elapsed disk query current rows ------- ------ -------- ---------- ---------- ---------- ---------- ---------- Parse 1 0.00 0.02 0 0 0 0 Execute 2 4.29 1592.97 37244 28910 1648 0 Fetch 0 0.00 0.00 0 0 0 0 ------- ------ -------- ---------- ---------- ---------- ---------- ---------- total 3 4.29 1592.99 37244 28910 1648 0 Misses in library cache during parse: 1 Optimizer goal: CHOOSE Parsing user id: 23 (recursive depth: 1) Rows Row Source Operation ------- --------------------------------------------------- 0 INDEX MAINTENANCE (object id 0) 0 LOAD AS SELECT 0 TABLE ACCESS FULL T PARTITION: 6 6 Elapsed times include waiting on following events: Event waited on Times Max. Wait Total Waited ---------------------------------------- Waited ---------- ------------ db file sequential read 2 0.01 0.01 <<<<<<<4. 处于sequential read的很少 db file parallel read 2 0.37 0.48 process startup 9 0.00 0.01 PX Deq: Join ACK 8 0.00 0.00 PX Deq: Execute Reply 1958 1.96 1509.03 <<<<<<2. 其次是在PX上 PX Deq Credit: send blkd 602 1.95 11.10 PX Deq: Parse Reply 1 0.00 0.00 PX Deq Credit: need buffer 1065 0.47 2.22 db file scattered read 919 1.39 29.27 <<<<<<<<3. 另外也有部分是在db file scattered read direct path write 8 0.00 0.00 direct path read 2666 0.00 0.00 <<<<<<< 1. 主要的等待是在direct path PX Deq: Signal ACK 1 0.00 0.00 ******************************************************************************** |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 |
--无并发,alter table t drop partition t6 update global indexes: ******************************************************************************** insert /*+ RELATIONAL(T) APPEND PARALLEL(T,1) */ into "TEST"."T" partition ("T6") select /*+ RELATIONAL(T) */ * from "TEST"."T" partition ("T6") delete global indexes call count cpu elapsed disk query current rows ------- ------ -------- ---------- ---------- ---------- ---------- ---------- Parse 2 0.00 0.00 0 0 0 0 Execute 2 9.01 144.04 54992 49158 130178 0 Fetch 0 0.00 0.00 0 0 0 0 ------- ------ -------- ---------- ---------- ---------- ---------- ---------- total 4 9.01 144.04 54992 49158 130178 0 Misses in library cache during parse: 1 Optimizer goal: CHOOSE Parsing user id: 23 (recursive depth: 1) Rows Row Source Operation ------- --------------------------------------------------- 0 LOAD AS SELECT 1276637 TABLE ACCESS FULL OBJ#(7017) PARTITION: 6 6 Elapsed times include waiting on following events: Event waited on Times Max. Wait Total Waited ---------------------------------------- Waited ---------- ------------ control file sequential read 1600 0.00 0.00 db file scattered read 902 0.00 0.05 <<<<<<<<<<2. 除了更新控制文件的读,第二位就是db file scattered read。 db file sequential read 21778 0.24 0.41 <<<<<<1. 主要的等待是在db file sequential read上,且没有看到PX的等待。 db file parallel read 3 0.00 0.00 local write wait 469 0.42 1.43 db file single write 132 0.79 9.75 control file parallel write 264 1.10 4.66 log file switch completion 86 0.97 32.70 direct path write 4 0.00 0.00 direct path read 178 0.00 0.00 <<<<<<<<2. Direct path read的等待也远远小于db file sequential read。 log buffer space 79 0.97 41.94 async disk IO 101 0.00 0.00 rdbms ipc reply 10 0.02 0.06 enqueue 8 1.66 1.72 ******************************************************************************** |
再结合我们V$SESSION_WAIT snapshot记录,我们看到,语句发起时,开始的行为吻合并发时的行为,而后面从1:58分之后,行为就变成了像拿不到并发,变成了串行的执行,于是出现大量的db file sequential read的等待。
……
因此从DBA反馈的当晚的情况看,和测试的现象看,应该是当晚执行的时候,没有拿到并发,导致语句虽然带有并发参数,但是实际执行的时候没有并发。由于串行,所以执行的时间很长。
至于为什么没有拿到并行,我们目前对于9i库没有更多的信息来证明这个问题,没有session hist的记录或者别的记录。但是根据以往的经验,当数据库中有多个程序都是并发执行时,抢光了parallel_max_servers,后续有并发的进程就不能用并发进程。且哪怕占用并发度的进程结束后,没有拿到并发度的进程还在执行,此时依旧还是会按照串行的方式继续进行下去,一直到sql执行结束。
因此,我测试了设置paralle max server为4,先用并发度为4的drop partition语句占用这4个并发度,再运行并发度为8的语句drop 另外一个表的partition,后面运行的这个sql在拿不到并发度之后,会和我们当晚的情况非常一致,会长期处于db file sequential read的等待:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
--先运行一个并发度为4的drop partition,占据4个并发度: ******************************************************************************** insert /*+ RELATIONAL(T2) APPEND PARALLEL(T2,4) */ into "TEST"."T2" partition ("T1") select /*+ RELATIONAL(T2) */ * from "TEST"."T2" partition ("T1") delete global indexes call count cpu elapsed disk query current rows ------- ------ -------- ---------- ---------- ---------- ---------- ---------- Parse 1 0.00 0.00 0 0 0 0 Execute 2 0.77 169.47 9363 17660 8 0 Fetch 0 0.00 0.00 0 0 0 0 ------- ------ -------- ---------- ---------- ---------- ---------- ---------- total 3 0.77 169.48 9363 17660 8 0 Misses in library cache during parse: 1 Optimizer goal: CHOOSE Parsing user id: 23 (recursive depth: 1) Rows Row Source Operation ------- --------------------------------------------------- 0 INDEX MAINTENANCE (object id 0) 0 LOAD AS SELECT 0 TABLE ACCESS FULL T2 PARTITION: 1 1 Elapsed times include waiting on following events: Event waited on Times Max. Wait Total Waited ---------------------------------------- Waited ---------- ------------ db file sequential read 17 0.00 0.00 <<<<<< 2. db file sequential很少 db file parallel read 1 0.00 0.00 PX Deq: Join ACK 4 0.00 0.00 PX Deq: Execute Reply 1651 1.98 150.10 <<<<<1. 主要是并发等待 PX Deq: Parse Reply 3 0.00 0.00 PX Deq Credit: send blkd 3734 1.95 8.44 <<<<<1. 主要是并发等待 PX Deq Credit: need buffer 1182 1.49 4.30 <<<<<1. 主要是并发等待 db file scattered read 597 0.00 0.02 free buffer waits 3 0.97 1.84 direct path write 4 0.00 0.00 direct path read 23 0.00 0.00 PX Deq: Signal ACK 2 0.00 0.00 ******************************************************************************** |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
--另外session运行并发度为8的drop分区,此时语句虽然带8的并发,但是实际没有拿到并发,和我们当晚的情况非常吻合。 ******************************************************************************** insert /*+ RELATIONAL(T) APPEND PARALLEL(T,8) */ into "TEST"."T" partition ("T5") select /*+ RELATIONAL(T) */ * from "TEST"."T" partition ("T5") delete global indexes call count cpu elapsed disk query current rows ------- ------ -------- ---------- ---------- ---------- ---------- ---------- Parse 1 0.00 0.00 0 8 0 0 Execute 1 18.97 412.85 156193 85127 410116 0 Fetch 0 0.00 0.00 0 0 0 0 ------- ------ -------- ---------- ---------- ---------- ---------- ---------- total 2 18.97 412.85 156193 85135 410116 0 Misses in library cache during parse: 1 Optimizer goal: CHOOSE Parsing user id: 23 (recursive depth: 1) Rows Row Source Operation ------- --------------------------------------------------- 1 INDEX MAINTENANCE (object id 0) 1 LOAD AS SELECT 4500000 TABLE ACCESS FULL T PARTITION: 5 5 Elapsed times include waiting on following events: Event waited on Times Max. Wait Total Waited ---------------------------------------- Waited ---------- ------------ db file sequential read 57681 0.00 0.38 <<<<<<<1. 虽然是并发度为8,但是没有拿到并发度,大量的sequential read db file parallel read 2 0.00 0.00 control file sequential read 8 0.00 0.00 db file scattered read 893 0.01 0.08 free buffer waits 124 0.98 58.25 log file switch completion 189 0.97 72.27 log buffer space 188 0.98 99.21 direct path write 8 0.00 0.00 direct path read 5450 0.00 0.01 buffer busy waits 1 0.00 0.00 undo segment extension 19027 0.16 0.32 ******************************************************************************** |
建议:
1. 目前该数据库的并发度为48(parallel_max_servers = 48),可适当加大。
2. 拿不到并发度,可能是在凌晨12点左右有大量job或者其他占用并发的作业启动,建议错峰执行,如提早2小时执行。
3. 执行前判定剩余并发度,如果不能获得足够的并发度,就暂缓执行。