This post is about another example of where a performance problem was passed to us and we found that the real issue laid outside of the database. There was a very similar issue discussed in this post, so I’ll just cover the differences in this case.
The users of a helpdesk application with an Oracle database backend were complaining about performance every Monday. The first thing I did was establish the period in which performance was bad and compare it to a time when performance was normal. I used Statspack reports to do this and found that I/O related waits were much higher, especially the average time spend waiting for redo log writes. The overall workload was similar, but the significant difference was the I/O was taking much longer on Mondays.
I verified this theory using sar and vxstat to compare response times on Mondays vs Tuesdays-Fridays.
This is an example of vxstat output showing the problem:
vxstat -i 60 -c 100 -d vol01 OPERATIONS BLOCKS AVG TIME(ms) TYP NAME READ WRITE READ WRITE READ WRITE Mon Jan 18 15:32:15 2010 dm cx00 31 219 496 3156 16.1 5.5 dm cx01 27 247 432 4240 5.9 4.2 dm cx02 1 20 16 320 50.0 26.0 dm cx03 4 80 64 2800 12.5 8270.9 Mon Jan 18 15:33:15 2010 dm cx00 29 171 560 1827 6.6 3.7 dm cx01 29 232 464 5440 6.2 1818.0 dm cx02 11 15 176 240 13.6 10.7 dm cx03 2 180 32 4400 10.0 337.3 Mon Jan 18 15:34:15 2010 dm cx00 38 131 608 1632 10.3 2299.0 dm cx01 35 107 704 2304 7.1 8.8 dm cx02 14 7 224 112 27.9 2.9 dm cx03 4 94 64 2096 10.0 9.1
The effect for the user was waiting up to eight seconds for a commit now and then!
I requested SAN statistics, but these didn’t show the jump in response times that I was expecting to see. In fact, the SAN reported better response times during the poor performance periods, and it couldn’t separate writes from reads. It appeared that finding the cause from SAN statistics was not an option.
There were a number of Oracle databases using the same SAN, on multiple servers so I gathered sar and Statspack (or AWR for 10g), data for them, eg:
set pagesize 5000 set numwidth 15 select a.snap_time, b.TIME_WAITED_MICRO,total_waits from STATS$SNAPSHOT a, STATS$SYSTEM_EVENT b where a.SNAP_ID=b.snap_id and b.event='log file parallel write' order by 1 /
This data showed that all of the databases using the same SAN had slower response times on Mondays, during the same 14 hour period.
Using sar -b and sar -u I could see that during the period of poor performance, there was a matching high amount of waiting for physical I/O (non-buffered). There was no virtual memory paging, and the databases didn’t use direct I/O of any kind, so this was unexpected.
By watching the running processes on the box at the time the performance started going bad each Monday, I found that the a process called scc (system configuration collector) was running for the same 14 hour period. It turns out that it was looping and running the format command repeatedly for hours, on a server with a thousand partitions/disk devices. This activity must have slowed the SAN down enough to cause the response problems!
The scc script was rescheduled for Sundays and a code fix was suggested.
The important things in this investigation were:
- Using historical information to create a baseline of normal performance for comparison.
- Understanding that tweaking the database was not going to help when the disk write times varied by a factor of up to 1000, even if the general opinion is that the database must be at fault.
- The format command can have an impact on SAN performance if there are many devices.