HP NZ Database Services

2010/01/26

Don’t Blame the Database, part three

This post is about another example of where a performance problem was passed to us and we found that the real issue laid outside of the database.  There was a very similar issue discussed in this post, so I’ll just cover the differences in this case.

The users of a helpdesk application with an Oracle database backend were complaining about performance every Monday.  The first thing I did was establish the period in which performance was bad and compare it to a time when performance was normal.  I used Statspack reports to do this and found that I/O related waits were much higher, especially the average time spend waiting for redo log writes.  The overall workload was similar, but the significant difference was the I/O was taking much longer on Mondays.

I verified this theory using sar and vxstat to compare response times on Mondays vs Tuesdays-Fridays.

This is an example of vxstat output showing the problem:

vxstat  -i 60 -c 100 -d vol01
 OPERATIONS           BLOCKS        AVG TIME(ms)
TYP NAME              READ     WRITE      READ     WRITE   READ  WRITE

Mon Jan 18 15:32:15 2010
dm  cx00            31       219       496      3156   16.1    5.5
dm  cx01            27       247       432      4240    5.9    4.2
dm  cx02             1        20        16       320   50.0   26.0
dm  cx03             4        80        64      2800   12.5 8270.9

Mon Jan 18 15:33:15 2010
dm  cx00            29       171       560      1827    6.6    3.7
dm  cx01            29       232       464      5440    6.2 1818.0
dm  cx02            11        15       176       240   13.6   10.7
dm  cx03             2       180        32      4400   10.0  337.3

Mon Jan 18 15:34:15 2010
dm  cx00            38       131       608      1632   10.3 2299.0
dm  cx01            35       107       704      2304    7.1    8.8
dm  cx02            14         7       224       112   27.9    2.9
dm  cx03             4        94        64      2096   10.0    9.1

The effect for the user was waiting up to eight seconds for a commit now and then!

I requested SAN statistics, but these didn’t show the jump in response times that I was expecting to see.  In fact, the SAN reported better response times during the poor performance periods, and it couldn’t separate writes from reads.  It appeared that finding the cause from SAN statistics was not an option.

There were a number of Oracle databases using the same SAN, on multiple servers so I gathered sar and Statspack (or AWR for 10g), data for them, eg:

set pagesize 5000
set numwidth 15
select a.snap_time, b.TIME_WAITED_MICRO,total_waits from STATS$SNAPSHOT a, STATS$SYSTEM_EVENT b
where a.SNAP_ID=b.snap_id
and b.event='log file parallel write'
order by 1
/

This data showed that all of the databases using the same SAN had slower response times on Mondays, during the same 14 hour period.

Using sar -b and sar -u I could see that during the period of poor performance, there was a matching high amount of waiting for physical I/O (non-buffered).  There was no virtual memory paging, and the databases didn’t use direct I/O of any kind, so this was unexpected.

By watching the running processes on the box at the time the performance started going bad each Monday, I found that the a process called scc (system configuration collector) was running for the same 14 hour period.  It turns out that it was looping and running the format command repeatedly for hours, on a server with a thousand partitions/disk devices.  This activity must have slowed the SAN down enough to cause the response problems!

The scc script was rescheduled for Sundays and a code fix was suggested.

The important things in this investigation were:

  • Using historical information to create a baseline of normal performance for comparison.
  • Understanding that tweaking the database was not going to help when the disk write times varied by a factor of up to 1000, even if the general opinion is that the database must be at fault.
  • The format command can have an impact on SAN performance if there are many devices.
About these ads

The Rubric Theme Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: