In the old days, (ie the early 90′s when I was still a student), the most popular tuning tool was the database buffer cache hit ratio. All tuning advice seemed to drag up the same old story. It took a long time for the masses to move away from this metric and to use response time breakdowns (wait events and CPU time) as their main source of performance information. Oracle’s recent tuning tools force it upon any stragglers.
However, such is the frustration of the early adopters, that the backlash against the buffer cache hit rate has gone too far, with some DBAs decrying it as useless. Now it seems popular to dismiss and scoff at the mention of a buffer cache hit rate. This is not a position I agree with. Sure, some SQL runs more efficiently with full scans, or direct I/O, so hit rates can be low for good plans. Sure, with mixed workloads, there is no rule of thumb for the optimal hit rate for the whole database.
But…. the total system buffer cache hit rate percentage can be used to detect changes in database work load, and can be drilled down to the session, SQL, segment and period levels. It’s not the only tool in the box, and not the first one to use, but it certainly helps to build a model of a database’s workload.
There are a couple of gotchas:
- Calculate it correctly. Not too difficult in recent versions, but still some people forget about direct I/O, (eg parallel processing).
- Some of the database metrics used to calculate the hit rate wrap around, leading to disturbing fluctuations in the resulting percentage score.
In this post I’m going to give a couple of recent examples of when total hit rates didn’t matter and did matter. Both times, drilling down made all the difference.
Case 1
First up, I was asked whether a DB buffer cache should be increased by 50% because the monitoring software was alerting due to an 86% hit rate.
The SGA advisory was supplied:
SGA Target SGA Size Est DB Est DB Est Physical Size (M) Factor Time (s) Time Factor Reads ---------- -------- -------- ----------- -------------- 500 .3 32,140 2.3 63,228,627 1,000 .5 14,091 1.0 3,327,207 1,500 .8 13,940 1.0 2,822,972 2,000 1.0 13,890 1.0 2,656,665 (CURRENT) 2,500 1.3 13,858 1.0 2,551,195 3,000 1.5 13,851 1.0 2,528,082 (PROPOSED) 3,500 1.8 13,851 1.0 2,528,082 4,000 2.0 13,863 1.0 2,528,082This indicated to me that the DB buffer cache did not need to be increased, and that doing so would yield no significant benefit.
I thought spending a few more minutes to gather more evidence would make my assertion more convincing.
First I checked the overall hit rate myself using v$buffer_pool_statistics. I calculated a 95% hit rate. Perhaps the monitoring tool was using v$sysstat and including direct I/O. (Outdated SQL).
Next I looked through V$SQL for statements with a lot of physical I/O and low hit rates. The top ones were related to statistics gathering and the monitoring tool itself. To get a broader view, I generated a STATSPACK report for 24 hourly snapshots and saw that 85% of disk reads during that period were from statistics gathering, with the second biggest contributer at 2% being the monitoring software itself.
We don't care about the hit rates of statistics gathering jobs that use parallel processing and direct I/O (negative hit rates), or overnight batch jobs (if any) that do full scans. The online users' hit rates are more relevant. Using V$SESS_IO I could see hit rates from 95-100%:
SID LOGICAL_READS HIT_RATIO ---------- ------------- ---------- 84 6 100 .......<snip>........ 267 29511 99.48 189 29558 99.56 93 215517 98.91 92 270283 94.77 296 418464 99.53 95 424808 94.99 80 437150 99.63So, the only actions I would advise in this case are to upgrade/correct/calibrate the monitoring software, and reconsider how the statistics are being gathered (overkill!).
Case 2
A total hit rate percentage for a 9i database that we support suddenly dropped from mid-90's (for the last year) to mid-30's. The database hadn't been restarted recently, and a quick check of the metrics in v$buffer_pool_statistics showed that the numbers hadn't wrapped around the clock. This was something real to investigate.
We were lucky enough to find a statement still in V$SQL (not flushed yet) that had a low hit rate and a high number of physical reads. If we hadn't have caught the statement in V$SQL we still could have got clues from v$segstat and v$session_longops, which would have identified the table being scanned repeatedly.
When we notified the client of what we'd found, they mentioned that they'd attempted to archive a lot of data recently, but that the process had taken far too long and had been abandoned.
The explain plan of the statement showed a full tablescan, and a CBO trace showed that an execution plan to use an appropriate index was not being chosen by a small difference in cost. Changing an Oracle parameter or two temporarily was an easy way to tip the balance and allow the client to complete the archiving procedure, (which required an outage anyway):
optimizer_index_cost_adj=1 optimizer_index_caching=100 DB_FILE_MULTIBLOCK_READ_COUNT=1The problem and solution took minutes to find, and we were alerted to it by the change in hit rate. Admittedly, an unusual rise in scattered read waits might also have got our attention, but the hit rate is a simpler metric to monitor, graph, compare and explain to the client in a report.