This post follows on from a previous one in which I used a Statspack report to show the value of the time model statistics. In this post I use some guess work to look outside the DB Time metric, to get a feel for what else might affect the response time for the application, and to gauge how important tuning the database (reducing DB Time) is from a holistic point of view.
I should point out that I’m not as familiar with AIX (the OS in this case) as I am with Solaris, Linux, HP-UX or even Windows. I had no access to the server in question during the period for the Statspack report. These are theories. Make up your own mind and send us feedback (see About us page for email address) if you have knowledge that you’d like to share.
The system was described to me as “CPU constrained”. The Statspack report shows that the database was directly responsible for only 24% of the 78% CPU utilisation (in other words increased server CPU utilisation by 19%).
Host CPU (CPUs: 2 Cores: 1 Sockets: 0)
~~~~~~~~ Load Average
Begin End User System Idle WIO WCPU
------- ------- ------- ------- ------- ------- --------
5.90 5.92 44.49 34.03 21.47 1.47 309.01
Instance CPU
~~~~~~~~~~~~ % Time (seconds)
-------- --------------
Host: Total time (s): 7,199.4
Host: Busy CPU time (s): 5,653.5
% of time Host is Busy: 78.5
Instance: Total CPU time (s): 1,364.2
% of Busy CPU used for Instance: 24.1
Instance: Total Database time (s): 4,059.1
%DB time waiting for CPU (Resource Mgr): 0.0
OS Statistics
-> ordered by statistic type (CPU use, Virtual Memory, Hardware Config), Name
Statistic Total
------------------------- ----------------------
BUSY_TIME 565,349
IDLE_TIME 154,590
IOWAIT_TIME 10,579
SYS_TIME 245,021
USER_TIME 320,328
OS_CPU_WAIT_TIME 2,224,700
RSRC_MGR_CPU_WAIT_TIME 0
PHYSICAL_MEMORY_BYTES 8,589,934,592
NUM_CPUS 2
NUM_CPU_CORES 1
GLOBAL_RECEIVE_SIZE_MAX 1,048,576
GLOBAL_SEND_SIZE_MAX 1,048,576
TCP_RECEIVE_SIZE_DEFAULT 16,384
TCP_RECEIVE_SIZE_MAX ######################
TCP_RECEIVE_SIZE_MIN 4,096
TCP_SEND_SIZE_DEFAULT 16,384
TCP_SEND_SIZE_MAX ######################
TCP_SEND_SIZE_MIN 4,096
-------------------------------------------------------------
OS Statistics - detail
Snap Snapshot
Id Day Time Load %Busy %User %System %WIO %WCPU
------ --------------- ------ ------ ------ ------- ------ ------
1654 Tue 18 11:00:04 5.9
1655 Tue 18 11:31:02 7.5 82.5 46.5 36.0 1.9
1656 Tue 18 12:00:04 5.9 74.3 42.4 31.9 1.0
-------------------------------------------------------------
This server is dedicated to the one database, and has no application component running on it. So what was using the rest of the CPU? Perhaps the database CPU measurements were not quite correct, (eg due to LOB OCI activity?), or perhaps the listener was using some CPU (tracing enabled?), but it seems to me there is more to it than that. I see that 44% of the CPU busy time happened in system mode. This could be related to I/O or network traffic (among other things). It was subsequently discovered that the I/O was being cached by the OS, so perhaps that contributed to some degree.
Finding the cause might offer some opportunity for tuning, greater than just trying to reduce background and DB CPU time.
How about network traffic?
I already know that this application uses LOBS, (direct I/O to LOBs showed up in the Statspack report), and LOB access can result in a high number of round trips.
Avg %Total
%Tim Total Wait wait Waits Call
Event Waits out Time (s) (ms) /txn Time
---------------------------- ------------ ---- ---------- ------ -------- ------
SQL*Net message from client 16,327,445 0 471,448 29 69.9
Statistic Total per Second per Trans
--------------------------------- ------------------ -------------- ------------
bytes received via SQL*Net from c 5,432,266,525 1,508,962.9 23,262.5
bytes sent via SQL*Net to client 9,743,180,327 2,706,439.0 41,723.1
SQL*Net roundtrips to/from client 16,329,496 4,536.0 69.9
There were 16 million round trips and about 15GB of data transferred across the network during the hour.
4,536 round trips per second means 9,072 SQL*Net packets (up to 8000bytes) arriving or departing per second, which may then be broken into multiple network packets (up to 1500bytes), each potentially causing CPU interrupts which have to be handled by the OS. (“Potentially” because of interrupt coalescing). Surely this would contribute to the CPU time system in system mode.
Further, it might show that the active database time is less significant to overall application performance than we might otherwise assume.
The total wait time was 471,448 seconds, most of which will be idle application time, but not all of it. Some of this time is network overhead, and some will be application think time. It is not possible to calculate the network + application time from the data in the Statspack report, but we can make a safe guess at a lower limit using the wait event histogram information.
Total ----------------- % of Waits ------------------
Event Waits <1ms <2ms <4ms <8ms <16ms <32ms <=1s >1s
-------------------------- ----- ----- ----- ----- ----- ----- ----- ----- -----
SQL*Net more data to clien 203K 99.1 .2 .3 .3 .0 .0 .0
SQL*Net message from clien 16M 16.1 29.5 25.4 19.3 6.5 1.6 1.4 .2
SQL*Net message to client 16M 100.0 .0 .0 .0 .0 .0 .0
SQL*Net more data from cli 60K 58.5 .2 .2 .2 2.3 38.1 .4
If we decide to treat any SQL*Net wait event longer than one second as idle (application and DB doing nothing), and treat the rest as time during which the user would be waiting, then we can get a lower limit from the highlighted row above. (I’ll ignore the less significant more data waits for this example).
0ms <= 16.1% of waits < 1ms 0.161 * 0ms * 16,327,445 = 0s
1ms <= 29.5% of waits < 2ms 0.295 * 1ms * 16,327,445 = 4,816s
etc...
Taking the lower bound of each I calculate at that 45,701 sec is the lower limit. Compare this to the total DB Time of 3,393 sec, and we can see that the time spent in the database is no more than 6.9% of the total response time of the app server + database server components.
Note that this doesn’t include any time used for communication between the application server and the user, browser processing, etc.
This extra information puts into perspective the scope of any benefit from reducing DB Time. This information could be useful to the overall tuning effort by moving the focus to options that might help to reduce the number of round trips or make them more efficient. (Eg bulk processing, pre-fetching, using varchar2 instead of LOB, increasing SDU, removing firewalls between app and DB, tuning interrupt coalescing parameters, application server tuning, etc).