HP NZ Database Services

2009/09/26

Supplementing Application Error Logging

Database users may complain that something isn’t right with the application and request that the DBAs have a look in the database for problems.  Oracle doesn’t record every error reported to database clients in the alert log, just server related messages, (internal errors, snapshot too old, segment cannot extend, etc).  It is the application that should capture and/or report the full error stack returned to the database client, but unfortunately not all of them do.

The following is an example of this situation, in which I had to track down the cause of a problem where the application should have made that clear from its error log.

Application is Stuck – Example Case

An application owner asked me to look at the database, because an job on the application server was stuck and appeared to  be looping.  I asked for the error messages and while waiting for them, I fired up SQL*Plus.

The predominant and most unusual wait event for the recent and active sessions was “undo segment recovery”.  This happens when the session is waiting for the PMON to roll back a dead transaction.

V$SESSION showed that the SQL being executed just before the wait was:

select STATE from DUMMY where ID = :1  FOR UPDATE

By this time I had received the application error logs.  They referred to a different select statement on the same table, but the only clue was “This may be due to a transient race condition. Rolling back and retrying the transaction.”

The “rolling back” error text fitted with the “undo segment recovery” wait event I was seeing, even if the SQL didn’t match.  The application was trying to do the same thing with the same result repeatedly.

I noticed that the number of “SQL*Net break/reset to client” wait events were steadily increasing.  This indicates error messages were being generated for some database sessions.

At this point I could either try to trace from the client side, or from the server side.  I only had access to the database server, so I enabled SQL tracing for a couple of likely database sessions and soon had a trace file of a session that had been waiting on “undo segment recovery”:

=====================
PARSING IN CURSOR #4 len=52 dep=0 uid=30 oct=3 lid=30 tim=52668306764175 hv=3327540542 ad='914b1fb8'
select STATE from DUMMY where ID = :1  FOR UPDATE
END OF STMT
PARSE #4:c=0,e=941,p=0,cr=0,cu=0,mis=1,r=0,dep=0,og=1,tim=52668306764164
BINDS #4:
kkscoacd
 Bind#0
  oacdty=02 mxl=22(22) mxlc=00 mal=00 scl=00 pre=00
  oacflg=03 fl2=1000000 frm=01 csi=871 siz=24 off=0
  kxsbbbfp=ffffffff7ae74f60  bln=22  avl=04  flg=05
  value=376059
WAIT #4: nam='undo segment recovery' ela= 2929424 segment#=5 tx flags=183 p3=0 obj#=27086 tim=52668309703725
EXEC #4:c=10000,e=2939555,p=0,cr=7,cu=2,mis=1,r=0,dep=0,og=1,tim=52668309703911
ERROR #4:err=1591 tim=1098267617
WAIT #4: nam='SQL*Net break/reset to client' ela= 9 driver id=1952673792 break?=1 p3=0 obj#=27086 tim=52668309704412
WAIT #4: nam='SQL*Net break/reset to client' ela= 973 driver id=1952673792 break?=0 p3=0 obj#=27086 tim=52668309705441
WAIT #4: nam='SQL*Net message to client' ela= 5 driver id=1952673792 #bytes=1 p3=0 obj#=27086 tim=52668309705502
WAIT #4: nam='SQL*Net message from client' ela= 2441 driver id=1952673792 #bytes=1 p3=0 obj#=27086 tim=52668309708092
WAIT #28: nam='SQL*Net message to client' ela= 3 driver id=1952673792 #bytes=1 p3=0 obj#=27086 tim=52668309708548
WAIT #28: nam='SQL*Net message from client' ela= 2608 driver id=1952673792 #bytes=1 p3=0 obj#=27086 tim=52668309711219
XCTEND rlbk=1, rd_only=1
WAIT #0: nam='SQL*Net message to client' ela= 2 driver id=1952673792 #bytes=1 p3=0 obj#=27086 tim=52668309711650

And there is the error message!  ORA-1591 means “lock held by in-doubt distributed transaction”.

Querying DBA_2PC_PENDING showed a couple of pending two-phase-commit transactions since the previous day.  I informed the application owner, who then asked me to ‘commit force’ those transactions.  Problem solved.

Wrap Up

If the users can’t provide the DBA with an ORA-NNNN error message, then the DBA can try to capture the error message by enabling SQL tracing and searching through the trace files for “err=“.

2009/09/19

Tune-n-Tools vs Oracle

The Puzzle

A UNIX administrator contacted us (all the way from San Diego) with a problem regarding semaphore configuration on a new HP-UX 11i v3 server running an Oracle database:

The default semaphore settings were:

SEMMSL=2048   (semaphores per set)
SEMMNI=2048   (number of sets)

He had read Oracle’s installation documentation which said that the product of these two parameters must be at least the number of PROCESSES as defined in the Oracle database’s init.ora.  By default SEMMSL x SEMMNI = 4,194,304; much higher than the database’s PROCESSES parameter of 10,000.

Following Tune’n'Tools guidance, the UNIX administrator changed the parameters to:

SEMMSL=128
SEMMNI=8192

The new settings resulted in product of 1,048,576, which should support 10,000 database processes right?  So why did the database fail to start and report the following error?

SQL> startup nomount pfile="/${ORACLE_SID}/oracle/admin/pfile/init${ORACLE_SID}.ora";
ORA-27154: post/wait create failed
ORA-27300: OS system dependent operation:semids_per_proc failed with status: 0
ORA-27301: OS failure message: Error 0
ORA-27302: failure occurred at: sskgpwcr2
ORA-27303: additional information: semids = 136, maxprocs = 10000

The Solution

Oracle will try to get PROCESSES+4 semaphores per set/id, and then divide the result by two until it is smaller than SEMMSL.  In this case, we start with 10,004 and end up with 78 (74 for database and 4 overhead per set).  This requires 136 sets/ids.

There is an Oracle compilation parameter SS_SEM_MAX which limits the maximum number of semaphore sets oracle will allocate, and this normally defaults to 10.  OS tuning tools may not take this into consideration when making recommendations.

So, there is a hidden limit: Least(SEMMNI,10) x (SEMMSL-4) >= PROCESSES

For the case above: 10 x 124 < 10,000 [error], but 10 x 2044 > 10,000 [no error].

(See Metalink Doc 15566.1 for more about SS_SEM_MAX).

Metalink Doc 15654.1 ( TECH: Calculating Oracle’s SEMAPHORE Requirements) gives this guidance:

SEMMSL is an arbitrary figure which is best set to a round figure no smaller that the smallest ‘processes’ figure for any database on the system.

Which makes sense given the SS_SEM_MAX limitation, and given that the init parameter “processes” for a DB may be increased after the initial installation.  I’d suggest SEMMSL be at least 4 + maximum expected PROCESSES value for all instances, or else Oracle’s “halve and try again” method may introduce inefficiencies.

The default values of SEMMSL=2048 and SS_SEM_MAX=10 would support up to 16,387 processes for a single instance, or ten instances with 2,000 processes each.  Not many people would have run into the SS_SEM_MAX limitation with these default values.

2009/06/24

Firewalls vs Users

Filed under: UNIX / Linux — hpdba @ 8:22 pm
Tags: , ,

While firewalls have their uses, they can also be an obstacle for users.

I have been called out to customer sites a few times, only to find the problem was a software firewall blocking useful traffic.  Once, it was the Windows firewall blocking FTP, once it was Zone Alarm blocking SQL*Net, and a couple of times the Linux iptables firewall has been to blame.

Here is a recent example of a software firewall that was inappropriately configured:

We wanted to install Oracle on an Oracle Enterprise Linux server, and needed to use VNC to do so.  We started up vncserver on the Linux server and tried to connect to it with a VNC client from a Windows PC.  It didn’t work.

So… we started to investigate.

From the PC command prompt we tried telnet NN.NN.NN.NN 5901 .  This timed out.  Hmmmm… not a “Connection refused” so it appeared to be a firewall issue.  The PC’s Windows firewall already had a VNC exception, (part of the VNC client installation), but even turning it off didn’t make a difference.

Next, we checked that vncserver was running on the Linux box, and ran netstat -a to confirm that vncserver was listening on port 5901.  All was as expected.  At this point we suspected a physical firewall, but once that was ruled out, it only left a Linux software firewall.  The commands show how to confirm this and disable the firewall.  (Rather than to reconfigure it).

[root@akadevsrv01 ~]# /etc/init.d/iptables status
$Table: nat
Chain PREROUTING (policy ACCEPT)
num  target     prot opt source               destination
Chain POSTROUTING (policy ACCEPT)
num  target     prot opt source               destination
1    MASQUERADE  all  --  192.168.122.0/24     0.0.0.0/0
Chain OUTPUT (policy ACCEPT)
num  target     prot opt source               destination
$Table: filter
Chain INPUT (policy ACCEPT)
num  target     prot opt source               destination
1    ACCEPT     udp  --  0.0.0.0/0            0.0.0.0/0           udp dpt:53
2    ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0           tcp dpt:53
3    ACCEPT     udp  --  0.0.0.0/0            0.0.0.0/0           udp dpt:67
4    ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0           tcp dpt:67
5    RH-Firewall-1-INPUT  all  --  0.0.0.0/0            0.0.0.0/0
Chain FORWARD (policy ACCEPT)
num  target     prot opt source               destination
1    ACCEPT     all  --  0.0.0.0/0            192.168.122.0/24    state RELATED,ESTABLISHED
2    ACCEPT     all  --  192.168.122.0/24     0.0.0.0/0
3    ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0
4    REJECT     all  --  0.0.0.0/0            0.0.0.0/0           reject-with icmp-port-unreachable
5    REJECT     all  --  0.0.0.0/0            0.0.0.0/0           reject-with icmp-port-unreachable
6    RH-Firewall-1-INPUT  all  --  0.0.0.0/0            0.0.0.0/0
Chain OUTPUT (policy ACCEPT)
num  target     prot opt source               destination
Chain RH-Firewall-1-INPUT (2 references)
num  target     prot opt source               destination
1    ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0
2    ACCEPT     icmp --  0.0.0.0/0            0.0.0.0/0           icmp type 255
3    ACCEPT     esp  --  0.0.0.0/0            0.0.0.0/0
4    ACCEPT     ah   --  0.0.0.0/0            0.0.0.0/0
5    ACCEPT     udp  --  0.0.0.0/0            224.0.0.251         udp dpt:5353
6    ACCEPT     udp  --  0.0.0.0/0            0.0.0.0/0           udp dpt:631
7    ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0           tcp dpt:631
8    ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0           state RELATED,ESTABLISHED
9    ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0           state NEW tcp dpt:21
10   ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0           state NEW tcp dpt:25
11   ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0           state NEW tcp dpt:22
12   ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0           state NEW tcp dpt:443
13   REJECT     all  --  0.0.0.0/0            0.0.0.0/0           reject-with icmp-host-prohibited
[root@akadevsrv01 ~]# /etc/init.d/iptables stop
$Flushing firewall rules:                                  [$  OK  ]
$Setting chains to policy ACCEPT: nat filter               [$  OK  ]
$Unloading iptables modules:                               [$  OK  ]
[root@akadevsrv01 ~]# chkconfig iptables off
[root@akadevsrv01 ~]# /etc/init.d/iptables  status
$Firewall is stopped.

Theme: Rubric. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.