Putting firewalls between application servers and database servers seems to be becoming more popular.
One negative side effect of this is when the firewall decides to end a connection due to inactivity. Some networking devices have a timeout period so idle connections are cut off. This can have unpleasant results for the users of an application.
I first came across this problem while working for a new customer a long time ago. I was frequently asked to kill database sessions that held row level locks and were blocking other transactions. After a while, I noticed a pattern. The locking problems always started after the blocking session had held the lock and been idle for at least 900 seconds. When this was described to the users, they added that they have to log in again after lunch, or coffee breaks, etc. I referred this issue to the network administrator, (who sat next to me and so couldn’t ignore me), who soon realised that the cause was a NAT’ing router. The router cut off the (fat client) application’s session after the user had been idle for 15min. The timeout was increased and users were asked to log out of the application when finished for the day. Problem solved.
The SQLNET.LOG on a database server can show a DBA that idle connections are being cut off, eg a recent example I saw on Solaris:
***********************************************************************
Fatal NI connect error 12170.
VERSION INFORMATION:
TNS for Solaris: Version 10.2.0.3.0 - Production
Oracle Bequeath NT Protocol Adapter for Solaris: Version 10.2.0.3.0 - Production
TCP/IP NT Protocol Adapter for Solaris: Version 10.2.0.3.0 - Production
Time: 01-JUL-2009 03:31:04
Tracing not turned on.
Tns error struct:
ns main err code: 12535
TNS-12535: TNS:operation timed out
ns secondary err code: 12560
nt main err code: 505
TNS-00505: Operation timed out
nt secondary err code: 145
nt OS err code: 0
Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=nn.nn.nn.nn)(PORT=57922))
In this example, the Solaris error 145 shows that the disconnection was not due to an Oracle setting or database problem, (as some parties were insisting), but was actually from a connection timeout at the OS level. (Look up error number 145 in /usr/include/sys/errno.h ).
In this case I needed more evidence to prove my explanation to others, so I recorded the connected database sessions and their idle times, then matched these errors to idle sessions that had disappeared at the same time. Recording this data for a while showed that only sessions that were idle for a certain amount of time were cut off. Later, the network support staff were able to confirm the idle connection timeout setting.
Over the years, my attempts to persuade network administrators to increase the timeout period have often met resistance, because the timeout exists to free up resources. An alternative is to use some sort of keep alive packets. There are various places keep alives can be enabled, eg at the Windows OS level or in PuTTy. For Oracle databases, SQLNET.EXPIRE_TIME can be used. Although SQLNET.EXPIRE_TIME was intended as a dead connection detection (DCD) method, a side effect of sending probe packets between the database server and client/application server is that the connection remains active. Set the SQLNET.EXPIRE_TIME to less than the firewall timeout and the problem is solved.
Another work around for applications that use connection pools is to keep each connection active by checking running a simple SQL statement at regular intervals.