Human Factors Behind Incidents: Why Settings Like "idle_session

PostgreSQL 14 introduced the parameter idle_session_timeout, and, unfortunately, many DBAs jumped to start using it without understanding or by ignoring the consequences. In a short span of time, it has become one of the most misused parameters in many PostgreSQL installations. There is nothing wrong with idle_session_timeout from a technical perspective; even without this parameter, I know frustrated DBAs run scheduled shell scripts to keep terminating old idle sessions. The new parameter made their work easy to nuke all idle connections. The problem is the collateral damage it can cause.

Note for novice DBAs: There are human and organizational factors for almost every incident, rather than just technical reasons. Trying to do a quick/dirty fix by changing parameters will have far-reaching implications. Not everything can be addressed on the database side. This is more evident in connection-related incidents. Blame games and bad decisions following one incident/outage can lead to further bad decisions and further problems.

For example, many such incidents will have a common root cause” ” something like unexpected, sudden connection explosion — a big bang of connections with hundreds or thousands of idle connections. The RCA investigation may end with only the technical side of the problem, like loose limits invited disaster to happen. But the right question to be addressed would be: Why do DBAs end up setting such loose limits? What are the factors that influenced and forced DBAs to do so? Unfortunately, it gets addressed very rarely in many organizations.

Common factors contributing to connection-related outages

1. Unrestricted connection settings: Leaving connection settings, such as max_connections and per-user limits, wide open exposes the system to malicious Denial-of-Service (DoS) attacks. These attacks can rapidly consume available connections, leaving legitimate users unable to connect and disrupting system functionality.

2. Inadequate application-side pooling: Suboptimal application-side pooling can lead to opening a large number of connections, and they may remain open and unused, consuming valuable resources. These connections can accumulate, eventually exhausting server resources and causing the system to become unresponsive.

3. Unexpected connection leaks: Unanticipated connection leaks can also arise from programming errors or faulty application logic. These leaks can quickly deplete the available connection pool, hindering system performance and potentially leading to crashes.

4. Human factor: In an ideal case, there must be stringent restrictions and enforcement on what load and number of connections can be onboarded to the Database. It should be a tight scrutiny like airline security.

The problem generally begins as a conflict of interests. App-Dev often demands unrestricted, uncontrolled access to the database. A “NO” from the Ops/DBAs at the right time may save the entire system from outages. However, gatekeeping is not an easy job. Pressure builds up, and DBAs give up in most cases and allow settings that the database server cannot accommodate. The first decision goes wrong here. Obviously, an incident is expected down the line.

5. Organizational factor of wrong decisions: A surprising set of wrong decisions generally happens after the investigation following major incidents, The investigation could be pointing fingers at an excessive number of connections. Its common that Ops/DBA becomes responsible for managing the “idle” connections! because “it is a database problem”. I would say the very wrong person takes charge.

The point forgotten is that connections are owned by clients but maintained by a database. It’s like a bank account. The customer owns it, but the bank maintains it. We shouldn’t be terminating or closing connections from the database side. Instead of Investigating who owns them and addressing them at their root, DBAs often go for shortcuts like idle_session_timeout or a script to terminate all “idle” connections. The boundaries of ownership are violated, inviting the next level of problems.

There are a lot of misunderstandings about “idle” connections among the user community. We should remember that a connection will be “idle” immediately after establishing it. It becomes “active” when there is an SQL for executing. It goes back to “idle” when the processing is over. Most of the connections from any connection pooler also will be “idle” most of the time. So, there is nothing wrong with a connection being “idle”. The point I want to make is that a connection appearing as “idle” doesn’t make it a candidate for termination/killing. This may sound too silly for an experienced user, but the reality is that I ended up sitting on emergency calls to explain this.

What matters is the number of connections. The right question to ask in an RCA call is, “Who is responsible for creating the large number of connections that are idle most of the time?

Common symptoms of poor connection management

Gradual server performance degradation due to the high number of connections
Often, the server becomes unresponsive and/or crashes
The server is running out of memory, and OOM killer terminates PostgreSQL
The server is running out of available connections
Application reporting connection failures and unexpected drops of connection crashes

Problems of backend termination

I often get the question, “What is the problem with terminating connections from backends?”. In layman’s terms, it will be like a bank closing your account and taking all the money in it. That can become a big surprise for the database client/application. Most of the applications may not be designed and tested for handling such surprises. Even a psql will get it with a surprise

postgres=> l+
FATAL:  terminating connection due to idle-session timeout
FATAL:  server conn crashed?
server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.

postgres=> l+

FATAL: terminating connection due to idle-session timeout

FATAL: server conn crashed?

server closed the connection unexpectedly

This probably means the server terminated abnormally

before or while processing the request.

The connection to the server was lost. Attempting reset: Succeeded.

So unless the application is designed, developed, and tested for handling unexpected connection drops, it won’t be able to handle this. The majority of applications just exit with error reporting. In critical systems, this causes serious outages for business.

Despite being tailored for PostgreSQL, pgBouncer is not immune to sudden connection terminations:

2023-11-14 16:17:21.967 UTC [4630] WARNING S-0x10b7940: postgres/authuser@127.0.0.1:5432 got packet 'E' from server when not linked
2023-11-14 16:17:21.967 UTC [4630] LOG S-0x10b7940: postgres/authuser@127.0.0.1:5432 closing because: server conn crashed? (age=218s)
2023-11-14 16:17:22.927 UTC [4630] LOG stats: 0 xacts/s, 0 queries/s, 0 client parses/s, 0 server parses/s, 0 binds/s, in 0 B/s, out 11 B/s, xact 0 us, query 0 us, wait 12 us
2023-11-14 16:17:40.533 UTC [4630] LOG S-0x10b7bf8: postgres/pgbtestapp@[::1]:5432 closing because: server conn crashed? (age=236s)
2023-11-14 16:17:40.533 UTC [4630] LOG C-0x10af130: postgres/pgbtestapp@[::1]:39470 closing because: server conn crashed? (age=28s)
2023-11-14 16:17:40.533 UTC [4630] WARNING C-0x10af130: postgres/pgbtestapp@[::1]:39470 pooler error: server conn crashed?
2023-11-14 16:18:06.370 UTC [4630] LOG S-0x10b7bf8: postgres/authuser@127.0.0.1:5432 new connection to server (from 127.0.0.1:46632)

2023-11-14 16:17:21.967 UTC [4630] WARNING S-0x10b7940: postgres/authuser@127.0.0.1:5432 got packet 'E' from server when not linked

2023-11-14 16:17:21.967 UTC [4630] LOG S-0x10b7940: postgres/authuser@127.0.0.1:5432 closing because: server conn crashed? (age=218s)

2023-11-14 16:17:22.927 UTC [4630] LOG stats: 0 xacts/s, 0 queries/s, 0 client parses/s, 0 server parses/s, 0 binds/s, in 0 B/s, out 11 B/s, xact 0 us, query 0 us, wait 12 us

2023-11-14 16:17:40.533 UTC [4630] LOG S-0x10b7bf8: postgres/pgbtestapp@[::1]:5432 closing because: server conn crashed? (age=236s)

2023-11-14 16:17:40.533 UTC [4630] LOG C-0x10af130: postgres/pgbtestapp@[::1]:39470 closing because: server conn crashed? (age=28s)

2023-11-14 16:17:40.533 UTC [4630] WARNING C-0x10af130: postgres/pgbtestapp@[::1]:39470 pooler error: server conn crashed?

2023-11-14 16:18:06.370 UTC [4630] LOG S-0x10b7bf8: postgres/authuser@127.0.0.1:5432 new connection to server (from 127.0.0.1:46632)

Even if we ignore all other problems, the connection pooler will end up re-establishing pretty much every connection in the pool. This completely destroys the very purpose of the connection pooler.

It is not that the PostgreSQL community is not aware of this possible abuse of parameters. This has been discussed in the community, and the implications on the pooler are known and well documented.

However, users still overlook the implications, and serious mess-ups happen.

From the discussions in the PostgreSQL community, the only strong argument in favor of this parameter is one-off users who directly log in to the database to execute custom statements. They may accidentally leave their sessions open for a long time, which causes damage.

There were discussions about making it more explicit like: Consider setting this for specific users instead of as a server default. Client connections managed by connection poolers or initiated indirectly, like those by a remote postgres_fdw using server, should probably be excluded from this timeout.

What can we do about it without causing any serious consequences?

Obviously, DBAs may have questions like, what if user accounts / application connections are really abusive? As mentioned above, the solution is not just technical, but understanding the system and implementing appropriate strategies is important.

Segregation of database accounts and role-based settings

Those accounts used by individual users for interactive logins need to be differentiated from the accounts used by the application – The service accounts, In terms of user/role-level settings.PostgreSQL allows us to have parameter settings at different levels and scope. More stringent restrictions are to be placed on the interactive login accounts. It is very much OK to have settings like idle_session_timeout for those accounts, Preferably not exceeding 5 minutes. Most importantly, the idle_in_transaction_session_timeout also does not exceed a few seconds.
On the other hand, I would recommend NOT to use idle_session_timeout for service accounts. But using idle_in_transaction_session_timeout for a couple of minutes is acceptable as it helps us to find problematic application logic.

Use network timeouts

There are alternative approaches if we suspect that the connections are just leaked without the application holding them. TCP timeouts could be the best solution for such cases. PostgreSQL has TCP-related timeouts as a parameter option, which can be specified like any other PostgreSQL parameter option.

tcp_keepalives_idle: This parameter specifies the number of seconds of inactivity, after which a keepalive message will be sent to the client by the PostgreSQL server. For example, if we set a value “120”, Sever will send a keepalive message to the client after two minutes of inactivity and wait for a response. If there is an acknowledgment from the client, the connection is validated. What if there is no acknowledgment from a client? That is where the next parameter helps

tcp_keepalives_interval: This is the number of seconds after which the TCP keepalive message will be retransmitted if there is no acknowledgment from the client. For example, setting a value of 20 results in sending keepalive packets every 20 seconds until an acknowledgment is obtained. What if there is no acknowledgment for any of the keepalive attempts? This is where the next parameter comes into play.

tcp_keepalives_count: This is the number of keepalive messages that can be lost before the client connection is considered as “dead”. And terminated.

So, the overall settings summary discussed is:

tcp_keepalives_idle = 120
tcp_keepalives_interval = 20
tcp_keepalives_count = 6

tcp_keepalives_idle = 120

tcp_keepalives_interval = 20

tcp_keepalives_count = 6

But you may adjust the values according to the needs of your environment.

There is one more parameter: tcp_user_timeout, Which, as some caveat, is documented and reportedly affects other parameter values. So this is not something we should consider on day one.

PostgreSQL 14 onwards, we have an additional option: client_connection_check_interval, which can detect whether the connection to the client has gone away or not. This is most useful if a client dies during a long-running statement like those complex reporting queries from OLAP systems. By default, this check is off (value 0). Setting this value to 5 to 10 seconds will be of great value if you are expecting cases of client connectivity. Please note that the value is specified in milliseconds. PostgreSQL allows us to set these values at even user level.

With all TCP timeouts in place, you may see entries as follows in PostgreSQL logs if some timeout happens.

2023-12-20 04:37:12.026 UTC [1230] LOG:  could not receive data from client: Connection timed out
2023-12-20 04:37:12.026 UTC [1230] LOG:  disconnection: session time: 0:22:23.419 user=jobin database=postgres host=fd42:8aba:4133:fb7d:216:3eff:fe3c:7bcb port=52984

1 2	2023-12-20 04:37:12.026 UTC [1230] LOG: could not receive data from client: Connection timed out 2023-12-20 04:37:12.026 UTC [1230] LOG: disconnection: session time: 0:22:23.419 user=jobin database=postgres host=fd42:8aba:4133:fb7d:216:3eff:fe3c:7bcb port=52984

Summary

There should be a cap on what a system can accept as a load. At the PostgreSQL level, it is the max_connections. I would consider it the primary duty of any Production DBA to protect the system from any possible abuse.

Dear DBAs, Please consider the following points to keep the system stable and protect it from any abuses.

Configure connection settings appropriately. Set connection limits, such as max_connections and per-user limits, to reasonable values that align with expected usage patterns. Regularly review and adjust these settings as needed. The max_connections setting of a PostgreSQL instance should not be too far from 10x of available CPUs.
Encourage the use of robust application-side pooling: Implement well-configured application-side pooling mechanisms to manage connections and prevent leaks efficiently. Employ libraries or frameworks that provide reliable pooling capabilities. In case the application framework doesn’t have a good connection pooler, consider external connection poolers like pgBouncer.
Connections are owned by those who create it. If there is an excessive number of idle connections. The part of the system that is causing the creation of connections needs to be fixed. Terminating them silently from the backend is not a solution, but it invites more problems.
Never terminate connections from the backend unless there is a request to do so from the application/client who owns the connections.
Avoid using parameters like idle_session_timeout at the global level; use it only at the user/role level, especially for interactive login accounts.
Avoid idle_session_timeout for application/service accounts. It can be used for accounts that are used for interactive logins where the consequence of timeout is predictable.
Alternate options, as discussed above, to handle leaked connections.
Monitor the connections. It is precious.

And most importantly, the crucial part. Empower yourself or your DBAs with the ability to say NO to any connection abuses. Junior DBAs are more vulnerable to pressure from other parts of organizations to take the wrong steps.

Percona Distribution for PostgreSQL provides the best and most critical enterprise components from the open-source community, in a single distribution, designed and tested to work together.

Download Percona Distribution for PostgreSQL Today!

1 Comment

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

WEKSLEY MESQUITA VIANA

2 months ago

Firstly congratulations for the article. Its very enlightening about the use of the idle_session_timeout parameter.

I wanted to ask a question about the reference where it says: “The max_connections setting of a PostgreSQL instance should not be too far from 10x of available CPUs.”

I’m looking for recent references that help define this max_connections when using SSDs. The literature only deals with old HDDs that point to something between 2 and 4x the number of processors.

I found this 10x number very interesting.

MySQL 5.7
End of Life

Compare Percona to Leading Database Solutions

Software
Downloads

Product
Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Human Factors Behind Incidents: Why Settings Like “idle_session_timeout” Can Be a Bad Idea

Common factors contributing to connection-related outages

Common symptoms of poor connection management

Problems of backend termination

What can we do about it without causing any serious consequences?

Summary

Related

Related Blog Articles

RECOMMENDED ARTICLES

Trying out the PostgreSQL pg_tde Tech Preview Release

Benchmarking MongoDB Performance on Kubernetes

Why MariaDB Is “Better” Than MySQL

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7 End of Life

Compare Percona to Leading Database Solutions

Software Downloads

Product Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Human Factors Behind Incidents: Why Settings Like “idle_session_timeout” Can Be a Bad Idea

Common factors contributing to connection-related outages

Common symptoms of poor connection management

Problems of backend termination

What can we do about it without causing any serious consequences?

Summary

Related

Share This Post!

Want to get weekly updates listing the latest blog posts?

Related Blog Articles

RECOMMENDED ARTICLES

Trying out the PostgreSQL pg_tde Tech Preview Release

Benchmarking MongoDB Performance on Kubernetes

Why MariaDB Is “Better” Than MySQL

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7
End of Life

Software
Downloads

Product
Documentation