It is very common to see many infrastructure layers standing between a PostgreSQL database and the Application server.  The most common ones are connection poolers, load balancers, routers, firewalls, etc.  We often forget or take for granted the network hops involved and the additional overhead it creates on the overall performance. But it could cause severe performance penalties in many cases and overall throughput deterioration. I have been trying to get a good estimate of this overhead for quite some time. Previously I had written about how the volume of data transmission as part of SQL execution, as well as the cursor location, affects the overall performance.

Meanwhile, Hans-Jürgen Schönig’s presentation, which brought up the old discussion of Unix socket vs. TCP/IP connection, triggered me to write about other aspects of network impact on performance. He demonstrated a specific case of a 2x performance degradation while using TCP/IP connection.

How to detect and measure the impact

There is no easy mechanism for measuring the impact of network overhead. But a very close analysis of wait_events from pg_stat_activity can tell us the story as closely as possible. So we should be sampling the wait events. Many methods exist for wait-event sampling, including extensions. But I prefer not to install special tools or extensions on the user environment for the wait event sampling. At Percona Support, we use pg_gather as the method to collect and study the wait events because it is a standalone SQL script and doesn’t need to install anything on the database systems. It is designed to be very lightweight as well. There will be 2,000 samples collected per session.

pg_gather analysis report can show wait events and other information associated with each session.


But I will be discussing and highlighting only the wait events portion of it in this blog while going through different types types of workloads and how network performance shows up in wait events.

Case 1: Query retrieving a large number of rows

Let’s consider the case of pg_dump taking a logical backup on a different machine.  If the network is fast, we may see a good amount of CPU utilization and “DataFileRead” as a wait event.

Of course, there are “ClientWrite” events, which is the wait event related to writing the data to the client (pg_dump) in this case. If the client is a lightweight tool like psql and the network is really fast, the “ClientWrite” may even become invisible.

But let’s see what the wait events look like if the network slows down.

We can see that the CPU utilization and “DataFileRead” wait events are dropped, indicating that the overall server-side session activity is slow downed. At the same time, “ClientWrite” is shot up to 1821, indicating that the session is spending considerable time sending the data to its client-side (pg_dump). There is also “ClientRead,” indicating that the acknowledgment from the pg_dump is taking time.

The spike in “ClientWrite” does not depend on the client tool. Following is the screenshot of a regular psql session for a query retrieving a large number of records.

This excessive “ClientWrite” is good enough to spot the problem in these cases.

Case 2: Bulk data loading

This is the opposite of the previous case. But PostgreSQL has a lot more work to do for a write operation with bulk data. Following wait events are captured from a really fast/low latency network.

Obviously, the PostgreSQL process has to spend time in “DataFileExtend,” “WALWrite,” and “WALSync.” Now, if the network slows down, many of those wait events we saw may become invisible as the performance bottleneck emerges.

The following is the wait events from the same bulk data load over a slower network.

As we can see, “ClientRead” has become the major wait event. This means the server session is spending more time reading from its client.

The change may not be dramatic in many systems, but overall “ClientRead” has become more prominent.

Case 3: Impact on transactions

One might ask what is so special about transactions. On an OLTP workload, statements could be simple and small to cause any observable network impact.  But back-and-forth communication between the server and the client can result in unwanted delays between statements and final commits or rollback.  Yes, I mean delays/gaps between each statement.

Following is the wait event of a fast network and micro-transactions using pgbench.

Obviously, there are high WAL-related wait events and CPU usage. But we can see there is considerable “ClientRead” also. This happens because there will be a lot of network interactions for microtransactions. ClientRead is unavoidable for transactions, and it’s OK to expect 5-10% of it.

But as the network slows down, the “ClientRead” becomes increasingly important. Following is the information from the same pgbench transaction workload over a slower network.

The ClientRead became the biggest wait event in this case.

You might wonder, what is the “Net/Delay*” showing up? This additional analysis is available in the new version of pg_gather (version 21) to assess the delay outside the transaction block. See the next session for details.

Case 4: Connection utilization

As the network latency increases, the client connection won’t be able to use the server session to the extent possible. The server session has to wait on the eighter “ClientRead”/”ClientWrite” or sit idle. Either way, it can drastically affect the throughput of the system.

Within a transaction, the delay is captured as “ClientRead,” but the delay between two transactions is not captured because the session becomes “idle” momentarily. pg_gather new version prepares an estimate of this momentary switches to idle as the server wastes time or “Net/Delay*.” It could be due to network delays or poor application response. From the database side, it is difficult to distinguish between them. But the “Net/Delay*” can give a good idea about how much server time is wasted.

If it is possible to install PostgreSQL client tools on the application server, it is easy to simulate a load and study both network delay and application side response delay and compare that with the actual data.

The delay/latency becomes more visible when there is a lot of back-and-forth communication between the client and server.  This can be easily tested by creating a single statement file. (Idea taken from Hans-Jürgen.)

This can be executed against a remote database over a TCP connection for a specified number of seconds.

On a fast network between my servers, I could get the following result as TPS of a single session.

But the wait event analysis by pg_gather tells me that more time is spent on Net/Delay*.

It makes sense because “SELECT 1” doesn’t have much to do at the server, and this workload is all about sending back-and-forth communication.

With a local Unix socket connection, the single session throughput increased by more than double!

But the wait event analysis tells us that still, the client-server communication is a major time consumer. 🙁

This kind of highly interactive workload could benefit from server-side programming (stored proc/function) or even an extension. Interestingly, the CPU usage is of less proportion compared to TPS when a Unix socket connection is used; that’s an important point to be noted. “ClientRead” increased because more data was transferred from the client.

If the network slows down in this case, the “Net/Delay*” also increases, and CPU usage and TPS drop because the session spends more time doing nothing between processing two statements.

Since this particular workload doesn’t have transactions and less data to send to the server, the “ClientRead” can drop to an unnoticeable level, as we see.

Summary

The “wait events” information from pg_stat_activity can tell us many details about performance and network congestion. Not just the sum of events, but the gap between two wait events and pattern has a lot of information to dig down.  Properly collected and analyzed data tells the story from the PostgreSQL perspective and how it experiences the network. More importantly, analysis becomes independent of database hosting or OS-level tools. There is no need to have any sophisticated tools or frameworks required to achieve this. Stand-alone SQL scripts like this can be handy in spotting problems and bottlenecks. Even though this blog post is specific about the network, the wait event analysis can be generic to many cases.

Percona Distribution for PostgreSQL provides the best and most critical enterprise components from the open-source community in a single distribution, designed and tested to work together.

 

Download Percona Distribution for PostgreSQL Today!

Subscribe
Notify of
guest

0 Comments
Inline Feedbacks
View all comments