In this blog post, we will discuss how to extend Percona Monitoring and Management (PMM) to get PostgreSQL metrics on checkpointing activity, internal buffers, and WAL usage. With this data, we’ll be able to better understand and tune our Postgres servers.

We’ll assume there are working PostgreSQL and PMM environments set up already. You can search the blog for more information on this if needed. For generating load, we used pgbench with the following commands:

Creating a custom query collector

Note: This step will not be needed with soon-to-come postgresql_exporter PMM versions!

The first step is to include a new query that will gather data on WAL usage because this is not yet collected by default on the latest version of PMM. However, do note that only three graphs will use these custom metrics, so if you want to try out the dashboard without doing this customization, it’s also possible for you to do it (at the expense of having a bit less data).

For this, it’s as easy as executing the following (in the PMM client node that is monitoring the Postgres servers):

This will result in the following new metrics on PMM:

Importing the custom PMM dashboard

For viewing these new metrics, we will import a new custom PMM dashboard that will provide this information in a structured and easy-to-understand way. This can be done in two easy steps (since I’ve uploaded it to Grafana Labs).

1. Click on the Dashboards -> Import menu: 

Percona Monitoring and Management Dashboard

2. Import via grafana.com with ID number (19600):

Import via Grafana

3. Select the PostgreSQL folder and Metrics, and import it:

Import dashboard from Grafana

After this, you’ll be taken to the dashboard, which I generally “star” for easy access later on:

Using the new dashboard

The dashboard is divided into three main areas, all of which have relevant configurations at the top and graphs following them. The main idea is to be able to easily detect what the server is doing at any point in time so we can better understand how it reacts to the workload and tune accordingly. These graphs and panels not only show changes in usage and workload patterns but also changes in configuration, which can help during root cause analysis and performance reviews.

Checkpointing section

Here, we will find everything related to checkpointing: when are checkpoints started (and due to what) and when they finish. Checkpointing has a lot of impact on the write throughput/utilization by Postgres, so it’s important to make sure that it’s not being triggered before needed.

In the following example, we can see how at the beginning, there were many instances of forced checkpointing, not only because the checkpointer was running more often than checkpoint_timeout seconds but because of the kind of metric we can see in the Checkpoints graph (requested vs. scheduled). We can’t see it yet because the WAL graphs are at the bottom, but this was fixed after we increased the value for max_wal_size.

Checkpointing PostgreSQL

Additionally, after the metrics refresh, we can see that the Checkpoint Timeout stat panel shows another change closer to the end of our currently selected time range:

This means that we have also changed the checkpoint_timeout value. This was done only after our workload was more stable and there were no more forced checkpoints being issued.

Buffers section

This section holds configurations and metrics related to reads to and writes from the shared buffers. Again, demonstrating it by example, let’s see how much of an impact correctly sizing the shared_buffers has on our test workload. It’s set at 128Mb by default, which is hardly enough for any serious workload and will mean that we should see a lot of churn in both reads and writes until we increase it. The only downside is that modifying shared_buffers needs a restart, so before doing that, we can increase bgwriter_lru_maxpages (which doesn’t need a restart) to avoid the writer stopping each round and get a bit more performance out of it at the expense of I/O.

We can clearly see how, at first, there were constant reads into the shared buffers, and the background writer was doing most of the writes. Additionally, we can see the positive impact on increasing bgwriter_lru_maxpages because that process is not being throttled anymore. After we increased shared_buffers and restarted Postgres, the reads decreased to almost having to read nothing from the page cache or disks, and the writes are no longer done by the background writer but by the checkpointer.

WAL usage section

The last section pertains to write-ahead logs, which are another source of potential performance bottlenecks. Note that (for now) this is the only section that needs data coming from the custom query collector we added at the beginning, so you can use the rest of the dashboard even if you decide not to include it. PMM already collects the configurations, so we will only miss data from the graphs.

The WAL Writes Per Checkpoint Timeout graph will group the number of bytes that were written in chunks of checkpoint_timeout seconds so we can have a clear view of the expected max_wal_size value. If the chunks (shown as orange bars) exceed the max_wal_size (shown as a red line), it means that there will be forced checkpointing. We can easily dimension max_wal_size knowing that it should be larger than any orange bar, and we can tell when exactly it was changed, following changes in the red line. The WAL Writes graph uses the same metric but is shown as a rate, which can help us pinpoint heavy write times more granularly. Lastly, the WAL Locations graph is included for completeness and shows the different locations for WAL pointers (such as insertion and flush locations), which, according to the documentation, are mainly used for debugging purposes.

Hints and tips

Most graphs contain additional information and links on either the configuration variables or the metrics they show, so they are useful in case we need a bit of help interpreting them. Just hover the mouse over the “i” icon on each of them:

Checkpointing PostgreSQL

Getting even more data

Starting from PostgreSQL 15, we have a new view on WAL metrics: pg_stat_wal

The metrics used in this custom collector and dashboard will work for older versions, but in the future, it will be nice to see what data we can extract from this new view and how we can use it to tune our servers.

Conclusion

This is another example of how tunable and powerful PMM is. Not only because we can add our custom metrics easily but because it already collects many metrics we can use out of the box. Additionally, it’s easy to share new PMM dashboards via the Grafana Labs page by simply publishing it and sharing the ID number. With all this new information at hand, we can now better tune our PostgreSQL instances!

Lastly, we are working closely with the PMM Dev team to include this new dashboard in the PMM server itself, so all these custom steps won’t be needed in future releases. Let us know if you have any comments on it.

Percona Monitoring and Management is a best-of-breed open source database monitoring solution. It helps you reduce complexity, optimize performance, and improve the security of your business-critical database environments, no matter where they are located or deployed.

 

Download Percona Monitoring and Management Today

Subscribe
Notify of
guest

2 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
mahesh

Can we use pmm for capacity planning for pg database like
1) db size growth
2) table growth /etc.

Also can we extract data from the pmm and send email as a weekly report ?