One of the most frequent concerns we see in the Percona support team is: Why is my pg_wal growing? / pg_wal is filling up the filesystem / pg_wal does not release the space, and so on.

In PostgreSQL, the pg_wal directory plays a crucial role in managing the Write-Ahead Logging (WAL) mechanism, which ensures the durability of transactions and crash recovery. WAL files are also required for replication purposes (if any).

Eventually, administrators may encounter situations where the pg_wal directory accumulates files, gradually consuming disk space and potentially leading to filesystem issues and database crashes.

This blog will cover some common reasons why the pg_wal directory has increased its size, how to troubleshoot, and strategies to mitigate the problem.

High transaction rate spikes

Problem: High transaction rates or spikes. WAL segments are generated due to PostgreSQL processing transactions before the data files are written. If the transaction rate exceeds the capacity of the system to archive or remove these segments, they accumulate in the pg_wal directory, leading to disk space exhaustion.

Although the archiving speed is generally unimportant, it is expected to keep pace with the average WAL generation rate in the pg_wal directory. If the archiving speed falls significantly behind the WAL segment creation rate for too long, pg_wall will start accumulating files until they are archived. If pg_wal does not have enough space to hold some uncommon/unplanned load, it can run out of space.

Troubleshooting: Identify the WAL segment creation speed in pg_wal and compare it with the archiving speed.

Solution: If the archiving speed is not fast enough, we should improve the archiving IO subsystem (or network latency when working with S3 cloud services). Otherwise, assessing how long the workload spikes last and adding additional space for pg_wal.

Inefficient checkpointing

Problem: Checkpoints in PostgreSQL are crucial for flushing modified data from memory to disk and recycling obsolete WAL segments. However, inefficient checkpointing strategies, such as too-infrequent or too-aggressive checkpoints, can impact the accumulation of WAL files. Infrequent checkpoints result in prolonged retention of WAL segments, while overly aggressive checkpoints may lead to excessive disk I/O and WAL generation.

Troubleshooting: Assessing the checkpoint and WAL parametrization regarding the database workload (min_wal_size, max_wal_size, wal_keep_size/wal_keep_segments, bgwriter_lru_maxpages, bgwriter_delay, etc).

Solution: Finding a proper trade-off for checkpoint frequency and bgwriter efficiency.

Replication lag

Scope: This only applies when streaming replication is set.

Problem: Delays in applying changes on standby servers can exacerbate the accumulation of WAL files on the primary server. A standby server might fall behind its primary due to network issues/slowness, high load, or HW resource constraints, so the primary server retains WAL segments until they are successfully replayed on the standby. This delay can strain disk space availability on the primary server. 

The above reasons and a misconfiguration in wal_keep_size/wal_keep_segment parameters might contribute to the space exhaustion.

Abandoned replication slots will hold WAL segments indefinitely.

Troubleshooting: Verify the replication lag between primary and standbys. Verify the configuration of wall_keep_segments/wal_keep_size (depending on your database version). Looking for abandoned replication slots in the primary server.

Solution: Improving the network performance or IO performance on the standby server (or any hardware bottleneck). Dropping any abandoned replication slots. Adjusting the wall_keep_segments/wal_keep_size configuration according to the replication performance and pg_wal directory capacity (in primary).

Failing WAL archive

Scope: This only applies when the database runs in continuous archiving (archive_mode is set to on, and archive_command is also set).

Problem: If the archiver process fails to perform the command in archive_command, WAL segments will remain on the pg_wal until the archiver succeeds.

The most common reasons can be related to full disks/filesystem (where archive_command points to), missing paths, wrong privileges, timeouts, unreachable destinations, and wrong archive_command.

Troubleshooting: Whenever archive_command fails, you will get an error message in the PostgreSQL log.

Solution: It depends on identifying the reason for archive_command and fixing it.

Retention policies

Problem: Misconfigured or inadequate retention policies for WAL archives can also contribute to accumulating files in the pg_wal directory. If archival processes fail to remove obsolete WAL segments promptly, the directory may become bloated with unnecessary files, consuming disk space that could be used for other purposes.

Troubleshooting: Reviewing the aforementioned min_wal_size, max_wal_size, wal_keep_size/wal_keep_segments parameters. Reviewing PostgreSQL log for failing archive events.

Solution: Improving parametrization and fixing failing archive reasons.

Conclusion

The pg_wal directory in PostgreSQL is critical for ensuring data durability and recovery, but its unchecked growth can lead to disk space constraints and operational challenges. By understanding the common reasons behind the accumulation of files in the pg_wal directory and implementing appropriate strategies such as tuning checkpoint settings, monitoring replication lag, and enforcing efficient retention policies, administrators can effectively manage disk space usage and maintain the stability and performance of their PostgreSQL databases.

Since WAL segments play a lead role in PostgreSQL databases, you should never manually remove WAL segments from pg_wal. It might lead to database crashes, failures on crash recoveries, failures on WAL archive events, and incomplete backup copies.

Percona Distribution for PostgreSQL provides the best and most critical enterprise components from the open-source community in a single distribution, designed and tested to work together. Run PostgreSQL in your production and mission-critical environments and easily deploy and orchestrate reliable PostgreSQL in Kubernetes.

 

Download Percona Distribution for PostgreSQL Today!

Subscribe
Notify of
guest

0 Comments
Inline Feedbacks
View all comments