This blog post was originally published in January 2023 and was updated in June 2023.

In PostgreSQL, the vacuum command is a maintenance task that helps to optimize database performance and reclaim disk space. Using the PostgreSQL vacuum command involves removing deleted or outdated rows from tables and indexes and updating statistics used by the query planner. This process is necessary to prevent the accumulation of unnecessary data, known as “dead tuples” or “dead rows,” which can take up significant space and slow down queries.

 

Get database support for PostgreSQL

 

Multi-version concurrency control (MVCC)

To maintain consistency and prevent data loss due to concurrent updates, PostgreSQL employs multi-version concurrency control (MVCC). PostgreSQL and other database management systems use MVCC to ensure consistent reads and prevent data loss from concurrent updates. PostgreSQL is achieved by storing multiple versions of each row within the database, allowing transactions to access a consistent data snapshot.

In PostgreSQL databases, each row within a single table is assigned a transaction ID referred to as an “xmin”. This ID signifies the transaction that inserted the row. When a row is updated or deleted, it is not immediately removed from the table. Instead, a new version of the row is inserted with a new transaction ID, while the old version is marked as “dead” with a transaction ID called an “xmax”.

When a transaction reads a row, it utilizes the xmin and xmax values to determine whether the row is visible to the transaction. If the xmin value is greater than the transaction’s “snapshot” (a record of the transaction IDs that were in progress when the transaction began), the row is not visible to the transaction. If the xmax value is equal to the transaction’s ID, the row has been deleted by the transaction and is also not visible. In all other cases, the row is visible to the transaction.

This allows transactions to access a consistent data snapshot, as they can only see rows committed when the transaction began. It also prevents data loss due to concurrent updates, as conflicting updates result in inserting new row versions rather than overwriting the existing data.

Although MVCC incurs some overhead in terms of storage and performance due to the need to maintain multiple versions of each row, it is a crucial feature of PostgreSQL and other database systems that support concurrent updates.

This allows multiple versions of each row to be stored, enabling transactions to access a consistent data snapshot. However, this can result in the accumulation of dead tuples as rows are updated or deleted.

PostgreSQL vacuuming

The vacuum process in PostgreSQL helps maintain the database’s performance and disk space efficiency by removing rows that are no longer needed.

These rows accumulate because PostgreSQL uses MVCC to allow multiple transactions to access the same data simultaneously without conflicts. During the vacuum process, tables and indexes are scanned, and these dead tuples are removed, helping to reclaim space and improve query performance. It is essential to run a vacuum to keep the database running smoothly periodically. MVCC stores multiple versions of each row, so dead tuples are not immediately removed when a row is updated or deleted.

PostgreSQL vacuum configuration parameters

Vacuum Full

The VACUUM FULL command in PostgreSQL is a tool used to reclaim disk space occupied by deleted or obsolete data. Unlike the regular VACUUM command, it performs a more thorough cleanup by physically rearranging the data on the disk. This ensures that all dead tuples are completely removed, resulting in a more compact storage structure and improved query performance.

VACUUM FULL requires exclusive access to the table, so while the command runs, other transactions cannot read from or write to the table, potentially impacting database availability. It’s important to know this is a resource-intensive operation that may cause database disruptions and longer execution times.

Verbose

The term “Verbose” is used when referring to Postgres log output. Normally Postgres logs only the barest of information when it’s first set up, i.e., STARTUP, SHUTDOWN, ERROR, and FATAL messages. Log output verbosity increases substantially when the logging parameters are set to record additional information relating to client access details and Postgres background processes such as CHECKPOINTs and VACUUM. The most “verbose” setting is when each and every SQL statement is logged and which can increase the Postgres log size generating GB of messages in a matter of a few minutes.

PostgreSQL Vacuum and Analyze

The vacuum process removes dead tuples and updates statistics used by the query planner to more accurately estimate the number of rows returned by a query and choose the most efficient execution plan. There are two types of vacuum in PostgreSQL: VACUUM and ANALYZE. VACUUM removes dead tuples and updates statistics, while ANALYZE only updates statistics. It is generally recommended to run both VACUUM and ANALYZE together.

The vacuum command can be initiated manually using SQL commands or automated using the autovacuum background process, which runs based on configurable thresholds such as the number of dead tuples or live rows in a table. In PostgreSQL 15, the vacuum process has been optimized to make it more efficient and faster to vacuum large tables. It includes improvements such as the ability to vacuum multiple partitions of the same table in parallel, vacuum indexes concurrently, and skip vacuuming indexes unaffected by an update.

From a technical perspective, the vacuum process in PostgreSQL 15 involves several components. The vacuum daemon (autovacuum) initiates vacuum operations based on configurable thresholds. The vacuum worker executes the actual vacuum operation, scanning the table or index and removing dead tuples while updating statistics.

Autovacuum

Autovacuum, which automates routine vacuum maintenance, is enabled by default in PostgreSQL and can be configured using several parameters in postgresql.conf file. PostgreSQL has several settings related to vacuum that can be configured to control how the vacuum process runs. You can find the following settings in postgresql.conf file and include:

  1. autovacuum: This setting enables or disables the autovacuum background process. By default, autovacuum is enabled.
  2. autovacuum_vacuum_threshold: This setting determines the minimum number of dead rows that must be present in a table before it is vacuumed. The default value is 50.
  3. autovacuum_analyze_threshold: This setting determines the minimum number of live rows that must be present in a table before it is analyzed. The default value is 50.
  4. autovacuum_vacuum_scale_factor: This setting is a multiplier that determines how many dead rows are needed to trigger a vacuum based on the table size. The default value is 0.2.
  5. autovacuum_analyze_scale_factor: This setting is a multiplier that determines how many live rows are needed to trigger an analyze based on the size of the table. The default value is 0.1.
  6. autovacuum_vacuum_cost_delay: This setting determines the time (in milliseconds) the autovacuum will wait before starting a vacuum operation. The default value is 20.
  7. autovacuum_vacuum_cost_limit: This setting determines the maximum number of rows that can be vacuumed in a single vacuum operation. The default value is 200.

Here is an example of configuring some of the vacuum parameters in the postgresql.conf file:

In this example, autovacuum is enabled, and the thresholds for vacuum and analyze are set to 100. The scale factors for vacuum and analyze are set to 0.5 and 0.2, respectively, which means that a vacuum will be triggered when there are 50 dead rows per 1,000 live rows in the table (0.5 x 100), and an analyze will be triggered when there are 20 live rows per 1,000 rows in the table (0.2 x 100). The vacuum cost delay is set to 50 milliseconds, and the vacuum cost limit is set to 500 rows, which means that the autovacuum will wait 50 milliseconds before starting a vacuum operation and will vacuum a maximum of 500 rows at a time.

It is important to configure these settings to ensure that vacuum and analyze properly are running effectively and not causing too much load on the database. It is also a good idea to monitor the activity of the autovacuum and manually vacuum tables that are not adequately maintained by the autovacuum.

Autovacuum can also be configured on a per-table basis using the autovacuum_vacuum_cost_delay and autovacuum_vacuum_cost_limit parameters in the table’s storage parameters. These parameters control how aggressively autovacuum vacuums the table, with a lower cost delay causing the vacuum to run more frequently and a higher cost limit allowing more rows to be vacuumed at once.

Parallel vacuuming

Parallel vacuum is a feature in PostgreSQL that allows the vacuum process to be run concurrently on multiple cores or processors for the same table, improving the performance of the vacuum operation. This can be especially useful for vacuuming large tables, as it allows the vacuum process to use multiple CPUs to scan and process the table in parallel.

The parallel vacuum was introduced in PostgreSQL 13 as an experimental feature and made generally available in PostgreSQL 14. To use a parallel vacuum, you need to set the “max_parallel_workers_per_gather” parameter in the postgresql.conf configuration file to a value greater than one. To enable parallel processing, specify the “PARALLEL” option when running a vacuum or analyze command.

For example, to run a parallel vacuum on a table named “foo_table”, you can use the following command:

You can also specify the “PARALLEL” option when running a vacuum or analyze command on an entire schema or database:

Note: Keep in mind that parallel vacuum can increase the load on the database server, as it requires multiple CPU cores to be used simultaneously. You should carefully monitor the performance of your PostgreSQL database when using a parallel vacuum and adjust the “max_parallel_workers_per_gather” parameter as needed to find the optimal setting for your workload.

 

Discover why Percona and PostgreSQL work better together

 

Transaction wraparound

Transaction wraparound is a phenomenon that can occur in PostgreSQL when the maximum transaction ID (TXID) has been reached, and the system wraps around to reuse old transaction IDs. This can cause problems if there are still rows in the database with a higher transaction ID than the current maximum, as they will be considered “dead” and removed by the vacuum process.

To understand how transaction-wraparound works, it is essential to understand how PostgreSQL manages transaction IDs. PostgreSQL transactions are assigned a unique transaction ID called a “xid.” The xid is a 32-bit integer, meaning it has a maximum value of 2^32-1, or 4,294,967,295. When this maximum value is reached, the system wraps around and reuses old xids.

To prevent transaction wraparound from causing problems, it is important to run the vacuum process and remove dead rows regularly. You can use the following query to check for tables at risk of transaction wraparound.

Vacuum statistics

To view the vacuum history for all tables in the current schema:

This query retrieves vacuum statistics for all tables in the “public” schema in the current database. This query can help monitor the status of the vacuum process and identify tables that may need to be vacuumed or analyzed. For example, if a table has many dead rows or has not been vacuumed or analyzed recently. In that case, it may be worth running a manual vacuum or analyzing operations to improve the performance of the database. Download our free e-book, “The 6 Common Causes of Poor Database Performance,” to see the most common database performance issues and get expert insights on how to fix them.

To view the list of tables that have been modified since the last vacuum:

To view the list of tables and indexes that have a high number of dead rows:

You can write the above query for all the databases using dblink. This will give information about all the databases.

Note: You need to replace the placeholders <host>, <port>, <database>, <username>, and <password> with the actual values for your server.

Conclusion

It is important to properly configure the autovacuum to avoid overloading the database with too many vacuums. It is also a good idea to monitor the activity of the autovacuum and manually vacuum tables that are not adequately maintained by the autovacuum.

In summary, PostgreSQL vacuuming is an essential maintenance task in PostgreSQL that helps to reclaim space and improve performance. t is generally recommended to run the vacuum regularly, either manually using the VACUUM SQL command or automatically using the autovacuum background process. This helps to ensure that the database remains efficient and free of dead rows and reduces the risk of transaction wraparound. Autovacuum is a convenient way to automate this process, but it is important to configure and monitor it to ensure it runs effectively and adequately.

Percona Distribution for PostgreSQL provides a complete package for PostgreSQL in a single installation. The components are tested to work together and continuously updated. This means you no longer need to find solutions for common requirements, like high availability and backup, on your own. Percona database experts are the best in the industry when it comes to tuning your complex PostgreSQL database environment for scaling and performance. Join the growing list of thousands of companies and millions of developers that use Percona software every day and learn more about Percona Distribution for PostgreSQL.

 

Discover why Percona and PostgreSQL work better together

 

FAQs

What is vacuum in PostgreSQL?

The term “vacuum” refers to a database maintenance procedure that reclaims storage space and optimizes database performance. Whenever data is inserted, updated, or deleted in a PostgreSQL database, it can generate “dead tuples” – rows that have become obsolete or are inaccessible.

The VACUUM operation in PostgreSQL identifies and eliminates these dead tuples, freeing up disk space to utilize for future operations.

What is the difference between vacuum full and vacuum freeze in PostgreSQL?

Both the VACUUM FULL and VACUUM FREEZE operations are both used in PostgreSQL for maintenance purposes, but they serve different functions.

The VACUUM FULL operation reclaims storage space by physically rewriting the entire table, relocating live tuples to fresh data pages, and discarding obsolete tuples. This can be useful when a significant amount of data has been updated and/or deleted, which results in fragmented space within the table. This operation effectively recaptures the space, but it comes with a cost; longer execution time and exclusive locks on the table, impacting concurrency.

VACUUM FREEZE addresses the issue of transaction ID wraparound. In PostgreSQL, every transaction is associated with a unique transaction ID, represented by a 32-bit number. As transactions progress, the transaction ID incrementally increases, and when it reaches its maximum value, it wraps around to zero, causing issues with transaction visibility and data integrity. The FREEZE operation marks all tuples in a table as frozen, meaning their transaction IDs are set to the maximum value, preventing wraparound and ensuring these tuples are marked as old and not subject to removal during VACUUM operations.

What is the vacuum threshold in PostgreSQL?

The “autovacuum_vacuum_threshold” parameter in PostgreSQL determines the minimum number of updated or deleted tuples required in a table before the autovacuum process is triggered. The default is 50 tuples, meaning that if 50 or more tuples are modified in a table, autovacuum will be triggered. However, you can adjust this value in the PostgreSQL configuration file (postgresql.conf) or by changing table storage parameters.

What is the difference between vacuum and vacuum full?

VACUUM: The VACUUM command analyzes and reclaims storage space in a PostgreSQL database. When rows in a table are deleted or updated, the space they once occupied is not immediately freed up. Rather, it becomes marked as available for future use. This command does not need access to the table, enabling other transactions to read from and write to the table concurrently.

VACUUM FULL: This is a more forceful form of vacuuming, reclaiming space by physically relocating the data within the table, ensuring the removal of all defunct rows, and compacting the storage space. It can take longer to complete than regular vacuuming and requires exclusive access to the table, making it inaccessible for reading or writing by other transactions while the command is in progress.

The primary distinction between VACUUM and VACUUM FULL lies in the level of aggressiveness used to free up storage space. VACUUM is a lightweight and faster operation, marking the space as available for reuse, while VACUUM FULL physically rearranges the data and compresses the storage, resulting in a more thorough cleanup.

Subscribe
Notify of
guest

1 Comment
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
mtnfalcon

Good article. It refers to a query one can use to check for tables at risk of xid wraparound, but I don’t see the query I think the author intended to include. Feel free to delete this comment from me if you fix the missing query.