This blog was originally published in April 2023 and was updated in April 2024.

Indexes are generally considered to be the panacea when it comes to SQL performance tuning, and PostgreSQL supports different types of indexes catering to different use cases. I keep seeing many articles and talks on “tuning” discussing how creating new indexes speeds up SQL but rarely ones discussing removing them. The urge to create more and more indexes is found to be causing severe damage in many systems. Many times, removing indexes is what we should be doing first before considering any new indexes for the benefit of the entire system. Surprised? Understanding the consequences and overhead of indexes can help to make an informed decision and potentially save the system from many potential problems.

At a very basic level, we should remember that indexes are not free of cost. The benefits come with a cost in terms of performance and resource consumption. The following is the list of ten problems/overheads that the excessive use of indexes can cause. This post is about PostgreSQL, but most of the problems also apply to other database systems.

10 Ways Excessive Indexes Can Hurt PostgreSQL Performance

1. Indexes penalize the transactions

We might see an improvement in the performance of a SELECT statement after adding an index. But we should not forget that the performance gains come with a cost to the transactions on the same table. Conceptually, every DML on a table needs to update all the indexes of the table. Even though there are a lot of optimizations for reducing the write amplification, it is a considerable overhead.

For example, let’s assume that there are five indexes on a table; every INSERT into the table will result in an INSERT of the index record on those five indexes. Logically, five index pages also will be updated. So effectively, the overhead is 5x.

2. Memory usage

Index pages must be in memory, regardless of whether any query uses them because they need to get updated by transactions. Effectively, the memory available for pages of the table gets less. The more indexes, the more the requirement of memory for effective caching. If we don’t increase the available memory,  this starts hurting the entire performance of the system.

3. Random writes: Updating indexes is more costly

Unlike INSERTS new records into tables, rows are less likely to be inserted into the same page. Indexes like B-Tree indexes are known to cause more random writes.

4. Indexes need more cache than tables

Due to random writes and reads, indexes need more pages to be in the cache. Cache requirements for indexes are generally much higher than associated tables.

5. WAL generation

In addition to WAL records of the table updates, there will also be WAL records for indexes. This helps in crash recovery and replication. If you are using any wait event analysis tools/scripts like pg_gather, the overhead of the WAL generation will be clearly visible. The actual impact depends on the index type.

WAL

This is a synthetic test case, but if WAL-related wait events appear as any of the top wait events, it is a matter of concern for a transaction system, and we should take every step to address it.

 

Watch our Free Webinar: A Deep Dive into PostgreSQL Indexing

 

6. More and more I/O

Not just WAL records are generated; we will have more pages dirtied, as well. As the index pages get dirtied, they must be written back to files, leading to more I/O again—the “DataFileWrite” wait event, as seen in the previous screenshot.

Another side effect is indexes increase the total Active-Dataset size. By “Active dataset,” I mean the tables and indexes which are frequently queried and used. As the size of the active dataset increases, the cache becomes less and less efficient. Less-effective cache results in more datafile read, so read I/O is increased. This is in addition to the read I/O required to bring the additional index pages from storage for specific queries.

Again the pg_gather report of another system with mainly select queries shows this problem. As the Active-Dataset increases, PostgreSQL has no choice but to bring the pages from storage.

PostgreSQL Active-Dataset

A more significant percentage of “DataFileRead” sustaining for a longer duration indicates that the Active-Dataset is much bigger, which is not cachable.

7. Impact on VACUUM/AUTOVACUUM

The overhead is not only for inserting or updating index pages, as discussed in the previous points. There is overhead in maintaining it since the indexes also need cleanups of old tuple references.

I have seen cases where autovacuum workers on a single table run for a very long duration because of the size of the table and, most importantly, the excessive number of indexes on the table. In fact, it is widespread that users see their autovacuum worker is “stuck” for hours without showing any progress for a longer duration.  This happens because the index cleanup by the autovacuum is the opaque stage of autovacuum and is not visible through views like pg_stat_progress_vacuum other than the vacuum phase is indicated as vacuuming indexes.

Indexes can get bloated and become less efficient over time.  Periodic index maintenance (REINDEX) might be needed in many systems.

8. Tunnel vision while tuning

Tunnel vision is the loss of the field view. The user may be concentrating on a particular SQL statement in an attempt to “tune” and decide on creating indexes.  By creating an index for tuning a query, we are shifting more system resources to that query.  Then it may give more performance to that particular statement by penalizing others.

But as we keep creating more and more indexes for tuning other queries, the resources will shift again towards other queries. This leads to a situation where the effort to tune every query penalizes every other query. Ultimately, everyone will be hurt, and only losers will be in this war. Someone trying to tune should consider how every part of the system can co-exist (maximizing business value) rather than absolute maximum performance for a particular query.

9. Greater storage requirement

Almost every day, I see cases where indexes take more storage than tablets.

PostgreSQL index storage requirement

This may sound too silly for those with more money to spend on storage, but we should remember that this has a cascading effect. The total database size grows to a multiple of the actual data. So obviously, backups take more time, storage, and network resources, and then the same backup can put more load on the host machine. This would also increase the time to restore a backup and recover it. Bigger databases affect many things, including more time to build standby instances.

10. Indexes are more prone to corruption

I am not just talking about rarely occurring index-related bugs like silent index corruption of PostgreSQL 14 or index corruption due to glibc collation change, which keeps popping up now and then and affects many environments even today. Over decades of working with databases, I have observed that index corruptions are reported more frequently. (I hope anyone involved in PostgreSQL for years and who has seen hundreds of cases will agree with me). As we increase the number of indexes, we increase the probability.

What should we do?

A set of critical questions should accompany new index considerations: Is it essential to have this index, or is it necessary to speed up the query at the cost of more index?  Is there a way to rewrite the query to get a better performance? Is it ok to discard the small gains and live without an index?

Existing indexes also require a critical review over a period of time. All unused indexes (those indexes with idx_scan as zero in pg_stat_user_indexes) should be considered for dropping. Scripts like the one from pgexperts can help to do more analysis.

The upcoming PostgreSQL 16 has one more column in pg_stat_user_indexes / pg_stat_all_indexes  with the name last_idx_scan, which can tell us when was the last time the index was used (timestamp). This will help us to take a well-informed look at all the indexes in the system.

Summary

The summary in simple words: Indexes are not cheap. There is a cost, and the cost can be manifold.  Indexes are not always good, and sequential scans are not always bad, either.  My humble advice is to avoid looking for improving individual queries as the first step because it is a slippery slope. A top-down approach to tuning the system yields better results starting from tuning the Host machine, Operating System, PostgreSQL parameter, Schema, etc.  An objective “cost-benefit analysis” is important before creating an index.

Our PostgreSQL Performance Tuning guide condenses years of database expertise into a practical guide for optimizing your PostgreSQL databases. Get it today:

 

Elevate your PostgreSQL Performance

FAQs

Why use indexes in PostgreSQL?

  • Improve Query Performance: Indexes significantly reduce the data search time, making queries faster by allowing the database to locate and retrieve data without scanning every row.
  • Efficiency in Sorting and Filtering: They speed up data retrieval, sorting, and filtering operations, especially in tables with large volumes of data.
  • Optimization of Join Operations: Indexes can improve the efficiency of join operations by quickly matching rows between tables.
  • Unique Data Enforcement: Unique indexes prevent duplicate values in a column, ensuring data integrity.

When should I CREATE INDEX in PostgreSQL?

Creating an index in PostgreSQL is most beneficial for columns frequently involved in search conditions (WHERE clauses), join conditions, or as part of an ORDER BY clause in queries. However, it’s important to balance the need for speed in query processing against the write and storage overhead that indexes introduce.

What are the downsides of creating an index in PostgreSQL?

While indexes are invaluable for improving query performance, there are downsides. Every index added to a table increases the storage space required by the database. Additionally, indexes can slow down write operations such as INSERT, UPDATE, and DELETE, as each operation requires updating the indexes. Excessive indexing can lead to increased complexity in query planning and execution, potentially negatively affecting overall performance if not carefully managed.

Subscribe
Notify of
guest

0 Comments
Inline Feedbacks
View all comments