Performance is one of the essential aspects of a database management system. Very little can be more annoying and frustrating for users than poor performance, meaning long-running queries and high response times at the front end. 

One of the most effective ways to tackle performance improvement is having the proper indexes for the table columns. An index can save a lot of time in data access and lead the queries to gather the results the fastest way possible.

In PostgreSQL, there are different ways it can leverage the indexes to produce the most efficient plan.

In this blog post, we will review the following three different PostgreSQL scan types for an index depending on the table, what the queries are retrieving, and the used filters:

  • Bitmap Index Scan
  • Index Scan
  • Index Only Scan

Building the testing scenario

We will use one table with a single index for the following exercises and examples and review how the scan strategy can change depending on the query conditions. 

Next is the table definition. I included a sequence creation statement for the id column since one best practice is always to have a primary key column, but for the specific examples we will go through, we don’t need it. 

Taking some inspiration from the How To Generate Test Data for Your Database With SQL Percona blog, we will use the following SQL statement to insert ten million rows to this table:

 Now we have our test table with some dummy data so we can practice.

Indexing the data

As we stated before, a best practice is to add a primary key for the table, but we are skipping this step and adding just a composite index that will help us to review the different scan types.

We create this multicolumn index as follows: 

Here, we considered three columns with different cardinalities, meaning the proportion of distinct values about the total number of rows. The following are the columns ordered from the higher to the lower cardinality:

  1. register_date. We loaded the 10M rows setting this column with the help of the random() function, so the number of distinct values is the largest from these three columns.
  2. age. When we loaded the data, we also used the random() function, but we “limited” the results with the floor() function, so all the different values are between 1 and 99. 
  3. is_active. This column data type is boolean, so only two different values are possible, true and false

It is essential to think about the data cardinality of a column when planning the indexes and, even before that, the filters we will execute against the data.

For example, in the above columns, having a single index over the is_active columns will not add any advantage because, from all the 10M rows, only two values are possible, so if we would want to filter all the is_active = true rows, the planner will use sequential scan with no doubt. 

One way to verify a column’s number of distinct values is by querying the pg_stats view in our database. For this is important to ensure the statistics are fresh; in this case, we run the ANALYZE command:

For the previous columns, the following is the result of querying the pg_stats view:

And we confirm the age column has 99 distinct values whereas is_active only 2. The register_date column shows a negative value of -1 because, as described in the documentation, the ANALYZE believed the number of distinct values is likely the same as the total number of rows:

[…] -1 indicates a unique column in which the number of distinct values is the same as the number of rows.

One index, different scan types

Now that we have the table data and the index in place, we can test the different scan types. First, and to have a starting point, let’s verify how PostgreSQL will resolve a query going for all the table data with no filters:

As expected, to retrieve all the data from the table, the planner decided on a Sequential Scan, going for all the 10M rows. It makes sense since it is getting all the rows at once. The total time it took was over 1183ms (~1.1sec.) 

Bitmap Index Scan

The planner chooses this index scan method when the query asks for a large enough amount of data that can leverage the benefits of the bulk read, like the sequential scan, but not that large that actually requires processing ALL the table. We can think of the Bitmap Index Scan as something between the Sequential and Index Scan. 

The Bitmap Index Scan always works in pair with a Bitmap Heap Scan; the first scan the index to find all the suitable row locations and builds a bitmap, then the second use that bitmap to scan the heap pages one by one and gather the rows. 

The following is an example of a Bitmap Index Scan using the table and the index we built before:

In the inner node (which executes first) is a Bitmap Index Scan on the idx_person_age_date_active index. It creates the bitmap with all the suitable row locations and passes it to its parent node (that executes after), a Parallel Bitmap Heap Scan on the person table. This second stage goes through the pages individually, executes a Recheck for the filter condition, and returns the result data set. 

To compare, consider how the same operation performs using just a sequential scan:

Considering this query went for 101K rows, about 1% of the total rows. The Bitmap Index Scan took advantage of the “sequential scan” style bulk read over a limited number of pages and produced a better result than a direct Sequential Scan, performing 2x faster. 

Index Scan

It might be the scan method you think about when hearing something like, “Hey, this query is doing good; it uses the index….” This method is the basic definition of accessing the data by an index. 

The Index Scan consists of two steps, the first is to get the row location from the index, and the second is to gather the actual data from the heap or table pages. So every Index Scan access is two read operations. But still, this is one of the most efficient ways of retrieving data from a table. 

The planner picks this scan method when the number of rows to retrieve is small, so executing the two-step Index Scan operations is “cheaper” and faster than gathering the data by processing the table pages individually. 

Using our test table, the following is an example of an Index Scan:

See that in the query we used before, we have added a new filter expression: AND register_date = ‘2023-03-23 19:50:03.22938+00’::timestamp. The register_date column is part of the multicolumn index ​​idx_person_age_date_active. Since we are filtering by a singular value for it, there is one index entry for the same, so PostgreSQL gets the specific row location from the index in one read and then all the row data from the table page within that location. The whole query took 0.064ms; that was fast! 

In the above example, the query filters by a specific timestamp value for the register_date column, but PostgreSQL will still choose the Index Scan for multiple rows if the number of rows is small, for example, in the following:

The query filtered the register_date column by a range using the BETWEEN operator. Based on the statistics, the planner thought the result would be one row and chose the Index Scan. In the end, the result set was eight rows, so eight pairs of read operations existed. Still, the query was resolved very quickly, with 0.337ms

Index Only Scan

Finally, we will review the Index Only Scan method. This is a really good approach that PostgreSQL uses to improve the standard Index Scan method. 

PostgreSQL uses this method when all the data the query asks for already exists in the index; in other words, the columns/expressions in the SELECT and WHERE clauses should be part of the index so that it is possible to avoid the second read operation to get the data from the table pages and return the result data only from the index read operation.

In the following, we are using almost the same query we used for the Index Scan example, but instead of asking for ALL the row columns (*), we are just retrieving the three columns we used to build the multicolumn index:

See the EXPLAIN output now says Index Only Scan, and also, it confirms there was no access to the heap (table pages) with the line Heap Fetches: 0. The time was even better than the Index Scan from before, only 0.058ms. This scan method can help get the best possible performance for the queries that fit its conditions. 

Remember, indexing ALL the columns so the index contains ALL the same data as the table is “not a good idea.” If that is the case, PostgreSQL will not see any advantage of using the index and will pick the sequential scan approach. See the following:

In the above, in a single transaction, we dropped the multicolumn index we used in the previous examples, and created a new one considering ALL the table columns, then refreshed the statistics and tried one query asking for all the columns (*) on a specific filter, as a result, the planner chosen the Sequential Scan, it wanted to boost it by executing the operation with parallelism. Still, the final execution time was far from our good results. 

Final pieces of advice

Now that we have reviewed the different scan options PostgreSQL can use on a single index depending on the stored data and the query filter conditions let me share some final thoughts you can find helpful when planning your query’s filters and table indexes.

  1. Index the columns with the most cardinality. Following this, your queries can perform best when filtering by the same column. Having indexes on columns with low cardinality will have the opposite effect since it will add extra maintenance, and with high certainty, the planner will not use them.
  2. Plan your queries for small (specific) data sets rather than larger ones. If your workload and service design affords it, consider filtering your data thinking in retrieving just a few rows. As we saw, the Index Scan is an effective and optimized technique to retrieve data faster, and PostgreSQL will use it if the result data is small enough. 
  3. Retrieve just the columns you need. By doing this, PostgreSQL can leverage Index Only Scan and avoid the “extra” read from the heap (table pages). These saves will produce a notorious good effect in a high query volume environment. Remember that the multicolumn indexes are not the only way to lead the planner to choose the Index Only Scan; you can also consider the covering indexes (matter for another blog, perhaps). 
  4. Tune the random_page_cost parameter. Lowering this will lead the planner to prefer index scans rather than sequential scans. Modern SSD can deliver a better throughput for random read access, so you might analyze adjusting this parameter accordingly.
  5. Tune the effective_cache_size parameter. Setting this parameter to a higher value (nearly 75% of total RAM if your machine is dedicated to the PostgreSQL service) will help the planner choose index scans over sequential scans. 

Remember that every implementation is different, and the details matter, so before adjusting anything on production is advisable to analyze and test the effects in a lower environment.

Percona Distribution for PostgreSQL provides the best and most critical enterprise components from the open-source community in a single distribution, designed and tested to work together.

 

Download Percona Distribution for PostgreSQL Today!

Subscribe
Notify of
guest

0 Comments
Inline Feedbacks
View all comments