In this blog post, we are going to briefly discuss the reasons why column order matters when creating tables -or relations- in PostgreSQL.

Alignment

In its internal representation of tuples, Postgres will set each column at a position depending on the alignment defined for the data type used (following the same sequential order as the one used in the table definition we use to create it). We can read more about it in the following online documentation sections:

https://www.postgresql.org/docs/current/catalog-pg-type.html 

Quoting the relevant excerpt here:

typalign is the alignment required when storing a value of this type. It applies to storage on disk as well as most representations of the value inside PostgreSQL. When multiple values are stored consecutively, such as in the representation of a complete row on disk, padding is inserted before a datum of this type so that it begins on the specified boundary. The alignment reference is the beginning of the first datum in the sequence. Possible values are:

c = char alignment, i.e., no alignment needed.

s = short alignment (2 bytes on most machines).

i = int alignment (4 bytes on most machines).

d = double alignment (8 bytes on many machines, but by no means all).

As we will check in more depth later on, this means that the exact position of the column in the tuple array will vary depending on which data type you use: some require less and can be “stored closer together,” and others don’t.

And from https://www.postgresql.org/docs/current/storage-page-layout.html:

> […] Interpreting the actual data can only be done with information obtained from other tables, mostly pg_attribute. The key values needed to identify field locations are attlen and attalign.

Additionally, we can say that choosing a good column order will not only impact disk usage but also memory and, potentially, CPU usage.

Padding

The first documentation quote brings us to padding. Alignment is the reason why column order matters, and padding is the reason why we could be wasting space in each tuple. This is best explained with a simple real-life example.

Let’s say we have two relations, each with the exact same columns but in different order.

We have two BIGINT columns, one INT and one SMALLINT in each, so we can say they are exactly the same in terms of the data they can hold. If we go back to the documentation, it mentions:

When multiple values are stored consecutively, such as in the representation of a complete row on disk, padding is inserted before a datum of this type so that it begins on the specified boundary.

Ok, so how do we know each data type’s boundary? We can use the following queries to check the alignment of data types used for our tables in particular:

In summary:

  • BIGINT has an alignment of eight bytes
  • INT has an alignment of four bytes
  • SMALLINT has an alignment of two bytes

Alignment and padding example

Now let’s think of the tuple in terms of an array, where each position represents one bit. If the first column takes two bytes, and the second one has an alignment of eight, we will need to have six more bytes added -as padding- before we can store its data. This leaves us with the following representations for tuples on each table:

Alignment and padding example

The zeros in the “tuple” rows mean that padding is needed, and space is wasted with no data at all. As we can see in this small example, we’ll need ten more bytes per tuple for table t1, even if they’ll hold the same exact information! Of course, you could say that ten more bytes per row is negligible, and it may be; but this could be a very different story in environments with large datasets. You can also think of it as a 45% overhead, in this case, wasted on padding.

An easy way to check how many bytes each tuple takes on each table is to use pg_column_size. In this example, we are subtracting 24 bits to discard the size of the header of the row, but do note that the header could be larger in other cases.

The difference between them is 10 bytes, as we calculated above.

We are not considering variable-length data types here because they are not as straightforward. If there is interest in a second part exploring this in more depth, let me know.

Using the pageinspect extension

Another interesting way to check this is to use the pageinspect extension. With it, we can get the actual contents (bytes) for each page used in a table and see exactly how the tuples are being stored. First, we need to enable it (note: this is not suggested for use in a production environment):

We will use two functions:

https://www.postgresql.org/docs/current/pageinspect.html#PAGEINSPECT-GENERAL-FUNCS

get_raw_page(relname text, fork text, blkno bigint) returns bytea

get_raw_page reads the specified block of the named relation and returns a copy as a bytea value. This allows a single time-consistent copy of the block to be obtained. fork should be ‘main’ for the main data fork, ‘fsm’ for the free space map, ‘vm’ for the visibility map, or ‘init’ for the initialization fork.

https://www.postgresql.org/docs/current/pageinspect.html#PAGEINSPECT-HEAP-FUNCS

heap_page_items(page bytea) returns setof record

heap_page_items shows all line pointers on a heap page. For those line pointers that are in use, tuple headers as well as tuple raw data are also shown. All tuples are shown, whether or not the tuples were visible to an MVCC snapshot at the time the raw page was copied.

Let’s now insert MAX values for the respective types of each column, so we can clearly see when padding is added (before these inserts, I recreated the tables, so we have only one row in each):

We can use a query like the following to inspect all the page data available:

But in this case, we’ll focus on t_data (the actual tuple data):

Since it’s in HEX (hexadecimal) notation, each number is four bits (which means for every two zeros, we have a byte of padding). Since we are using max values for each datatype, we know the zeros are padding, and this is why we don’t get any zeros printed in the second query. We have 12 zeros in the first column and eight zeros in the third, which gives 20 zeros in total, amounting to:

Which is the same number we got with the different approaches taken above. This tool gives us a more visual understanding of the theory we discussed above, which is always helpful.

Conclusion

Taking some time to think about the low-level row structure can save us resources and improve performance when designing our tables. Data will need less disk space, and the overall query and server performance should be positively impacted by it. On the other hand, it could make table definitions less clear if we enforce minimizing padding, so we might need to be a bit lax at times (for instance, we could make the exception for primary key columns and have those at the beginning).

If you are interested in this topic, there is a Percona Live recording that covers this in more depth: “Reducing Costs and Improving Performance With Data Modeling in Postgres  by Charly Batista.

Percona Distribution for PostgreSQL provides the best and most critical enterprise components from the open-source community in a single distribution, designed and tested to work together. Run PostgreSQL in your production and mission-critical environments and easily deploy and orchestrate reliable PostgreSQL in Kubernetes.

 

Download Percona Distribution for PostgreSQL Today!

Subscribe
Notify of
guest

3 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Aravinth Manogaran

Ideally it should have designed to store all the fixed length data types serially and should be handled with meta data definition during run time.

David Burns

I would be very interested in a follow-on article on variable-length columns.

Rishabh

“We are not considering variable-length data types here because they are not as straightforward” – Please write an article on this.

Also does the same concept apply to MySQL (InnoDB)?