A deep dive into indexes

Understanding database Indexes in PostgreSQL

Paweł Dąbrowski

CTO / Software Engineer

There are three types of developers: those who know that indexes speed up database queries, those who know that indexes speed up database queries and take additional space and time to write, and those who have a more profound knowledge about indexes, their advantages, disadvantages, different types and much more. Unfortunately, the first and the second group is the most numerous.

There are three types of developers: those who know that indexes speed up database queries, those who know that indexes speed up database queries and take additional space and time to write, and those who have a more profound knowledge about indexes, their advantages, disadvantages, different types and much more. Unfortunately, the first and the second group is the most numerous.

This article will help you understand SQL indexes if you fall into the first two groups. If you know a lot about indexes, this article will help you organize your knowledge and remind you about good practices. SQL is a declarative language meaning it tells the database what we want to do but not how to achieve it. The database engine decides how to pull data. We can help the query planner by using indexes.

All information I provide in this article is specific to the PostgreSQL database engine, but at some point, they are also helpful for other popular database engines.

Life without indexes

Imagine that you run a web application that is used by millions of users every day. The application uses a database where information about users is collected in a single database called users. Each time someone requests a user’s profile to be rendered, you have to first find the user record in the database and then render the information back to the visitor. An ordinary scenario:

SELECT * FROM users WHERE users.slug = 'user411' LIMIT 1;

The above query was executed against the table with one million records, where every slug is unique, but I didn’t put any index on it. It took 57.419 ms to get one record. Let’s add a unique index on the slug column:

CREATE UNIQUE INDEX slug_idx ON users (slug);

I ran the select query again, and it took 0.055 ms to get the matching record. That’s 1043 times faster than before! Imagine using a more complicated query against a bigger set of data without proper indexes on the columns. Now, when we have analyzed simple yet meaningful example, we can do a deep dive into the world of indexes to learn how to design our database the right way.

Truly understand what an index is

I mentioned at the beginning that most of the developers know that the index speeds up the queries to the database, but if you would ask them to explain the index in detail, they wouldn’t know the answer.

I separated the definition of index into two pieces: a high-level explanation and a low-level definition. The first one is needed to get the right understanding of how indexes work in general, and the second one is for those who like to know how things are working under the hood.

High-level definition

Imagine you have a list of contacts on your iPhone, but this list is not sorted alphabetically. You want to call John. What would you do? You would go through every contact starting from the top and stop on the contact named John. Let’s say your list contains 100 contacts, and if you are lucky enough, you would need to go through 20 - 30 contacts before finding the right one. What if John is the last contact? It would take a lot of time to find the contact.

That’s the problem that is solved by indexes. Without the index on the column, when you do the filtering, the database is doing the full scan, which simply means that it goes through every record unless the matching record is found. It can take a lot of time, depending on the database size.

When you put the index on a column, in our case, the name of the contact, the database creates a special structure that sorts the contacts. So if you are looking for John, you would look for contact names starting with the letter J; it speeds up the searching process a few times.

Low-level definition

There are a few types of indexes; I will discuss them next. Let’s focus on the default one, which is B-tree, so I can demonstrate what it looks like on a low level inside the database.

B-tree stands for balanced search tree. It simplifies the binary search tree by allowing nodes with more than two children. Each node contains keys in ascending order. If we would consider the example with contact names, the B-tree structure would look similar to the following visualization:

If you would form all of those names into an array and sort them alphabetically, John would be in the middle of an array. If the search is performed on the index, the engine checks if the given value is before or after the root node and repeats the check for every next node.

Nodes also contain a pointer to the record in the database, so when the matching node is found, the engine will use the pointer to get the record from the database and incorporate it into search results.

The different types of indexes and use-cases

There are a few types of indexes, each one suitable for different use cases. You already learned about the B-tree index, which is a default one, and it will be applied unless you specify another one. Here is the full list, along with use cases for each of the index types:

  • B-tree- the default index, suitable for all kinds of data. The query planner would consider using the B-tree index when the following operators are used: <, <=, =, >=,>, BETWEEN, IN, IS NULL or IS NOT NULL. Additionally, the query planner can also consider using this index for queries that include pattern-matching with operators LIKE and ~. However, the pattern must be a constant, and the anchor must be at the begging of the pattern.
  • Hash - index that can handle only simple equality comparisons with the = operator, and only when a such an operator is used, the query planner will consider using the hash index. In Postgres versions to 10, you shouldn’t be using this index as it’s not transaction safe and comes with some other disadvantages. Also, the advantage over b-tree is quite small so in most cases, b-tree is a better choice.
  • GiST - the shortcut that stands for Generalized Search Tree. This type of index is good for geometric data types, network address data, and full-text search. The query planner would consider using it when the following operators are used: <<, &<, &>, >>, <<|, &<|, |&>, |>>, @>, <@, ~= or &&. This type of index is also good for optimizing queries that find the nearest neighbor.
  • SP-GiST - the shortcut that stands for space-partitioned generalized search tree. This type of index is suitable for multimedia, phone routing, IP routing or GIS. The query planner would consider using this index when one of the following operators is used: <<, >>, ~=, <@, <<| or |>>.
  • GIN - the shortcut that stands for Generalized Inverted Index. While B-tree index is optimized for a case when row has a single key value, GIN is more suitable for a case when the index must map many values to one row. Think about GIN when you want to index arrays, hstore, jsonb and implement full-text search. The query planner would consider using this index when the following operators will be used: <@, @>, = or &&.
  • BRIN - the shortcut that stands for Block Range Index. It is often used on a column that has a linear sort order, for example, the creation date on records used for reporting. Much smaller than B-tree index and less costly to maintain. The query planner would consider using this index when the following operators are used: <, <=, =, >= or >. The BRIN index is generally more suitable for large tables with time series data rather than small tables with randomly ordered data.

The truth is that in most of the cases, you should be fine with the b-tree index that is created by the default unless you cope with a very specific information in your database.

Improving query performance with indexes

Now, that it’s clear how indexes work in general and under the hood, it is much easier to understand how we can update the queries with the indexes to make them much faster.

Detecting when to add an index

You know why and how. It is the time to find the answer when. You won’t benefit from putting indexes on the wrong columns; your database would only take more disk space. In general, the following practices are considered as good:

  • Index columns that you search on - after discovering how indexes work, this one should be obvious to you. If you filter your records by the last_name column, put an index there. There are countless examples of cases when you can use filtering to narrow search results.
  • Index columns for database-level validation - it’s not enough to have validation on the backend or frontend (or both). It is considered to be a good practice to use indexes to validate the data presence and integrity.
  • Index columns used for join operations - putting an index on a column that you use to join table can improve the performance.
  • Index column that you often use for sorting - index organizes the data in a way that makes the sorting more efficient, reducing the need for expensive sort operations.

You may find some special use cases for your data, but the above rules apply to most of the applications. Think about them each time you design the database, perform the audit of the legacy database, or look for a way to improve the performance of the queries.

Using explain and analyze commands

This section won’t be complete without mentioning about the explain and analyze commands that help us to determine how the query planner sees our data.

The EXPLAIN command prints the execution plan of a query without executing it. The plan contains the order of the operations, join methods, index usage, and estimated costs:

EXPLAIN SELECT title FROM articles WHERE published = true;

The ANALYZE command collects statistics about the data in a table or index. In opposite to the EXPLAIN command, it provides actual runtime performance metrics:

ANALYZE SELECT title FROM articles WHERE published = true;

You can also combine both commands to get the complete insight into query optimization and fine-tuning in PostgreSQL:

EXPLAIN ANALYZE SELECT title FROM articles WHERE published = true;

Basic index on single or multiple columns

When it comes to filtering, you can filter records based on single or multiple columns. If you plan to filter basic data types, the default index (b-tree) will be enough:

CREATE INDEX title_idx ON articles (title);

If you plan to add an index on multiple columns, just modify the last part of the command:

CREATE INDEX title_category_idx ON articles (title, category);

Partial indexes

When you create an index, by default, you index all records in the given table. If you want to avoid it, you can create a partial index. When you build a query, you use WHERE keyword to narrow search results, and the same is for the indexes creation:

CREATE INDEX title_idx ON articles (title) WHERE published = true;

That way, you will reduce the index size by indexing only the subset of all records.

Expression indexes

When simple values are not enough, you can create indexes based on the outcome of the function. A popular example would be indexing the lowercase version of the value in the given field:

CREATE INDEX title_idx ON articles (lower(title));

Keep in mind that such an index can take more space (depending on the outcome) and can have the impact of the performance of data modifications.

Covering indexes

As you may already know, after the index is scanned, the pointer is used to retrieve the data from the table. You can skip the second part and improve the query even more by using covering indexes.

A covering index is also known as index-only scan. It contains all of the columns required to satisfy the query so there is no need to perform the lookup query. The syntax for creating such indexes is the following:

CREATE INDEX idx_covering ON articles (title, category) INCLUDE (author, slug);

As always, such an index gives us a lot of benefits, but you also have to know that it is not a good idea to use such an index if you perform frequent updates on the included columns, as it requires updating both the index and the table.

Other types of indexes

Besides mentioned simple queries, you can also improve the speed of queries that are more advanced. These include full-text search and geospatial data. Whenever you create the index, you can refer to the official documentation where the index creation command is well described along with all possible parameters.

Maintenance of existing indexes

It’s not enough to just add the right indexes to your database; you also need to maintain the existing indexes to make sure that the structure of your database is in the best shape possible. While when you start the application from the bottom, you don’t have much to maintain in term of indexes, in a legacy application, there is a lot of cases to handle.

You need to maintain indexes because one of the following situations might happen in the past:

  • The developer indexed more columns than it was needed
  • The developer indexed a column before it was used, and it happened that it was never used
  • The query plan is not using the index and do the full scan instead; it can happen especially on small tables

These are just some of many situations where invalid indexes can be created, and we need to maintain them to keep our database healthy and as performant as possible.

The table where information about indexes is stored

In PostgreSQL, the information about indexes is stored in the pg_index table. Since it’s a table, you can query it just like any other table in your database. I investigated the table columns (I used version 11.19), and the table contains the following columns among others:

  • indnatts - the total number of columns in the index
  • indnkeyatts - the total number of key columns in the index
  • indisunique - determines if index is unique
  • indisprimary - determines if index represents primary key of the table
  • indisexclusion - determines if the index supports exclusion constraint
  • indisclustered - determines if table was last clustered on this index
  • indisvalid - determines if index is valid for queries
  • indisready - determines if index is ready for inserts
  • indislive - determines if index is in process of being dropped
  • indisreplident - determines if index has been chosen as replica identity
  • indkey - an array of indnatts values to indicate which table columns the index indexes

The other columns are indexrelid, indrelid, indimmediate, indcheckxmin, indcollation, indclass, indoption, indexprs, indpred. You can check the detailed explanation of each column in the official documentation.

Detecting unused indexes

To detect unused indexes in our database, we can use the pg_stat_user_indexes table which contains usage statistics. This table contains the following important columns:

  • idx_scan - number of scans performed on the index
  • relname - name of the table
  • indexrelname - name of the index
  • indexrelid - identification of the relation that would help us to find corresponding record in pg_index table

What you have to keep in mind is that we can’t just simply find records where idx_scan equals zero. We also have to keep in mind the following things:

  • Sometimes index is not simple, and it’s an expression over one or more columns in the table or a partial index. In such case, we don’t want to consider it in our query for finding unused indexes. To take this situation into consideration, we can use the indkey column from the pg_index table - when the column contains 0, it means that the index is an expression.
  • For unique indexes, the idx_scan will contain 0 as such indexes behave differently from non-unique indexes. Unique index are used to enforce the unique constraint, and a full scan is not performed. To find unique indexes we can leverage the indisunique column from pg_index table.
  • Besides uniqueness, we also have other constraints. Those we also can’t take into account when searching for unused indexes. To improve our process, we can take a look into the pg_constraint table which contains all constraints. If our index is not included there, we can take it into consideration.

Taking into account all of the above points, we can finally produce a SQL query that will show us which indexes are unused:

SELECT s.relname, s.indexrelname, pg_relation_size(s.indexrelid) AS index_size FROM pg_stat_user_indexes s JOIN pg_catalog.pg_index i ON s.indexrelid = i.indexrelid WHERE s.idx_scan = 0 AND 0 <>ALL (i.indkey) AND NOT i.indisunique AND NOT EXISTS (SELECT 1 FROM pg_catalog.pg_constraint c WHERE c.conindid = s.indexrelid) ORDER BY pg_relation_size(s.indexrelid) DESC;

The above query will produce a set of results where each result contains three attributes:

  • relname - the name of the table
  • indexrelname - the name of the index
  • index_size - the size of the index

Detecting bloats

Frequent update and delete operations can lead to a situation where there is a lot of unused space in a table or index relation files on the disk; it is called bloat. Such a situation can cause the performance degradation.

How bloat is created

While it might be surprising for a lot of developers, when a record is deleted, it is not physically removed from the disk. PostgreSQL makes it invisible and marked for deletion. The same situation happens when the record is updated. A new version is created in the database, and the old one is marked as invisible.

During the VACUUM operation, PostgreSQL scans the table for dead rows and removes them from the table; just this situation physically deletes the data and frees up the space. However, the auto VACUUM operation is not enabled by default.

The bloat is created when auto VACUUM is not enabled or when it’s enabled but is not running frequently enough to keep up with the workload on the database. You can check if this feature is enabled by running the following query:

SELECT name, setting, unit FROM pg_settings WHERE name = 'autovacuum';

Finding the size of the bloat

The database itself provides an extension that provides information about the tuples in our database. For example, you can query the tables table, and for each table, trigger the pgstattuple function and read the percentage of dead tuples with the following query:

SELECT table_name, (pgstattuple(table_name)).dead_tuple_percent AS dead_tuple_percent FROM information_schema.tables WHERE table_schema='public' AND table_type='BASE TABLE' ORDER BY dead_tuple_percent DESC;

If you receive an error about the unknown extension, you have to enable it first:

CREATE EXTENSION pgstattuple;

It should work now. Since this article is about indexes, let’s find out how we can find the bloat size in indexes. We can use the following query:

SELECT cls.relname, am.amname, pg_total_relation_size(indexrelid) AS index_size_bytes, pg_total_relation_size(indexrelid) - pg_relation_size(indexrelid) AS index_bloat_bytes FROM pg_index idx JOIN pg_class cls ON cls.oid=idx.indexrelid JOIN pg_am am ON am.oid=cls.relam WHERE indrelid > 16384;

This query returns the name of the index, index type, and estimated bloat size.

Removing the bloat from index

To remove the bloat from the given index, you can rebuild the index:

REINDEX INDEX index_name;

Above command will copy the data from the old index to the new index. Please keep in mind that this operation can take some time and affect the database performance.

Detecting duplicate indexes

There are some cases when two or more indexes exist; they have different names but the same combination of columns. As a result, one of them (or even more) is not used and just takes the space and makes the writing process more expensive.

To find duplications, we can query the pg_index table using the following query:

SELECT pg_size_pretty(sum(pg_relation_size(idx))::bigint) as size, (array_agg(idx))[1] as idx1, (array_agg(idx))[2] as idx2 FROM ( SELECT indexrelid::regclass as idx, (indrelid::text ||E'\n'|| indclass::text ||E'\n'|| indkey::text ||E'\n'|| coalesce(indexprs::text,'')||E'\n' || coalesce(indpred::text,'')) as key FROM pg_index) sub GROUP BY key HAVING count(*) > 1 ORDER BY sum(pg_relation_size(idx)) DESC;

To better understand what is happening above, let’s break down the query into smaller pieces. We have the following subquery:

SELECT indexrelid::regclass as idx, (indrelid::text ||E'\n'|| indclass::text ||E'\n'|| indkey::text ||E'\n'|| coalesce(indexprs::text,'')||E'\n' || coalesce(indpred::text,'')) as key FROM pg_index
  • indexrelid::regclass as idx - to obtain the name of the index. Without the casting, we would just get the OID which is the object identifier of the index.
  • indrelid::text ||E'\n'|| indclass::text ||E'\n'|| indkey::text ||E'\n'|| coalesce(indexprs::text,'')||E'\n' || coalesce(indpred::text,'') - the || character is a concatenation operator, and this part of the query formats a single string based on a few details about the index

We also have the main query:

SELECT pg_size_pretty(sum(pg_relation_size(idx))::bigint) as size, key, (array_agg(idx))[1] as idx1, (array_agg(idx))[2] as idx2 FROM ( ... ) sub GROUP BY key HAVING count(*) > 1 ORDER BY sum(pg_relation_size(idx)) DESC;

It filters the subquery to get keys that appear more than once, gets the size of the index, and provides the names of duplicated indexed. The above example finds only two indexes, but if you suspect that they can be more of them, simply add (array_agg(idx))[n] as idxn to get more of them as a result.

Data validation with constraints

I bet you have often heard the term constraint, but what is it really? It’s a condition (or rule) that describes a valid state of the data in a table. Constraints are related to indexing because each time you create the constraint, you also create an index (except foreign key constraint).

You have the following types of constraints at your disposal:

  • Primary key - a constraint that ensures that a column (or combination of columns) uniquely identifies each row in a given table.
  • Unique - a constraint that ensures that a column (or combination of columns) is unique per given table. The difference between the primary key and the unique constraint is that the primary key does not allow null values.
  • Not null - a constraint that ensures that a column does not contain a null value. On the backend you can call it presence validation for a given field in the form.
  • Foreign key - a constraint that ensures that the relationship exists and help to avoid a case when you delete associated record in a second table and leave the “dead” reference in the base table.
  • Check - a constraint that ensures that every row evaluates the given expression to true. For example, if you have a salary column, you can define the salary > 10000 expression, and each row in the table must evaluate it to true otherwise won’t be considered as valid.
  • Exclusion - a constraint that ensures that two records does not have overlapping values on a given column. The most common case is an event table where we don’t want the start_date and end_date columns to overlap with another record.

Having proper constraints help you to keep your data in the right format but also improves the performance as you avoid having invalid information inside the database so you simply have fewer data to process, scan, or query.

Good practices to memorize

Last, but not least, it’s always good to know good practices regarding indexes. Not only because of technical interviews! It’s easier to memorize more general rules and go deeper each time it is needed instead of trying to memorize all of the information that is available.

Try to remember the following good practices to make the work with databases as efficient as possible:

  • Don’t index all columns - indexes increase the writing time and take space, so look for columns that you filter on or those that you need to validate.
  • Don’t index columns if the table has little data - PostgreSQL won’t use indexes if there is just little data inside the table. The query planner would prefer to use a full scan in such cases.
  • Select the proper index type - the b-tree index (which is a default one) is good for many cases but not for all. When you want to use full-text search, geospatial data, or more complex data types, the default index won’t be the right choice.
  • Maintain your indexes - as time goes by, some of the indexes may no longer be needed, or the data needs to be reindexed. Remember to look after your indexes on a regular basis.
  • Benchmark and test - always test indexes on production-like data before deploying them to ensure that you will benefit from the change.
  • Use indexes - the query planner does a great job, but you can help him by using indexes; it will then more efficiently pull your data.

When you make indexing the columns a habit, you will generate your own list of good practices with time. Also, understanding the way indexes work under the hood will help you to make more confident decisions regarding database architecture.