A deep dive into indexes
CTO / Software Engineer
There are three types of developers: those who know that indexes speed up database queries, those who know that indexes speed up database queries and take additional space and time to write, and those who have a more profound knowledge about indexes, their advantages, disadvantages, different types and much more. Unfortunately, the first and the second group is the most numerous.
There are three types of developers: those who know that indexes speed up database queries, those who know that indexes speed up database queries and take additional space and time to write, and those who have a more profound knowledge about indexes, their advantages, disadvantages, different types and much more. Unfortunately, the first and the second group is the most numerous.
This article will help you understand SQL indexes if you fall into the first two groups. If you know a lot about indexes, this article will help you organize your knowledge and remind you about good practices. SQL is a declarative language meaning it tells the database what we want to do but not how to achieve it. The database engine decides how to pull data. We can help the query planner by using indexes.
All information I provide in this article is specific to the PostgreSQL database engine, but at some point, they are also helpful for other popular database engines.
Imagine that you run a web application that is used by millions of users every day. The application uses a database where information about users is collected in a single database called users
. Each time someone requests a user’s profile to be rendered, you have to first find the user record in the database and then render the information back to the visitor. An ordinary scenario:
SELECT * FROM users WHERE users.slug = 'user411' LIMIT 1;
The above query was executed against the table with one million records, where every slug is unique, but I didn’t put any index on it. It took 57.419 ms to get one record. Let’s add a unique index on the slug
column:
CREATE UNIQUE INDEX slug_idx ON users (slug);
I ran the select
query again, and it took 0.055 ms to get the matching record. That’s 1043 times faster than before! Imagine using a more complicated query against a bigger set of data without proper indexes on the columns. Now, when we have analyzed simple yet meaningful example, we can do a deep dive into the world of indexes to learn how to design our database the right way.
I mentioned at the beginning that most of the developers know that the index speeds up the queries to the database, but if you would ask them to explain the index in detail, they wouldn’t know the answer.
I separated the definition of index into two pieces: a high-level explanation and a low-level definition. The first one is needed to get the right understanding of how indexes work in general, and the second one is for those who like to know how things are working under the hood.
Imagine you have a list of contacts on your iPhone, but this list is not sorted alphabetically. You want to call John. What would you do? You would go through every contact starting from the top and stop on the contact named John. Let’s say your list contains 100 contacts, and if you are lucky enough, you would need to go through 20 - 30 contacts before finding the right one. What if John is the last contact? It would take a lot of time to find the contact.
That’s the problem that is solved by indexes. Without the index on the column, when you do the filtering, the database is doing the full scan, which simply means that it goes through every record unless the matching record is found. It can take a lot of time, depending on the database size.
When you put the index on a column, in our case, the name of the contact, the database creates a special structure that sorts the contacts. So if you are looking for John, you would look for contact names starting with the letter J; it speeds up the searching process a few times.
There are a few types of indexes; I will discuss them next. Let’s focus on the default one, which is B-tree, so I can demonstrate what it looks like on a low level inside the database.
B-tree stands for balanced search tree. It simplifies the binary search tree by allowing nodes with more than two children. Each node contains keys in ascending order. If we would consider the example with contact names, the B-tree structure would look similar to the following visualization:
If you would form all of those names into an array and sort them alphabetically, John would be in the middle of an array. If the search is performed on the index, the engine checks if the given value is before or after the root node and repeats the check for every next node.
Nodes also contain a pointer to the record in the database, so when the matching node is found, the engine will use the pointer to get the record from the database and incorporate it into search results.
There are a few types of indexes, each one suitable for different use cases. You already learned about the B-tree index, which is a default one, and it will be applied unless you specify another one. Here is the full list, along with use cases for each of the index types:
<
, <=
, =
, >=
,>
, BETWEEN
, IN
, IS NULL
or IS NOT NULL
. Additionally, the query planner can also consider using this index for queries that include pattern-matching with operators LIKE
and ~
. However, the pattern must be a constant, and the anchor must be at the begging of the pattern.
=
operator, and only when a such an operator is used, the query planner will consider using the hash index. In Postgres versions to 10, you shouldn’t be using this index as it’s not transaction safe and comes with some other disadvantages. Also, the advantage over b-tree is quite small so in most cases, b-tree is a better choice.
<<
, &<
, &>
, >>
, <<|
, &<|
, |&>
, |>>
, @>
, <@
, ~=
or &&
. This type of index is also good for optimizing queries that find the nearest neighbor.
<<
, >>
, ~=
, <@
, <<|
or |>>
.
<@
, @>
, =
or &&
.
<
, <=
, =
, >=
or >
. The BRIN index is generally more suitable for large tables with time series data rather than small tables with randomly ordered data.
The truth is that in most of the cases, you should be fine with the b-tree index that is created by the default unless you cope with a very specific information in your database.
Now, that it’s clear how indexes work in general and under the hood, it is much easier to understand how we can update the queries with the indexes to make them much faster.
You know why and how. It is the time to find the answer when. You won’t benefit from putting indexes on the wrong columns; your database would only take more disk space. In general, the following practices are considered as good:
You may find some special use cases for your data, but the above rules apply to most of the applications. Think about them each time you design the database, perform the audit of the legacy database, or look for a way to improve the performance of the queries.
This section won’t be complete without mentioning about the explain
and analyze
commands that help us to determine how the query planner sees our data.
The EXPLAIN
command prints the execution plan of a query without executing it. The plan contains the order of the operations, join methods, index usage, and estimated costs:
EXPLAIN SELECT title FROM articles WHERE published = true;
The ANALYZE
command collects statistics about the data in a table or index. In opposite to the EXPLAIN
command, it provides actual runtime performance metrics:
ANALYZE SELECT title FROM articles WHERE published = true;
You can also combine both commands to get the complete insight into query optimization and fine-tuning in PostgreSQL:
EXPLAIN ANALYZE SELECT title FROM articles WHERE published = true;
When it comes to filtering, you can filter records based on single or multiple columns. If you plan to filter basic data types, the default index (b-tree) will be enough:
CREATE INDEX title_idx ON articles (title);
If you plan to add an index on multiple columns, just modify the last part of the command:
CREATE INDEX title_category_idx ON articles (title, category);
When you create an index, by default, you index all records in the given table. If you want to avoid it, you can create a partial index. When you build a query, you use WHERE
keyword to narrow search results, and the same is for the indexes creation:
CREATE INDEX title_idx ON articles (title) WHERE published = true;
That way, you will reduce the index size by indexing only the subset of all records.
When simple values are not enough, you can create indexes based on the outcome of the function. A popular example would be indexing the lowercase version of the value in the given field:
CREATE INDEX title_idx ON articles (lower(title));
Keep in mind that such an index can take more space (depending on the outcome) and can have the impact of the performance of data modifications.
As you may already know, after the index is scanned, the pointer is used to retrieve the data from the table. You can skip the second part and improve the query even more by using covering indexes.
A covering index is also known as index-only scan. It contains all of the columns required to satisfy the query so there is no need to perform the lookup query. The syntax for creating such indexes is the following:
CREATE INDEX idx_covering ON articles (title, category) INCLUDE (author, slug);
As always, such an index gives us a lot of benefits, but you also have to know that it is not a good idea to use such an index if you perform frequent updates on the included columns, as it requires updating both the index and the table.
Besides mentioned simple queries, you can also improve the speed of queries that are more advanced. These include full-text search and geospatial data. Whenever you create the index, you can refer to the official documentation where the index creation command is well described along with all possible parameters.
It’s not enough to just add the right indexes to your database; you also need to maintain the existing indexes to make sure that the structure of your database is in the best shape possible. While when you start the application from the bottom, you don’t have much to maintain in term of indexes, in a legacy application, there is a lot of cases to handle.
You need to maintain indexes because one of the following situations might happen in the past:
These are just some of many situations where invalid indexes can be created, and we need to maintain them to keep our database healthy and as performant as possible.
In PostgreSQL, the information about indexes is stored in the pg_index table. Since it’s a table, you can query it just like any other table in your database. I investigated the table columns (I used version 11.19), and the table contains the following columns among others:
indnatts
- the total number of columns in the index
indnkeyatts
- the total number of key columns in the index
indisunique
- determines if index is unique
indisprimary
- determines if index represents primary key of the table
indisexclusion
- determines if the index supports exclusion constraint
indisclustered
- determines if table was last clustered on this index
indisvalid
- determines if index is valid for queries
indisready
- determines if index is ready for inserts
indislive
- determines if index is in process of being dropped
indisreplident
- determines if index has been chosen as replica identity
indkey
- an array of indnatts values to indicate which table columns the index indexes
The other columns are indexrelid
, indrelid
, indimmediate
, indcheckxmin
, indcollation
, indclass
, indoption
, indexprs
, indpred
. You can check the detailed explanation of each column in the official documentation.
To detect unused indexes in our database, we can use the pg_stat_user_indexes
table which contains usage statistics. This table contains the following important columns:
idx_scan
- number of scans performed on the index
relname
- name of the table
indexrelname
- name of the index
indexrelid
- identification of the relation that would help us to find corresponding record in pg_index
table
What you have to keep in mind is that we can’t just simply find records where idx_scan
equals zero. We also have to keep in mind the following things:
indkey
column from the pg_index
table - when the column contains 0, it means that the index is an expression.
idx_scan
will contain 0 as such indexes behave differently from non-unique indexes. Unique index are used to enforce the unique constraint, and a full scan is not performed. To find unique indexes we can leverage the indisunique
column from pg_index
table.
pg_constraint
table which contains all constraints. If our index is not included there, we can take it into consideration.
Taking into account all of the above points, we can finally produce a SQL query that will show us which indexes are unused:
SELECT s.relname, s.indexrelname, pg_relation_size(s.indexrelid) AS index_size FROM pg_stat_user_indexes s JOIN pg_catalog.pg_index i ON s.indexrelid = i.indexrelid WHERE s.idx_scan = 0 AND 0 <>ALL (i.indkey) AND NOT i.indisunique AND NOT EXISTS (SELECT 1 FROM pg_catalog.pg_constraint c WHERE c.conindid = s.indexrelid) ORDER BY pg_relation_size(s.indexrelid) DESC;
The above query will produce a set of results where each result contains three attributes:
relname
- the name of the table
indexrelname
- the name of the index
index_size
- the size of the index
Frequent update and delete operations can lead to a situation where there is a lot of unused space in a table or index relation files on the disk; it is called bloat. Such a situation can cause the performance degradation.
While it might be surprising for a lot of developers, when a record is deleted, it is not physically removed from the disk. PostgreSQL makes it invisible and marked for deletion. The same situation happens when the record is updated. A new version is created in the database, and the old one is marked as invisible.
During the VACUUM operation, PostgreSQL scans the table for dead rows and removes them from the table; just this situation physically deletes the data and frees up the space. However, the auto VACUUM operation is not enabled by default.
The bloat is created when auto VACUUM is not enabled or when it’s enabled but is not running frequently enough to keep up with the workload on the database. You can check if this feature is enabled by running the following query:
SELECT name, setting, unit FROM pg_settings WHERE name = 'autovacuum';
The database itself provides an extension that provides information about the tuples in our database. For example, you can query the tables
table, and for each table, trigger the pgstattuple
function and read the percentage of dead tuples with the following query:
SELECT table_name, (pgstattuple(table_name)).dead_tuple_percent AS dead_tuple_percent FROM information_schema.tables WHERE table_schema='public' AND table_type='BASE TABLE' ORDER BY dead_tuple_percent DESC;
If you receive an error about the unknown extension, you have to enable it first:
CREATE EXTENSION pgstattuple;
It should work now. Since this article is about indexes, let’s find out how we can find the bloat size in indexes. We can use the following query:
SELECT cls.relname, am.amname, pg_total_relation_size(indexrelid) AS index_size_bytes, pg_total_relation_size(indexrelid) - pg_relation_size(indexrelid) AS index_bloat_bytes FROM pg_index idx JOIN pg_class cls ON cls.oid=idx.indexrelid JOIN pg_am am ON am.oid=cls.relam WHERE indrelid > 16384;
This query returns the name of the index, index type, and estimated bloat size.
To remove the bloat from the given index, you can rebuild the index:
REINDEX INDEX index_name;
Above command will copy the data from the old index to the new index. Please keep in mind that this operation can take some time and affect the database performance.
There are some cases when two or more indexes exist; they have different names but the same combination of columns. As a result, one of them (or even more) is not used and just takes the space and makes the writing process more expensive.
To find duplications, we can query the pg_index
table using the following query:
SELECT pg_size_pretty(sum(pg_relation_size(idx))::bigint) as size, (array_agg(idx))[1] as idx1, (array_agg(idx))[2] as idx2 FROM ( SELECT indexrelid::regclass as idx, (indrelid::text ||E'\n'|| indclass::text ||E'\n'|| indkey::text ||E'\n'|| coalesce(indexprs::text,'')||E'\n' || coalesce(indpred::text,'')) as key FROM pg_index) sub GROUP BY key HAVING count(*) > 1 ORDER BY sum(pg_relation_size(idx)) DESC;
To better understand what is happening above, let’s break down the query into smaller pieces. We have the following subquery:
SELECT indexrelid::regclass as idx, (indrelid::text ||E'\n'|| indclass::text ||E'\n'|| indkey::text ||E'\n'|| coalesce(indexprs::text,'')||E'\n' || coalesce(indpred::text,'')) as key FROM pg_index
indexrelid::regclass as idx
- to obtain the name of the index. Without the casting, we would just get the OID which is the object identifier of the index.
indrelid::text ||E'\n'|| indclass::text ||E'\n'|| indkey::text ||E'\n'|| coalesce(indexprs::text,'')||E'\n' || coalesce(indpred::text,'')
- the ||
character is a concatenation operator, and this part of the query formats a single string based on a few details about the index
We also have the main query:
SELECT pg_size_pretty(sum(pg_relation_size(idx))::bigint) as size, key, (array_agg(idx))[1] as idx1, (array_agg(idx))[2] as idx2 FROM ( ... ) sub GROUP BY key HAVING count(*) > 1 ORDER BY sum(pg_relation_size(idx)) DESC;
It filters the subquery to get keys that appear more than once, gets the size of the index, and provides the names of duplicated indexed. The above example finds only two indexes, but if you suspect that they can be more of them, simply add (array_agg(idx))[n]
as idxn to get more of them as a result.
I bet you have often heard the term constraint, but what is it really? It’s a condition (or rule) that describes a valid state of the data in a table. Constraints are related to indexing because each time you create the constraint, you also create an index (except foreign key constraint).
You have the following types of constraints at your disposal:
salary
column, you can define the salary > 10000
expression, and each row in the table must evaluate it to true otherwise won’t be considered as valid.
start_date
and end_date
columns to overlap with another record.
Having proper constraints help you to keep your data in the right format but also improves the performance as you avoid having invalid information inside the database so you simply have fewer data to process, scan, or query.
Last, but not least, it’s always good to know good practices regarding indexes. Not only because of technical interviews! It’s easier to memorize more general rules and go deeper each time it is needed instead of trying to memorize all of the information that is available.
Try to remember the following good practices to make the work with databases as efficient as possible:
When you make indexing the columns a habit, you will generate your own list of good practices with time. Also, understanding the way indexes work under the hood will help you to make more confident decisions regarding database architecture.