database • indexing • postgres • mysql • relational database • clustered index • primary key index • explain • analyze • postgres indexing

An in-depth look at Database Indexing

technology

In this article, we will explore Database Indexing. We will begin by installing the Docker & running a Postgres container on it. Subsequently, to execute queries and comprehend how the database uses various indexing strategies, we will insert millions of rows into a Postgres table.

Following that, we will explore different tools to gaining insights into the SQL query planner and optimizers. After that, we will delve into understanding database indexing, examining how various types of indexing works with examples, and do a comparison between different types of database scan strategies.

Finally, we will then demystify how database indexes operates for the WHERE clause with the AND and OR operators.

Prerequisite

Installing Docker & Running a Postgres Container:
1. Install Docker by following the instructions provided in the getting started guide on the official Docker website.
2. Verify that Docker is installed by running the command docker --version
Running a PostgresSQL Container:
1. Spin up the Docker container by using the official Postgres image.
```
docker run -e POSTGRES_PASSWORD=secret --name pg postgres
```
2. Start the Postgres command shell:
```
docker exec -it pg psql -U postgres
```

Inserting a Million Rows into a Postgres Table:

Create a table named employees:

create table employees(id serial primary key, employeeid integer, name TEXT);

Insert into the employees table using the generate_series function:

create or replace function gen_random_string(length integer)
RETURNS VARCHAR as
$$
DECLARE
result VARCHAR := '';
BEGIN
FOR i IN 1..length LOOP
	result := result || chr((floor(random() * 26) + 65)::integer);
END LOOP;
RETURN result;
END;
$$ language plpgsql;

INSERT INTO EMPLOYEES(employeeid, name)
SELECT *, gen_random_string((random() * 10 + 5)::integer)
FROM generate_series(0, 1000000);

Confirm the result by executing the count query:

select count(*) from employees;
count
--------
	1000001
(1 row)

This sequence of steps creates a table named employees and inserts one million rows into it, generating random values for the employeeid and name columns. The final count query verifies the successful insertion of the specified number of rows.

The SQL Query Planner and Optimizer:

Explanation: The explain command displays the execution plan generated by the PostgresSQL planner for the provided statement. This plan illustrates how the table(s) referenced in the statement will be scanned, whether through plain sequential scans, index scans, etc.

Examples:

Select All Query:
```
postgres=# explain select * from employees;
QUERY PLAN
-------------------------------------------------------------
Seq Scan on employees (cost=0.00..16139.01 rows=1000001 width=19)
```
- Seq Scan: Directly goes to the heap and fetches everything, similar to a Full Table Scan in other databases. In Postgres, with multiple threads, it's called Parallel Seq Scan.
- Cost=0.00..16139.01: The first number represents work before fetching (e.g., aggregating, ordering), and the second number is the total estimated execution time.
- rows=1000001: An approximate number of rows to be fetched.
- width=19: The sum of bytes for all columns.

Select All Query with Order By (Indexed Column):

postgres=# create index employees_employeeid_idx ON employees(employeeid);
CREATE INDEX
postgres=# explain select * from employees order by employeeid;
QUERY PLAN
-----------------------------------------------------------------
Index Scan using employees_employeeid_idx on employees (cost=0.42..32122.44 rows=1000001 width=19)

cost=0.42: Postgres performs work, ordering by employeeid. An index on employeeid leads to an Index Scan.

Select All Query with Order By (Non-Indexed Column):

postgres=# explain select * from employees order by name;
QUERY PLAN
--------------------------------------------------------------
Sort (cost=136306.96..138806.96 rows=1000001 width=19)
	Sort Key: name
	-> Seq Scan on employees (cost=0.00..16139.01 rows=1000001 width=19)

Seq Scan & Sort: Seq Scan on the table, followed by sorting. Sorting cost is critical.

Select Only ID:

postgres=# explain select id from employees;
QUERY PLAN
---------------------------------------------------
Seq Scan on employees (cost=0.00..16139.01 rows=1000001 width=4)

width=4: Fetching only id, resulting in a smaller width of 4 bytes (integer).

Select All Query for a Particular ID:

postgres=# explain select * from employees where id = 10;
QUERY PLAN
-------------------------------------------------------------------
Index Scan using employees_pkey on employees (cost=0.42..8.44 rows=1 width=19)
	Index Cond: (id = 10)

rows=1: Fetching only 1 record using the primary key index.

What is Database indexing?

An index is a data structure that speeds up data retrieval without needing to scan every row present in the table. Index improves lookup performance but decreases write performance because every time a new row is created, indexes need to be updated.

Indexes are typically stored on the disk. An index is typically a small table with two columns: a primary/candidate key and address. Keys are made from one or more columns.

The data structure used for storing the index is B+ Trees. In the simplest form, an index is a stored table of key-value pairs that allows searches to be conducted in O(logn) time using binary search on sorted data.

Types of Indexes:

Clustered Index
- Index and data reside together and are ordered by the key. A Clustered Index is basically a tree-organized table. Instead of storing the records in an unsorted Heap table space, the clustered index is actually B+Tree index having the Leaf Nodes, which are ordered by the clusters key column value, store the actual table records, as illustrated by the following diagram.
Nonclustered Index
- A nonclustered index contains the key values and each key value entry has a pointer to the data row that contains the key value. Since the Clustered Index is usually built using the Primary Key column values, if you want to speed up queries that use some other column, then you'll have to add a Secondary Non-Clustered Index. The Secondary Index is going to store the Primary Key value in its Leaf Nodes, as illustrated by the following diagram

How database indexes works under the hood?

We have already created a database index on the employeeid column in our employees table using the CREATE INDEX statement. Behind the scenes, Postgres creates a new pseudo-table in the database with two columns: a value for employeeid and a pointer to the corresponding record in the employees table. This pseudo-table is organized and stored as a binary tree with ordered values for the employeeid column. Consequently, the query operates with O(logn) efficiency and typically executes in a second or less.

Let’s delve into two scenarios:

SELECT * FROM employees WHERE employeeid = 4;
Here, with an index on the employeeid column, the query initiates an Index Scan. The process begins by accessing the Index table, retrieving the reference for the Page number, and obtaining the row number for the specific record on that page. Subsequently, it navigates to the corresponding page in the heap and fetches the entire row. This method, known as an Index Scan.
SELECT employeeid FROM employees WHERE employeeid = 4;
In this instance, there is no need to access the heap to retrieve the complete record. Since the required value for employeeid is already present in the index table, the operation is streamlined, and it directly performs an Index Only Scan. This approach allows the system to retrieve the specific employeeid directly from the index table without the additional step of fetching the complete row from the heap. This can lead to improved performance, particularly when the index includes all the columns needed for the query, minimizing the amount of data that needs to be processed.

What are different type of database scan strategies?

Index Only Scan
```
postgres=# EXPLAIN ANALYZE select id from employees where id = 100;
							QUERY PLAN
-----------------------------------------------------------------------------
 Index Only Scan using employees_pkey on employees  (cost=0.42..4.44 rows=1 width=4) (actual time=2.529..2.542 rows=1 loops=1)
   Index Cond: (id = 100)
   Heap Fetches: 0
 Planning Time: 0.510 ms
 Execution Time: 2.708 ms
```
If we examine the given query, we retrieve the ID using a filter on the ID column, which serves as the primary key and has an index on it. Let's break down the query output:
1. Index Only Scan: In the case of an Index Only Scan, Postgres scans the index table, resulting in faster performance as the Index table is significantly smaller than the actual table. With Index Only Scan, results are directly fetched from the Index table when querying columns for which indexes have been created.
2. Heap Fetches: 0: This indicates that the queried ID value did not necessitate accessing the heap table to retrieve information. The information was obtained inline, and this is referred to as an Inline query.
3. Planning Time: 0.510 ms: This represents the time taken by Postgres to determine whether to use the index or perform a full table scan.
4. Execution Time: 2.708 ms: This is the time taken by Postgres to actually fetch the records from the table.

Index Scan

postgres=# EXPLAIN ANALYZE select name from employees where id = 1000;
								QUERY PLAN
----------------------------------------------------------------------------
 Index Scan using employees_pkey on employees  (cost=0.42..8.44 rows=1 width=11) (actual time=1.250..1.260 rows=1 loops=1)
   Index Cond: (id = 1000)
 Planning Time: 0.703 ms
 Execution Time: 1.655 ms

If we examine the given query, we are retrieving the name using a filter on the ID column, which serves as the primary key and has an index on it. In this case, the process begins with an index scan on the Index table to retrieve information about the Page number and row number on the Heap. Since the name is not available in the Index table, we must go to the heap to fetch the name. This type of scan is referred to as an Index Scan.

postgres=# EXPLAIN ANALYZE select name from employees where id < 1000;
                      QUERY PLAN
----------------------------------------------------------------------------------------------------------
 Index Scan using employees_pkey on employees  (cost=0.42..40.75 rows=1047 width=32) (actual time=0.062..1.139 rows=999 loops=1)
   Index Cond: (id < 1000)
 Planning Time: 4.948 ms
 Execution Time: 1.215 ms
(4 rows)

In this case, we are filtering the record using the filter on the id with '<' operator and filtering out record which have id less than 1000. So the process begins with an Index scanning on the Index table then fetching the rows from the heap. Same as in case of fetching single id.

postgres=# EXPLAIN ANALYZE select name from employees where id > 1000;
                    QUERY PLAN
---------------------------------------------------------------------------
 Seq Scan on employees  (cost=0.00..18639.01 rows=998953 width=32) (actual time=0.104..168.884 rows=999001 loops=1)
   Filter: (id > 1000)
   Rows Removed by Filter: 1000
 Planning Time: 0.158 ms
 Execution Time: 198.259 ms
(5 rows)

In this case, we are filtering the record using the filter on the id with '>' operator and filtering out records which have id greater than 1000. So, in this case, as Postgres knows it has to fetch 99% of the data anyway, it prefers to use the Seq Scan on the heap table. Rather than going to the Index table to filter out the records and then again going to the heap to filter those Index-scanned rows.

Parallel Seq Scan

postgres=# EXPLAIN ANALYZE select id from employees where name = 'WABOY';
								QUERY PLAN
----------------------------------------------------------------------------
 Gather  (cost=1000.00..12347.44 rows=1 width=4) (actual time=3.970..120.383 rows=1 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on employees  (cost=0.00..11347.34 rows=1 width=4) (actual time=64.894..102.448 rows=0 loops=3)
		 Filter: ((name)::text = 'WABOY'::text)
		 Rows Removed by Filter: 333333
 Planning Time: 0.898 ms
 Execution Time: 120.850 ms
(8 rows)

If we examine the given query, we are retrieving the id using a filter on the name column, which doesn't have an index on it. As we don't have an index on the name column, that means we have to actually search for the name WABOY one by one and perform a sequential scan on the employees table. Postgres efficiently addresses this by executing multiple worker threads and conducting a parallel sequential scan.

Bitmap Scan
Let's create a Bitmap Index on the name column to get started.
```
postgres=# CREATE INDEX employees_name_idx ON employees(name);
CREATE INDEX
```
Let's explore how Bitmap Scan works in PostgreSQL.

Heap pages are stored on disk, and loading a page into memory can be expensive. When using an Index Scan, if the query yields a large number of rows, the query's performance may suffer because each row's retrieval involves loading a page into memory.
In contrast, with a Bitmap Scan, instead of loading rows into memory, we set a bit to 1 in an array of bits corresponding to heap page numbers. The operation then works on top of this bitmap.
Here's a simplified breakdown of above image:
- In a bitmap index scan, rows are not loaded into memory. PostgreSQL sets the bit to 1 for page number 1 when the name is 'CD' and 0 for other pages.
- When the name is 'BC', page number 2 is set to 1, and others are set to 0.
- Subsequently, a new bitmap is created by performing an OR operation on both bitmaps.
- Finally, PostgreSQL executes a Bitmap Heap Scan where it fully scans each heap page and rechecks the conditions.
This approach minimizes the need to load entire pages into memory for individual rows, improving the efficiency of the query. If the query results in a lot of rows located in only a limited number of heap pages then this strategy will be very efficient.

Now let's filter out the id, name by the name
```
postgres=# EXPLAIN ANALYZE select id, name from employees where name = 'WABOY';
                  QUERY PLAN
--------------------------------------------------------------------------------
 Bitmap Heap Scan on employees  (cost=111.17..6277.29 rows=5000 width=36) (actual time=0.348..0.369 rows=1 loops=1)
   Recheck Cond: (name = 'WABOY'::text)
   Heap Blocks: exact=1
   ->  Bitmap Index Scan on employees_name_idx  (cost=0.00..109.92 rows=5000 width=0) (actual time=0.274..0.274 rows=1 loops=1)
         Index Cond: (name = 'WABOY'::text)
 Planning Time: 0.905 ms
 Execution Time: 0.734 ms
```
Upon analyzing the provided query, we extract the id and name by applying a filter on the name column, which has an index. Let's clarify the process:
1. Bitmap Index Scan: This step involves scanning the index table for the name column since an index exists on it. It retrieves the page number and row number to obtain references to the corresponding records in the heap.
2. Bitmap Heap Scan: Since we are filtering based on both id and name, this step is necessary to visit the heap and retrieve the values for both attributes for a specific record. The reference to the record is obtained from the preceding Bitmap Index Scan.

Combining Database Indexes

Prerequisite: Let's create a table to learn how to combine indexes.

CREATE TABLE NUMBERS(id serial primary key, a integer, b integer, c integer);
INSERT INTO NUMBERS(a, b, c) select (random() * 100)::integer, (random() * 1000)::integer, (random() * 2000)::integer from generate_series(0, 10000000);

Now let's create index on the columns A and B.

CREATE INDEX numbers_a_idx on numbers(a);
CREATE INDEX numbers_b_idx on numbers(b);

Select column c for a particular value of column a

postgres=# EXPLAIN ANALYZE SELECT c FROM numbers WHERE a = 88;
                  QUERY PLAN

---------------------------------------------------------------------------------
 Bitmap Heap Scan on numbers  (cost=1101.09..57496.05 rows=98665 width=4) (actual time=41.110..683.631 rows=99888 loops=1)
   Recheck Cond: (a = 88)
   Heap Blocks: exact=45619
   ->  Bitmap Index Scan on numbers_a_idx  (cost=0.00..1076.42 rows=98665 width=0) (actual time=29.403..29.403 rows=99888 loops=1)
         Index Cond: (a = 88)
 Planning Time: 1.569 ms
 Execution Time: 687.152 ms

Here, we can analyze that since we have an index only on column a, a bitmap index scan is performed on column a. To retrieve column c, it jumps to the heap and performs a bitmap heap scan.

Select column c but we are going to query on both a and b with AND operation

postgres=# EXPLAIN ANALYZE SELECT c FROM numbers WHERE a = 90 AND b = 500;
                  QUERY PLAN
-----------------------------------------------------------------------------
 Bitmap Heap Scan on numbers  (cost=1320.12..1746.88 rows=110 width=4) (actual time=32.300..38.262 rows=107 loops=1)
   Recheck Cond: ((b = 500) AND (a = 90))
   Heap Blocks: exact=107
   ->  BitmapAnd  (cost=1320.12..1320.12 rows=110 width=0) (actual time=32.079..32.081 rows=0 loops=1)
         ->  Bitmap Index Scan on numbers_b_idx  (cost=0.00..110.88 rows=9926 width=0) (actual time=4.494..4.494 rows=9974 loops=1
)
               Index Cond: (b = 500)
         ->  Bitmap Index Scan on numbers_a_idx  (cost=0.00..1208.93 rows=111000 width=0) (actual time=26.799..26.800 rows=99868 l
oops=1)
               Index Cond: (a = 90)
 Planning Time: 3.362 ms
 Execution Time: 38.604 ms

Here, we can analyze the following:

PostgreSQL executed a bitmap index scan on column 'A'.
Concurrently, a bitmap index scan was performed on column 'B'.
Subsequently, PostgreSQL executed a bitmap AND operation to combine the results of the scans on 'A' and 'B'.
After obtaining the references for the rows to be retrieved, PostgreSQL proceeds to perform a bitmap heap scan.

Select column c but we are going to query on both a and b with OR operation.

postgres=# EXPLAIN ANALYZE SELECT c FROM numbers WHERE A = 50 OR B = 500;
                QUERY PLAN

-------------------------------------------------------------------------------
 Bitmap Heap Scan on numbers  (cost=1164.23..57490.68 rows=101835 width=4) (actual time=37.957..600.439 rows=109466 loops=1)
   Recheck Cond: ((a = 50) OR (b = 500))
   Heap Blocks: exact=46998
   ->  BitmapOr  (cost=1164.23..1164.23 rows=101926 width=0) (actual time=25.625..25.626 rows=0 loops=1)
         ->  Bitmap Index Scan on numbers_a_idx  (cost=0.00..1002.43 rows=92000 width=0) (actual time=24.309..24.309 rows=99602 lo
ops=1)
               Index Cond: (a = 50)
         ->  Bitmap Index Scan on numbers_b_idx  (cost=0.00..110.88 rows=9926 width=0) (actual time=1.313..1.314 rows=9974 loops=1
)
               Index Cond: (b = 500)
 Planning Time: 1.135 ms
 Execution Time: 604.165 ms

Here, we can analyze the following:

PostgreSQL executed a bitmap index scan on column a.
Concurrently, a bitmap index scan was performed on column b.
Subsequently, PostgreSQL executed a BitmapOr operation to combine the results of the scans on columns a and b.
After obtaining the references for the rows to be retrieved, PostgreSQL proceeds to perform a bitmap heap scan.

Composite Index
First, we need to drop the indexes on both columns a and b, and then create a composite index on columns a and b.
```
postgres=# CREATE INDEX numbers_a_b_idx on numbers(a, b);
CREATE INDEX
```

Select column c for a particular value of column a

postgres=# EXPLAIN ANALYZE SELECT c FROM numbers WHERE a = 70;
							QUERY PLAN

-----------------------------------------------------------------------
Bitmap Heap Scan on numbers  (cost=1189.93..56830.03 rows=106000 width=4) (actual time=38.779..610.173 rows=99789 loops=1)
	Recheck Cond: (a = 70)
	Heap Blocks: exact=45549
	->  Bitmap Index Scan on numbers_a_b_idx  (cost=0.00..1163.43 rows=106000 width=0) (actual time=27.796..27.797 rows=99789 loops
=1)
		Index Cond: (a = 70)
Planning Time: 5.188 ms
Execution Time: 613.305 ms
(7 rows)

Here, we can analyze the following:

This time, PostgreSQL decided to use the composite index numbers_ab_idx on both columns a and b.
Subsequently, it performs a Bitmap Heap Scan on the selected rows based on the composite index.

Select column c for a particular value of column b

postgres=# EXPLAIN ANALYZE SELECT c FROM numbers WHERE b = 900;
							QUERY PLAN
-----------------------------------------------------------------------
Gather  (cost=1000.00..108130.94 rows=9926 width=4) (actual time=24.402..395.326 rows=10027 loops=1)
	Workers Planned: 2
	Workers Launched: 2
	->  Parallel Seq Scan on numbers  (cost=0.00..106138.34 rows=4136 width=4) (actual time=9.913..317.809 rows=3342 loops=3)
		Filter: (b = 900)
		Rows Removed by Filter: 3329991
Planning Time: 0.574 ms
JIT:
	Functions: 12
	Options: Inlining false, Optimization false, Expressions true, Deforming true
	Timing: Generation 4.899 ms, Inlining 0.000 ms, Optimization 3.039 ms, Emission 25.030 ms, Total 32.968 ms
Execution Time: 398.820 ms
(12 rows)

Here, we can analyze the following:

This time, Postgres did not use the index numbers_a_b_idx. Even though we have a composite index on both columns a and b. Why? Because we cannot use this composite index when scanning a filter. The filter condition is on column a, and the composite index can be used for conditions involving both columns a and b or just column a. However, it cannot be used for conditions involving only column b. Therefore, if we have a composite index on columns a and b, querying on column b alone will not utilize the index.

Select column c but we are going to query on both A and B with AND operation

postgres=# EXPLAIN ANALYZE SELECT C FROM numbers WHERE A = 60 AND B = 600;
							QUERY PLAN
-------------------------------------------------------------------------
Bitmap Heap Scan on numbers  (cost=5.44..386.39 rows=98 width=4) (actual time=0.732..6.281 rows=102 loops=1)
	Recheck Cond: ((a = 60) AND (b = 600))
	Heap Blocks: exact=101
	->  Bitmap Index Scan on numbers_a_b_idx  (cost=0.00..5.42 rows=98 width=0) (actual time=0.513..0.513 rows=102 loops=1)
		Index Cond: ((a = 60) AND (b = 600))
Planning Time: 0.756 ms
Execution Time: 6.659 ms

Here, the situation remains the same as earlier when we had an index on both columns A and B.

Select column c but we are going to query on both A and B with OR operation

postgres=# EXPLAIN ANALYZE SELECT C FROM numbers WHERE A = 60  or B = 80;
							QUERY PLAN
------------------------------------------------------------------------
Gather  (cost=1000.00..128404.51 rows=108495 width=4) (actual time=20.721..388.512 rows=109443 loops=1)
	Workers Planned: 2
	Workers Launched: 2
	->  Parallel Seq Scan on numbers  (cost=0.00..116555.01 rows=45206 width=4) (actual time=8.325..304.804 rows=36481 loops=3)
		Filter: ((a = 60) OR (b = 80))
		Rows Removed by Filter: 3296853
Planning Time: 1.009 ms
JIT:
	Functions: 12
	Options: Inlining false, Optimization false, Expressions true, Deforming true
	Timing: Generation 5.795 ms, Inlining 0.000 ms, Optimization 2.561 ms, Emission 21.798 ms, Total 30.154 ms
Execution Time: 397.675 ms

Here, we can analyze the situation as follows:

As observed earlier, it's not feasible to use a composite index on column B individually. The option is either to use it on column A alone or on both columns A and B. Consequently, Postgres opts for a Parallel Sequential Scan in this scenario.

Manish Sharma

Manish Sharma works in the Technology team at eLitmus and loves building things. Outside of work, he enjoys spending time in nature and swimming.