An in-depth look at Database Indexing
In this article, we will explore Database Indexing. We will begin by installing the Docker & running a Postgres container on it. Subsequently, to execute queries and comprehend how the database uses various indexing strategies, we will insert millions of rows into a Postgres table.
Following that, we will explore different tools to gaining insights into the SQL query planner and optimizers. After that, we will delve into understanding database indexing, examining how various types of indexing works with examples, and do a comparison between different types of database scan strategies.
Finally, we will then demystify how database indexes operates for the WHERE clause with the AND and OR operators.
Prerequisite
-
Installing Docker & Running a Postgres Container:
- Install Docker by following the instructions provided in the getting started guide on the official Docker website.
-
Verify that Docker is installed by running the command
docker --version
-
Running a PostgresSQL Container:
- Spin up the Docker container by using the official Postgres image.
- Start the Postgres command shell:
-
Inserting a Million Rows into a Postgres Table:
-
Create a table named
employees
: -
Insert into the
employees
table using the generate_series function: -
Confirm the result by executing the
count
query:
employees
and inserts one million rows into it, generating random values for theemployeeid
andname
columns. The final count query verifies the successful insertion of the specified number of rows. -
Create a table named
-
The SQL Query Planner and Optimizer:
-
Explanation:
The
explain
command displays the execution plan generated by the PostgresSQL planner for the provided statement. This plan illustrates how the table(s) referenced in the statement will be scanned, whether through plain sequential scans, index scans, etc. -
Examples:
-
Select All Query:
-
Seq Scan
: Directly goes to the heap and fetches everything, similar to a Full Table Scan in other databases. In Postgres, with multiple threads, it's calledParallel Seq Scan
. -
Cost=0.00..16139.01
: The first number represents work before fetching (e.g., aggregating, ordering), and the second number is the total estimated execution time. -
rows=1000001
: An approximate number of rows to be fetched. -
width=19
: The sum of bytes for all columns.
-
-
Select All Query with Order By (Indexed Column):
-
cost=0.42
: Postgres performs work, ordering byemployeeid
. An index onemployeeid
leads to anIndex Scan
.
-
-
Select All Query with Order By (Non-Indexed Column):
-
Seq Scan & Sort
: Seq Scan on the table, followed by sorting. Sorting cost is critical.
-
-
Select Only ID:
-
width=4
: Fetching onlyid
, resulting in a smallerwidth
of 4 bytes (integer).
-
-
Select All Query for a Particular ID:
-
rows=1
: Fetching only 1 record using the primary key index.
-
-
Select All Query:
-
Explanation:
The
What is Database indexing?
An index is a data structure that speeds up data retrieval without needing to scan every row present in the table. Index improves lookup performance but decreases write performance because every time a new row is created, indexes need to be updated.
Indexes are typically stored on the disk. An index is typically a small table with two columns: a primary/candidate key and address. Keys are made from one or more columns.
The data structure used for storing the index is B+ Trees. In the simplest form, an index is a stored table of key-value pairs that allows searches to be conducted in O(logn)
time using binary search on sorted data.
Types of Indexes:
-
Clustered Index
- Index and data reside together and are ordered by the key. A Clustered Index is basically a tree-organized table. Instead of storing the records in an unsorted Heap table space, the clustered index is actually B+Tree index having the Leaf Nodes, which are ordered by the clusters key column value, store the actual table records, as illustrated by the following diagram.
-
Nonclustered Index
- A nonclustered index contains the key values and each key value entry has a pointer to the data row that contains the key value. Since the Clustered Index is usually built using the Primary Key column values, if you want to speed up queries that use some other column, then you'll have to add a Secondary Non-Clustered Index. The Secondary Index is going to store the Primary Key value in its Leaf Nodes, as illustrated by the following diagram
How database indexes works under the hood?
We have already created a database index on the employeeid
column in our employees table using the CREATE INDEX
statement. Behind the scenes, Postgres creates a new pseudo-table in the database with two columns: a value for employeeid
and a pointer to the corresponding record in the employees
table. This pseudo-table is organized and stored as a binary tree with ordered values for the employeeid
column. Consequently, the query operates with O(logn) efficiency and typically executes in a second or less.
Let’s delve into two scenarios:
-
SELECT * FROM employees WHERE employeeid = 4;
Here, with an index on theemployeeid
column, the query initiates anIndex Scan
. The process begins by accessing the Index table, retrieving the reference for the Page number, and obtaining the row number for the specific record on that page. Subsequently, it navigates to the corresponding page in the heap and fetches the entire row. This method, known as anIndex Scan
. -
SELECT employeeid FROM employees WHERE employeeid = 4;
In this instance, there is no need to access the heap to retrieve the complete record. Since the required value foremployeeid
is already present in the index table, the operation is streamlined, and it directly performs anIndex Only Scan
. This approach allows the system to retrieve the specificemployeeid
directly from the index table without the additional step of fetching the complete row from the heap. This can lead to improved performance, particularly when the index includes all the columns needed for the query, minimizing the amount of data that needs to be processed.
What are different type of database scan strategies?
-
Index Only Scan
If we examine the given query, we retrieve the ID using a filter on the ID column, which serves as the primary key and has an index on it.
Let's break down the query output:
-
Index Only Scan:
In the case of anIndex Only Scan
, Postgres scans the index table, resulting in faster performance as the Index table is significantly smaller than the actual table. WithIndex Only Scan
, results are directly fetched from the Index table when querying columns for which indexes have been created. -
Heap Fetches: 0:
This indicates that the queried ID value did not necessitate accessing the heap table to retrieve information. The information was obtained inline, and this is referred to as an Inline query. -
Planning Time: 0.510 ms:
This represents the time taken by Postgres to determine whether to use the index or perform a full table scan. -
Execution Time: 2.708 ms:
This is the time taken by Postgres to actually fetch the records from the table.
-
-
Index Scan
If we examine the given query, we are retrieving the
name
using a filter on the ID column, which serves as the primary key and has an index on it. In this case, the process begins with an index scan on the Index table to retrieve information about the Page number and row number on the Heap. Since thename
is not available in the Index table, we must go to the heap to fetch thename
. This type of scan is referred to as anIndex Scan
. In this case, we are filtering the record using the filter on the id with '<' operator and filtering out record which have id less than 1000. So the process begins with an Index scanning on the Index table then fetching the rows from the heap. Same as in case of fetching single id. In this case, we are filtering the record using the filter on the id with '>' operator and filtering out records which have id greater than 1000. So, in this case, as Postgres knows it has to fetch 99% of the data anyway, it prefers to use the Seq Scan on the heap table. Rather than going to the Index table to filter out the records and then again going to the heap to filter those Index-scanned rows. -
Parallel Seq Scan
If we examine the given query, we are retrieving the
id
using a filter on the name column, which doesn't have an index on it. As we don't have an index on the name column, that means we have to actually search for the nameWABOY
one by one and perform a sequential scan on the employees table. Postgres efficiently addresses this by executing multiple worker threads and conducting a parallel sequential scan. -
Bitmap Scan
Let's create a Bitmap Index on the name column to get started. Let's explore how Bitmap Scan works in PostgreSQL.
Heap pages are stored on disk, and loading a page into memory can be expensive. When using an Index Scan, if the query yields a large number of rows, the query's performance may suffer because each row's retrieval involves loading a page into memory.
In contrast, with a Bitmap Scan, instead of loading rows into memory, we set a bit to 1 in an array of bits corresponding to heap page numbers. The operation then works on top of this bitmap.
Here's a simplified breakdown of above image:- In a bitmap index scan, rows are not loaded into memory. PostgreSQL sets the bit to 1 for page number 1 when the name is 'CD' and 0 for other pages.
- When the name is 'BC', page number 2 is set to 1, and others are set to 0.
- Subsequently, a new bitmap is created by performing an OR operation on both bitmaps.
- Finally, PostgreSQL executes a Bitmap Heap Scan where it fully scans each heap page and rechecks the conditions.
This approach minimizes the need to load entire pages into memory for individual rows, improving the efficiency of the query. If the query results in a lot of rows located in only a limited number of heap pages then this strategy will be very efficient.
Now let's filter out the id, name by the name Upon analyzing the provided query, we extract theid
andname
by applying a filter on thename
column, which has an index. Let's clarify the process:-
Bitmap Index Scan
: This step involves scanning the index table for thename
column since an index exists on it. It retrieves the page number and row number to obtain references to the corresponding records in the heap. -
Bitmap Heap Scan
: Since we are filtering based on bothid
andname
, this step is necessary to visit the heap and retrieve the values for both attributes for a specific record. The reference to the record is obtained from the precedingBitmap Index Scan
.
Combining Database Indexes
- Prerequisite: Let's create a table to learn how to combine indexes.
-
Now let's create index on the columns
A
andB
.
-
Select column
c
for a particular value of columna
Here, we can analyze that since we have an index only on columna
, a bitmap index scan is performed on columna
. To retrieve columnc
, it jumps to the heap and performs a bitmap heap scan. -
Select column c but we are going to query on both
a
andb
withAND
operation Here, we can analyze the following:- PostgreSQL executed a bitmap index scan on column 'A'.
- Concurrently, a bitmap index scan was performed on column 'B'.
- Subsequently, PostgreSQL executed a bitmap AND operation to combine the results of the scans on 'A' and 'B'.
- After obtaining the references for the rows to be retrieved, PostgreSQL proceeds to perform a bitmap heap scan.
-
Select column c but we are going to query on both a and b with
OR
operation. Here, we can analyze the following:- PostgreSQL executed a bitmap index scan on column
a
. - Concurrently, a bitmap index scan was performed on column
b
. - Subsequently, PostgreSQL executed a
BitmapOr
operation to combine the results of the scans on columnsa
andb
. - After obtaining the references for the rows to be retrieved, PostgreSQL proceeds to perform a bitmap heap scan.
- PostgreSQL executed a bitmap index scan on column
- Composite Index
-
First, we need to drop the indexes on both columns
a
andb
, and then create a composite index on columnsa
andb
. -
Select column
c
for a particular value of columna
Here, we can analyze the following:- This time, PostgreSQL decided to use the composite index
numbers_ab_idx
on both columnsa
andb
. - Subsequently, it performs a Bitmap Heap Scan on the selected rows based on the composite index.
- This time, PostgreSQL decided to use the composite index
-
Select column
c
for a particular value of columnb
Here, we can analyze the following:-
This time, Postgres did not use the index
numbers_a_b_idx
. Even though we have a composite index on both columnsa
andb
. Why? Because we cannot use this composite index when scanning a filter. The filter condition is on columna
, and the composite index can be used for conditions involving both columnsa
andb
or just columna
. However, it cannot be used for conditions involving only columnb
. Therefore, if we have a composite index on columnsa
andb
, querying on columnb
alone will not utilize the index.
-
This time, Postgres did not use the index
-
Select column c but we are going to query on both
A
andB
withAND
operation Here, the situation remains the same as earlier when we had an index on both columns A and B. -
Select column c but we are going to query on both
A
andB
withOR
operation Here, we can analyze the situation as follows:-
As observed earlier, it's not feasible to use a composite index on column
B
individually. The option is either to use it on columnA
alone or on both columnsA
andB
. Consequently, Postgres opts for a Parallel Sequential Scan in this scenario.
-
As observed earlier, it's not feasible to use a composite index on column