Post

Postgres -- Online DDL

Index

Processes and Memory

CREATE INDEX supports parallel processing starting from version 11.0. This blog shares performance improvement using this new feature. Configurations relevant to this feature are maintenance_work_mem and max_parallel_maintenance_workers.

How many processes are created during index creation? The entry point is here and the call stack is

1
2
3
index_build
  -> plan_create_index_workers
    -> compute_parallel_worker

Let’s sort out the details. First, if the parallel_workers storage option is set for the table, then the umber of worker process is just min(table.parallel_workers, max_parallel_maintenance_workers). If not, then we get the number of pages of this table, and set #worker = log_3(#pages / min_parallel_table_scan_size). We can get the number of pages from pg_class.relpages. Also, min_parallel_table_scan_size has default value is 8MB. Note that this parameter is stored as number of block size internally, so we can use the table storage size pg_total_relation_size() to calculate the number of workers. For example, if table storage is less than 3*8MB, then no workers. If table storage is in range [3*8, 3^2*8]MB, then one worker. It is very simple math. But this is not the end. See code.

This is where maintenance_work_mem plays a role. The leader process and all worker processes share a total maintenance_work_mem when scanning table and doing tuple sort. The default value maintenance_work_mem is 64MB. Suppose we have 3 workers, then each process can get 64/4=16MB memory. See code and code. To make sure each process has at least 32MB, we cannot have more than 2 processes. That is one leader process and one worker process. Personally, I think we should set maintenance_work_mem to at least 1GB.

One final mark about index_build process. I am curious to know how Postgres forks these worker processes and then join them. What I found is this line. Postgres builds some wrapper on top of the fork system call. After creating this ParallelContext, it calls LaunchParallelWorkers(pcxt) and WaitForParallelWorkersToAttach(pcxt). All these makes sense. However, as you can see, it passes the function as a string when creating the parallel context. So how does it look up the real function by this string? The answer is LookupParallelWorkerFunction. If first tries to find the function from a dictionary InternalParallelWorkers. If not found, then it calls dlsym to look up the function from some shared library. I am shocked!

BTW, we can use view pg_stat_progress_create_index to monitor index build process.

Create Index Locks

There are two ways to create an index: CREATE INDEX <index name> on ... and CREATE INDEX CONCURRENTLY <index name> on .... The former one holds a ShareLock through out the index build process, which blocks other insert/update/delete operations. So if building an index takes five hours, then the table is read-only for five hours. This is only useful for bootstrapping or migrating a new database and is disastrous for a online production database. That is usually why we take the latter approach.

CREATE INDEX CONCURRENTLY uses three stages(transactions), three waits and two full table scans. So it is definitely slower than the approach without the CONCURRENTLY keyword, but most time it only holds a ShareUpdateExclusiveLock, so table updates are not blocked during the process.

Before we go to the details, we must cover an important concept Heap Only Tuples (HOT). This README file is very informative. Also, I like some Chinese blog posts with vivid diagrams. For example this one. Note sure if the link is still valid ^_^. Basically, in the world of MVCC, HOT means that if a table update does not contain any index field change then the update can be made only to the tuple page and not creating a new index version. There is an redirect mechanism in the page for you to retrieve the updated version. This optimization saves space and time. But it requires that the indices of this table do not change. That is why it makes index creation complicated and we need to process index creation in three stages. Let’s take a closer look at these three stages with some experiments.

The main code is here.

Stage 1: Create an Index

The code block for this part is this code range. I extract some important logic below.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
...
lockmode = concurrent ? ShareUpdateExclusiveLock : ShareLock;
rel = table_open(tableId, lockmode);
...
	if (skip_build || concurrent || partitioned)
		flags |= INDEX_CREATE_SKIP_BUILD;
...

	indexRelationId =
		index_create(rel, indexRelationName, indexRelationId, parentIndexId,
					 parentConstraintId,
					 stmt->oldNumber, indexInfo, indexColNames,
					 accessMethodId, tablespaceId,
					 collationIds, opclassIds, opclassOptions,
					 coloptions, NULL, reloptions,
					 flags, constr_flags,
					 allowSystemTableMods, !check_rights,
					 &createdConstraintId);
...

	if (!concurrent)
	{
		/* Close the heap and we're done, in the non-concurrent case */
		table_close(rel, NoLock);

    ...
		return address;
	}

	/* save lockrelid and locktag for below, then close rel */
	heaprelid = rel->rd_lockInfo.lockRelId;
	SET_LOCKTAG_RELATION(heaplocktag, heaprelid.dbId, heaprelid.relId);
	table_close(rel, NoLock);

...

	PopActiveSnapshot();
	CommitTransactionCommand();

This stage does a lot of validation and preprocessing. The whole phase is wrapped inside a single transaction. First, it determines the lock mode. Non-CONCURRENTLY build uses ShareLock while CONCURRENTLY build uses ShareUpdateExclusiveLock. So non-CONCURRENTLY build blocks table updates. Note that we are referring to the lock on the table not the lock on the index. Second, it sets the build flags. CONCURRENTLY build sets INDEX_CREATE_SKIP_BUILD, which means the index_create function below only creates the index but does not build the index content. Table is unlocked and closed after index_build, and this is where Non-CONCURRENTLY build is complete. For CONCURRENTLY build, it commits this transaction which marks the end of phase one.

What is the end state of phase one?

  • Non-CONCURRENTLY build is finished.
  • CONCURRENTLY build has an empty index marked as Invalid.

Let’s do a quick experiment. In terminal one, run below query

1
2
3
4
5
6
7
8
9
10
11
12
13
14
postgres=# select * from my_table;
 id |  name
----+---------
  1 | Alice
  2 | Bob
  3 | Charlie
  4 | David
(4 rows)

postgres=# begin;
BEGIN
postgres=*# update my_table set name = 'A' where id = 1;
UPDATE 1
postgres=*#

and in terminal two, run below query

1
postgres=# create index concurrently idx_test on my_table (name);

Basically, terminal one has an ongoing transaction and terminal two tries to create a new index. You will see terminal two gets stuck. If you check the table details, you see it has a new invalid index.

1
2
3
4
5
6
7
8
9
postgres=# \d my_table;
                                   Table "public.my_table"
 Column |         Type          | Collation | Nullable |               Default
--------+-----------------------+-----------+----------+--------------------------------------
 id     | integer               |           | not null | nextval('my_table_id_seq'::regclass)
 name   | character varying(50) |           |          |
Indexes:
    "my_table_pkey" PRIMARY KEY, btree (id)
    "idx_test" btree (name) INVALID

Stage 2: First Scan

The code block for this part is this code range. It is not that long.

1
2
3
4
5
6
7
8
9
10
11
12
13
	StartTransactionCommand();
...

	WaitForLockers(heaplocktag, ShareLock, true);
...

	PushActiveSnapshot(GetTransactionSnapshot());

	index_concurrently_build(tableId, indexRelationId);

	PopActiveSnapshot();

	CommitTransactionCommand();

Phase 2 is also wrapped inside a single transaction. It first waits for a ShareLock on this table. We need this step to make sure HOT chain is valid. Refer to the HOT README file above for the reasoning behind.

Note that WaitForLockers only waits for locks but does not acquire locks. It obtains a list of transactions which hold this lock at the function start and waits until these transactions to disappear (either commit or rollback). It does not add new transactions to the waiting list during waiting. Because it does not acquire the lock, there is no problem with lock queue priority. I was once skeptic that building index will need some metadata lock, so I am worried whether some other process has higher priority to acquire this metadata lock. In Msyql, the thread that is creating an index acquires a special lock which has higher priority than normal xlock to solve this problem. In Postgres, we do not have this problem at all because we DO NOT need to lock at all. We just wait for the old transactions to die!

Then we build a snapshot, start the first scan path index_concurrently_build and commit this transaction. We also mark this new index as writable but not readable.

Continue with above experiment, let’s check pg_locks. There is a process is waiting for a ShareLock.

1
2
3
4
admincoin=# select *, relation::regclass from pg_locks;
   locktype    | virtualxid | transactionid  | virtualtransaction |  pid  |           mode           | granted | fastpath |          waitstart           |                       relation
---------------+------------+----------------+--------------------+-------+--------------------------+---------+----------+------------------------------+------------------------------------------------------
 virtualxid    | 4/322      |                | 5/62               | 11036 | ShareLock                | f       | f        | 2024-09-17 00:16:20.36455+00 |

The lock is not granted because there is an old ongoing transaction that is not committed yet. If we check the index build progress, we will see

1
2
3
4
postgres=# select * from pg_stat_progress_create_index \gx
-[ RECORD 1 ]------+---------------------------------
command            | CREATE INDEX CONCURRENTLY
phase              | waiting for writers before build

The code for this phase is defined here. The enum 1 corresponds to the code here. So far everything makes sense.

Stage 3: Second Scan

The code block for this part is this code range It is not that long either.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
	StartTransactionCommand();

  pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
                              PROGRESS_CREATEIDX_PHASE_WAIT_2);

  WaitForLockers(heaplocktag, ShareLock, true);

	snapshot = RegisterSnapshot(GetTransactionSnapshot());
	PushActiveSnapshot(snapshot);

	/*
	 * Scan the index and the heap, insert any missing index entries.
	 */
	validate_index(tableId, indexRelationId, snapshot);

	limitXmin = snapshot->xmin;

	PopActiveSnapshot();
	UnregisterSnapshot(snapshot);

	CommitTransactionCommand();

This is the second WaitForLockers. It waits a ShareLock again. As the comment says, this phase do a second scan to insert missing index entries from the last stage.

Let’s continue the experiment. Open terminal three, and run

1
2
3
4
postgres=# begin;
BEGIN
postgres=*# update my_table set name = 'B' where id = 1;
UPDATE 1

Go to terminal two, and run rollback;. So the first update is aborted. The CREATE INDEX CONCURRENTLY command is still stuck. If you check pg_locks, you will find that it is waiting for a ShareLock. This time it is the new transaction we just started that was blocking it. Also, index build progress shows

1
2
3
4
postgres=# select * from pg_stat_progress_create_index \gx
-[ RECORD 1 ]------+--------------------------------------
command            | CREATE INDEX CONCURRENTLY
phase              | waiting for writers before validation

It is slightly different changing from writers before build to writers before validation. This message corresponds to an enum value 3. This is exactly the enum PROGRESS_CREATEIDX_PHASE_WAIT_2 above. I know the name is wrong!

Stage 4: Wait for Old Snapshots

The code block for this part is this code range

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
	StartTransactionCommand();

	/*
	 * The index is now valid in the sense that it contains all currently
	 * interesting tuples.  But since it might not contain tuples deleted just
	 * before the reference snap was taken, we have to wait out any
	 * transactions that might have older snapshots.
	 */
	pgstat_progress_update_param(PROGRESS_CREATEIDX_PHASE,
								 PROGRESS_CREATEIDX_PHASE_WAIT_3);
	WaitForOlderSnapshots(limitXmin, true);

	/*
	 * Index can now be marked valid -- update its pg_index entry
	 */
	index_set_state_flags(indexRelationId, INDEX_CREATE_SET_VALID);

	CacheInvalidateRelcacheByRelid(heaprelid.relId);

	UnlockRelationIdForSession(&heaprelid, ShareUpdateExclusiveLock);

	pgstat_progress_end_command();

This final stage does another wait, not for locks this time but for old snapshots. Also, it marks this index as valid finally.

For the experiment above, a new transaction can no longer block this stage. TODO: find a way to create a snapshot to block the phase.

FAQ

After learning so much about how CREATE INDEX works. What are the FQAs people may have?

Q1. How to kill a query?

1
2
3
4
5
-- gracefully cancel
SELECT pg_cancel_backend(<pid of the process>)

-- hard terminate
SELECT pg_terminate_backend(<pid of the process>)

Q2. For a blocked process, How to know which process is blocking it?

pg_locks provides an overall view of all locks in the database. However, it is not straightforward to figure out the blocking relationship by ourselves. Luckily, there is a function pg_blocking_pids(<pid>) to obtain this information. For example,

1
2
3
4
5
=# select pg_blocking_pids(2267);
 pg_blocking_pids
------------------
 {3240}
(1 row)

pg_blocking_pids returns integer[], and it is possible this list contains duplicates. Quoted its documentation below.

When using parallel queries the result always lists client-visible process IDs (that is, pg_backend_pid results) even if the actual lock is held or awaited by a child worker process. As a result of that, there may be duplicated PIDs in the result.

This happens for example, one client issues an create index request, and another clients tries to update the same table, then the result of pg_blocking_pids(2nd_client_pid) contains the 1st client’s pid multiple times. This is because create index utilizes multiple processes. See the explanation of max_parallel_maintenance_workers above.

Btw, a very useful query about pg_locks.

1
SELECT pl.*, psa.query FROM pg_locks pl LEFT JOIN pg_stat_activity psa ON pl.pid = psa.pid;

Q3. For non-CONCURRENTLY option, are writes all forbidden during index creation?

Yes. I did an experiment. If we do CREATE INDEX CONCURRENTLY, then there is no such problem.

Q4. What locks are needed for create index CONCURRENTLY ?

It needs a ShareUpdateExclusiveLock for phase 1, and waits for ShareLocks for phase 2 and 3. Also, it waits for old snapshots for phase 4.

Q5. What is lock queue priority?

Suppose we have a write query is running, and another write query is blocked. Then a create index request comes and get blocked. Once the lock is released, will the write request or the create index request get the lock?

There is no need to worry about it at all because it does not acquire locks. It only wait for locks.

Wait. I need to be strict. It indeed acquires a ShareUpdateExclusiveLock at phase 1. But according to https://www.postgresql.org/docs/7.2/locking-tables.html, SELECT/INSERT/UPDATE/DELETE does not conflict with this lock.

This post is licensed under CC BY 4.0 by the author.