Postgres -- Getting Started
Postgres is wonderful piece of software!
Set Up Postgres
Postgres documents the progress of creating and starting a postgres database in detail. Basically, it provides initdb
and postgres
commands to initialize data directory and start a postmaster process. It also mentions the wrapper program pg_ctl
which can be used instead of running initdb
or postgres
directly. The source code is here. I am a little surprised by this wrapper because it uses system(cmd) or /bin/sh to run the wrapped command. Why not just call the relevant entry functions of the wrapped command?
MacOS
Just use homebrew to install Postgres.
Linux
1
2
3
4
5
6
7
# Install the default version
sudo apt-get intall postgres
# Install a newer version
sudo apt-get install -y postgresql-common
sudo /usr/share/postgresql-common/pgdg/apt.postgresql.org.sh -y
sudo apt-get install -y postgresql-16
Things becomes a little more convoluted in Debian and Redhat. When you run apt install postgres
, it also installs a dependency postgresql-common. Basically, it is a wrapper on top of pg_ctl
and enables us to manage different postgres cluster with the same or different versions. How does this shim layer work? Let’s see a few postgres commands.
1
2
3
4
5
> ll /usr/bin/createuser
lrwxrwxrwx 1 root root 37 Oct 2 2023 /usr/bin/createuser -> ../share/postgresql-common/pg_wrapper
> ll /usr/bin/psql
lrwxrwxrwx 1 root root 37 Oct 2 2023 /usr/bin/psql -> ../share/postgresql-common/pg_wrapper
So gp_wrapper
is the shim layer. It is a Perl file that routes the original command to the binary under /usr/lib/postgresql/version/bin.
Build From Source
Refer to https://www.postgresql.org/docs/devel/installation.html to build from source.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
mkdir build_dir
cd build_dir
# We need to provide installation prefix because the binary generated
# reference symbols in the default installation folder.
../configure --without-icu --prefix $HOME/code/postgres/build_dir/install_dir
# or for debug build
../configure --without-icu --enable-cassert --enable-debug CFLAGS="-ggdb -Og -O0 -g3 -fno-omit-frame-pointer" -prefix $HOME/code/postgres/build_dir/install_dir
# make
bear -- make -j6
cd .. && ln -s build_dir/compile_commands.json compile_commands.json
cd build_dir
# install
make install
# create a cluster
mkdir data
./install_dir/bin/initdb -D $HOME/code/postgres/build_dir/data
# start server using a different port
./install_dir/bin/postgres -D $HOME/code/postgres/build_dir/data -p 5433
# connect to sever and create test database
./install_dir/bin/psql postgres -p 5433
postgres=# create database test2;
After the first successful compilation, you probably only incremental compilation afterward.
1
2
cd build_dir
make && make install
Sample Data
Finally, we need some sample data. sakila is a popular data set.
1
2
3
4
git clone git@github.com:jOOQ/sakila.git --depth=1
psql -d test -f ~/code/sakila/postgres-sakila-db/postgres-sakila-schema.sql
psql -d test -f ~/code/sakila/postgres-sakila-db/postgres-sakila-insert-data.sql
psql test
Debugger
https://github.com/tvondra/gdbpg
1
2
3
4
5
$ lldb -- ./install_dir/bin/postgres -D $HOME/code/postgres/build_dir/data -p 5433
(lldb) b sql/sql_optimizer.cc:5875
Breakpoint 1: where = mysqld`JOIN::estimate_rowcount() + 44 at sql_optimizer.cc:5875:37, address = 0x0000000100a2c4d8
(lldb) process launch -n
Process 4071 launched: '/Users/xiongding/code/mysql-server/build/bin/mysqld' (arm64)
I haven’t make it work under debugger because some signal handler issue. Life is short. Do not want to spend time making it work. Instead, there is a simple trick to inspect internal states. It has a pretty print function that can print out any node.
1
2
#include "nodes/print.h"
pprint(transform);
Configurations
The main configuration file is postgresql.conf
. I thought Postgres reads it as a plain text file and splits each line to a key value pair. Also, a line is a comment if it starts with #
. No way! Postgres always shows off in unexpected ways. The configuration grammar is defined using Lex & Yacc!
1
2
3
4
5
6
7
8
9
10
11
12
13
14
\n ConfigFileLineno++; return GUC_EOL;
[ \t\r]+ /* eat whitespace */
#.* /* eat comment (.* matches anything until newline) */
{ID} return GUC_ID;
{QUALIFIED_ID} return GUC_QUALIFIED_ID;
{STRING} return GUC_STRING;
{UNQUOTED_STRING} return GUC_UNQUOTED_STRING;
{INTEGER} return GUC_INTEGER;
{REAL} return GUC_REAL;
= return GUC_EQUALS;
. return GUC_ERROR;
Integer, real number and string are recognized. “=” is parsed as equal token. White spaces including tab and \r
are ignored. Anything after #
is considered as comments and is ignored.
What if the same key shows up multiple times? The documentation clearly says
If the file contains multiple entries for the same parameter, all but the last one are ignored.
The code is here. It is funny that all of them are kept, but all except the last one have ignore=True
.
Various Binaries
Import From Binary File
Postgres only supports importing data from text, csv and binary files. The binary protocol is covered briefly here. Basically,
- Each row starts with a 16-bit integer denoting the number of fields in this row.
- Following this 16-bit integer is a sequence of fields. Each field starts with a 32-bit integer denoting the size of the field, and then is followed by the binary data of this field.
See code here. The (de)serialization protocol for each data type is defined by the _send
and _recv
functions inside the adt
folder, i.e., Abstract Data Types. See timestamp example. Basically, timestamp is just a uint64 that stores the microseconds from postgres epoch date, i.e., 2000-01-01. Note, it is NOT Unix epoch date. Below is an part of binary postgres data file.
1
2
3
00000560 3a 20 31 7d 00 00 00 04 00 00 00 00 00 00 00 08 |: 1}............|
00000570 00 02 4d 04 9e 82 c7 00 00 00 00 08 00 02 ae 50 |..M............P|
00000580 b9 9b 4a 00 ff ff ff ff ff ff ff ff ff ff ff ff |..J.............|
The 00 00 00 08
in the first line says the next field has 8 bytes. The value follows is 00 02 4d 04 9e 82 c7 00 = 647632188000000 (ms) = 1990 5:49:48 PM Unix epoch time = 2020 5:49:48 PM Postgres epoch time
. The 00 00 00 08
in the second line designates another timestamp. Note, there are a lot ff
bytes follows. These -1
values denote NULL values.
Page Layout
Postgres has a good documentation about Database Page Layout. The diagram in the source code comment is super helpful too. Let’s look at an example.
First, create the table and insert sample data.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
CREATE TABLE my_table(
id SERIAL PRIMARY KEY,
name VARCHAR(255) NOT NULL,
age INT
);
CREATE INDEX idx_name ON my_table (name);
INSERT INTO my_table (name, age) VALUES
('Alice', 30),
('bob', 25),
('Charlie', 35),
('Diana', 28),
('Edward', 40);
Second, Run CHECKPOINT
and restart Postgres. Why? The DDL/DML changes above are written to WAL, but WAL is not merged into the data files yet. The merging process is done periodically and the frequency is controlled by some configuration parameters. To manually trigger this process, we can create a checkpoint and then restart Postgres such that it merges all changes before this checkpoint to the data files. The corresponding data file is below
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
> sudo hexdump -C /var/lib/postgresql/13/main/base/1638340/2481266
00000000 76 01 00 00 b0 f3 03 7f 00 00 00 00 2c 00 38 1f |v...........,.8.|
00000010 00 20 04 20 00 00 00 00 d8 9f 50 00 b0 9f 48 00 |. . ......P...H.|
00000020 88 9f 50 00 60 9f 50 00 38 9f 50 00 00 00 00 00 |..P.`.P.8.P.....|
00000030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00001f30 00 00 00 00 00 00 00 00 4d e3 07 00 00 00 00 00 |........M.......|
00001f40 00 00 00 00 00 00 00 00 05 00 03 00 02 09 18 00 |................|
00001f50 05 00 00 00 0f 45 64 77 61 72 64 00 28 00 00 00 |.....Edward.(...|
00001f60 4d e3 07 00 00 00 00 00 00 00 00 00 00 00 00 00 |M...............|
00001f70 04 00 03 00 02 09 18 00 04 00 00 00 0d 44 69 61 |.............Dia|
00001f80 6e 61 00 00 1c 00 00 00 4d e3 07 00 00 00 00 00 |na......M.......|
00001f90 00 00 00 00 00 00 00 00 03 00 03 00 02 09 18 00 |................|
00001fa0 03 00 00 00 11 43 68 61 72 6c 69 65 23 00 00 00 |.....Charlie#...|
00001fb0 4d e3 07 00 00 00 00 00 00 00 00 00 00 00 00 00 |M...............|
00001fc0 02 00 03 00 02 09 18 00 02 00 00 00 09 62 6f 62 |.............bob|
00001fd0 19 00 00 00 00 00 00 00 4d e3 07 00 00 00 00 00 |........M.......|
00001fe0 00 00 00 00 00 00 00 00 01 00 03 00 02 09 18 00 |................|
00001ff0 01 00 00 00 0d 41 6c 69 63 65 00 00 1e 00 00 00 |.....Alice......|
00002000
The first 24 bytes are PageHeaderData
. You can see that
1
2
3
4
5
pd_lower = 0x002c
pd_upper = 0x1f38
pd_special = 0x2000
pd_pagesize = 0x2004 && 0xFF00 = 0x2000 = 8192
version = 0x2004 && 0x00FF = 4
Note that my MacBook M1, namely, AArch64, is a little-endian CPU, so we interpret 2c 00
as 0x002c
. As you can see, the default page size in Postgres is 8KB. Also this is the code to split page size and version.
The bytes between the end of PageHeaderData
and pd_lower
are ItemIdData. They are pointers to the data tuples. The size of this part is 0x002c - 24 = 20
bytes. The content is extracted as follows.
1
2
3
> sudo hexdump -C -s 24 -n 20 /var/lib/postgresql/13/main/base/1638340/2481266
00000018 d8 9f 50 00 b0 9f 48 00 88 9f 50 00 60 9f 50 00 |..P...H...P.`.P.|
00000028 38 9f 50 00 |8.P.|
Each ItemIdData takes four bytes, so we have five items in total corresponding to the five tuples inserted. Let’s take a look at one example d8 9f 50 00
. The definition of ItemIdData is copied below
1
2
3
4
5
6
typedef struct ItemIdData
{
unsigned lp_off:15, /* offset to tuple (from start of page) */
lp_flags:2, /* state of line pointer, see below */
lp_len:15; /* byte length of tuple */
} ItemIdData;
How to only get the first 15 bits of the first 2 bytes? Actually, I am lost for this niche detail. This is how I make sense of these bits: d8 9f
should be thought as 0x9fd8
, and 9 = 1001
, so I strip off the first bit of 9
to get 0x1fd8
. For size 0x0050 = 1010000
, 5 = 101
, and I strip off the last bit to get 100
, so the reals size is 0x0040
. So the first entry starts at 0x1fd8
and takes 40 bytes.
1
2
3
4
> sudo hexdump -C -s 0x1f38 -n 40 /var/lib/postgresql/13/main/base/1638340/2481266
00001f38 4d e3 07 00 00 00 00 00 00 00 00 00 00 00 00 00 |M...............|
00001f48 05 00 03 00 02 09 18 00 05 00 00 00 0f 45 64 77 |.............Edw|
00001f58 61 72 64 00 28 00 00 00 |ard.(...|
You can get the rest of the tuples. Forgive me! I need to read more about this detail.
How Are Data Types Serialized To Pages?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
=# select oid, typname, typlen, typbyval, typalign, typstorage from pg_type where typname in ('varchar', 'int2', 'int4', 'char', 'bool', 'bytea', 'float8', 'text', 'uuid', 'timestamp', 'timestamptz', 'date');
oid | typname | typlen | typbyval | typalign | typstorage
------+-------------+--------+----------+----------+------------
16 | bool | 1 | t | c | p
17 | bytea | -1 | f | i | x
18 | char | 1 | t | c | p
21 | int2 | 2 | t | s | p
23 | int4 | 4 | t | i | p
25 | text | -1 | f | i | x
701 | float8 | 8 | t | d | p
1043 | varchar | -1 | f | i | x
1082 | date | 4 | t | i | p
1114 | timestamp | 8 | t | d | p
1184 | timestamptz | 8 | t | d | p
2950 | uuid | 16 | f | c | p
https://github.com/postgres/postgres/blob/a3e6c6f929912f928fa405909d17bcbf0c1b03ee/src/include/varatt.h#L142
https://github.com/postgres/postgres/blob/a3e6c6f929912f928fa405909d17bcbf0c1b03ee/src/backend/access/common/indextuple.c#L65
Toast Storage
TODO: learn it
pgloader
pgloader
is a great tool to migrate data from another database to postgres. Using Mysql -> postgres as an example, the main logic is
1
2
3
4
5
6
7
8
process-source-and-target (main.lisp#main)
-> load-data (api.lisp#process-source-and-target)
-> lisp-code-for-loading-from-mysql (api.lisp#load-data)
-> copy-database (command-mysql.lisp#lisp-code-for-loading-from-mysql)
-> copy-from (migrate-database.lisp#copy-database)
-> queue-raw-data (copy-data.lisp#copy-from)
-> map-rows (copy-data.lisp#queue-raw-data)
-> (mysql.lisp#map-rows)
There are lots of things going along the chain. First, it takes a producer-consumer model using channels provided by the lparallel
package, which is every similar to Golang. It is true that Lisp is a multi-paradigm programming language. Second, the map-rows
implementation inside mysql.lisp
actually uses batching. The batch size is controlled by configuration parameter rows-per-range. So we are sure pgloader is not stupid to cause obvious OOM. Third, pgload has configuration parameter controlling whether we only want to copy data or also need to create tables beforehand. See doc. Last, data is copied as binary as described in the above binary protocol. The code is here.
Locale
Let’s talk about locale in general before jumping into locales in Postgres.
We only talk about POSIX compliant systems. First, what is the locale of my MacOS?
1
2
3
4
5
6
7
8
9
$ locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=
Locale has format
1
language[_territory][.codeset][@modifier]
In my case, language is en
, territory or country is United States. codeset
is UTF-8
. There is no modifier. LC_COLLATE
, LC_CTYPE
and etc are locale categories. They specify various cultural conventions/behaviors of characters, strings, time format, money format, and etc. Out of them, LC_COLLATE
and LC_CTYPE
are most relevant to Postgres.
LC_COLLATE
: specifies the collation order. It controls how strings are compared and sorted.LC_CTYPE
:C
here stands for character. It specifies character classification and case conversion. For example, whether a character is digit or not? what is the corresponding upper case letter of a character?
In postgres, LC_COLLATE
and LC_CTYPE
are fixed once a database is created. To check the current values,
1
2
3
4
5
6
7
admincoin=# show LC_COLLATE;
-[ RECORD 1 ]-------
lc_collate | C.UTF-8
admincoin=# show LC_CTYPE;
-[ RECORD 1 ]-----
lc_ctype | C.UTF-8
Note, somehow show LC_COLLATE
does not work in Aurora Postgres. We can directly query pg_database
:
1
2
3
4
admincoin=# select datcollate, datctype from pg_database where datname = 'test';
-[ RECORD 1 ]-------
datcollate | C.UTF-8
datctype | C.UTF-8
How Does Collation Affect Index Layout?
As you can imagine, collation affects how tuples are compared it, so it affect the page layout of indices. We can do a quick experiment. I am lazy to make a new collation, but instead I choose to use type CITEXT
to illustrate the idea. CITEXT
is case-insensitive text date type. It is equivalent to a case-insensitive collation.
First, set up the schemas and data.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
CREATE TABLE my_table(
id SERIAL PRIMARY KEY,
name VARCHAR(255) NOT NULL,
age INT
);
CREATE INDEX idx_name ON my_table (name);
CREATE EXTENSION IF NOT EXISTS citext;
CREATE TABLE my_table_2(
id SERIAL PRIMARY KEY,
name CITEXT NOT NULL,
age INT
);
CREATE INDEX idx_name_2 ON my_table_2 (name);
INSERT INTO my_table (name, age) VALUES
('Alice', 30),
('bob', 25),
('Charlie', 35),
('Diana', 28),
('Edward', 40);
INSERT INTO my_table_2 (name, age) VALUES
('Alice', 30),
('bob', 25),
('Charlie', 35),
('Diana', 28),
('Edward', 40);
You can see that I deliberately make bob
all lower cases. Second, do a CHECKPOINT
and restart postgres. Let’s check the data file content.
The left side is the data file of idx_name
and the right side is the date file of idx_name_2
. There are two pages for each case. The first page is special. It is called meta page. This meta page contains information about the page number of the btree root, the height of the btree, and etc. In our case, the root page number is 1 and height is 1. It makes sense as there are only 5 tuples. The root page itself can hold all of them.
Let’s focus on the second page. The row 00003ff0
corresponds to BTPageOpaqueData. The flag field is 03
meaning this is the root page and also a leaf page. Five tuples are inserted to a page backwardly starting from larger offset. That is why you see the names are from Edward
to Alice
. Let’s check the difference of the 2nd page for these two indices. Note, inside an index page, ItemIdData
array is sorted corresponding to the tuples they point to. This order is the prerequisite that you can do a binary search inside a btree node. Whenever you insert a new tuple in this page, db should maintain this invariant. You can see more detail about how Postgres does binary search on this ItemIdData
array.
On the left side, the ItemIdData array has pointers [0x3fe0, 0x3fc0, 0x3fb0, 0x3fa0, 0x3fd0]
, which correspond to [Alice, Charlie, Diana, Edward, blob]
. On the right side, the ItemIdData array has pointers [0x3fe0, 0x3fd0, 0x3fc0, 0x3fb0, 0x3fa0]
, which correspond to [Alice, blob, Charlie, Diana, Edward]
. So the right side sorts tuples case-insensitively.
How Does Collation Affect Query Performance
Using the example above, what is the performance of case-insensitive query?
For my_table_2
,
1
2
3
4
5
6
7
test2=# explain select * from my_table_2 where name = 'alice';
QUERY PLAN
-------------------------------------------------------------------------
Bitmap Heap Scan on my_table_2 (cost=4.20..13.67 rows=6 width=40)
Recheck Cond: (name = 'blob'::citext)
-> Bitmap Index Scan on idx_name_2 (cost=0.00..4.20 rows=6 width=0)
Index Cond: (name = 'blob'::citext)
For my_table
, we need to manually add the lower
function.
1
2
3
4
5
test2=# explain select * from my_table where lower(name) = 'bob';
QUERY PLAN
-----------------------------------------------------------
Seq Scan on my_table (cost=0.00..12.10 rows=1 width=524)
Filter: (lower((name)::text) = 'blob'::text)
You can see that with lower
, Postgres does not use the name index. How to make this query efficient? We can add an index
1
CREATE INDEX idx_name_lower on my_table (lower(name));
However, this change makes no difference probably because this table only has 5 tuples. Let’s see a real life example with object_column_index
.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
admincoin=# \d object_column_index
Table "public.object_column_index"
Column | Type | Collation | Nullable | Default
-------------------+-----------------------------+-----------+----------+----------------------------------
boolean_value | boolean | | |
column_name | character varying(100) | | |
datetime_value | timestamp(0) with time zone | | |
float_value | double precision | | |
id | integer | | not null | generated by default as identity
int_value | integer | | |
long_value | bigint | | |
object_guid | character varying(36) | | not null |
object_type | character varying(48) | | not null |
organization_guid | character varying(36) | | |
string_value | character varying(200) | | |
Indexes:
"object_column_index_pkey" PRIMARY KEY, btree (id)
"idx_on_boolean_column" btree (object_type, column_name, organization_guid, boolean_value, object_guid)
"idx_on_column_of_object" btree (object_guid, column_name)
"idx_on_datetime_column" btree (object_type, column_name, organization_guid, datetime_value, object_guid)
"idx_on_float_column" btree (object_type, column_name, organization_guid, float_value, object_guid)
"idx_on_int_column" btree (object_type, column_name, organization_guid, int_value, object_guid)
"idx_on_string_column" btree (object_type, column_name, organization_guid, string_value, object_guid)
"uq_object_guid_and_column_name" UNIQUE, btree (object_guid, column_name)
A case-sensitive query
1
2
3
4
5
admincoin=# explain select * from object_column_index where object_type = 'a' and column_name = 'b' and organization_guid = 'c' and string_value = 'alice';
-[ RECORD 1 ]------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | Index Scan using idx_on_string_column on object_column_index (cost=0.69..8.71 rows=1 width=151)
-[ RECORD 2 ]------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | Index Cond: (((object_type)::text = 'a'::text) AND ((column_name)::text = 'b'::text) AND ((organization_guid)::text = 'c'::text) AND ((string_value)::text = 'alice'::text))
A case-insensitive query
1
2
3
4
5
6
7
admincoin=# explain select * from object_column_index where object_type = 'a' and column_name = 'b' and organization_guid = 'c' and lower(string_value) = 'alice';
-[ RECORD 1 ]-----------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | Index Scan using idx_on_string_column on object_column_index (cost=0.69..8.72 rows=1 width=151)
-[ RECORD 2 ]-----------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | Index Cond: (((object_type)::text = 'a'::text) AND ((column_name)::text = 'b'::text) AND ((organization_guid)::text = 'c'::text))
-[ RECORD 3 ]-----------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | Filter: (lower((string_value)::text) = 'alice'::text)
You see the additional filter stage. Let’s add a lower index
1
create index idx_on_string_column_lower on object_column_index (object_type, column_name, organization_guid, lower(string_value), object_guid);
Then
1
2
3
4
5
admincoin=# explain select * from object_column_index where object_type = 'a' and column_name = 'b' and organization_guid = 'c' and lower(string_value) = 'alice';
-[ RECORD 1 ]-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | Index Scan using idx_on_string_column_lower on object_column_index (cost=0.69..8.71 rows=1 width=151)
-[ RECORD 2 ]-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
QUERY PLAN | Index Cond: (((object_type)::text = 'a'::text) AND ((column_name)::text = 'b'::text) AND ((organization_guid)::text = 'c'::text) AND (lower((string_value)::text) = 'alice'::text))
So it indeed uses the lower index!
Replication
Postgres has a concept of logical replication. See these two chapters: