Post

Mysql introduction

I am reading a good book “Database internals” by Alex Petrov.

Below are some useful papers I read

Some useful resources:

  • Alibaba’s database engine blogs http://mysql.taobao.org/monthly/

Build

I did not find any official document about how to build mysql from source code. But its test folder has a setup readme, which is quite useful.

Generate Makefile and compilation database

Mysql relies on boost at configuration stage.

1
2
3
4
5
cd mysql-server
mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=Debug -G "Unix Makefiles" -DCMAKE_EXPORT_COMPILE_COMMANDS=1 -DDOWNLOAD_BOOST=1  -DWITH_BOOST=./boost/ ..
cd ..
ln -s build/compile_commands.json compile_commands.json

Mysql build has a few dependencies. In ubuntu, you can install them as

1
apt-get install flex bison libudev-dev libssl-dev libncurses5-dev

Build

Do not run make install!

1
make -j8

Post installation

There are a few steps to follow after a fresh installation.

First, create mysql user and group

1
2
$ groupadd mysql
$ useradd -r -g mysql -s /bin/false mysql

This step can be omitted if the OS is MacOS.

Second, initialize the data directory. Suppose the build folder is /code/mysql-server/build.

1
2
3
4
5
6
7
8
cd build
rm -rf data && mkdir data && mkdir log && touch log/mysql.err

# Mysql 8
bin/mysqld --initialize --user=mysql

# Mysql 9
./bin/mysqld --initialize --datadir=$HOME/code/mysql-server/build/data --log-error=$HOME/code/mysql-server/build/log/mysql.err

For Mysql 8, the above command prints a temp root password in console. For Mysql 9, the above command prints a temp root password in the log file. Now we are ready to start the server.

1
2
3
4
5
# Mysql 8
bin/mysqld --user=mysql

# Mysql 9
bin/mysqld --user=mysql --datadir=$HOME/code/mysql-server/build/data --log-error=$HOME/code/mysql-server/build/log/mysql.err --socket=$HOME/code/mysql-server/build/mysqld.sock --pid-file=$HOME/code/mysql-server/build/mysqld.pid

Open another terminal,

1
bin/mysql -u root -p<temp_password>

The above command may fail with error saying socket file /tmp/socket... cannot be used. This is because the existing mysql in the system already uses this file, so we need to use another file. Usually, a quick fix is to pass an additional parameter -h 127.0.0.1 such that we do not use Unix socket.

Then assign a new root password.

1
mysql>  ALTER USER 'root'@'localhost' IDENTIFIED BY 'password';

One thing I found interesting about MacOS. I already installed mysql using homebrew and it is running. I assume the above process will fail because port 3306 is already used. However, it is not.

1
2
3
4
5
6
7
8
$ lsof -i tcp:3306
COMMAND   PID      USER   FD   TYPE             DEVICE SIZE/OFF NODE NAME
mysqld  62685 xiongding   21u  IPv6 0x26ae8b92b7c99911      0t0  TCP *:mysql (LISTEN)
mysqld  96366 xiongding  363u  IPv4 0x26ae8ba13830ee11      0t0  TCP localhost:mysql (LISTEN)

$ ps -ef |grep mysql
  502 96366 95551   0 Thu10AM ??         9:28.96 /opt/homebrew/opt/mysql/bin/mysqld --basedir=/opt/homebrew/opt/mysql --datadir=/opt/homebrew/var/mysql --plugin-dir=/opt/homebrew/opt/mysql/lib/plugin --log-error=xiong-MacBook-Pro-Zip.local.err --pid-file=xiong-MacBook-Pro-Zip.local.pid
  502 62685 50179   0 11:14AM ttys009    0:02.80 ./bin/mysqld --user=mysql

You can see that it works because one uses IPv4 and the other uses IPv6. By default, Mysql server starts to listen to IPv6. If you want to use a different port, then you can do

1
2
bin/mysqld --user=mysql --port 3307
bin/mysql -u root -ppassword -P 3307

Btw, mysqld does not response to Ctl-c, but it responses to SIGQUIT, which is Ctl-\.

Create example database and tables

Mysql provides sample data for an “employee” database. See doc.

1
2
3
4
5
cd build
wget https://github.com/datacharmer/test_db/archive/refs/tags/v1.0.7.zip
unzip v1.0.7.zip
cd test_db-1.0.7/
../bin/mysql -t -u root -ppassword -P 3307 < employees.sql

Also, below is a sample table as well.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
CREATE DATABASE demodb;

USE demodb;

CREATE TABLE t1 (
    id bigint NOT NULL,
    last_name varchar(255) NOT NULL,
    first_name varchar(255),
    age int,
    PRIMARY KEY (id),
    INDEX index_name (first_name, last_name)
);

insert into t1 (id, last_name, first_name, age) values (1, 'x1', 'y1', 18);
insert into t1 (id, last_name, first_name, age) values (2, 'x2', 'y2', 18);

Mount existing database files

You can also mount existing database files as follows,

1
bin/mysqld --user=mysql --port 3307 --datadir=/opt/homebrew/var/mysql

Debug

Run in gdb

1
2
gdb --args bin/mysqld --user=mysql
(gdb) run&

Here, we added & to the run command because the main thread is daemon thread, and it waits forever. We want to check the child threads. When a client connects to the server, gdb will show a message like below

1
(gdb) [New Thread 0xffffe067edc0 (LWP 51367)]

Run in lldb

As of Dec 17 2023, gdb still does not support MacOS Apple chip, so the only debugger I can use is lldb.

1
2
3
4
5
$ lldb -- ./bin/mysqld --user=mysql --port 3307
(lldb) b sql/sql_optimizer.cc:5875
Breakpoint 1: where = mysqld`JOIN::estimate_rowcount() + 44 at sql_optimizer.cc:5875:37, address = 0x0000000100a2c4d8
(lldb) process launch -n
Process 4071 launched: '/Users/xiongding/code/mysql-server/build/bin/mysqld' (arm64)

Use debug info

If Mysql is compiled with debug info, then when you connect to it, you see the promp contains server version xx-debug as below. Also, a configuration variable debug exists.

1
2
3
4
5
6
7
8
9
10
...
Server version: 8.2.0-debug Source distribution
...
mysql> show variables like 'debug';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| debug         |       |
+---------------+-------+
1 row in set (0.02 sec)

Unit tests

Mysql comes with a lot of unit tests. Some of them use gtest framework. For example, we can build a test suite as follows,

1
make merge_small_tests-t -j8

Then run it

1
./runtime_output_directory/merge_small_tests-t --gtest_filter='VarlenSortTest.FailSort'

Grammar

SQL is a nonregular language, but Mysql has a context-free grammar. See code.

MISC

See a lot of #ifndef UNIV_HOTBACKUP in the codebase. This flag is never set. See more https://jira.mariadb.org/browse/MDEV-11690 .

Name conventions

Prefixes:

  • ut_: utility
  • ha_: handler
  • opt_: optimization

InnoDB

InnoDB source code tree has an interesting structure. Checkout this article for more details.

Jeremy Cole is definitely an expert in InnoDB. He wrote a series blogs about internals of InnoDB and also is the author of innodb_ruby. This is most valuable references I found on internet about InnoDB.

First, let’s talk about a few core deta structures inside InnoDB.

  • tablespace: fil_space_t
1
class page_id_t(m_space_no, m_page_no)

Blow terminology table can help you When reading the source code,

  • sx lock: share/exclusive lock

Locks

Next-key lock.

Metadata lock

There are a few good posts from Alibaba that explains MDL every well.

  • http://mysql.taobao.org/monthly/2015/10/02
  • http://mysql.taobao.org/monthly/2015/11/04/

Data structures

KEY

Key is just alias for index.

Code: https://github.com/mysql/mysql-server/blob/7e1ce704209203da2bde727d5ce8b059d2c07c6c/sql/key.h#L121

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
class KEY {
 public:
  /** Tot length of key */
  uint key_length{0};
  /** How many key_parts */
  uint user_defined_key_parts{0};
  /** How many key_parts including hidden parts */
  uint actual_key_parts{0};  //Xiong: This is the total number of fields in this index

  /**
    Array of AVG(number of records with the same field value) for 1st ... Nth
    key part. For internally created temporary tables, this member can be
    nullptr. This is the same information as stored in the above
    rec_per_key array but using float values instead of integer
    values. If the storage engine has supplied values in this array,
    these will be used. Otherwise the value in rec_per_key will be
    used.  @todo In the next release the rec_per_key array above
    should be removed and only this should be used.
  */
  rec_per_key_t *rec_per_key_float{nullptr};
  ...
};

One node about field rec_per_key_float. It is an array that contains the average number of records with the same index prefix. For our example employee database, the table dept_emp has below definition

1
2
3
4
5
6
7
8
9
10
| dept_emp | CREATE TABLE `dept_emp` (
  `emp_no` int NOT NULL,
  `dept_no` char(4) NOT NULL,
  `from_date` date NOT NULL,
  `to_date` date NOT NULL,
  PRIMARY KEY (`emp_no`,`dept_no`),
  KEY `dept_no` (`dept_no`),
  CONSTRAINT `dept_emp_ibfk_1` FOREIGN KEY (`emp_no`) REFERENCES `employees` (`emp_no`) ON DELETE CASCADE,
  CONSTRAINT `dept_emp_ibfk_2` FOREIGN KEY (`dept_no`) REFERENCES `departments` (`dept_no`) ON DELETE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci

For the primary key, actual_key_parts = 2. so rec_per_key_float is a two-element array. What are their values?

1
2
3
4
5
6
7
8
9
10
mysql> select * from mysql.innodb_index_stats where table_name = 'dept_emp' and index_name = 'PRIMARY';
+---------------+------------+------------+---------------------+--------------+------------+-------------+-----------------------------------+
| database_name | table_name | index_name | last_update         | stat_name    | stat_value | sample_size | stat_description                  |
+---------------+------------+------------+---------------------+--------------+------------+-------------+-----------------------------------+
| employees     | dept_emp   | PRIMARY    | 2023-12-12 12:18:05 | n_diff_pfx01 |     299527 |          20 | emp_no                            |
| employees     | dept_emp   | PRIMARY    | 2023-12-12 12:18:05 | n_diff_pfx02 |     331143 |          20 | emp_no,dept_no                    |
| employees     | dept_emp   | PRIMARY    | 2023-12-12 12:18:05 | n_leaf_pages |        731 |        NULL | Number of leaf pages in the index |
| employees     | dept_emp   | PRIMARY    | 2023-12-12 12:18:05 | size         |        737 |        NULL | Number of pages in the index      |
+---------------+------------+------------+---------------------+--------------+------------+-------------+-----------------------------------+
4 rows in set (0.02 sec)

Field n_diff_pfx[num] refers the approximate number of different values for this index with prefix length num. In our example, the primary key is (emp_no, dept_no), so there are approximate 299527 distinct emp_no, and 331143 distinct (emp_no, dept_no). Also, the approximate total number of records in this table is 331143 as well. Therefore

1
*rec_per_key_float = [299527/331143, 331143/331143] = [1.1055530887032, 1]

Let’s check these numbers in debugger

1
2
3
4
(lldb) p keyinfo->rec_per_key_float[0]
(rec_per_key_t) 1.10555303
(lldb) p keyinfo->rec_per_key_float[1]
(rec_per_key_t) 1

Value of rec_per_key_float is populated by the database engine. For InnoDB, the relevant code is here.

TABLE_SHARE::keys defines the number of indices this table has. TABLE::key_info points to the array of indices of this table. Note TABLE_SHARE also has a field key_info, but they are not the same pointer. It is actually copied over.

And below shows the indices of this table.

1
2
3
4
(lldb) p keyuse->table_ref->table->key_info[0]->name
(const char *) 0x00000001213a2cc0 "PRIMARY"
(lldb) p keyuse->table_ref->table->key_info[1]->name
(const char *) 0x00000001213a2cc8 "dept_no"

BTW, one thing to note is that the table definition is here. Mysql automatically created an index dept_no for the foreign key constraint dept_emp_ibfk_2, but not for dept_emp_ibfk_1 because it smartly knows that the primary key covers this index. See this post.

An index is generated for each FOREIGN KEY definition unless there is already an obviously adequate index in existence.

Query block

Mysql code base is not easy to read because the mutual reference using pointers. For example, Query_block and JOIN keeps a pointer to each other.

Best reference: https://github.com/mysql/mysql-server/blob/eb86b4016060d426858cc09873d12492f1be396e/sql/query_term.h#L125

1
2
3
4
5
6
7
SELECT * FROM t1
JOIN t2 on ...
JOIN (select * from t3) t3t on ..
WHERE ...
group by ...
having ...
order by ...
1
2
3
4
5
6
7
8
class Query_block : public Query_term {
...
  bool is_implicitly_grouped() const {
    return m_agg_func_used && group_list.elements == 0;
  }

...
};

Implicitly grouped: select max(emp_no) from employees;

1
2
3
4
5
class JOIN {
  JOIN_TAB *join_tab{nullptr};
  QEP_TAB *qep_tab{nullptr};
  JOIN_TAB **best_ref{nullptr};
};

qep_tab and best_ref go together. See code.

join_tab:

QEP_shared

QEP_shared_owner seems badly implemented. Using a shared_ptr seems way more cleaner.

JOIN_TAB

QEP_TAB

I frequently found that some comments in mysql codebase help me a lot to understand its internals. Examples

  • https://github.com/mysql/mysql-server/blob/d7255a34e726757df08659e5f295ac72b10a63c8/sql/sql_select.h#L352

TODO:

  1. learn details of dynamic range access
  2. read function GetTableAccessPath

Useful commands

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
SELECT thr.processlist_id AS mysql_thread_id,
       p.thd_id,
       concat(PROCESSLIST_USER,'@',PROCESSLIST_HOST) User,
       Command,
       FORMAT_PICO_TIME(trx.timer_wait) AS trx_duration,
       current_statement as `latest_statement`
  FROM performance_schema.events_transactions_current trx
  INNER JOIN performance_schema.threads thr USING (thread_id)
  LEFT JOIN sys.processlist p ON p.thd_id=thread_id
 WHERE thr.processlist_id IS NOT NULL
   AND PROCESSLIST_USER IS NOT NULL
   AND trx.state = 'ACTIVE'
 GROUP BY thread_id, timer_wait
 ORDER BY TIMER_WAIT DESC LIMIT 20;

SELECT DATE_SUB(now(), INTERVAL (
         SELECT variable_value
           FROM performance_schema.global_status
           WHERE variable_name='UPTIME')-TIMER_START*10e-13 second) `start_time`,
       SQL_TEXT
  FROM performance_schema.events_statements_history
 WHERE nesting_event_id=(
               SELECT EVENT_ID
                 FROM performance_schema.events_transactions_current t
                 LEFT JOIN sys.processlist p ON p.thd_id=t.thread_id
                 WHERE conn_id=9251630)
 ORDER BY event_id;
This post is licensed under CC BY 4.0 by the author.