LLVM introduction
First, Clang is both the frontend of C like language, and it is also a compiler driver. As a frontend, Clang has a hand-written recursive decent parser, and it is quite complicated. Quote from Walter Bright here
“Figure on 10 man-years to do a correct C++ parser, and that’s if you’re an experienced compiler guy.”
To study clang, I took the same approach as I did when learning Cpython internals. I compile clang/llvm in debug mode, and invoke gdb for a sample program and walk through stack trace together with the code base.
Another valuable resource is the llvm dev meetings https://llvm.org/devmtg/ Checkout the recordings in youtube!
Compile
Follow https://clang.llvm.org/get_started.html to get started. Building inside docker is super slow, so I built clang natively in M1 Macbook. The build process take about 30min. Also, enable outputting compile_commands.json
1
2
3
4
cd llvm-project
mkdir -p build && cd build
cmake -DLLVM_ENABLE_PROJECTS=clang -DCMAKE_BUILD_TYPE=Debug -G "Unix Makefiles" -DCMAKE_EXPORT_COMPILE_COMMANDS=1 ../llvm
make -j$(nproc)
After build, you will see compile_commands.json
in the build
folder. Make a soft link to it in the root directory.
1
ln -s compile_commands.json ../
Navigate the code base.
I use neovim + treesitter + clangd lsp.
1
2
cd llvm-project
vim .
Trace the call sequence:
1
2
3
lldb ./build/bin/clang
(lldb) break cc1_main
(lldb) run -S -emit-llvm ~/test.c
Call sequence
1
2
3
4
cc1_main -> CompilerInstance::ExecuteAction -> FrontendAction::Execute
-> parseAST -> Parser.cpp:ParseTopLevelDecl
-> Parser::ParseExternalDeclaration
-> Parser::ParseDeclarationOrFunctionDefinition
CodeGenAction.h
defines a lot of types of actions.
After reading some codes, I found this video 2019 LLVM Developers’ Meeting: S. Haastregt & A. Stulova “An overview of Clang ” Just like reinforcement of what I learned.
LLVM options
LLVM uses TableGen to generate source files from configuration files. This part confused me when I initially try to figure out how command line options work in LLVM.
The configuration source file is Options.td The corresponding generated file is Options.inc
, which I cannot give a link to it. The option generated by the -E
config is below
1
2
OPTION(prefix_1, llvm::StringLiteral("E"), E, Flag, Action_Group, INVALID, nullptr, NoXarchOption | CC1Option | FlangOption | FC1Option, 0,
"Only run the preprocessor", nullptr, nullptr)
Then it comes to Options.h file. It defines an enum ID
with elements in the form of OPT_xxx
. For option -E
, it is OPT_E
. Then searching the repo, we find this part getFrontendActionTable. Ah, everything makes sense now.
HandleDirective
HandleDirective
is a preprocess parsing stage dealing with lines that start with #
.
Handle header import
clang -E <file_name>
prints out the file content after preprocessing. Below is a sample backtrace.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
#0: `clang::Preprocessor::HandleEndOfFile(this=0x000000012b832418, Result=0x000000016fdfba38, isEndOfMacro=false) at PPLexerChange.cpp:333:3
#1: `clang::Lexer::LexEndOfFile(this=0x0000600003900680, Result=0x000000016fdfba38, CurPtr="") at Lexer.cpp:3067:14
#2: `clang::Lexer::LexTokenInternal(this=0x0000600003900680, Result=0x000000016fdfba38, TokAtPhysicalStartOfLine=true) at Lexer.cpp:3627:14
#3: `clang::Lexer::Lex(this=0x0000600003900680, Result=0x000000016fdfba38) at Lexer.cpp:3576:24
#4: `clang::Preprocessor::Lex(this=0x000000012b832418, Result=0x000000016fdfba38) at Preprocessor.cpp:886:33
#5: `clang::DoPrintPreprocessedInput(PP=0x000000012b832418, OS=0x00006000026056e0, Opts=0x000000012b82fa68) at PrintPreprocessedOutput.cpp:1018:8
#6: `clang::PrintPreprocessedAction::ExecuteAction(this=0x0000600002605740) at FrontendActions.cpp:1018:3
#7: `clang::FrontendAction::Execute(this=0x0000600002605740) at FrontendAction.cpp:1060:8
#8: `clang::CompilerInstance::ExecuteAction(this=0x000000012b105cb0, Act=0x0000600002605740) at CompilerInstance.cpp:1049:33
#9: `clang::ExecuteCompilerInvocation(Clang=0x000000012b105cb0) at ExecuteCompilerInvocation.cpp:264:25
#10: `cc1_main(Argv=ArrayRef<const char *> @ 0x000000016fdfc240, Argv0="/Users/xiongding/code/llvm-project/build/bin/clang-17", MainAddr=0x00000001000036b0) at cc1_main.cpp:249:15
#11: `ExecuteCC1Tool(ArgV=0x000000016fdfcb18, ToolContext=0x000000016fdfd5e0) at driver.cpp:366:12
#12: `clang_main(this=0x000000016fdfd5e0, ArgV=0x000000016fdfcb18)::$_0::operator()(llvm::SmallVectorImpl<char const*>&) const at driver.cpp:506:14
#13: `int llvm::function_ref<int (llvm::SmallVectorImpl<char const*>&)>::callback_fn<clang_main(callable=6171907552, params=0x000000016fdfcb18)::$_0>(long, llvm::SmallVectorImpl<char const*>&) at STLFunctionalExtras.h:45:12
#14: `llvm::function_ref<int (llvm::SmallVectorImpl<char const*>&)>::operator(this=0x000000016fdfdc68, params=0x000000016fdfcb18)(llvm::SmallVectorImpl<char const*>&) const at STLFunctionalExtras.h:68:12
#15: `clang::driver::CC1Command::Execute(this=0x000000016fdfca78) const::$_1::operator()() const at Job.cpp:439:34
#16: `void llvm::function_ref<void ()>::callback_fn<clang::driver::CC1Command::Execute(callable=6171904632) const::$_1>(long) at STLFunctionalExtras.h:45:12
#17: `llvm::function_ref<void ()>::operator(this=0x000000016fdfca18)() const at STLFunctionalExtras.h:68:12
#18: `llvm::CrashRecoveryContext::RunSafely(this=0x000000016fdfcab8, Fn=function_ref<void ()> @ 0x000000016fdfca18)>) at CrashRecoveryContext.cpp:426:3
#19: `clang::driver::CC1Command::Execute(this=0x000000012b1057d0, Redirects=ArrayRef<std::__1::optional<llvm::StringRef> > @ 0x000000016fdfcb00, ErrMsg="", ExecutionFailed=0x000000016fdfcfef) const at Job.cpp:439:12
#20: `clang::driver::Compilation::ExecuteCommand(this=0x000000012b004f00, C=0x000000012b1057d0, FailingCommand=0x000000016fdfd0e8, LogOnly=false) const at Compilation.cpp:199:15
#21: `clang::driver::Compilation::ExecuteJobs(this=0x000000012b004f00, Jobs=0x000000012b004f80, FailingCommands=0x000000016fdfd960, LogOnly=false) const at Compilation.cpp:253:19
#22: `clang::driver::Driver::ExecuteCompilation(this=0x000000016fdfd9b0, C=0x000000012b004f00, FailingCommands=0x000000016fdfd960) at Driver.cpp:1863:5
#23: `clang_main(Argc=3, Argv=0x000000016fdfee70, ToolContext=0x000000016fdfebb8) at driver.cpp:542:21
#24: `main(argc=3, argv=0x000000016fdfee70) at clang-driver.cpp:15:10
Basically, when preprocessing a file, PP (preprocessor) tries to find the header file. If found, lex it. This is implemented as a stack. When entering an included file, we push current lexer to the stack and create a new lexer. When included file finished processing, we restore the old lexer from the stack.
Header search directories and order
Clang uses a single vector SearchDirs to store header lookup directories. The first part of this directory contains quoted header directories, and the rest contains angled header directories. Quoted header directories are passed in by -iquote
. Angled header directories are passed in by -I
. function tells the details of how to construct this search directory.
Basically, #include "foo.h"
will search from index 0 of this vector. While #include <foo.h>
starts from somewhere in the middle. See code here. The small distinction matters when you have a header with the same name as an existing system header.
There is one more cavelet here. I always hear that quoted include will search current directory first. So basically, #include "foo.h"
will first search the header in the current directory. This part is implemented here.
1
if (!Includers.empty() && !isAngled && !NoCurDirSearch) {
This part is executed before searching the SearchDirs
vector, so current folder has the highest precedence. Above check shows that this special rule is not allowed for the angle style import. Also, there is a parameter NoCurDirSearch
to disable this behavior. Includers
here is set to the current file’s full path.
IR
When I read paper “The LLVM Instruction Set and Compilation Strategy”, I surprisingly found that LLVM .o file contains IR code, but not machine code. This is quite different from GCC. In GCC, source code is first preprocessed, and then compiled to assembly, i.e., .s files. Then this .s file is assemblied to .o file, and finally linked by the linker. So this means LLVM’s linker can potentially do more optimization.
TODO: find some reference on LLVM linker performance. It may be much slower than the gcc equivalent.