You break debugging after you install gdb
A funny thing happened yesterday. I tried to use memray
to debug memory leak of a python service in production, but memray attach <pid>
failed to work. The reason is that I did apt-get install gdb
before installing memray
. Wait a minute. gdb
is a prerequisite for memray attach
. How does gdb breaks memray
.
A little background about how memray attach
works. Almost all these tools rely on a debugger gdb
or lldb
to attach to the running process. Some ambitious authors will write from scratch by only using ptrace
, but it is rare. memray
is the former case. The source code is here. The whole magic is just a gdb script. This script sets a bunch of breakpoints at function malloc
, free
, PyMem_Malloc
etc and registers a callback function memray_spawn_client
. This makes total sense for a memory profiler. However, the tricky part is the line doing dlopen
. For my case, the two parameters were
1
2
$libpath = /usr/local/lib/python3.12/site-packages/memray/_memray.cpython-312-x86_64-linux-gnu.so
$rtld_now = 2
2
means loading _memray.cpython-312-x86_64-linux-gnu.so
immediately.
If dlopen
succeeds, it should return a valid handler pointer, but what I got was always 0x0
. I tried to add dlerror()
after that line to see what error was, but immediately got segmentation fault!
Is the problem with _memray.cpython-312-x86_64-linux-gnu.so
itself? Chatgpt told me to verify it by running below script.
1
2
3
4
5
gdb -p $PID -batch \
-ex 'set unwindonsignal off' \
-ex 'set $r = (void*) dlopen("libm.so.6", 2)' \
-ex 'printf "dlopen -> %p\n", $r' \
-ex quit
Fine. It printed dlopne -> null
. What? It did not work for any shared library. Maybe the problem was at OS level? Chatgpt quickly generated a simple C program to call dlopne("libm.so.6", 2)
. It ran without problem, so the problem was with this specific $PID
process. I had a few other hypothesis, but none of them turned out to be the root cause. Then I casually read the memory maps of this process. It was a long list! I lost my patience and threw all of them to Chatgpt, and Chatgpt highlighted a few suspicious mappings.
1
2
3
4
5
0x7f75d0bed000 0x7f75d0bee000 0x1000 0x0 r--p /usr/lib/x86_64-linux-gnu/libdl.so.2 (deleted)
0x7f75d0bee000 0x7f75d0bef000 0x1000 0x1000 r-xp /usr/lib/x86_64-linux-gnu/libdl.so.2 (deleted)
0x7f75d0bef000 0x7f75d0bf0000 0x1000 0x2000 r--p /usr/lib/x86_64-linux-gnu/libdl.so.2 (deleted)
0x7f75d0bf0000 0x7f75d0bf1000 0x1000 0x2000 r--p /usr/lib/x86_64-linux-gnu/libdl.so.2 (deleted)
0x7f75d0bf1000 0x7f75d0bf2000 0x1000 0x3000 rw-p /usr/lib/x86_64-linux-gnu/libdl.so.2 (deleted)
Not just libdl
but also libm.so
and a few other shared libs. Nice! This must be related because I did not see (deleted)
on my local host.
(deleted)
means the file was deleted, but its content was still used by the process and it exits in memory. Think about mmap
and then deleting the physical file. I checked the file path, it did exist in disk. I expressed my confusion to Chatgpt, and it told me that the shared lib might be upgraded with the file path kept unchanged. Chatgpt said I might have upgraded libc
. I was very suspicious about this answer because it was not the first time Chatgpt lied to me.
I took a break for two hours because my daughter would not sleep if I did not sleep at the same time. I closed my eyes lying on the bed and thought about what I had done after logging into this machine. I did apt-get install procps
because I wanted to use ps
, then did apt-get installl gdb
and uv pip install memray
and apt-get install neovim
because I wanted to write some C code to test my hypothesis. Nothing I could think of to be related to libc
or ld
.
Wait a minute! apt-get install gdb
! I found my daughter was already asleep, and I went to my laptop, logged into another similar production box, and type apt-get install --simulate gdb
. It actually automatically upgrades glibc
! Holy shit. I quickly compiled gdb
from source, and then ran memray attach <pid>
. And it worked!
This is the full story. I know this isn’t significant at all in many people’s view. We are never short of gotchas in the Linux world. But I kind of feel this is a good lesson for me. I started to appreciate the difficulty of problems that Arch Linux faces. It is not easy to do rolling upgrade for long-running servers. Meanwhile, there is still some puzzle I do not fully resolve. (delete)
means file was deleted/updated on the disk, but still memory has it, so in principle, it should continue to work, but why gdb
must need the physical existence. Moreover, I could not find any manual/documentation discussing this gotcha. gdb
relying on physical existence is just my guess/observation!