Post

Process and thread

Process

IPC (Interprocess communication)

wait system call blocks a parent process until a child process terminates. If not called, the child process will be a zombie process. How could we avoid blocking parent process and also reap the terminated child process? The answer is SIGCHLD. We can install a signal handler for SIGCHLD and call wait in this signal handler, so parent only waits when child indeed terminates. See http://web.stanford.edu/~hhli/CS110Notes/CS110NotesCollection/Topic%202%20Multiprocessing%20(4).html

pipe

I created a short program as below.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def main():
    [rf, wf] = os.pipe()
    for fd in [rf, wf]:
        flags = fcntl.fcntl(fd, fcntl.F_GETFL) | os.O_NONBLOCK
        fcntl.fcntl(fd, fcntl.F_SETFL, flags)
        flags = fcntl.fcntl(fd, fcntl.F_GETFD)
        flags |= fcntl.FD_CLOEXEC
        fcntl.fcntl(fd, fcntl.F_SETFD, flags)

    pid = os.fork()
    if pid:
        # parent
        bs = os.read(rf, 1)
        print(bs)
    else:
        # child
        os.write(wf, b'a')

However, running it produced below error.

1
2
3
    bs = os.read(rf, 1)
         ^^^^^^^^^^^^^^
BlockingIOError: [Errno 35] Resource temporarily unavailable

Doc page of os.read does not mention this error. But I found the doc page of io.BufferedIOBase.read says

A BlockingIOError is raised if the underlying raw stream is in non blocking-mode, and has no data available at the moment.

Then I went to Cpython source code. Ah. EAGAIN is classified as a BlockingIOError. man 2 read gives more details about it.

We can use select to fix this issue. See the full code in pipe_example.py in the tutorial folder.

Concurrent write

If multiple processes are writing to the same pipe, then the messages may get interleaved. However, when the message length is less than PIPE_BUF, then the message can be written atomically. POSIX.1 requires PIPE_BUF to be at least 512 bytes. See man 7 pipe for more details.

Thread

Let’s talk about Linux first. When you search pthread_create in glibc repo, you will see two implementations. One is under htl folder, the other is under nptl folder. These two names represent LinuxThreads and Native POSIX Thread Library respectively. This paper is a must read to understand why NPTL replaces LinuxThreads.

What are the differences between a process and a thread? First, we need to be explicit about what layer we are talking about: is it in the user land or in the kernel? In the user land, the distinction is clear. Each process has its own distinct PID, and its own virtual memory space. On the other hand, threads in the same group share the same PID and memory space. In the kernel, there is neither process nor thread. The only concept is called task. Tasks themselves can share nothing, something, or everything.

The threading model in both LinuxThreads and NPTL is 1:1 mapping, namely, each process or thread in the user land is mapped to a task in the kernel. To create a new process, we use system call fork which creates a new kernel task that shares nothing with its calling task. To create a new thread, we use system call clone with flag CLONE_PARENT and CLONE_THREAD set, which means the new task will share the same PID, PPID, and TGID of the the calling task. Actually, fork is implemented as a wrapper on top of clone. More detailed discussion can be found here.

So let’s take a look at the source code of pthread_create quickly. The signature is below.

1
2
3
4
5
int
__pthread_create_2_1 (pthread_t *newthread, const pthread_attr_t *attr,
		      void *(*start_routine) (void *), void *arg) {
    ...
}

First, note that it takes a start function and an argument. This is different from a process. After fork, the child process continues to run the rest code, but a new thread will run the provided function until it finishes and the thread exits.

Second, the data structure __pthread_attr_t defines the properties/flags of the new thread. If is defined as a union here. It is quite wired! It is just a byte array. The int member is only for align purpose and won’t be used. This shouldn’t be the real definition because __pthread_attr_t supposes to carry flags that define what to share between the calling thread and new thread. The actual definition is here copied below

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
struct pthread_attr
{
  /* Scheduler parameters and priority.  */
  struct sched_param schedparam;
  int schedpolicy;
  /* Various flags like detachstate, scope, etc.  */
  int flags;
  /* Size of guard area.  */
  size_t guardsize;
  /* Stack handling.  */
  void *stackaddr;
  size_t stacksize;

  /* Allocated via a call to __pthread_attr_extension once needed.  */
  struct pthread_attr_extension *extension;
  void *unused;
};

__pthread_attr_t is always statically casted to struct pthread_attr. Why design it this way? glibc runs in the user land, pthread_attr is the data structure used in the user land. We need to do a sys call clone and we need to pass bytes to kernel which is __pthread_attr_t. So it is the same thing in two representations.

Third, __pthread_attr_t has two members stackaddr and stacksize. These two members defines the stack memory space of the new thread. A few lines below is

1
  int err = allocate_stack (iattr, &pd, &stackaddr, &stacksize);

The stack space of the new thread is allocated in the user land instead of in the kernel! Most time, the caller leaves these two parameters unspecified, so the default stack size is used.

1
2
$ ulimit -a | grep stack
stack size                  (kbytes, -s) 10240

Above example shows that the default stack size is 10MB, which is a lot, right? Suppose the total memory is 16GB, then we can have at most 1600 threads? No. The calculation is way off. Following the allocate_stack function above, you will see that glibc uses mmap to allocate stack memory. It is definitely not malloc because malloc is used for heap allocation. So the thread’s stack size is just the size in the virtual memory space. The stack may or may not be mapped to physical memory depending on whether the space is used or not. If the stack usage is under a page size, for example 64KB, then only one page is allocated in physical memory. In reality, there is no problem of running millions of threads in Linux. For the stack size calculation, please read Ulrich’s blog post Thread Numbers and Stacks. Meanwhile, I personally find this post quite enlightening as well. Let’s do a quick calculation. For a x86-64 system, 48 bits can be used to address memory and one bit is reserved for the kernel, so the addressable memory is 2^47. Now suppose stack size is 8M = 2^23. Then we can have as many as 2^(47-23) = 16M threads.

Let’s briefly mention golang here. In golang, a lot of goroutines can run in the same thread. If the goal is to support as many as 1M goroutines, then the stack size of a thread should be at least 1M multiplied by the average function frame size. This can be ~1G virtual memory. This is insane and it means we can have at most 2^(47 - 30) = 128K threads, which is a big constraint. Golang designers implemented a clever idea called “growable stack” to solve this problem. Checkout this wonderful take Go scheduler: Implementing language with lightweight concurrency by Dmitry Vyukov.

Inter thread communication

TODO:

  1. read https://www.akkadia.org/drepper/futex.pdf
  2. read https://www.akkadia.org/drepper/tls.pdf

Relearn ps

ps is the most used command in Linux. I am learning new things about it from time to time. I definitely cannot know all its usage.

Quote from ps man page

1
2
3
4
 H      Show threads as if they were processes.
 -L     Show threads, possibly with LWP and NLWP columns.
 -m     Show threads after processes.
 -T     Show threads, possibly with SPID column.

Let’s see a few examples

1
2
3
4
5
6
7
8
9
10
11
# ps -efL
UID        PID  PPID   LWP  C NLWP STIME TTY          TIME CMD
root         1     0     1  1   15 15:38 ?        00:00:16 kafka consumer controller
root         1     0    16  0   15 15:38 ?        00:00:00 kafka consumer controller
root         1     0    17  0   15 15:38 ?        00:00:00 kafka consumer controller

# ps -efT
UID        PID  SPID  PPID  C STIME TTY          TIME CMD
root         1     1     0  1 15:38 ?        00:00:16 kafka consumer controller
root         1    16     0  0 15:38 ?        00:00:00 kafka consumer controller
root         1    17     0  0 15:38 ?        00:00:00 kafka consumer controller

Here, LWP stands for Light Weighted process, i.e., thread. NLWP means the number of threads in this thread group. SPID is an alias of LWP. It possibly stands for shared PID. You can checkout ps’s man page to read more about these jargons.

Mics

read vs pread

read will read from current offset and advance offset. pread takes a offset parameter and does not change current offset of the file.

This post is licensed under CC BY 4.0 by the author.