An overview of process management in the Linux kernel.
The process
A process is a running program and related resources. It is the living result and the container of isolation of a running program, including:
- One or more threads of execution, including the thread ID, the program counter (PC), stack and stack pointer, a set of saved registers, and scheduling information.
- The resources it uses, such as open files/sockets, signals, mutexes, processor state data, and other memory space (heap, shared libraries, etc.).
- The code (
text
section) that it executes. - The
data
/bss
section which contains global variables or uninitialized variables.
Modern operating systems, including Linux, provide mainly two types of virtualization for processes: virtual memory and virtual processor.
Virtual memory
With virtual memory support, the process can allocate and manage memory as if it owned all the memory in the system.
From the process/programmer's perspective, memory access uses virtual memory addresses, which are translated to physical memory addresses by the MMU with the help of page tables.
The above image from this blog shows the virtual memory layout of a process in Linux. We will come back to memory management in later posts.
Virtual processor
On modern operating systems, the process also runs as if it alone monopolizes the CPU, despite the fact that it is sharing the CPU with other processes.
The kernel schedules the processes and decides which processes to run, at what time, and for how long.
Process from programmer's point of view
A process is created by its parent process by calling the fork()
system call, which internally called the clone()
system call.
The clone()
system call duplicates a task data structure (CoW), assign it a new PID,
and copies necessary resources such as open files and signal handlers depending on the flags passed to it.
(Interestingly, a thread in Linux is implemented as a process that clone()
s more shared data than regular process do.)
After fork()
returns, the parent resumes execution and the child process starts execution from the same place.
The fork()
system call returns twice:
- In the child process, the return value is 0.
- In the parent process, the return value is the PID of the child process.
Programmers can use this return value to distinguish the child process from the parent process.
Often, the child process is used to execute a new, different program.
We can use the exec()
family of system calls to create a new address space and load a new program into it.
Finally, a process can exit by calling the exit()
system call.
This system call terminates the process and frees all resources associated with it.
A parent process can wait for a child process to exit by calling the wait()/waitpid()
function,
which internally calls the wait4()
system call.
Process data structure
Internally, a process is represented by a struct task_struct
structure, which is also referred as a process descriptor.
The kernel maintains a circular doubly linked list of all process descriptors.
In the above code sample in linux-next
,
the struct task_struct
structure is defined in the include/linux/sched.h
header file.
Process descriptor content
The process descriptor contains all the information that the kernel needs about a process, including the process properties, scheduling information, the program counter, saved register data, memory space, file system information, etc..
Some examples in struct task_struct
are:
- Process properties
- pid: Unique process (thread) ID (Line 946 in commit
09688c0166e7
). - tgid: Unique process ID of the process group (process) (Line 947 in commit
09688c0166e7
). - state, exit_state, etc.: Used to track the process state, such as running, pending, existing, etc..
- flags, exit_code, comm, uid, gid, etc..
- pid: Unique process (thread) ID (Line 946 in commit
- Scheduling properties
- prio, static_prio, normal_prio, rt_priority, etc.: Process priority properties.
- policy: Process scheduling policy.
- sched_class: The scheduling class of the process.
- se, rt, dl, etc.: Scheduling class instance.
- Process links
- real_parent, parent: If the process is not traced, the parent is the real parent.
- children: A list of all the children of the process.
- sibling: A pointer to the next sibling of the process.
- group_leader: A pointer to the leader of the process group.
- Memory and file system
- mm: A pointer to the
mm_struct
member. - fs: File system pointer.
- files: File descriptor table.
- mm: A pointer to the
The task_struct
is initialized in the init_task
structure in ./init/init_task.c
.
Note that the pid
in the kernel source actually means the thread ID from the user space application's view (such as ps
).
And the process ID (PID) returned by ps
is actually the tgid
in the kernel source.
There is a little confuse in these terms.
In this blog post there is a nice explanation of the difference. See the PID, TID, PPID, and TGID
section.
Process descriptor lookup
It is important for the kernel to quickly look up the process descriptor of the currently executing process.
On different architectures, the kernel uses different methods to do this, and it is abstracted by the current
macro. (./include/asm-generic/current.h
)
After Linux 2.6, the process descriptor is allocated dynamically.
A struct thread_info
structure is allocated at the end of the process's kernel stack,
and it contains a pointer to the process descriptor.
The kernel uses the current_thread_info()->task
call to get the current process descriptor address.
Task link and task tree
There is a next
pointer and a prev
pointer in the process descriptor, which links the process as a doubly linked list.
You can use for_each_process
to iterate through all processes (./include/linux/sched/signal.h
).
There is also a parent
pointer in the process descriptor, which points to the parent process descriptor,
Every process has exactly one parent, and zero or more children.
Thus all the processes in the system are linked in a tree structure, with the init
process (pid 1) as the root process.
Process state
The state
field of the process descriptor contains the current state of the process.
There are five process states in Linux, which is defined in the ./include/linux/sched.h
.
The kernel changes the process state when the situation changes.
Exiting and cleanup
When a process calls exit()
system call, or it is terminated by a signal or exception,
it cleans up its resources, send a signal to its parent, and then changes its exit_state
to EXIT_ZOMBIE
.
After that, the process calls do_exit()
and let the kernel switch to a new process.
At this point, the process descriptor task_struct
and the thread_info
, the kernel stack still exist.
The parent process needs to handle the child process exit signal, or tell the kernel it is uninterested, then the kernel will free these memory resources.
Question: What if the parent process need to exit before the child process exits?