Fork () implementation without MMU

In this article I will tell you how we implemented such a strange one
fork()
. I will check the operability on a third-party program - dash - an interpreter that uses fork()
to run applications. Who cares, please, under the cat.
A legitimate question may arise - why even make some sort of a stripped-down version
fork()
, if for some reason it exists vfork()
, just allowing you to create processes without using the MMU? In fact, there are a number of applications that do not use copying the address space “to the fullest,” but alsovfork()
- it’s not enough for them (I remind you that when using vfork()
in a descendant, you can’t even touch the stack, that is, you cannot return from a function or change local variables). dash - POSIX-compatible lightweight shell - just serves as an example of such a program. To simplify, when calling third-party programs, dash uses it
fork()
, after which in both processes (parent and child) it modifies some static data, performs some operations with the heap, then the child process calls the desired program with execv()
, and the parent calls waitpid()
and waits for the child to complete. A small educational program for those who do not understand what all these words mean, but for some reason continued to read the article: in UNIX systems, processes are created using a system call
fork()
. ByThe definition from POSIX is fork()
to create an exact copy of the process with the exception of some variables. If the function succeeds, the function returns a value of zero to the child process and the number of the child process to the parent (after which the processes begin to "live their own lives"). It turns out that two processes begin to work in the system with the same addresses of variables, with the same stack address and so on. However, the data is not "confused" between processes due to the use of virtual memory: a copy of the parent data is created, and the child process accesses other physical addresses. Read more about virtual memory here .Modern MMUs allow you to set permissions for memory pages, therefore, to create a process, it’s enough to just clone the translation table, marking the pages with the copy-on-write flag, and when trying to write to this page, clone the data for real. The lack of MMU not only makes it impossible to use this optimization, but also calls into question the possibility of making a call
fork()
in principle. Without hardware support for virtual memory, programs use physical addresses, not virtual ones, which means that the process obtained with the help fork()
will inevitably access the same data as the parent process.In a broad sense, the address space of a process is understood to mean all memory mapped for a particular process. This includes a segment with program code, a bunch, stacks of threads, partly kernel memory, peripheral memory may be included, and so on. In this article, address space will be understood as heap, static data (hereinafter, static data means data from the .bss and .data sections) and a process stack, that is, data that can be changed in both processes. It is with these data that one will have to deal with in order to avoid confusion between the work of the “forks”.
Since we are talking about implementation
fork()
in a multi-tasking OS, switching the context of the process plays an important role. About switching contexts of objects that share processor time is written in our article."Organization of multitasking in the kernel of the OS . " In this case, the address space should also be switched. If, on systems with an MMU, this is, roughly speaking, a change in a pointer to a translation table, then in the absence of an MMU it will be more difficult to ensure the correct operation of the processes. One way is to swap values on the stack, on the heap, and in static memory when switching processes. This, of course, is very slow, but simple enough. Actually, this is the idea for implementing ours
fork()
without MMU. We memorize the necessary data for forked processes and copy the values to the working address space when switching. For the stack, heap, and static data, this is done in different ways.Stack
In our OS, memory for thread stacks is not allocated dynamically, but from a static pool, respectively, some part of .bss contains a place for the stacks. We are interested only in those stacks whose corresponding flows are forked. Copy processes, according to the definition of POSIX, should have only one thread (a copy of the one in which the corresponding system call took place), but
fork()
can be called from different threads at different times, so you need to maintain a list of threads for each address space whose stacks must be saved .A bunch
Heap data is stored in a buffer that is allocated from the static page pool.
.bss and .data
Embox (along with all applications) is compiled into a single static image, so the information about the .bss and .data sections is not taken from any ELF file, but is saved in a common section under a unique name (for example,
__module_embox__cmd__shell_data_vma
). Information about what data relates to which application is stored in a special structure, which is set at compile time.struct mod_app {
char *data;
size_t data_sz;
char *bss;
size_t bss_sz;
};
During execution of the current process, you can find out where its data is stored.
When starting the program directly, you need to copy the corresponding data to some memory area (so that the next time the program starts, the initial values of the variables are correct).
These are the pieces of memory that we will copy to the allocated portion of the system memory.
fork
The above is enough to understand how the call works
fork()
. The function itself
fork
is architecture dependent. Assembly language implements the transfer of registers as an argument to a call fork_body()
- a function that implements the logic of a system call. Although most readers are more familiar with the x86 instruction set, I’ll give an implementation for ARM, as it is much shorter and more understandable.
The registers are saved in the pt_regs structure:
typedef struct pt_regs {
int r[13]; /* Регистры общего назначения */
int lr; /* Адрес возврата */
int sp; /* Указатель стека */
int psr; /* Состояние программы, ARM-специфичный регистр */
} pt_regs_t;
The source code for the fork () function
/* Регистры будем сохранять на стеке, для этого выделяем 68 байт */
sub sp, sp, #68
/* Копируем 13 регистров общего назначения и регистр возврата */
stmia sp, {r0 - r12, lr}
/* Сохраняем SP */
str sp, [sp, #56]
/* Напрямую записывать CPSR в память не можем, поэтому считываем CPSR в r0*/
mrs r0, cpsr;
/* Сохраняем CPSR на стеке */
str r0, [sp, #60];
/* По соглашению о вызовах в r0 передаётся первый аргумент */
mov r0, sp
/* Переходим к архитектурно-независимой части вызова */
b fork_body
As you can see, in fact, in ptregs, not the correct SP value is stored, but shifted by 68 bytes (in which we put the structure
pt_regs_t
). We will take this into account when restoring registers.The architecture-independent part of the fork () call
void _NORETURN fork_body(struct pt_regs *ptregs) {
struct addr_space *adrspc;
struct addr_space *child_adrspc;
struct task *parent;
pid_t child_pid;
struct task *child;
assert(ptregs);
parent = task_self();
assert(parent);
child_pid = task_prepare("");
if (0 > child_pid) {
ptregs_retcode_err_jmp(ptregs, -1, child_pid);
panic("%s returning", __func__);
}
adrspc = fork_addr_space_get(parent);
if (!adrspc) {
adrspc = fork_addr_space_create(NULL);
fork_addr_space_set(parent, adrspc);
}
/* Save the stack of the current thread */
fork_stack_store(adrspc, thread_self());
child = task_table_get(child_pid);
child_adrspc = fork_addr_space_create(adrspc);
/* Can't use fork_addr_space_store() as we use
* different task as data source */
fork_stack_store(child_adrspc, child->tsk_main);
fork_heap_store(&child_adrspc->heap_space, task_self());
fork_static_store(&child_adrspc->static_space);
memcpy(&child_adrspc->pt_entry, ptregs, sizeof(*ptregs));
sched_lock();
{
child = task_table_get(child_pid);
task_start(child, fork_child_trampoline, NULL);
fork_addr_space_set(child, child_adrspc);
thread_stack_set(child->tsk_main, thread_stack_get(thread_self()));
thread_stack_set_size(child->tsk_main, thread_stack_get_size(thread_self()));
}
ptregs_retcode_jmp(ptregs, child_pid);
sched_unlock();
panic("%s returning", __func__);
}
A function call
ptregs_retcode_jmp()
will return to the parent process. In turn, the child process will use the same call when the process starts.static void *fork_child_trampoline(void *arg) {
struct addr_space *adrspc;
adrspc = fork_addr_space_get(task_self());
fork_stack_restore(adrspc, stack_ptr());
ptregs_retcode_jmp(&adrspc->pt_entry, 0);
panic("%s returning", __func__);
}
After the descendant calls execv (), there is no longer any need to support intersecting address spaces, and therefore, you don’t need to copy anything when switching contexts.
Health Check
Actually, for dash, this functionality was quite enough :)
For testing in Embox, you can run template x86 / qemu
git clone https://github.com/embox/embox.git
cd embox
make confload-x86/qemu
make
./scripts/qemu/auto_qemu
Then we call dash and inside it you can call other commands, for example, ping.

Most likely, by "poking" dash it will be possible to achieve some exception, do not hesitate to create an issue in our repository :)