Execution on bare hardware (base-hw)

The code specific to the base-hw platform is located within the repos/base-hw/ directory. In the following description, unless explicitly stated otherwise, all paths are relative to this directory.

In contrast to classical L4 microkernels where Genode's core process runs as user-level roottask on top of the kernel, base-hw executes Genode's core directly on the hardware with no distinct kernel underneath. Core and the kernel are melted into one hybrid component. Although all threads of core are running in privileged processor mode, they call a kernel library to synchronize hardware interaction. However, most work is done outside of that library. This design has several benefits. First, the kernel part becomes much simpler. For example, there are no allocators needed within the kernel. Second, base-hw side-steps long-standing difficult kernel-level problems, in particular the management of kernel resources. For the allocation of kernel objects, the hybrid core/kernel can employ Genode's user-level resource trading concepts as described in Section Resource trading. Finally and most importantly, merging the kernel with roottask removes a lot of redundancies between both programs. Traditionally, both kernel and roottask perform the book keeping of physical-resource allocations and the existence of kernel objects such as address spaces and threads. In base-hw, those data structures exist only once. The complexity of the combined kernel/core is significantly lower than the sum of the complexities of a traditional self-sufficient kernel and a distinct roottask on top. This way, base-hw helps to make Genode's TCB less complex.

The following subsections detail the problems that base-hw had to address to become a self-sufficient base platform for Genode.

Bootstrapping of base-hw

Early bootstrap

After the boot loader has loaded the kernel image into memory, it calls the kernel's entry point. At this stage, the MMU is still switched off and no CPU other than the primary boot CPU are initialized. The first job of the loaded kernel is the initialization of all CPUs and their transition from the use of physical memory to virtual memory. This one-time code path is called bootstrap. The corresponding code is located at src/bootstrap/. Besides enabling the MMU, this code performs system-global static hardware configurations such as setting up an ARM TrustZone policy. Once completed, bootstrap ELF-loads the actual core/kernel executable, which is designated to run entirely in virtual memory. After this stage is complete, bootstrap is no longer part of the picture.

Startup of the base-hw kernel

Core on base-hw uses Genode's regular linker script. Like any regular Genode component, its execution starts at the _start symbol. But unlike a regular component, core is started by the bootstrap component as a kernel running in privileged mode. Instead of directly following the startup procedure described in Section Startup code, base-hw uses custom startup code that initializes the kernel part of core first. For example, the startup code for the ARM architecture is located at src/core/spec/arm/crt0.s. It calls the kernel initialization code in src/core/kernel/main.cc. Core's regular C++ startup code (the _main function) is executed by the first thread created by the kernel (see the thread setup in the Core_main_thread::Core_main_thread() constructor).

Kernel entry and exit

The execution model of the kernel can be roughly characterized as a single-stack kernel. In contrast to traditional L4 kernels that maintain one kernel thread per user thread, the base-hw kernel is a mere state machine that never blocks in the kernel. State transitions are triggered by core or user-level threads that enter the kernel via a system call, by device interrupts, or by a CPU exception. Once entered, the kernel applies the state change depending on the event that caused the kernel entry, and leaves the kernel again. The transition between normal threads and kernel execution depends on the concrete architecture. For ARM, the corresponding code is located at src/core/spec/arm/exception_vector.s.

Interrupt handling and preemptive multi-threading

In order to respond to interrupts, base-hw has to contain a driver for the interrupt controller. The interrupt-controller driver is named Board::Pic. The implementation depends on the used board. For each board supported by the base-hw kernel, there exists a board.h file that ties together all board-specific definitions. For example, the file core/board/pbxa9/board.h defines the properties of the PBX-A9 platform. In this case, the header maps the Board::Pic type to Hw::Gicv2 as defined in hw/spec/arm/gicv2.h. Each of the Board::Pic drivers implement the same interface.

To support preemptive multi-threading, base-hw requires a hardware timer. The timer is programmed with the time slice length of the currently executed thread. Once the programmed timeout elapses, the timer device generates an interrupt that is handled by the kernel. Similarly to interrupt controllers, there exist a variety of different timer devices for different hardware platforms. Therefore, base-hw features a variety of timer drivers. The selection of the timer driver for a given board follows the same pattern as the definition of the Board::Pic type. The board-specific board.h file defines a Board::Timer type that is mapped to one of the available drivers. For example, the pbxa9/board.h file includes spec/arm/cortex_a9_global_timer.h, which contains the definition of Board::Timer.

The in-kernel handler of the timer interrupt invokes the thread scheduler (src/core/kernel/cpu_scheduler.h). The scheduler maintains a list of so-called scheduling contexts where each context refers to a thread. Each time the kernel is entered, the scheduler is updated with the passed duration. When updated, it takes a scheduling decision by making the next to-be-executed thread the head of the list. At kernel exit, the control is passed to the user-level thread that corresponds to the head of the scheduler list.

Split kernel interface

The system-call interface of the base-hw kernel is split into two parts. One part is usable by all components and solely contains system calls for inter-component communication and thread synchronization. The definition of this interface is located at include/kernel/interface.h. The second part is exposed only to core. It supplements the public interface with operations for the creation, the management, and the destruction of kernel objects. The definition of the core-private interface is located at src/core/kernel/core_interface.h.

The distinction between both parts of the kernel interface is enforced by the function Thread::_call in src/core/kernel/thread.cc.

Public part of the kernel interface

Threads do not run independently but interact with each other via synchronous inter-component communication as detailed in Section Inter-component communication. Within base-hw, this mechanism is referred to as IPC (for inter-process communication). To allow threads to perform calls to other threads or to receive RPC requests, the kernel interface is equipped with system calls for performing IPC (send_request_msg, await_request_msg, send_reply_msg). To keep the kernel as simple as possible, IPC is performed using so-called user-level thread-control blocks (UTCB). Each thread has a corresponding memory page that is always mapped in the kernel. This UTCB page is used to carry IPC payload. The largely simplified procedure of transferring a message is as follows. (In reality, the state space is more complex because the receiver may not be in a blocking state when the sender issues the message)

The sender marshals its payload into its UTCB and invokes the kernel,
The kernel transfers the payload from the sender's UTCB to the receiver's UTCB and schedules the receiver,
The receiver retrieves the incoming message from its UTCB.

Because all UTCBs are always mapped in the kernel, no page faults can occur during the second step. This way, the flow of execution within the kernel becomes predictable and no kernel exception handling code is needed.

In addition to IPC, threads interact via the synchronization primitives provided by the Genode API. To implement these portions of the API, the kernel provides system calls for managing the execution control of threads (stop_thread, restart_thread, yield_thread).

To support asynchronous notifications as described in Section Asynchronous notifications, the kernel provides system calls for the submission and reception of signals (await_signal, cancel_next_await_signal, submit_signal, pending_signal, and ack_signal) as well as the life-time management of signal contexts (kill_signal_context). In contrast to other base platforms, Genode's signal API is directly supported by the kernel so that the propagation of signals does not require any interaction with core's PD service. However, the creation of signal contexts is arbitrated by the PD service. This way, the kernel objects needed for the signalling mechanism are accounted to the corresponding clients of the PD service.

The kernel provides an interface to make the kernel's scheduling timer available as time source to the user land. Using this interface, components can bind signal contexts to timeouts (timeout) and follow the progress of time (time and timeout_max_us).

Core-private part of the kernel interface

The core-private part of the kernel interface allows core to perform privileged operations. Note that even though the kernel and core provide different interfaces, both are executed in privileged CPU mode, share the same address space and ultimately trust each other. The kernel is regarded a mere support library of core that executes those functions that shall be synchronized between different CPU cores and core's threads. In particular, the kernel does not perform any allocation. Instead, the allocation of kernel objects is performed as an interplay of core and the kernel.

Core allocates physical memory from its physical-memory allocator. Most kernel-object allocations are performed in the context of one of core's services. Hence, those allocations can be properly accounted to a session quota (Section Resource trading). This way, kernel objects allocated on behalf of core's clients are "paid for" by those clients.
Core allocates virtual memory to make the allocated physical memory visible within core and the kernel.
Core invokes the kernel to construct the kernel object at the location specified by core. This kernel invocation is actually a system call that enters the kernel via the kernel-entry path.
The kernel initializes the kernel object at the virtual address specified by core and returns to core via the kernel-exit path.

The core-private kernel interface consists of the following operations:

The creation and destruction of protection domains (new_pd, delete_pd), invoked by the PD service
The creation, manipulation, and destruction of threads (new_thread, start_thread, resume_thread, thread_quota, pause_thread, delete_thread, thread_pager, and _cancel_thread_blocking), used by the CPU service and the core-specific back end of the Genode::Thread API
The creation and destruction of signal receivers and signal contexts (new_signal_receiver, delete_signal_receiver, new_signal_context, and delete_signal_context), invoked by the PD service
The creation and destruction of kernel-protected object identities (new_obj, delete_obj)
The creation, manipulation, and destruction of interrupt kernel objects (new_irq, ack_irq, and delete_irq)
The mechanisms needed to transfer the flow of control between virtual machines and virtual-machine monitors (new_vm, delete_vm, run_vm, pause_vm)

Scheduler of the base-hw kernel

CPU scheduling in traditional L4 microkernels is based on static priorities. The scheduler always picks the runnable thread with highest priority for execution. If multiple threads share one priority, the kernel schedules those threads in a round-robin fashion. Whereas being pretty fast and easy to implement, this scheme has disadvantages: First, there is no way to prevent high-prioritized threads from starving lower-prioritized ones. Second, CPU time cannot be granted to threads and passed between them by the means of quota. To cope with these problems without much loss of performance, base-hw employs a custom scheduler that deviates from the traditional approach.

The base-hw scheduler introduces the distinction between high-throughput-oriented scheduling contexts - called fills - and low-latency-oriented scheduling contexts - called claims. Examples for typical fills would be the processing of a compiler job or the rendering computations of a sophisticated graphics program. They shall obtain as much CPU time as the system can spare but there is no demand for a high responsiveness. In contrast, an example for the claim category would be a typical GUI-software stack covering the control flow from user-input drivers through a chain of GUI components to the drivers of the graphical output. Another example is a user-level device driver that must quickly respond to sporadic interrupts but is otherwise untrusted. The low latency of such components is a key factor for usability and quality of service. Besides introducing the distinction between claim and fill scheduling contexts, base-hw introduces the notion of a so-called super period, which is a multiple of typical scheduling time slices, e.g., one second. The entire super period corresponds to 100% of the CPU time of one CPU. Portions of it can be assigned to scheduling contexts. A CPU quota thereby corresponds to a percentage of the super period.

At the beginning of a super period, each claim has its full amount of assigned CPU quota. The priority defines the absolute scheduling order within the super period among those claims that are active and have quota left. As long as there exist such claims, the scheduler stays in the claim mode and the quota of the scheduled claims decreases. At the end of a super period, the quota of all claims is replenished to the initial value. Every time the scheduler can't find an active claim with CPU-quota left, it switches to the fill mode. Fills are scheduled in a simple round-robin fashion with identical time slices. The proceeding of the super period doesn't affect the scheduling order and time-slices of this mode. The concept of quota and priority that is implemented through the claim mode aligns nicely with Genode's way of hierarchical resource management: Through CPU sessions, each process becomes able to assign portions of its CPU time and subranges of its priority band to its children without knowing the global meaning of CPU time or priority.

Sparsely populated core address space

Even though core has the authority over all physical memory, it has no immediate access to the physical pages. Whenever core requires access to a physical memory page, it first has to explicitly map the physical page into its own virtual memory space. This way, the virtual address space of core stays clean from any data of other components. Even in the presence of a bug in core (e.g., a dangling pointer), information cannot accidentally leak between different protection domains because the virtual memory of the other components is not necessarily visible to core.

Multi-processor support of base-hw

On uniprocessor systems, the base-hw kernel is single-threaded. Its execution model corresponds to a mere state machine. On SMP systems, it maintains one kernel thread and one scheduler per CPU core. Access to kernel objects gets fully serialized by one global spin lock that is acquired when entering the kernel and released when leaving the kernel. This keeps the use of multiple cores transparent to the kernel model, which greatly simplifies the code compared to traditional L4 microkernels. Given that the kernel is a simple state machine providing lightweight non-blocking operations, there is little contention for the global kernel lock. Even though this claim may not hold up when scaling to a large number of cores, current platforms can be accommodated well.

Cross-CPU inter-component communication

Regarding synchronous and asynchronous inter-processor communication - thanks to the global kernel lock - there is no semantic difference to the uniprocessor case. The only difference is that on a multiprocessor system, one processor may change the schedule of another processor by unblocking one of its threads (e.g., when an RPC call is received by a server that resides on a different CPU as the client). This condition may rescind the current scheduling choice of the other processor. To avoid lags in this case, the kernel lets the unaware target processor trap into an inter-processor interrupt (IPI). The targeted processor can react to the IPI by taking the decision to schedule the receiving thread. As the IPI sender does not have to wait for an answer, the sending and receiving CPUs remain largely decoupled. There is no need for a complex IPI protocol between sender and receiver.

TLB shootdown

With respect to the synchronization of core-local hardware, there are two different situations to deal with. Some hardware components like most ARM caches and branch predictors implement their own coherence protocol and thus need adaption in terms of configuration only. Others, like the TLBs lack this feature. When for instance a page table entry gets invalid, the TLB invalidation of the affected entries must be performed locally by each core. To signal the necessity of TLB maintenance work, an IPI is sent to all other cores. Once all cores have completed the cleaning, the thread that invoked the TLB invalidation resumes its execution.

Asynchronous notifications on base-hw

The base-hw platform improves the mechanism described in Section Asynchronous notification mechanism by introducing signal receivers and signal contexts as first-class kernel objects. Core's PD service is merely used to arbitrate the creation and destruction of those kernel objects but it does not play the role of a signal-delivery proxy. Instead, signals are communicated directly by using the public kernel operations await_signal, cancel_next_await_signal, submit_signal, and ack_signal.