CPU cores or what SMP is and what it is eaten with

Introduction

Good day, today I would like to touch on a fairly simple topic that almost no ordinary programmer knows, but each of you most likely used it.
It will be about symmetric multiprocessing (SMP in the people) - an architecture that is found in all multi-tasking operating systems, and of course, is an integral part of them. Everyone knows that the more cores the processor has - the more powerful the processor will be, yes it is, but how can the OS use several cores at the same time? Some programmers do not descend to this level of abstraction - they simply don’t need it, but I think everyone will be interested in how SMP works.

Multitasking and its implementation

Those who have ever studied computer architecture know that the processor itself is not able to perform several tasks at once, multitasking gives us only the OS, which switches these tasks. There are several types of multitasking, but the most adequate, convenient and widely used is preemptive multitasking (you can read its main aspects on wikipedia). It is based on the fact that each process (task) has its own priority, which affects how much processor time it will be allocated. Each task is given one quantum of time, during which the process does something, after the expiration of the quantum of time, the OS transfers control to another task. The question arises - how to allocate computer resources, such as memory, devices, etc. between processes? Everything is very simple: Windows does it itself, Linux uses the semaphore system.

Interrupts and PICs

Perhaps for some it will be news, for someone it will not, but the i386 architecture (I’ll talk about x86 architecture, ARM doesn’t count, because I haven’t studied this architecture, and I have never come across it (even at the level of writing some service or resident program)) uses interrupts (we will only talk about hardware interrupts, IRQ) in order to notify the OS or program of a particular event. For example, there is a 0x8 interrupt (for the protected and long modes, for example, 0x20, depending on how to configure the PIC, more on this later), which is called by PIT, which, for example, can generate interrupts with any necessary frequency. Then the work of the OS for the distribution of time quanta is reduced to 0, when the interrupt is called, the program stops working, and control is given, for example, to the kernel,

As you probably understood, interrupts are functions (or procedures) that are called at any time by the hardware, or by the program itself. In total, the processor supports 16 interrupts on two PICs. The processor has flags, and one of them - the flag "I" - Interrupt Control. By setting this flag to 0, the processor will not trigger any hardware interrupts. But, I also want to note that there are so-called NMI - Non-Maskable Interrupts - these interrupts will still be called, even if bit I is set to 0. With the help of PIC programming, you can disable these interrupts, but after returning from any interrupt using IRET - they will not be banned again. I note that from under a normal program you cannot track the interrupt call — the execution of your program is stopped,

PIC - Programmable Interrupt Controller

From wiki:

As a rule, it is an electronic device, sometimes made as a part of the processor itself, or else its complex microcircuits, the inputs of which are electrically connected to the corresponding outputs of various devices. The input number of the interrupt controller is indicated by “IRQ”. This number should be distinguished from the interrupt priority, as well as from the number of the entry in the interrupt vector table (INT). For example, in the IBM PC in real mode of operation (in this mode MS-DOS works) of the processor, the interrupt from the standard keyboard uses IRQ 1 and INT 9.

The original IBM PC platform uses a very simple interrupt scheme. The interrupt controller is a simple counter, which either sequentially goes through the signals of different devices, or is reset to the beginning when a new interrupt is found. In the first case, the devices have equal priority, in the second device with a lower (or greater when counting back) sequence number have a higher priority.

As you understand, this is an electronic circuit that allows devices to send requests for interrupts, usually exactly 2.

Now, let's move on to the topic of the article itself.

SMP

To implement this standard, motherboards began to install new schemes: APIC and ACPI. Let's talk about the first.

APIC - Advanced Programmable Interrupt Controller, an improved version of the PIC. It is used in multiprocessor systems and is an integral part of all the latest Intel processors (and compatible). APIC is used for complex interrupt redirection and for sending interrupts between processors. These things were not possible using the older PIC specification.

Local APIC and IO APIC

In an APIC-based system, each processor consists of a “core” and a “local APIC”. Local APIC is responsible for processing a processor-specific interrupt configuration. In addition, it contains a local vector table (LVT), which translates events, such as “internal clock” and other “local” interrupt sources, into an interrupt vector (for example, the LocalINT1 contact can raise an NMI exception, keeping “ 2 "to the appropriate input LVT).

More information about local APIC can be found in the “System Programming Guide” for modern Intel processors.

In addition, there is an APIC IO (for example, intel 82093AA), which is part of a chipset and provides multiprocessor interrupt control, including both static and dynamic symmetric distribution of interrupts for all processors. In systems with multiple I / O subsystems, each subsystem can have its own set of interrupts.

Each interrupt pin is individually programmed “as either edge or level triggered”. The interrupt vector and interrupt control information can be specified for each interrupt. An indirect case access scheme optimizes the memory space required to access the internal APIC I / O registers. In order to increase the flexibility of the system in assigning memory usage, the space of two APIC I / O registers is relocatable, but by default it is 0xFEC00000.

Initialization of the “local” APIC

Local APIC is activated at boot time and can be disabled by resetting bit 11 IA32_APIC_BASE (MSR) (this only works with processors with a family> 5, since the Pentium does not have such MSR) . However, Intel’s software development guide states that after you disable a local APIC via IA32_APIC_BASE, you will not be able to turn it on until it is completely reset. IO APIC can also be configured to work in legacy mode so that it emulates an 8259 device.

The local APIC registers are mapped to the physical FEE00xxx page (see table 8-1 of the Intel P4 SPG). This address is the same for each local APIC that exists in the configuration, which means that you can directly access the registers of the local APIC core where your code is currently running. Please note that there is an MSR that defines the actual APIC base (available only for processors with a family> 5). MADT contains the local APIC base, and on 64-bit systems it may also contain a field defining a 64-bit base address redefinition, which you should use instead. You can leave the local APIC database only where you find it, or move it where you want. Note: I do not think that you can move it further than the 4th GB RAM.

To enable local APIC for receiving interrupts, you must configure the “Spurious Interrupt Vector Register”. The correct value for this field is the IRQ number that you want to associate with false interrupts with the lower 8 bits, and the 8th bit, set to 1 to actually enable APIC (for more information, see the specification). You must select the interrupt number, which is set to the lower 4 bits; The easiest way to use is 0xFF. This is important for some older processors, because for these values, the lower 4 bits must be set to 1.

Disable the 8259 pic correctly. This is almost as important as setting up an APIC. You do this in two steps: masking all interrupts and reassigning the IRQ. Masking all interrupts disables them in the PIC. Trapping interrupts is what you probably already did when you used the PIC: you want interrupt requests to start at 32 instead of 0 to avoid conflicts with exceptions (in protected and long (Long) processor modes, because . The first 32 interrupts are exceptions). Then you should avoid using these interrupt vectors for other purposes. This is necessary because, despite the fact that you masked all PIC interrupts, it could still produce false interrupts, which would then be incorrectly handled as exceptions in your kernel.
Let's go to SMP.

Symmetric multitasking: initialization

The startup sequence is different for different CPUs. The Intel Programmer's Guide (Section 7.5.4) contains an initialization protocol for Intel Xeon processors and does not cover older processors. For a generic "all processor types" algorithm, see "Intel Multiprocessing Specification."

For 80486 (with external APIC 8249DX), you must use IPIT INIT followed by IPI "INIT level de-assert" without any SIPI. This means that you cannot tell them where to start executing your code (the SIPI vector part), and they always start executing the BIOS code. In this case, you set the CMOS BIOS reset value to “warm start with far jump” (i.e. Set CMOS 0x0F to 10) so that the BIOS performs jmp far ~ [0: 0x0469], and then set the segment and offset AP entry points at 0x0469.

“INIT level de-assert” IPI is not supported on new processors (Pentium 4 and Intel Xeon), and AFAIK is completely ignored on these processors.

For newer processors (P6, Pentium 4), one SIPI is enough, but I'm not sure that older Intel (Pentium) processors or processors from other manufacturers need a second SIPI. It is also possible that a second SIPI exists in the event of a delivery failure for the first SIPI (bus noise, etc.).

I usually send the first SIPI, and then wait to see if the AP increases the number of running processors. If he does not increase this counter within a few milliseconds, I will send the second SIPI. This is different from Intel’s general algorithm (which has a delay of 200 microseconds between SIPI), but trying to find a time source capable of accurately measuring the delay of 200 microseconds during early loading is not so easy. I also found that on real hardware, if the delay between SIPI is too long (and you don’t use my method), the master AP can run the early AP startup code for the OS twice (which in my case will cause the OS to think that we have two times more processors than we actually do).

You can broadcast these signals over the bus to run each device present. However, you can also turn on processors that were turned off specifically (because they were “defective”).

We are looking for information using the MT table

Some information (which may not be available on newer machines) for multiprocessing. First you need to find the structure of the floating pointer MP. It is aligned on a 16-byte boundary and contains a signature at the beginning of "_MP_" or 0x5F504D5F. The OS should search in EBDA, BIOS ROM space and in the last kilobyte of the “base memory”; the size of the base memory is specified in a 2-byte value in 0x413 in kilobytes, minus 1 KB. Here is the structure:

structmp_floating_pointer_structure {char signature[4];
    uint32_t configuration_table;
    uint8_t length; // In 16 bytes (e.g. 1 = 16 bytes, 2 = 32 bytes)uint8_t mp_specification_revision;
    uint8_t checksum; // This value should make all bytes in the table equal 0 when added togetheruint8_t default_configuration; // If this is not zero then configuration_table should be // ignored and a default configuration should be loaded insteaduint32_t features; // If bit 7 is then the IMCR is present and PIC mode is being used, otherwise // virtual wire mode is; all other bits are reserved
}

Here is the configuration table, which is indicated by a floating pointer structure:

structmp_configuration_table {char signature[4]; // "PCMP"uint16_t length;
    uint8_t mp_specification_revision;
    uint8_t checksum; // Again, the byte should be all bytes in the table add up to 0char oem_id[8];
    char product_id[12];
    uint32_t oem_table;
    uint16_t oem_table_size;
    uint16_t entry_count; // This value represents how many entries are following this tableuint32_t lapic_address; // This is the memory mapped address of the local APICs uint16_t extended_table_length;
    uint8_t extended_table_checksum;
    uint8_t reserved;
}

After the configuration table entries are entry_count, which contain more information about the system, followed by an extended table. Entries are either 20 bytes to represent the processor, or 8 bytes for something else. Here's what the APIC and I / O entries look like.

structentry_processor {uint8_t type; // Always 0uint8_t local_apic_id;
    uint8_t local_apic_version;
    uint8_t flags; // If bit 0 is clear then the processor must be ignored// If bit 1 is set then the processor is the bootstrap processoruint32_t signature;
    uint32_t feature_flags;
    uint64_t reserved;
}

Here is the IO APIC entry.

structentry_io_apic {uint8_t type; // Always 2uint8_t id;
    uint8_t version;
    uint8_t flags; // If bit 0 is set then the entry should be ignoreduint32_t address; // The memory mapped address of the IO APIC is memory
}

We are looking for information using APIC

You can find the MADT (APIC) table in ACPI. The table shows a list of local APICs, the number of which should correspond to the number of cores on your processor. Details of this table are not here, but you can find them on the Internet.

Run AP

After you have gathered the information, you need to disable the PIC and prepare for APIC I / O. You also need to configure the local APIC's BSP. Then start the AP using SIPI.

Kernel launch code:

I note that the vector that you specify at startup speaks of the starting address: vector 0x8 - address 0x8000, vector 0x9 - address 0x9000, etc.

// ------------------------------------------------------------------------------------------------static u32 LocalApicIn(uint reg){
    return MmioRead32(*g_localApicAddr + reg);
}
// ------------------------------------------------------------------------------------------------staticvoidLocalApicOut(uint reg, u32 data){
    MmioWrite32(*g_localApicAddr + reg, data);
}
// ------------------------------------------------------------------------------------------------voidLocalApicInit(){
    // Clear task priority to enable all interrupts
    LocalApicOut(LAPIC_TPR, 0);
    // Logical Destination Mode
    LocalApicOut(LAPIC_DFR, 0xffffffff);   // Flat mode
    LocalApicOut(LAPIC_LDR, 0x01000000);   // All cpus use logical id 1// Configure Spurious Interrupt Vector Register
    LocalApicOut(LAPIC_SVR, 0x100 | 0xff);
}
// ------------------------------------------------------------------------------------------------uint LocalApicGetId(){
    return LocalApicIn(LAPIC_ID) >> 24;
}
// ------------------------------------------------------------------------------------------------voidLocalApicSendInit(uint apic_id){
    LocalApicOut(LAPIC_ICRHI, apic_id << ICR_DESTINATION_SHIFT);
    LocalApicOut(LAPIC_ICRLO, ICR_INIT | ICR_PHYSICAL
        | ICR_ASSERT | ICR_EDGE | ICR_NO_SHORTHAND);
    while (LocalApicIn(LAPIC_ICRLO) & ICR_SEND_PENDING)
        ;
}
// ------------------------------------------------------------------------------------------------voidLocalApicSendStartup(uint apic_id, uint vector){
    LocalApicOut(LAPIC_ICRHI, apic_id << ICR_DESTINATION_SHIFT);
    LocalApicOut(LAPIC_ICRLO, vector | ICR_STARTUP
        | ICR_PHYSICAL | ICR_ASSERT | ICR_EDGE | ICR_NO_SHORTHAND);
    while (LocalApicIn(LAPIC_ICRLO) & ICR_SEND_PENDING)
        ;
}
voidSmpInit(){
	kprintf("Waking up all CPUs\n");
	*g_activeCpuCount = 1;
	uint localId = LocalApicGetId();
	// Send Init to all cpus except selffor (uint i = 0; i < g_acpiCpuCount; ++i)
	{
		uint apicId = g_acpiCpuIds[i];
		if (apicId != localId)
		{
			LocalApicSendInit(apicId);
		}
	}
	// wait
	PitWait(200);
	// Send Startup to all cpus except selffor (uint i = 0; i < g_acpiCpuCount; ++i)
	{
		uint apicId = g_acpiCpuIds[i];
		if (apicId != localId)
			LocalApicSendStartup(apicId, 0x8);
	}
	// Wait for all cpus to be active
	PitWait(10);
	while (*g_activeCpuCount != g_acpiCpuCount)
	{
		kprintf("Waiting... %d\n", *g_activeCpuCount);
		PitWait(10);
	}
	kprintf("All CPUs activated\n");
}

[org 0x8000]
	AP:
		jmp short bsp ; Если это первое ядро - прыгаем в BSP
		xor ax,ax
		mov ss,ax
		mov sp, 0x7c00xor ax,ax
		mov ds,ax
	        ; Mark CPU as active
        	lock
		inc byte [ds:g_activeCpuCount]
		;Переходим в защищенный режим, настраиваем стек
		jmp zop
	bsp:
	xor ax,ax
	mov ds,ax
	mov dword[ds:g_activeCpuCount],0
	mov dword[ds:g_activeCpuCount],0
	mov word [ds:0x8000], 0x9090 ; Заменяем JMP сюда на 2 NOP'а
	;Переходим в защищенный режим, настраиваем стек

Now, as you understand, in order for the OS to use many cores, you need to configure the stack for each core, each core, its interrupts, etc., but the most important thing is that when using symmetric multiprocessing, all the resources of the cores are the same: one memory, one PCI, etc., and the OS can only parallelize tasks between cores.

I hope that the article was not tedious enough, and quite informative. Next time, I think you can talk about how you used to draw on the screen (and now draw), without using shaders and cool video cards.

Good luck!

Tags: