How stack trace works on ARM
Good day! A few days ago I ran into a small problem in our project - in the gdb interrupt handler, it incorrectly deduced the stack trace for Cortex-M. Therefore, once again, it is useful to find out, and in what ways can we get the stack trace for ARM? What compilation flags affect the ability to trace the stack to ARM? How is this implemented in the Linux kernel? According to the research I decided to write this article.
Let us examine the two main stack trace methods in the Linux kernel.
Let's start with a simple approach, which can be found in the Linux kernel, but which currently has a deprecated status in GCC.
Imagine that a program is running on a stack in RAM, and at some point we are interrupting it and want to display a call stack. Suppose we have a pointer to the current instruction that is executed by the processor (PC), as well as the current pointer to the top of the stack (SP). Now, in order to “jump” up the stack to the previous function, you need to understand what this function was and where we should jump to this function. ARM uses the Link Register (LR) for this purpose,
This description is taken from the gcc / gcc / config / arm / arm.h GCC header file.
Those. the compiler (in our case, GCC) can somehow be told that we want to do a stack trace. And then in the prolog of each function, the compiler will prepare some auxiliary structure. You can see that this structure contains the “next” value of the LR register we need, and, most importantly, it contains the address of the next frame.
This compiler mode is set with the -mapcs-frame option. In the description of the option, there is a mention of “leaf-frame-pointer for leaf functions.” Here, leaf functions are those that do not make any calls to other functions, so they can be made slightly lighter.
It may also be asked what to do with assembly functions in this case. In fact, nothing tricky - you need to insert special macros. From the tools / objtool / Documentation / stack-validation.txt file in the Linux kernel:
Below is the function of unwinding the stack from the Linux kernel:
But here I want to mark the line with
The -mapcs-frame option is only valid for the ARM instruction set. But it is known that ARM microcontrollers have another set of instructions - Thumb (Thumb-1 and Thumb-2, to be exact), it is used mainly for the Cortex-M series. To enable frame generation for Thumb mode, use the -mtpcs-frame and -mtpcs-leaf-frame flags .In fact, this is an analog -mapcs-frame. Interestingly, these options currently only work for Cortex-M0 / M1. For some time I could not figure out why it is impossible to compile the desired image for the Cortex-M3 / M4 / .... After re-reading all the gcc options for ARM and searching the Internet, I realized that this was probably a bug. Therefore, I got directly into the source code of the arm-none-eabi-gcc compiler . After examining how the compiler generates frames for ARM, Thumb-1 and Thumb-2, I came to the conclusion that they bypassed Thumb-2, that is, at the moment frames are generated only for Thumb-1 and ARM. After creating the bugs, the GCC developers explained that the standard for ARM has already changed several times and these flags are very outdated, but for some reason they all still exist in the compiler. Below is the disassembler of the function for which the frame has been generated.
For comparison, the disassembler of the same function for ARM instructions
At first glance it may seem that these are completely different things. But in fact, the frames are absolutely identical, the fact is that in the Thumb mode, the push instruction allows stacking only low registers (r0 - r7) and the lr register on the stack. For all other registers, this has to be done in two steps via the mov and str instructions, as in the example above.
An alternative approach is to unwind the stack, based on the “Exception Handling ABI for the ARM Architecture” standard ( EHABI ). In fact, the main example of the use of this standard is the exception handling in languages like C ++. Information prepared by the compiler for handling exceptions can also be used to trace the stack. This mode is enabled by the option GCC -fexceptions (or -funwind-frames ).
Let's take a closer look at how this is done. For starters, this document (EHABI) imposes certain requirements on the compiler to generate auxiliary tables .ARM.exidx and .ARM.extab. This is how this section .ARM.exidx is defined in the Linux kernel source. From the arch / arm / kernel / vmlinux.lds.h file :
The standard “Exception Handling ABI for the ARM Architecture” defines each element of the table .ARM.exidx as the following structure:
The first element is the offset from the beginning of the function, and the second element is the address in the table of instructions that need to be interpreted in a special way in order to unwind the stack further. In other words, each element of this table is just a sequence of words and half words, which are a sequence of instructions. The first word indicates the number of instructions that must be executed to unwind the stack to the next frame.
The description of these instructions is given in the already mentioned EHABI standard:
Further, the main implementation of this interpreter in Linux is in the file arch / arm / kernel / unwind.c
This implementation of the unwind_frame function is used when the CONFIG_ARM_UNWIND option is enabled. Comments with explanations in Russian, I inserted directly into the source text.
Below is an example of how the table element .ARM.exidx looks like for the kernel_start function in Embox:
But its disassembler:
Let's break down the steps. We see the assignment
Now let's see how much memory the “eaters” build with the -funwind-frames flag is for.
For the experiment, I compiled Embox for the STM32F4-Discovery platform. Here are the results of objdump:
It is easy to calculate that the .ARM.exidx and .ARM.extab sections take about 1/10 of the .text size. After that, I collected a larger image - for the ARM Integrator CP based on the ARM9, and there these sections made up 1/12 of the size of the .text section. But it is clear that this ratio may vary from project to project. It also turned out that the size of the image that adds the -macps-frame flag is smaller than the option with exceptions (which is expected). For example, if the size of the .text section is 600 Kb, the total size of .ARM.exidx + .ARM.extab was 50 Kb, and the size of the additional code with the -mapcs-frame flag is only 10 Kb. But if we look above what a large prolog has been generated for the Cortex-M1 (remember, via mov / str?), Then it becomes clear that in this case there will be almost no difference, which means for the Thumb-mode, use -mtpcs-frame hardly has any meaning.
The third approach is to trace the stack using a debugger. It seems that many operating systems for working with FreeRTOS, NuttX microcontrollers at the moment assume exactly this tracing option or offer to watch the disassembler.
As a result, we came to the conclusion that the stack trace for arm at run time is practically never used. This is probably a consequence of the desire to make the most effective code during operation, and debugging actions (which include stack promotion) go offline. On the other hand, if the OS already uses C ++ code, then it is quite possible to use the trace implementation via .ARM.exidx.
Well, yes, the problem with the wrong stack in the interrupt in Embox, it was solved very simply, it turned out to be enough to save the LR register to the stack.
Let us examine the two main stack trace methods in the Linux kernel.
Stack unwind through frames
Let's start with a simple approach, which can be found in the Linux kernel, but which currently has a deprecated status in GCC.
Imagine that a program is running on a stack in RAM, and at some point we are interrupting it and want to display a call stack. Suppose we have a pointer to the current instruction that is executed by the processor (PC), as well as the current pointer to the top of the stack (SP). Now, in order to “jump” up the stack to the previous function, you need to understand what this function was and where we should jump to this function. ARM uses the Link Register (LR) for this purpose,
The Link Register (LR) is register R14. It stores the return information for subroutines, function calls, and exceptions. On reset, the processor sets the value to 0xFFFFFFFFNext, we need to go up the stack and load the new values of the LR registers from the stack. The structure of the stack frame for the compiler is as follows:
/* The stack backtrace structure is as follows:
fp points to here: | save code pointer | [fp]
| return link value | [fp, #-4]
| return sp value | [fp, #-8]
| return fp value | [fp, #-12]
[| saved r10 value |]
[| saved r9 value |]
[| saved r8 value |]
...
[| saved r0 value |]
r0-r3 are not normally saved in a C function. */
This description is taken from the gcc / gcc / config / arm / arm.h GCC header file.
Those. the compiler (in our case, GCC) can somehow be told that we want to do a stack trace. And then in the prolog of each function, the compiler will prepare some auxiliary structure. You can see that this structure contains the “next” value of the LR register we need, and, most importantly, it contains the address of the next frame.
| return fp value | [fp, #-12]
This compiler mode is set with the -mapcs-frame option. In the description of the option, there is a mention of “leaf-frame-pointer for leaf functions.” Here, leaf functions are those that do not make any calls to other functions, so they can be made slightly lighter.
It may also be asked what to do with assembly functions in this case. In fact, nothing tricky - you need to insert special macros. From the tools / objtool / Documentation / stack-validation.txt file in the Linux kernel:
Each callable function must be annotated as such with the ELFBut the same document discusses that this is also an obvious disadvantage of this approach. The objtool utility checks if all functions in the kernel are written in the correct format for stack tracing.
function type. In asm code, this is typically done using the
ENTRY / ENDPROC macros.
Below is the function of unwinding the stack from the Linux kernel:
#if defined(CONFIG_FRAME_POINTER) && !defined(CONFIG_ARM_UNWIND)int notrace unwind_frame(struct stackframe *frame){
unsignedlong high, low;
unsignedlong fp = frame->fp;
/* Тут идут некоторые проверки, мы их опустим *//* restore the registers from the stack frame */
frame->fp = *(unsignedlong *)(fp - 12);
frame->sp = *(unsignedlong *)(fp - 8);
frame->pc = *(unsignedlong *)(fp - 4);
return0;
}
#endif
But here I want to mark the line with
defined(CONFIG_ARM_UNWIND)
. She hints that another implementation of unwind_frame is used in the Linux kernel, and we will talk about it a little later. The -mapcs-frame option is only valid for the ARM instruction set. But it is known that ARM microcontrollers have another set of instructions - Thumb (Thumb-1 and Thumb-2, to be exact), it is used mainly for the Cortex-M series. To enable frame generation for Thumb mode, use the -mtpcs-frame and -mtpcs-leaf-frame flags .In fact, this is an analog -mapcs-frame. Interestingly, these options currently only work for Cortex-M0 / M1. For some time I could not figure out why it is impossible to compile the desired image for the Cortex-M3 / M4 / .... After re-reading all the gcc options for ARM and searching the Internet, I realized that this was probably a bug. Therefore, I got directly into the source code of the arm-none-eabi-gcc compiler . After examining how the compiler generates frames for ARM, Thumb-1 and Thumb-2, I came to the conclusion that they bypassed Thumb-2, that is, at the moment frames are generated only for Thumb-1 and ARM. After creating the bugs, the GCC developers explained that the standard for ARM has already changed several times and these flags are very outdated, but for some reason they all still exist in the compiler. Below is the disassembler of the function for which the frame has been generated.
staticintmy_func(int a){
my_func2(7);
return0;
}
00008134 <my_func>:
8134: b084 sub sp, #168136: b580 push {r7, lr}
8138: aa06 add r2, sp, #24813a: 9203str r2, [sp, #12]813c: 467a mov r2, pc
813e: 9205str r2, [sp, #20]8140: 465a mov r2, fp
8142: 9202str r2, [sp, #8]8144: 4672 mov r2, lr
8146: 9204str r2, [sp, #16]8148: aa05 add r2, sp, #20814a: 4693 mov fp, r2
814c: b082 sub sp, #8814e: af00 add r7, sp, #0
For comparison, the disassembler of the same function for ARM instructions
000081f8 <my_func>:
81f8: e1a0c00d mov ip, sp
81fc: e92dd800 push {fp, ip, lr, pc}
8200: e24cb004 sub fp, ip, #48204: e24dd008 sub sp, sp, #8
At first glance it may seem that these are completely different things. But in fact, the frames are absolutely identical, the fact is that in the Thumb mode, the push instruction allows stacking only low registers (r0 - r7) and the lr register on the stack. For all other registers, this has to be done in two steps via the mov and str instructions, as in the example above.
Stack unwind through exceptions
An alternative approach is to unwind the stack, based on the “Exception Handling ABI for the ARM Architecture” standard ( EHABI ). In fact, the main example of the use of this standard is the exception handling in languages like C ++. Information prepared by the compiler for handling exceptions can also be used to trace the stack. This mode is enabled by the option GCC -fexceptions (or -funwind-frames ).
Let's take a closer look at how this is done. For starters, this document (EHABI) imposes certain requirements on the compiler to generate auxiliary tables .ARM.exidx and .ARM.extab. This is how this section .ARM.exidx is defined in the Linux kernel source. From the arch / arm / kernel / vmlinux.lds.h file :
/* Stack unwinding tables */
#define ARM_UNWIND_SECTIONS \ . = ALIGN(8); \ .ARM.unwind_idx : { \ __start_unwind_idx = .; \ *(.ARM.exidx*) \ __stop_unwind_idx = .; \ } \
The standard “Exception Handling ABI for the ARM Architecture” defines each element of the table .ARM.exidx as the following structure:
structunwind_idx {unsignedlong addr_offset;
unsignedlong insn;
};
The first element is the offset from the beginning of the function, and the second element is the address in the table of instructions that need to be interpreted in a special way in order to unwind the stack further. In other words, each element of this table is just a sequence of words and half words, which are a sequence of instructions. The first word indicates the number of instructions that must be executed to unwind the stack to the next frame.
The description of these instructions is given in the already mentioned EHABI standard:
Further, the main implementation of this interpreter in Linux is in the file arch / arm / kernel / unwind.c
Implement unwind_frame function
intunwind_frame(struct stackframe *frame){
unsignedlong low;
conststructunwind_idx *idx;structunwind_ctrl_blockctrl;/* Тут некоторые проверки, пропустим их *//* В секции ARM.exidx бинарным поиском находим дескриптор, используя текущий PC */
idx = unwind_find_idx(frame->pc);
if (!idx) {
pr_warn("unwind: Index not found %08lx\n", frame->pc);
return -URC_FAILURE;
}
ctrl.vrs[FP] = frame->fp;
ctrl.vrs[SP] = frame->sp;
ctrl.vrs[LR] = frame->lr;
ctrl.vrs[PC] = 0;
if (idx->insn == 1)
/* can't unwind */return -URC_FAILURE;
elseif ((idx->insn & 0x80000000) == 0)
/* prel31 to the unwind table */
ctrl.insn = (unsignedlong *)prel31_to_addr(&idx->insn);
elseif ((idx->insn & 0xff000000) == 0x80000000)
/* only personality routine 0 supported in the index */
ctrl.insn = &idx->insn;
else {
pr_warn("unwind: Unsupported personality routine %08lx in the index at %p\n",
idx->insn, idx);
return -URC_FAILURE;
}
/* А вот здесь как раз анализируем таблицу, чтобы найти то кол-во
* инструкций, которое нужно выполнить для раскрутки стека *//* check the personality routine */if ((*ctrl.insn & 0xff000000) == 0x80000000) {
ctrl.byte = 2;
ctrl.entries = 1;
} elseif ((*ctrl.insn & 0xff000000) == 0x81000000) {
ctrl.byte = 1;
ctrl.entries = 1 + ((*ctrl.insn & 0x00ff0000) >> 16);
} else {
pr_warn("unwind: Unsupported personality routine %08lx at %p\n",
*ctrl.insn, ctrl.insn);
return -URC_FAILURE;
}
ctrl.check_each_pop = 0;
/* Наконец, интерпретируем инструкции одна за одной */while (ctrl.entries > 0) {
int urc;
if ((ctrl.sp_high - ctrl.vrs[SP]) < sizeof(ctrl.vrs))
ctrl.check_each_pop = 1;
urc = unwind_exec_insn(&ctrl);
if (urc < 0)
return urc;
if (ctrl.vrs[SP] < low || ctrl.vrs[SP] >= ctrl.sp_high)
return -URC_FAILURE;
}
/* Некоторые проверки *//* Наконец, обновляем значения следующего по стеку фрейма */
frame->fp = ctrl.vrs[FP];
frame->sp = ctrl.vrs[SP];
frame->lr = ctrl.vrs[LR];
frame->pc = ctrl.vrs[PC];
return URC_OK;
}
This implementation of the unwind_frame function is used when the CONFIG_ARM_UNWIND option is enabled. Comments with explanations in Russian, I inserted directly into the source text.
Below is an example of how the table element .ARM.exidx looks like for the kernel_start function in Embox:
$ arm-none-eabi-readelf -u build/base/bin/embox
Unwind table index'.ARM.exidx' at offset 0xaa6d4 contains 2806 entries:
<...>
0x1c3c <kernel_start>: @0xafe40
Compact model index: 10x9b vsp = r11
0x40 vsp = vsp - 40x840x80pop {r11, r14}
0xb0 finish
0xb0 finish
<...>
But its disassembler:
00001c3c <kernel_start>:
voidkernel_start(void) {
1c3c: e92d4800 push {fp, lr}
1c40: e28db004 add fp, sp, #4
<...>
Let's break down the steps. We see the assignment
vps = r11
. (R11 is FP) and beyond vps = vps - 4
. This corresponds to the instructions add fp, sp, #4
. Next comes pop {r11, r14}, which corresponds to the instructions push {fp, lr}
. The last instruction finish
reports the end of the execution (to be honest, I still do not understand why there are two finish instructions). Now let's see how much memory the “eaters” build with the -funwind-frames flag is for.
For the experiment, I compiled Embox for the STM32F4-Discovery platform. Here are the results of objdump:
With the -funwind-frames flag:
Sections:
Idx Name Size VMA LMA File off Algn
0 .text 0005a600 08000000 08000000 00004000 2**14
CONTENTS, ALLOC, LOAD, CODE
1 .ARM.exidx 00003fd8 0805a600 0805a600 0005e600 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
2 .ARM.extab 000049d0 0805e5d8 0805e5d8 000625d8 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
3 .rodata 0003e380 08062fc0 08062fc0 00066fc0 2**5
No flag:
Sections:
Idx Name Size VMA LMA File off Algn
0 .text 00058b1c 08000000 08000000 00004000 2**14
CONTENTS, ALLOC, LOAD, CODE
1 .ARM.exidx 00000008 08058b1c 08058b1c 0005cb1c 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
2 .rodata 0003e380 08058b40 08058b40 0005cb40 2**5
It is easy to calculate that the .ARM.exidx and .ARM.extab sections take about 1/10 of the .text size. After that, I collected a larger image - for the ARM Integrator CP based on the ARM9, and there these sections made up 1/12 of the size of the .text section. But it is clear that this ratio may vary from project to project. It also turned out that the size of the image that adds the -macps-frame flag is smaller than the option with exceptions (which is expected). For example, if the size of the .text section is 600 Kb, the total size of .ARM.exidx + .ARM.extab was 50 Kb, and the size of the additional code with the -mapcs-frame flag is only 10 Kb. But if we look above what a large prolog has been generated for the Cortex-M1 (remember, via mov / str?), Then it becomes clear that in this case there will be almost no difference, which means for the Thumb-mode, use -mtpcs-frame hardly has any meaning.
Do you need such a stack trace now for ARM? What are the alternatives?
The third approach is to trace the stack using a debugger. It seems that many operating systems for working with FreeRTOS, NuttX microcontrollers at the moment assume exactly this tracing option or offer to watch the disassembler.
As a result, we came to the conclusion that the stack trace for arm at run time is practically never used. This is probably a consequence of the desire to make the most effective code during operation, and debugging actions (which include stack promotion) go offline. On the other hand, if the OS already uses C ++ code, then it is quite possible to use the trace implementation via .ARM.exidx.
Well, yes, the problem with the wrong stack in the interrupt in Embox, it was solved very simply, it turned out to be enough to save the LR register to the stack.