
Modification of a system call. Part 2
In the previous part, we agreed to the fact that non-exported Linux kernel names can be used in the code of native kernel modules with the same success as exported ones. One such name in the kernel is the selector table of all Linux system calls. Actually, this is the main interface of any applications to kernel services. Now we will look at how you can modify the original handler of any system call, replace it, or add variety to its execution in accordance with our own vision.
The technique for modifying system calls to the operating system has long been known and has been used in a wide variety of operating systems. This is a favorite technique of virus writers starting with the MS-DOS system - a system that simply provoked such experiments. But we will use this technique for peaceful purposes ... (In different publications, such actions are called differently: modification, embedding, implementation, substitution, interception - they have their own nuances, but can be used as synonyms in our discussion.)
If any system call in the kernel of the Linux operating system is called indirectly through the address in the sys_call_table system call table (array), then replacing the address in this selector table with its own handler function, we will replace the system call handler. This, in fact, is the technique of modification. In practice, radicalism is never required to such an extent; in reality, it may be necessary to make an original system call, but after doing some of our own actions either before it (preprocessing) or after it (postprocessing) ends (or a combination of both).
In the previous part, we made sure that we can find a location and use any, including not exported, kernel symbols. All that is required, in principle, for the purpose of modifying system calls is to find the base address of the sys_call_table array, and write down the address of your own processing function using the offset number of the required system call call.
In reality, the scheme will be a little more complicated - you must first save the old (original) value of the system handler:
- To call the original handler from its own processing function before and / or after the modified code is executed;
- To restore the original handler when unloading the module.
Realization, as always happens, is somewhat more complicated than theory. The first, minor, difficulty is how to write a prototype of your own processing function for a specific system call. The slightest incorrectness of the prototype with a high probability will simply lead to the collapse of the operating system. The solution is to just peek and write off (like a d-lee) the prototype of the processing function of this system call from the kernel header file.
The next, much more significant, difficulty here is that the sys_call_table selector table, in processor architectures that allow this (and I386 and X86_64 among them), is located in memory pages that are read-only (readonly). This is controlled by hardware (MMU - Memory Management Unit) and an exception is thrown when access rights are violated. Therefore, we need to clear the write ban flag for the duration of the sys_call_table element modification and restore it after.
In the I386 and X86_64 architectures, the write enable flag is determined by the bit flag in the hidden processor status register CR0. To perform the actions we need, we use the appropriate functions, for a 32-bit architecture, for example, they will look like this (file CR0.c, this code is written on inline assembler inserts - the GCC compiler extension):
PS Various options for writing to write-protected pages were discussed, for example, in WP: Safe or Not? and a Kosher way to modify the write-protected areas of the Linux kernel .
Now we are ready to replace any Linux system call (man (2)) with our own handler function - and this is what we were aiming for. To illustrate the operability of the method, we replace (expand) the system call write (1, ...) - output to the terminal, duplicate the output stream to the system log (similar to what the tee command does):
The kernel character search function find_sym (), using the kernel API call kallsyms_on_each_symbol (), we saw in the previous part of the discussion. In addition, we make control (more to illustrate) the coincidence of the address of the name of the original sys_write () with the same address located at the __NR_write position of the sys_call_table table.
Now we can execute a system with parallel logging of everything that is output to the terminal (the write () experiment is not particularly aesthetically pleasing, but very illustrative and, moreover, safe in the early stages of experimentation in comparison with other Linux system calls):
Similarly, we can change the behavior of any Linux system call. This is done dynamically by loading the module, and when it is unloaded, the original behavior of the system is restored. The fields of application of this technique are wide: monitoring and debugging capabilities during the development period, targeted change in the behavior of individual system calls for project tasks, and more.
The code shown is noticeably simplified. A real module would have to take a series of safety actions to ensure integrity. For example, a new handler function could increase the module reference counter by calling try_module_get (THIS_MODULE) to prevent the module from unloading for the duration of the function (which is possible with a vanishingly small, but still finite probability). Before returning, the function then does the opposite: module_put (THIS_MODULE). Other precautions may be needed during the loading and unloading of the module, for example. But this is a fairly common technique for kernel modules, and it is not discussed in order not to complicate the principle.
We will see some additional nuances and special cases of the technique shown in the next part of the discussion.
The code archive for experiments can be taken hereor here (due to the insignificance of the examples, I do not post them on GitHub).
PS Everything shown works unchanged in 32-bit. In 64-bit architecture, the picture becomes somewhat more complicated due to the need to emulate 32-bit applications. In order not to complicate the picture, this option was not consciously touched upon (perhaps for now, and it is worth returning to it later).
Modification Technique
The technique for modifying system calls to the operating system has long been known and has been used in a wide variety of operating systems. This is a favorite technique of virus writers starting with the MS-DOS system - a system that simply provoked such experiments. But we will use this technique for peaceful purposes ... (In different publications, such actions are called differently: modification, embedding, implementation, substitution, interception - they have their own nuances, but can be used as synonyms in our discussion.)
If any system call in the kernel of the Linux operating system is called indirectly through the address in the sys_call_table system call table (array), then replacing the address in this selector table with its own handler function, we will replace the system call handler. This, in fact, is the technique of modification. In practice, radicalism is never required to such an extent; in reality, it may be necessary to make an original system call, but after doing some of our own actions either before it (preprocessing) or after it (postprocessing) ends (or a combination of both).
In the previous part, we made sure that we can find a location and use any, including not exported, kernel symbols. All that is required, in principle, for the purpose of modifying system calls is to find the base address of the sys_call_table array, and write down the address of your own processing function using the offset number of the required system call call.
In reality, the scheme will be a little more complicated - you must first save the old (original) value of the system handler:
- To call the original handler from its own processing function before and / or after the modified code is executed;
- To restore the original handler when unloading the module.
Implementation
Realization, as always happens, is somewhat more complicated than theory. The first, minor, difficulty is how to write a prototype of your own processing function for a specific system call. The slightest incorrectness of the prototype with a high probability will simply lead to the collapse of the operating system. The solution is to just peek and write off (like a d-lee) the prototype of the processing function of this system call from the kernel header file
The next, much more significant, difficulty here is that the sys_call_table selector table, in processor architectures that allow this (and I386 and X86_64 among them), is located in memory pages that are read-only (readonly). This is controlled by hardware (MMU - Memory Management Unit) and an exception is thrown when access rights are violated. Therefore, we need to clear the write ban flag for the duration of the sys_call_table element modification and restore it after.
In the I386 and X86_64 architectures, the write enable flag is determined by the bit flag in the hidden processor status register CR0. To perform the actions we need, we use the appropriate functions, for a 32-bit architecture, for example, they will look like this (file CR0.c, this code is written on inline assembler inserts - the GCC compiler extension):
// page write protect - on
#define rw_enable() \
asm( "cli \n" \
"pushl %eax \n" \
"movl %cr0, %eax \n" \
"andl $0xfffeffff, %eax \n" \
"movl %eax, %cr0 \n" \
"popl %eax" );
// page write protect - off
#define rw_disable() \
asm( "pushl %eax \n" \
"movl %cr0, %eax \n" \
"orl $0x00010000, %eax \n" \
"movl %eax, %cr0 \n" \
"popl %eax \n" \
"sti " );
PS Various options for writing to write-protected pages were discussed, for example, in WP: Safe or Not? and a Kosher way to modify the write-protected areas of the Linux kernel .
Now we are ready to replace any Linux system call (man (2)) with our own handler function - and this is what we were aiming for. To illustrate the operability of the method, we replace (expand) the system call write (1, ...) - output to the terminal, duplicate the output stream to the system log (similar to what the tee command does):
#define PREFIX "! "
#define DEB2(...) if( debug > 1 ) printk( KERN_INFO PREFIX " ---- " __VA_ARGS__ )
#define LOG(...) printk( KERN_INFO PREFIX __VA_ARGS__ )
#define ERR(...) printk( KERN_ERR PREFIX __VA_ARGS__ )
static int debug = 0; // debug output level: 0, 1, 2
module_param( debug, uint, 0 );
asmlinkage long (*old_sys_write) ( unsigned int fd, const char __user *buf, size_t count );
#define LEN 250
asmlinkage long new_sys_write ( unsigned int fd, const char __user *buf, size_t count ) {
if( 1 == fd ) {
char msg[ LEN + 1 ];
int n = count < LEN ? count : LEN, r;
if( ( r = copy_from_user( msg, (void*)buf, n ) ) != 0 ) return -EINVAL;
if( '\n' == msg[ n - 1 ] ) msg[ n - 1 ] = '\0';
else msg[ n ] = '\0';
if( strchr( msg, '!' ) != NULL ) goto rec; // to prevent recursion
LOG( "{%04d} %s\n", count, msg );
}
rec:
return old_sys_write( fd, buf, count ); // original write()
};
static void **taddr; // address of sys_call_table
static int __init wrchg_init( void ) {
void *waddr;
if( NULL == ( taddr = find_sym( "sys_call_table" ) ) ) {
ERR( "sys_call_table not found\n" ); return -EINVAL;
}
old_sys_write = (void*)taddr[ __NR_write ];
if( NULL == ( waddr = find_sym( "sys_write" ) ) ) {
ERR( "sys_write not found\n" ); return -EINVAL;
}
if( old_sys_write != waddr ) {
ERR( "Oooops! : addresses not equal\n" ); return -EINVAL;
}
LOG( "set new sys_write syscall [%p]\n", &new_sys_write );
show_cr0();
rw_enable();
taddr[ __NR_write ] = new_sys_write;
show_cr0();
rw_disable();
show_cr0();
return 0;
}
static void __exit wrchg_exit( void ) {
rw_enable();
taddr[ __NR_write ] = old_sys_write;
rw_disable();
LOG( "restore old sys_write syscall [%p]\n", (void*)taddr[ __NR_write ] );
return;
}
module_init( wrchg_init );
module_exit( wrchg_exit );
The kernel character search function find_sym (), using the kernel API call kallsyms_on_each_symbol (), we saw in the previous part of the discussion. In addition, we make control (more to illustrate) the coincidence of the address of the name of the original sys_write () with the same address located at the __NR_write position of the sys_call_table table.
Now we can execute a system with parallel logging of everything that is output to the terminal (the write () experiment is not particularly aesthetically pleasing, but very illustrative and, moreover, safe in the early stages of experimentation in comparison with other Linux system calls):
$ sudo insmod wrlog.ko debug=2
$ ls
CR0.c find.c Makefile Modi.hist wrlog.0.c wrlog.1.c wrlog.2.c wrlog.3.c wrlog.c wrlog.hist wrlog.ko
$ sudo rmmod wrlog
$ dmesg | tail -n31
[ 1594.231242] ! set new sys_write syscall [f8854000]
[ 1594.231248] ! ---- CR0 = 80050033
[ 1594.231250] ! ---- CR0 = 80040033
[ 1594.231252] ! ---- CR0 = 80050033
[ 1594.232737] ! {0052} /home/olej/2015_WORK/own.BOOK/SysCalls/Modi/examles
[ 1594.233368] ! {0078} \x1b[01;32molej@nvidia\x1b[01;34m ~/2015_WORK/own.BOOK/SysCalls/Modi/examles $\x1b[00m
[ 1596.866659] ! {0001} l
[ 1597.154675] ! {0001} s
[ 1597.644985] ! {0110} CR0.c find.c Makefile Modi.hist wrlog.0.c wrlog.1.c wrlog.2.c wrlog.3.c wrlog.c wrlog.hist wrlog.ko
[ 1597.645196] ! {0113}
[ 1597.645196] CR0.c find.c Makefile Modi.hist wrlog.0.c wrlog.1.c wrlog.2.c wrlog.3.c wrlog.c wrlog.hist wrlog.ko
[ 1597.645321] ! {0052} /home/olej/2015_WORK/own.BOOK/SysCalls/Modi/examles
[ 1597.645951] ! {0078} \x1b[01;32molej@nvidia\x1b[01;34m ~/2015_WORK/own.BOOK/SysCalls/Modi/examles $\x1b[00m
[ 1600.226651] ! {0001} s
[ 1600.346587] ! {0001} u
[ 1600.522683] ! {0001} d
[ 1601.026667] ! {0001} o
[ 1602.170701] ! {0001}
[ 1602.426522] ! {0001} r
[ 1603.218682] ! {0001} m
[ 1603.682677] ! {0001} m
[ 1603.906615] ! {0001} o
[ 1604.338566] ! {0001} d
[ 1606.442570] ! {0001}
[ 1606.946670] ! {0001} w
[ 1607.226667] ! {0001} r
[ 1607.834662] ! {0001} l
[ 1608.106672] ! {0001} o
[ 1608.842694] ! {0001} g
[ 1612.003059] ! {0002}
[ 1612.014102] ! restore old sys_write syscall [c1179f70]
Discussion
Similarly, we can change the behavior of any Linux system call. This is done dynamically by loading the module, and when it is unloaded, the original behavior of the system is restored. The fields of application of this technique are wide: monitoring and debugging capabilities during the development period, targeted change in the behavior of individual system calls for project tasks, and more.
The code shown is noticeably simplified. A real module would have to take a series of safety actions to ensure integrity. For example, a new handler function could increase the module reference counter by calling try_module_get (THIS_MODULE) to prevent the module from unloading for the duration of the function (which is possible with a vanishingly small, but still finite probability). Before returning, the function then does the opposite: module_put (THIS_MODULE). Other precautions may be needed during the loading and unloading of the module, for example. But this is a fairly common technique for kernel modules, and it is not discussed in order not to complicate the principle.
We will see some additional nuances and special cases of the technique shown in the next part of the discussion.
The code archive for experiments can be taken hereor here (due to the insignificance of the examples, I do not post them on GitHub).
PS Everything shown works unchanged in 32-bit. In 64-bit architecture, the picture becomes somewhat more complicated due to the need to emulate 32-bit applications. In order not to complicate the picture, this option was not consciously touched upon (perhaps for now, and it is worth returning to it later).