Add a system call. Part 4 and the last

                                                                                        - Something is bothering me Honduras ...
                                                                                        - Worried? And you don’t scratch him.
    


    In the previous parts of the discussion ( 1st , 2nd and 3rd ), we examined how, using the ability to change the contents of sys_call_table, change the behavior of a Linux system call. Now we will continue the experiments in the direction of whether (and how) it is possible to dynamically add a new system call for the purposes of your software project.

    We will not focus on the question “why?” - in programming, the last thing to ask is “why?”, You need to ask “how?”: If some technique is not close to you, you just don’t use it (see the epigraph). But nevertheless, we will return shortly to this towards the end, in the discussion.

    What does it look like?


    Despite the general similarity to the previously discussed examples of substituting a system call, this task, with all its similarities, has some aggravating features:

    • The size of the original sys_call_table system call table is slowly but monotonously increasing from version to version of the kernel and significantly depends on the specific processor platform.
    • The constant defining the dimension of this table (known in the kernel as __NR_syscall_max, or in some newer versions as __NR_syscalls) is declared by the preprocessor constant (macro) of the compilation period, and is unknown at runtime (at least, it is unknown to me).
    • Trying to add our own entry point to the end of the table, we have a significant risk of going beyond the area allocated to the table - this cannot be done!

    The size of the sys_call_table table is quite large, and it changes from version to version of the kernel (version 3.13), here is a very rough estimate:

    $ cat /proc/kallsyms | grep ' sys_' | grep T | wc -l 
    357 
    

    The kernel versions in this part of the discussion will have to be constantly mentioned: what was defined in the header file of the previous version may be determined differently and in a completely different place (file) by the next version, or even not defined at all. This is a common practice in kernel codes. But for all that, all the basic principles and dependencies remain unchanged from version to version.

    It mitigates the above limiting circumstances because the table of system calls is not dense , quite sparse , it has unused positions ( left over from outdated system calls and not currently supported). All such positions are filled with one address - a pointer to the function of the handler for unrealized calls sys_ni_syscall ():
    $ cat /proc/kallsyms | grep sys_ni_syscall 
    c045b9a8 T sys_ni_syscall 
    

    And the sys_ni_syscall () system call itself is defined somehow like this:
    asmlinkage long sys_ni_syscall( void ) {
       return -ENOSYS;
    }
    

    Therefore, we can add our new system call handler to any unused position of the sys_call_table table. We draw attention to the fact that in these positions are not outdated, unused calls, but the call is placed, the only action of which is to return the error code. Moreover, kernel developers do not have the right to reuse these positions, otherwise a completely obsolete application could cause, without suspecting, a new replacement call.

    Statically, textually in the source code, we can examine in detail the structure of the sys_call_table table (for the selected platform and version). For such studies, the source code itself is not very suitable as it was presented by the developers, but, fortunately for our purposes, today there are many resources that visualize the kernel code using the LXR (Linux Kernel Cross Reference) project, for example here or here (this allows you to compare versions and easily find the identifiers you need). For example, I will show only those sys_call_table positions of the x86 architecture 3.0.26 kernel that contain (file) a link to sys_ni_syscall (but to the kernel 3.2 onwards this file will disappear even from the code tree ... but the principles of creating the table will remain the same and its appearance will not change):

    ENTRY(sys_call_table)
            .long sys_restart_syscall    /* 0 - old "setup()" system call, used for restarting */
            .long sys_exit
    ...
            .long sys_ni_syscall         /* old break syscall holder */       //17
            .long sys_ni_syscall         /* old stty syscall holder */        //31 
            .long sys_ni_syscall         /* old gtty syscall holder */        //32 
            .long sys_ni_syscall         /* 35 - old ftime syscall holder */  //35 
            .long sys_ni_syscall         /* old prof syscall holder */        //44 
            .long sys_ni_syscall         /* old lock syscall holder */        //53 
            .long sys_ni_syscall         /* old mpx syscall holder */         //56 
            .long sys_ni_syscall         /* old ulimit syscall holder */      //58 
            .long sys_ni_syscall         /* old profil syscall holder */      //98 
            .long sys_ni_syscall         /* old "idle" system call */         //112 
            .long sys_ni_syscall         /* old "create_module" */            //127 
            .long sys_ni_syscall         /* 130: old "get_kernel_syms" */     //130 
            .long sys_ni_syscall         /* reserved for afs_syscall */       //137 
            .long sys_ni_syscall         /* Old sys_query_module */           //167 
            .long sys_ni_syscall         /* reserved for streams1 */          //188 
            .long sys_ni_syscall         /* reserved for streams2 */          //189 
            .long sys_ni_syscall         /* reserved for TUX */               //222 
            .long sys_ni_syscall                                              //223 
            .long sys_ni_syscall                                              //251 
            .long sys_ni_syscall         /* sys_vserver */                    //273 
            .long sys_ni_syscall         /* 285 */ /* available */            //285
    ... 
            .long sys_setns                                                   // 346 
    

    The listing shows only unused positions (except for the beginning and end of the table), comments were left from the source code, and the last comment, with the position number of the system call, was added by me.

    We see that for this version of the kernel, the table has 347 positions of system calls, of which 21 are not involved. The analysis of unused positions in dynamics , without relying on variable kernel codes, will be devoted to the first kernel module under consideration:

    static void **taddr,                     // адрес sys_call_table 
                *niaddr;                     // адрес sys_ni_syscall() 
    static int nsys = 0;                    // число системных вызовов в версии
    #define SYS_NR_MAX 450 
    // SYS_NR_MAX - произвольно большое, больше длины sys_call_table 
    static int sys_length( void* data, const char* sym, struct module* mod, unsigned long addr ) { 
       int i; 
       if( ( strstr( sym, "sys_" ) != sym ) || 
           ( 0 == strcmp( "sys_call_table", sym ) ) ) return 0; 
       for( i = 0; i < SYS_NR_MAX; i++ ) { 
          if( taddr[ i ] == (void*)addr ) {  // найден sys_* в sys_call_table 
             if( i > nsys ) nsys = i; 
             break; 
          } 
       } 
       return 0; 
    } 
    static void put_entries( void ) { 
       int i, ni = 0; 
       char buf[ 200 ] = ""; 
       for( i = 0; i <= nsys; i++ ) 
          if( taddr[ i ] == niaddr ) { 
             ni++; 
             sprintf( buf + strlen( buf ), "%03d, ", i ); 
          } 
       LOG( "found %d unused entries: %s\n", ni, buf ); 
    } 
    static int __init init_driver( void ) { 
       if( NULL == ( taddr = (void**)kallsyms_lookup_name( "sys_call_table" ) ) ) { 
          ERR( "sys_call_table not found!\n" ); 
          return -EFAULT; 
       } 
       LOG( "sys_call_table address = %p\n", taddr ); 
       if( NULL == ( niaddr = (void*)kallsyms_lookup_name( "sys_ni_syscall" ) ) ) { 
          ERR( "sys_ni_syscall found!\n" ); 
          return -EFAULT; 
       } 
       LOG( "sys_ni_syscall address = %p\n", niaddr ); 
       kallsyms_on_each_symbol( sys_length, NULL ); 
       LOG( "sys_call_table length = %d\n", nsys + 1 ); 
       put_entries(); 
       return -EPERM; 
    } 
    module_init( init_driver ); 
    

    As before, the optional details (such as the macro LOG (), etc.) are not shown, they are all in the full attached files.
    One could go easier (which is also correct) - to find out the length of sys_call_table, simply recalculate the number of kernel characters using the sys_ * mask and subtract 1 (the sys_call_table character itself). But we go the redundant way:
    • the next character in the sys_ * mask is in the loop;
    • its position is sought in sys_call_table (this is an additional reinsurance that this is a system call);
    • if this position is greater than previously found for the previous characters, then it is not considered the current number of the last call (the current size of sys_call_table);

    This redundant (but not at all necessary) scheme allows you to simultaneously clarify the exact size of the system call table for your architecture and version of the Linux kernel:

    $ uname -p 
    i686 
    $ uname -r 
    3.13.0-37-generic 
    $ sudo insmod nsys.ko 
    insmod: ERROR: could not insert module nsys.ko: Operation not permitted 
    $ dmesg | tail -n 4 
    [10751.601851] ! sys_call_table address = c1666140 
    [10751.602194] ! sys_ni_syscall address = c1075930 
    [10751.659769] ! sys_call_table length = 351 
    [10751.659779] ! found 27 unused entries: 017, 031, 032, 035, 044, 053, 056, 058, 098, 112, 127, 130, 137, 167, 169, 188, 189, 222, 223, 251, 273, 274, 275, 276, 285, 294, 317, 
    

    Total, in this version 351 system calls, of which 27 are not used (almost 10% of the table size). The stability of this list is very high (consciously, version 3.0.26 was chosen for code analysis, and versions 2.6.32 and 3.13, which are more than 4 years apart from each other, were chosen for execution in the dynamics).

    Note: Without being distracted to the side, we note nevertheless in passing that writing the module in a similar manner, which a). not intended to be downloaded at all, b). and in this regard, deliberately returns a non-zero completion code, c). therefore, it doesn’t have an unload function (__exit) at all - this is the direct equivalent of a user application (starting from the main () point), but only executed in supervisor mode, with maximum privileges. But this is already a subject for another conversation ...

    Implementing a New System Call


    Now we are ready to return to the implementation of the formulated task: add a new system call. Naturally, we will also need a user-space test application using such a call. The number of the new call is defined in the general header file (syscall.h), for consistency of use by the module and the program (the macros LOG (), ERR () and other small things mentioned there):

    // номер нового добавляемого системного вызова 
    #define __NR_own 223 
    // может быть взят любой, полученный при загрузке модуля nsys.ko 
    // для ядра 3.31 был получен ряд из 27 позиций: 
    // 017, 031, 032, 035, 044, 053, 056, 058, 098, 112, 
    // 127, 130, 137, 167, 169, 188, 189, 222, 223, 251, 
    // 273, 274, 275, 276, 285, 294, 317, 
    

    It’s simpler and clearer to start with a user application that will make a new system call. Everything is simple here - it cannot be easier:

    static void do_own_call( char *str ) { 
       int n = syscall( __NR_own, str, strlen( str ) ); 
       if( n == 0 ) LOG( "syscall return %d\n", n ); 
       else { 
          ERR( "syscall error %d : %s\n", n, strerror( -n ) ); 
          exit( n ); 
       } 
    } 
    int main( int argc, char *argv[] ) { 
       if( 1 == argc ) do_own_call( "DEFAULT STRING" ); 
       else 
          while( --argc > 0 ) do_own_call( argv[ argc ] ); 
       return EXIT_SUCCESS; 
    }; 
    

    A program can make one or a series (if you specify several parameters on the command line) of system calls and pass the character parameter to the call (similar to how it does, for example sys_write). And already in the module code we can see how this line is copied to the kernel space. But the main interest here is the return code: success or failure of making a system call.

    And here is the module that “picks up” such a call from the kernel:

    asmlinkage long (*old_sys_addr) ( void ); 
    // системный вызов с двумя параметрами: 
    asmlinkage long new_sys_call ( const char __user *buf, size_t count ) { 
       static char buf_msg[ 80 ]; 
       int res = copy_from_user( buf_msg, (void*)buf, count ); 
       buf_msg[ count ] = '\0'; 
       LOG( "accepted %d bytes: %s\n", count, buf_msg ); 
       return res; 
    }; 
    static void **taddr; // адрес таблицы sys_call_table 
    static int __init new_sys_init( void ) { 
       void *waddr; 
       if( NULL == ( taddr = (void**)kallsyms_lookup_name( "sys_call_table" ) ) ) { 
          ERR( "sys_call_table not found!\n" ); 
          return -EFAULT; 
       } 
       old_sys_addr = (void*)taddr[ __NR_own ]; 
       if( ( waddr = (void*)kallsyms_lookup_name( "sys_ni_syscall" ) ) != NULL ) 
          LOG( "sys_ni_syscall address = %p\n", waddr ); 
       else { 
          ERR( "sys_ni_syscall not found!\n" ); 
          return -EFAULT; 
       } 
       if( old_sys_addr != waddr ) { 
          ERR( "not free slot!\n" ); 
          return -EINVAL; 
       } 
       LOG( "old sys_call_table[%d] = %p\n", __NR_own, taddr[ __NR_own ] ); 
       rw_enable(); 
       taddr[ __NR_own ] = new_sys_call; 
       rw_disable(); 
       LOG( "new sys_call_table[%d] = %p\n", __NR_own, taddr[ __NR_own ] ); 
       return 0; 
    } 
    static void __exit new_sys_exit( void ) { 
       rw_enable(); 
       taddr[ __NR_own ] = old_sys_addr; 
       rw_disable(); 
       LOG( "restore sys_call_table[%d] = %p\n", __NR_own, taddr[ __NR_own ] ); 
       return; 
    } 
    module_init( new_sys_init ); 
    module_exit( new_sys_exit ); 
    

    There is also a double reinsurance - checking the correspondence of the address in the given (__NR_own) position of the sys_call_table table to the address of unused sys_ni_syscall system calls.

    And now we evaluate what we have obtained:

    $ ./syscall 
    syscall error -1 : Operation not permitted 
    $ echo $? 
    255 
    $ sudo insmod adds.ko 
    $ lsmod | head -n3 
    Module                  Size  Used by 
    adds                   12622  0 
    pci_stub               12550  1 
    $ dmesg | tail -n3 
    [15000.600618] ! sys_ni_syscall address = c1075930 
    [15000.600622] ! old sys_call_table[223] = c1075930 
    [15000.600623] ! new sys_call_table[223] = f87d9000 
    $ ./syscall new string for call 
    syscall return 0 
    syscall return 0 
    syscall return 0 
    syscall return 0 
    $ dmesg | tail -n4 
    [15070.680753] ! accepted 4 bytes: call 
    [15070.680799] ! accepted 3 bytes: for 
    [15070.680804] ! accepted 6 bytes: string 
    [15070.680807] ! accepted 3 bytes: new 
    $ ./syscall 'new string for call' 
    syscall return 0 
    $ dmesg | tail -n1 
    [15167.526452] ! accepted 19 bytes: new string for call 
    $ sudo rmmod adds 
    $ dmesg | tail -n1 
    [15199.917817] ! restore sys_call_table[223] = c1075930 
    $ ./syscall 
    syscall error -1 : Operation not permitted 
    

    After unloading the module, the kernel is no longer able to support the execution of the system call required by the program!

    Discussion


    Actually, there is nothing to discuss here - everything is transparently shown by example. But at first I promised to express my thoughts on why this could be applied at all (but once again I will repeat my firm conviction that the question “why?” In programming is generally meaningless). The trick shown provides another way for (two-way) applications to interact with the kernel. Yes, of course it is possible to do the same through / dev, / proc, or / sys ... but each of these methods is heavier than a system call, it involves a greater number of intermediate kernel mechanisms.

    When is it possible to use a similar mechanism? For example, for asynchronous notifications of an application about some events in the kernel when a separate application thread is blocked on a system call until an expected event occurs. Such an event could be, for example, a hardware interrupt (IRQ) from a debugged new device (moderately not fast). With this approach, any input-output operations with the device can be implemented from user space using the operations of the inb (), outb () ..., or ioperm () and iopl () groups. All this together makes it possible to study the work and write the exchange code with the device in the most subtle details without going beyond the user's space, without the risks and difficulties associated with the privileged kernel mode. And then, according to circumstances and at will:

    Note: The remark above about the low speed of devices that can only be worked out in this way should also not be taken too close to the heart. In fact, high-speed devices inside the Linux kernel do not work on interrupts, but rather use circular polling. Like, for example, all the network interfaces of the network stack at the hardware level ... anyone who knows the Linux network subsystem will understand what I mean.

    I'm not talking about the developers of proprietary hardware and projects that have the same rights to exist in nature as others. In their works, a similar technique can find the basis for application.

    And again, as before, the code archive can be taken here or here ...

    Epilogue


    Since this is the final part of a short cycle about such an unusual (indecent?) Handling of Linux system calls, I would like to say in a nutshell in order of the overall result of what was said.

    When you start writing kernel modules, or patches to the kernel, initially there is a feeling of stiffness, limited only by the capabilities that the poorly documented kernel API provides, or are described in a few and long-obsolete books on the type of “writing Linux drivers”. But experiments like those described in this series, and many others like them, suggest that you have access to everything in the kernel module.(without exception!) the possibilities of user space (the launch of new processes and threads, sending UNIX signals, etc.). And plus the unattainable features in the user space associated with the privileged (supervisor, ring 0) processor protection mode (privileged commands, internal processor registers, interrupt response).

    Show this - this is the main goal of this series of articles, and not just private tasks of substituting or adding system calls. Programming in kernel mode should create such a feeling of freedom that here you are like gods and can do everything here. But this also requires an adequate degree of responsibility ...

    Also popular now: