Generation of type objects on the fly (or madness with Rust)

In this article we will have a little fun with the programming language Rust, and in particular, with type-objects.


When I got acquainted with Rust, one of the details of the implementation of type-objects seemed to me interesting. Namely, the fact that the virtual table of functions is located not in the data itself, but in the "thick" pointer to them. Each pointer to a type-object) contains a pointer to the data itself, as well as a link to a virtual table, which will contain the addresses of functions that implement this type-object for a given structure (but since this is an implementation detail, the behavior may change.


Let's start with a simple example that demonstrates thick pointers. The following code will output on 64-bit architecture 8 and 16:


fn main () {
    let v: &String = &"hello".into();
    let disp: &std::fmt::Display = v;
    println!("Указатель на строку: {}", std::mem::size_of_val(&v));
    println!("Толстый указатель на типаж-объект: {}", std::mem::size_of_val(&disp));
}

What is interesting? When I was engaged in enterprise-Java, one of the tasks, which quite regularly arose was the adaptation of existing objects for given interfaces. That is, the object is already there, issued in the form of a link, but it is necessary to adapt it to the specified interface. And you can not change the input object, it is what it is.


I had to do something like this:


Person adapt(Json value) {
    // ...какая-нибудь логика, например, проверим, что "value" действительно
    // соответствует контракту Person
    return new PersonJsonAdapter(value);
}

At such an approach, there were various problems. For example, if the same object "adapts" twice, then we get two different ones Person(in terms of comparison of links). And the fact that you have to create new objects every time is somehow ugly.


When I saw the type objects in Rust, I had a thought that in Rust this could be done much more elegantly! You can also take and assign another virtual table to the data and get a new type-object! And do not allocate memory for each instance. At the same time, the whole logic of "borrowing" remains in place - our adaptation function will look something like fn adapt<'a>(value: &'a Json) -> &'a Person(that is, we sort of borrow from the initial data).


Moreover, it is possible to “force” the same type (for example, String) to implement our type-object several times, with different behavior. What for? But you never know what you might need in the enterprise?


Let's try to implement it.


Formulation of the problem


We set the problem in the following way: to make a function annotatethat "assigns" the Stringfollowing type-object to the usual type :


trait Object {
  fn type_name(&self) -> &str;
  fn as_string(&self) -> &String;
}

And the function itself annotate:


/// Адаптировать строку к типажу-объекту `Object`, который будет декларировать,
/// что его "тип" -- тот, который задан в `type_name`.
fn annotate<'a>(input: &'a String, type_name: &str) -> &'a dyn Object {
    // ...
}

Write a test right away. First, we make sure that the "attributed" type coincides with the expected one. Second, make sure that we can get the source line and it will be the same line (from the point of view of pointers):


#[test]
fn test() {
    let input: String = "hello".into();
    let annotated1 = annotate(&input, "Widget");
    let annotated2 = annotate(&input, "Gadget");
    // Типаж-объект ведёт себя так, как мы ожидаем 
    assert_eq!("Widget", annotated1.type_name());
    assert_eq!("Gadget", annotated2.type_name());
    let unwrapped1 = annotated1.as_string();
    let unwrapped2 = annotated2.as_string();
    // Это физически всё та же строка -- сравниваем указатели
    assert_eq!(unwrapped1 as *const String, &input as *const String);
    assert_eq!(unwrapped2 as *const String, &input as *const String);
}

Approach number 1: and after us even the flood!


To begin with we will try to make absolutely naive implementation. Just wrap our data in a "wrapper", which will additionally contain type_name:


struct Wrapper<'a> {
    value: &'a String,
    type_name: String,
}
impl<'a> Object for Wrapper<'a> {
    fn type_name(&self) -> &str {
        &self.type_name
    }
    fn as_string(&self) -> &String {
        self.value
    }
}

So far, nothing special. Just like in Java. But we do not have a garbage collector, where will we store this wrapper? We need to return the link, so that it remains valid after the function call annotate. Nothing terrible shove in the box ( Box), so that the wrapper ( Wrapper) was allocated on the heap. And then return the link to it. And in order to keep the wrapper alive after calling the function annotate, we will “leak” this box :


fn annotate<'a>(input: &'a String, type_name: &str) -> &'a dyn Object {
    let b = Box::new(Wrapper {
        value: input,
        type_name: type_name.into(),
    });
    Box::leak(b)
}

... and the test passes!


But this is some kind of dubious decision. Not only do we still allocate memory for each "annotation", so also the memory leaks ( Box::leakreturns a reference to the data stored on the heap, but at the same time "forgets" the box itself, that is, there will be no automatic release).


Approach number 2: arena!


To begin with, let's try to save these wrappers somewhere so that they would still be released at some point. But at the same time keeping the signature annotateas it is. That is, to return a link with reference counting (for example, Rc<Wrapper>) does not fit.


The simplest option is to create an auxiliary structure, a “type system”, which will be responsible for storing these wrappers. And when we finish, we will free this structure and all the wrappers with it.


Something like this. A library typed-arenafor storing wrappers is used, but it was possible to get along with the type Vec<Box<Wrapper>>, most importantly, to ensure that it Wrapperdoes not move anywhere (in the nightly Rust, you could use the pin API for this ):


struct TypeSystem {
    wrappers: typed_arena::Arena<Wrapper>,
}
impl TypeSystem {
    pub fn new() -> Self {
        Self {
            wrappers: typed_arena::Arena::new(),
        }
    }
    /// Результат заимствует из параметра `input`, и при этом должен жить меньше,
    /// чем система типов (иначе возможна ситуация, когда все обёртки высвободятся,
    /// а при этом на них ещё будут ссылки)!
    pub fn annotate<'a: 'b, 'b>(
        &'a self,
        input: &'b String,
        type_name: &str
    ) -> &'b dyn Object {
        self.wrappers.alloc(Wrapper {
            value: input,
            type_name: type_name.into(),
        })
    }
}

But where did the parameter responsible for the type link lifetime go Wrapper? We had to get rid of it, since we cannot attribute any fixed lifetime to the type typed_arena::Arena<Wrapper<'?>>. Each wrapper has a unique parameter depending on input!


Instead, we sprinkle a little insecure Rust to get rid of the lifetime parameter:


struct Wrapper {
    value: *const String,
    type_name: String,
}
impl Object for Wrapper {
    fn type_name(&self) -> &str {
        &self.type_name
    }
    /// Эта конверсия -- безопасна, так как мы гарантируем (через сигнатуру
    /// `annotate`), что ссылка на обёртку (как часть ссылки на типаж-объект
    /// `&Object`) живет меньше, чем ссылка на сами данные (`String`).
    fn as_string(&self) -> &String {
        unsafe { &*self.value }
    }
}

And the tests pass again, thereby giving us confidence in the correctness of the decision. In addition to feeling a little awkwardness due to unsafe(as it should be, with an unsafe Rust it's better not to joke!).


But still, what about the promised version, which does not require additional memory allocations for wrappers?


Approach # 3: let the gates of hell open


Idea. For each unique "type" ("Widget", "Gadget"), we will create a virtual table. Hands, during the execution of the program. And we assign it to the link given to us to the data itself (which we have, as we remember, is simple String).


First, a small description of what we need to get. So, the link to the type of object, how is it arranged? In fact, these are just two pointers, one to the data itself, and the other to a virtual table. So we write:


#[repr(C)]
struct TraitObject {
    pub data: *const (),
    pub vtable: *const (),
}

( #[repr(C)]we need to guarantee the correct location in memory).


It seems everything is simple, we will generate a new table for the specified parameters and "collect" a link to the type-object! But what does this table consist of?


The correct answer to this question would be “this is the implementation detail”. But we will do so; create a file rust-toolchainin the root of our project and write to: nightly-2018-12-01. After all, a fixed assembly can be considered stable, right?


Now that we have fixed the Rust version (in fact, we will need the nightly build for one of the libraries just below).


After some searching on the Internet , we find out that the table format is simple: first there is a link to the destructor, then two fields related to memory allocation (type size and alignment), and then functions follow one after another (the order is at the discretion of the compiler, but we have only two functions, so the probability of guessing is quite large, 50%).


So we write:


#[repr(C)]
#[derive(Clone, Copy)]
struct VirtualTableHeader {
    destructor_fn: fn(*mut ()),
    size: usize,
    align: usize,
}
#[repr(C)]
struct ObjectVirtualTable {
    header: VirtualTableHeader,
    type_name_fn: fn(*const String) -> *const str,
    as_string_fn: fn(*const String) -> *const String,
}

Similarly, it is #[repr(C)]necessary to ensure the correct location in memory. I divided into two structures, a little later it will be useful to us.


Now let's try to write our type system, which will provide the function annotate. We will need to cache the generated tables, so we will get the cache:


struct TypeInfo {
    vtable: ObjectVirtualTable,
}
#[derive(Default)]
struct TypeSystem {
    infos: RefCell<HashMap<String, TypeInfo>>,
}

We use the internal state RefCellso that our function TypeSystem::annotatecan receive &selfas a shared link. This is important, since we “borrow” y TypeSystemto ensure that the virtual tables we have generated have lived longer than the link to the type-object that we return from annotate.


Since we want to annotate a lot of instances, we cannot borrow &mut selfas a changeable link.


And we outline this code:


impl TypeSystem {
    pub fn annotate<'a: 'b, 'b>(
        &'a self,
        input: &'b String,
        type_name: &str
    ) -> &'b dyn Object {
        let type_name = type_name.to_string();
        let mut infos = self.infos.borrow_mut();
        let imp = infos.entry(type_name).or_insert_with(|| unsafe {
            // Откуда мы её возьмём, эту таблицу?
            let vtable = unimplemented!();
            TypeInfo { vtable }
        });
        let object_obj = TraitObject {
            data: input as *const String as *const (),
            vtable: &imp.vtable as *const ObjectVirtualTable as *const (),
        };
        // Сконвертируем сконструированную структуру в ссылку на типаж-объект
        unsafe { std::mem::transmute::<TraitObject, &dyn Object>(object_obj) }
    }
}

Where do we get this table from? The first three entries in it will match the entries for any other virtual table for a given type. Therefore, just take and copy them. At first we will get this type:


trait Whatever {}
impl<T> Whatever for T {}

It is useful to us to get this "any other virtual table". And then, we copy these three records from him:


let whatever = input as &dyn Whatever;
let whatever_obj = std::mem::transmute::<&dyn Whatever, TraitObject>(whatever);
let whatever_vtable_header = whatever_obj.vtable as *const VirtualTableHeader;
let vtable = ObjectVirtualTable {
    // Скопируем записи!
    header: *whatever_vtable_header,
    type_name_fn: unimplemented!(),
    as_string_fn: unimplemented!(),
};
TypeInfo { vtable }

In principle, the size and alignment we could get through std::mem::size_of::<String>()and std::mem::align_of::<String>(). But from where it is still possible to “steal” the destructor, I do not know.


OK, but where do we get the addresses of these functions, type_name_fnand as_string_fn? You may notice that as_string_fnin general it is not needed, a pointer to the data is always the first entry in the type-object representation. That is, this function is always the same:


impl Object for String {
    // ...
    fn as_string(&self) -> String {
        self
    }
}

But with the second function is not so easy! It depends on our name "type" type_name,.


It does not matter, we can simply generate this function in runtime. Take for this library dynasm(currently requires the nightly build Rust). Read about
function calling conventions .


For simplicity, suppose that we are only interested in Mac OS and Linux (after all these fun transformations, compatibility doesn't really bother us anymore, right?). And, yes, only x86-64, of course.


The second function as_string,, is easy to implement. We are promised that the first parameter will be in the register RDI. And return value to RAX. That is, the function code will be something like:


dynasm!(ops
    ; mov rax, rdi
    ; ret
);

But the first function is a little trickier. First, we need to return &str, and this is a thick pointer. Its first part is a pointer to a string, and the second part is the length of a string slice. Fortunately, the convention above allows you to return 128-bit results using the register EDXfor the second part.


It remains to get somewhere a link to a string slice that contains our string type_name. type_nameWe do not want to rely on (although through annotations of the lifetime you can guarantee that it type_namewill live longer than the returned value).


But we have a copy of this string, which we put in the hash table. Crossing our fingers, we will assume that the location of the string slice that will String::as_strnot return will not change from moving the line itself String(and Stringwill move during the size change HashMap, where this line is stored by the key). I don’t know if the standard library guarantees this behavior, but how can we play just?


We get the necessary components:


let type_name_ptr = type_name.as_str().as_ptr();
let type_name_len = type_name.as_str().len();

And we write this function:


dynasm!(ops
    ; mov rax, QWORD type_name_ptr as i64
    ; mov rdx, QWORD type_name_len as i64
    ; ret
);

And finally, the final code annotate:


pub fn annotate<'a: 'b, 'b>(&'a self, input: &'b String, type_name: &str) -> &'b Object {
    let type_name = type_name.to_string();
    // Запоминаем расположение и длину строкового слайса
    let type_name_ptr = type_name.as_str().as_ptr();
    let type_name_len = type_name.as_str().len();
    let mut infos = self.infos.borrow_mut();
    let imp = infos.entry(type_name).or_insert_with(|| unsafe {
        let mut ops = dynasmrt::x64::Assembler::new().unwrap();
        // Создаём код для функции `type_name`
        let type_name_offset = ops.offset();
        dynasm!(ops
            ; mov rax, QWORD type_name_ptr as i64
            ; mov rdx, QWORD type_name_len as i64
            ; ret
        );
        // Создаём код для функции `as_string`
        let as_string_offset = ops.offset();
        dynasm!(ops
            ; mov rax, rdi
            ; ret
        );
        let buffer = ops.finalize().unwrap();
        // Копируем части из аналогичной таблицы
        let whatever = input as &dyn Whatever;
        let whatever_obj =
            std::mem::transmute::<&dyn Whatever, TraitObject>(whatever);
        let whatever_vtable_header =
            whatever_obj.vtable as *const VirtualTableHeader;
        let vtable = ObjectVirtualTable {
            header: *whatever_vtable_header,
            type_name_fn: std::mem::transmute(buffer.ptr(type_name_offset)),
            as_string_fn: std::mem::transmute(buffer.ptr(as_string_offset)),
        };
        TypeInfo { vtable, buffer }
    });
    assert_eq!(imp.vtable.header.size, std::mem::size_of::<String>());
    assert_eq!(imp.vtable.header.align, std::mem::align_of::<String>());
    let object_obj = TraitObject {
        data: input as *const String as *const (),
        vtable: &imp.vtable as *const ObjectVirtualTable as *const (),
    };
    unsafe { std::mem::transmute::<TraitObject, &dyn Object>(object_obj) }
}

For the purposes dynasmyou need to add a field bufferto our structure TypeInfo. This field controls the memory that stores the code of our generated functions:


#[allow(unused)]
buffer: dynasmrt::ExecutableBuffer,

And all the tests pass!


Done, master!


It’s so easy and natural to generate your type object implementations in the Rust code!


The latter solution actively relies on implementation details and is therefore not recommended for use. But in reality you have to do what is necessary. Desperate times require desperate measures!


There is, however, (yet) one feature that I rely on here. Namely, it is safe to release the memory occupied by the virtual table after there are no references to the type-object that uses it. On the one hand, it is logical that a virtual table can be used only through type-object references. On the other hand, the tables provided by Rust have a lifetime 'static. It can be assumed that some code separates the table from the link for some of its own purposes (for example, for some dirty work ).


Source code can be found here .


Also popular now: