Phantom: large garbage collection
This article is a sequel, the beginning is here . For those who did not click on the link, a brief introduction:
We discuss garbage collection in the Phantom operating system, that is, in a virtual (bytecode-) environment of a machine running in persistent RAM. The size of persistent memory is of the order of the size of the disk, that is, units of terabytes for today and, potentially, tens and hundreds of terabytes tomorrow.
Since we are talking about virtual memory, a significant part of the objects in any case is not in RAM, regardless of which algorithm or approach we choose in general. That is - the cost of access to the facility is high. This is generally a disk operation.
In addition, it should be expected that in a network environment, the collections of such virtual machines will exchange proxy objects, that is, there will be a graph of objects stretched over many machines on the network and, of course, in all this nightmare it will potentially be necessary to be able to collect garbage not only on one car, but also over the network.
My idea of a garbage collection scheme in such an environment looks like a combination of two collectors.
The first is fast, deterministic, but not guaranteeing 100% collection. Currently, it is implemented on the principle of counting the number of links to an object. An imperfect, but fairly simple and predictable model.
The second is slow, non-deterministic (it is difficult to predict the time of work), but accurate and uncompromising. In a traditional environment, such a collector would require stopping the world, but there is a trick - if the assembly is carried out on a full copy of the state of the system, then all garbage collected in the copy will be garbage in a later version of the same state, so that it neither happened. The beauty of the approach is that the Phantom implements persistence precisely through the creation of "snapshots" - full copies of the state of the object memory of a virtual machine. That is, there is already an environment in which you can calmly and judiciously collect garbage.
In fact, at this moment the task seems trivial - garbage can be collected in the “immobilized patient” in the simplest and simplest way. Yes? Looks like no.
The first question is the storage of intermediate information about the state of the assembly. Yes, and its volume. We must assume that the situation is terrible and the system managed to drive itself into a corner, having spent all its disk space. How much exactly should we reserve for the collector so that he can finish the job?
The second question is the restartability of the algorithm. It would be tempting to implement the collector as a regular program under Phantom, which, therefore, lives in persistent memory and OS restarts are invisible to it, the entire state of the program is saved, and garbage collection simply continues further after the system restarts. But due to the first requirement, such an implementation can be dangerous - if there is not enough memory, the user process can “eat it up” and the garbage collector will stop working. This would be solved by quoting the allocation of memory, but in the current version of the OS it is not, well, in general, the solution looks very elegant, but from the point of view of debugging it will most likely look like hell.
Therefore, it would be nice to lean the garbage collector on some simple data structure that is easy to store and process in a linear form. For example, a simply connected list organized as a task list, from which the algorithm takes atomic tasks and to which it adds tasks in the process of solving the “taken out” task.
At the same time, it would be nice to avoid modifying the original snapshot (copies of the system’s memory), and make the list itself atomic so that turning off the algorithm at any point does not lead to a failure.
In general, such a naive implementation begs: a list of bypassed roots objects, a list of visited objects visited, and an algorithm that reduces to:
Naturally, this is a very inefficient algorithm, but it can be optimized in a rather trivial way. For example, like this: Roots are made not by a queue, but by the stack and the tail of this stack is stored in memory, while a large number of leaf objects bypasses will occur without modifying the disk part of this stack. It is only important that the objects get into the disk copy only entirely and leave it only after bypassing all the children.
The problem is that each step of this algorithm requires viewing everything visited, which is potentially a collapse of all hopes, a fiasco.
The trouble is that visited cannot be cached in memory - it must be viewed in its entirety. The obvious idea suggests itself - to make it not a list, but a tree sorted by the address of the object. Then the search for the object in the tree will be reduced logarithmically, and, most importantly, fragments of the tree can be cached, since a complete search is not needed.
By the way, if you have ideas about such an algorithm, please write.
A separate question arises in the event of disk I / O failure for any page. Strictly speaking, in such a situation, garbage collection is generally impossible. Although for this snapshot, the disappearance of the page (and, therefore, the links located in it) makes some of the objects inaccessible, that is, actual garbage, it would be short-sighted to just take it and destroy it all.
First, it would be nice to have some analogue of lost + found for such situations. Although it is completely unclear how to implement it. Secondly, in the current working system, the corresponding page can be quite alive. Therefore, it would be fair to do this: check whether this page is in later shots or in memory. If there is a memory - force its placement in a snapshot, even if it has not changed (usually the Phantom unchanged pages, of course, do not write again), stop the assembly and restart it at a later snapshot. If you are unlucky and the page is nowhere to be found, enable recovery mode and at the end of the usual assembly heuristically search the garbage for subtrees of objects of tangible size, which should be excluded from the garbage and “suspended” in a special place in the object hierarchy.
What else is important?
In general, the Phantom virtual persistent memory subsystem and its virtual bytecode machine (object environment) do not know a damn thing about each other. The second lives in the first. And that’s it.
But one fairly typical case that needs a bunch between them. This case looks simple: between two snapshots, a program in Phantom OS selects, uses, and frees a couple of gigabytes of objects. For example, it calculates graphics and creates temporary data in the process. By the beginning of the snapshot, they are not needed and irrelevant. But the memory in which they lay, “touched”, was modified. From the point of view of the snapshot, this is an occasion to write such a memory into a snapshot. In reality, its contents are no longer interesting to anyone and, moreover, should be nullified from sin. It would be logical, when freeing a large area of memory, to inform the paging subsystem that this area is not only not dirty, but is generally not needed and should not be saved. And when recovering from a snapshot, you need to read it as a page of zeros.
This seems trivial again, but no. The previous article talked about the problem of deleting objects by resetting the reference counter and that this can only be done after all threads of the virtual machine instruction have passed the boundary.
Technically, the easiest way to do this is to do a deletion after a snapshot.
Why? Because snapshot is guaranteed to suspend all threads on the bytecode instruction boundary. Why after, not during? Because the synchronous part of the snapshot must be done by running, so that it is invisible to running programs.
And then - late, because then the "unnecessary" pages, nevertheless, will fall into a snapshot.
In the end, this means that you need to perform a not very obvious chain of operations:
To all this, you still need to add a blocking of the snapshot on which the garbage collection is going, from removal, as well as a mechanism that allows you to keep at least three partially blocked snapshots -
And maybe a few more that are stored in backup / time machine mode.
Let’s put a semicolon on this, and ask the question: what statistics on the state of the object environment would you consider useful to collect in the process of garbage collection? We go around objects anyway, you can conduct this or that analysis relatively free of charge.
We discuss garbage collection in the Phantom operating system, that is, in a virtual (bytecode-) environment of a machine running in persistent RAM. The size of persistent memory is of the order of the size of the disk, that is, units of terabytes for today and, potentially, tens and hundreds of terabytes tomorrow.
Since we are talking about virtual memory, a significant part of the objects in any case is not in RAM, regardless of which algorithm or approach we choose in general. That is - the cost of access to the facility is high. This is generally a disk operation.
In addition, it should be expected that in a network environment, the collections of such virtual machines will exchange proxy objects, that is, there will be a graph of objects stretched over many machines on the network and, of course, in all this nightmare it will potentially be necessary to be able to collect garbage not only on one car, but also over the network.
My idea of a garbage collection scheme in such an environment looks like a combination of two collectors.
The first is fast, deterministic, but not guaranteeing 100% collection. Currently, it is implemented on the principle of counting the number of links to an object. An imperfect, but fairly simple and predictable model.
The second is slow, non-deterministic (it is difficult to predict the time of work), but accurate and uncompromising. In a traditional environment, such a collector would require stopping the world, but there is a trick - if the assembly is carried out on a full copy of the state of the system, then all garbage collected in the copy will be garbage in a later version of the same state, so that it neither happened. The beauty of the approach is that the Phantom implements persistence precisely through the creation of "snapshots" - full copies of the state of the object memory of a virtual machine. That is, there is already an environment in which you can calmly and judiciously collect garbage.
In fact, at this moment the task seems trivial - garbage can be collected in the “immobilized patient” in the simplest and simplest way. Yes? Looks like no.
The first question is the storage of intermediate information about the state of the assembly. Yes, and its volume. We must assume that the situation is terrible and the system managed to drive itself into a corner, having spent all its disk space. How much exactly should we reserve for the collector so that he can finish the job?
The second question is the restartability of the algorithm. It would be tempting to implement the collector as a regular program under Phantom, which, therefore, lives in persistent memory and OS restarts are invisible to it, the entire state of the program is saved, and garbage collection simply continues further after the system restarts. But due to the first requirement, such an implementation can be dangerous - if there is not enough memory, the user process can “eat it up” and the garbage collector will stop working. This would be solved by quoting the allocation of memory, but in the current version of the OS it is not, well, in general, the solution looks very elegant, but from the point of view of debugging it will most likely look like hell.
Therefore, it would be nice to lean the garbage collector on some simple data structure that is easy to store and process in a linear form. For example, a simply connected list organized as a task list, from which the algorithm takes atomic tasks and to which it adds tasks in the process of solving the “taken out” task.
At the same time, it would be nice to avoid modifying the original snapshot (copies of the system’s memory), and make the list itself atomic so that turning off the algorithm at any point does not lead to a failure.
In general, such a naive implementation begs: a list of bypassed roots objects, a list of visited objects visited, and an algorithm that reduces to:
- For empty roots and visited put object in it by address zero
- For non-empty roots - read and delete the address of the object, if it is not in visited - add it to visited, add all its children to roots
- For an empty root and a non-empty visited - go through linearly all objects in the snapshot address space, if they do not meet in visited - mark them as garbage in the current address space (this process can also be made restartable if you write down from time to time the memory address that we passed )
Naturally, this is a very inefficient algorithm, but it can be optimized in a rather trivial way. For example, like this: Roots are made not by a queue, but by the stack and the tail of this stack is stored in memory, while a large number of leaf objects bypasses will occur without modifying the disk part of this stack. It is only important that the objects get into the disk copy only entirely and leave it only after bypassing all the children.
The problem is that each step of this algorithm requires viewing everything visited, which is potentially a collapse of all hopes, a fiasco.
The trouble is that visited cannot be cached in memory - it must be viewed in its entirety. The obvious idea suggests itself - to make it not a list, but a tree sorted by the address of the object. Then the search for the object in the tree will be reduced logarithmically, and, most importantly, fragments of the tree can be cached, since a complete search is not needed.
By the way, if you have ideas about such an algorithm, please write.
A separate question arises in the event of disk I / O failure for any page. Strictly speaking, in such a situation, garbage collection is generally impossible. Although for this snapshot, the disappearance of the page (and, therefore, the links located in it) makes some of the objects inaccessible, that is, actual garbage, it would be short-sighted to just take it and destroy it all.
First, it would be nice to have some analogue of lost + found for such situations. Although it is completely unclear how to implement it. Secondly, in the current working system, the corresponding page can be quite alive. Therefore, it would be fair to do this: check whether this page is in later shots or in memory. If there is a memory - force its placement in a snapshot, even if it has not changed (usually the Phantom unchanged pages, of course, do not write again), stop the assembly and restart it at a later snapshot. If you are unlucky and the page is nowhere to be found, enable recovery mode and at the end of the usual assembly heuristically search the garbage for subtrees of objects of tangible size, which should be excluded from the garbage and “suspended” in a special place in the object hierarchy.
What else is important?
In general, the Phantom virtual persistent memory subsystem and its virtual bytecode machine (object environment) do not know a damn thing about each other. The second lives in the first. And that’s it.
But one fairly typical case that needs a bunch between them. This case looks simple: between two snapshots, a program in Phantom OS selects, uses, and frees a couple of gigabytes of objects. For example, it calculates graphics and creates temporary data in the process. By the beginning of the snapshot, they are not needed and irrelevant. But the memory in which they lay, “touched”, was modified. From the point of view of the snapshot, this is an occasion to write such a memory into a snapshot. In reality, its contents are no longer interesting to anyone and, moreover, should be nullified from sin. It would be logical, when freeing a large area of memory, to inform the paging subsystem that this area is not only not dirty, but is generally not needed and should not be saved. And when recovering from a snapshot, you need to read it as a page of zeros.
This seems trivial again, but no. The previous article talked about the problem of deleting objects by resetting the reference counter and that this can only be done after all threads of the virtual machine instruction have passed the boundary.
Technically, the easiest way to do this is to do a deletion after a snapshot.
Why? Because snapshot is guaranteed to suspend all threads on the bytecode instruction boundary. Why after, not during? Because the synchronous part of the snapshot must be done by running, so that it is invisible to running programs.
And then - late, because then the "unnecessary" pages, nevertheless, will fall into a snapshot.
In the end, this means that you need to perform a not very obvious chain of operations:
- Suspend all threads on the border of the instruction and immediately "release" them
- Release objects with a null reference counter that were declared to be deleted before this stop (checking that the counter is still null)
- Suspend all threads at the instruction border again
- Run the synchronous part of the snapshot (in memory)
- “Release” stopped threads
- Quietly complete the asynchronous part of the snapshot (input-output)
To all this, you still need to add a blocking of the snapshot on which the garbage collection is going, from removal, as well as a mechanism that allows you to keep at least three partially blocked snapshots -
- Old on which we collect garbage
- More relevant, which is full and restartable
- The latter, which is in the process of formation
And maybe a few more that are stored in backup / time machine mode.
Let’s put a semicolon on this, and ask the question: what statistics on the state of the object environment would you consider useful to collect in the process of garbage collection? We go around objects anyway, you can conduct this or that analysis relatively free of charge.