Simple object DBMS
Within the framework of one project, the task was set to long-term storage of logically connected data objects with ensuring multi-user access to their contents. There are various ways to meet this need with the help of existing data management systems. Nevertheless, a search was made for a simple and productive solution, the results of which are proposed for consideration.
This article discusses the general logic of data management, without diving into the details of a software implementation, often self-evident.
According to the conditions of the task, the object database management system, or rather the part of it that is responsible for multi-user access, operates with a homogeneous set of isolated objects. Note that the most diverse information entities can take a unified form of an object: data, meta-data, lists, transactions, scenario resources, documents, and others.
Initially, the only thing known about the object is that it is serialized, for long-term storage on disk, and consists of two parts: the header and the actual content.
The title of the object has a fixed length, and is necessary to place service information in it. In particular, the header stores the full length of the object in bytes, its own descriptor, and status number.
A priori, the object contains a set of values that are identified by their serial number in the set. The object itself does not interpret its values in any way, but “knows” that each value is characterized by a length in bytes, which allows you to calculate the size of the object. A set of values exists in a tuple format.
To store objects, a conditionally infinite file space is used, logically divided into clusters. In file storage, each object occupies one or more consecutive clusters. The serial number of the first cluster is used as the file pointer FP (File Pointer) to place the object in the repository.
For long-term storage of file pointers, a DAT (Data Allocation Table) is used, which is a simple dynamically expandable array of integer pointers. DAT cell indices are used as system identifiers for IDOs. When a new object is created, it is allocated another free DAT cell, the index of which becomes the permanent identifier of the object. This identifier is a unique and global descriptor of an object within a physical database.
When the system starts, the DAT is loaded from the storage into RAM, and is used to organize quick access to the object by its IDO according to the following scheme:
If the value extracted from the DAT is a file pointer, then the object is loaded from the storage into the memory - Cache of objects , and the contents of the cell DAT is replaced by a pointer to memory A * . Reverse replacement occurs when an object is pushed out of memory.
Note: A * memory pointer- This is not an absolute address, but only an offset relative to the beginning of Cache , but it points directly to the contents of the object. The object header, as well as service fields intended for temporary storage of FP and chaining of objects, are located relative to A * with a negative offset. It is noteworthy that the value A * is also used as an identifier for an object in memory.
Represents a contiguous region of memory statically allocated during system initialization. The required size is optional.
The main tasks of Cache are to be as full as possible, and quickly allocate the space required to place the object. For these purposes, the chain of free memory blocks is ranked, and the ejection of the next unused objects occurs only when this is the only way to get free space of the required size. When allocating memory to an object, the necessary reserve for placing service fields is automatically taken into account. And for the organization of free memory management, it is also used.
In the absence of external influences the complete set of objects that make up the database indefinitely retains unchanged its condition . Any action to extract the contents of objects that does not change the state of the database is hereinafter referred to as selection .
An external impact that changes the state of a database is considered a transaction .
A transaction creates new objects or changes the contents of existing ones. In this case, the following mechanism for making changes is involved: a copy of the object is created first, with the rights of its older version, into which these changes are made. The totality of newly created objects and modified copies of objects forms many transaction objects. Accordingly, the new state of the database is transaction objects + objects of the previous state , ignoring the lower versions of the objects. De facto, the sequential transaction number is the base state number.
In conditions of multi-user access to data, certain efforts are required to maintain the logical integrity of the data - both during transaction execution and during sampling.
The concept of transactional integrity is well known - " all or nothing ."
A new database state will only be generated if the transaction completes successfully. At the same time, transaction objects become publicly available. The fixation of a new state occurs when the user ends the transaction session , which he must open before the start of the transaction. But until the transactional session is completed, the objects generated or modified by it are available exclusively to the user who opened the session. Any interruption of a transaction session, regardless of the reason for the interruption, will entail a simple destruction of the objects generated by the session.
In addition to the above, one more obvious rule should be taken into account: a transaction started later than the previous one cannot be completed before the completion of the previous, higher priority transaction.
The need to maintain “ sound ” data integrity is far from obvious.
An operative financial report of the company is generated, for which a very extensive and long-term data sampling is done. At the same time, the database continuously changes its state under the influence of the transaction flow. If you do not introduce restrictions, then there is a completely non-zero probability that the report balance will not converge, since only part of the generally correct and successfully completed posting fell into the sample (in a random way at the intersection of time intervals). In order to prevent such a collision, it is necessary to follow a simple rule - any data sampling should be carried out with the base state unchanged. The user implements state fixing by opening a sampling session . In this session, all subsequent database states, that is, older versions of objects, are ignored.
Thus, at each moment of time, a single user either does nothing, or is in the process of executing one of two sessions: transactional or fetching.
At least one database state is background , always relevant. The background state is formed by a complete set of objects directly addressed from the DAT, of which some remain on the disk, and some are loaded into memory.
The dynamics of a multi-user process is such that users, when executing transactions, generate a sequence of new temporary states of the database. Upon successful completion of the next transaction session, the temporary state generated by it becomes publicly available. When you open a sampling session, the user is presented with the available state with the highest number. Having existed for some time, temporary states that are no longer used for sampling purposes are sequentially absorbed by the background state.
Any state, including the background one, owns some set of state objects that are connected in the chain of the same name. Note: objects of temporary states are the transaction objects mentioned above that have become relevant as a result of the successful completion of a transaction session. In the background state chain, objects are ordered by decreasing time of their non-use. Accessing an object for its contents automatically moves the object to the end of the chain. The objects at the beginning of the chain are candidates for crowding out Cache .
It was previously mentioned that an attempt to modify an existing object automatically spawns its new version. Thus, several versions of the same object can be in memory at the same time. These versions are linked in the chain of the same name. The pointer ( A * ) to the first object in the version chain is in the DAT, and the chain itself allows the user to access the “correct” version of the object in the required state. In this case, the correct version (relevant from the point of view of the user) is considered to be the version with the highest status number, not exceeding the required one.
The distribution by state of objects connected in a chain of states and versions looks something like this:
The process of absorption of the state is initiated by the last of those who used it at the end of the session. When the next temporary state is absorbed, the Background state first removes outdated versions of objects from the memory for which there is a new version in the absorbed state, and then simply connects the chain of objects of the absorbed state to its own.
To control multi-user access, database states and their objects, the States table ST is used .
Each ST record contains a pointer ( A * ) to the first object in the chain of state objects, user and lock object identifiers, and also a counter of users using this state.
In relation to the table ST, there are three external pointers that operate with the full state number of the database. If the size of the table is a multiple of a power of two, then the use of the least significant bits of the absolute number as an index to the ST table provides circular movement of the pointers in the table.
Background State Pointer ( BS) contains the background state number. When the subsequent temporary state is absorbed, the BS pointer is incremented. The absorption condition is the zero value of the counters of the use of two states at once: the background and the next one. The condition is checked when closing any of the sessions, after decreasing the usage counter.
The Last Available State ( LS ) pointer contains the number of the oldest available state. This number is provided to the user when he opens a sampling session. When the next transactional session closes, the LS pointer increments, automatically getting the number of this session.
Next State Pointer ( NS) provides the status number to the user opening the transaction session, after which it increments. If there are no open transactional sessions, then the NS value exceeds the LS value by 1. If there are no temporary states, then the values of the BS and LS pointers coincide.
The status number obtained by the user when opening any session is stored in the corresponding client table entry CT (Client Table). All user calls to the object service API include a client identifier, and the remaining data is retrieved from the corresponding CT record.
The client’s system identifier is the sequence number of the record allocated to it in the Client Table during authorization. This table registers both system resources allocated to the client: TCP socket and stream descriptors, and the resources used by it in the data management system, and in particular, the number of the state opened by the user, as well as various control flags.
Recall: transactions, regardless of their duration and result, must be completed strictly in the order in which they were started, in ascending order of their own numbers. To organize such a sequence, a pool of named lock objects is used, which are created together with the ST table during system initialization.
Immediately upon opening a transactional session, a free locking object is requested from the pool, which is immediately captured by the user thread and held by it until the session is completed. The identifier of the captured lock object is stored in the corresponding state of the ST record. After that, the record of the previous state is checked for the presence of an incomplete transaction in the form of the current identifier of the lock object.
With parallel execution of transactions, there is always an unpleasant probability that an earlier transaction will change the contents of an object after a subsequent transaction uses the same object for its own purposes. The overhead of permanent monitoring of such a conflict is very high. And its resolution is possible only by re-execution of all subsequent transactions.
If there is a previous transaction in progress, the current one, trying to avoid a conflict, changes the logic of its execution. Recall that disk access is the longest of all operations performed by the object service. Therefore, while the previous transaction is being executed, the current one only simulates its execution - without actually creating copies of the object and changing their contents. At the same time, all objects that were somehow used by the transaction are guaranteed to be loaded into Cache. When the simulation is completed, the transaction re-checks the ST record of the previous state. If the identifier of the lock object is obtained from it, then the transaction freezes in an attempt to capture this object. After the lock object is released by the previous transaction, the current one will continue to execute,
If something went wrong (for example, a fatal error in the service of objects or a hardware failure), then only auto-recovery from a checkpoint can save the database. In the milder case, when the client’s flow “fell and did not wring out,” or went to infinity, leaving its session open, a complete stop of the state pipeline can be prevented.
A freezing transaction session will be detected when a new transaction cannot receive a free object from the pool, which for these purposes contains half as few lock objects as the table of ST records. In this case, the condition with the index [LS + 1] is problematic.
Hanging of the sample session will be detected only when all available ST records are exhausted, that is, when the BS and NS indices are equal. A hung state with the index [BS] or [BS + 1] will have a non-zero value for the usage counter.
Regardless of the cause of the session crash, its consequences are always the same: after receiving the identifier of the “hanging user”, its flow is forcibly stopped, all resources used are freed, the session is forced to end, after which the pipeline is unloaded automatically in the normal mode. All these actions are performed in the thread of the user who discovered the problem. Upon completion of recovery operations, the flow of the hanging user starts again, and it makes at least one attempt to repeat the failed session again. In the event of a second failure, the user begins his flow by waiting for external events. The whole process is governed by flags set to the user in the Client Table.
The content of an object is a set of its values, stored in a tuple format. The properties of the tuple allow it to be used as a universal way of organizing data from the point of view of storage and access. It is worth mentioning that the memory manager ( MM ) of the object service, which ensures the operation of all its parts, including Cache objects, is initially oriented towards supporting the tuple format.
Logically, a tuple is a sequence of elements identified by their serial number in the tuple. The tuple element contains some value, which is characterized by its length. The length of the value is a priori known, and is placed in the value header. A tuple is implemented as an array of relative pointers. Each pointer represents the offset of the start of the value relative to the start of the tuple. Both offset and size are measured in bytes.
A tuple has a number of remarkable properties.
First of all, the value in the tuple may be another tuple. From this point of view, the entire contents of the object can be considered as one value. The length of any value is known from its header, which means that the number of elements in the tuple is also known.
The order of the elements in the tuple is strict and unchanged. Operation "Insert" is prohibited. But you can safely add new elements to the existing set.
In a tuple, an uninitialized value will have a zero offset value, which is what distinguishes it from an "empty" value with a zero length. A non-initialized value can be operated without conflict, including considering it as an "empty" value.
In a tuple, you can place a data structure of arbitrary complexity. Logically, the tuple format resembles XML, but only with indices instead of tags and the ability to operate not only on text values. A single value in a complex structure can be accessed directly using a sequence of indices (route) as the address. And it is possible with respect to the owner tuple.
A tuple has the ability to create its own instances. A tuple instance differs from its copy by zero offset values for all its values, which are not, in turn, tuples. In other words, an instance is a copy of a structure.
A tuple of values can exist both in serialized form (a continuous set of bytes for storage on disk), and in an arbitrary one, in which individual values of the tuple are located in different places in RAM. The offset relative to the start of the tuple can also have a negative value.
The last two of the listed features of the tuple's behavior are actively used when modifying an object during a transaction session.
In fact, a change in an object is understood as a modification of one or more of its values. Although it was previously mentioned that a copy of the object is created to make changes, in fact there is no need to copy the object in full. Instead of a copy of it in Cache, an instance of the object is created with the pointers in the tuple reset to zero. When accessing such an uninitialized value, the value retrieved from the previous version of the object is returned as the result (we save memory).
The new value assigned to the object is also placed in the object cache, after which the offset of the new value relative to the object instance is written to the corresponding element of the tuple.
Further, in case of successful completion of the transaction, the transaction objects thus formed must be unloaded to disk.
The file space allocated for storing objects is not only clustered, but also divided into data banks of size two to the power of N clusters. Each bank occupies one file, which is called the serial number of the bank. Thus, when accessing an object on disk, its FP is sequentially converted first into a file name and then into a cluster number in the file.
To minimize disk operations, as well as the time of automatic recovery of the system after an accident, all objects, regardless of their primary location, are stored in one bank. For these purposes, a continuous memory area of the appropriate size is reserved (optimally - 32 MB), and transaction objects are written to this memory bank sequentially, up to its filling. Before recording, the length (in clusters) of all objects is summed up, and if there is not enough free space in the memory bank, a new bank is requested from the system, and the filled one is put in the recording stream queue.
Saving transaction objects is performed when the transaction session is closed. Writing an object to memory starts at the beginning of the next free cluster, and a new object FP is formed. The new FP is not calculated if the previous version of the object is already present in this memory, and its size allows you to write a new version in this place. The full version of the object with all the values of its tuple, both modified and borrowed from the previous version, is written into memory. During the recording process, the object is serialized, with the calculation of new pointers (offsets) in the tuple.
After the session ends, the modified transaction objects in Cache become publicly available as it is, namely with an incomplete tuple. These objects should replace previous versions.
Naturally, if an object has a subsequent version in memory, then such an object cannot be forced out of Cache in order to obtain free space, even if it is at the very beginning of the extrusion chain. For such an object, a different displacement order is provided, but for now another queue is pushed out instead.
The previous version of the object (hereinafter, version [0]) is erased from memory by its subsequent version [+1], and this happens during the absorption of the temporary state by the background. And there is one subtlety of execution.
Version [0], usually loaded from disk, occupies a continuous area in Cache, while the tuple and new values of version [+1] are in different places, increasing memory defragmentation. Therefore, the scenario of absorption of version [+1] looks more attractive than the scenario of extrusion of version [0]. If the new value of the version [+1] does not exceed the existing size, then it is simply copied to the body of the version [0], and the memory it occupies is freed. Otherwise, this new value remains outside the object as it is, and a new offset is calculated for it, and a flag is set for the object, obliging the process of usual crowding out to analyze the object for fragments outside it.
During the execution of the transaction session, the transaction is formalized in the format of the object. Elements of the tuple of this object are formed by elementary actions, such as creating an object or changing one of its values. Among other things, the identifier of the user on whose behalf the transaction was executed, as well as a mark on the date / time of the start of the transaction, are stored in the object header. If the transaction session completes successfully, then when the session is closed, the transaction formalized in this way is put in the write queue.
It is worth noting that the procedure for queuing a transaction and a bank of objects formed during the closing of one session is determined by whether it was possible to place transaction objects in this bank. If successful, then the formalized transaction is queued first. If a new bank was opened to accommodate the objects of this transaction, then the bank of objects is placed first in the queue.
A sequence of formalized transactions stored on disk forms a transaction log.
A complete transaction log is the primary event form of a database. Sequential re-execution of the log contents gives the database contents that are completely identical to those obtained in working multi-user mode. This feature stipulates the use of the journal as an element of ensuring reliability.
In addition, it is worth noting that the magazine has one more function - fiscal, which has repeatedly proved its usefulness in disassemblies such as "... this program messed up."
The data stream is engaged in saving data on the disk - the background stream serving the file storage. Information about the stored data: pointer to the memory area, size of the area and data type, the recording stream is extracted from the queue. At the end of writing to the file, the stream independently releases the memory area transferred to it.
The data type defines the target file to which data will be written. There are only three basic data types: formalized transactions, object banks, and DAT.
All transactions write stream writes to one constantly open transaction log file. In the transaction file, they follow one after another, without alignment. Before recording a transaction, its checksum is calculated and fixed in the header, and a forced commit is made at the end of the recording.
When the queue comes to saving the bank of objects, the write stream first creates the next checkpoint, and only then writes the bank to a file. To create a checkpoint, the write stream closes the current log file, packs it into a backup copy, packs the bank contents into a separate file, packs a backup copy of the full DAT, and opens a new log file.
As a result of the actions of the write stream, file storage takes the following form:
DAT and object banks are placed in the workspace. Archive copies of object banks and files that together form a complete transaction log are placed in the backup area. Each separate transaction log file, together with an archive copy of DAT, as well as archive copies of its own and previous banks of objects, forms a separate control point.
During the initialization of the file storage at system startup, it is checked for its validity. Storage is deemed suitable for work if the checksums match the values in the file headers and the contents of the workspace correspond to the contents of the backup area.
If the server was not completed properly, then at least the DAT in the workspace and the last transaction log file will fail, and the recovery process from the most recent checkpoint will automatically start.
The recovery logic is fairly obvious: missing or damaged banks are restored from the backups of the object banks, and then the DAT file is restored from the most recent checkpoint, after which transactions from its log are executed sequentially. Since the security of the most recent transaction in the log is not guaranteed, the process ends when the transaction checksum does not match.
We have only two physical resources at our disposal: memory and processor cycles. And since performance is a priority, the result is increased memory consumption. So, in order to reduce the volume of file operations, new and changed objects are saved on disk in bulk, in one file. At the same time, earlier versions of variable objects remain in the same places in previous banks and will never be used. In order to return disk memory to the system, the internally “thinned out” bank needs to be periodically “compacted”, reducing its size, and not forgetting to make changes to the DAT and overwrite the archive copy of the bank. To facilitate the analysis of bank occupancy, the facility service permanently maintains the bitmap of clusters up to date.
All the relatively simple functionality discussed above is grouped around the internal control data structures of the File Storage and the Object Service. File storage is responsible for the reliability of long-term data storage, which provides both multiple backups of the database in its various forms, including the transaction log, and the presence of auto-recovery mechanisms. Service facilities with minimal means provides multi-user access to the contents of the database.
The data model by means of structured metadata, also stored in the object format, ensures the logical coherence of data objects and the consistency of their values, in full accordance with the business logic of the application, integrated directly into the metadata.
Custom layerCursors , which are a kind of logical similarity to SQL cursors, are the server part of the interface resources used for user interaction with the contents of the database. This layer, which is very important, among other things, provides complete isolation of the interface from the internal system for identifying objects and their values.
The internal logic of the Model and Cursors, as well as how to implement it, is the topic of a separate story.
The scalability potential is provided by two factors: isolation of an individual object, as well as separation of user actions into selection and transaction.
If you need to increase the load capacity, the first thing that comes to mind is the allocation of a separate master server from the server pool. The master server stores the reference copy of the database and deals exclusively with the execution of transactions. Transactions are received by him from the rest of the servers involved in servicing user requests and generating samples. Execution results - a stream of changed versions of objects, the master server broadcasts to all other servers serving the data selection, simultaneously providing multiple database replication.
The presence of a significant heterogeneity of the logical connection of objects in the database (objects can be grouped into domains with strong connectivity within the domain and a small number of connections outside it) allows you to distribute the database into several master servers, each of which serves its own domain groups.
A detailed analysis of the specifics of implementation is beyond the scope of this article, but its principles themselves are subject to discussion.
As often happens in the presentation of a sufficiently voluminous material, a lot of relatively small and minor details were missed. For example, were not mentioned: segmentation and duplication of DAT in memory; special procedure for managing "large" objects; principles of organization of mutual blocking of user flows when accessing shared resources; logic and implementation of garbage collection; mapping the execution process into a log-log; collection and display of statistics; and much more.
It was important to show that multi-threaded management of objects is based on a rather trivial logic and is not such a difficult task to implement.
The implementation option, brought to primitivism, is likely to be the most effective and productive. Although opinions on this issue may be divided. The object representation of data looks more natural than tabular. Simplification of internal identification provides “step-by-step” accessibility of objects, and promises a certain profit when implementing their logical connection.
The material is published with the hope that it will be useful to everyone who is interested in database architecture.
This article discusses the general logic of data management, without diving into the details of a software implementation, often self-evident.
According to the conditions of the task, the object database management system, or rather the part of it that is responsible for multi-user access, operates with a homogeneous set of isolated objects. Note that the most diverse information entities can take a unified form of an object: data, meta-data, lists, transactions, scenario resources, documents, and others.
Data object
Initially, the only thing known about the object is that it is serialized, for long-term storage on disk, and consists of two parts: the header and the actual content.
The title of the object has a fixed length, and is necessary to place service information in it. In particular, the header stores the full length of the object in bytes, its own descriptor, and status number.
A priori, the object contains a set of values that are identified by their serial number in the set. The object itself does not interpret its values in any way, but “knows” that each value is characterized by a length in bytes, which allows you to calculate the size of the object. A set of values exists in a tuple format.
Identification and Access
To store objects, a conditionally infinite file space is used, logically divided into clusters. In file storage, each object occupies one or more consecutive clusters. The serial number of the first cluster is used as the file pointer FP (File Pointer) to place the object in the repository.
For long-term storage of file pointers, a DAT (Data Allocation Table) is used, which is a simple dynamically expandable array of integer pointers. DAT cell indices are used as system identifiers for IDOs. When a new object is created, it is allocated another free DAT cell, the index of which becomes the permanent identifier of the object. This identifier is a unique and global descriptor of an object within a physical database.
When the system starts, the DAT is loaded from the storage into RAM, and is used to organize quick access to the object by its IDO according to the following scheme:
If the value extracted from the DAT is a file pointer, then the object is loaded from the storage into the memory - Cache of objects , and the contents of the cell DAT is replaced by a pointer to memory A * . Reverse replacement occurs when an object is pushed out of memory.
Note: A * memory pointer- This is not an absolute address, but only an offset relative to the beginning of Cache , but it points directly to the contents of the object. The object header, as well as service fields intended for temporary storage of FP and chaining of objects, are located relative to A * with a negative offset. It is noteworthy that the value A * is also used as an identifier for an object in memory.
Cache objects
Represents a contiguous region of memory statically allocated during system initialization. The required size is optional.
The main tasks of Cache are to be as full as possible, and quickly allocate the space required to place the object. For these purposes, the chain of free memory blocks is ranked, and the ejection of the next unused objects occurs only when this is the only way to get free space of the required size. When allocating memory to an object, the necessary reserve for placing service fields is automatically taken into account. And for the organization of free memory management, it is also used.
States and Transactions
In the absence of external influences the complete set of objects that make up the database indefinitely retains unchanged its condition . Any action to extract the contents of objects that does not change the state of the database is hereinafter referred to as selection .
An external impact that changes the state of a database is considered a transaction .
A transaction creates new objects or changes the contents of existing ones. In this case, the following mechanism for making changes is involved: a copy of the object is created first, with the rights of its older version, into which these changes are made. The totality of newly created objects and modified copies of objects forms many transaction objects. Accordingly, the new state of the database is transaction objects + objects of the previous state , ignoring the lower versions of the objects. De facto, the sequential transaction number is the base state number.
In conditions of multi-user access to data, certain efforts are required to maintain the logical integrity of the data - both during transaction execution and during sampling.
Data integrity
The concept of transactional integrity is well known - " all or nothing ."
A new database state will only be generated if the transaction completes successfully. At the same time, transaction objects become publicly available. The fixation of a new state occurs when the user ends the transaction session , which he must open before the start of the transaction. But until the transactional session is completed, the objects generated or modified by it are available exclusively to the user who opened the session. Any interruption of a transaction session, regardless of the reason for the interruption, will entail a simple destruction of the objects generated by the session.
In addition to the above, one more obvious rule should be taken into account: a transaction started later than the previous one cannot be completed before the completion of the previous, higher priority transaction.
The need to maintain “ sound ” data integrity is far from obvious.
An operative financial report of the company is generated, for which a very extensive and long-term data sampling is done. At the same time, the database continuously changes its state under the influence of the transaction flow. If you do not introduce restrictions, then there is a completely non-zero probability that the report balance will not converge, since only part of the generally correct and successfully completed posting fell into the sample (in a random way at the intersection of time intervals). In order to prevent such a collision, it is necessary to follow a simple rule - any data sampling should be carried out with the base state unchanged. The user implements state fixing by opening a sampling session . In this session, all subsequent database states, that is, older versions of objects, are ignored.
Thus, at each moment of time, a single user either does nothing, or is in the process of executing one of two sessions: transactional or fetching.
State objects
At least one database state is background , always relevant. The background state is formed by a complete set of objects directly addressed from the DAT, of which some remain on the disk, and some are loaded into memory.
The dynamics of a multi-user process is such that users, when executing transactions, generate a sequence of new temporary states of the database. Upon successful completion of the next transaction session, the temporary state generated by it becomes publicly available. When you open a sampling session, the user is presented with the available state with the highest number. Having existed for some time, temporary states that are no longer used for sampling purposes are sequentially absorbed by the background state.
Any state, including the background one, owns some set of state objects that are connected in the chain of the same name. Note: objects of temporary states are the transaction objects mentioned above that have become relevant as a result of the successful completion of a transaction session. In the background state chain, objects are ordered by decreasing time of their non-use. Accessing an object for its contents automatically moves the object to the end of the chain. The objects at the beginning of the chain are candidates for crowding out Cache .
It was previously mentioned that an attempt to modify an existing object automatically spawns its new version. Thus, several versions of the same object can be in memory at the same time. These versions are linked in the chain of the same name. The pointer ( A * ) to the first object in the version chain is in the DAT, and the chain itself allows the user to access the “correct” version of the object in the required state. In this case, the correct version (relevant from the point of view of the user) is considered to be the version with the highest status number, not exceeding the required one.
The distribution by state of objects connected in a chain of states and versions looks something like this:
The process of absorption of the state is initiated by the last of those who used it at the end of the session. When the next temporary state is absorbed, the Background state first removes outdated versions of objects from the memory for which there is a new version in the absorbed state, and then simply connects the chain of objects of the absorbed state to its own.
To control multi-user access, database states and their objects, the States table ST is used .
State table
Each ST record contains a pointer ( A * ) to the first object in the chain of state objects, user and lock object identifiers, and also a counter of users using this state.
In relation to the table ST, there are three external pointers that operate with the full state number of the database. If the size of the table is a multiple of a power of two, then the use of the least significant bits of the absolute number as an index to the ST table provides circular movement of the pointers in the table.
Background State Pointer ( BS) contains the background state number. When the subsequent temporary state is absorbed, the BS pointer is incremented. The absorption condition is the zero value of the counters of the use of two states at once: the background and the next one. The condition is checked when closing any of the sessions, after decreasing the usage counter.
The Last Available State ( LS ) pointer contains the number of the oldest available state. This number is provided to the user when he opens a sampling session. When the next transactional session closes, the LS pointer increments, automatically getting the number of this session.
Next State Pointer ( NS) provides the status number to the user opening the transaction session, after which it increments. If there are no open transactional sessions, then the NS value exceeds the LS value by 1. If there are no temporary states, then the values of the BS and LS pointers coincide.
The status number obtained by the user when opening any session is stored in the corresponding client table entry CT (Client Table). All user calls to the object service API include a client identifier, and the remaining data is retrieved from the corresponding CT record.
Customer table
The client’s system identifier is the sequence number of the record allocated to it in the Client Table during authorization. This table registers both system resources allocated to the client: TCP socket and stream descriptors, and the resources used by it in the data management system, and in particular, the number of the state opened by the user, as well as various control flags.
Conflict resolution
Recall: transactions, regardless of their duration and result, must be completed strictly in the order in which they were started, in ascending order of their own numbers. To organize such a sequence, a pool of named lock objects is used, which are created together with the ST table during system initialization.
Immediately upon opening a transactional session, a free locking object is requested from the pool, which is immediately captured by the user thread and held by it until the session is completed. The identifier of the captured lock object is stored in the corresponding state of the ST record. After that, the record of the previous state is checked for the presence of an incomplete transaction in the form of the current identifier of the lock object.
With parallel execution of transactions, there is always an unpleasant probability that an earlier transaction will change the contents of an object after a subsequent transaction uses the same object for its own purposes. The overhead of permanent monitoring of such a conflict is very high. And its resolution is possible only by re-execution of all subsequent transactions.
If there is a previous transaction in progress, the current one, trying to avoid a conflict, changes the logic of its execution. Recall that disk access is the longest of all operations performed by the object service. Therefore, while the previous transaction is being executed, the current one only simulates its execution - without actually creating copies of the object and changing their contents. At the same time, all objects that were somehow used by the transaction are guaranteed to be loaded into Cache. When the simulation is completed, the transaction re-checks the ST record of the previous state. If the identifier of the lock object is obtained from it, then the transaction freezes in an attempt to capture this object. After the lock object is released by the previous transaction, the current one will continue to execute,
Emergency situations
If something went wrong (for example, a fatal error in the service of objects or a hardware failure), then only auto-recovery from a checkpoint can save the database. In the milder case, when the client’s flow “fell and did not wring out,” or went to infinity, leaving its session open, a complete stop of the state pipeline can be prevented.
A freezing transaction session will be detected when a new transaction cannot receive a free object from the pool, which for these purposes contains half as few lock objects as the table of ST records. In this case, the condition with the index [LS + 1] is problematic.
Hanging of the sample session will be detected only when all available ST records are exhausted, that is, when the BS and NS indices are equal. A hung state with the index [BS] or [BS + 1] will have a non-zero value for the usage counter.
Regardless of the cause of the session crash, its consequences are always the same: after receiving the identifier of the “hanging user”, its flow is forcibly stopped, all resources used are freed, the session is forced to end, after which the pipeline is unloaded automatically in the normal mode. All these actions are performed in the thread of the user who discovered the problem. Upon completion of recovery operations, the flow of the hanging user starts again, and it makes at least one attempt to repeat the failed session again. In the event of a second failure, the user begins his flow by waiting for external events. The whole process is governed by flags set to the user in the Client Table.
Tuple of values
The content of an object is a set of its values, stored in a tuple format. The properties of the tuple allow it to be used as a universal way of organizing data from the point of view of storage and access. It is worth mentioning that the memory manager ( MM ) of the object service, which ensures the operation of all its parts, including Cache objects, is initially oriented towards supporting the tuple format.
Logically, a tuple is a sequence of elements identified by their serial number in the tuple. The tuple element contains some value, which is characterized by its length. The length of the value is a priori known, and is placed in the value header. A tuple is implemented as an array of relative pointers. Each pointer represents the offset of the start of the value relative to the start of the tuple. Both offset and size are measured in bytes.
A tuple has a number of remarkable properties.
First of all, the value in the tuple may be another tuple. From this point of view, the entire contents of the object can be considered as one value. The length of any value is known from its header, which means that the number of elements in the tuple is also known.
The order of the elements in the tuple is strict and unchanged. Operation "Insert" is prohibited. But you can safely add new elements to the existing set.
In a tuple, an uninitialized value will have a zero offset value, which is what distinguishes it from an "empty" value with a zero length. A non-initialized value can be operated without conflict, including considering it as an "empty" value.
In a tuple, you can place a data structure of arbitrary complexity. Logically, the tuple format resembles XML, but only with indices instead of tags and the ability to operate not only on text values. A single value in a complex structure can be accessed directly using a sequence of indices (route) as the address. And it is possible with respect to the owner tuple.
A tuple has the ability to create its own instances. A tuple instance differs from its copy by zero offset values for all its values, which are not, in turn, tuples. In other words, an instance is a copy of a structure.
A tuple of values can exist both in serialized form (a continuous set of bytes for storage on disk), and in an arbitrary one, in which individual values of the tuple are located in different places in RAM. The offset relative to the start of the tuple can also have a negative value.
The last two of the listed features of the tuple's behavior are actively used when modifying an object during a transaction session.
Object change
In fact, a change in an object is understood as a modification of one or more of its values. Although it was previously mentioned that a copy of the object is created to make changes, in fact there is no need to copy the object in full. Instead of a copy of it in Cache, an instance of the object is created with the pointers in the tuple reset to zero. When accessing such an uninitialized value, the value retrieved from the previous version of the object is returned as the result (we save memory).
The new value assigned to the object is also placed in the object cache, after which the offset of the new value relative to the object instance is written to the corresponding element of the tuple.
Further, in case of successful completion of the transaction, the transaction objects thus formed must be unloaded to disk.
Saving Objects
The file space allocated for storing objects is not only clustered, but also divided into data banks of size two to the power of N clusters. Each bank occupies one file, which is called the serial number of the bank. Thus, when accessing an object on disk, its FP is sequentially converted first into a file name and then into a cluster number in the file.
To minimize disk operations, as well as the time of automatic recovery of the system after an accident, all objects, regardless of their primary location, are stored in one bank. For these purposes, a continuous memory area of the appropriate size is reserved (optimally - 32 MB), and transaction objects are written to this memory bank sequentially, up to its filling. Before recording, the length (in clusters) of all objects is summed up, and if there is not enough free space in the memory bank, a new bank is requested from the system, and the filled one is put in the recording stream queue.
Saving transaction objects is performed when the transaction session is closed. Writing an object to memory starts at the beginning of the next free cluster, and a new object FP is formed. The new FP is not calculated if the previous version of the object is already present in this memory, and its size allows you to write a new version in this place. The full version of the object with all the values of its tuple, both modified and borrowed from the previous version, is written into memory. During the recording process, the object is serialized, with the calculation of new pointers (offsets) in the tuple.
After the session ends, the modified transaction objects in Cache become publicly available as it is, namely with an incomplete tuple. These objects should replace previous versions.
Version merge
Naturally, if an object has a subsequent version in memory, then such an object cannot be forced out of Cache in order to obtain free space, even if it is at the very beginning of the extrusion chain. For such an object, a different displacement order is provided, but for now another queue is pushed out instead.
The previous version of the object (hereinafter, version [0]) is erased from memory by its subsequent version [+1], and this happens during the absorption of the temporary state by the background. And there is one subtlety of execution.
Version [0], usually loaded from disk, occupies a continuous area in Cache, while the tuple and new values of version [+1] are in different places, increasing memory defragmentation. Therefore, the scenario of absorption of version [+1] looks more attractive than the scenario of extrusion of version [0]. If the new value of the version [+1] does not exceed the existing size, then it is simply copied to the body of the version [0], and the memory it occupies is freed. Otherwise, this new value remains outside the object as it is, and a new offset is calculated for it, and a flag is set for the object, obliging the process of usual crowding out to analyze the object for fragments outside it.
Transaction formalization
During the execution of the transaction session, the transaction is formalized in the format of the object. Elements of the tuple of this object are formed by elementary actions, such as creating an object or changing one of its values. Among other things, the identifier of the user on whose behalf the transaction was executed, as well as a mark on the date / time of the start of the transaction, are stored in the object header. If the transaction session completes successfully, then when the session is closed, the transaction formalized in this way is put in the write queue.
It is worth noting that the procedure for queuing a transaction and a bank of objects formed during the closing of one session is determined by whether it was possible to place transaction objects in this bank. If successful, then the formalized transaction is queued first. If a new bank was opened to accommodate the objects of this transaction, then the bank of objects is placed first in the queue.
A sequence of formalized transactions stored on disk forms a transaction log.
Transaction log
A complete transaction log is the primary event form of a database. Sequential re-execution of the log contents gives the database contents that are completely identical to those obtained in working multi-user mode. This feature stipulates the use of the journal as an element of ensuring reliability.
In addition, it is worth noting that the magazine has one more function - fiscal, which has repeatedly proved its usefulness in disassemblies such as "... this program messed up."
Burning to disc
The data stream is engaged in saving data on the disk - the background stream serving the file storage. Information about the stored data: pointer to the memory area, size of the area and data type, the recording stream is extracted from the queue. At the end of writing to the file, the stream independently releases the memory area transferred to it.
The data type defines the target file to which data will be written. There are only three basic data types: formalized transactions, object banks, and DAT.
All transactions write stream writes to one constantly open transaction log file. In the transaction file, they follow one after another, without alignment. Before recording a transaction, its checksum is calculated and fixed in the header, and a forced commit is made at the end of the recording.
When the queue comes to saving the bank of objects, the write stream first creates the next checkpoint, and only then writes the bank to a file. To create a checkpoint, the write stream closes the current log file, packs it into a backup copy, packs the bank contents into a separate file, packs a backup copy of the full DAT, and opens a new log file.
As a result of the actions of the write stream, file storage takes the following form:
DAT and object banks are placed in the workspace. Archive copies of object banks and files that together form a complete transaction log are placed in the backup area. Each separate transaction log file, together with an archive copy of DAT, as well as archive copies of its own and previous banks of objects, forms a separate control point.
Check Point
During the initialization of the file storage at system startup, it is checked for its validity. Storage is deemed suitable for work if the checksums match the values in the file headers and the contents of the workspace correspond to the contents of the backup area.
If the server was not completed properly, then at least the DAT in the workspace and the last transaction log file will fail, and the recovery process from the most recent checkpoint will automatically start.
The recovery logic is fairly obvious: missing or damaged banks are restored from the backups of the object banks, and then the DAT file is restored from the most recent checkpoint, after which transactions from its log are executed sequentially. Since the security of the most recent transaction in the log is not guaranteed, the process ends when the transaction checksum does not match.
Garbage collection
We have only two physical resources at our disposal: memory and processor cycles. And since performance is a priority, the result is increased memory consumption. So, in order to reduce the volume of file operations, new and changed objects are saved on disk in bulk, in one file. At the same time, earlier versions of variable objects remain in the same places in previous banks and will never be used. In order to return disk memory to the system, the internally “thinned out” bank needs to be periodically “compacted”, reducing its size, and not forgetting to make changes to the DAT and overwrite the archive copy of the bank. To facilitate the analysis of bank occupancy, the facility service permanently maintains the bitmap of clusters up to date.
Server architecture
All the relatively simple functionality discussed above is grouped around the internal control data structures of the File Storage and the Object Service. File storage is responsible for the reliability of long-term data storage, which provides both multiple backups of the database in its various forms, including the transaction log, and the presence of auto-recovery mechanisms. Service facilities with minimal means provides multi-user access to the contents of the database.
The data model by means of structured metadata, also stored in the object format, ensures the logical coherence of data objects and the consistency of their values, in full accordance with the business logic of the application, integrated directly into the metadata.
Custom layerCursors , which are a kind of logical similarity to SQL cursors, are the server part of the interface resources used for user interaction with the contents of the database. This layer, which is very important, among other things, provides complete isolation of the interface from the internal system for identifying objects and their values.
The internal logic of the Model and Cursors, as well as how to implement it, is the topic of a separate story.
Scalability
The scalability potential is provided by two factors: isolation of an individual object, as well as separation of user actions into selection and transaction.
If you need to increase the load capacity, the first thing that comes to mind is the allocation of a separate master server from the server pool. The master server stores the reference copy of the database and deals exclusively with the execution of transactions. Transactions are received by him from the rest of the servers involved in servicing user requests and generating samples. Execution results - a stream of changed versions of objects, the master server broadcasts to all other servers serving the data selection, simultaneously providing multiple database replication.
The presence of a significant heterogeneity of the logical connection of objects in the database (objects can be grouped into domains with strong connectivity within the domain and a small number of connections outside it) allows you to distribute the database into several master servers, each of which serves its own domain groups.
A detailed analysis of the specifics of implementation is beyond the scope of this article, but its principles themselves are subject to discussion.
Behind the scenes
As often happens in the presentation of a sufficiently voluminous material, a lot of relatively small and minor details were missed. For example, were not mentioned: segmentation and duplication of DAT in memory; special procedure for managing "large" objects; principles of organization of mutual blocking of user flows when accessing shared resources; logic and implementation of garbage collection; mapping the execution process into a log-log; collection and display of statistics; and much more.
It was important to show that multi-threaded management of objects is based on a rather trivial logic and is not such a difficult task to implement.
Summary
The implementation option, brought to primitivism, is likely to be the most effective and productive. Although opinions on this issue may be divided. The object representation of data looks more natural than tabular. Simplification of internal identification provides “step-by-step” accessibility of objects, and promises a certain profit when implementing their logical connection.
The material is published with the hope that it will be useful to everyone who is interested in database architecture.