Oracle RAC General Description / Part 1

    Highly loaded sites, 5 nines availability. In the background (backend) a bunch of processed information in the database. But what if the hardware becomes flawed, if some long-overdue error in the OS crashes, the network interface drops? What will happen to the availability of information? Out of sheer curiosity, I decided to consider what solutions Oracle offers to solve the above problems. Recent versions, unlike Oracle 9i, are called Oracle 10g (or 11g), where g means grid, distributed computing. At the heart of distributed computing, whatever one may say, are clusters, and additional data replication technologies (DataGuard, Streams). This article outlines how an Oracle 10g-based cluster works. It is called Real Application Cluster ( RAC ).

    The article does not pretend to be complete and comprehensive, and settings are also excluded in it (so as not to increase in volume). The point is just to give an idea of ​​RAC technology.

    PS Beware of the mnogabukaf
    Continuation of the article



    I wanted to write an article as accessible as possible so that it would be interesting to read even to a person who is new to Oracle DBMS. Therefore, I venture to start the description with aspects of the most common database configuration - single-instance, when there is one Oracle database (RDBMS) on the same physical server. This is not directly related to the cluster, but the basic requirements and principles of operation will be the same.

    Introduction Single-instance.


    The most common database installed on a single physical server is called single-instance . I have never betrayed much importance to the difference between the concepts: an instance (instance) database and the database itself (as a whole). Now I want to especially note that an instance refers to software (processes, threads, services) that is located in RAM and processes data (sorting, buffering, maintenance) received directly from disk. Thus, a database means a combination of:
    • data storage area, i.e. physical files on disk (datastorage) (the database itself)
    • DB instance (receiving and processing this data in RAM) (DBMS)



    In all modern relational databases, data is stored in tables. Tables, indexes, and other objects in Oracle are stored in logical containers — tablespaces . Physically, the tablespace is located in one or more files on the disk. They are stored as follows:
    Each database object (tables, indexes, rollback segments, etc.) is stored in a separate segment — a disk area that can occupy space in one or more files. Segments, in turn, consist of one or more extents. Extent is a continuous piece of space in a file. Extents are made up of blocks. Block- The smallest unit of space allocation in Oracle, the default is 8K. Blocks store rows of data, indexes, or intermediate lock results. It is in blocks that the Oracle server typically reads and writes to disk. Blocks have an address, the so-called DBA ( Database Block Address ).



    Whenever DML (Data Manipulation Language) accesses the database, Oracle loads the corresponding blocks from disk into RAM, namely into the buffer cache . Although it is possible that they are already there, and then you do not need to access the disk. If the request changed the data (update, insert, delete), then the block changes occur directly in the buffer cache, and they are marked as dirty. But blocks are not immediately flushed to disk. After all, the disk is the bottleneck of any database, so Oracle is trying to access it as little as possible. Dirty blocks will be flushed to disk automatically by the background DBWn process when passing through a checkpoint or when switching a log.


    Suppose that one long transaction was launched that reads data, and somewhere in the process of its execution another transaction was launched with the intention of changing one of the blocks being read. How does the server coordinate these requests? In fact, the question is divided into two:
    1. What happens if Oracle falls somewhere in the middle of a long transaction (if it made a change)?
    2. What data will the first transaction read when, in the cache in it, “under the nose”, another transaction changed the block?


    To answer these questions, consider a mechanism for ensuring consistent reading of CR ( consistency read ). It's all about the magic bubbles of transaction logs, which in Oracle are represented by two types:
    • redo log
    • undo segment


    When a change request is received in the database, Oracle uses it in the buffer cache, while simultaneously entering information sufficient to repeat this action into the redo log buffer located in the main memory. As soon as the transaction is completed, it is committed (commit), and the server flushes the contents of redo buffer log to disk in redo login append-write mode and commits the transaction. This approach is much less expensive than writing directly to the modified block disk. If the server crashes, the cache and all changes in it will be lost, but the redo log files will remain. When you turn on, Oracle will start by looking at them and re-executing table changes (transactions) that were not reflected in the datafiles. This is called “rolling” changes from redo, roll-forward. Online redo log is flushed to disk ( LGWR ) when the transaction is confirmed, when passing the checkpoint or every 3 seconds (default).

    With undo is a bit more complicated. With each table in the adjacent segment, the cancellation segment associated with it is stored.. When querying DML, along with table blocks, data from the rollback segment is necessarily loaded and also stored in the buffer cache. When the data in the table changes in the cache, the undo data also changes in the cache, and “countermeasures” are added there. That is, if insert was entered into the table, then delete is added to the rollback segment, delete - insert, update - the previous row value is entered. Blocks (and corresponding undo data) are marked as dirty and go into redo log buffer. Yes, yes, not only instructions are written in redo's log, what changes should be made (redo), but also what kind of counteractions (undo) do they have. Since LGWR flushes redo log buffer every 3 seconds, if a long transaction fails (for a couple of minutes), and after a minute the server crashes, redo will have incomplete commit entries. Oracle, as it wakes up, roll them (roll-forward), and on the recovered (from redo log) in memory data rollback segments, roll-back all uncommitted transactions. Justice restored.

    Briefly worth mentioning is another indisputable advantage of the undo segment. In the second scenario (from the diagram), when select reaches the reading of a DBA 500 block, it suddenly discovers that this block has already been changed in the cache (dirty mark), and therefore will turn to the rollback segment in order to obtain the corresponding previous state of the block. If such a previous state (flashback) was not present in the cache, it will read it from disk and continue executing select. Thus, even with a long “select count (money) from bookkeeping” debit with credit converges. Consistent Reading (CR).

    Distracted. It's time to look for approaches to cluster configuration. =)


    Data access level. ASM.




    The storage ( datastorage ) in large databases is almost always SAN ( Storage Area Network ), which provides a transparent interface to servers for disk arrays.
    Third-party manufacturers (Hitachi, HP, Sun, Veritas) offer comprehensive solutions for organizing such SANs based on a number of protocols (the most common is Fiber Channel), with additional features: mirroring, load balancing, connecting drives on the fly, partitioning between partitions and .tp
    The position of Oracle Corporation in the matter of building a database of any scale comes down to the fact that you only need the appropriate software from Oracle (with the appropriate licenses), and the selected equipment - if possible (if the funds remain after purchasing Oracle :). Thus, to build a highly loaded database, you can do without expensive SPARC servers and stuffed SANs using free Linux servers and cheap RAID arrays.

    At the level of access to data and disks, Oracle offers its solution - ASM ( Automatic Storage Management ). This is a mini-instance of Oracle (INSTANCE_TYPE = ASM) that is separately installed on each cluster node and provides disk management services.

    Oracle tries to avoid disk access because this is perhaps the main bottleneck of any database. Oracle has the function of caching data, but file systems also buffer write to disk. Why buffer data twice? Moreover, if Oracle confirmed the transaction and received notifications that the changes to the files have been made, it is desirable that they are already there, and not in the cache, in case the database “crashes”. Therefore, it is recommended to use RAW devices (disks without a file system), which makes ASM.

    ASM runs on top of the RAW device, its advantages are:
    • no need for separate disk partition management software
    • no file system needed

    Disk group - a combination of several disks. When writing files to disks, data is written with extents of 1 MB in size, distributing them across all disks in the group. This is done in order to ensure high availability, because parts of one table (from tablespace) are scattered across different physical disks.

    ASM abilities:
    • Data mirroring:
      usually 2 or 3 steps, i.e. data is simultaneously written to 2 or 3 discs. For mirroring, the drive is indicated with no more than 8 partner disks on which copies of the data will be distributed.
    • Automatic load balancing on disks (ensuring high availability):
      if tablespace data is placed on 10 disks and, at some point in time, reading data from certain disks will “roll over”, ASM will turn to the same extents, but located on mirrored disks.
    • Automatic rebalancing:
      When a disk is deleted, ASM will duplicate the extents that it contained on the fly to the other disks remaining in the group. When you add a disk to a group, it will move the extents in the group so that there will be approximately the same number of extents on each disk.

    Suppose that several disks are connected to a specific controller - and, therefore, are, SPF - single point of failure (If the controller fails, we lose the entire disk array). ASM has the technology for defining Failure Groups within the Disk Group. With this mechanism, mirroring will scatter copies of extents across disks located in different failure groups to avoid SPF ( Single Point of Failure ), for example, when the SAN or RAID controller dies.

    Thus, the cluster can now store and read data from a shared file storage.
    It's time to level up.


    Clusterware CRS



    At this level, it is necessary to ensure coordination and joint work of cluster nodes, i.e. clusterware layer: somewhere between the database instance itself and the disk storage:

    CRS ( Cluster-Ready Services ) - a set of services that ensure joint operation of nodes, fault tolerance, high system availability, and system recovery after a failure. CRS looks like a “mini-instance" of a database (software) installed on each cluster node. Install CRS - mandatory for building Oracle RAC. CRS can also be integrated with third-party clusterware solutions such as HP or Sun.

    Again, a little "terminology" ...

    CRS consists of 3 main components:
    • CSSD - Cluster Synchronization Service Daemon
    • CRSD - Cluster Ready Services Daemon
    • EVMD - Event Monitor Daemon
    xPurpose (in brief)What rights does it work withWhen the process dies, it reboots:
    CSSDThe synchronization mechanism for the interaction of nodes in a cluster environment.userprocess
    CRSDThe main engine to support resource availabilityroothost
    EVMDAlert process for events occurring in a clusteruserprocess
    Cluster settings are stored in OCR ( Oracle Cluster Registry ). OCR is a special file of the database node profiles that stores their current configuration: node availability, service distribution (several databases can be supported by different groups of nodes in the cluster), network settings, etc. Physically, OCR is stored in a common datastorage. During cluster operation, each node stores OCR in memory, and only one node (master) performs direct OCR updates on disk.

    As has already become clear from the tablet, the most important process, the “most powerful demon,” is CRSD ( Cluster Ready Services Daemon) His responsibilities include: starting, stopping the node, generating failure logs, reconfiguring the cluster in the event of a node crash, he is also responsible for crash recovery and maintaining the OCR profile file. If the daemon crashes, the entire host reboots. CRS manages OCR resources: Global Service Daemon (GSD), ONS Daemon, Virtual Internet Protocol (VIP), listeners, databases, instances, and services.

    The responsibilities of the CSSD ( Cluster Synchronization Service Daemon ) service include coordinating the interaction of cluster nodes, synchronizing nodes and resources between them, determining their availability through the following functions:
    • Node Membership (NM). Every second checks the heartbeat between nodes. NM also shows the other nodes that it has access to the so-called voting disk (if there are several, then at least the majority), making regular entries there. If the node does not respond to heartbeat or does not leave an entry on the voting disk for several seconds (10 for Linux, 12 for Solaris), then the master node excludes it from the cluster.
    • Group Membership (GM). The function is responsible for timely notification when a node is added / removed / dropped out of a cluster, for subsequent cluster reconfiguration.

    CSSD provides dynamic information about the nodes and instances that are currently part of it and is responsible for locking resources in the cluster.

    The informant in the cluster is EVMD ( Event Manager Daemon ), which notifies the nodes of events: that the node is running, lost communication, is being restored. It acts as a bridge between CRSD and CSSD. Alerts are also sent to ONS (Oracle Notification Services), the universal Oracle gateway through which alerts can be sent, for example, via SMS or e-mail.

    The cluster starts according to the following pattern: CSSD reads from the OCR shared repository, from where it reads the cluster configuration to identify where the voting disk is located, reads the voting disk to find out how many nodes (rose) in the cluster and their names, establishes connections with neighboring nodes by IPC protocol. By exchanging heartbeat, it checks to see if all the neighboring nodes have risen, and finds out who in the current configuration is identified as master. The first master node becomes the master node . After the start, all running nodes are registered with master, and subsequently will provide him with information about their resources.

    Up to CRS, database instances are installed on the nodes.
    Clients communicate with each other over a private network - Cluster Interconnect, IPC ( Interprocess Communication ) protocol . It has requirements: high bandwidth and low latency. It can be built on the basis of high-speed versions of Ethernet, third-party solutions (HP, Veritas, Sun), or the growing InfiniBand . The latter, in addition to high throughput, writes and reads directly from the application buffer, without the need for kernel-level calls. Top IP Oracle recommends using UDP for Linux, and TCP for Windows. Also, when transmitting packets over interconnect, Oracle recommends keeping within the range of 6-15 ms for delays.

    Continuation of the article

    Also popular now: