Architecture and project platform Odnoklassniki

    Architecture and project platform Odnoklassniki


    In this post we will tell about the experience accumulated over 5 years in maintaining a highly loaded project. We hope that fellow developers will be interested to know what and how we do, what problems and difficulties we have and how we deal with them.

    Basic statistics

    Up to 2.8 million users online during peak hours
    7.5 billion requests per day (150,000 requests per second during peak hours)
    2,400 servers, storage systems
    Network traffic during peak hours: 32 Gb / s

    Architecture

    Puff architecture:

    • presentation layer (a presentation layer or simply a WEB server that forms HTML)
    • business services layer (servers providing data selection and processing)
    • caching layer (caching frequently used data)
    • persistence layer (database server)
    • common infrastructure systems (logging systems statistics, application configurations, resource localization, monitoring)

    Presentation layer:

    • We use our own framework, which allows you to build the composition of pages in the JAVA language using your own GUI factories (typography, lists, tables, portlets).
    • The page composition consists of independent blocks (usually portlets), which allows you to update information on the screen in parts using AJAX requests. This approach to navigation allows you to get rid of constant page reloads, thereby important site functions (Messages, Discussions and Alerts) are always available to the user. Without javascript, the page is fully functional, except for the functionality written in GWT - when you click on the links, it simply redraws completely.
    • Functional components like Messages, Discussions and Alerts, as well as all the dynamic parts (shortcut menu, photo tags, sorting photos, gift rotation) are written using the Google Web Toolkit framework.

    Selection, processing and caching of data:

    The code is written in Java. There are exceptions - some modules for caching data are written in C and C ++.
    Java because it is a convenient development language, a lot of groundwork in various fields, libraries, open source projects in Java.
    At the business logic level, there are about 25 types of servers / components and caches that communicate with each other via remote interfaces. Every second, there are about 3,000,000 remote requests between these modules.
    The odnoklassniki-cache module is used for caching data. It provides the ability to store data in memory using Java Unsafe. We cache all the data that is accessed frequently. For example: information from user profiles, user groups, information about the groups themselves, of course, a graph of user relationships, a graph of user and group relationships, user holidays, some meta information about photos, etc.
    For example, one of the servers that caches the graph of user connections at rush hour is able to process about 16,600 requests per second. The CPU is busy up to 7%, the maximum load average for 5 minutes is 1.2. The number of vertices of the graph is> 85 million; the bonds are 2,500 million (two and a half billion). In memory, the graph occupies 30 GB.

    Load balancing and balancing:

    • weighted round robin inside the system;
    • vertical and horizontal partitioning of data both in databases and at the caching level;
    • servers at the level of business logic are divided into groups. Each group processes various events. There is a mechanism for routing events, i.e. any event (or group of events) can be selected and sent for processing to a specific group of servers.
    Service management takes place through a centralized configuration system. The system is self-written. Through the WEB interface, you can change the location of portlets, the configuration of clusters, change the logic of some services, etc. The changed configuration is saved in the database. Each server periodically checks to see if there are updates for applications that are running on it. If there is - apply them.

    Data, database server, backups:

    The total amount of data without redundancy is 160 TB. Two solutions for storing and serving data are used - MS SQL and BerkeleyDB. Data is stored in at least two copies. Depending on the type of data, copies can be from two to four. There is a daily backup of all data. Every 15 minutes, backups of the accumulated data are made. As a result of this backup strategy, the maximum possible data loss is 15 minutes.

    Equipment, data centers, network:

    Used dual-processor, 4-core server. The memory capacity is from 4 to 48 GB, depending on the functionality. Depending on the types and use of data, they are stored either in the server’s memory, or on the server’s disks, or on external storage systems.
    All equipment is located in 3 data centers. A total of about 2,400 servers and storage systems. Data centers are combined in an optical ring. At the moment, on each of the routes, the capacity is 30 Gb / s. Each of the routes consists of fiber pairs physically independent of each other. These pairs are aggregated into a common “pipe” on the root routers.
    The network is divided into internal and external. Networks are physically separated. Different server interfaces are connected to different switches and operate on different networks. On an external network WEB server, communicate with the world. On the internal network, all servers communicate with each other.
    The topology of the internal network is a star. Servers are connected in L2 switches (access switches). These switches are connected by at least two gigabit links to the agregation stack of routers. Each link goes to a separate switch in the stack. In order for this architecture to work, we use the RSTP protocol . If necessary, access switches are connected to the agregation stack by more than two links. Then use link aggregation of ports.
    Agregation switches are connected with 10Gb links to the root routers, which provide both communication between data centers and communication with the outside world.
    Used switches and routers from Cisco. To communicate with the outside world, we have direct connections with several of the largest telecom operators.
    Network traffic during peak hours - 32 Gb / s

    Statistics System:

    There is a library responsible for event logging. The library is used in all modules. It allows you to aggregate statistics and save it in a temporary database. Saving itself occurs using the log4j library. Usually we store the number of calls, the maximum, minimum and average execution time, the number of errors that occurred during execution.
    From temporary databases, all statistics are stored in DWH. Every minute, DWH servers go to temporary databases in production and collect data. Temporary databases are periodically cleared of data.

    Sample code that stores statistics about sent messages:

    public void sendMessage(String message) {
       long startTime = LoggerUtil.getMeasureStartTime();
       try {
           /**
            * business logic - send message
            */
            LoggerUtil.operationSuccess(LogFactory.getLog({log's appender name}), startTime, "messageService", "sendMessage");
        } catch (Exception e) {
            LoggerUtil.operationFailure(LogFactory.getLog({log's appender name}), startTime, "messageService", "sendMessage");
        }
    }
    


    Our DWH system stores all statistics and provides tools for viewing and analyzing them. The system is based on Microsoft solutions. Database servers - MS SQL 2008, reporting system - Reporting services. Now DWH is 13 servers located in a separate environment from production. Some of these servers provide operational statistics (i.e. online statistics). Some are responsible for storing and providing access to historical statistics. The total amount of statistics is 13 TB.
    It is planned to introduce a multi-dimension analysis of statistics based on OLAP.

    Monitoring

    Monitoring is divided into two components:

    1. Monitoring services and site components
    2. Monitoring resources (equipment, network) The
    primary monitoring of services. Own monitoring system based on operational data in DWH. There are those on duty whose duty it is to monitor the performance of the site and, in case of any anomalies, take actions to find out and eliminate the causes of these anomalies.
    In the case of resource monitoring, we monitor both the “health” of the equipment (temperature, health of components: CPU, RAM, disks, etc.) and server resource indicators (CPU load, RAM, disk subsystem load, etc. .). To monitor the “health” of equipment, we use Zabbix, and we accumulate statistics on the use of server and network resources in Cacti.
    Alerts about the most critical anomalies come via SMS, other alerts are sent by email.

    Technology:

    • Operating systems: MS Windows, openSUSE
    • Java, C, C +. All main code is written in Java. Modules for caching data are written in C and C +.
    • Use GWT to add dynamics to the WEB interface. Using GWT, such modules as Messages, Discussions and Alerts are written
    • WEB servers - Apache Tomcat
    • Business logic servers work under JBoss 4
    • Load balancers on the WEB layer - LVS . We use IPVS for balancing on Layer-4
    • Apache Lucene for indexing and searching for textual information
    • Databases:
    MS SQL 2005 Std edition. It is used largely because it has so historically developed. Servers with MS SQL are combined in a failover cluster. If one of the working nodes fails, the standby node takes over its
    BerkeleyDB functions - its own internal library is used to work with BDB. We use BDB, C implementation, version 4.5. Dual-core master-slave cluster. Between master and slave native BDB replication. Writing occurs only in master, reading occurs from both nodes. Data is stored in tmpfs, transaction logs are stored on disks. Every 15 minutes we backup logs. The servers of one cluster are located on different power lines so as not to lose both copies of the data at once.
    A new storage solution is under development. We need even faster and more reliable access to data.
    • When communicating between servers, we use our own solution based on JBoss Remoting
    • Communication with SQL databases occurs through JDBC drivers

    People:

    About 70 technical specialists work on the project. Of these, 40 developers, 20 system administrators and engineers, 8 testers.
    All developers are divided into small teams (1-3 people). Each of the teams works autonomously and develops either some new service, or works on improving existing ones. Each team has a technical leader or architect. He is responsible for the architecture of the service, the choice of technologies and approaches. At different stages of development, designers, testers and system administrators can join the team.
    For example, there is a separate Group service team. Or a team developing website communication services (such as message systems, discussions, activity feeds). There is a platform team that tests, rolls in and implements new technologies, optimizes existing solutions. At the moment, one of the tasks of this team is the development and implementation of a high-speed and reliable solution for data storage.

    Basic principles and approaches in development

    Development is carried out in small iterations. An example of a development life cycle is a 3-week cycle:
    0 week - architecture definition
    1 week - development, testing on developers' computers
    2 week - testing on a pre-production environment, release on a production environment

    Almost all of the new functionality is “disconnected”. A typical launch of a new “feature” is as follows:
    1. the functionality is developed and gets into the production release
    2. through a centralized configuration system, the functionality is turned on for a small part of users. The statistics of user activity, the load on the infrastructure are analyzed.
    3. If the previous stage was successful, the functionality is gradually turned on for an increasingly larger audience. If during the launch process we don’t like the statistics collected, or the load on the infrastructure inappropriately grows, then the functionality is disabled, the reasons are analyzed, errors are fixed, optimization is performed and everything is repeated from the first step

    Best practices, tricks & tips

    The specifics of working with DBMS:

    • We use both vertical and horizontal partitioning, ie different groups of tables are located on different servers (vertical partitioning), and data from large tables are additionally distributed between servers (horizontal partitioning). The partitioning device built into the DBMS is not used - all the logic is located at the level of business services.
    • Distributed transactions are not used - all transactions are only within the same server. To ensure integrity, the associated data is placed on 1 server or, if this is not possible, the data recovery logic is additionally programmed.
    • Database queries do not use JOIN even among local tables to minimize CPU load. Instead, data denormalization is used, or JOINs occur at the business service level. In this case, JOIN occurs both with data from the database and with data from the cache.
    • When designing a data structure, foreign keys, stored procedures, and triggers are not used. Again, to reduce the load on the CPU of the database servers.
    • SQL DELETE statements are also used with caution - this is the hardest operation of DML. We try not to delete the data once again or use deletion via a marker - the record is first marked as deleted, and then deleted by the background process from the table.
    • Widely used indexes. Both ordinary and cluster. The latter to optimize the most common queries in the table.

    Caching:

    • Uses our own server cache, implemented in Java. Some data sets, such as user profiles, social graph, etc. entirely stored in cache.
    • Data is partitioned onto a cluster of server caches. Partition replication is used to ensure reliability.
    • Sometimes the performance requirements are so great that they use local short-lived data caches received from server caches located directly in the memory of business logic servers.
    • The server cache, in addition to the usual key-value operations, can fulfill requests for data stored in memory, thus minimizing the transfer of unnecessary data over the network. Used by map-reduce to perform queries and operations on the cluster. In especially difficult cases, for example, to implement queries according to a social graph, the C language is used. This helps to increase productivity.
    • To store large amounts of data in memory, off-heap memory is used to remove unnecessary load from the Java GC.
    • Caches can use a local drive to store data, which turns them into a high-performance database server.

    Optimization of loading speed and page performance

    • Cache all external resources (Expires and Cache-Control headers). We minimize and compress CSS and JavaScript files (gzip).
    • To reduce the number of HTTP requests from the browser, all JavaScript and CSS files are combined into one. Small graphics are combined into sprites.
    • When loading a page, only those resources that are actually needed to get started are downloaded.
    • No universal CSS selectors. We try not to use standard selectors (by tag name).
    • If CSS expressions are needed, then write “one-time”. Avoid filters whenever possible.
    • We cache calls to the DOM tree, as well as properties of elements that lead to reflow. Updating the DOM tree offline.
    • In GWT we use UIBinder and HTMLPanel to create interfaces.

    Useful reading! We will be glad to questions.

    Also popular now: