krestjaninoff October 24, 2012 at 14:52

Highload 2012

A few days ago in Moscow there was a "conference of developers of highly loaded systems" HighLoad ++ , of which I was lucky to become a participant. Below I want to briefly go over the reports that I visited during the conference, highlighting in them interesting points in my opinion.

I’ll immediately warn you that I could misunderstand some things, misinterpret some. If this is important for you - do not read this post, but come to the next conference in person!

Day 1

Storage and delivery of VKontakte content

The report began with a story two years ago when I asked Pasha Durov how VKontakte stores user data? On that day, Pasha answered “on the discs”. Since then, I do not like him :) This time Oleg (the speaker) promised to answer this question fully, devoting him a whole report.

Initially, all files (photos) were stored on the same server as the code in the upload folder. This approach is simple, but it has two flaws: security (code injection, XSS) and scalability.

Then the guys were puzzled by scalability, organizing the following scheme: files are uploaded to the main server, and then transferred to auxiliary (designed specifically for user data).
Each user always corresponds to the same auxiliary server, identified by the user id.
However, this approach has a bottle neck - the main server.

Therefore, the next step was to upload files directly to the content server without using the main server with the application. However, to implement this approach, it is necessary to answer the question: on which server to upload the photo?

add to the last server added to the system (does not fit, since the server runs out of space quickly);
on the most free server (does not fit, since all traffic is dumped on the server);
randomly free server (best);

Search and eliminate bottlenecks. Usually the bottleneck is one - the rest of the system nodes are much better than the system. Bottle neck when storing a large amount of user data can be the following:

data backup (RAID is used if the entire array of 2 disks burns out - user files are lost permanently);
caching (60% of the content goes from the cache on the client);
traffic (the solution is CDN, especially for video: VKontakte in Moscow has servers that cache video from a data center in St. Petersburg);

The file system is XFS. However, user files are not stored in the file system, but in one large file. The file is always open for reading, the file index is stored in RAM. The guys see the further development of this technology in a complete abstraction from physical media.

In view of the widespread distribution of WiFi, VKontakte is switching to HTTPS. Since there are a lot of servers with content, you can’t buy a certificate for each of them (too expensive). Therefore, VKontakte has several servers that proxy content through HTTPS.

What about pictures:

total more than 30,000,000,000, 17,000,000 added per day;
Images are pre-compressed on the client side (Flash);
uses Graphics Magic;
when saving photos are cut immediately for all necessary permissions;

Regarding audio:

130K is added per day;
users have the ability to "add" to themselves the data from the page of another, which reduces the amount of downloaded audio by orders of magnitude;
at the request of copyright holders, the files were initially searched by the md5 hash. Now an algorithm has been developed that finds similar audio records by some audio characteristics;

Regarding the video:

320K is added per day;
delayed video processing through the queue is used - it is too expensive to process video online;
ffmpeg is used for video processing;
the video is duplicated when reloaded by other users - this is not a problem;
at one time, VKontakte had P2P for video on flash (wat !?), now they can do it without it;

In general, I got the feeling that the contact is relatively simple. Simple, but effective.

NoSQL in a high-load project

Report on how NoSQL is used in Mamba. The features of the system in question are that more than 30% of the executed queries are increment and reading of counters.

We started searching for a suitable storage for the data as it should with Memcache. Not satisfied with the lack of persistence and RAM ONLY. Tried Radis - RAM ONLY.
In addition, under heavy load, Memcached performance drops 100 times (for testing the load, the Brutis tool was used).

How long, briefly, guys came to use TokyoTyrant. However, he suddenly found problems with the integrity of the database when the server was disconnected from the outlet :) They were solved by a new development from the author of TokyoTyrant - KoytotTycoon. However, it was not possible to write to the 30M database due to architectural restrictions.

Therefore, the guys went in the direction of LevelDB from Google. This database uses LSM-tree technology. Data is stored in SSTable files: sorted immutable key-value pairs.
Data is written to a similar (but already modifiable) structure in RAM. From time to time, subtrees from memory to disk merge.
In order not to catch problems in case of a sudden power outage - Write Ahead Log is written.

The guys once again tested and found that in most cases the LevelDB library outperforms all the previously used options. Now they hold 4700 get / sec and 1600 update / sec with 200M records in the database.

MVCC unmasked

MultiVersion Concurency Control is a mechanism that allows readers not to block write, but write readers. In general, it significantly reduces the number of locks in the database. Present in Oracle, Mysql InnoDB, PostgreSQl and some others.

Each record in the MVCC tables has two attributes: creaton (xmin) - the number of the transaction in which the record was created, expiration (xmax) - the number of the transaction in which the record was deleted.

    INSERT xmin 40, xmax Null
    DELETE xmin 40, xmax 47 
    UPDATE xmin 64, xmax 78 / xmin 78, xmax NUll

When each statement is executed, an MVCC snapshot is created, which determines which data in the database is visible / accessible to the statement.
The algorithm for determining the visibility of a string is as follows:

get the number of the last completed transaction (current);
we consider visible those records in which:
- xmin <current numbers;
- they are not deleted (xmax = null), or the transaction in which they are deleted is not made public;

The xmin and xmax fields are present in each table, but are hidden by default. However, they can always be specified explicitly in the selection: SELECT xmin, xmax, * FROM test ;

You can get the id of the current transaction using the SELECT query txid_current () . This query selects data from the pg_clog table. It is important to understand that a transaction rollback is simply setting the appropriate token to record the transaction in this table. No data is deleted at the same time. A transaction is simply marked as rolled back.

Non-MVCC DBMSs require direct data changes when changing or deleting records. MVCC DBMSs are deprived of this problem - all irrelevant data is cleaned deferred.
Here I want to add from myself that delayed cleaning (the so-called VACUUM) is not as perfect as we would like ... but this is a topic for another discussion.

By the way, here is the speaker ’s site on which he promised many interesting articles.

MySQL on Google

Perhaps the biggest disappointment among the reports. In short, its essence boils down to one thesis: "Yes, we have MySQL slightly patched by us."
Maybe my requirements are too high, but I expected something more interesting from the Good Corporation.

Largest projects: Ads, Checkout and, of course, YouTube.

For data centers, not the most expensive hardware is used, but the most “advanced” one. Components (disks) become unusable very often, but can be easily and cheaply replaced.

MySQL is used as a cluster. For each shard, there is a separate separate process (decider), which is responsible for choosing a new master in case the old one falls.
Heartbeat is carried out by a simple script that writes some data to the wizard and checks its presence on the replicas.

In order to organize scaling and fault tolerance, the guys introduce many restrictions on working with the database. For example, a write transaction cannot take longer than 30 seconds. It is necessary so that it would be possible to quickly make a master from another slave.

SPDY Support in NginX

SPDY is a binary protocol over a TCP / TLS connection. It is a transport for HTTP. Designed by Google.

Key features:

multiplexing (several requests per connection, both from server to client, and vice versa);
header compression (zlib, deflate);
flow control (window for tcp connection);
server push ("loading" data into the browser initiated by the server);

Of the advantages: built-in HTTPS (due to TLS), good work with a large number of pictures (due to multiplexing);
Of the minuses: it works with one domain, as a result - it does not support CDN;

According to research (combat) Wordpress'a: SPDY is faster than HTTPS, but slower than regular HTTP.

SQL Tricks

The report was a list of interesting, but not familiar to all SQL / PostgreSQL features:

view is an effective tool for encapsulating logic and / or tables. I do not agree with the first, but I use the second (renaming tables);
postgres uses the OS cache (therefore, they suggest allocating either a little or all of it);
pgbench - a utility for testing Postgres;
Common Table Name (CTE) / Recursive CTE;
window functions;
JSON support (c 9.2);
lateral support (use of data calculated on the current line in FROM clause, c 9.3);
extension for working with text and trigrams: pg_trgm;
extensions for storing key-value pairs: hstore;
connecting tables from other databases via dblink;
extensions for working with geo-data: PostGIS and EarthDistance;
prepared statements - allow you to avoid the overhead of parsing the request, checking the existence and rights, building a plan (sometimes it may take
more time than execution);
locks can be set explicitly through the select ... for share construct;
in PgSQL Prepared Statements are accessible via execute;
code blocks do $ code $ ... $ code $ (c 9.1);
upsert is best implemented through insert -> fk constraint -> update;
you can return several values from the function by returning the rates (within the transaction) or by generating JSON (row_to_json, array_to_json);

The speaker, by the way, is really dangerous: using Recursive CTE and Regexp, he wrote a JSON parser in Postgres with an array of one request!

Crouching Server, Hiding Diode

"Hello, my name is Andrey, I'm the Voronezh cattle." This description of the report can be completed :) In general, Aksyonov in all its glory.

Key points:

with benchmarks - the main thing to remember about goals: you need to measure what you need, not load average;
the average values for metrics mean nothing: the average temperature in the hospital is cold corpses and febrile coma;
scalability is not linear (Amdahl’s law - even if it is not 5% parallel, then
64 * CPU = 14.5 X)

C (n) = n / (1 + a * (n-1) + b * n * (n-1) ), where

a is the degree of contention (costs of unparalleled code),

b is the degree of coherency (costs of coherence, communication, synchronization);
you can highlight the sweet spot - the amount of resources at which performance will be optimal;
after a sweet spot, on the contrary, productivity starts to fall - it can be worse with growth ;
Do not test on default settings - for most software they are, to put it mildly, strange (fsync after each Insert'a and innodb_buffer_pool is 32 MB in MySQl);
run the same request 1000 times - test the cache;

World constants - it’s important to understand how much the action costs.

CPU L1 - 1,000,000,000 op / sec (1e9)
CPU L2, misbranch - 100,000,000 op / sec (1e8)
RAM Access - 10,000,000 op / sec (1e7)
SSD megaraid - 100,000 op / sec (1e5)
SSD - 10,000 op / sec (1e4)
LAN, 1MB RAM - 1,000 op / sec (1e3)
HDD seek, 1MB LAN - 100 op / sec (1e2)
WAN roundtrip - 10 op / sec (1e1)
Memcached access - 10,000 op / sec (1e4)
RDB simple select - 100 op / sec (1e2)

Go language

Getting started and Hello, World in GO.

Day 2

How the search works

Morning sober from Andrei Aksyonov :)
In the work of any search engine, there are 4 milestones: data acquisition (spider robot), indexing, search, and scaling.

Indexing includes:

receiving text (html, pdf-> text),
tokenization
morphological processing (stemming, lemotization),
creation of an inverted index (keyword -> page numbers in the book and the position of the word in them);

The data in the index must be compressed. Compression can be bit, byte, block. The less data - the less need to be processed upon request.

The search consists of two orthogonal stages: quickly matching (find) and qualitatively raking (rank).
Matching criteria may vary depending on its area of use (web, data mining). As for ranking, it, in principle, cannot be relevant to the end, because relevance is personal for each user. Therefore, tricky algorithms such as BM25 (TF, IDF, DocLength) are used for ranking, the results of which are personalized as much as possible.

As for scaling, search requires resources. Therefore, Google has millions of search servers, Yandex has tens of thousands.

Odnoklassniki Search

The guys managed to use fullscan search (on MsSQl) with 30M users. On a cluster of their 16 bases, one search query worked on average 15-30 seconds.
Realizing that you can’t live like that, we decided to look for a solution.

Since the Java project, they began to look towards Lucene. Over 3 years of working with Lucene made the following changes:

Added replication
storage of indexes in memory (tried on a RAM drive, mapping files in Heap, but in the end they just pulled the files in ByteArray - in OldGeneration);
rewrote the search by index (default created too many objects, which led to problems with the GC);

Now the infrastructure has a separate indexer server that creates a search index. The index is replicated to query servers that store the index both on disk and in memory (an index from memory is used to process queries). The indexer receives data through the queue, which allows it to be inaccessible from time to time.

A big drawback was the lack of logic for directly updating the index from the database, bypassing the queue. Since some messages from the queue sometimes disappeared. I can state the same problem in my company. Conclusion: sanity check when working with queues should always be .

Only 5% of requests (the most frequently requested) are constantly in the cache. Hit in cache 60%.
For personal requests, temporary caches are created (on request), which live for several minutes.

Each user on the portal is assigned a separate app-server (calculated based on userId). In case of a problem with the server, the user is transferred to the reserve.

The search for online users was first done by the general index, filtering by the “online” flag outside the index. Then they created a separate index for this purpose.

Storing data in evernote

Overview report about the device and the Evernote numbers.

Software: Java 6, MySQl 5.1, Debian (Stable), DRBD (Distributed Storage System), Xen.
HardWare: SuperMicro 1U, 2x L5630 COU, 96 GB RAM, 6x 300GB Intel SSD, LSI RAID 5 (+ spare) ~ $ 8000.

DataBase: MySQL, 10TB (peak riops 350, peak wiops 50)
Search engine: Lucene (peak riops 800, peak wiops 50)

Servers are located in the USA and China (400 Linux servers).

Access to files is controlled through Apache / WebDAV. Data is always stored on the same host and is not transferred. Compared to NFS, WebDav has a slight overhead, but is many times simpler and cheaper to deploy and support.

Load balancing is used by the A10 with SSL iron balancer (HTTPS is looking externally, HTTP is being proxied inside).

Considering the features of the service (storage of many small files), the problems and their criticality can be described in the following table:

/	size	normal load	peak load
bandwith	-	medium	medium
latence	-	low	low
cpu	-	low	medium
file size	high	low	medium
meta data	low	medium	low

The author does not recommend using cloud platforms if your application is resource-intensive (bandwidth, storage, cpu) or uses them in variable quantities.
If in doubt, consider how much it will cost to buy and support your servers for the needs of the application, and how much it will cost Amazon. In the case of Evernote, the difference is 1-4 orders of magnitude.

DDOS mechanics

I missed this report almost completely, as I was at dinner :) From the final part, I managed to isolate only a couple of theses.

The simplest protection against HTTP attacks is limit zones in Nginx. You can also pay attention to the NGINX module "testcookie".

However, it is important to understand that Nginx is not a cure for DDOS attacks.

DDOS attacks in Russia 2012

In total, attacks in 2012 were> 2600 attacks (last year there were somewhere> 1700).

A botnet of 3,000 cars is enough to flood the average site. The largest botnet consisted of 150K machines. Most bonnets live in Germany, USA, Ukraine, Kazakhstan (descending).

DDOS activity is the same as e-commerce activity (well, a bit of politics).
Political DDOSs use botnets from Pakistan, India (where law enforcement does not reach).

Most attacks generate less than 1GB traffic (normal site bandwidth is no more).

When attacking, the first thing to do is to analyze the logs (access.log) and make a snapshot of the traffic (tcpdump -s0 -c1000000 -w attack.dump). Further neutralization of the attack: remove from the junk traffic, block the IP addresses from which the attack occurs, change the IP.

You should always have at least a 2x margin of site performance.

Features of the Russian DDOS - Full Browser Stack (using real browsers that can pass all anti-ddos tests).

Well and yes - if you expect problems, contact Qrator.

Scaling in 2012

Most likely the report was about load optimization or something else. Alas, I spent all its beginning in the toilet :) I decided to mention it only in view of a rather interesting discussion that unfolded during the questions.

The speaker mentioned the lack of any redundancy and scaling on his project. This led at first to a rather unexpected question: what is the business model of your site? As it turned out, on the speaker’s project, people pay for access to site features using subscriptions (for 6, 12, 24 months). Taking into account the fact that the fakap at which the site goes down happens only a couple of times a year, and the fact that all users of the site have already paid for its use, high fault tolerance is quite expensive, complicated, and most importantly - not so much needed thing :) Another thing is if the monetization of the project depended on each request to the project!

Proactive Web Performance Optimization

The report was supposed to be read by a Twitter employee, so I personally looked forward to the report, looking forward to something interesting. However, immediately after the start of the report, it turned out that he had been working on Twitter recently and before that he was engaged in web optimization most of the time. And to completely finish off the crowd, he will tell us about a wonderful tool / utility / profiler / plugin for web optimization - YSlow.

Distributed, scalable, and highly loaded storage system for virtual machines

By and large, the report tells how Parallels complemented its version of GoogleFS with chunk, metadata servers and other attributes.
A distinctive (for me) feature is that the guys suggest using the still expensive, but fast SSDs. At Google, it seems, the policy comes down to buying
cheap hardware, which is not a pity to throw out and replace.

The main purpose of virtualization is the sharing of resources of one host, effective backup and deployment.

The best solution for combining disks into an array and their subsequent partitioning between SAN storage virtual machines. But it’s expensive, so everyone uses regular disks.

In the case of journaling FS, the data first goes to the journal, and only then to the disk. This is necessary so that if we suddenly turn off the disk, we would be able to play the log and add the missing data to the disk.

From the report, mention was made of the PAXOS algorithm. The Greek island of PAXOS is considered, on which there is a parliament. Deputies sit in parliament, who are also businessmen. In view of the latter, they are very often away so that no more than half of the parliamentarians are always in parliament. Contact with the absent is possible only with the help of messengers. At the same time, parliament should always work and adopt laws. In addition, all parliamentarians should always be aware of the latest laws. As far as I understand, the guys use this algorithm to synchronize chunk servers.

A power outage can be emulated by sending a SIGSTOP process. In this case, the application on the other end of the TCP connection will be in the same situation as when the power was turned off. This is faster than a real power outage.

Recommendation service on a virtual Hadoop cluster

The guys did a map / reduce task "service recommendations on the site" using Hadoop. Their particular problems and solutions are given as an example, so there is probably nothing to write here.

MariaDB: The new MySQL

Overview report about Maria DB. The MySQL fork was completely rewritten from scratch by the founder of MySQL, after MySQL migrated from Sun to Oracle.
As an engine, it uses Percona's XtraDB by default. Supports InnoDB.

Essentially, the report boils down to a shangelog MariaDB version 5.5.

Personal story of how MySQL grew and the challenges I've met on the journey

Report from Oracle. Basically what will happen in the new versions. A kind of RoadMap MySQl.

The report answered an interesting question: why does Oracle support MySQL? The company wants to be represented in all markets, and MySQL covers Web, Mobile and Embedded. In addition, I do not want to lose such clients as Facebook and Twitter. About the huge comunity you can not mention.

I would also like to note that MySQL Installer for Windows includes a migration utility from databases such as MsSQL, Sybase , etc. Unfortunately, she does not know how to rewrite all the related project code in one. In view of what, its meaning for me remains lost.

How to choose right index

A report on what indices are and what they eat with. The report had a lot of equipment and examples, so I recommend that anyone who is interested in the issue see the presentation.

The following points are of the greatest value in my opinion:

do not make too many indexes, as they reduce insertion speed and generally entail overhead - do indexes only where they are really
needed;
select only the data that you need: less data is transmitted over the network, less data is processed by the processor;
sorting is quite an expensive operation;
distinct - a very slow operation, since it works on the result of sampling;
index (a, b) better than index (a) only on equals operation;
clustering index contains all row data (apparently it was implied that when building a clustered index, the data in the table is sorted according to the index);
if the column is not in the index, but it is used in the selection, you have to do fetch from the table;

Presentations

All presentations are available on SlideShare.

PS

I’ll leave the post without pictures, since WiFi in the MUMU cafe near Kiev Station is not much better than the local Olivier :)

Tags: