Highload conference review fwdays'17

On October 14th, the Highload fwdays conference was held in Kiev, dedicated to high-load projects, working with databases and architecture, in particular, microservices, machine learning and Big Data. DataArt sponsored the conference. And our colleagues Igor Masternoy (leader of the Java community DataArt Kiev) and Anna Kolot (.NET, SharePoint Developer) talked about the reports they visited.

Details of the conference program can be found here .

Let's start the review with a report by Dmitry Okhonko from FacebookAbout Log Device. “Yet another log storage,” you will think. You would be right, but this Log Storage stands out from its creators. Facebook claimed bandwidth is 1TB / s. And to find out how they handle the processing of such a volume of data was interesting.

Logdevice

What is a Log Device for? The following user cases were presented at the presentation:

Reliable communication between services.
Write ahead logging.
State replication.
Journaling.
Maintaining secondary indexes on a big distributed data store.

The system is designed to quickly reliably accessible and for centuries to record a huge number of append-only logs, which are the smallest addressable units. In addition, the system should guarantee the order of recording the log and the ability to get it if necessary. Let's look at its architecture.

Fig. 1 LogDevice architecture.

The data that is written to LogDevice, for a start, goes to the so-called sequencer nodes (in the figure, these are red and green rhombuses). Each log has its own sequencer, which generates a number consisting of two numbers: the era number and the record serial number. This combination of identifiers is necessary to guarantee the fault tolerance of the sequencer itself. If the sequencer node dies, the new one should continue to record logs, whose identifiers will be more than previously recorded. For this, a new node increases the number of an era. Log metadata, log era identifiers are stored in zookeeper.

Next, the log with the identifier obtained is recorded in a certain number of storage nodes. The speaker explained that inside the storage node, the key-value Rocks DB database is spinning. Each log in it is a record of the form:

{LOG_ID, LSN, COPYSET, DATE, Payload}

LOG_ID - log type;
LSN - record identifier;
COPYSET - a list of identifiers for the nodes that contain the record;
Payload - the body of the log.

How does reading records work? As you can see in the figure, each Reader client connects to all the nodes that contain the records of the logs it needs. Each node issues records starting with the specified identifier or date. The task of the Reader client is to remove duplicates and sort the received records.

You can read more about LogDevice on Facebook . The planned release in open source is the end of 2017.

ML reports

In the neighborhood with speeches about Highload and Big Data, one could get deeper acquainted with the fashionable direction of Machine Learning. We managed to listen to the reports of Sergey Vashchilin (PayMaxi) “Semantic text analysis and search for plagiarism” and Aleksand Zarichkova (Ring Ukraine) “Faster than real-time face detection” .

Text semantic analysis

The speaker walked on the basic algorithms in the field, however, the topic of machine learning remained unrevealed. The report presented SVD algorithms , tf-idf for searching document keywords, shingles algorithm and using the lzm archiver to compare the similarity of two documents. The system described by the author is a relatively inexpensive way to check documents for matches.

At the input of the system, a document is filed in which keywords are found using SVD, these words are used to search for similar documents on the Internet. As a cache, Apache Lucene is used to speed up the search, which allows you to locally save and index documents. Documents found are processed using the shingles algorithm and compared with the incoming document.

The advantages of the system are its simplicity and low cost of development in comparison with analogues. The disadvantage of the system is the lack of real semantic analysis of the text and a more intelligent search for keywords.

Faster than real-time face detection

The report provided an overview of machine learning methods for recognizing objects in pictures and determining their location on them (as in Figure 2). The speaker told the story of recognition methods ranging from the Viola-Jones detector to modern (2015) convolutional neural networks such as YOLO, SSD.

Fig. 2 Recognition of objects and their location.

This report was especially interesting because it was represented by the Ukrainian startup Ring Ukrainewho is working on a smart doorbell device. As it is clear from the speech, the described neural networks are widely used to determine the face of the person who is ringing at your door. The main difficulty, according to the speaker, remains the definition of the caller in real time, which requires a high speed of recognition itself. Modern neural networks YOLO or SSD have such characteristics, below is a comparative table of architectures of various neural networks.

Table 1. Comparison of neural network architectures

Model	Train	mAP	FLOPS	Fps
Yolo	BVOC 2007 + 2012	40.19	40.19	45
SSD300	VOC 2007 + 2012	63.4	-	46
SSD500	VOC 2007 + 2012	74.3	-	nineteen
YOLOv2	VOC 2007 + 2012	76.8	34.90	67
YOLOv2 544x544	VOC 2007 + 2012	76.8	59.68	40
Tiny YOLO	VOC 2007 + 2012	57.1	6.97	207

As we can see in Table 1, neural networks differ in the characteristics of speed (FPS - frames per second) and accuracy (mAP - mean average precision) of recognition, which allows you to select a neural network for a specific task.

New features in Elastic 6

The report by Philip Krenn of Elastic began with the table below from db-engines.com, which tracks the popularity of search engines. Elastic is two times ahead of its closest competitor Solr. It is used by organizations such as CERN, Goldman Sachs and Sprint.

What is so cool about the new Elastic 6 beta released in August? The following innovations were presented in the report.

Cross cluster search

Allows you to search for data in several Elastic clusters. Such an opportunity already existed, but it was not implemented optimally - with the help of the Tribe node. Tribe kept the merged state of both clusters, accepting any update of the state of each Elastic-controlled cluster. Plus, the Tribe node that received the request independently determined which indexes, shards, and nodes should be polled and which top-N documents should be loaded.

Cross cluster search does not require a separate dedicated node and merged state of both clusters. Any node can accept cross-cluster requests. The cross cluster request is handled by a coordination node that communicates with the three configured seed nodes of another cluster. You can also configure the gateway node, which directly accepts the connection from the coordinator and acts as a proxy. For coordination from the point of view of the API, there is no difference in the search queries for indexes on another cluster - the results will be marked with a prefix with the name of the second cluster. Cross cluster search unifies the search API and removes some restrictions compared to the Tribe node (search by index with the same name on the local and related clusters). More details can be read here and here .

How to migrate an existing Elastic 5 cluster to Elastic 6? Starting with the sixth version, you can update the version using Rolling Upgrades without stopping the Elastic cluster (valid for switching from Elastic 5.6 -> Elastic 6).

Sequence numbers

Sequence numbers is one of the important new features of Elastic 6. Each update, delete, put operation will be assigned a sequential identifier, so that the lagging secondary replica can restore the flow of operations from the Transaction log starting from the i-th one without copying files. Users are given the opportunity to configure the transaction log storage time.

Types

Elasticsearch previously represented Mapping types as tables in a database (index). Although the analogy is not entirely accurate. Columns in databases with the same name may not depend on each other, but in Elastic, the fields of two different Mapping types must have the same Field type. You cannot assign the deleted date field of one type to boolean in another type. This is due to how exactly Elastic implements types. Apache Lucene under the hood of Elastic stores fields of different types in the same field. Now Elastic has decided to phase out types in indexes. In Elastic 6, each index can contain only one type and may not be specified during indexing. In future versions, type support will be completely discontinued.

Top 10 architectural files on a real highload project

I liked the presentation by Dmitry Menshikov “Top 10 architectural files on a real highload project”. Dmitry is a born speaker, owns the word, successfully adds humor to the speech and supports the interest of the audience. The speaker showed ten cases, told what led to problems, and how they could resolve the situation.

For example:

As a mistake in choosing a UUID in the format date + time + MAC address and a length of 128 bits, discovered after the project was put into operation, it led to slowdown and almost cost two years of development. And as a simple decision to increase the length of the index to 4 bytes saved the situation. As you know, timestamp is a 60-bit value, and for UUID version 1 it is presented as an account of 100-nanosecond intervals in UTC, starting from 00: 00: 00.00 on October 15. The problem was that the possible inclusion of the second fraction at the beginning and the lack of a uniform distribution were not taken into account due to the features of the generation of identifiers.

As a new developer decided to fix a long-falling test, he added a check for the existence of a key and about 2 million user messages that did not have a key were lost.

In our opinion, the report was not saturated with complex production details. But the conclusion is unchanged: implemented solutions must be tested in practice and think about how change will affect the system.

Microservice Architecture Traps

Nikita Galkin, a systems architect at GlobalLogic, presented a paper entitled “Microservice Architecture Traps” . The main problems in introducing microservices, according to Nikita, consist in developing without using template approaches - when in one case the settings file is called config, in the other - settings, etc. refusal from daily updating of documentation during the development process; the use of microservices where the problem can be solved by the box means of already implemented products.

High [Page] load

An interesting report was “High [Page] load” by Dmitry Voloshin, co-founder of Preply.com, a site for finding tutors around the world. Dmitry spoke about the history of the site’s development, how it became necessary to measure and improve the page loading time with an increase in the number of users from three to a million and with the site entering the intercontinental market.

The main idea of the report is the need for continuous monitoring of factors affecting page loading speed; the need for work to improve performance, as the success of a business depends on the speed of page loading.

Tools used by Preply to increase page loading speed:

basic: caching, orm optimizations;
more advanced: replicas, load-balancing;
current: CDN, microservices.

Tags: