illustrarium September 26, 2014 at 14:33

As we write a web service for a billion users

Recovery mode

BeSmart.net project IT director Maxim Model about working on a global training service

Our team is working on a BeSmart project . Now we have nine programmers, including IT directors, that is, myself (of course, there are designers, marketers and other specialists - more than 20 people in all). We work in the Belarusian Vitebsk, known in Russia for the festival "Slavic Bazaar".

BeSmart.net is a service for posting training lectures in video, audio and PDF format, which, we hope, will eventually be watched around the world. There are many ambitions, but for now we will set them aside and tell you what two goals are facing us, the developers, and how we fulfill them.

Our first goal is to create a highly loaded system that would withstand the requests of millions of users who will download lectures or buy them. At the same time, we understand that we have taken on an extremely difficult task, and therefore we want to do everything thoroughly.

The second goal is to protect the content to the maximum. Of course, we understand that protecting information on the Internet is now quite difficult, if at all possible. But still, we intend to complicate as much as possible, and if possible, to prevent unauthorized access to lectures posted in our system.

Now I will describe the technologies that we have applied to solve both problems.

What programming language did we choose

We develop the core of the system in C ++, which, like any language, is not without flaws. The main one is the long development time. C ++ programs are not written, but designed. First, the architecture of the future application is developed, and only after that the development stage begins. Of course, this approach is not mandatory, but it will greatly facilitate the support and development of the project in the future.

C ++ is an extremely powerful tool, and this power requires careful handling. A programmer who writes in C ++ must be highly qualified so that the code is clear and understandable for those who will support the project.

The second drawback (it also affects the first) is the lack of ready-made solutions, especially in the field of web development. If, for example, there are a lot of them for PHP, then for C ++ there are practically none. You have to do a lot from scratch, and BeSmart.net is a 90% self-written product.

It took more than two years to work. In March 2012, preparation of the ToR began, which lasted until May. Then the development of the core of the project began, which lasted until November 2012. In May 2013, the women began to develop their own file server. And only in October 2013 the project reached the point of commercial use - it became possible to use the site fully.

What is common between us and Facebook

As you already understood, we are aware that writing code in C ++ is quite difficult and time consuming. But now let's turn to the advantages of C ++, for which we chose this language. First of all, this is the ability to get ready-made efficient compiled code, which is impossible or extremely difficult on other platforms. This is very important for highly loaded projects.

The code is converted by the compiler program (we use the GCC compiler) into the machine one, understandable only to the computer, after which the computer executes it. The speed of processing user requests will be higher. When thousands or even millions of people simultaneously use the service, they will receive information quickly, and the physical capacity of the servers for this will require less than they could.

Facebook was written in PHP, but later Zuckerberg needed to reduce the load on the servers: when you have millions of users this will give significant savings. Then Facebook wrote its own software - HipHop, which translates PHP into C ++ and immediately compiles it into machine code. However, not all PHP scripts can be translated, and Facebook switched to C ++ only partially.

Vkontakte programmers, as far as we know, still write PHP code, but they do it consciously. First, new Vkontakte functions are quickly written in PHP, then tested, and if they are functional, they are written in a compiled programming language (a year ago, Pavel Durov announced his own development - KPHP).

Everything is initially written in C ++, and this is an indicator of a serious project. When we first started working on BeSmart.net, we met with IT people from other projects, and one of them asked me: “Well, how are you? What are you doing?" I say: "Here, we will make an application in C ++." To which my friend replied: "Are you going to service a billion people or what?" I said, “Yes, we're going to service a billion.”

Where do we store user data

So far we do not have a billion lecturers and students (by the way, you can bring our goal closer and upload some lecture to BeSmart.net). But we are already preparing to store large amounts of multimedia information. Storage needs to be made secure, and this is our second big goal. For this, we have both software and hardware solutions.

We rent servers in two data centers: one is located in Moscow, the second - in Hong Kong. Now we are negotiating a rental server in another European country. Information on the servers is duplicated. Depending on where in the world the request will be made, the content will be delivered from the server closest to the user.
Of course, we are aware of the existence of companies that provide CDN services, which do the same, but outsource it. But the fact of the matter is that we do not use the services of other companies. After all, we vouch for the safety of all data on the site.

We also wrote our own asynchronous file server (responsible for storing software) in C ++. Its peculiarity is that it gives users content not from servers, but from our local network.
Here you need to clarify: in addition to the servers with the software part of the BeSmart.net service, we have separate servers with user data. They are combined into a local network, which is not directly connected to the Internet. The file server also accesses this local network when it issues content to site users. In fact, this is one bunch of servers inside another. Thus, the data is better protected from hacking.

So the data is stored in Besmart.net

DC - data center;
Cluster Web - a cluster of web servers for receiving and processing orders from users;
Cluster FS - a cluster of servers for accessing project file resources (download and upload content);
Gw- a gateway that forms a secure channel for combining internal networks of different Besmart.net data centers into a single network (used to create a distributed data storage network and replicate it).

How we protect copyrighted content

The site accepts lectures from users in all popular video formats. But they are stored in MP4 format, and when viewed, users receive them using the HLS (HTTP Live Streaming) protocol developed by Apple. Streaming video is not given out as a whole file, which is gradually played on the computer from the cache, but in fragments of 10 seconds. Some browsers, when displaying the video, save these pieces in the cache, some do not, but in any case, the technology complicates the work of pirates.

We mark all content that users upload to the site with a watermark (“watermark”). Of course, using it will not save you from illegal distribution of the video, but identifying your rights to it, in general, is not such a meaningless undertaking. At the very least, removing our lectures from sites that are in the legal field will certainly be easier.

So that the pirates couldn’t easily get rid of the watermark, we made it dynamic - the BeSmart.net logo slowly floats in the picture, changing its path. While the clips are being saved on the servers, the processing takes place: the records are sorted into audio and video tracks, a watermark is superimposed on them, after which the tracks are collected again. As a result, the motion path of the watermark in the frame is always different. To separate audio and video tracks and combine them back, we wrote our own software. And for frame-by-frame “watermark” information and video stream we use FFmpeg .

Let me remind you, in addition to video and audio files, you can upload PDF to the site. These files are also marked with a watermark. When downloading, we break these files into separate pages, and a program specially written by us puts a watermark on each of them. At the same time, we deliberately degrade the quality of the images so that the texts are suitable for reading, but not for printing.

What will happen next

Every day there are new tasks that do not always pursue our two main goals, but somehow improve the service. Recently, we made the so-called "exchange of business cards" for users of the system who want to communicate with each other - an analogue of "friendship" on social networks. In addition, it became possible to blog, leave comments and evaluate lectures. It took two weeks to write the code.

Our project is young and for this reason it is not without flaws, but we are working on it. The development process is not completed, we are at the beginning of the road. Nothing is impossible - there are only tasks and deadlines for their implementation.

Tags:

As we write a web service for a billion users

Also popular now: