Big Data - Bro or Not Bro

Moscow hosted the Big Data, Meet Big Brother conference organized by the Sistema_VC foundation. There was everything there: an Israeli developer arrived who knows how to process data a hundred times faster than anyone. MTS said that MTS will die if it does not become an IT company. Russian businessmen were screaming, trying to dispel it.

It seems that everyone is already accustomed to talking about big data, especially if they are philosophical, sooner or later the Orwellian supervillain Big Brother will appear - just like Hitler in all disputes on the Internet . The organizers did not pull and beat the stamp immediately in the title. Anxiety - justified or not - part of the HYIP, what to do.

On a big date, in fact, they have been tagging from antiquity - from all ancient Egyptians, when rewriting people in order to understand how to use them more effectively. At the time of Peter I, big data (population census for tax collection) was collected for three years, and then another three years were processed. Now to this process, perhaps, added wires, speeds and data types themselves. All in the name of efficiency, optimization and even more ancient dreams of mankind - so that everything will somehow become itself.

Businesses dream that everything itself is clearly segmented, it is self-determined, who should sell what and when. Buyers want everything to be bought, turned on, chewed and digested. At the conference, smart people gathered to discuss how to achieve this. I listened to them intelligently, asked around and wrote down everything.

Johan Callebaut and psychology in big data

The conference began with a speech by a psychologist, Jorgan Callebout. He works for DataSine. With the help of machine learning and psychological models, they segment the audience and learn to whom it is better to show which advertisement.

It works like this: they collect all the data that they find — from Internet records to payment history — and use machine learning to impose them on the Big Five psychological model.

extraversion - introversion
affection - detachment
self-control - impulsivity
emotional instability - emotional stability
expressivity - practicality

Johan argues that the company does not use the fourth point because it is not ethical. According to him, allegedly, we can conclude about the mental health of a person and use it against him.

Distribution methods, of course, are derived by man, and, if not deepened, seem rather stereotypical. For example, Jorgan says that if you buy a lot of books, most likely you are an introvert. If you often spend money in bars - probably an extrovert (because the introverts are at home and silent).

To the question “is this why ?!” Jorgan has a medical answer. It's all about the hormone acetylcholine, to which all people have different degrees of sensitivity. If a person is sensitive to a hormone, it becomes an introvert, and with strong emissions from, say, interaction with people, it closes in a lump and swallows the tongue. In extroverts, the hormone stimulation threshold is higher. Therefore, the crowd, noise and communication for a long time can not strain them.

The hormone splashes not only at the sight of people, it reacts to many things - colors, sounds, words. Therefore, for extroverts and introverts, the Jorgan team makes different promotional letters.

For example, we use the same figures and facts, but we make out letters in the mailing list in different ways. For extroverts we put orange pictures, bright. For introverts blue and cold. Machine learning helps us to pick up these pictures. From the fact that you change one image in the email, the number of clicks on the link increases by 40%. If you also customize the text - the coefficients increase to 80%.

When Jorgan was anxiously asked if the introduction of big data would turn us all into introverts, he replied, no, it would not. What was born, so will you.

But it was the most unusual of the troubling questions. The rest went according to the classics - but will not the companies start manipulating us with these psychological things of yours?

Many companies have not even reached the level where they could use big data, and even more so with someone to manipulate them. And generally, we are not going to manipulate you. We do not want to force you to do something against your will. We only personalize the proposals so that everyone is good.

Ami Gal and GPU Speed Database

Ami Gal, the founder of SQream, came to Tel Aviv from the conference. His company is developing its own database, which according to the application, works 100 times faster than usual thanks to processing requests for a GPU. This makes it suitable for working with big data.

From the examples, Ami spoke about the case of the Israeli Cancer Research Center. There is a database of treating thousands of patients for decades, there are samples of the genes of each patient, information about all the anomalies, reactions and, of course, the success of a particular treatment.

Putting together huge datasets, scientists learned how to select the most statistically appropriate treatment methods for each new patient. The problem was that only one column of such a table could have up to 6 billion records. Previously, the analysis took 2 months - now takes 2 hours.

That is, as soon as scientists receive a patient's DNA sample, they immediately know which method is most likely to lead to success.

I was interested to learn more about Ami, his company and technology, so I asked him about everything personally.

Ami studied computer science and physics at the University of Tel Aviv, then worked as a programmer, and in 1996 founded his first company. According to him, at that time it wasn’t like modern technological startups: “We had to do something and immediately sell it to customers in order to survive.”

In 2000, he and his partners founded Magic Software. There, Ami took the post of technical director and vice president of R & D, but gradually shifted from technology to business - “he moved to the dark side.”

After leaving Magic three years later, Ami began investing in startups. “If startups are based on relatives, friends and fools, then I was one of the last,” he laughs.

Finally, in 2010, together with a Russian migrant Kostya Varakin, Ami came up with the idea of speeding up databases with the help of a GPU and founded SQream.

- When the idea appeared, there was no feeling like “Yes, this is obvious! Why does nobody still handle SQL requests for GPUs? ”

Today is obvious. But when we started, no one wanted to listen to us. It seemed to everyone that this was impossible to accomplish.

The idea came to my co-founder Kosta Varakin from St. Petersburg. But she seemed so impossible that he did not immediately dare to voice it. And I thought - to use a graphics processor not for games, but for data processing - it's cool. We started to work, put this approach in the basis of the company.

Of course, we believed that GPUs would be perfect for the data, and everyone would immediately start using them. But not started. I remember when I wanted to raise investments, people in business reacted like this: “Are you kidding me? GPU data processing? It does not happen, go. "

Only six years later (about two or three years ago) did the GPU become mainstream thanks to working with AI, deep machine learning. And, of course, now the processing of data on the GPU has ceased to seem a strange idea.

“Didn't the people to whom you proposed the idea see speed?”

Saw, everyone saw. But the fact is that graphics processors are designed to work with vector graphics. And how we process data is the exact opposite of working with a vector. The chip is not designed for calculations of this kind. Therefore, we must make the processor believe in the software that it processes, for example, video, although this is not so. That is, you have to convert everything before and after the GPU, because it perceives only the vector.

We had to take complex problems and break them into lists of simple instructions for the processor. But it looked almost impossible.

- And what was the most difficult to develop?

Work with Russians (laughs). In fact, the most difficult in the history of the company was not a technical solution. At the very beginning, we planned to develop only an accelerator for foreign databases. Something that will speed up the work of Oracle, MS SQL. Let's say we send a request to Oracle, and it runs faster thanks to the GPU.

We entered the market with the question: “Do you need a piece that will make your database work 20 times faster”? And the market responded: “No, not needed.”

The problem was that we intercepted the request between the engine and the client. This was an intervention in the work of Oracle. We were told: “This is not allowed - send a request to your engine and process it yourself”. And we say: "We do not have a database."
"So do it."

We looked at how other companies are doing, how data warehouses with MPP architecture are arranged. All of them are based on a different database - mainly PostgreSQL or MySQL. Vertica, Greenplum and other repositories of the previous generation are all built around PostgreS.

We decided to try it too. We took PostgreSQL and GPU utilities. It turned out very slowly: the speed increased only twice. No one would transfer the database to the GPU for the sake of acceleration in some two times. We did not know what to do, did not sleep for a week. With all due respect to me and my colleagues, it was too much for us to build a database from scratch - this is too big a project.

But we tried, and after building the first block, the performance increased 18 times. Then we decided to continue, although we knew that the road would be long and difficult. This decision was the hardest for all the time SQream. After all, it meant that we would need much more money, people and time to build a company.

Speaking from the point of view of technology, the most difficult was to launch a JOIN using a GPU between two large tables on the disk.

- What is your stack?

We use CUDA to work with the GPU. We write everything in C ++, Haskell and a little bit on Erlang.

When you work with billions of transactions for a certain period of time, say in a split second, you need something very close to the hardware.
We go from assembly to Cuda and to C ++. If you add something else on the way, the speed will already fall, so we need to be as low as possible. We tried to work with other platforms: for example, we used OpenCL instead of Cuda, but it was not so well worked out, the process was too slow.

We need to go as deep as possible so that the performance is high.
To do this, we use programming languages such as C ++, Haskell, Cuda. In some moments we implement Erlang, but this happens much less frequently - we use the same C ++ more and more.

- If I worked only with ordinary databases, in the case of switching to yours, would I have to relearn?

From the point of view of language, nothing new needs to be learned. If you wrote in SQL, then everything will be the same here. There are things that work differently. But the specifications describe how to set everything up.

- The declared acceleration is 100 times - this is the maximum that can be squeezed out of the GPU?

I do not think that our company has reached 10% of the possible. Already in September, we will launch the third version of the product, in which we double the performance. In the future we plan to increase it again and again. CPU performance since 2006 almost does not grow, and the amount of data is growing exponentially. The performance of the GPU is growing the same way.

It turns out that we are at the very beginning of the life cycle. One of the things we are planning to do in the near future is to increase the performance not only on one GPU, but also to work on several. Just imagine what speed it will be! Here there is a request lasting 100 seconds. We break it into several small ones between ten GPUs - and the request passes in an instant.

I generally think that we are on the edge of a new era, when GPU computing will become dominant in data processing.

- Why haven't they become so far? What stops?

A lot of. I can name three barriers.

The first is not as strong as before, but so far it exists. When we come to companies that work with Oracle or IBM, they face a choice - go to a small startup from Tel Aviv or stay with a big player. Even if they are solved, this process is greatly delayed.

The second obstacle is the lack of people. Tel Aviv is a small Silicon Valley. In Israel, there is very high competition for personnel: it takes me three months to find the right employee, although I need him in three seconds.

And finally, the third - I, as the owner of a technology company, can say that there is always someone smarter than you, and much more. We constantly have to make sure that the technology was at the peak of its possibilities, and there is a lot to invest in it.

- Do not you think that the GPU is still a “crutch”, and for the data it would be better to find or invent your processing unit?

Of course, we are looking for new types of processors - not just graphics. Now there are technologies and better - they will appear on the market in the next couple of years. To this must be prepared. That is why we are in touch with startups, manufacturers of computing chips, including quantum computers.

As soon as these technologies develop, the world will be able to solve problems much faster, and this, of course, is impatient to see. If to be very optimistic, in five years the first such machines will appear, their very early versions that are suitable for academic research. And it will be less than ten years before the first attempts to introduce such technology into public fields: medicine and security. Prior to this, it will be good to show yourself GPU. It is interesting to see what will be faster in the end.

Russian companies and big data

In a break between performances, young and beautiful people loafed around the stylish space, walked on the roof, chatted and drank herbal lemonade. I did not get it because of the stupid acetylcholine (Thanks to Jorgan for explaining), but I am not offended.

Then, Big Data from Leonid Tkachenko, founder of GOSU Data Lab Alisa Chumachenko, founder of Segmento Roma Nester and Yevgeny Isupov from Tinkoff Bank, came to discuss big data on the scene.

Both me and the audience were great about Leonid's statements. It is unusual to hear such a level of candor from the top of one of the largest companies in Russia. The fact that I’m quoting him here more doesn’t mean that he said so much more than anyone else (this is not an ad for MTS. I have another operator, and Leonid, judging by his words, has already come to terms with it. Although even in this case he’s talking about me knows more than I thought).

He began immediately with the fact that Big Data really does not work now, and the myth is bloated. According to him, if the problem could not be solved using conventional methods, then nothing will change with the advent of big data.

For example, MTS had a successful customer churn prediction model. When big data was applied to it, the increase was completely insignificant. And just the opposite. MTS could not predict when the subscribers would decide to switch to a cheaper tariff (to call and talk in advance with a couple of bonuses). When they tried to solve the problem with big data, nothing happened.

Search for a miracle in Big Data technologies is not necessary

Evgeny Isupov objected to him:

- When we added new data or more specialized mathematics, which allows us to do advanced feature engineering, generate signs that were difficult for a person to come up with - then we saw a significant increase there.

And Leonid agreed with this:
- Here is also an example where the addition of new data plays a significant role. If we look at how our subscribers call - we just know that they are calling. It costs to add a minimal geoanalyst, a base station where the phone spends most nights, and a base station where it is located five days a week. All - we know where you live and work.

If we add modeling based on the profile of calls — and this is done with us — we can restore your entire household. We see that there are three MTS subscribers in it, another beeline and another megaphone. We do not have geoanalysts for them, we just know how they call our network.

In this model, there are more than a thousand very subtle significant things that you cannot generate. For example, such a feature - how the density of communication between people changes from 3 to 4 on Friday, and from 4 to 5. And so on. We take all pairs of one-on-one subscribers who call each other a lot, impose thousands of features and are able to cut them into two parts - couples who live together and couples who do not live together.

Alisa Chumachenko took the thought to a pragmatic direction - first of all, according to her, there are tasks, not technologies. If it makes sense to do something with the help of big data, and it is more profitable and more efficient than the old methods - it means they will be used. There is no need to work with Big Data for Big Data, but many people try for some reason.

Big data is exactly HYIP, and they will appear where they don’t belong at all.

When she asked if anyone had heard about DeepMind, I pulled my hand with the thought, "Lord, well, of course everyone heard, they are fucking better known than the Pope of Rome." But around five people raised their hands.

Then Alice began to talk about the victory of the AI in Go and added a fact that personally surprised me. It turns out that for a trained neural network have found practical application. It is used to cool Google servers. The AI goes through what kind of lockers for cooling where and when to turn, learns, encourages itself and punishes - and this process has already reduced server costs by 40%.

Alice herself, because she works with games, dreams of a system that will know everything about her play habits. She remembered the first time she entered League of Legends, and the game gave her 30 seconds to choose one of a couple of hundred characters.

- If the game knew that I always play a support - it would highlight the heroes that suit me better, and advised the others not to touch. If the game knew that I love, I would be converted into a user and carried money into it.

Speaking about the future of big data, Kut monologue gave again Leonid:

- MTS is a man of 50 years. Everything is over. Ahead, either a miserable life, or in general on Vagankovsky. Classical telecom - the end. We are aware of this, and as a business we are looking for a new body, where to relocate our soul, to a new business. And in this body we are finished.

Big Data can be. We have three strategies:
- Complete accumulation of all customer data at all, even if we do not understand how to use it. Storage technologies are cheap enough to store everything.
- Open data-saentists access to data and try to blind something.
- Build on the basis of knowledge about people a new business, based on the penetration to them in the head, in the soul, in desires. Make the maximum personalization. To know everything about you, as if we are watching and listening to you without doing it.

And the last mile of this business has already been built - to catch a person on the Internet and show him advertising. It remains to build the first one, penetrate deeply and find out what this person wants to see. So that every second bought.

Leonid believes that the future of data can go two ways. Either the data will become the property of people, and they will be able to sell information about themselves, decide which company and what to open. Or the data will become the full property of the states.

Оно будет знать о нас абсолютно все. Но хотя бы жить станет безопаснее.

The fact that the data will be strongly regulated in one way or another is all agreed.
“Anyone who has been facing the GDPR for the past six months understands that access to private data will be very much regulated. On the other hand, there is China, looking at which you understand that it is not. Russia will most likely follow the Chinese version. In any case, the huge companies that store this data (looks at Leonid slyly) will not be easy.

Roman believes that anxiety is born from ignorance and misunderstanding:

- We are in a state of technopanics. Everyone is afraid that someone will know something about them, and everyone does not like it. There are, for example, 15 technological and business reasons why Facebook is not profitable to eavesdrop on people. But people believed it, and now perceive the service differently.

The data collection process should be transparent so that people are not afraid.

As in all global issues, contradictions are born in trifles. Where to lay the line between privacy and comfort, where to whom and in what cases to make personal information public.

As Eugene said, when information such as “what you did last night” is used against you, to laugh or to harm at all, you don’t want to open it. But if this information can, for example, improve health or sleep, then it can be given away.

Roman believes that there is a need to be afraid of small companies:

- Big companies to allow leakage will be more expensive than selling my data. Strain small companies, which by all means data tend to monetize. We are purchasing data from 40 sources, and some from companies that have not given any information about people. When you can close tomorrow, you do not have much responsibility to society and people.

Alice, on the contrary, believes in a bright future:

- I only want to think that you do not have a bank account, you have been sent a card once. Today everything that we do is becoming public. But I don’t believe in extreme scenarios, so I want the AI to appear more quickly, which shows and offers us all the relevant.

And Leonid summed up:
- If you want to go all serious, turn off the phone better.

Instead of conclusions

In conversations about Big Brother, I always remember one story. When Orwell finished writing “1984,” he sent a copy to his school teacher, Aldous Huxley. He responded with a letter - he praised the book, but did not agree with the idea. He believed that "encouraging infantilism and drug-hypnosis are much better suited to obtain power than prisons and batons."

Of course, to scare that “Big Brother is watching you” is much more effective, and to be afraid of it is much more fun. But, dear Sistema_VC, I think the name Big Data, Meet Brave New World would be better.

Tags:

Big Data - Bro or Not Bro

Johan Callebaut and psychology in big data

Ami Gal and GPU Speed ​​Database

Russian companies and big data

Instead of conclusions

Also popular now:

Ami Gal and GPU Speed Database