
You are not google
- Transfer
Developers are crazy about the strangest things. We all prefer to consider ourselves super-rational beings, but when it comes to choosing a particular technology, we fall into a kind of madness, skipping from a comment on HackerNews to a blog post, and now, as if in oblivion, we are helpless we are sailing towards the brightest source of light and bow obediently to it, having completely forgotten what we were originally looking for.
This is not at all how rational people make decisions. But exactly so the developers decide to use, for example, MapReduce.
As Joe Hellerstein noted in his lecture on databases for undergraduate students (in the 54th minute):
The fact is that there are approximately 5 companies in the world that perform such ambitious tasks. As for everyone else ... they spend incredible resources to provide a fault-tolerant system that they really don't need. People had a kind of “google” in the 2000s: “we will do everything exactly like Google does, because we also manage the largest data processing service in the world ...” [ironically shakes his head and waits for laughter from the audience]
How many floors in the building of your data center? Google decided to stay at four, at least in this particular data center located in Mays County, Oklahoma.
Yes, your system is more resilient than you need, but think about what it might cost. The point is not only the need to process large amounts of data. You are likely exchanging a complete system — with transactions, indexes, and query optimization — for something relatively weak. This is a significant step back. How many Hadoop users do this consciously? How many of them make a really balanced decision?
MapReduce / Hadoop is a very simple example. Even the followers of the Cargo Cult already realized that the planes would not solve all their problems. Nevertheless, the use of MapReduce allows you to make an important generalization: if you use the technology created for a large corporation, but at the same time solve small problems, you may be acting thoughtlessly. Not even so, it is most likely that you are guided by mystical ideas that imitating giants like Google and Amazon, you will reach the same heights.
Yes, this article is another opponent of the cargo cult. But wait, I have a useful checklist for you, which you can use to make more informed decisions.
Cool framework: UNPHAT
The next time you google some new cool technique for (re) shaping your system, I urge you to stop and just use the UNPHAT framework :
- Don't even try to think about possible solutions before understanding the (Understand) problem. Your main goal is to “solve” the problem in terms of the problem, not in terms of solutions.
- List (eNumerate) several possible solutions. No need to immediately point your finger at your favorite option.
- Consider a separate solution, and then read the documentation (Paper) , if any.
- Define the Historical context in which this solution was created.
- Match Advantages with Flaws. Analyze what the decision makers had to sacrifice to achieve their goal.
- Think (Think) ! Soberly and calmly consider how well this solution is suitable to meet your needs. What exactly needs to change for you to change your mind? For example, how much less data should be, so that you prefer not to use Hadoop?
You are not amazon
Using UNPHAT is easy. Recall my recent conversation with a company that hastily decided to use Cassandra for an intensive process of reading data downloaded at night.
Since I was already familiar with the Dynamo documentation and knew that Cassandra is a derivative system, I understood that in these databases the main focus was on the ability to record (Amazon needed to make the action “add to cart” never did not fail). I also appreciated that the developers sacrificed data integrity - and indeed, every feature inherent in traditional RDBMS. But after all, the company with which I spoke, the ability to record was not a priority. Honestly, the project meant creating one big record a day.
Amazon sells a lot of everything. If the function "add to basket" suddenly stopped working, they would lose a LOT of money. Do you have a problem of the same order?
This company decided to use Cassandra because it took several minutes to complete the PostgreSQL query in question, and they decided that these were technical limitations on the part of their hardware. After clarifying a couple of points, we realized that the table consisted of approximately 50 million rows of 80 bytes each. It would take about 5 seconds to read it from the SSD if you had to go through it completely. This is slow, but it is still two orders of magnitude faster than the query execution speed at that time.
At this stage, I had a lot of questions (U = understand, understand the problem!) And I began to weigh about 5 different strategies that could solve the original problem (N = eNumerate, list a few possible solutions!), But in any case it was already clear to the moment that using Cassandra was fundamentally the wrong decision. All they needed was a little patience to set up, probably a new design for the database and, possibly (though unlikely), the choice of a different technology ... But definitely not a key-value data storage with intensive recording that Amazon created for their basket!
You are not LinkedIn
I was very surprised to find that one student startup decided to build its architecture around Kafka. It was amazing. As far as I could tell, their business carried out only a few dozen very large operations per day. Perhaps a few hundred on the most successful days. With this bandwidth, the main data warehouse could be handwritten entries in an ordinary book.
For comparison, recall that Kafka was created to handle all analytic events on LinkedIn. This is just an enormous amount of data. Even a couple of years ago, it was about 1 trillion events daily , with a peak load of 10 million messages per second. Of course, I understand that Kafka can be used to work with lower loads, but to 10 orders less?
The Sun, being a very massive object, and that is only 6 orders of magnitude heavier than the Earth.
Maybe the developers even made a deliberate decision, based on the expected needs and a good understanding of the purpose of Kafka. But I think that they were rather fueled by the (usually justified) community enthusiasm for Kafka and almost never wondered if this was really the tool they needed. Just imagine ... 10 orders!
I have already mentioned? You are not amazon
Even more popular than Amazon's distributed data warehouse, is the architectural design approach that provides them with scalability: a service-oriented architecture. As Werner Vogels noted in a 2006 interview with Jim Gray, Amazon realized in 2001 that they were having difficulty scaling the interface (front-end) and that service-oriented architecture could help them. This idea infected one developer after another, while startups, consisting of just a couple of developers and almost no clients, did not begin to split their software into nanoservices.
By the time Amazon decided to switch to SOA (Service-oriented architecture), they had about 7,800 employees and their sales exceeded $ 3 billion .
The Concert Hall Bill Graham Auditorium in San Francisco seats 7,000 people. Amazon had about 7,800 employees when they switched to SOA.
This does not mean that you should postpone the transition to SOA until your company reaches the level of 7800 employees ... just always think with your own head . Is this really the best solution for your task? What exactly is the task before you and are there any other ways to solve it?
If you tell me that the work of your organization, which consists of 50 developers, simply rises without an SOA, then I wondered why so many large companies simply work wonderfully using a single, but well-organized application.
Even Google is not Google.
Examples of using systems for processing highly loaded data streams (Hadoop or Spark) can really be bewildering. Very often, traditional DBMSs are better suited to the load, and sometimes the amount of data is so small that even the available memory would be enough for them . Did you know that you can buy 1TB of RAM somewhere for $ 10,000? Even if you had a billion users, you would still be able to provide each of them with 1 KB of RAM.
Perhaps this will not be enough for your load, because you will need to read and write to disk. But do you really need several thousand disks to read and write? Here is how much data you have in fact? GFS and MapReduce were created to solve computing problems across the Internet ... for example, to recalculate the search index throughout the Internet .
Prices for hard drives are now much lower than in 2003 when the GFS documentation was published.
Maybe you read the GFS and MapReduce documentation and noticed that one of the problems for Google was not the amount of data, but the bandwidth (processing speed): they used distributed storage because it took too much time to transfer bytes from the disks. But what will be the bandwidth of the devices that you will use this year? Given that you don’t even need as many devices as Google needed, would it be better to just buy more modern drives? How much will it cost to use an SSD?
Maybe you want to consider scalability in advance. Have you already done all the necessary calculations? Will you accumulate data faster than SSD prices go down? How many times will your business have to grow so that all available data no longer fits on one device? As of 2016, Stack Exchange was processing 200 million queries per day with support for only 4 SQL servers : the main one for Stack Overflow, one more for everything else, and two copies.
Again, you can resort to UNPHAT and still decide to use Hadoop or Spark. And the decision may even be right. The main thing is that you really use the right technology to solve your problem . By the way, this is well known at Google: when they decided that MapReduce was not suitable for indexing, they stopped using it.
First things first, understand the problem
Пусть мой посыл и не является чем-то новым, но, возможно, именно в таком виде он отзовется в вас или может быть, вам попросту будет легко запомнить UNPHAT и применять его в жизни. Если же нет, вы можете посмотреть речь Рича Хики на Hammock Driven Development, или книгу Поля «How to Solve it», или курс Хэмминга «The Art of Doing Science and Engineering». Потому что главное, о чем мы все просим — это думать!
И действительно понимать задачу, которую вы пытаетесь решить. Говоря вдохновляющими словами Поля:
«Глупо отвечать на вопрос, который вы не понимаете. Грустно стремиться к цели, которую вы не желаете достичь.»
Перевод на русский
Translation: Alexander Tregubov
Editing: Alexey Ivanov (@ponchiknews)
Community: @ponchiknews
Illustration: LucidChart Content Team