
Data is still more important
- Transfer
Here is a quote from Linus Torvalds for 2006 :
Which is very similar to Eric Raymond’s “submission rule” of 2003 :
Here's just a summary of ideas like Rob Pike’s 1989 thought :
He quotes Fred Brooks from 1975 :
So, for almost half a century, smart people have said again and again: focus on the data first. But sometimes it seems that this is the smartest advice that everyone forgets.
I will give some real examples.
This system was originally created with the expectation of incredible scalability with a large load on the CPU. Nothing synchronous. Everywhere callbacks, queues, and work pools.
But there were two problems. The first was that the "processor load" was not so intense - one task took a maximum of a few milliseconds. So most of the architecture did more harm than good. The second problem was that the "highly scalable distributed system" actually only worked on one machine. Why? Because all communication between asynchronous components was carried out using files in the local file system, which now has become a bottleneck for any scaling. The original design was not tied to data at all, with the exception of protecting local files in the name of "simplicity." The main part of the project was devoted to all this additional architecture, which was “obviously” necessary to cope with the “heavy load” on the CPU.
This system followed the design of microservices from single-purpose applications with REST APIs. One of the components was a database in which documents are stored (mainly answers to standard forms and other electronic documents). Naturally, she set the API for saving and retrieving data, but pretty quickly there was a need for more complex search functionality. The developers considered that adding this function to an existing API document contradicts the principles of microservice design. Since 'search' is essentially different from 'get / put', the architecture should not combine them. In addition, they planned to use a third-party tool to work with the index, so the creation of a new service 'search' also made sense for this reason.
As a result, a search API and search index were created, which essentially became a duplicate of the data in the main database. This data was updated dynamically, so any component that changed the document data through the main database API should also send a request to update the index through the search API. Using the REST API, this cannot be done without a race condition, so the two datasets would go out of sync from time to time.
Despite what the architecture promised, the two APIs were closely related through their data dependencies. Later, the developers recognized that the search index should be combined with a common document service, and this made the system much more maintainable. Doing One works at the data level, but not at the verb level.
This system was a kind of automated deployment pipeline. The original development team wanted to make a tool flexible enough to solve deployment issues across the company. They wrote a set of plug-in components with a configuration file system that not only configured the components, but also acted as a domain-specific language (DSL) for programming how the components fit into the pipeline.
Fast forward to a few years, and the tool turned into "the very program." There was a long list of known errors that no one had ever fixed. No one wanted to touch the code for fear of breaking something. No one has used the flexibility of DSL. All users copied and pasted the same guaranteed working configuration as everyone else.
Something went wrong? Although the original project document often used words such as “modular,” “disconnected,” “expandable,” and “custom,” he said nothing at all about the data. Thus, data dependencies between components were processed in an unregulated way using a globally common JSON blob. Over time, the components made more and more undocumented assumptions about what is included or not included in the JSON blob. Of course, DSL allowed components to be rearranged in any order, but most configurations did not work.
I chose these three projects, because it’s easy to explain the general thesis using their example, without touching the others. Once I tried to create a website, and instead hung over some kind of servile XML database that did not even solve my data problems. There was another project that turned into a broken semblance of half functionality
Update:
Apparently, many still think that I'm trying to make fun of someone. My real colleagues know that I am much more interested in fixing real problems, and not in blaming those who generated these problems, but, okay, this is what I think about the developers involved in these projects.
Honestly, the first situation clearly occurred because the system architect was more interested in applying scientific work than in solving a real problem. Many of us can be blamed for this (me too), but it really annoys our colleagues. After all, they will have to help in support when we get tired of a toy. If you recognize yourself, do not be offended, please just stop (although I would prefer to work with a distributed system on one node than with any system on my “XML database”).
In the second example, there is nothing personal. Sometimes it seems that everyone around says how wonderful it is to share services, but no one talks about when it is better not to do this. People all the time learn from their own bitter experience.
The third story actually happened to some of the smartest people I've ever worked with.
(End of update).
The question "What is said about the problems created by the data?" It turns out to be quite a useful litmus test for good system design. It is also very convenient for identifying false “experts” with their advice. Problems with the architecture of complex, complicated systems are data problems, so false experts like to ignore them. They will show you a surprisingly beautiful architecture, but they will not say anything about what data it is suitable for, and (importantly) what data it is not suitable for.
For example, a false expert might say that you should use the pub / sub system because pub / sub systems are loosely coupled and loosely coupled components are more maintainable. It sounds beautiful and gives beautiful diagrams, but it is the reverse of thinking. Pub / sub does not make your components loosely coupled; pub / subit’s loosely coupled, which may or may not correspond to the needs of your data.
On the other hand, a well-designed data-oriented architecture is of great importance. Functional programming, service mesh, RPC, design patterns, event loops, whatever, everyone has their own merits. But personally, I saw that much more successful production systems work on boring old DBMSs .
I am a huge believer in developing code around data, and not vice versa, and I think this is one of the reasons why git was quite successful ... In fact, I argue that the difference between a bad programmer and a good one is whether he thinks more important your code or your data structures. Bad programmers worry about code. Good programmers worry about data structures and their relationships.
Which is very similar to Eric Raymond’s “submission rule” of 2003 :
Turn knowledge into data so that program logic becomes stupid and reliable.
Here's just a summary of ideas like Rob Pike’s 1989 thought :
Data dominates. If you choose the right data structures and organize everything well, then the algorithms will almost always be self-evident. Data structures, not algorithms, play a central role in programming.
He quotes Fred Brooks from 1975 :
Presentation is the essence of programming
Behind the mastery is ingenuity, thanks to which
economical and fast programs appear . Almost always, this is the result of a
strategic breakthrough, not a tactical skill. Sometimes such a strategic
breakthrough is an algorithm, such as, for example, the fast Fourier transform
proposed by Cooley and Tukey, or replacing n ² comparisons with n log n when sorting.
More often, a strategic breakthrough occurs as a result of the presentation of
data or tables. The core of the program is here. Show me the flowcharts without showing the tables, and I will remain astray. Show me your
tables, and flowcharts will most likely not be needed: they will be obvious.
So, for almost half a century, smart people have said again and again: focus on the data first. But sometimes it seems that this is the smartest advice that everyone forgets.
I will give some real examples.
Highly scalable system that failed
This system was originally created with the expectation of incredible scalability with a large load on the CPU. Nothing synchronous. Everywhere callbacks, queues, and work pools.
But there were two problems. The first was that the "processor load" was not so intense - one task took a maximum of a few milliseconds. So most of the architecture did more harm than good. The second problem was that the "highly scalable distributed system" actually only worked on one machine. Why? Because all communication between asynchronous components was carried out using files in the local file system, which now has become a bottleneck for any scaling. The original design was not tied to data at all, with the exception of protecting local files in the name of "simplicity." The main part of the project was devoted to all this additional architecture, which was “obviously” necessary to cope with the “heavy load” on the CPU.
Service-Oriented Architecture That Still Data Oriented
This system followed the design of microservices from single-purpose applications with REST APIs. One of the components was a database in which documents are stored (mainly answers to standard forms and other electronic documents). Naturally, she set the API for saving and retrieving data, but pretty quickly there was a need for more complex search functionality. The developers considered that adding this function to an existing API document contradicts the principles of microservice design. Since 'search' is essentially different from 'get / put', the architecture should not combine them. In addition, they planned to use a third-party tool to work with the index, so the creation of a new service 'search' also made sense for this reason.
As a result, a search API and search index were created, which essentially became a duplicate of the data in the main database. This data was updated dynamically, so any component that changed the document data through the main database API should also send a request to update the index through the search API. Using the REST API, this cannot be done without a race condition, so the two datasets would go out of sync from time to time.
Despite what the architecture promised, the two APIs were closely related through their data dependencies. Later, the developers recognized that the search index should be combined with a common document service, and this made the system much more maintainable. Doing One works at the data level, but not at the verb level.
Fantastically modular and configurable lump of mud
This system was a kind of automated deployment pipeline. The original development team wanted to make a tool flexible enough to solve deployment issues across the company. They wrote a set of plug-in components with a configuration file system that not only configured the components, but also acted as a domain-specific language (DSL) for programming how the components fit into the pipeline.
Fast forward to a few years, and the tool turned into "the very program." There was a long list of known errors that no one had ever fixed. No one wanted to touch the code for fear of breaking something. No one has used the flexibility of DSL. All users copied and pasted the same guaranteed working configuration as everyone else.
Something went wrong? Although the original project document often used words such as “modular,” “disconnected,” “expandable,” and “custom,” he said nothing at all about the data. Thus, data dependencies between components were processed in an unregulated way using a globally common JSON blob. Over time, the components made more and more undocumented assumptions about what is included or not included in the JSON blob. Of course, DSL allowed components to be rearranged in any order, but most configurations did not work.
The lessons
I chose these three projects, because it’s easy to explain the general thesis using their example, without touching the others. Once I tried to create a website, and instead hung over some kind of servile XML database that did not even solve my data problems. There was another project that turned into a broken semblance of half functionality
make
, again because I did not think what I really needed. I already wrote about the time spent creating an endless hierarchy of OOP classes that should have been encoded in the data . Update:
Apparently, many still think that I'm trying to make fun of someone. My real colleagues know that I am much more interested in fixing real problems, and not in blaming those who generated these problems, but, okay, this is what I think about the developers involved in these projects.
Honestly, the first situation clearly occurred because the system architect was more interested in applying scientific work than in solving a real problem. Many of us can be blamed for this (me too), but it really annoys our colleagues. After all, they will have to help in support when we get tired of a toy. If you recognize yourself, do not be offended, please just stop (although I would prefer to work with a distributed system on one node than with any system on my “XML database”).
In the second example, there is nothing personal. Sometimes it seems that everyone around says how wonderful it is to share services, but no one talks about when it is better not to do this. People all the time learn from their own bitter experience.
The third story actually happened to some of the smartest people I've ever worked with.
(End of update).
The question "What is said about the problems created by the data?" It turns out to be quite a useful litmus test for good system design. It is also very convenient for identifying false “experts” with their advice. Problems with the architecture of complex, complicated systems are data problems, so false experts like to ignore them. They will show you a surprisingly beautiful architecture, but they will not say anything about what data it is suitable for, and (importantly) what data it is not suitable for.
For example, a false expert might say that you should use the pub / sub system because pub / sub systems are loosely coupled and loosely coupled components are more maintainable. It sounds beautiful and gives beautiful diagrams, but it is the reverse of thinking. Pub / sub does not make your components loosely coupled; pub / subit’s loosely coupled, which may or may not correspond to the needs of your data.
On the other hand, a well-designed data-oriented architecture is of great importance. Functional programming, service mesh, RPC, design patterns, event loops, whatever, everyone has their own merits. But personally, I saw that much more successful production systems work on boring old DBMSs .