visirok July 1, 2019 at 23:20

About errors that appear out of nowhere and in which there is no one to blame: The phenomenon of the smearing of responsibility

The article will not talk about irresponsible employees, as one might suggest from the title of the article. We will discuss one real technical danger that may await you if you create distributed systems.

In one Enterprise system, there was a component. This component collected data from users about a certain product and recorded it in a data bank. And it consisted of three standard parts: the user interface, the business logic on the server and the tables in the database.

The component worked well, and for several years no one touched its code.

But once, for no reason, strange things started to happen to the component.

When working with some users, a component in the middle of a session suddenly started throwing errors. It happened infrequently, but as usual, at the most inopportune moment. And what is most incomprehensible, the first errors appeared in a stable version of the system in production. In the version in which for several months no components were changed at all.

We began to analyze the situation. We checked the component under heavy load. Works good. Repeated quite extensive integration tests. In the integration tests, our component worked fine.

In a word, the error came unclear when and unclear where.

They began to dig deeper. A detailed analysis and comparison of the log files showed that the cause of the error messages shown to the user is constraint violation in the primary key in the already mentioned table in the database.

The component wrote data to the table using Hibernate, and sometimes Hibernate, when trying to write the next row, reported a constraint violation.

I will not bore readers with further technical details and immediately tell you about the essence of the error. It turned out that not only our component writes to the above table, but sometimes (extremely rarely) some other component. And she does it very simply, with a simple SQL INSERT statement. A Hibernate works by default when writing as follows. To optimize the writing process, it queries the index for the next primary key once, and then writes several times just by increasing the key value (10 times by default). And if it happened that after the request, the second component got stuck in the process and wrote data to the table using the following primary key value, then the subsequent attempt to write from Hibernate led to constraint violation.
If you are interested in technical details, see them below.

Technical details

.
The class code started like this:

@Entity
@Table(name="PRODUCT_XXX")
public class ProductXXX {
                @Id
                @Basic(optional=false)
                @Column(
                                name="PROD_ID",
                                columnDefinition="integer not null",
                                insertable=true,
                                updatable=false)
                @SequenceGenerator(
                                name="GEN_PROD_ID",
                                sequenceName="SEQ_PROD_ID",
                                allocationSize=10)
                @GeneratedValue(
                                strategy=GenerationType.SEQUENCE,
                                generator="GEN_PROD_ID")
                private long prodId;

One discussion of a similar issue on Stackoverflow:
https://stackoverflow.com/questions/12745751/hibernate-sequencegenerator-and-allocationsize

And it so happened that for many months after changing the second component and implementing the entries in the table in it, the processes of writing the first and second components never overlap in time. And they began to intersect when, in one of the units using the system, the work schedule changed slightly.

Well, the integration tests went smoothly, since the time intervals for testing both components inside the integration tests did not intersect either.

In a way, we can say that no one was really to blame for the error.

Or is it not so?

Observations and Thoughts

After discovering the true cause of the error, it was corrected.

But not with this happy end, I would like to end this article, but reflect on this error as a representative of the vast category of errors that have gained popularity after the transition from monolithic to distributed systems.

From the point of view of individual components or services in the described Enterprise-system everything was done everything seems to be right. All components, or services, had independent life cycles. And when the need arose to write to the table in the second component, because of the insignificance of the operation, a pragmatic decision was made to implement this directly in this component in the simplest way, and not to touch the stable working first component.

But alas, what happened often in distributed systems (and relatively less often in monolithic systems) happened: responsibility for performing operations on a particular object was spread out between subsystems. Surely, if both write operations were implemented in the same microservice, a single technology would be chosen for their implementation. And then the described error would not have occurred.

Distributed systems, especially the concept of microservices, have effectively helped solve a number of problems inherent in monolithic systems. However, paradoxically, the separation of responsibilities for individual services provokes the opposite effect. Components now "live" as independent as possible. And inevitably there is a temptation, making big changes to one component, to “screw right here” a little functionality that would be better implemented in another component. This quickly achieves the final effect, reduces the volume of approvals and testing. So, from change to change, the components are overgrown with features unusual for them, the same internal algorithms and functions are duplicated, multivariance of problem solving (and sometimes their non-determinism) arises. In other words, a distributed system degrades over time,

“Smearing” responsibility for components in large systems consisting of many services is one of the typical and painful problems of modern distributed systems. The situation is further complicated and confused by the shared optimization subsystems such as caching, prediction of the following operations (prediction), as well as orchestration of services, etc.

Centralizing access to the database, at least at the level of a single library, the requirement is quite obvious. However, many modern distributed systems have historically grown around databases and use the data stored in them directly (via SQL) rather than through access services.

"Helping" the spread of responsibility and ORM frameworks and libraries like Hibernate. Using them, many developers of database access services unwittingly want to give as high as possible objects as a result of the request. A typical example is the request for user data in order to display it in a greeting or in the field with the authentication result. Instead of returning the user name in the form of three text variables (first_name, mid_name, last_name), such a request often returns a full-fledged user object with dozens of attributes and connected objects, such as the list of roles of the requested user. This in turn complicates the logic of processing the result of the request,

What is there to do? (Recommendations)

Alas, the smearing of responsibility in certain cases is sometimes forced, and sometimes even inevitable and justified.

Nevertheless, if possible, you should try to comply with the principle of distribution of responsibility between the components. One component is one responsibility.

Well, if it is impossible to concentrate operations on certain objects strictly in one system, such smearing must be very carefully recorded in the system-wide (“supercomponent”) documentation as the specific dependence of the components on the data element, on the domain object, or on each other.

It would be interesting to know your opinion on this matter as well as cases from practice confirming or refuting the theses of this article.

Thank you for reading the article to the end.

Illustration "Multimedia Mikher" by the author of the article.

Tags:

system architecture

About errors that appear out of nowhere and in which there is no one to blame: The phenomenon of the smearing of responsibility

Observations and Thoughts

What is there to do? (Recommendations)

Also popular now: