Basych December 25, 2014 at 12:43

How We Built The Trouble Tree

From the sandbox

Hello. For a relatively long time I have been working in a company creating corporate information systems. In this article I want to share some partially negative experience, maybe someone else will be interested in someone else's "rake".
One of the interesting tasks for the team of our design engineers was to build a single “fault tree” for a large corporate information system for monitoring equipment.

Formulation of the problem

The information system in which we built this classifier centralizes information about accidents on a wide variety of equipment, and also collects data on various problem situations from completely heterogeneous systems, databases and devices. It is clear that the primary alarms in such an architecture will be completely diverse. For example, 3 different external external systems sent us a “break” fault, but in one case it was a break in the carrier cable in the suspension of the contact network, in the other there was a break in the power supply, and in the third, the connection between the subscribers was lost. This situation did not suit us, since a clear classification was required from us, for subsequent use in reporting and analytical tasks.

When finding new types of incoming events, our accident handler simply added them to the directory and by the time the systematization of only message types began, more than 1000 had accumulated.

We set ourselves the following goals:

remove synonyms - combine records that have exactly the same meaning, but different spelling;
to develop a unified hierarchical classification structure in which we could “lay down” faults of any nature;
to simplify the further development (detail) of this classification by defining clear and consistent principles for its construction.

We strove to ensure that the characteristic features of our classification are:

general applicability, the possibility of use in our other projects and products;
simplicity and intuitive understandability, which makes it easy to develop this structure and find the “right place” for new malfunctions;
persuasiveness, the ability to prove this to the potential and potential customers of automation — the fact is that our system covered the automation of several large organizational units at once, each of which adopted its own approach to this issue.

Progress and our mistakes

Soon after the start of work, it became clear that the development of the classification principle is a key task of the whole topic. It was not possible to take any of the classifications coming from external systems as the basis for the following reasons:
- narrowness of the general direction of the estimates due to the specificity of tasks solved by specific systems,
- lack of a hierarchy of problems in many cases (flat fault lists not built into tree structures),
- confusion wording, mixing in one position the causes and consequences.

We created the first version on the basis of a grouping by infrastructure facilities on which these manifestations of malfunctions arose. In fact, this was the easiest way, since it assumed a simple combination of separate “alien” fault lists based on a single (our) infrastructure model.

In general, it turned out something like this:

.... there were about 1,600 lines, of which about 600 could not be attached to specific objects. At the same time, not all problems had a clear object binding and not all the objects mentioned were introduced into our resource base. Although this approach unraveled the situation a little, it did not allow us to introduce a common hierarchy, identify synonyms and reduce the total number, which was one of our goals.

In the future, the “applicability” of the malfunction to objects remained in our system, but this became a directory separate from the general hierarchy of malfunctions.

Result

So, at some point it became clear that we would not be able to create a single structure either on the basis of previously deployed information bases and systems, or on the basis of regulatory documents adopted by the organization.

As a result, we developed the following principles of work:

to separate at the root of the created tree violations (deviations from the norms) and manifestations of natural processes classified by the natural sciences;
to preserve their classification for natural phenomena and processes from the point of view of modern science;
to separate the consequences from the primary manifestations, and refer the wording containing both the phenomenon and the consequences to the branch where the consequence is assigned (for example, “blackout” refers to power failures, and “data loss as a result of blackout” refers to disruptions in the operation of information systems);
All that cannot be classified so far can be allocated to the group “Other” and to establish systematic work on the “analysis” of this group on the basis of the classification principles adopted above
when determining the location of each new record in the general structure, be guided only by the principle: “A particular case of which a manifestation is already entered into the tree” is to search in this way for the place of this record in the tree starting from its root.
Add missing “generalizations” to the tree yourself (if we don’t have such an initial alarm message).

Acting like this, we got about the following set of branches for the first level of the tree:

What is the result?

Unfortunately, this work was not completed, and the result that we stopped at is extremely "raw".
I believe that the reasons for this failure are as follows:
- this work should have been organized and continued by the owner of the infrastructure, but there simply weren’t any specialists ready to take it upon themselves;
- specialists "on the ground" were quite satisfied with the names and classifications familiar to them, and our attempts to generalize and distinguish subgroups met their resistance;
- The implementation of global analytical reporting, for which this work was carried out, did not start.
In general, the customer was not ready for such changes, and we did not have enough administrative resources to affect its employees.

Of course, we can say that time was not wasted. That considerable experience has been accumulated in carrying out such work, which is partly formulated in the principles described above. For myself, I personally concluded that it is important to break up such projects into small stages, to constantly demonstrate an intermediate result to the customer and to provide active support for changes on his part.

Why did it happen this way? Why was the intermediate result that we obtained even in our opinion was far from perfect?
As it turned out during the implementation, users are basically ready to accept (and forgive us) any classification, but with one simple condition - Add a text search to the form!

Classification is a product of the systematization of experience. Obviously, each person, guided by a unique personal experience, sees it in his own way. For example, in the mail program, some (including myself) create a complex system for sorting incoming mail, while others do not sort mail at all, store everything in one folder and at the same time they are perfectly oriented there. And they find the letter I need faster. Maybe such people have Yandex in their heads?

In addition, any predefined classification can be 100% perfect only after it has been finalized taking into account the latest data received in the system. That is, the classification requires constant care, and the user does not need to work on the system, but use it. Search is indexing, and it always works with actual data. Is classification necessary then?

Tags:

How We Built The Trouble Tree

Formulation of the problem

Progress and our mistakes

Result

What is the result?

Also popular now: