Bas1l May 14, 2010 at 13:30

Difficulties in finding errors in scientific applications

This is a continuation of the note on why unit tests do not work well in scientific applications; and in this article I want to talk about the difficulties of finding errors and debug (scientific applications) that I encountered in my time, and many of which were amazing to me as a web developer.

The article will consist of several sections:

Introduction for order
Difficulty finding bugs
- Parallelism
- Nonlocality of errors
- Typo non-obviousness
- Effect of setting on the result
- Identification of errors: error or not?
Conclusion

Introduction

The main purpose of this article is to describe your own experience in the hope that someone will be interested and unusual (especially, as it seems to me, to industrial programmers); perhaps someone will be able to better prepare for writing diplomas / coursework / laboratory. The preamble, the statement of the scientific problem, a brief description of the algorithm can be read in the mentioned article [1]. Therefore, I will immediately turn to the presentation of those unusual (for web developers, for example) difficulties in finding errors in order to broaden the horizons of readers and the consciousness as a whole.

Difficulty finding bugs

Parallelism

This is an exceptionally short point to complete the picture, in which I will once again mention that it is much more difficult to search for errors and debug in parallel programs than in single-threaded ones. And most resource-intensive scientific programs are parallel.

Nonlocality of errors

Here I mean that an error in the code of one class can manifest itself in completely unexpected places. And the point is not in the poor application architecture and high connectivity of the modules, but in the features of tasks and modeling. For example, if the velocity profile is distorted near the walls when simulating the fluid flow in a pipe, this does not mean at all that the error is precisely in the algorithm for calculating the walls, it can be anywhere. Conversely, if the density is distributed strangely in the thickness of the liquid, this does not mean that it is not a wall calculation algorithm.

Compare this with a typical business application scenario: if during the processing of a purchase in an online store discounts on goods are not correctly calculated, the error is almost certainly hidden in the code for calculating discounts on goods.

It may be objected that it is easy to determine the source of the error from the history of the change. Like, as soon as the application stopped working, you need to look for the error in the modified code. However, this approach does not apply to parts of the program that were added for the first time (for example, when adding heat transfer, new boundary conditions, and so on), because there is no previous working version (heat transfer, given boundary conditions, etc.).

In addition, errors in scientific applications may not occur for a long time. For example, if, when a new type of boundary conditions was added, the Poiseuille flow [2], driven by a force and not a pressure gradient, suddenly ceased to be correctly modeled, then it may turn out that this is not a new algorithm of boundary conditions, but the logic of taking into account external force, simply before that, the error was not critical. (see also the paragraph “Slow rate of increase in errors” in [1]).

Typo non-obviousness

One of the problems of scientific algorithms is that they are often not obvious. Even if you create a program with beautiful architecture, take a special class for the algorithm, design and write it well, then you probably can’t avoid a few problems.

Firstly, these are meaningless variable names (because these are auxiliary variables from the original scientific article that do not carry a semantic load). Secondly, these are non-obvious operations on class variables (because they are also taken from the original article, which were obtained by dark magic methods such as optimization of standard deviation, Fredholm alternative, calculation of spatial density harmonics, etc.).

If you debit a business application and observe a line like
bool categoryIsVisible = categoryIsEnabled || productsCount> 0;
then you will immediately notice a typo, because the condition must be a logical "AND."

But imagine that a line (from a real project) caught your eye
double probability = latticeVectorWeight * density * (1.0 + 3.0 * dotProduct + 9.0 / 2.0 * dotProduct * dotProduct - 3.0 / 2.0 * velocitySquare);
It is unlikely that you will determine that you have mixed up the plus and minus somewhere. And by the way, the names of variables are meaningful here.

Effect of setting on the result

In this paragraph I will try to explain that the work of scientific applications is much stronger (compared to business applications) depends on the input data, the parameters of the system, its initial state — that is, on the setting of the system.

The main sources of setting dependency are as follows.

1. System parameters have a very strong (qualitative) effect on the result of the program’s work, unlike business applications, where the parameters usually affect only quantitatively (for example, the operation of the CMS will not fundamentally depend on whether the administrator adds five or five text to the page ten lines)

2. A smaller area of stability of the algorithms according to the input data. In business applications, the main limitation on data is the absence of overflow errors (and who pays attention to it ?!). In scientific algorithms (one of the differences of which is to work with sets of higher power), one needs to recall stability (and after that the rigidity of differential equations, stability theory, Lyapunov exponents, etc.) and monitor it. In addition, in business applications, all restrictions are determinate (they say that a name during registration cannot be longer than 100 characters, email must correspond to a certain regular expression), while in scientific problems, you often have to use trial and error to determine the working range of input data.

3. Everything else (hard to formalize for me so far). In particular, it is the conversion of units of measure from physical to units of measure for a program.

To illustrate these aspects, I will demonstrate a checklist that, after weeks of futile debugging of an application for modeling hydrodynamics, I compiled for myself. If I could not find the error in a few hours / days of step-by-step execution, then I checked with this checklist.

Attention! It is somewhat far from the subject matter of Habr and the interests of most readers, so you can skip it if you wish and go to the next paragraph.
So checklist:

Check incompressibility
Check Reynolds numbers
Check translation

The first point means that the algorithm works only in a weakly compressible fluid, which is equivalent to its low velocities (much less than the speed of sound in a fluid) (because the flow is induced by a density gradient). When I forgot about this restriction for the first time, I spent several days searching for errors in the code, because externally the program worked almost correctly.

The second item is equivalent to checking the stability domain of the algorithm. The fact is that the Reynolds number determines how much the fluid motion is turbulent and unstable [3]. The larger it is, the more unstable the flow, the smaller the more “viscous”. It turns out that even if the motion is never physically turbulent (again, in the Poiseuille flow), then the calculations begin to diverge at sufficiently large Reynolds numbers. Of course, until I stepped on this rake (and didn’t like it for a week), I did not think about tracking the stability area.

The third item is specific only for physical calculations and some algorithms. The method used takes input physical quantities in special units of the lattice (when the unit of length is the step of the uniform spatial lattice, and the unit of time is proportional to it). Until I came across a special article [4] devoted to the translation of quantities in this method, I tried unsuccessfully for several weeks to understand why the program was not behaving quite correctly.

It is worth noting that the second and third paragraphs barely allow automatic verification.

Identification of errors: error or not?

This problem is completely unimaginable in business applications; and it consists in the fact that it is often impossible to say for sure whether a deviation in the behavior of the program (from the expected) is a mistake.

For example, it is known that the velocity profile during the flow of a viscous fluid through a cylindrical pipe will be parabolic [2]. However, suppose that when simulating the fluid at the pipe walls flows a little faster than it should. Options commonly considered are:

this is actually a mistake
this is a feature of the algorithm
this is a consequence of incorrect or inappropriate input data for the algorithm (initial conditions, physical parameters) (see “Effect of setting on the result”)

Verification through unit testing of the first point is complicated by the difficulties of writing unit tests in such applications [1].

The second item in this example is easy to verify by replacing the wall calculation algorithm. However, it may turn out that modeling with the new method also leads to distorted results. In this case, you can try a couple more algorithms (if they are, in principle, available, and if you have time to search, disassemble and implement them).

Checking the third paragraph, unfortunately, is far from being so trivial. One option is to try changing the input parameters and the system setting to determine if there is a region in the phase space of the initial data in which the program works. Sadly, this is not so simple, because the number of degrees of freedom in the initial conditions during complex modeling is very large (you can set different physical parameters, such as viscosity and thermal conductivity; the initial distribution of speeds, forces, densities in the whole system, etc.) . For example, in a test with fluid flowing through a pipe, it took me several days to try to start modeling not with a stationary velocity distribution, but with a stationary fluid, subsequently accelerated by constant force — and the error disappeared!

Conclusion

Here, in fact, all the difficulties of finding errors that I wanted to talk about. If someone has thoughts on how to avoid or effectively deal with such effects, I will be glad to hear them.

Thanks for reading!

References:

[1] Why unit tests do not work in scientific applications
[2] Poiseuille flow
[3] Reynolds number
[4] Translation of quantities into LBM

Tags: