phillennium April 26, 2018 at 18:12

Pandora's White Box

When discussing testing, most often speakers talk about the features of the approach known as the “black box”. But here we will talk about the opposite scenario - the “white box”, which allows us to formulate questions for the code, understanding its internal structure.

The article is based on the transcript of the report of Nikita Makarov (Odnoklassniki) from our December Heisenbug 2017 Moscow conference.

Theory

At a large number of conferences and in a very large number of books, blog posts and other sources, it is said that black box testing is good and correct, because this is how the user sees the system.

It’s as if we are joining it - we see and test it in the same way.

This is all cool, but for some reason very little is said about the white box.

Once, I myself wondered why. What is white box testing?

White box definition

I got to understand. He began to search for sources. The quality of Russian speakers turned out to be very low, translated from English into Russian - a little higher. And I got to English sources - to Glenford Myers himself (G. Myers), who wrote the wonderful book “The Art of Software Testing”.

Literally in the second chapter, the author begins to talk about testing the white box:
“To combat the challenges associated with testing economics, you should establish some strategies before beginning. Two of the most prevalent strategies include black-box testing and white-box testing ... "

Transfer

To stay within the reasonable costs of testing, before you start, you need to develop some kind of strategy. There are two prevailing strategies : black and white box testing.

At the end of the dictionary, Myers gives some definition of white box testing:
"White-box testing - A type of testing in which you examine the internal structure of a program."

Transfer

White box testing is the type of testing in which you examine the internal structure of a program.

What in practice? Myers suggests building test scenarios based on coverage:

Statement coverage - statement coverage in the code;
Decision coverage - coverage of solutions;
Condition coverage - coverage of conditions;
Decision-condition coverage - coverage of conditions and decisions;
Multiple-condition coverage - combinatorial coverage of conditions and solutions.

All Myers says was 35 years ago. What software was written then and what - now? What code bases did you work with then and now? A lot has changed. Coverage is, of course, good, and there are many tools for measuring it, which we will discuss below. But coverage is far from everything. Especially taking into account the fact that we live in a world of distributed systems, where a bracelet from a person’s hand forwards data through a telephone to cloud services.

What do you need to understand now as testing the white box? We look into the code, understand the structure and dependencies that are in this code, ask questions, draw conclusions and design tests based on this data. We perform these tests manually or automatically and based on them we obtain new data on the state of our system - on how it may or may not work. This is our profit.

Why do I need a white box?

Why should we all do this if we have a black box - that is, how the user sees the system? The answer is very simple: life is complicated.

This is the call stack of a regular modern enterprise application written in Java:

Not only in Java, everything is so verbose and plentiful. In any other language, it will look something like this. What is there?

There are web server calls here; security framework that does authorization, authentication, verifies rights and everything else. There is a web framework and another web framework (because in 2017 you can’t just take and write an enterprise application on one web framework). There are frameworks for working with the database and converting objects to columns, tables, columns and everything else. And here there is a small yellow square - this is one challenge to business logic. Everything below and above it happens in your application every time.

Trying to get to this thing somewhere outside with a black box (as the user sees it), you can not test a lot of things. And sometimes you really need it, especially when user behavior changes something in security, the user is redirected to some other place, or something happens in the database. The black box does not allow you to do this. That is why you need to get inside - in a white box.

How to do it? Let's see in practice.

Practice

So that there are no wrong or exaggerated expectations, let's clarify some details from the very beginning:

There will be no ready-made recipes. At all. Everything that I will show requires the application of a file, hands and head.
Much depends on the context. I came from Java development (I have been doing this for quite some time). We have our own tools. Some may seem wonderful, others may be ugly. Some of them cannot or should not exist in your context. This is normal. I came not to show tools, but to share ideas. That is why all my examples are simplified to the limit.
So that you can do all this with your development team, you need to have an impact on it. What do I mean by this? You must be able to read the code that developers write. You must be able to speak the same language with them. Without this, everything that I will discuss later cannot be dealt with.

In order for my further story to be more or less structured, I divided it into three levels. Let's start with the simplest - easy level.

Easy level

As I said, we look at the code and see:

The code is not formatted;
The code is not written according to guidelines;
names of methods, classes and variables do not correspond to what is accepted in the company;
the code is written stylistically incorrectly (again, the guidelines do not match);
any static code analyzer will find a bunch of standard problems for your language;
unit-tests for the code are either absent or written in such a way that they do not withstand criticism.

Fixing this is the very first and easiest thing to do in the white box testing area. All of this is remarkably handled by the tools of static code analysis, which are already quite complex today - such as Sonar for Java and analogues for your languages (in fact, Sonar is multilingual and suits almost everyone).

I don’t want to stay here long. About this there are a bunch of interesting reports.

Medium level

The average difficulty level is scaled. When you work in a small company or team, you are the only tester, you have three or four developers (as well as the industry average), 100 thousand lines of code for all, and the code review is carried out by throwing a sneaker of the leading developer to who guilty - you do not need any special tools. But this rarely happens.

Large successful projects are usually “spread out” to several offices and development teams. And the size of the code base begins with a million lines of code.

When a project has a lot of code, developers begin to build formal rules by which this code is written:

the code should go to certain places, to certain packages;
The code must be framed correctly;
it must be inherited from a certain class, it must have the correct logger, the necessary annotations are placed so that all metrics on the production are correctly counted and statistics are collected, the exits go where they need to.

In other words, as code grows, formal rules arise that you can verify. Accordingly, there are tools that allow you to do this.

Let's look at an example.

Archunit

The source code of the

ArchUnit example allows you to describe the formal rules of what should or should not be in the code in the form of a more or less problem-oriented language and push them in the form of unit test standards into each project. So, from within the ArchUnit project, it is possible to verify that it meets the “sanitary minimum”.

So, we have a ruleArchRuleDefenition:

    @Test
    public void testNoDirectUsagesOfSelenium() {
        ArchRule rule = ArchRuleDefinition
                .noClasses()
                .that()
                .resideInAPackage("org.example.out.test")
                .should()
                .accessClassesThat()
                .resideInAPackage("..org.openqa.selenium..");
        rule.check(classes);
    }

The rule says that in no class ( .noClasses()), which is in the corresponding package with tests ( org.example.out.test), there should not be calls directly to the internals of Selenium ( ..org.openqa.selenium..).

Let's run this test. It falls remarkably:

At the same time, he writes that we have violated the rule (when a class located in such a package knocks on classes that are in another package). What is more valuable, in the form of a stack trace, it shows all the lines where this rule is not respected.

ArchUnit is a wonderful tool that allows you to embed such things in the CI / CD cycle, that is, write tests inside the project that check some kind of architectural rules. But it has one drawback: it checks everything when the code is already written and committed somewhere (that is, either a commit hook will work, which rejects this commit, or something else). And there are situations when it is necessary that bad code cannot be written at all.

Annotation processing

Source code of an example:

Library
Test project

Last Heisenbug in the summer of 2017, my colleague from Yandex, Kirill Merkushev, talked about how code generation solves the problems of test automation. Who did not watch his performance - please look, the video is here .

Actually, code generation can solve many problems. It allows you to not only create code that you don’t want to write, but also prohibit the creation of code that should not be written. Let's see how it works.

Most code generation uses annotation processing. I have a project that describes a couple of annotation processors that are specific to the Java development world - in particular, the Pojo annotation. There is no such thing as structure in Java programs. The Java founding fathers are now thinking of introducing structures into the programming language. It was already in C, we haven’t yet (although more than 40 years have passed). But we were able to get out - we have a Pojo (plain old java object), that is, objects with fields, getters, setters, but there is nothing else in them - no logic.

I have an annotation that characterizes a Pojo object, as well as an annotation that characterizes a Helper - this is a stateless object into which all sorts of procedural methods are crammed (pure business logic). And I have two processors for such annotations.

The Pojo annotation processor looks for appropriate annotations in the code, and when it finds it, checks the code for what is (or is not) Pojo. The Helper annotation processor works similarly (here is a link to annotations and annotation processors ).

How does it all work? I have a small project, I run compilation in it:

I see that it does not even compile:

This is because this project contains code that violates the rules:

package a.b.c;
import annotations.Pojo;
@Pojo
public class AnotherFailed {
    private long point;
}

Unlike the previous example, this thing is embedded inside the development environment, inside continuous integration, that is, it allows you to cover a larger circuit inside the CI / CD cycle.

Nightmare level

When you've played enough at the previous levels, you want something more.

Code coverage

Example Source Code

To measure code coverage, since Myers wrote his book, a lot of different tools have appeared. They are for almost every programming language. Here I have cited only what I found popular by the number of links to them on the Internet (you can say that this is wrong - I agree with you):

Jacoco, Cobertura - Java;
OpenCover - .NET;
Coverage - Python;
SimpleCov - Ruby;
OpenCppCoverage - C ++;
cover, gcov - Go.

In some programming languages (for me it was a surprise) - for example, in Python and in Go - tools for measuring code coverage by tests are built into the language itself.

There are tools and, moreover, there is an integration of these tools with development environments, when we see this wonderful little thing on the left, indicating that this piece of code is covered with unit tests (green color), and this one is not (red color).

And looking at this in the context of unit tests, I want to ask a question - why can’t this be done with integration or with functional tests? Somewhere you can!

But in addition to tests, we have users. We can test anything we want (the main thing is not to test bullshit), but users push in one place, because they use it 95% of the time. And why can’t you make the same beautiful stripes, but only for code that is used or not used?

In fact, this can be done. Let's see how.

Imagine that I am a tester of this application. And it falls to me for regression testing (“Urgently, we’re burning, we are doing a mega-startup, we need to check what works, what does not work”). I carry out all these manipulations with him - everything works, we release it in release. The release is successful, everything is fine.

Six months pass - the situation repeats itself. For six months, the developers have changed something there. What exactly, I do not know. Can I find out - this is a separate issue. But most importantly, what code is now being called? Have I checked everything by pressing a single button or not all? It is clear that not all, but have I missed something important?

You can find answers to these questions if, together with the application, you launch an agent that removes coverage from it.

I used Jacoco. You can take any, most importantly, so that you can later understand what he intended for you. As a result of the agent’s work, we got the jacoco.exec file:

From this file, the source application and the application binary, you can create a report from which you can see how it all works.

I have a small script that analyzed this thing and created the html folder:

The script shows this report:

During the testing, I sold something by hand, but something not, in a different percentage. But since we are not shy to look into the white box and see what happens inside the application, we know where we need to push.

In this report, those lines that I “pushed” are highlighted in green. Red - which did not sell.

If we read this code more or less thoughtfully (without even delving into what is going on inside), we can understand that I did not sell any work related to network failure. Also, I did not check the cases of getting a bad status code (that we are not authorized to request repositories of this organization).

To check the network drop, you can break the grid or implement Fault Injection testing, or you can write another Fault Injection implementation, putting it in the application directory, to receive the status code not 200, but, for example, 401.

Trying to answer questions about what is being tested by our tests, where our users are pushing and how actually one correlates with the other, we at Odnoklassniki created a service that can bring everything together. We do custom service. We can test some forgotten corner of our large portal, where no one goes, but what is the value?

First we called it Cover. But then, due to a typo of one of our engineers, we renamed it KOVÖR.

KOVÖR knows about our software development cycle, in particular, when you need to enable coverage measurement, when you need to turn it off, when you need to render reports from this. And KOVÖR allows us to compare reports on what was, say, last week, and this; according to what we did with self-tests, and what people sold with their hands.

It looks like this (these are real screenshots from KOVёR):

We get a side-by-side comparison of the same code. On the left are autotests, on the right are users. What is not pushed through is highlighted in red, what is pushed through in green (in this case, autotests push through a particular piece of business logic much better than users).

As you know, everything can be adjusted: left and right can change, colors used, too.

As a result, we get such a fairly simple 2x2 matrix that characterizes the code:

Where we have coverage with both autotests and people - it needs to be compared, and KOVÖR works with this. Where there is coverage with autotests, but no people, you need to think carefully. On the one hand, it can be dead code - a very big problem of modern development. On the other hand, it can be a functionality that is used by people in some urgent circumstances (user recovery, unlocking, backup, recovery from backup - something that is rarely called).

Where there are no autotests, but there are people, obviously, you need to write code to cover these places, and strive for the reasonable, good, eternal. And where there are no autotests or people - first of all, you need to insert some metrics and verify that this code is never really called. After that, it is necessary to ruthlessly remove it.

Code Coverage tools already exist, and you just need to integrate them into yourself. With them you can:

use them for introspection of manual testing;
get a quality measure for autotests;
Find dead code and dead features with their help.

Meta information

There is a classic mathematical problem about the layout of a backpack: how to pack all things in a backpack so that they fit there and there is as much space as possible. I think many of you have heard about her. Let's look at it in the context of testing.

Suppose I have 10 autotests. They look like this:

In reality, each autotest runs a different time. Therefore, at a certain point in time, they look like this:

And we have two resources on which we run them:

Resource to run tests number 1
Resource for running tests number 2

I do not know what it is - jenkins slave, virtual machines, docker containers, phones - anything.

If we take these 10 tests and divide them into two resources equally, we get the following picture:

This picture is not good and not bad, but it has one feature: the first resource has been idle for quite some time, and testing on the second resource is still underway .

Without changing the number of tests on each of these resources, you can simply regroup them and get the following picture:

Five tests remained in each resource, but the simple one was reduced - we saved about 20% of the testing time. When we first cut this optimization in ourselves, it really saved us 20%. That is, this figure is not from the ceiling, but from practice.

If we look at this pattern further, then the speed of tests is always a function of how much resources you have and how many tests you have. Then you have to balance it and somehow optimize it.

Why is it important?

Because not everything is always the same. Suppose someone runs to your Continuous integration server and says that we urgently need to run tests - check the fix and do it as quickly as possible.

You can follow the lead of this person and give him all possible resources to run tests.

The truth may be that their fix is not very important compared to the current release, which should roll out in two hours. This is the first.

And the second - there are actually not as many tests as your resources. That is, the picture that I showed earlier, where you have 10 tests and two resources, is a very big simplification. There may be 200 resources, and 10 thousand tests. And this game, with how many resources to give to anyone, begins to affect everyone.

To play this game correctly, you must always have answers to two questions: how many resources you have to run and how many tests.

If you think long enough about the question of how much resources you have and how many tests you have (especially the last one), sooner or later you will come to the conclusion that it would be nice to parse the code of your tests and understand that But what happens in it:

This thought may seem crazy to you, but do not drive it right away. All development environments already do this in order to show you such hints:

Moreover, they are engaged in parsing not only the code, but also all the dependencies in it.
They know how to do it. Moreover, all development environments do this well, and some even supply libraries that can solve such problems in just six lines (at least for Java ).

In these six lines, you parse and completely parse some piece of code. You can get any meta-information from it: how many fields, methods, constructors it contains - anything, including tests.

And having all this in mind, we created a service called Berrimor.

BERRIMOR can say “oatmeal, sir!” And he can:

Download code from GIT repositories
correctly parse the code (including regularly);
highlights meta-information, namely: counts tests; receives meta-information from tests (tags, disabled tests); knows the owners of the tests.

BERRIMOR delivers all this data out.

I could show you the BERRIMOR interface, but you would still nothing there anyway. All its power lies inside the API.

Social code analysis

In 2010, I gave lectures by Sergei Arkhipenko on managing software projects and I remember this quote here:

"…реальность, которая заключена в особой специфике производства программ, по сравнению с любой другой производственной деятельностью, потому что то, что производят программисты – нематериально, это коллективные ментальные модели, записанные на языке программирования" (Сергей Архипенков, Лекции по управлению программными проектами, 2009).

The key word is collective. People have a handwriting, but not everyone has a good one. Programmers also have a handwriting (and also not always good). There are some relationships between people: someone writes a feature, someone patches it, someone fixes it. These dependencies are inside each team, inside each development team. And they affect the quality of what is happening in the project.

Social code analysis is an emerging discipline. I have identified three videos that are publicly available and can help you understand what it is.

There they are

Mining Repository Data to Debug Software Development Teams , Elmar Juergens;
Seven Secrets of Maintainable Codebases , Adam Tornhill;
How Flaky Tests in Continuous Integration: Current Practice at Google and Future Directions , John Micco, Atif Memon.

Social code analysis allows you to:

understand who is repairing and who is breaking;
Find implicit relationships in code. When your class changes and the test for it is an explicit connection in the code, and this is normal. And when your class, test, and something else changes, and so it happens every time - this is an implicit connection in the code;
find hot spots in the code, where they most often fix, change, break;
find dead code and dead features. It is very strange now (in 2017) that the code looks that was written once in 2013-2015 and has not changed since then. Either it is perfect and works well - and the metrics will show it, or it is dead;
if you know how technical code looks in your code, you can find it too.

A little more about technical debt. I have a weak hypothesis of technical debt.

on an abstract project in a vacuum there is a bug tracker (issue tracker). The bug tracker has all the bugs, tasks, and each of them has some kind of ID;
There is a version control system Git - in the most simplified case. Git has commits, and commits have comments where they write links to task IDs;
My hypothesis is that those files in Git, in which most often something is changed under bugs, this is the place where technical debt accumulates.

At Odnoklassniki, it looks like this:

When I write and commit something, I provide a link to the ticket in Jira. By virtue of the NDA, I can’t show you social code analysis using Odnoklassniki repositories as an example. I will show you an example of the open source project Kafka.

Kafka has an open issue tracker, an open repository with code:

https://issues.apache.org/jira/projects/KAFKA
https://github.com/apache/kafka
Fine in the repository http://bit.ly/2yfc7lA and http://bit.ly/2wO5ByG

Let's see what happens there.

So, I have (a small utility application ) that raises all the commits in this repository and parses all the comments for them, providing a regular expression search for commits Pattern.compile("KAFKA-\\d+")that reference a ticket.

The console shows that there are only 4246 commits, and 1562 commits without such a mention. That is, the accuracy of the analysis is one third less than we would like.

Next, we raise each commit, make up an index from it - which files in it changed (under which ticket). We compose all these indices in a large hashmap: the file name is a list of tickets by which this file changed. Here's what it looks like:

For example, we have a KafkaApis file and there is a huge list of issue next to which it changed (the API changes often).

Then we go to the Kafka issue tracker and determine by what issue this thing changed - was it a bug, a feature, an optimization? At the output, we get a small hash, where it says what kind of thing it is and what its priority is (these are all just bugs):

we get the following conclusion:

Where do we write, what percentage of changes was in this or that file:

For example, for the top line, the total number of tickets that went through commits in this file is 231, of which 128 are bugs and, accordingly, 128 is divided by 231 - we get 55% - the proportion of changes. With a high probability, technical debt is concentrated in these files.

Summary

I showed you six different examples. This is far from all that exists. But this means that the white box is primarily a strategy. How you will implement it on your project - you know better. One must think, do not be afraid to get into the code. There always lies the whole truth about your project. Therefore, read the code, write the code, intervene in the code that programmers write.

If the topic of testing and error handling is as close to us as we are, you will probably be interested in these reports at our May Heisenbug 2018 Piter conference :

Auxiliary techniques for testing microservices (Alexander Martyushov, Signavio)
Web Security Testing Starter Kit (Andrey Leonov, SEMrush)
Beta testing VKontakte (Anastasia Semenyuk, VKontakte)
In the wake of Wild West testing: unusual tricks for common problems (Vitaliy Fridman, Smashing Magazine)

Tags:

testing