Metamorphic testing: why almost no one knows about this promising technique
- Transfer
I have to admit: I'm reading ACM Magazine . It makes me a “nerd,” even by the standards of programmers. Among other things, I learned from this magazine about “metamorphic testing”. I had never heard of him before, like all the people I asked. But the scientific literature on this topic is surprisingly voluminous: there are many incredibly successful examples of its application in completely different fields of research. So why haven't we heard of him before? There is only one article for people outside the scientific community. Now let there be two of them.
Brief Background
Most written tests use oracles . That is, you know the answer and explicitly check whether the calculations give the correct answer.
def test_dist():
p1 = (0, 3)
p2 = (4, 0)
assert dist(p1, p2) == 5
In addition to oracle tests, there are also manual tests. The tester sits down at the computer and compares the input data with the results. As systems become more complex, manual tests are becoming less and less useful. Each of them checks only one point in a much larger state space, and we need something that explores the entire state space.
This leads us to generative testing : writing tests covering a random set in a state space. The most popular style of generative testing is property based testing , or PBT. We find the “property” of the function, and then generate the input values and check whether the output values match this property.
def test_dist():
p1 = random_point()
p2 = random_point()
assert dist(p1, p2) >= 0
The advantage of PBT is that it covers more space. Its disadvantage is the loss of specificity. This is no longer an oracle test! We do not know what the answer should be, and the function may be erroneous, but in a way that has the same property. Here we rely on heuristics.
A serious problem with PBT is finding good properties. Most functions have simple, general properties and complex, specific properties. General propertiescan be applied to a large number of functions, but they do not give us much information. More specific properties give more information, but they are more difficult to find and they are applicable only in areas of limited tasks. If you have a function that determines whether a graph is acyclic, then what property tests will you write? Will they give you confidence that the function is correct?
Motivation
Now consider a more complex task. Imagine that you want to write a speech-to-text (STT) converter for English. It receives a sound file, and displays the text. How would you test it?
The easiest way to use a hand oracle. Dictate the sentence and check if the output text matches it. But this is not even close enough! The range of human speech is huge . It would be better to test 1,000 or even 10,000 different sound files. Hand-held oracles with transcription would be too costly. This means that we have to use property-based testing instead.
But how do we generate input? For example, we can create random strings, pass them through a text-to-speech (TTS) converter, and then make sure that our STT produces the same text. But this again gives us a very limited range of human voice. Will TTS create changes in intonation, swallow words, imitate a strong accent? If we can’t deal with them, will the STT be particularly useful? It is better to use arbitrary texts, for example, recordings from the radio, from podcasts and online videos.
Now a new problem arises. When using TTS, we started with written text. In the case of arbitrary sound files, we do not have it, and at the same time we do not want to transcribe manually. Instead, we are limited to using properties. What properties do we need to test? Examples of the simplest properties: “the program does not crash with any incoming data” (a good property) or “it does not convert acoustic music into words” (maybe?). These properties do not cover the verification of the main task of the program very well and slightly increase confidence in its quality.
So, we have two tasks. First, we need a large amount of input in the form of speech. Secondly, we need to understand how to convert them into useful tests without spending long hours manually transcribing voices into oracles.
Metamorphic testing
For all this, the output is considered separately. What if we embed them in a wider context? For example, if a sound clip is transcribed into the output
out
, then we should always get out
when:- Double volume up
- Increasing frequency
- Increase pace
- Adding background noise
- Adding vehicle noise
- Any combination of the above.
These are all “simple” transformations that we can easily test. For example, for a test with “traffic noise”, we can take 10 samples of car noise, put them on a sound clip and check if the recognition results of all 11 versions match. We can double or increase the volume, turning 11 versions into 33 versions, and then double the pace to get 66 versions. This principle can be applied to each sound clip in our database, significantly expanding the space of incoming data.
The presence of 66 versions for comparison is quite convenient. But that’s not all: we still don’t need to know what the output should be. If all 66 conversions return
out
, then the test passed successfully, if at least one returns something else, then the test fails. At no stage do we need to check what is contained in out
. This is extremely important. So we significantly increase the test space with very little human involvement. For example, we can download an episode of the series, perform conversions and check if all the results of their conversion to text 1 match . We got useful tests without listening to the voice clip . Now we can generate complex and deep tests without using an oracle! Two sets of input data, as well as their output data, are connected to each other. Such a property related to the set of incoming / outgoing data is called a metamorphic relationship 2. Testing that applies this property is called metamorphic testing . In complex systems, interesting metamorphic relationships can be easier to find than interesting properties of individual incoming / outgoing data.
Let us state it a little more formally: if we have
x
and f(x)
, then we can perform some transformations x
to get x2
and f(x2)
. In the case of STT, we just check f(x) = f(x2)
, but we can use any relationships between the two datasets. There may be metamorphic relationships of the type f(x2) > f(x)
or “is an f(x2)/f(x)
integer value”. Similarly, this principle can be extended to several sets of input data by using f(x)
andf(x3)
. An example of this is comparing the results of a search engine without filters with the results of an engine with one or two filters. In the majority of descriptions of use cases I read, only two sets of input data are used, because even they are enough to find crazy bugs.Examples of using
Speaking of use cases: how effective is metamorphic testing in practice? It is one thing to talk about a technique abstractly or to give artificial examples. Case studies are useful for three reasons. Firstly, it shows if the method actually works. Secondly, from them you can learn about the potential difficulties when using MT. Third, the examples show us how we can use the technique. Any metamorphic connection used in the example of use can be tried to adapt to the solution of our problems.
In the article "Metamorphic Testing: A Review of Challenges and Opportunities" is a list of a lot of research, but all of them are scientific articles. Below are the most interesting. Articles tagged
(pdf)
, posted, as you might guess, in PDF.- METTLE: A Metamorphic Testing Approach To Validating Unsupervised Machine Learning Methods (pdf). Here 11 different metamorphic relationships are described for testing uncontrolled clustering, for example, “will we get the same result if we mix the input data?” And “do the additional input data at the edges of the clusters belong to these clusters?” Different models change with different relationships. For example, about 5% of tested k-means models with mixing of the order of incoming points have an average clustering error of 20%
- DeepTest: Automated Testing of Deep-Neural-Network-driven Autonomous Cars (pdf). The topic of the article is the vision systems of autopilots in cars. Examples of metamorphic relationships (MS): “adding a rain filter” or “slight tilt of the image”. The authors posted the results of sampling here : almost all the systems they tested did not cope with their functions when changing the MS.
- Automated Testing of Graphics Shader Compilers (pdf). Injections of the "dead" code and runtime constants in the shaders led to the disappearance of objects in the images or turning them into noise. Researched on the basis of their work created a startup GraphicsFuzz , which was bought by Google, and the site is closed.
- Metamorphic Testing of RESTful Web APIs (pdf). Will we get the same elements when pagination changes ? What if you sort them by date? This article lists a bunch of bugs in Spotify and Youtube.
- An innovative approach for testing bioinformatics programs using metamorphic testing (formerly pdf, but now no). Search for errors in bioinformation data. I am poorly versed in bioinformatics, but the article shows that MSs are also useful in specialized fields.
Problem
Oh, so all these sources are in PDF.
It took several hours to find all of these articles. And this problem is associated with the biggest obstacle to the development of MT: all of the above links are preprints or first drafts of future scientific articles. When I begin to understand little-known techniques, I first of all ask myself: “why are they not well-known?” Sometimes the reason is obvious, sometimes it is a complex set of small reasons, sometimes the problem is simply that the technique is “out of luck”.
In the case of MT, the problem is obvious. Almost all information is hidden behind scientific paywall. If you want to study MT, you either need access to the journal, or you need to spend several hours searching for preprints 3 .
Further study
The inventor of MT is Ty Chen . He became the driving force of many studies. Other researchers in this area are Zhi Quan Zhou and Sergio Segura ; both of them posted all their preprints on the Internet. Most of the research work is done by one of these people.
Probably the best place to start is with Metamorphic Testing: A Review of Challenges and Opportunities and A Survey on Metamorphic Testing . Although this article is written about metamorphic testing, investigated also applied metamorphic relationships in general to a wide variety of other disciplines, for example, formal code verification and debugging. I have not yet studied these areas of application of the technique in detail, but it is probably worth looking at them too.
From the point of view of applicability, it may theoretically be possible to adapt most PBT libraries to verify metamorphic properties. In fact, the first example in Quickcheck is testing MS, and in this essay on PBT, MS is indirectly applied. GenerallyIt seems to me that most PBT research focuses on efficiently generating and trimming input data, and MT research is mainly focused on determining what we really need to test. Therefore, these techniques are likely to complement each other.
Thanks to Brian Ng for his research assistance.
Postscript: request
In fact, it is not surprising that I had never heard of this technique before. There are many really interesting and useful techniques that could not leave their tiny bubble. I found out about MTs through luck rather than active searches.
If you know something worth widespread use, please write to me .
- Well, there can be obvious problems: there can be music in the podcast, fragments of speech in other languages, etc. But the theory is reliable: if we are able to get speech samples, we can use them as part of the tests without preliminary manual transcription / markup.
- In specifications, the corresponding idea is hyperproperties — properties of sets of behaviors, rather than individual behaviors. Most hyperspecific studies are related to HS safety. As I understand it, its HS are a superset of MS.
- I had a second, now disproved, hypothesis: since most of the main researchers from China and Hong Kong, perhaps this technique is better known in communities of programmers who communicate in Mandarin, rather than English. Brian Eun tested this hypothesis for me, but did not find any significant signs of the use of the technique by the Chinese.