Synthetic vs real test data: pros, cons, pitfalls
Let's start with the sweet and give examples from testing practice.
Imagine an online store ready for launch. Nothing bodes trouble. Marketers developed a promotion strategy, articles were written in specialized Internet resources, and advertising was paid. Management expected up to 300 purchases weekly. The first week passes, managers record 53 payments. The store’s management is furious ...
The project manager is running around looking for reasons: lack of thought in usability? inappropriate traffic? something else? We began to understand, studied the data analytics system. It turned out that 247 people reached the checkout, and only 53 paid.
The analysis showed that the rest could not make a purchase because of the email address!
Testing of the order form was given to a novice specialist. He tried his best, entered into the fields "Name", "Email", "Phone" all the possible and impossible options that gave him online generators. A week later, all bugs were found and fixed. The release has taken place. But among the options considered there was not a single email address containing less than 8 characters after @.
So the happy owners of the mails @ mail.ru, @ ya.ru (and similar ones) could not enter their mail and left the site without shopping. The owners received less than 600 thousand rubles, the entire advertising budget was drained into the void, and the online store itself received a bunch of negative reviews.
Do you think this is an isolated case? Then here is another horror story for the customer.
In the wake of the general interest in cashless payments, the loan company decided to introduce a new way of transferring funds - to the borrower's bank card. We implemented the appropriate form in the manager’s personal account, tested various error options in the fields for entering card data, fixed it, and started working. A month later, management reached information that 24% of potential borrowers who had already received approval did not apply for a loan until the end. Why? They provided a bank card, the number of which contained 18 digits, instead of those pledged and tested 16. Neither the system nor the managers could register such cards, and the customers left with nothing.
The pilot project was implemented in 3 offices of the city. The average number of monthly loans at three offices was 340. Result: the organization lost at least 612 thousand rubles. revenue.
These are just a couple of examples where synthetic data can cause serious losses on a project. Many testers enter synthetic data in order to test various projects - from mobile applications to software. In this case, testers themselves come up with test situations, trying to predict user behavior.
However, more often than not they see the user not multidimensional, as in a cinema with 3D glasses, but sketchy, as if the child drew a little man on an album sheet.
This leads to the fact that the tester does not cover all possible test situations and cannot work with a large amount of data. And, although testing can be carried out well, there is no guarantee that the system will not fall when a real user (most often illogical and even unpredictable) starts interacting with the product.
Today we’ll talk about which data to give preference in the testing process: synthetic or real.
We will understand in terms
Each time we take a test, we decide what test data we will use. Their sources may be:
- Copies of the production database on the test bench.
- A database of third-party client systems that can be used in the current one.
- Test data generators.
- Manual creation of test data.
The first 2 sources provide us with real test data. They are created by existing processes by users or the system.
For example, when we join a project to develop a web product for a manufacturing company, we can use a copy of the existing 1C database for testing, which for several years has collected and processed all data about operations and customers. Using the module for generating and displaying reports on completed orders, we obtain information from 1C in the required format and work with real test data.
We call the sources from points 3 and 4 “synthetic test data” (such a term can be found in foreign testing, but it is rarely used in the Russian-language segment). We create such data ourselves for the purposes of development and testing.
For example, we received an order from a new electronic trading platform for procurement by state and municipal organizations under 44 Federal Laws. Here, very strict information protection rules are followed, so the team does not have access to real data. The only way out for testing is to create a whole set of test data from scratch. Even physical digital signatures that are intended solely for testing.
In our practice, one type of data is rarely used, usually we work with a combination of them depending on the task.
To check the limitations and exceptions in the same system for the manufacturing company, we additionally used synthetic data. With their help, we checked how the report behaves if there are no products in one of the orders. On an electronic trading platform, we combined synthetic data with real reference books OKPD2 and OKVED2.
Synthetic Data Capabilities
In a number of situations, synthetic data is simply indispensable! Let's see what tasks they can be used from the tester’s arsenal:
1. Simplification and standardization
Often real data are homogeneous data arrays: imagine that thousands of individuals with one set of attributes, legal entities of different types, standard operations and many other types of entities are registered in the system. Then why spend hours testing the same input, if you can combine them into groups and appoint a “representative" for each group?
At one of last year’s projects, the customer decided to strengthen the team of testers before the next release, for which a team of our specialists was involved. The product contained a modified registration form with many fields. We laid out the test forms for 30 minutes and spent about the same amount of time. In parallel, it was revealed that another tester had already tested this form, having spent 7 hours on it. Why? He just decided to run the test according to the real data of 12 employees from the staff list and did not take into account that the test for one employee covers all the attributes that are the same for each registered profile.
Profit: 6 hours 30 minutes and this is only on one test.
2. Combinatorics and test coverage
Despite the often large amount of real data, they may not contain a number of possible cases. In order to guarantee the operability of the system with various combinations of input data, we have to generate them ourselves.
Let us return to the previous example. The registration form in the new release was finalized for a reason. The customer team, based on the norms of corporate culture, decided to make the middle name mandatory. As a result, all foreign specialists in the state suddenly had one father - Ivan (someone said to write Ivanovich until they fix it). The case is funny, but if you do not take into account some Wishlist or the parameters of your clients in the tests, then do not be offended if these people then do not take you into account in their costs / reviews.
3. Automation
In automated testing, synthetic data is indispensable. Even seemingly insignificant changes in the data can affect the operation of a whole set of test runs.
An example from the banking sector will be illustrative here. To check whether the application correctly puts down the numbers of bank accounts in the documents generated by it, we spent 120 people / hours for writing auto tests. There was no access to the database, because the account number was indicated in the autotest itself. The tests showed excellent results and we were ready to draw 180% + ROI from automation in the report. But a week later the database was updated with a change in the account number. All autotests, as well as our automation efforts, ended up safely. After revising the autotests, the final ROI dropped to 106%. With the same success, we could immediately start testing with our hands.
4. Improving manageability
Using synthetic data, we know (in the worst case, we assume) what kind of response to expect from the system. If changes are made to functionality, we understand how the response to the same data will change. Or we can adjust the data to get the desired result.
In one of the projects, our team started testing using real data from the customer’s counterparty database. The database was actively refined, but at that time it was compiled extremely carelessly. We lost time trying to understand where the error is, in the functionality or in the database. The solution was simple, to compose a synthetic database, which became shorter, but more adequate and more informative. Testing of this functionality was completed in 12 person / hours.
So what are the disadvantages?
Synthetic data might seem omnipotent. So it is, until you come across a human factor. Synthetic data is intentionally created by people and this is their main drawback. We physically cannot think through all the possible scenarios and combinations of input factors (and nobody canceled force majeure). And here the advantage remains for the real data.
The benefits of working with real data
What advantages do we see in working with real data? 4 proofs from our experience:
1. Work with large amounts of information
The actual operation of the system provides us with millions of operations. Repeat the simultaneous work of thousands of users or automatic data generation will not be able to any team of specialists.
Proof: we created a synthetic mini-database, which, as it seemed to us, met all the criteria of the customer’s existing base. With a synthetic base, everything worked great, but as soon as you run tests with a real base, everything fell. Bottom line: if you can’t take into account all the nuances of a real data sample, do not waste time generating synthetic data.
2. Using a variety of data formats
Text, sound, video, images, executable files, archives - it is impossible to predict what the user decides to upload to the form field. Tips about accepted formats may be ignored, and the ban on downloading may not be implemented by the development team. As a result, the desired scenarios are tested. For example, that in the sound download field, it is indeed possible to download an mp3 file and that downloading an executable file will not harm the system. Real data helps us track exceptions.
Proof:We tested the photo upload field in the user profile. We tried the most common graphic formats from the database, plus several video and text files. Synthetic compilation loaded perfectly. In actual use, it turned out that any attempt to download a sound file causes an error. The entire registration form crashed without the ability to replace the file. Even a page refresh did not save.
3. Unpredictability of user behavior
Although many QA specialists have succeeded in creating and analyzing exceptional situations, let's be honest - we can never accurately predict how a person will behave and the factors surrounding him. And you can start with turning off the Internet at the time of the operation, and end with operations with the program code and internal files.
Proof:Every year, in our company, employees undergo certification, where, among other things, they evaluate their skills in a special questionnaire. Estimates are agreed with the head, and based on them, the grade (level) of the specialist is calculated. Before the release, the module was fully tested, everything worked like a clock. But once, at the time of saving the results, a ddos attack was made on the system, as a result of which only part of the data was saved, and subsequent attempts to save only duplicated the errors. Without a real situation, we would not have tracked such a serious mistake.
4. System updates
It is very important to understand how the system will behave during the upgrade, what are the possible risks, what may “not take off”. In programs such as 1C, where a huge number of directories and links, the issue of updates is especially acute. And here the best option would be to have a fresh copy of the production version, test the update on it, and only then be released.
Proof: the case is quite common. Project in the field of factoring services. The criticality of data loss and distortion is over the top, and any suspicion of the relevance of the data reproduced by the system can stop the entire office. And here our special crookedly rolls out the next update immediately to the prod, without capturing the last 10 versions of the builds.
They rolled out at 18-00 fixed in the morning, at 11 o’clock. Due to the failed completion and incomprehensibility with the versions, the work of the company’s divisions was completely frozen for 2 hours. Managers could not process current contracts and register new ones.
Since then, we have been working with three stands without fail:
- Develop. Improvements are laid out here, anarchy and lawlessness are going on, called exception testing. QA-engineers work mainly with synthetic data, real ones are infrequently infused.
- Pre-release. When all the improvements are implemented and tested, they are going to this branch. The version with the sale is also preliminarily rolled here. Thus, we test the update and operation of new functions in conditionally combat conditions.
- Production. This is a working, combat version of the system with which end users work.
So what data and when to work?
We share 3 insights of our work with real and synthetic data:
1. Remember that the choice of data type depends on the goals and stage of testing. So, developing a new system, we can only operate with synthetic data. To cover various combinations of input conditions, also, most often, we turn to them. But the reproduction of some tricky exception related to user behavior will require real logs. The same applies to working with generally accepted directories and registries.
2. Do not forget to optimize your work with test data. Where possible, use generators, create common registers of basic entities, save and use system backups in time, deploying them at the right time. Then preparation for the next project will not be a source of melancholy and gloom for you, but one of the stages of work.
3. Do not go to the side of solid synthetics, but do not focus on real data. Use a combined approach to test data in order not to miss errors, save time and show maximum results in your work.
Despite the fact that synthetics is prophesying a great future, and scientists have seen in synthetic data a new hope for artificial intelligence, synthetics are not a panacea for testing. This is just another approach to generating test data that can help solve your problems, and can lead to new ones. Knowing the strengths and weaknesses of real / synthetic data, as well as the ability to apply them at the right time, is what protects you from losses, downtime or legal action. I hope today we have helped you become a little more secure. Or not?
Let's discuss. Tell us in the comments about your cases of working with synthetic and real test data. Let's see who among us is more: realists or synthetics? ;)
Victoria Sokovikova.
Test Analyst at Quality Laboratory