Data testing: requirements and levels
My name is Alexey Chumagin, I am a tester in Provectus. In this article I will explain how data quality requirements are formed and what levels of data testing can be.
The article deals with large (or not) data, on the basis of analysis and aggregation, of which different processes are built, patterns are derived for use in further analysis or for decision making. The data can be collected for a specific project from scratch, or databases previously collected for other projects or for commercial purposes can be used. The sources of this data are diverse and include not only the input by operators, but also automated and / or automatic measurements stored in the database systemically or unsystematically (in a heap, “then we figure out what to do about it”).
Why data testing is important
Data is playing an increasing role in decision making in everyday life as well as in business. Modern technologies and algorithms allow you to process and store huge amounts of data, transforming them into useful information.
What is this data? For example, the history of your browser, transactions on your card, the point of movement of a device. They are impersonal, but this data still belongs to a specific device. If you collect and process them, you can get quite interesting information about the owner of this device. For example, where he likes to go, what is his gender and age. So gradually, we “humanize” the device and give it some characteristics.
Then this information can be used for targeted advertising. If you are a woman, then with high probability we can say that you are not interested in advertising of razors for men. You need to show ads related to your interests. The quality of advertising targeting can be improved due to what is known about the devices on which it is shown. You are shown the ads you want to see. So you will click on it. People who show you this ad will receive money for it, and the ad customer will receive a profit from what you learn about his product.
All this is based on the data owned by different companies and people. For effective use of this data, it is necessary that they are reliable and we know that this account belongs to these transactions.
As there is a lot of data, storing them requires considerable resources. Data cleansing is a separate task that needs to be addressed. We want to store only the data that we really need. And we don’t want to have in our database duplicates or records that do not meet our criteria. For example, entries with empty fields. Therefore, there are requirements for the quality of the data and the question arises about their testing.
What is quality?
I like this definition: product quality is a measure of user satisfaction. It is clear that everything depends on the context of use of the product. If you use any well-known product, for example, Facebook or Skype, then you have the same quality requirements. You will put up with some errors, but still continue to use this product. And if you are a customer of a program and paid money for it, the quality requirements will be higher. You will find fault, watch some trivia. Different people have different ideas about quality, and different programs also have their own quality requirements.
Therefore, before developing and testing, people usually determine what they will consider a quality product. All this can be described formally. For example, we will consider our product quality if it does not contain critical errors. Or if he works for two weeks without a glitch.
Determining these requirements is not an easy task. Typically, software requirements form the business, and if we ask the business what the data should be, then we can get an answer that the data should be good and clean. The task of the tester is to find out or clarify what the data is and by what criteria we determine their quality and purity. These criteria need to be formalized and fixed, made measurable.
How data quality requirements are formed
The tester begins to find out what is incomprehensible to him and what he would like to know about the object of testing. The tester makes a list of questions and begins to take an "interview" with the customer. He, in theory, should know what the data should be. For example, I ask whether empty cells or duplicate rows are valid.
Example of requirements - if we have a list of people, then the first name, last name and middle name can be repeated. But the entire set of lines can not be repeated. Repetitions may be allowed for a single cell, but no longer for a whole row or a combination of several cells. Full coincidence should not be.
Next we begin to ask about the format of the data in a particular cell. For example, there should be 12 digits in a telephone number, in a bank card number - 16. We may have a criterion that not every sequence of these signs is a bank card number. Or we understand that there can be only letters in a surname. We may have many questions about the data format. Thus, we find out everything that we need to know about the subject of testing.
What is quality data
Qualitative data must have several characteristics.
- Completeness - there are no gaps in the records, all cells must be filled. Data should carry as much information as possible.
- Uniqueness - among the data should not be the same records.
- Reliability - for the sake of it, and everything is started. No one wants to work with data that cannot be trusted. The cells of the tables with qualitative data contain what they must contain: IP address, telephone number, etc.
- Accuracy. If we talk about digital data, then there must be an exact number of characters. For example, 12 decimal places. Data should be close to some kind of average value.
- Consistency - the data must maintain values, regardless of the way they are measured.
- Timeliness - the data should be relevant, especially if they are periodically updated. For example, each month the amount of data should increase. Data should not be outdated. If we are talking about banking transactions, then we are interested in them, for example, over the past six months.
Data testing levels
We can group the data by the so-called layers - a good analogy with the testing pyramid works here . This distribution by the number of tests at different levels of the application.
- Unit-layer is when one module of the program is tested, most often it is one function or method. Such tests should be the most. A unit test for data is when we define the requirements for each cell. It makes no sense to test further if we have errors at the cell level. If, for example, the surname contains numbers, then what's the point of checking something further? Perhaps there should be letters similar to these numbers. And then we need to fix everything and check the next level so that we have everything in the singular and not have duplicates, if that is what is said in the requirements.
- An integration layer is when several pieces of a program are tested together. The data API layer is when we talk about the entire table. Suppose we may have duplicates, but not more than a hundred pieces. If we have a million-plus city, then a million people cannot live on the same street. Therefore, if we make a sample on the street, then the number of addresses should be ten thousand or a thousand - this must be determined. And if we have a million, then something is wrong with the data.
- The system layer is when the entire program is tested completely. In the case of data, this layer means that the entire system is being tested. This includes statistics. For example, we say that we cannot have more than 30% of men born after 1985. Or we say that 80% of the data must be of the same type.
In conclusion, I will say that data testing is an area that provides many opportunities for creativity and development. There is no silver bullet here: different approaches can be used to test data. The truth, as always, is somewhere in the middle.